This API client for the Internet Archive is
intended primarily for searching for items, retrieving metadata for
items, and downloading the files associated with items. The functions
can be used with the pipe operator (
magrittr and the data
manipulation verbs in dplyr to create
pipelines from searching to downloading. For the full details of what is
possible with the Internet Archive API, see their advanced search
Install this package from CRAN:
# install.packages("devtools") devtools::install_github("ropensci/internetarchive", build_vignettes = TRUE)
Then load the package. We will also use dplyr for manipulating the retrieved data.
library("internetarchive") library("dplyr", warn.conflicts = FALSE)
Basic search and browse
The simplest way to search the Internet Archive is to use a keyword search. The following function searches for these keywords in the most important metadata fields, and returns a list of item identifiers.
ia_keyword_search("isaac hecker") #> 31 total items found. This query requested 5 results. #>  "americanexperien00fari" "fatherhecker00sedggoog" #>  "fatherhecker01sedg" "abitunpublished00heckgoog" #>  "TheLifeOfFatherHecker"
You can pass an item identifier to the
ia_browse() function to open an
item in your browser. If you pass this function multiple identifiers, it
will open only the first one.
Usually it is more useful to perform an advanced search. You can
construct an advanced search as a named character vector, where the
names correspond to the fields. The following search, for instance,
looks for items published by the American Tract Society in 1864. Run the
ia_list_fields() to see the list of accepted metadata fields.
ats_query <- c("publisher" = "american tract society", "year" = "1864") ia_search(ats_query, num_results = 20) #> 13 total items found. This query requested 20 results. #>  "huguenotsfrance00martgoog" "missionsmartyrsi00bost" #>  "vitalgodlinessa00plumgoog" "liliantaleofthre00lili" #>  "littlewillietrue00amer" "ourvillageinwart00mart" #>  "vitalgodlinesstrws00plum" "vitalgodlinesstr00plum" #>  "songsofzionenlar00amer" "ilvertonrectoryo00mart" #>  "colorbearerfranc01amer" "sketcheseloquen00wategoog" #>  "sketchesofeloque00wate"
You can change the number of items returned by the search using the
num_results = argument, and you can request subsequent pages of
results with the
page = argument.
ia_keyword_search() both return a
character vector of identifiers, so both can be used in the same way at
the beginning of a pipeline.
To search by a date range, use the
date field and the years (or
ISO 8601 dates) separated by
TO. Here we search for publications by the American Tract Society in
ia_search(c("publisher" = "american tract society", date = "1840 TO 1850")) #> 104 total items found. This query requested 5 results. #>  "historyreformat09aubgoog" "scripturebiogra00hookgoog" #>  "historyreformat22aubgoog" "memoirmrssarahl00hookgoog" #>  "circulationandc00socigoog"
Getting item metadata and files
Once you have retrieved a list of items, you can retrieve their metadata and the list of files associated with the items.
To get a single item’s metadata, you can pass its identifier to the
hecker <- ia_get_items("TheLifeOfFatherHecker")
The result is a list where the names of items in the list are the item
identifiers, and the rest of the list is the metadata. This nested list
can be difficult to work with, so the
ia_metadata() returns a data
frame of the metadata, and
ia_files() returns a data frame of the
files associated with the item.
These functions can also retrieve the information for multiple items when used in a pipeline. Here we search for all the items about Hecker, retrieve their metadata, and turn it into a data frame. We then filter the data frame to get only the titles.
ia_keyword_search("isaac hecker", num_results = 20) %>% ia_get_items() %>% ia_metadata() %>% filter(field == "title") %>% select(value)
ia_download() function will download all the files in a data frame
ia_files(). This function should be used with caution,
and you should first filter the data frame to download only the files
that you wish. In the following example, we retrieve a list of all the
files associated with items published by the American Tract Society in
1864. Then we filter the list so we get only text files, then we pick
only the first text file associated with each item. Finally we download
the files to a directory we specify (in this case, a temporary
dir <- tempdir() ia_search(ats_query) %>% ia_get_items() %>% ia_files() %>% filter(type == "txt") %>% group_by(id) %>% slice(1) %>% ia_download(dir = dir, overwrite = FALSE) %>% glimpse()
ia_download() returns a modified version of the data frame
that was passed to it, adding a column
local_file with the path to the
overwrite = argument is
FALSE, then you can pass the same
data frame of files to
ia_download() and it will download only the
files that it has not already downloaded.