Initial draft of package

ropensci · Apr 28, 2016 · 0481175 · 0481175
commit 0481175
Show file tree

Hide file tree

Showing 29 changed files with 80,660 additions and 0 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -0,0 +1,6 @@
+^.*\.Rproj$
+^\.Rproj\.user$
+^data-raw$
+^README\.Rmd$
+^README-.*\.png$
+^\.travis\.yml$
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+.Rproj.user
+.Rhistory
+.RData
+inst/doc
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,5 @@
+# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r
+
+language: R
+sudo: false
+cache: packages
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -0,0 +1,26 @@
+Package: gutenbergr
+Type: Package
+Title: Download and process public domain books stored in
+    Project Gutenberg
+Version: 0.1
+Date: 2016-04-27
+Authors@R: person("David", "Robinson", email = "admiral.david@gmail.com", role = c("aut", "cre"))
+Description: Download and process public domain works in the Project
+    Gutenberg collection. Includes metadata for all Project Gutenberg works,
+    so that they can be searched and retrieved.
+License: MIT + file LICENSE
+LazyData: TRUE
+Maintainer: David Robinson <admiral.david@gmail.com>
+VignetteBuilder: knitr
+Depends: R (>= 2.10)
+Imports:
+    dplyr,
+    readr,
+    purrr,
+    urltools,
+    rvest,
+    xml2,
+    stringr
+RoxygenNote: 5.0.1
+Suggests: knitr,
+    rmarkdown
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,2 @@
+YEAR: 2016
+COPYRIGHT HOLDER: David Robinson
diff --git a/NAMESPACE b/NAMESPACE
@@ -0,0 +1,6 @@
+# Generated by roxygen2: do not edit by hand
+
+export(gutenberg_download)
+export(gutenberg_get_mirror)
+export(gutenberg_strip)
+import(dplyr)
diff --git a/R/data.R b/R/data.R
@@ -0,0 +1,115 @@
+#' Gutenberg metadata about each work
+#'
+#' Selected fields of metadata about each of the Project Gutenberg
+#' works. These were collected using the gitenberg Python package,
+#' particularly the \code{pg_rdf_to_json} function.
+#'
+#' @format A tbl_df with one row for each work in Project Gutenberg
+#' and the following columns:
+#' \describe{
+#'   \item{gutenberg_id}{Numeric ID, used to retrieve works from
+#'   Project Gutenberg}
+#'   \item{title}{Title}
+#'   \item{author}{Author, if a single one given. Given as last name
+#'   first (e.g. "Doyle, Arthur Conan")}
+#'   \item{author_id}{Project Gutenberg author ID}
+#'   \item{language}{Language code, separated by / if multiple}
+#'   \item{gutenberg_bookshelf}{Which collection or collections this
+#'   is found in, separated by / if multiple}
+#'   \item{rights}{Generally one of three options: "Public domain in the USA."
+#'   (the most common by far), "Copyrighted. Read the copyright notice inside this book
+#'   for details.", or "None".}
+#'   \item{has_text}{Whether there is a file containing digits followed by
+#'   \code{.txt} in Project Gutenberg for this record (as opposed to, for
+#'   example, audiobooks). If not, cannot be retrieved with
+#'   \code{\link{gutenberg_download}}}
+#' }
+#'
+#' @examples
+#'
+#' library(dplyr)
+#' library(stringr)
+#'
+#' # look for Shakespeare, excluding collections (containing "Works") and translations
+#' shakespeare_metadata <- gutenberg_metadata %>%
+#'   filter(author == "Shakespeare, William",
+#'          language == "en",
+#'          !str_detect(title, "Works"),
+#'          has_text) %>%
+#'          distinct(title)
+#'
+#' \dontrun{
+#' shakespeare_works <- gutenberg_download(shakespeare_metadata$gutenberg_id)
+#' }
+"gutenberg_metadata"
+
+
+#' Gutenberg metadata about the subject of each work
+#'
+#' Gutenberg metadata about the subject of each work, particularly
+#' Library of Congress Classifications (lcc) and Library of Congress
+#' Subject Headings (lcsh).
+#'
+#' @format A tbl_df with one row for each pairing of work and subject, with
+#' columns:
+#' \describe{
+#'   \item{gutenberg_id}{ID describing a work that can be joined with
+#'   \link{gutenberg_metadata}}
+#'   \item{subject_type}{Either "lcc" (Library of Congress Classification) or
+#'   "lcsh" (Library of Congress Subject Headings)}
+#'   \item{subject}{Subject}
+#' }
+#'
+#' @examples
+#'
+#' library(dplyr)
+#' library(stringr)
+#'
+#' gutenberg_subjects %>%
+#'   filter(subject_type == "lcsh") %>%
+#'   count(subject, sort = TRUE)
+#'
+#' sherlock_holmes_subjects <- gutenberg_subjects %>%
+#'   filter(str_detect(subject, "Holmes, Sherlock"))
+#'
+#' sherlock_holmes_subjects
+#'
+#' sherlock_holmes_metadata <- gutenberg_metadata %>%
+#'   semi_join(sherlock_holmes_subjects) %>%
+#'   filter(author == "Doyle, Arthur Conan",
+#'          language == "en",
+#'          has_text) %>%
+#'   arrange(gutenberg_id) %>%
+#'   distinct(title)
+#'
+#' sherlock_holmes_metadata
+#'
+#' \dontrun{
+#' holmes_books <- gutenberg_download(sherlock_holmes_metadata$gutenberg_id)
+#'
+#' holmes_books
+#' }
+"gutenberg_subjects"
+
+
+#' Metadata about Project Gutenberg authors
+#'
+#' Data frame with metadata about each author of a Project
+#' Gutenberg work. For space only metadata from people that have
+#' been the single author of a work (not multiple authors,
+#' contributors, etc) are included.
+#'
+#' @format A tbl_df with one row for each author, with the columns
+#' \describe{
+#'   \item{gutenberg_author_id}{Unique identifier for the author that can
+#'   be used to join with the \link{gutenberg_metadata} dataset}
+#'   \item{author}{The \code{agent_name} field from the original metadata}
+#'   \item{alias}{Alias}
+#'   \item{birthdate}{Year of birth}
+#'   \item{deathdate}{Year of death}
+#'   \item{wikipedia}{Link to Wikipedia article on the author. If there
+#'   are multiple, they are "/"-delimited}
+#'   \item{aliases}{List column of character vectors of aliases. If there
+#'   are multiple, they are "/"-delimited}
+#' }
+"gutenberg_authors"
diff --git a/R/globals.R b/R/globals.R
@@ -0,0 +1 @@
+globalVariables(c(".", "gutenberg_id"))
diff --git a/R/gutenberg_download.R b/R/gutenberg_download.R
@@ -0,0 +1,127 @@
+#' Download a book using its Gutenberg ID
+#'
+#' @param gutenberg_id A vector of Project Gutenberg ID
+#' @param mirror Optionally a mirror URL to retrieve the books from
+#'
+#' @import dplyr
+#'
+#' @export
+gutenberg_download <- function(gutenberg_id, mirror = NULL) {
+  if (is.null(mirror)) {
+    mirror <- gutenberg_get_mirror()
+  }
+
+  id <- as.character(gutenberg_id)
+
+  path <- id %>%
+    stringr::str_sub(1, -2) %>%
+    stringr::str_split("") %>%
+    sapply(stringr::str_c, collapse = "/")
+
+  path <- ifelse(nchar(id) == 1, "0", path)
+
+  full_url <- stringr::str_c(mirror, path, id,
+                             stringr::str_c(id, ".zip"),
+                             sep = "/")
+  names(full_url) <- id
+
+  try_download <- function(url) {
+    ret <- read_zip_url(url)
+    if (!is.null(ret)) {
+      return(ret)
+    }
+    base_url <- stringr::str_replace(url, ".zip$", "")
+    for (suffix in c("-8", "-0")) {
+      new_url <- paste0(base_url, suffix, ".zip")
+      ret <- read_zip_url(new_url)
+      if (!is.null(ret)) {
+        return(ret)
+      }
+    }
+    warning("Could not download a book at ", url)
+
+    NULL
+  }
+
+  # run this on all requested books
+  ret <- full_url %>%
+    purrr::map(try_download) %>%
+    purrr::discard(is.null) %>%
+    purrr::map(gutenberg_strip) %>%
+    purrr::map_df(~data_frame(text = .), .id = "gutenberg_id") %>%
+    mutate(gutenberg_id = as.integer(gutenberg_id))
+
+  ret
+}
+
+
+#' Strip header and footer content from a Project Gutenberg book
+#'
+#' Strip header and footer content from a Project Gutenberg book. This
+#' is based on some formatting guesses so it may not be perfect. It
+#' will also not strip tables of contents, prologues, or other text
+#' that appears at the start of a book.
+#'
+#' @param text A character vector of book text
+#'
+#' @export
+gutenberg_strip <- function(text) {
+  text[is.na(text)] <- ""
+
+  starting_regex <- "^\\*\\*\\*.*PROJECT GUTENBERG"
+  start_after <- which(stringr::str_detect(text, starting_regex))[1]
+
+  if (!is.na(start_after)) {
+    text <- tail(text, -(start_after))
+  }
+
+  ending_regex <- "^(End of .*Project Gutenberg.*|\\*\\*\\*.*END OF.*PROJECT GUTENBERG)"
+  stop_before <- which(stringr::str_detect(text, ending_regex))[1]
+
+  if (!is.na(stop_before)) {
+    text <- head(text, stop_before - 1)
+  }
+
+  # strip empty lines from start and empty
+  if (text[1] == "") {
+    text <- tail(text, -(min(which(text != "")) - 1))
+  }
+  if (tail(text, 1) == "") {
+    text <- head(text, max(which(text != "")))
+  }
+
+  text
+}
+
+
+#' Get the recommended mirror for Gutenberg files
+#'
+#' Get the recommended mirror for Gutenberg files by accessing
+#' the wget harvest path, which is
+#' \url{http://www.gutenberg.org/robot/harvest?filetypes[]=txt}.
+#' Also set the global \code{gutenberg_mirror} option
+#'
+#' @export
+gutenberg_get_mirror <- function() {
+  mirror <- getOption("gutenberg_mirror")
+  if (!is.null(mirror)) {
+    return(mirror)
+  }
+
+  # figure out the mirror for this location from wget
+  message("Determining mirror for Gutenberg http://www.gutenberg.org/robot/harvest")
+  wget_url <- "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
+  mirror_full_url <- xml2::read_html(wget_url) %>%
+    rvest::html_nodes("a") %>%
+    .[[1]] %>%
+    rvest::html_attr("href")
+
+  # parse and leave out the path
+  parsed <- urltools::url_parse(mirror_full_url)
+  mirror <- paste0(parsed$scheme, "://", parsed$domain)
+
+  # set option for next time
+  options(gutenberg_mirror = mirror)
+
+  return(mirror)
+}
diff --git a/R/utils.R b/R/utils.R
@@ -0,0 +1,14 @@
+#' Read a file from a .zip URL
+#'
+#' Download, read, and delete a .zip file
+#'
+#' @param url URL to a .zip file
+read_zip_url <- function(url) {
+  tmp <- tempfile(fileext = ".zip")
+  download.file(url, tmp, quiet = TRUE)
+
+  ret <- purrr::possibly(readr::read_lines, NULL)(tmp)
+  unlink(tmp)
+
+  ret
+}
diff --git a/README.Rmd b/README.Rmd
@@ -0,0 +1,45 @@
+<!-- README.md is generated from README.Rmd. Please edit that file -->
+
+```{r, echo = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  fig.path = "README-",
+  message = FALSE
+)
+```
+
+gutenbergr: R package to search and download public domain texts from Project Gutenberg
+----------------
+
+[![Travis-CI Build Status](https://travis-ci.org/.svg?branch=master)](https://travis-ci.org/)
+
+Download and process public domain works from the [Project Gutenberg](https://www.gutenberg.org/) collection. Includes
+
+* A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein.
+* Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
+  * `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc
+  * `gutenberg_authors` contains information about each author, such as aliases and birth/death year
+  * `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics
+
+### FAQ
+
+### What do I do with the text once I have it?
+
+We recommend the [tidytext](https://github.com/juliasilge/tidytext) package for tokenization and analysis!
+
+#### How were the metadata R files generated?
+
+See the [data-raw](data-raw) directory for the scripts that generate these datasets. As of now, these were generated from [the Project Gutenberg catalog](https://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog) on **April 27 2016**.
+
+There are no guarantees about how often the metadata will be updated in the package. If you're interested in works that have just been added or heavily edited on Project Gutenberg, you may want to run the scripts yourself.
+
+#### Do you respect the rules regarding robot access to Project Gutenberg?
+
+Yes! The package respects [these rules](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) and complies to the best of our ability. Namely:
+
+* Project Gutenberg allows wget to harvest Project Gutenberg using [this list of links](http://www.gutenberg.org/robot/harvest?filetypes[]=html). The gutenbergr package visits the page once to find the recommended mirror for the user's location.
+* We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from `http://www.gutenberg.lib.md.us/8/84/84.zip`.
+* We retrieve the .zip file rather than txt to minimize bandwidth on the mirror.
+
+Still, this package is *not* the right way to download the entire Gutenberg corpus (or all from a particular language). For that, follow [their recommendation](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic for analysis.