Skip to content

Commit

Permalink
Initial draft of package
Browse files Browse the repository at this point in the history
  • Loading branch information
dgrtwo committed Apr 28, 2016
0 parents commit 0481175
Show file tree
Hide file tree
Showing 29 changed files with 80,660 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
^.*\.Rproj$
^\.Rproj\.user$
^data-raw$
^README\.Rmd$
^README-.*\.png$
^\.travis\.yml$
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.Rproj.user
.Rhistory
.RData
inst/doc
5 changes: 5 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r

language: R
sudo: false
cache: packages
26 changes: 26 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Package: gutenbergr
Type: Package
Title: Download and process public domain books stored in
Project Gutenberg
Version: 0.1
Date: 2016-04-27
Authors@R: person("David", "Robinson", email = "admiral.david@gmail.com", role = c("aut", "cre"))
Description: Download and process public domain works in the Project
Gutenberg collection. Includes metadata for all Project Gutenberg works,
so that they can be searched and retrieved.
License: MIT + file LICENSE
LazyData: TRUE
Maintainer: David Robinson <admiral.david@gmail.com>
VignetteBuilder: knitr
Depends: R (>= 2.10)
Imports:
dplyr,
readr,
purrr,
urltools,
rvest,
xml2,
stringr
RoxygenNote: 5.0.1
Suggests: knitr,
rmarkdown
2 changes: 2 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
YEAR: 2016
COPYRIGHT HOLDER: David Robinson
6 changes: 6 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Generated by roxygen2: do not edit by hand

export(gutenberg_download)
export(gutenberg_get_mirror)
export(gutenberg_strip)
import(dplyr)
115 changes: 115 additions & 0 deletions R/data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
#' Gutenberg metadata about each work
#'
#' Selected fields of metadata about each of the Project Gutenberg
#' works. These were collected using the gitenberg Python package,
#' particularly the \code{pg_rdf_to_json} function.
#'
#' @format A tbl_df with one row for each work in Project Gutenberg
#' and the following columns:
#' \describe{
#' \item{gutenberg_id}{Numeric ID, used to retrieve works from
#' Project Gutenberg}
#' \item{title}{Title}
#' \item{author}{Author, if a single one given. Given as last name
#' first (e.g. "Doyle, Arthur Conan")}
#' \item{author_id}{Project Gutenberg author ID}
#' \item{language}{Language code, separated by / if multiple}
#' \item{gutenberg_bookshelf}{Which collection or collections this
#' is found in, separated by / if multiple}
#' \item{rights}{Generally one of three options: "Public domain in the USA."
#' (the most common by far), "Copyrighted. Read the copyright notice inside this book
#' for details.", or "None".}
#' \item{has_text}{Whether there is a file containing digits followed by
#' \code{.txt} in Project Gutenberg for this record (as opposed to, for
#' example, audiobooks). If not, cannot be retrieved with
#' \code{\link{gutenberg_download}}}
#' }
#'
#' @examples
#'
#' library(dplyr)
#' library(stringr)
#'
#' # look for Shakespeare, excluding collections (containing "Works") and translations
#' shakespeare_metadata <- gutenberg_metadata %>%
#' filter(author == "Shakespeare, William",
#' language == "en",
#' !str_detect(title, "Works"),
#' has_text) %>%
#' distinct(title)
#'
#' \dontrun{
#' shakespeare_works <- gutenberg_download(shakespeare_metadata$gutenberg_id)
#' }
"gutenberg_metadata"


#' Gutenberg metadata about the subject of each work
#'
#' Gutenberg metadata about the subject of each work, particularly
#' Library of Congress Classifications (lcc) and Library of Congress
#' Subject Headings (lcsh).
#'
#' @format A tbl_df with one row for each pairing of work and subject, with
#' columns:
#' \describe{
#' \item{gutenberg_id}{ID describing a work that can be joined with
#' \link{gutenberg_metadata}}
#' \item{subject_type}{Either "lcc" (Library of Congress Classification) or
#' "lcsh" (Library of Congress Subject Headings)}
#' \item{subject}{Subject}
#' }
#'
#' @examples
#'
#' library(dplyr)
#' library(stringr)
#'
#' gutenberg_subjects %>%
#' filter(subject_type == "lcsh") %>%
#' count(subject, sort = TRUE)
#'
#' sherlock_holmes_subjects <- gutenberg_subjects %>%
#' filter(str_detect(subject, "Holmes, Sherlock"))
#'
#' sherlock_holmes_subjects
#'
#' sherlock_holmes_metadata <- gutenberg_metadata %>%
#' semi_join(sherlock_holmes_subjects) %>%
#' filter(author == "Doyle, Arthur Conan",
#' language == "en",
#' has_text) %>%
#' arrange(gutenberg_id) %>%
#' distinct(title)
#'
#' sherlock_holmes_metadata
#'
#' \dontrun{
#' holmes_books <- gutenberg_download(sherlock_holmes_metadata$gutenberg_id)
#'
#' holmes_books
#' }
"gutenberg_subjects"


#' Metadata about Project Gutenberg authors
#'
#' Data frame with metadata about each author of a Project
#' Gutenberg work. For space only metadata from people that have
#' been the single author of a work (not multiple authors,
#' contributors, etc) are included.
#'
#' @format A tbl_df with one row for each author, with the columns
#' \describe{
#' \item{gutenberg_author_id}{Unique identifier for the author that can
#' be used to join with the \link{gutenberg_metadata} dataset}
#' \item{author}{The \code{agent_name} field from the original metadata}
#' \item{alias}{Alias}
#' \item{birthdate}{Year of birth}
#' \item{deathdate}{Year of death}
#' \item{wikipedia}{Link to Wikipedia article on the author. If there
#' are multiple, they are "/"-delimited}
#' \item{aliases}{List column of character vectors of aliases. If there
#' are multiple, they are "/"-delimited}
#' }
"gutenberg_authors"
1 change: 1 addition & 0 deletions R/globals.R
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
globalVariables(c(".", "gutenberg_id"))
127 changes: 127 additions & 0 deletions R/gutenberg_download.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#' Download a book using its Gutenberg ID
#'
#' @param gutenberg_id A vector of Project Gutenberg ID
#' @param mirror Optionally a mirror URL to retrieve the books from
#'
#' @import dplyr
#'
#' @export
gutenberg_download <- function(gutenberg_id, mirror = NULL) {
if (is.null(mirror)) {
mirror <- gutenberg_get_mirror()
}

id <- as.character(gutenberg_id)

path <- id %>%
stringr::str_sub(1, -2) %>%
stringr::str_split("") %>%
sapply(stringr::str_c, collapse = "/")

path <- ifelse(nchar(id) == 1, "0", path)

full_url <- stringr::str_c(mirror, path, id,
stringr::str_c(id, ".zip"),
sep = "/")
names(full_url) <- id

try_download <- function(url) {
ret <- read_zip_url(url)
if (!is.null(ret)) {
return(ret)
}
base_url <- stringr::str_replace(url, ".zip$", "")
for (suffix in c("-8", "-0")) {
new_url <- paste0(base_url, suffix, ".zip")
ret <- read_zip_url(new_url)
if (!is.null(ret)) {
return(ret)
}
}
warning("Could not download a book at ", url)

NULL
}

# run this on all requested books
ret <- full_url %>%
purrr::map(try_download) %>%
purrr::discard(is.null) %>%
purrr::map(gutenberg_strip) %>%
purrr::map_df(~data_frame(text = .), .id = "gutenberg_id") %>%
mutate(gutenberg_id = as.integer(gutenberg_id))

ret
}


#' Strip header and footer content from a Project Gutenberg book
#'
#' Strip header and footer content from a Project Gutenberg book. This
#' is based on some formatting guesses so it may not be perfect. It
#' will also not strip tables of contents, prologues, or other text
#' that appears at the start of a book.
#'
#' @param text A character vector of book text
#'
#' @export
gutenberg_strip <- function(text) {
text[is.na(text)] <- ""

starting_regex <- "^\\*\\*\\*.*PROJECT GUTENBERG"
start_after <- which(stringr::str_detect(text, starting_regex))[1]

if (!is.na(start_after)) {
text <- tail(text, -(start_after))
}

ending_regex <- "^(End of .*Project Gutenberg.*|\\*\\*\\*.*END OF.*PROJECT GUTENBERG)"
stop_before <- which(stringr::str_detect(text, ending_regex))[1]

if (!is.na(stop_before)) {
text <- head(text, stop_before - 1)
}

# strip empty lines from start and empty
if (text[1] == "") {
text <- tail(text, -(min(which(text != "")) - 1))
}
if (tail(text, 1) == "") {
text <- head(text, max(which(text != "")))
}

text
}


#' Get the recommended mirror for Gutenberg files
#'
#' Get the recommended mirror for Gutenberg files by accessing
#' the wget harvest path, which is
#' \url{http://www.gutenberg.org/robot/harvest?filetypes[]=txt}.
#' Also set the global \code{gutenberg_mirror} option
#'
#' @export
gutenberg_get_mirror <- function() {
mirror <- getOption("gutenberg_mirror")
if (!is.null(mirror)) {
return(mirror)
}

# figure out the mirror for this location from wget
message("Determining mirror for Gutenberg http://www.gutenberg.org/robot/harvest")
wget_url <- "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
mirror_full_url <- xml2::read_html(wget_url) %>%
rvest::html_nodes("a") %>%
.[[1]] %>%
rvest::html_attr("href")

# parse and leave out the path
parsed <- urltools::url_parse(mirror_full_url)
mirror <- paste0(parsed$scheme, "://", parsed$domain)

# set option for next time
options(gutenberg_mirror = mirror)

return(mirror)
}
14 changes: 14 additions & 0 deletions R/utils.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#' Read a file from a .zip URL
#'
#' Download, read, and delete a .zip file
#'
#' @param url URL to a .zip file
read_zip_url <- function(url) {
tmp <- tempfile(fileext = ".zip")
download.file(url, tmp, quiet = TRUE)

ret <- purrr::possibly(readr::read_lines, NULL)(tmp)
unlink(tmp)

ret
}
45 changes: 45 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-",
message = FALSE
)
```

gutenbergr: R package to search and download public domain texts from Project Gutenberg
----------------

[![Travis-CI Build Status](https://travis-ci.org/.svg?branch=master)](https://travis-ci.org/)

Download and process public domain works from the [Project Gutenberg](https://www.gutenberg.org/) collection. Includes

* A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein.
* Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
* `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc
* `gutenberg_authors` contains information about each author, such as aliases and birth/death year
* `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics

### FAQ

### What do I do with the text once I have it?

We recommend the [tidytext](https://github.com/juliasilge/tidytext) package for tokenization and analysis!

#### How were the metadata R files generated?

See the [data-raw](data-raw) directory for the scripts that generate these datasets. As of now, these were generated from [the Project Gutenberg catalog](https://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog) on **April 27 2016**.

There are no guarantees about how often the metadata will be updated in the package. If you're interested in works that have just been added or heavily edited on Project Gutenberg, you may want to run the scripts yourself.

#### Do you respect the rules regarding robot access to Project Gutenberg?

Yes! The package respects [these rules](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) and complies to the best of our ability. Namely:

* Project Gutenberg allows wget to harvest Project Gutenberg using [this list of links](http://www.gutenberg.org/robot/harvest?filetypes[]=html). The gutenbergr package visits the page once to find the recommended mirror for the user's location.
* We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from `http://www.gutenberg.lib.md.us/8/84/84.zip`.
* We retrieve the .zip file rather than txt to minimize bandwidth on the mirror.

Still, this package is *not* the right way to download the entire Gutenberg corpus (or all from a particular language). For that, follow [their recommendation](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic for analysis.
Loading

0 comments on commit 0481175

Please sign in to comment.