-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
dgrtwo
committed
Apr 28, 2016
0 parents
commit 0481175
Showing
29 changed files
with
80,660 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
^.*\.Rproj$ | ||
^\.Rproj\.user$ | ||
^data-raw$ | ||
^README\.Rmd$ | ||
^README-.*\.png$ | ||
^\.travis\.yml$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
.Rproj.user | ||
.Rhistory | ||
.RData | ||
inst/doc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r | ||
|
||
language: R | ||
sudo: false | ||
cache: packages |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
Package: gutenbergr | ||
Type: Package | ||
Title: Download and process public domain books stored in | ||
Project Gutenberg | ||
Version: 0.1 | ||
Date: 2016-04-27 | ||
Authors@R: person("David", "Robinson", email = "admiral.david@gmail.com", role = c("aut", "cre")) | ||
Description: Download and process public domain works in the Project | ||
Gutenberg collection. Includes metadata for all Project Gutenberg works, | ||
so that they can be searched and retrieved. | ||
License: MIT + file LICENSE | ||
LazyData: TRUE | ||
Maintainer: David Robinson <admiral.david@gmail.com> | ||
VignetteBuilder: knitr | ||
Depends: R (>= 2.10) | ||
Imports: | ||
dplyr, | ||
readr, | ||
purrr, | ||
urltools, | ||
rvest, | ||
xml2, | ||
stringr | ||
RoxygenNote: 5.0.1 | ||
Suggests: knitr, | ||
rmarkdown |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2016 | ||
COPYRIGHT HOLDER: David Robinson |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Generated by roxygen2: do not edit by hand | ||
|
||
export(gutenberg_download) | ||
export(gutenberg_get_mirror) | ||
export(gutenberg_strip) | ||
import(dplyr) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
#' Gutenberg metadata about each work | ||
#' | ||
#' Selected fields of metadata about each of the Project Gutenberg | ||
#' works. These were collected using the gitenberg Python package, | ||
#' particularly the \code{pg_rdf_to_json} function. | ||
#' | ||
#' @format A tbl_df with one row for each work in Project Gutenberg | ||
#' and the following columns: | ||
#' \describe{ | ||
#' \item{gutenberg_id}{Numeric ID, used to retrieve works from | ||
#' Project Gutenberg} | ||
#' \item{title}{Title} | ||
#' \item{author}{Author, if a single one given. Given as last name | ||
#' first (e.g. "Doyle, Arthur Conan")} | ||
#' \item{author_id}{Project Gutenberg author ID} | ||
#' \item{language}{Language code, separated by / if multiple} | ||
#' \item{gutenberg_bookshelf}{Which collection or collections this | ||
#' is found in, separated by / if multiple} | ||
#' \item{rights}{Generally one of three options: "Public domain in the USA." | ||
#' (the most common by far), "Copyrighted. Read the copyright notice inside this book | ||
#' for details.", or "None".} | ||
#' \item{has_text}{Whether there is a file containing digits followed by | ||
#' \code{.txt} in Project Gutenberg for this record (as opposed to, for | ||
#' example, audiobooks). If not, cannot be retrieved with | ||
#' \code{\link{gutenberg_download}}} | ||
#' } | ||
#' | ||
#' @examples | ||
#' | ||
#' library(dplyr) | ||
#' library(stringr) | ||
#' | ||
#' # look for Shakespeare, excluding collections (containing "Works") and translations | ||
#' shakespeare_metadata <- gutenberg_metadata %>% | ||
#' filter(author == "Shakespeare, William", | ||
#' language == "en", | ||
#' !str_detect(title, "Works"), | ||
#' has_text) %>% | ||
#' distinct(title) | ||
#' | ||
#' \dontrun{ | ||
#' shakespeare_works <- gutenberg_download(shakespeare_metadata$gutenberg_id) | ||
#' } | ||
"gutenberg_metadata" | ||
|
||
|
||
#' Gutenberg metadata about the subject of each work | ||
#' | ||
#' Gutenberg metadata about the subject of each work, particularly | ||
#' Library of Congress Classifications (lcc) and Library of Congress | ||
#' Subject Headings (lcsh). | ||
#' | ||
#' @format A tbl_df with one row for each pairing of work and subject, with | ||
#' columns: | ||
#' \describe{ | ||
#' \item{gutenberg_id}{ID describing a work that can be joined with | ||
#' \link{gutenberg_metadata}} | ||
#' \item{subject_type}{Either "lcc" (Library of Congress Classification) or | ||
#' "lcsh" (Library of Congress Subject Headings)} | ||
#' \item{subject}{Subject} | ||
#' } | ||
#' | ||
#' @examples | ||
#' | ||
#' library(dplyr) | ||
#' library(stringr) | ||
#' | ||
#' gutenberg_subjects %>% | ||
#' filter(subject_type == "lcsh") %>% | ||
#' count(subject, sort = TRUE) | ||
#' | ||
#' sherlock_holmes_subjects <- gutenberg_subjects %>% | ||
#' filter(str_detect(subject, "Holmes, Sherlock")) | ||
#' | ||
#' sherlock_holmes_subjects | ||
#' | ||
#' sherlock_holmes_metadata <- gutenberg_metadata %>% | ||
#' semi_join(sherlock_holmes_subjects) %>% | ||
#' filter(author == "Doyle, Arthur Conan", | ||
#' language == "en", | ||
#' has_text) %>% | ||
#' arrange(gutenberg_id) %>% | ||
#' distinct(title) | ||
#' | ||
#' sherlock_holmes_metadata | ||
#' | ||
#' \dontrun{ | ||
#' holmes_books <- gutenberg_download(sherlock_holmes_metadata$gutenberg_id) | ||
#' | ||
#' holmes_books | ||
#' } | ||
"gutenberg_subjects" | ||
|
||
|
||
#' Metadata about Project Gutenberg authors | ||
#' | ||
#' Data frame with metadata about each author of a Project | ||
#' Gutenberg work. For space only metadata from people that have | ||
#' been the single author of a work (not multiple authors, | ||
#' contributors, etc) are included. | ||
#' | ||
#' @format A tbl_df with one row for each author, with the columns | ||
#' \describe{ | ||
#' \item{gutenberg_author_id}{Unique identifier for the author that can | ||
#' be used to join with the \link{gutenberg_metadata} dataset} | ||
#' \item{author}{The \code{agent_name} field from the original metadata} | ||
#' \item{alias}{Alias} | ||
#' \item{birthdate}{Year of birth} | ||
#' \item{deathdate}{Year of death} | ||
#' \item{wikipedia}{Link to Wikipedia article on the author. If there | ||
#' are multiple, they are "/"-delimited} | ||
#' \item{aliases}{List column of character vectors of aliases. If there | ||
#' are multiple, they are "/"-delimited} | ||
#' } | ||
"gutenberg_authors" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
globalVariables(c(".", "gutenberg_id")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
#' Download a book using its Gutenberg ID | ||
#' | ||
#' @param gutenberg_id A vector of Project Gutenberg ID | ||
#' @param mirror Optionally a mirror URL to retrieve the books from | ||
#' | ||
#' @import dplyr | ||
#' | ||
#' @export | ||
gutenberg_download <- function(gutenberg_id, mirror = NULL) { | ||
if (is.null(mirror)) { | ||
mirror <- gutenberg_get_mirror() | ||
} | ||
|
||
id <- as.character(gutenberg_id) | ||
|
||
path <- id %>% | ||
stringr::str_sub(1, -2) %>% | ||
stringr::str_split("") %>% | ||
sapply(stringr::str_c, collapse = "/") | ||
|
||
path <- ifelse(nchar(id) == 1, "0", path) | ||
|
||
full_url <- stringr::str_c(mirror, path, id, | ||
stringr::str_c(id, ".zip"), | ||
sep = "/") | ||
names(full_url) <- id | ||
|
||
try_download <- function(url) { | ||
ret <- read_zip_url(url) | ||
if (!is.null(ret)) { | ||
return(ret) | ||
} | ||
base_url <- stringr::str_replace(url, ".zip$", "") | ||
for (suffix in c("-8", "-0")) { | ||
new_url <- paste0(base_url, suffix, ".zip") | ||
ret <- read_zip_url(new_url) | ||
if (!is.null(ret)) { | ||
return(ret) | ||
} | ||
} | ||
warning("Could not download a book at ", url) | ||
|
||
NULL | ||
} | ||
|
||
# run this on all requested books | ||
ret <- full_url %>% | ||
purrr::map(try_download) %>% | ||
purrr::discard(is.null) %>% | ||
purrr::map(gutenberg_strip) %>% | ||
purrr::map_df(~data_frame(text = .), .id = "gutenberg_id") %>% | ||
mutate(gutenberg_id = as.integer(gutenberg_id)) | ||
|
||
ret | ||
} | ||
|
||
|
||
#' Strip header and footer content from a Project Gutenberg book | ||
#' | ||
#' Strip header and footer content from a Project Gutenberg book. This | ||
#' is based on some formatting guesses so it may not be perfect. It | ||
#' will also not strip tables of contents, prologues, or other text | ||
#' that appears at the start of a book. | ||
#' | ||
#' @param text A character vector of book text | ||
#' | ||
#' @export | ||
gutenberg_strip <- function(text) { | ||
text[is.na(text)] <- "" | ||
|
||
starting_regex <- "^\\*\\*\\*.*PROJECT GUTENBERG" | ||
start_after <- which(stringr::str_detect(text, starting_regex))[1] | ||
|
||
if (!is.na(start_after)) { | ||
text <- tail(text, -(start_after)) | ||
} | ||
|
||
ending_regex <- "^(End of .*Project Gutenberg.*|\\*\\*\\*.*END OF.*PROJECT GUTENBERG)" | ||
stop_before <- which(stringr::str_detect(text, ending_regex))[1] | ||
|
||
if (!is.na(stop_before)) { | ||
text <- head(text, stop_before - 1) | ||
} | ||
|
||
# strip empty lines from start and empty | ||
if (text[1] == "") { | ||
text <- tail(text, -(min(which(text != "")) - 1)) | ||
} | ||
if (tail(text, 1) == "") { | ||
text <- head(text, max(which(text != ""))) | ||
} | ||
|
||
text | ||
} | ||
|
||
|
||
#' Get the recommended mirror for Gutenberg files | ||
#' | ||
#' Get the recommended mirror for Gutenberg files by accessing | ||
#' the wget harvest path, which is | ||
#' \url{http://www.gutenberg.org/robot/harvest?filetypes[]=txt}. | ||
#' Also set the global \code{gutenberg_mirror} option | ||
#' | ||
#' @export | ||
gutenberg_get_mirror <- function() { | ||
mirror <- getOption("gutenberg_mirror") | ||
if (!is.null(mirror)) { | ||
return(mirror) | ||
} | ||
|
||
# figure out the mirror for this location from wget | ||
message("Determining mirror for Gutenberg http://www.gutenberg.org/robot/harvest") | ||
wget_url <- "http://www.gutenberg.org/robot/harvest?filetypes[]=txt" | ||
mirror_full_url <- xml2::read_html(wget_url) %>% | ||
rvest::html_nodes("a") %>% | ||
.[[1]] %>% | ||
rvest::html_attr("href") | ||
|
||
# parse and leave out the path | ||
parsed <- urltools::url_parse(mirror_full_url) | ||
mirror <- paste0(parsed$scheme, "://", parsed$domain) | ||
|
||
# set option for next time | ||
options(gutenberg_mirror = mirror) | ||
|
||
return(mirror) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#' Read a file from a .zip URL | ||
#' | ||
#' Download, read, and delete a .zip file | ||
#' | ||
#' @param url URL to a .zip file | ||
read_zip_url <- function(url) { | ||
tmp <- tempfile(fileext = ".zip") | ||
download.file(url, tmp, quiet = TRUE) | ||
|
||
ret <- purrr::possibly(readr::read_lines, NULL)(tmp) | ||
unlink(tmp) | ||
|
||
ret | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
<!-- README.md is generated from README.Rmd. Please edit that file --> | ||
|
||
```{r, echo = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>", | ||
fig.path = "README-", | ||
message = FALSE | ||
) | ||
``` | ||
|
||
gutenbergr: R package to search and download public domain texts from Project Gutenberg | ||
---------------- | ||
|
||
[![Travis-CI Build Status](https://travis-ci.org/.svg?branch=master)](https://travis-ci.org/) | ||
|
||
Download and process public domain works from the [Project Gutenberg](https://www.gutenberg.org/) collection. Includes | ||
|
||
* A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein. | ||
* Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered: | ||
* `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc | ||
* `gutenberg_authors` contains information about each author, such as aliases and birth/death year | ||
* `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics | ||
|
||
### FAQ | ||
|
||
### What do I do with the text once I have it? | ||
|
||
We recommend the [tidytext](https://github.com/juliasilge/tidytext) package for tokenization and analysis! | ||
|
||
#### How were the metadata R files generated? | ||
|
||
See the [data-raw](data-raw) directory for the scripts that generate these datasets. As of now, these were generated from [the Project Gutenberg catalog](https://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog) on **April 27 2016**. | ||
|
||
There are no guarantees about how often the metadata will be updated in the package. If you're interested in works that have just been added or heavily edited on Project Gutenberg, you may want to run the scripts yourself. | ||
|
||
#### Do you respect the rules regarding robot access to Project Gutenberg? | ||
|
||
Yes! The package respects [these rules](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) and complies to the best of our ability. Namely: | ||
|
||
* Project Gutenberg allows wget to harvest Project Gutenberg using [this list of links](http://www.gutenberg.org/robot/harvest?filetypes[]=html). The gutenbergr package visits the page once to find the recommended mirror for the user's location. | ||
* We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from `http://www.gutenberg.lib.md.us/8/84/84.zip`. | ||
* We retrieve the .zip file rather than txt to minimize bandwidth on the mirror. | ||
|
||
Still, this package is *not* the right way to download the entire Gutenberg corpus (or all from a particular language). For that, follow [their recommendation](https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages) to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic for analysis. |
Oops, something went wrong.