R Makefile
Clone or download
Permalink
Failed to load latest commit information.
.github use new issue template with collapsible code block Aug 30, 2017
R fix #165 tweak docs for ft_get and ft_collect to hopefully make more … Aug 10, 2018
inst add test dois script to inst/ignore Jan 17, 2018
man-roxygen FigShare fix Apr 23, 2016
man fix #165 tweak docs for ft_get and ft_collect to hopefully make more … Aug 10, 2018
revdep update revdep Jan 15, 2018
tests comment out ft_get_si tsts for now, often failing Apr 5, 2018
vignettes fix formats Feb 7, 2018
.Rbuildignore change COC file name Jan 11, 2018
.gitignore addressed #104 in ft_get at least, replacing httr with crul - more to do Jan 10, 2018
.travis.yml change build matrix to do covr on release only, remove rcampdf install Feb 24, 2018
CODE_OF_CONDUCT.md change COC file name Jan 11, 2018
DESCRIPTION fix #165 tweak docs for ft_get and ft_collect to hopefully make more … Aug 10, 2018
LICENSE bump year in license file Jan 17, 2018
Makefile remove some commented out code, update news and final bump to v1 Jan 16, 2018
NAMESPACE namespace base pkg calls Jan 15, 2018
NEWS.md update news and cran comments, bump to v1.0.1 for cran push Feb 7, 2018
README.Rmd add bit about suppdata pkg to readme, fix #164 Aug 9, 2018
README.md add bit about suppdata pkg to readme, fix #164 Aug 9, 2018
appveyor.yml update appveyor slack notification Feb 15, 2016
codemeta.json udpate codemeta.json after updating codemetar pkg Jan 17, 2018
cran-comments.md update news and cran comments, bump to v1.0.1 for cran push Feb 7, 2018
fulltext.Rproj fixes to tests, and rproj settings, forgot an xml2 fxn import Jul 30, 2015

README.md

  _____     .__  .__   __                   __
_/ ____\_ __|  | |  |_/  |_  ____ ___  ____/  |_
\   __\  |  \  | |  |\   __\/ __ \\  \/  /\   __\
 |  | |  |  /  |_|  |_|  | \  ___/ >    <  |  |
 |__| |____/|____/____/__|  \___  >__/\_ \ |__|
                                \/      \/

cran checks Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status Build status codecov.io rstudio mirror downloads cran version

Get full text articles from lots of places

Checkout the fulltext manual to get started.


rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.

fulltext makes it easy to do text-mining by supporting the following steps:

  • Search for articles - ft_search
  • Fetch articles - ft_get
  • Get links for full text articles (xml, pdf) - ft_links
  • Extract text from articles / convert formats - ft_extract
  • Collect bits of articles that you actually need - ft_chunks/ft_tabularize
  • Collect all texts into a data.frame - ft_table
  • Download supplementary materials from papers - ft_get_si

It's easy to go from the outputs of ft_get to text-mining packages such as tm and quanteda.

Data sources in fulltext include:

Authorization: A number of publishers require authorization via API key, and some even more draconian authorization processes involving checking IP addresses. We are working on supporting all the various authorization things for different publishers, but of course all the OA content is already easily available.

We'd love your feedback. Let us know what you think in the issue tracker

Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd

Important Note: Supplementary data from papers is being moved to the suppdata package. Once suppdata is on CRAN, we'll deprecate the ft_get_si function here; after which point suppdata focuses on supplementary materials and fulltext focuses on the papers themselves.

Installation

Stable version from CRAN

install.packages("fulltext")

Development version from GitHub

devtools::install_github("ropensci/fulltext")

Load library

library('fulltext')

Search

ft_search() - get metadata on a search query.

ft_search(query = 'ecology', from = 'crossref')
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 152831; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

Get full text links

ft_links() - get links for articles (xml and pdf).

res1 <- ft_search(query = 'ecology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 4 
#> [IDs] ID_30082897 ID_30082725 ID_30082706 ID_30042191 ...

Or pass in DOIs directly

ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 4 
#> [IDs] ID_30082897 ID_30082725 ID_30082706 ID_30042191 ...

Get full text

ft_get() - get full or partial text of articles.

ft_get('10.7717/peerj.228')
#> <fulltext text>
#> [Docs] 1 
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext 
#> [IDs] 10.7717/peerj.228 ...

Extract chunks

x <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife")
x %>% ft_collect() %>% ft_chunks("publisher") %>% ft_tabularize()
#> $elife
#>                          publisher
#> 1 eLife Sciences Publications, Ltd
#> 2 eLife Sciences Publications, Ltd

Get multiple fields at once

x %>% ft_collect() %>% ft_chunks(c("doi","publisher")) %>% ft_tabularize()
#> $elife
#>                   doi                        publisher
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd
#> 2 10.7554/eLife.32763 eLife Sciences Publications, Ltd

Use dplyr to data munge

library("dplyr")
x %>%
  ft_collect() %>% 
  ft_chunks(c("doi", "publisher", "permissions")) %>%
  ft_tabularize() %>%
  .$elife %>%
  select(-permissions.license, -permissions.license_url)
#>                   doi                        publisher
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd
#> 2 10.7554/eLife.32763 eLife Sciences Publications, Ltd
#>   permissions.copyright.statement permissions.copyright.year
#> 1              © 2014, Zhao et al                       2014
#> 2            © 2017, Mhatre et al                       2017
#>   permissions.copyright.holder permissions.free_to_read
#> 1                   Zhao et al                     <NA>
#> 2                 Mhatre et al

Supplementary materials

Grab supplementary materials for (re-)analysis of data

ft_get_si() accepts article identifiers, and output from ft_search(), ft_get()

catching.crabs <- read.csv(ft_get_si("10.6084/m9.figshare.979288", 2))
head(catching.crabs)
#>   trap.no. length.deployed no..crabs
#> 1        1          10 sec         0
#> 2        2          10 sec         0
#> 3        3          10 sec         0
#> 4        4          10 sec         0
#> 5        5          10 sec         0
#> 6        1           1 min         0

Extract text from PDFs

There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.

Locally, using code adapted from the package tm, and two pdf to text parsing backends

pdf <- system.file("examples", "example2.pdf", package = "fulltext")
ft_extract(pdf)
#> <document>/Library/Frameworks/R.framework/Versions/3.5/Resources/library/fulltext/examples/example2.pdf
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Interoperability with other packages downstream

cache_options_set(path = (td <- 'foobar'))
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), type = "pdf")
library(readtext)
#> Error in library(readtext): there is no package called 'readtext'
x <- readtext::readtext(file.path(cache_options_get()$path, "*.pdf"))
#> Error in loadNamespace(name): there is no package called 'readtext'
library(quanteda)
#> Error in library(quanteda): there is no package called 'quanteda'
quanteda::corpus(x)
#> Error in loadNamespace(name): there is no package called 'quanteda'

Contributors

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for fulltext: citation(package = 'fulltext')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

rofooter