An R api to search across and get full text for open access journals
R Makefile
Latest commit 3f15bcf Jul 21, 2017 @sckott sckott add fail well test for ft_search associated with new stop() on no res…
…ults for biorxiv_saerch, fix #113
Permalink
Failed to load latest commit information.
.github added issue and PR template and contributing in .github dir Mar 11, 2017
R add fail well test for ft_search associated with new stop() on no res… Jul 21, 2017
inst fix broken links in readme and formats vignette Feb 11, 2017
man-roxygen
man update biorxiv_search man file Jun 7, 2017
revdep added revdep and rbuildignore it Jul 22, 2016
tests add fail well test for ft_search associated with new stop() on no res… Jul 21, 2017
vignettes fix broken links in readme and formats vignette Feb 11, 2017
.Rbuildignore fix ft_search test - just using diff. searech term, dont ignore tets … Jul 21, 2017
.gitignore ignore a file Oct 14, 2015
.travis.yml more explanation for biorxiv_search Jun 7, 2017
CONDUCT.md added CoC, udpated readme as well to avoid old code Aug 1, 2015
DESCRIPTION add will as contrib, add contribs section in readme Jul 21, 2017
LICENSE fix to license year, remove verbose() call Feb 11, 2017
Makefile added vignegette for ft_get fxn interface May 9, 2015
NAMESPACE attempting fixes for ft_get #105 Mar 15, 2017
NEWS.md updated news, cdran comments, and bmped to v0.1.8 in description Jul 22, 2016
README.Rmd add will as contrib, add contribs section in readme Jul 21, 2017
README.md add will as contrib, add contribs section in readme Jul 21, 2017
appveyor.yml update appveyor slack notification Feb 15, 2016
cran-comments.md updated news, cdran comments, and bmped to v0.1.8 in description Jul 22, 2016
fulltext.Rproj fixes to tests, and rproj settings, forgot an xml2 fxn import Jul 30, 2015

README.md

  _____     .__  .__   __                   __
_/ ____\_ __|  | |  |_/  |_  ____ ___  ____/  |_
\   __\  |  \  | |  |\   __\/ __ \\  \/  /\   __\
 |  | |  |  /  |_|  |_|  | \  ___/ >    <  |  |
 |__| |____/|____/____/__|  \___  >__/\_ \ |__|
                                \/      \/

Get full text articles from (almost) anywhere

Build Status Build status codecov.io rstudio mirror downloads cran version

rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.

fulltext makes it easy to do text-mining by supporting the following steps:

  • Search for articles
  • Fetch articles
  • Get links for full text articles (xml, pdf)
  • Extract text from articles / convert formats
  • Collect bits of articles that you actually need
  • Download supplementary materials from papers

Additional steps we hope to include in future versions:

  • Analysis enabled via the tm package and friends
  • Visualization

Data sources in fulltext include:

Authorization: A number of publishers require authorization via API key, and some even more draconian authorization processes involving checking IP addresses. We are working on supporting all the various authorization things for different publishers, but of course all the OA content is already easily available.

We'd love your feedback. Let us know what you think in the issue tracker

Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd

Installation

Stable version from CRAN

install.packages("fulltext")

Development version from GitHub

devtools::install_github("ropensci/fulltext")

Load library

library('fulltext')

Search

ft_search() - get metadata on a search query.

ft_search(query = 'ecology', from = 'plos')
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 39041; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

Get full text links

ft_links() - get links for articles (xml and pdf).

res1 <- ft_search(query = 'ecology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 1 
#> [IDs] ID_28724921 ...

Or pass in DOIs directly

ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 1 
#> [IDs] ID_28724921 ...

Get full text

ft_get() - get full or partial text of articles.

ft_get('10.1371/journal.pone.0086169', from = 'plos')
#> <fulltext text>
#> [Docs] 1 
#> [Source] R session  
#> [IDs] 10.1371/journal.pone.0086169 ...

Extract chunks

library("rplos")
(dois <- searchplos(q = "*:*", fl = 'id',
   fq = list('doc_type:full',"article_type:\"research article\""), limit = 5)$data$id)
#> [1] "10.1371/journal.pone.0003649" "10.1371/journal.pone.0057589"
#> [3] "10.1371/journal.pone.0003616" "10.1371/journal.pone.0003505"
#> [5] "10.1371/journal.pone.0003677"
x <- ft_get(dois, from = "plos")
x %>% chunks("publisher") %>% tabularize()
#> $plos
#>                                     publisher
#> 1 Public Library of ScienceSan Francisco, USA
#> 2 Public Library of ScienceSan Francisco, USA
#> 3 Public Library of ScienceSan Francisco, USA
#> 4 Public Library of ScienceSan Francisco, USA
#> 5 Public Library of ScienceSan Francisco, USA
x %>% chunks(c("doi","publisher")) %>% tabularize()
#> $plos
#>                            doi                                   publisher
#> 1 10.1371/journal.pone.0003649 Public Library of ScienceSan Francisco, USA
#> 2 10.1371/journal.pone.0057589 Public Library of ScienceSan Francisco, USA
#> 3 10.1371/journal.pone.0003616 Public Library of ScienceSan Francisco, USA
#> 4 10.1371/journal.pone.0003505 Public Library of ScienceSan Francisco, USA
#> 5 10.1371/journal.pone.0003677 Public Library of ScienceSan Francisco, USA

Use dplyr to data munge

library("dplyr")
x %>%
 chunks(c("doi", "publisher", "permissions")) %>%
 tabularize() %>%
 .$plos %>%
 select(-permissions.license)
#>                            doi                                   publisher permissions.copyright.year permissions.copyright.holder permissions.license_url
#> 1 10.1371/journal.pone.0003649 Public Library of ScienceSan Francisco, USA                       2008           Rajagovindan et al                    <NA>
#> 2 10.1371/journal.pone.0057589 Public Library of ScienceSan Francisco, USA                       2013                   Dane et al                    <NA>
#> 3 10.1371/journal.pone.0003616 Public Library of ScienceSan Francisco, USA                       2008                Bandera et al                    <NA>
#> 4 10.1371/journal.pone.0003505 Public Library of ScienceSan Francisco, USA                       2008                Brodeur et al                    <NA>
#> 5 10.1371/journal.pone.0003677 Public Library of ScienceSan Francisco, USA                       2008              Kuparinen et al                    <NA>

Supplementary materials

Grab supplementary materials for (re-)analysis of data

ft_get_si() accepts article identifiers, and output from ft_search(), ft_get()

catching.crabs <- read.csv(ft_get_si("10.6084/m9.figshare.979288", 2))
head(catching.crabs)
#>   trap.no. length.deployed no..crabs
#> 1        1          10 sec         0
#> 2        2          10 sec         0
#> 3        3          10 sec         0
#> 4        4          10 sec         0
#> 5        5          10 sec         0
#> 6        1           1 min         0

Cache

When dealing with full text data, you can get a lot quickly, and it can take a long time to get. That's where caching comes in. And after you pull down a bunch of data, if you do so within the R session, you don't want to lose that data if the session crashes, etc. When you search you will be able to (i.e., not ready yet) optionally cache the raw JSON/XML/etc. of each request locally - when you do that exact search again we'll just give you the local data - unless of course you want new data, which you can do.

ft_get('10.1371/journal.pone.0086169', from='plos', cache=TRUE)

Extract text from PDFs

There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.

Locally, using code adapted from the package tm, and two pdf to text parsing backends

pdf <- system.file("examples", "example2.pdf", package = "fulltext")
(res <- ft_extract(pdf))
#> <document>/Users/sacmac/github/ropensci/fulltext/inst/examples/example2.pdf
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Or extract directly into a tm Corpus

paths <- sapply(paste0("example", 2:5, ".pdf"), function(x) system.file("examples", x, package = "fulltext"))
(corpus <- ft_extract_corpus(paths))
#> $meta
#> data frame with 0 columns and 0 rows
#> 
#> $data
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 4
#> 
#> attr(,"class")
#> [1] "ft_extract"

Extract pdf remotely on the web, using a service called PDFX

pdf5 <- system.file("examples", "example5.pdf", package = "fulltext")
pdfx(file = pdf5)
#> $meta
#> $meta$job
#> [1] "34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044"
#>
#> $meta$base_name
#> [1] "5kpp"
#>
#> $meta$doi
#> [1] "10.7554/eLife.03640"
#>
#>
#> $data
#> <?xml version="1.0" encoding="UTF-8"?>
#> <pdfx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://pdfx.cs.man.ac.uk/static/article-schema.xsd">
#>   <meta>
#>     <job>34b281c10730b9e777de8a29b2dbdcc19f7d025c71afe9d674f3c5311a1f2044</job>
#>     <base_name>5kpp</base_name>
#>     <doi>10.7554/eLife.03640</doi>
#>   </meta>
#>    <article>
#>  .....

Contributors

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for fulltext: citation(package = 'fulltext')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

rofooter