This repository has been archived by the owner on Sep 9, 2022. It is now read-only.
/
README.Rmd
155 lines (115 loc) · 4.21 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
```{r echo=FALSE}
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
warning = FALSE,
message = FALSE,
cache.path = "inst/cache/"
)
```
pubchunks
=========
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![cran checks](https://cranchecks.info/badges/worst/pubchunks)](https://cranchecks.info/pkgs/pubchunks)
[![R-check](https://github.com/ropensci/pubchunks/workflows/R-check/badge.svg)](https://github.com/ropensci/pubchunks/actions?query=workflow%3AR-check)
[![codecov](https://codecov.io/gh/ropensci/pubchunks/branch/master/graph/badge.svg)](https://codecov.io/gh/ropensci/pubchunks)
[![rstudio mirror downloads](https://cranlogs.r-pkg.org/badges/pubchunks)](https://github.com/r-hub/cranlogs.app)
[![cran version](https://www.r-pkg.org/badges/version/pubchunks)](https://cran.r-project.org/package=pubchunks)
## Get chunks of XML articles
## Package API
```{r echo=FALSE, comment=NA, results='asis'}
cat(paste(" -", paste(getNamespaceExports("pubchunks"), collapse = "\n - ")))
```
The main workhorse function is `pub_chunks()`. It allows you to pull out sections of articles from many different publishers (see next section below) WITHOUT having to know how to parse/navigate XML. XML has a steep learning curve, and can require quite a bit of Googling to sort out how to get to different parts of an XML document.
The other main function is `pub_tabularize()` - which takes the output of `pub_chunks()` and coerces into a data.frame for easier downstream processing.
## Supported publishers/sources
- eLife
- PLOS
- Entrez/Pubmed
- Elsevier
- Hindawi
- Pensoft
- PeerJ
- Copernicus
- Frontiers
- F1000 Research
If you know of other publishers or sources that provide XML let us know by [opening an issue](https://github.com/ropensci/pubchunks/issues).
We'll continue adding additional publishers.
## Installation
Stable version
```{r eval=FALSE}
install.packages("pubchunks")
```
Development version from GitHub
```{r eval=FALSE}
remotes::install_github("ropensci/pubchunks")
```
Load library
```{r}
library('pubchunks')
```
## Working with files
```{r}
x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml",
package = "pubchunks")
```
```{r}
pub_chunks(x, "abstract")
pub_chunks(x, "title")
pub_chunks(x, "authors")
pub_chunks(x, c("title", "refs"))
```
The output of `pub_chunks()` is a list with an S3 class `pub_chunks` to make
internal work in the package easier. You can easily see the list structure
by using `unclass()`.
## Working with the xml already in a string
```{r}
xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")
```
## Working with xml2 class object
```{r}
xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")
```
## Working with output of fulltext::ft_get()
```{r eval=FALSE}
install.packages("fulltext")
```
```{r}
library("fulltext")
x <- fulltext::ft_get('10.1371/journal.pone.0086169')
pub_chunks(fulltext::ft_collect(x), sections="authors")
```
## Coerce pub_chunks output into data.frame's
```{r}
x <- system.file("examples/elife_1.xml", package = "pubchunks")
res <- pub_chunks(x, c("doi", "title", "keywords"))
pub_tabularize(res)
```
## Get a random XML article
```{r cache=TRUE}
library(rcrossref)
library(dplyr)
res <- cr_works(filter = list(
full_text_type = "application/xml",
license_url="http://creativecommons.org/licenses/by/4.0/"))
links <- bind_rows(res$data$link) %>% filter(content.type == "application/xml")
download.file(links$URL[1], (i <- tempfile(fileext = ".xml")))
pub_chunks(i)
download.file(links$URL[13], (j <- tempfile(fileext = ".xml")))
pub_chunks(j)
download.file(links$URL[20], (k <- tempfile(fileext = ".xml")))
pub_chunks(k)
```
```{r echo=FALSE}
unlink(i)
unlink(j)
unlink(k)
```
## Meta
* Please [report any issues or bugs](https://github.com/ropensci/pubchunks/issues).
* License: MIT
* Get citation information for `pubchunks`: `citation(package = 'pubchunks')`
* Please note that this package is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). By contributing to this project, you agree to abide by its terms.