/
parsermd.Rmd
201 lines (140 loc) · 6.65 KB
/
parsermd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Getting Started"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Getting Started}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(magrittr)
```
# parsermd
<!-- badges: start -->
<!-- badges: end -->
The goal of parsermd is to extract the content of an R Markdown file to allow for programmatic interactions with the document's contents (i.e. code chunks and markdown text). The goal is to capture the fundamental structure of the document and as such we do not attempt to parse every detail of the Rmd. Specifically, the yaml front matter, markdown text, and R code are read as text lines allowing them to be processed using other tools.
## Installation
`parsermd` can be installed from CRAN with:
```r
install.packages("parsermd")
```
You can install the latest development version of `parsermd` from [GitHub](https://github.com/rundel/parsermd) with:
```r
remotes::install_github("rundel/parsermd")
```
```{r}
library(parsermd)
```
## Parsing Rmds
This is a basic example which shows you the basic abstract syntax tree (AST) that results from parsing a simple Rmd file,
```{r example}
rmd = parsermd::parse_rmd(system.file("minimal.Rmd", package = "parsermd"))
```
The R Markdown document is parsed and stored in a flat, ordered list object containing tagged elements. By default the package will present a hierarchical view of the document where chunks and markdown text are nested within headings, which is shown by the default print method for `rmd_ast` objects.
```{r tree}
print(rmd)
```
If you would prefer to see the underlying flat structure, this can be printed by setting `use_headings = FALSE` with `print`.
```{r no_headings}
print(rmd, use_headings = FALSE)
```
Additionally, to ease the manipulation of the AST the package supports the transformation of the object into a tidy tibble with `as_tibble` or `as.data.frame` (both return a tibble).
```{r tibble}
as_tibble(rmd)
```
and it is possible to convert from these data frames back into an `rmd_ast`.
```{r as_ast}
as_ast( as_tibble(rmd) )
```
Finally, we can also convert the `rmd_ast` back into an R Markdown document via `as_document`
```{r as_doc}
cat(
as_document(rmd),
sep = "\n"
)
```
## Working with the AST
Once we have parsed an R Markdown document, there are a variety of things that we can do with our new abstract syntax tree (ast). Below we will demonstrate some of the basic functionality within `parsermd` to manipulate and edit these objects as well as check their properties.
```{r}
rmd = parse_rmd(system.file("hw01-student.Rmd", package="parsermd"))
rmd
```
Say we were interested in examining the solution a student entered for Exercise 1 - we can get access to this using the `rmd_select` function and its selection helper functions, specifically the `by_section` helper.
```{r}
rmd_select(rmd, by_section( c("Exercise 1", "Solution") ))
```
To view the content instead of the AST we can use the `as_document()` function,
```{r}
rmd_select(rmd, by_section( c("Exercise 1", "Solution") )) %>%
as_document()
```
Note that this gives us the *Exercise 1* and *Solution* headings and the contained markdown text, if we only wanted the markdown text then we can refine our selector to only include nodes with the type `rmd_markdown` via the `has_type` helper.
```{r}
rmd_select(rmd, by_section(c("Exercise 1", "Solution")) & has_type("rmd_markdown")) %>%
as_document()
```
This approach uses the tidyselect `&` operator within the selection to find the intersection of the selectors `by_section(c("Exercise 1", "Solution"))` and `has_type("rmd_markdown")`. Alternative the same result can be achieved by chaining multiple `rmd_select`s together,
```{r}
rmd_select(rmd, by_section(c("Exercise 1", "Solution"))) %>%
rmd_select(has_type("rmd_markdown")) %>%
as_document()
```
### Wildcards
One useful feature of the `by_section()` and `has_label()` selection helpers is that they support [glob](https://en.wikipedia.org/wiki/Glob_(programming)) style pattern matching. As such we can do the following to extract all of the solutions from our document:
```{r}
rmd_select(rmd, by_section(c("Exercise *", "Solution")))
```
Similarly, if we wanted to just extract the chunks that involve plotting we can match for chunk labels with a "plot" prefix,
```{r}
rmd_select(rmd, has_label("plot*"))
```
### ast as a tibble
As mentioned earlier, the ast can also be represented as a tibble, in which case we construct several columns using the properties of the ast (sections, type, and chunk label).
```{r}
tbl = as_tibble(rmd)
tbl
```
All of the functions above also work with this tibble representation, and allow for the same manipulations of the underlying ast.
```{r}
rmd_select(tbl, by_section(c("Exercise *", "Solution")))
```
As the complete ast is store directly in the `ast` column, we can also manipulate this tibble using dplyr or similar packages and have these changes persist. For example we can use the `rmd_node_length` function to return the number of lines in the various nodes of the ast and add a new length column to our tibble.
```{r}
tbl_lines = tbl %>%
dplyr::mutate(lines = rmd_node_length(ast))
tbl_lines
```
Now we can apply a `rmd_select` to this updated tibble
```{r}
rmd_select(tbl_lines, by_section(c("Exercise 2", "Solution")))
```
and see that our new `lines` column is maintained.
Note that using the `rmd_select` function is optional here and we can also accomplish the same task using `dplyr::filter` or any similar approach
```{r}
tbl_lines %>%
dplyr::filter(sec_h3 == "Exercise 2", sec_h4 == "Solution")
```
As such, it is possible to mix and match between `parsermd`'s built-in functions and any of your other preferred data manipulation packages.
One small note of caution is that when converting back to an ast, `as_ast`, or document, `as_document`, only the structure of the `ast` column matters so changes made to the section columns, `type` column, or the `label` column will not affect the output in any way. This is particularly important when headings are filtered out, as their columns may still appear in the tibble while they are no longer in the ast - `rmd_select` attempts to avoid this by recalculating these specific columns as part of the subsetting process.
```{r}
tbl %>%
dplyr::filter(sec_h3 == "Exercise 2", sec_h4 == "Solution", type == "rmd_chunk")
```
<br/>
```{r}
tbl %>%
dplyr::filter(sec_h3 == "Exercise 2", sec_h4 == "Solution", type == "rmd_chunk") %>%
as_document() %>%
cat(sep="\n")
```
<br/>
```{r}
tbl %>%
rmd_select(by_section(c("Exercise 2", "Solution")) & has_type("rmd_chunk")) %>%
as_document() %>%
cat(sep="\n")
```