Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
417 lines (295 sloc) 16.3 KB
---
title: The Rt of good package READMEs
date: '2019-12-03'
slug: readmes
tags:
- readme
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
cache = TRUE,
warning = FALSE)
```
A recent topic of the Write The Docs' great newsletter was [READMEs](https://www.writethedocs.org/blog/newsletter-july-2019/#readmes-on-readmes-and-other-readme-related-resources). It read _"As they’re often the first thing people see about a code project, READMEs are pretty important to get right."_. In this post, we'll share some insights around the READMEs of R packages: why they're crucial; what they usually contain; how you can best write yours. Let's dive in! :swimmer:
## Why is a good README key
As mentioned above, the WTD newsletter stated that READMEs are often the first entry point to a project. For a package you could think of other entry points such as the CRAN homepage, but the README remains quite important as seen in the poll below
```{r jonathan-carroll, echo = FALSE}
blogdown::shortcode("tweet", "969442252610191361")
```
A good README is crucial to recruit users that'll actually gain something from using your package. As written by [noffle](https://github.com/noffle/) in [the Art of README](https://github.com/noffle/art-of-readme),
> "your job, when you're doing it with optimal altruism in mind, isn't to "sell" people on your work. It's to let them evaluate what your creation does as objectively as possible, and decide whether it meets their needs or not -- not to, say, maximize your downloads or userbase."
Furthermore, you can recycle the content of your README in other venues (more on how to do that -- without copy-pasting -- later) like that vignette mentioned in the poll. If you summarize your package in a good one-liner for the top of the README,
```
> Connect to R-hub, from R
```
you can re-use it
* as Package Title in DESCRIPTION,
```
Title: Connect to 'R-hub'
```
* in the GitHub repo description,
* in your introduction at a social event (ok, maybe not).
Other parts of a pitch are good talk fodder, blog post introductions, vignette sections, etc. Therefore, the time you spend pitching your package in the best possible way is a gift that'll keep on giving, to your users and you.
## What is a good README
In the Art of README, noffle [includes a checklist](https://github.com/noffle/art-of-readme#bonus-the-readme-checklist); and the rOpenSci dev guide [features guidance about the README](https://devguide.ropensci.org/building.html#readme). Now, what about good READMEs in the wild? In this section, we'll have a look at a small sample of READMEs.
### Sampling READMEs
We shall start by merging the lists of top downloaded and trending CRAN packages one can obtain using [`pkgsearch`](https://r-hub.github.io/pkgsearch/).
```{r sample-readmes}
library("magrittr")
trending <- pkgsearch::cran_trending()
top <- pkgsearch::cran_top_downloaded()
pkglist <- unique(c(trending[["package"]], top[["package"]]))
```
This is a list of `r length(pkglist)` package names, including `r glue::glue_collapse(head(pkglist), sep = ", ", last = " and ")`. Then, again with `pkgsearch`, we'll extract their metadata, before keeping only those that have a GitHub README. More arbitrary choices. :grimacing:
```{r metapkgs}
meta <- pkgsearch::cran_packages(pkglist)
meta <- meta %>%
dplyr::mutate(URL = strsplit(URL, "\\,")) %>%
tidyr::unnest(URL) %>%
dplyr::filter(stringr::str_detect(URL, "github\\.com")) %>%
dplyr::mutate(URL = stringr::str_remove_all(URL, "\\(.*")) %>%
dplyr::mutate(URL = stringr::str_remove_all(URL, "\\#.*")) %>%
dplyr::mutate(URL = trimws(URL)) %>%
dplyr::select(Package, Title, Date, Version,
URL) %>%
dplyr::mutate(path = urltools::path(URL)) %>%
dplyr::mutate(path = stringr::str_remove(path, "\\/$")) %>%
tidyr::separate(path, sep = "\\/", into = c("owner", "repo"))
str(meta)
```
At this point we have `r nrow(meta)` packages with `r length(unique(meta$URL))` unique GitHub repo URLs, pfiew.
We'll then extract their [preferred README from the GitHub V3 API](https://developer.github.com/v3/repos/contents/#get-the-readme). Some of them won't even have one so we'll lose them from the sample.
```{r, get-readmes}
gh <- memoise::memoise(ratelimitr::limit_rate(gh::gh,
ratelimitr::rate(1, 1)))
get_readme <- function(owner, repo){
readme <- try(gh("GET /repos/:owner/:repo/readme",
owner = owner, repo = repo),
silent = TRUE)
if(inherits(readme, "try-error")){
return(NULL)
}
lines <- suppressWarnings(
readLines(
readme$download_url
)
)
if (length(lines) == 1){
sub(readme$path, lines,
readme$download_url) -> link
} else {
link <- readme$download_url
}
tibble::tibble(owner = owner,
repo = repo,
readme = list(suppressWarnings(readLines(link))))
}
readmes <- purrr::map2_df(.x = meta$owner, .y = meta$repo,
.f = get_readme)
```
The `readmes` data.frame has `r nrow(readmes)` lines so we lost a few more packages.
### Assessing README size
#### Number of lines
A first metric we'll extract is the number of lines of the README.
```{r}
count_lines <- function(readme_lines){
readme_lines %>%
purrr::discard(. == "") %>% # emtpy lines
purrr::discard(stringr::str_detect(., "\\<\\!\\-\\-")) %>% # html comments
length()
}
readmes <- dplyr::group_by(readmes, owner, repo) %>%
dplyr::mutate(lines_no = count_lines(readme[[1]])) %>%
dplyr::ungroup()
```
How long are usual READMEs? Their number of lines range from `r min(readmes$lines_no)` to `r max(readmes$lines_no)` with a median of `r median(readmes$lines_no)`.
```{r nolinesplot, fig.cap="Dot plot of the number of lines in READMEs"}
library("ggplot2")
ggplot(readmes) +
geom_histogram(aes(lines_no), binwidth = 5) +
xlab("No. of lines") +
scale_y_continuous(NULL, breaks = NULL) +
hrbrthemes::theme_ipsum(base_size = 16,
axis_title_size = 16) +
ggtitle(glue::glue("Number of lines in a sample of {nrow(readmes)} READMEs"))
```
READMEs in our sample most often don't have more than 200 lines. Now, this metric might indicate how much a potential user needs to take in and how long they need to scroll down but we shall now look into other indicators of size: the number of lines of R code, the number of words outside of code and output.
#### Other size indicators
To access the numbers we're after without using too many regular expressions, [we shall convert the Markdown content to XML via `commonmark` and use XPath to parse it](https://ropensci.org/technotes/2018/09/05/commonmark/).
```{r xml}
get_xml <- function(readme_lines){
readme_lines %>%
glue::glue_collapse(sep = "\n") %>%
commonmark::markdown_xml(normalize = TRUE,
hardbreaks = TRUE) %>%
xml2::read_xml() %>%
xml2::xml_ns_strip() -> xml
xml2::xml_replace(xml2::xml_find_all(xml, "//softbreak"),
xml2::read_xml("<text>\n</text>"))
list(xml)
}
readmes <- dplyr::group_by(readmes, owner, repo) %>%
dplyr::mutate(xml_readme = get_xml(readme[[1]])) %>%
dplyr::ungroup()
```
This is how a single README XML looks like:
```{r xmlex}
readmes$xml_readme[[1]]
```
Let's count lines of code.
```{r loc}
get_code_lines <- function(xml) {
if(is.null(xml)) {
return(NULL)
}
xml2::xml_find_all(xml, "code_block") %>%
purrr::keep(xml2::xml_attr(., "info") == "r") %>%
xml2::xml_text() %>%
length
}
loc <- readmes %>%
dplyr::group_by(repo, owner) %>%
dplyr::summarise(loc = get_code_lines(xml_readme[[1]]))
```
The number of lines of code of READMEs range from `r min(loc$loc)` (for `r sum(loc$loc == min(loc$loc))` READMEs) to `r max(loc$loc)` with a median of `r median(loc$loc)`. The README with the most lines of code is `r glue::glue("https://github.com/{loc$owner[loc$loc == max(loc$loc)]}/{loc$repo[loc$loc == max(loc$loc)]}#readme")`.
What about words in text?
```{r words}
get_wordcount <- function(xml, package) {
if(is.null(xml)) {
return(NULL)
}
xml %>%
xml2::xml_find_all("*[not(self::code_block) and not(self::html_block)]") %>%
xml2::xml_text() %>%
glue::glue_collapse(sep = " ") %>%
tibble::tibble(text = .) %>%
tidytext::unnest_tokens(word, text) %>%
nrow()
}
words <- readmes %>%
dplyr::group_by(repo, owner) %>%
dplyr::summarise(wordcount = get_wordcount(xml_readme[[1]]))
```
The number of words in READMEs range from `r min(words$wordcount)` to `r max(words$wordcount)` with a median of `r median(words$wordcount)`. The READMEs with respectively the most words and least words are `r glue::glue("https://github.com/{words$owner[words$wordcount == max(words$wordcount)]}/{words$repo[words$wordcount == max(words$wordcount)]}#readme")` and `r glue::glue("https://github.com/{words$owner[words$wordcount == min(words$wordcount)]}/{words$repo[words$wordcount == min(words$wordcount)]}#readme")`.
The README size might depend on the package interface size itself, i.e. a package with a single function/dataset probably doesn't warrant many lines. Beyond a certain interface size or complexity, one might want to make it easier on potential users by breaking up documentation into smaller articles, instead of showing all there is in one page.
Now, a helpful way to still convey information efficiently when there are more than a few things is a good README _structure_. Besides, a good structure is important for READMEs of all sizes.
### Glimpsing at README structure
To assess README structures a bit, we first need to extract headers.
```{r structure}
get_headings <- function(xml) {
if(is.null(xml)) {
return(NULL)
}
xml %>%
xml2::xml_find_all("heading") -> headings
list(tibble::tibble(text = xml2::xml_text(headings),
position = seq_along(headings),
level = xml2::xml_attr(headings, "level")))
}
structure <- readmes %>%
dplyr::group_by(repo, owner) %>%
dplyr::mutate(structure = get_headings(xml_readme[[1]])) %>%
dplyr::ungroup()
```
Here's the structure of the 42th README.
```{r structureex}
structure$structure[42][[1]]
```
We wrote an ugly long and imperfect function to visualize the structure of any of the sampled READMEs, inspired by not ugly and better code in [`fs`](https://github.com/r-lib/fs/blob/master/R/tree.R).
<details>
<summary>Click to see the function.</summary>
```{r ugly}
pc <- function(...) {
paste0(..., collapse = "")
}
print_readme_structure <- function(structure){
structure$parent <- NA
for (i in seq_len(nrow(structure))) {
possible_parents <- structure$position[structure$level < structure$level[i]
& structure$position < structure$position[i]]
if (any(possible_parents)) {
structure$parent[i] <- max(possible_parents)
} else {
structure$parent[i] <- NA
}
}
structure$parent[structure$level == min(structure$level)] <- 0
for (i in seq_len(nrow(structure))) {
if (structure$level[i] == 1) {
cat(structure$text[i])
} else {
if(structure$position[i] == max(structure$position[structure$parent==structure$parent[i]])) {
firstchar <- cli:::box_chars()$l
} else {
firstchar <- cli:::box_chars()$j
}
cat(
rep(" ", max(
0,
as.numeric(structure$level[i]) - 1
)),
pc(
firstchar,
pc(rep(
cli:::box_chars()$h,
max(
0,
as.numeric(structure$level[i]) - 1
)
))
), structure$text[i]
)
}
cat("\n")
}
}
```
</details>
In practice,
```{r r42}
print_readme_structure(structure$structure[42][[1]])
print_readme_structure(structure$structure[7][[1]])
print_readme_structure(structure$structure[1][[1]])
```
What are the most common headers?
```{r}
structure %>%
tidyr::unnest(structure) %>%
dplyr::count(text, sort = TRUE) %>%
dplyr::filter(n > 4) %>%
knitr::kable()
```
This seems to be quite in line with noffle's checklist (e.g. having installation instructions). Headers related to the background and to examples might have a title specific to the package, in which case they don't appear in the table above.
The headers are one thing, their _order_ is another one which we won't analyze in this post. In the Art of README, noffle [discusses "cognitive funneling"](https://github.com/noffle/art-of-readme#cognitive-funneling) which might help you choose an order.
## How to write a good README
The previous sections aimed at giving an overview of reasons for writing a good README and content of READMEs in the wild, this one should give a few helpful tips.
### Tools for writing and re-using content
#### R Markdown
In [noffle's checklist for a README](https://github.com/noffle/art-of-readme#bonus-the-readme-checklist) one item is "Clear, runnable example of usage". You probably want to go a step further and have "Clear, runnable, executed example of usage" for which using R Markdown is quite handy. Using R Markdown to produce your package README is our number one recommendation.
#### usethis' README templates
To create the README, you can use either [`usethis::use_readme_rmd()` (for an R Markdown README) or `usethis::use_readme_md()` (for a Markdown README)](https://usethis.r-lib.org/reference/use_readme_rmd.html). The created README file will include a few sections for you to fill in.
#### Re-use of Rmd portions in other Rmds
Then, instead of writing everything in one .Rmd and then copying entire sections of it in a vignette, you can re-use Rmd chunks as explained in [this blog post by Garrick Aden-Buie presenting an idea by Brodie Gaslam](https://www.garrickadenbuie.com/blog/dry-vignette-and-readme/). Compared to that blog post, we recommend keeping re-usable Rmd pieces in man/rmdhunks/ so that they're available to build vignettes. In the vignettes, use
````markdown
```{r child='../man/rmdhunks/child.Rmd'} `r ''`
```
````
In the README, use
````markdown
```{r child='man/rmdhunks/child.Rmd'} `r ''`
```
````
#### Re-use of Rmd portions in manual pages
Since `roxygen2`'s latest release, you can include md/Rmd in manual pages. What a handy thing for, say, your [package-level documentation](https://usethis.r-lib.org/reference/use_package_doc.html)! [Refer to `roxygen2`'s documentation](https://roxygen2.r-lib.org/articles/rd.html#including-external--rmd-md-files).
In the R files use
```r
#' @includeRmd man/rmdhunks/child.Rmd
```
#### Table of contents
If your README is a bit long you might need to add a table of contents at the top, [as done for `pkgsearch` README](https://github.com/r-hub/pkgsearch/blob/846ccec71b4dfcc4d7a72257187320c94be8e454/README.Rmd#L5). In a pkgdown website, the README does not get a table of content in the sidebar, which might be an argument for keeping it small as opposed to articles that do get a table of contents in the sidebar.
#### Hide clutter in details tags
You can use the [`details` package](https://cran.r-project.org/web/packages/details/index.html) to hide some content that'd otherwise clutter the README, but that you still want to be available upon click. You can see it in action [in `reactor` README](https://github.com/yonicd/reactor/blob/master/README.Rmd).
### How to assess a README
You should write the package README with potential users in mind, who might not understand the use case for your package, who might not know the tool you're wrapping or porting, etc. It is not easy to change perspectives, though! Having actual external people review your README, or using the audience feedback after say a talk introducing the package, can help a lot here. Ask someone on your corridor or your usual discussion venues (forums, Twitter, etc.).
## Conclusion
In this post we discussed the importance of a good package README, mentioned existing guidance, and gave some data-driven clues as to what is a usual README. We also mentioned useful tools to write a good README. For further READing we recommend the [Write the Docs' list of README-related resources](https://www.writethedocs.org/blog/newsletter-july-2019/#readmes-on-readmes-and-other-readme-related-resources) as well as [this curated list of awesome READMEs](https://github.com/matiassingers/awesome-readme). We also welcome your input below... What do _you_ like seeing in a README?
You can’t perform that action at this time.