The State Of Data On CRAN: Discovering Good Data Packages
HTML R Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Discovering good data packages

DOI: 10.5281/zenodo.47223

Project participants:

Most of us are involved in teaching R in some way, and it is always a struggle to find suitable datasets with which to teach, especially across domain expertise. There are many packages that have data, but finding them and knowing what is in them is a struggle due to inadequate documentation.


  1. Make it easy to discover suitable data
  2. Write some guidance on documenting data in packages


  1. Google Doc which describes best practices for documentation.

Checklist of things to document.

Make sure your documentation answers as many of these questions as possible.

  • What does the data represent?
  • What format is the data in?
  • How big is the dataset?
  • Where does the come from?
  • How has the data been processed?
  • What does the data look like?
  • How do you analyze the data?
  • Where is this data used?
  • Is there a paper, or other external resource discussing this dataset?
  1. A patch for usethis::use_readme_rmd() to display datasets in package README files.

  2. A flexdashboard with a searchable table that shows metadata on datasets from many CRAN packages. It has information for over 4000 datasets.

The state of data on CRAN

What makes a good data package?

Twitter Bot


Potential Future Work

Additional Data Sources

  • Crawl Biocondunctor

  • Examine inst/extdata folders

Additional Package Stats

  • Use Github URLs to pull geo-location of package maintainers

Quality Assessment

  • Scoring the quality of data in a package

  • Creating badges to advertise data quality

  • Contact package authors with data quality deficiencies