The State Of Data On CRAN: Discovering Good Data Packages
HTML R Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
README_files/figure-markdown_github
blogpost
data
docs
inst
.gitignore
README.Rmd
README.md
data-packages.Rproj

README.md

Discovering good data packages

DOI: 10.5281/zenodo.47223

Project participants:

Most of us are involved in teaching R in some way, and it is always a struggle to find suitable datasets with which to teach, especially across domain expertise. There are many packages that have data, but finding them and knowing what is in them is a struggle due to inadequate documentation.

Goals:

  1. Make it easy to discover suitable data
  2. Write some guidance on documenting data in packages

Deliverables:

  1. Google Doc which describes best practices for documentation.

Checklist of things to document.

Make sure your documentation answers as many of these questions as possible.

  • What does the data represent?
  • What format is the data in?
  • How big is the dataset?
  • Where does the come from?
  • How has the data been processed?
  • What does the data look like?
  • How do you analyze the data?
  • Where is this data used?
  • Is there a paper, or other external resource discussing this dataset?
  1. A patch for usethis::use_readme_rmd() to display datasets in package README files.

  2. A flexdashboard with a searchable table that shows metadata on datasets from many CRAN packages. It has information for over 4000 datasets.

The state of data on CRAN

What makes a good data package?

https://docs.google.com/document/d/1xhJmt0v4p49jpwINNak9N7AMMb5yohTwwNOXH8WzqqQ/edit?usp=sharing

Twitter Bot

https://twitter.com/rstatsdata

Graphs

Potential Future Work

Additional Data Sources

  • Crawl Biocondunctor

  • Examine inst/extdata folders

Additional Package Stats

  • Use Github URLs to pull geo-location of package maintainers

Quality Assessment

  • Scoring the quality of data in a package

  • Creating badges to advertise data quality

  • Contact package authors with data quality deficiencies