Data for Data Carpentry workshop
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Data Carpentry data

This directory contains the datasets used as examples for the lessons in the datacarpentry/lessons directory. Here is a list of the subdirectories and contents.


Example data from the biological sciences.

aphid data

Publication: Bahlai, C.A., Schaafsma, A.W., Lagos, D., Voegtlin, D., Smith, J.L., Welsman, J.A., Xue, Y., DiFonzo, C., Hallett, R.H., 2014. Factors inducing migratory forms of soybean aphid and an examination of North American spatial dynamics of this species in the context of migratory behavior. Agriculture and Forest Entomology. 16, 240-250

Downloaded from:

Used in: Excel lessons (datacarpentry/lessons/excel/ecology-examples)


  • Master_suction_trap_data_list_uncleaned.csv : a pre-cleaning version of the dataset
  • aphid_data_Bahlai_2014.xlsx : spreadsheet with aphid data

Portal mammals data

This is data on a small mammal community in southern Arizona over the last 35 years. This is part of a larger project studying the effects of rodents and ants on the plant community. The rodents are sampled on a series of 24 plots, with different experimental manipulations of which rodents are allowed to access the plots.

Publication: S. K. Morgan Ernest, Thomas J. Valone, and James H. Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708.

Downloaded from:

Used in: excel, shell, R, python and SQL lessons


  • plots.csv : a list of the experimental plot IDs and descriptions
  • species.csv : a list of the two-letter species code and information about the species
  • surveys.csv : the full list of observations of species on plots
  • surveys-exercise-extract_month.csv : a small subset of the surveys data used in one of the excel lessons
  • portal_mammals.sqlite : a SQLite database of the mammal data; incorporates plots.csv, species.csv and surveys.csv


Data used in lessons aimed at text-mining in the social sciences.


Full-text of several articles from PLOS ONE, PLOS Computational Biology, and PLOS Biology.

Downloaded from:

Used in: text-mining R lesson in lessons/R/materials/08-text_mining-R



Sample data for data cleaning exercises. Includes the words spoken by characters of different races and gender in the Lord of the Rings movie trilogy.

Publication: J.R.R. Tolkien. The Lord of the Rings. Ballantine Books, New York. Copyright 1954-1974. Volume I. The Fellowship of the Ring. Volume II. The Two Towers. Volume III. The Return of the King.

Downloaded from: jennybc on github; original dataset at manyeyes

Used in: data-tidying R lesson in lessons/tidy-data


  • Male.csv: the word counts for male characters in LOTR
  • Female.csv: the word counts for female characters in LOTR
  • The_Fellowship_Of_The_Ring.csv: word counts in FOTR
  • The_Return_Of_The_King.csv: word counts in ROTK
  • The_Two_Towers.csv: word counts in TT
  • lotr_clean.tsv: original data in tidy form
  • lotr_tidy.tsv: the multi-film, tidy dataset generated at the end of the lessons