Data Carpentry data

This directory contains the datasets used as examples for the lessons in the datacarpentry/lessons directory. Here is a list of the subdirectories and contents.


Example data from the biological sciences.

aphid data

Publication: Bahlai, C.A., Schaafsma, A.W., Lagos, D., Voegtlin, D., Smith, J.L., Welsman, J.A., Xue, Y., DiFonzo, C., Hallett, R.H., 2014. Factors inducing migratory forms of soybean aphid and an examination of North American spatial dynamics of this species in the context of migratory behavior. Agriculture and Forest Entomology. 16, 240-250

Downloaded from:

Used in: Excel lessons (datacarpentry/lessons/excel/ecology-examples)


  • Master_suction_trap_data_list_uncleaned.csv : a pre-cleaning version of the dataset
  • aphid_data_Bahlai_2014.xlsx : spreadsheet with aphid data

Portal mammals data

This is data on a small mammal community in southern Arizona over the last 35 years. This is part of a larger project studying the effects of rodents and ants on the plant community. The rodents are sampled on a series of 24 plots, with different experimental manipulations of which rodents are allowed to access the plots.

Publication: S. K. Morgan Ernest, Thomas J. Valone, and James H. Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708.

Downloaded from:

Used in: excel, shell, R, python and SQL lessons


  • plots.csv : a list of the experimental plot IDs and descriptions
  • species.csv : a list of the two-letter species code and information about the species
  • surveys.csv : the full list of observations of species on plots
  • surveys-exercise-extract_month.csv : a small subset of the surveys data used in one of the excel lessons
  • portal_mammals.sqlite : a SQLite database of the mammal data; incorporates plots.csv, species.csv and surveys.csv


Data used in lessons aimed at text-mining in the social sciences.


Full-text of several articles from PLOS ONE, PLOS Computational Biology, and PLOS Biology.

Downloaded from:

Used in: text-mining R lesson in lessons/R/materials/08-text_mining-R



Sample data for data cleaning exercises. Includes the words spoken by characters of different races and gender in the Lord of the Rings movie trilogy.

Publication: J.R.R. Tolkien. The Lord of the Rings. Ballantine Books, New York. Copyright 1954-1974. Volume I. The Fellowship of the Ring. Volume II. The Two Towers. Volume III. The Return of the King.

Downloaded from: jennybc on github; original dataset at manyeyes

Used in: data-tidying R lesson in lessons/tidy-data


  • Male.csv: the word counts for male characters in LOTR
  • Female.csv: the word counts for female characters in LOTR
  • The_Fellowship_Of_The_Ring.csv: word counts in FOTR
  • The_Return_Of_The_King.csv: word counts in ROTK
  • The_Two_Towers.csv: word counts in TT
  • lotr_clean.tsv: original data in tidy form
  • lotr_tidy.tsv: the multi-film, tidy dataset generated at the end of the lessons