Datasets

For the purposes of this class, we're going to be focusing on datasets that are better suited to humanities projects (data relevant to questions about literature, art, history, and culture).

General resources for humanities data:

Melanie Walsh has compiled an excellent list of links to datasets, including a dataset of Nobel Laureates 1901-2017), a dataset of dialogue in 2,000 Hollywood films, a collection of European novels, a collection of Colonial South Asian literature, network data about Game of Thrones Character relationships), an archive of Donald Trump's Tweets, and all the obituaries published in the NYTimes from 1852-2007
Miriam Posner's list of datasets for her graduate course.
Alan Liu's list of data collections and datasets in DH Toychest
- Liu's list includes small- and medium-sized collections of texts, as well as links to larger collections of documents and images
Rutgers University Libraries list of datasets generated from their collections that relevant for digital humanities research

Specific Topics:

A collection of annotations from Jacques Derrida's library
A collection of tales about the Virgin Mary in Ethiopia, Eritrea, and Egypt from 1300 - the present
A collection of metadata for 2,000 novels published between 1660 and 1850 created as a part of the Early Novels Database (END
- Full dataset
- Small collection of metadata about 25 novels and full-texts for those 25 novels
A collection of borrowing records from an English-language lending-library in Paris in the 1920s-1930s - see the Shakespeare and Company dataset and project documentation
A collection of 40,245 records of art sales from the stock books of NYC art dealer M. Knoedler & Co.(1872-1970)
- There's also a smaller sub collection of this data (4,100 records) created by Matthew Lincoln, downloadable here
A dataset from the White House tapes of the Nixon Administration, 1971-1973
A collection of items from Princeton University Art Museum datasets
- Data from the Princeton Art Museum is available via an API (an application programming interface). For instructions on how to access the interface through specific URL addresses, see the project overview page. The data is available in JSON -- short for JavaScript Object Notation––format.
- Links to zip file with sample PUAM collection data (created by Dan Brennan, PUAM web developer), which include:
  - A CSV of metadata for items from the Meginnity Collection of Latin American art
  - A CSV of metadata for items in the collection of miniature paintings from Southeast Asia
  - A CSV of metadata for items in the African American prints exhibition
  - A CSV of metadata for items in the Clarence H. White Archives
  - A CSV of metadata for in the Princeton Portrait Collection (more information about the collection here: https://artmuseum.princeton.edu/collections/1416)
  - A CSV of 500 items items recently updated in the PUAM catalog
- Links to zip files of image collections
A dataset of 50 years of Billboard Hot 100 pop music lyrics, created by data scientist Kaylin Walker
A collection of metadata about comics in North America in Michigan State University Library Comics Art Collection
The New York Public Library's Menu dataset
- Link to project resources and dataset here: The NYPL project page is no longer live (after completing their project to get the public to help aid in transcribing menus from their collection), but a zip file of archived dataset can still be accessed here
- To access an overview of the data and project glossary, see Curating Menus
A dataset of books checked out at the Seattle Public Library. “Checkouts by Title | City of Seattle Open Data Portal.”
“At the Circulating Library: A Database of Victorian Fiction, 1837–1901”, see also a snapshot of the dataset here
A dataset of works translated into English, created by PublishersWeekly.com
"Who Has Your Face?" - a dataset detailing which agencies have access to your data in facial recognition, put together by the Atlas of Surveillance. Link to download data Link to the data report
Torn Apart/Separados project data
- Download a modified set of the project data (courtesy of Melanie Walsh) here
- Download dataset (as well as additional project data) from the project's GitHub repository
- For more on the composition and origins of the data, scroll down to the section labeled "Data" on the project credits page.
Switching the Lens - Rediscovering Londoners of African, Caribbean, Asian and Indigenous Heritage, 1561 to 1840 project dataset
The Victorian Women Writers Project
- Includes links to a collection of full-texts by 19th-century women writers and a collection of XML-encoded versions of those texts (XML is short for Extensible Markup Language, a markup language for capturing metadata within the file)
The Black Book Interactive Project
Alan Liu's 1880s British Fiction corpus

Other humanities data resources:

The Pudding's repository of data connected to their data journalism stories
- A heads up: with The Pudding, you'll want to be careful to say something new about the data beyond the original story that the dataset was assembled for.
Jeremy Singer (an independent data journalist) also has running dataset newsletter called Data is Plural. You can view the archive of datasets or subscribe here.
- A heads up: unlike the more curated sets of humanities data above, these datasets lists are in varied formats, and come with varying levels of contextual detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Datasets

General resources for humanities data:

Specific Topics:

Other humanities data resources:

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Datasets

General resources for humanities data:

Specific Topics:

Other humanities data resources: