Data and code for analyzing language associated with fictional characters.
Jupyter Notebook Python R
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
blogpost
error
genre_experiment
metadata
post22hathi
pre23hathi
vizdata
yearlysummaries
LICENSE
README.md
combine_all_summaries.py
combine_hathi_summaries.py
oldlabnotebook.md
oldreadme.md

README.md

Research on characterization

Repo containing code and data for research on characterization (Ted Underwood and David Bamman, 2015-17).

The original texts of many volumes are under copyright, and couldn't be shared even if the size limits of this repository permitted. So we are sharing derived data, plus metadata which would allow a researcher to retrieve those original texts from HathiTrust Research Center.

Right now all the data provided in this repo is aggregated by year; we have not yet made available word counts broken out by volume or by character name; that will come out with our article, as will a more tightly integrated and replicable codebase. At the moment, the metadata is in /metadata and data is in /yearlysummaries.

blogpost

Scripts used to calculate confidence intervals and plot visualizations in the blog post "The Gender Balance of Fiction, 1800-2007."

error

A brief discussion of sources of error in the project.

genre_experiment

Checking to see whether the rise of genre fiction might explain changes in the gender balance of the larger dataset.

metadata

Contains metadata for volumes used in this project, along with a discussion of metadata error.

yearlysummaries

Aggregated yearly word counts, broken out by author gender and character gender, and by the grammatical role of the word. They are not broken out by the word itself. For more detailed lexical information, right now, you would need to consult the vizdata folder.

vizdata

Metadata and data used in an interactive visualization.