Research on characterization
Repo containing code and data for research on characterization (Ted Underwood and David Bamman, 2015-17).
The original texts of many volumes are under copyright, and couldn't be shared even if the size limits of this repository permitted. So we are sharing derived data, plus metadata which would allow a researcher to retrieve those original texts from HathiTrust Research Center.
Right now all the data provided in this repo is aggregated by year; we have not yet made available word counts broken out by volume or by character name; that will come out with our article, as will a more tightly integrated and replicable codebase. At the moment, the metadata is in /metadata and data is in /yearlysummaries.
Scripts used to calculate confidence intervals and plot visualizations in the blog post "The Gender Balance of Fiction, 1800-2007."
A brief discussion of sources of error in the project.
Checking to see whether the rise of genre fiction might explain changes in the gender balance of the larger dataset.
Contains metadata for volumes used in this project, along with a discussion of metadata error.
Aggregated yearly word counts, broken out by author gender and character gender, and by the grammatical role of the word. They are not broken out by the word itself. For more detailed lexical information, right now, you would need to consult the vizdata folder.
Metadata and data used in an interactive visualization.