Skip to content

Code and data to support "Machine Learning and Human Perspective."

License

Notifications You must be signed in to change notification settings

tedunderwood/measureperspective

Repository files navigation

Machine Learning and Human Perspective

DOI

Code and data to support Ted Underwood, "Machine Learning and Human Perspective," accepted for publication in PMLA.

The repository has been organized to document particular figures and assertions in the article.

The subdirectory genderedperspectives/ contains code, and intermediate stages of data, used to produce figure 1. The raw data is contained in another repository (it runs to several gigabytes).

Section 2: Multiplying perspectives and 3: Measuring parallax

These two sections of the article use shared sources of data and some shared code.

So instead of creating a separate folder for each section, I have spread the various components of the workflow across different folders (data/, metadata/, and so on) documented below.

If you're interested in understanding the immediate sources of evidence for a particular figure in the article, I would start with rplots/, which contains the R scripts actually used for visualization.

The paper trails for many passing assertions in the article--e.g., briefly cited accuracy figures--lead through the interpretations/ subfolder. For instance, arguments about the increasing blurriness of the boundary between science fiction and fantasy are documented here.

Note that several notebooks in the interpretations folder are using more a rigorous measurement of the distance between two models than I had time to explain in the article. For full explanation of this more rigorous metric, see "The Historical Significance of Textual Distances" and/or the experiment documented in measuredivergence/.

To fully reproduce the predictive modeling in the article, you will need word counts for volume parts. I store these in a folder called simply data, but that folder is a little large for a github repo, so I am instead providing a link that allows download: DataForMeasuredPerspective.zip.

If you want to replicate the research process from the beginning--and perhaps develop your own independent sample--I would recommend starting with rawdata/, where I document the process of selecting the sample of books I used.

Scripts that actually produce the figures in the article. Each script is associated with a brief pointer to sources of data.

Scripts I used to scrape genre tags, download data, and tokenize extracted features.

Early metadata files. It should really be named "rawmetadata" but the name is in too many scripts to change at this point.

All-purpose folder covering transformations of data and, especially, metadata.

Code for predictive modeling. main_experiment and methodological_experiment in this folder are the heart of the project.

Files produced by individual modeling runs. Files that end simply ".csv" contain predictions about individual volumes; files that end ".coefs.csv" contain the coefficients attached to individual words, and can be used to get a sense of the words that matter for a particular genre in a particular period. Files that end ".pkl" are machine-readable versions of a model; see logistic/versatiletrainer2.py to understand the format.

This folder mostly contains files that summarize results across multiple modeling runs.

Several Jupyter notebooks that survey results, visualize them, and discuss them. These notebooks provide support for several assertions made in passing in the third and fourth section of the article.

One of the challenges of this project is to figure out how we should measure the "distance" between predictive models. This task may not be quite as straightforward as it seems; e.g. I put scare quotes around distance because it's probably not literally a distance. I've reached a tentative conclusion, explained in a Jupyter notebook spacebetweengenres.

These scripts are used to find passages in a book that are "surprising" to earlier models of a genre. They're used in the section of the article on "measuring parallax."