Source code for An Interactive Topic Model of Signs (Signs@40)
JavaScript R CSS Python
Switch branches/tags
Nothing to show
Clone or download
Latest commit f5944a4 Nov 17, 2014

README.md

This repository holds source code for An Interactive Topic Model of Signs, part of Signs@40, a project of Signs: Journal of Women in Culture and Society. It also includes code and documentation for the creation of the topic model displayed on Signs@40. This work is by Andrew Goldstone, Susana Galán, C. Laura Lovin, Andrew Mazzaschi, and Lindsey Whitmore. We make the source code available for modification or duplication (with attribution to us) under the terms of the MIT License. See LICENSE.

The collection of scripts in the modeling subdirectory were used to create the topic model itself from a collection of full texts of Signs articles supplied by JSTOR. These full texts are not publicly available, but researchers can request full-text data sets by contacting support@jstor.org.

In addition to the missing source data, thanks to the use of file paths specific to the computers used to generate the model, these scripts would need some modification to run on another system. However, we have included them for the purposes of documentation, so that our topic-modeling choices are explicit. Much of this code uses dfrtopics, an in-development R package by Andrew Goldstone to help use MALLET from R to analyze JSTOR Data for Research datasets.

  • Metadata. Metadata was mostly as supplied by JSTOR. One issue's metadata, however, was missing. This data we obtained by exporting citations for Signs 40, no. 1 from the regular JSTOR site in RIS format, converting that data into CSV format with a small python script, and then processing the CSV into the expected metadata format using metadata_40.1.R.

  • Featurizing. The instances_stoprefs.R script was used to build the MALLET instances file from the full texts.

    • We used MALLET to tokenize the text, opting for MALLET's default tokenization, which uses the regular expression \p{Alpha}+.
    • Our list of stop words is in stop_refs.txt. In addition to very frequent words, names, and similar, we also removed a set of words whose over-representation in article reference lists created problematic results in the modeling process. These words were suggested using the ref_words function in instances_stoprefs.R to compare the vocabulary of article texts with a set of reference list texts from Signs (also supplied by JSTOR).
    • The script also removes infrequent words (those occurring four or fewer times).
    • Documents less than 800 words long (before removing words) are omitted.
    • Only "full-length articles" (JSTOR type fla) were included, but we hand-corrected some errors in the metadata classifications.
  • Modeling parameters. MALLET is used to generate the model via the model_k70.R script. The key parameter is the number of topics, 70, which, after experimenting with many values, we found to yield the most fruitful topics for exploration and interpretation. A number of other parameters are set to default values (the starting hyperparameter values, for example). This script also sets MALLET's random seed. This is important for reproducibility: every time the script is run, exactly the same model is generated.

  • An operator error. We discovered late into our development process that one issue's worth of articles was omitted from the model inputs. As the beta version of our visualization was already being used by commentators, we did not want to rerun the model and reshuffle all of the topics. Instead, we constructed a compatible MALLET instances file of the missing documents (modeling/instance_signs39.4.R) and used MALLET's capacity to infer topics for new documents on the basis of an existing model. This inferencing process is scripted at the end of model_k70.R. This might be of some interest to R and MALLET users, as it shows how to use the rJava glue to interact with MALLET's Java API from R.

  • Browser data generation. The script browser_k70.R prepares most of the data inputs for the web-browser visualization code.

    The meta_fix.R script incorporates additional hand-curated metadata about special issues and modifies two files in the data directory: info.json and meta.csv.zip. The versions of these files in the repository (identical to those on Signs@40) have already been modified accordingly, so we have not included the additional metadata these scripts use as input.

  • Browser source. The model browser itself is written in JavaScript. The running code lives in the js subdirectory, but these files are generated by uglifyJS from the actual source in the src directory. The supplied Makefile has an uglify target.

  • Libraries. The visualization makes use of the following open-source libraries, included here in order to record the particular versions of those libraries we have made use of: d3 by Mike Bostock; bootstrap by Twitter, Inc. (customization parameters in config.json); jQuery by the jQuery Foundation; and JSZip by Stuart Knightley. This site builds on dfr-browser by Andrew Goldstone.