SerendipSlim is a visualization tool designed to help researchers explore large collections of text documents through the use of probabilistic topic modeling. SerendipSlim is an updated version of an earlier tool called Serendip, which first appeared in a publication at IEEE VAST 2014. Serendip was created by Eric Alexander and Joe Kohlmann, working as part of the Visualizing English Print project, a cross-disciplinary collaboration of computer scientists and literature scholars interested in bringing the practices of data visualization and statistical analysis to the study of historical documents.
SerendipSlim's back-end is written in Python version 2.7. Python 2.7 must therefore be installed to operate it. We suggest using a Python installation like Anaconda, as this comes with the necessary additional libraries required by our scripts, such as Flask (along with some other useful ones like Scipy and NLTK).
Running the local server
With Python and Flask installed, SerendipSlim can be run simply by running the
__init__.py file from within the SerendipSlim directory:
> python __init__.py
This should show information indicating that the server is running, like this:
> python __init__.py * Restarting with stat * Debugger is active! * Debugger pin code: 544-249-008 * Running on http://127.0.0.1:5001/ (Press CTRL+C to quit)
Running the program with no other arguments will start a local server that will serve up any models contained within the directory
/SerendipSlim/Data/Metadata/ (including the sample model
ShakeINF_ch50 built on Shakespeare's First Folio). To serve up models located in other directories, simply pass the directory's path as a command-line argument:
> python __init__.py "C:\path\to\directory\containing\topic\models"
To be valid for Serendip, models have to be formatted in a very specific way, described below.
Interacting with SerendipSlim
Once the local server is running, researchers can view the corpus-level visualization by navigating to localhost:5001 within a web browser. From there, individual models can be selected from the dropdown menu in the top navigation bar, or controlled using the URL.
We have built a number of models on sample corpora curated by the Visualizing English Print project. Though they have better performance when being run from a local server, rather than using our server, they can be interacted with here:
SlimCV (originally "CorpusViewer") is meant to help researchers explore collections of documents at the corpus level. At its heart is a reorderable matrix plotting topics (along the horizontal axis) against documents (along the horizontal axis). The proportions of individual topics within each document are indicated by the size of circular glyphs located at the vertices.
Controls for (re-)ordering, labeling, selecting, and coloring the topics and documents can be found in the control panels on the left.
Views for examining metadata and distributions from individual topics and documents can be found in the panels on the right.
SlimTV (originally "TextViewer") is meant to help researchers examine topic modeling data in a lower level of abstraction. By connecting the high level patterns of a topic model down to individual passages, researchers can combine the practices of close and distant reading, helping them build explanations for the trends they observe.
SlimTV is centered around a tagged-text view of a single document. Individual words are highlighted with a color corresponding to the topic the model has associated with them (after the topic has been toggled on using one of the buttons in the list on the left).
On the right, a line graph visualization graphs topic density (along the horizontal axis) against position within the document (along the vertical axis). By looking for peaks and valleys for individual topic lines and clicking on the corresponding places within the visualizations, researchers can navigate directly to relevant passages of text.
SerendipSlim model format
Serendip requires models to be laid out in a specific directory format:
NAME_OF_YOUR_MODEL_DIR/ NAME_OF_FIRST_MODEL/ TopicModel/ HTML/ NAME_OF_FIRST_DOC/ tokens.csv rules.json NAME_OF_SECOND_DOC/ ... topics_freq/ topic_0.csv topic_1.csv ... topics_sal/ topic_0.csv topic_1.csv ... topics_ig/ topic_0.csv topic_1.csv ... theta.csv metadata.csv NAME_OF_SECOND_MODEL/ ...
These files should be structured thus:
- theta.csv: A CSV file containing a single row per document. Each row contains cells indicating the topics present in the document and the proportion of these topics, in successive order. For example, the row for a document containing 60% of topic 3, 25% of topic 4, and 15% of topic 6 would look like
- metadata.csv: A CSV file containing columns for each piece of human-curated metadata for the corpus. The first row of this file should contain the field names for the metadata. The first column must be id and the second column must be filename. The second row of the file should be the data type of each column, chosen from
intfor numerical data (e.g., "year"),
catfor categorical data (e.g., "genre"), and
strfor arbitrary string data (e.g., "title" or "author"). The rest of the rows correspond to the values individual documents. (example)
- topic_X.csv: These topic CSV files contain
word, proportionpairs for each word in each topic, in descending order. Only the
topics_freqdistributions are required. The other directories are for the alternative orderings of saliency and information gain (as described in the paper). (example)
- tokens.csv: There is a tokens CSV file for each document containing the tokens of the document and their corresponding tags. Each line of these files looks like
token, tokenToMatch, endReason, topic_X. (example)
tokenis simply the unchanged token from the document.
tokenToMatchis the simplified token used to match it to corresponding tokens during modeling. Generally, this is done through lowercasing, but can include more complex things like lemmatizing.
endReasonis the event that cut off the token, and can have values of
sfor a space,
cfor a character like punctuation, and
nfor a newline.
xfor the number of the topic) indicates the topic that this word is tagged with. This value is optional, as not every word will necessarily get tagged (e.g., stopwords will not).
- rules.json: A JSON object telling SlimTV details about the distribution of topic tags in this document. (example)
Building new models
Serendip can display a variety of models, so long as they conform to the above format. Sample scripts that build models using Mallet and Gensim can be found in our VEP_TMScripts repository. These scripts are provided AS IS and may need to be tuned/updated to create models in certain environment on certain documents.
Updates, requests, and contact:
SerendipSlim offers many new features not available in the original Serendip, along with a huge upgrade in speed, scale, and ease of use. However, some rarely used features were dropped. If you have requests for the return of certain features, or suggestions for other features, improvements, or bug fixes, let me know!