layout | title | nav_order |
---|---|---|
default |
Conclusion & Additional Resources |
6 |
Congratulations! You've just finished this workshop.
You should now be able to:
- Define topic modeling
- Use at least one tool to perform topic modeling on a text corpus
- Explain the limitations of topic modeling
To learn more about any particular topic, take a look at the links below.
In the lesson, we briefly discussed how topic modeling works without getting into the mathematical basis for the practice. David Blei gives an overview of topic modeling, with a plain language description of latent Dirchlet allocation (LDA), in the Winter 2012 issue of the Journal of Digital Humanities. The entire issue of the journal is dedicated to topic modeling and may also be of interest!
If you wish to read more about the specifics of LDA, the seminal article by David M. Blei, Andrew Y. Ng and Michael I. Jordan is a good place to start. Or, if you prefer to head off on a tangent that might enrich your understanding, Alexandra Schofield, Måns Magnusson and David Mimno (yes, the David Mimno who is the primary maintainer of MALLET) have written a provocative paper that suggests removing stopwords after training is as effective as removing them before in topic modeling.
If you already familiar or have enjoyed working with Python, William J.B. Mattingly -- a historian and digital humanist -- has developed a playlist of video tutorials that go into greater depth about topic modeling, including another package (Top2Vec) that Mattingly insists is the best way to do topic modeling in Python. The script in the "Topic Modeling with Python" part of the lesson owes a debt to Mattingly's experiments with Gensim - which may have been abandoned to pursue Top2Vec!
If you wish to continue using Gensim for topic modeling, however, you may wish to explore the Gensim documentation further.
We used the Python programming language for topic modeling but if you are more familiar or comfortable with the R programming language, which is popular in academic and data science contexts, there are an abundance of resources to guide you:
- Julia Silge's video on Topic modeling with R and tidy data principles (uses the tidytext -- developed by Silge -- and stm packages)
- a series of R tutorials from a team at Vrije Universiteit Amsterdam, including Fitting LDA models in R with an accompanying video (uses the quanteda and topicmodels packages)
- if you prefer a notebook approach, Martin Schweinberger of the University of Queensland has developed one
- Thomas W. Jones' textmineR library, which allows for the use of different modeling techniques including and beyond LDA.
There is much more out there if R is your language of choice!
Visualization is one modality for exploratory data analysis, but it privileges the visual sense and may not be accessible for all audiences. Shawn Graham has created a Programming Historian lesson on sonification, or the mapping of dataset features to sound. Graham demonstrates several tools for sonifying data in the lesson, and the part of the lesson that explores Sonic Pi uses the probalistic weights of topics from a topic model - data that you will have available to try out from your experiments with Voyant, MALLET and Gensim.