Assignments | Lab Worksheets | Syllabus
This course introduces students to the knowledge discovery process and methods used to mine patterns from a collection of text. We will critically review text mining methods developed in the knowledge discovery and databases, information science, and computational linguistics communities. Students will develop proficiency with modeling text through individual projects.
How can computers read? When we look at a paragraph of text, we have a set of skills to understand and interpret it: what is the message? Is it an argument? What is the sentiment? Computers don't have the same context or literacy. Their language is quantitative. Through text mining, this course will equip you with the skills to use understanding text through computing.
Text mining is most useful in the new affordances that it allows. In most cases, the tools of text mining aren't meant to replace 'close reading'; they give us new ways to ask questions - about literature, news, scholarship, correspondence, etc. - and are best applied in service of that novelty. Computing allows for:
- Scale: Computers compare poorly to us in their ability to interpret meaning, but the things they can do may be applied to enormous scales. If you're interested in hundreds of books, thousands or web pages, or millions of tweets, simply reading them is unfeasible.
- Re-contextualization: With text mining, you take apart texts and put them together in new ways. These give you new ways to understand information in a text or appreciate a book. Likewise, breaking down text to data also provides new comparative or critical tools. For example, we can understand what makes Jane Austen's books different from her contemporaries, or attribute authorship for anonymous or pseudonymous writing.
- Summarization: Aggregation, extraction, and visualization all serve to report patterns you. For example, text summarization models can extract the takeaway points from a set of medical literature. A few final notes on course philosophy.
First, the broad view of text mining can encompass many disciplinary approaches. This course hews closely to the sub-area referred to as text analysis, intended to treat text mining in the services of qualitative questions. This is closest to the treatments in the digital humanities and computational social sciences.
For this course, you will be expected to learn new programming skills. Note that this is not a programming course. We will cover a subset of skills in Python that pertain to data science. Most of the time, your needs will be served by tinkering with and modifying code examples that I provide for you.
I understand the time constraints of being a student. To account for the time you will spend in this course learning new tools and writing code, I have tried to keep reading and writing loads reasonable.
Succeeding in this course will be through many little steps. The assignments are small but frequent. If you are looking at the entire outline of ideas and skills in this course, it may look overwhelming. However, going one step at a time, learning the language of text mining won't be scary.
An introductory level database and programming course or permission of the instructor.
This course incorporated readings from a variety of sources. Readings will openly accessible and posted on/linked from the course website. In addition to individual essays and papers, we will also return repeatedly to the following texts:
- Art of Literary Text Analysis - Stefan Sinclair, 2015-
- Introduction to Information Retrieval - Manning and Schutz, 2008
- Speech and Language Processing 3rd edition. Dan Jurafsky and James H. Martin. 2017.
- Search Engines: Information Retrieval in Practice - Croft, Metzler and Strohman. 2009.
- Week 1: Introduction
- Week 2: Fundamentals
- Week 3: Features
- Week 4: Text Mining for Art and Criticism
- Week 5: Documentation Access; Natural Language Processing 1 - Part of Speech Tagging
- Week 6: Natural Language Processing 2 - Information Extraction and Dependency Parsing
- Week 7: Classification 1
- Week 8: Classification 2
- Week 9: Clustering
- Week 10: Topic Modeling and Dimensionality Reduction 1
- Week 11:Topic Modelling 2; Sentiment Analysis
- Week 12: Visualization
- Week 13: Word Embeddings
- Week 14: What's Next: Remainder Notes from Text Mining
The week-to-week syllabus, with readings, slides, and schedule notes is on the Syllabus page.
- 30% Lab Tasks - Due Weekly
- 20% Small Assigments
- 10% - Twitter Bot Assignment
- 10% - Topic Modelling Assignment
- 35% Text Mining Project
- 5% Problem Statement
- 5% Literature review + 5% Data collection
- 20% Final report
- 15% Participation
- 5% Attendance
- 10% Forum posts, comments, class engagement
Details are on the Assignments page.