<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Xanda Schofield](https://www.cs.hmc.edu/~xanda) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email xanda@cs.hmc.edu.<br />
____

# Text Data Curation 3

This is lesson 3 of 3 in the educational series on Text Data Curation. This notebook is intended to look at how trained models, such as naive Bayes models and topic models, can actually help the text curation process. 

**Audience:** `Learners` / `Researchers`

**Use case:** [`How-To`](https://constellate.org/docs/documentation-categories#howtoproblemoriented) 

**Difficulty:** `Intermediate`
Assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.

**Completion time:** `90 minutes`

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* How Python libraries work (installation and imports)

**Knowledge Recommended:**
* Basic file operations (open, close, read, write)
* How text is stored on computers

**Learning Objectives:**
After this lesson, learners will be able to:
1. Use a lexicon to retrieve interesting documents
2. Augment a lexicon using correlation between words
3. Use a simple topic model to check for oddities in a corpus

___

# Required Python Libraries

* `nltk`
* `numpy`
* `sklearn`

## Install Required Libraries

In [1]:
### Install Libraries ###

# # Using !pip installs
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [2]:
### Import Libraries ###
from collections import Counter
import csv
import os
import urllib.request

from nltk.tokenize import word_tokenize
import numpy as np
from scipy.stats import spearmanr

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# import spacy

# Required Data

**Data Format:** 
* comma-separated value (.csv)

**Data Source:**
* [Rotten Tomatoes Dataset](https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset)


## Download Required Data

In [2]:
### Retrieve multiple files using a list ###

download_urls = [
    'https://cs.hmc.edu/~xanda/data/rotten_tomatoes_critic_reviews_50k.csv',
    # https://cs.hmc.edu/~xanda/data/rotten_tomatoes_critic_reviews_50k.csv', # the full dataset
    'https://cs.hmc.edu/~xanda/data/rotten_tomatoes_movies.csv',
    'https://cs.hmc.edu/~xanda/data/stoplist_en.txt' # a modification of an English stoplist constructed by David Mimno
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

# Introduction

In the last section, we looked at classic ways to get data prepared to use for analysis. Today, we will do two types of analysis: first, finding terms within a lexicon, and second, running a topic model. However, our goal right now is not going to be finishing analysis, but instead starting to try things out and spot if there are subtler issues with our corpus.

# Lesson

For our third and last lesson, our dataset will be a collection of RottenTomatoes reviews posted on Kaggle. (I sampled 50k reviews so it wouldn't take forever to download and run, but if you'd like the full collection you can comment out the lines above and below to download and load in the whole thing.)

## Inspecting and slicing data

Before we get far, let's go ahead and inspect our data by loading it in:

In [55]:
"""
First, load in movie review data from RottenTomatoes. Sample will be in a CSV
Next, look at data and check features - lengths reasonable? top/bottom features seem normal? Who are our reviewers? What's the fresh vs rotten ratio?
+ Develop some hypotheses of things that might cause issues with this data
Check for repeats - should be there (will add some if there aren't)
Train classifier for fresh vs rotten - any obvious features? Find something promising and something weird!
+ Specific movie names? (Overrepresent a movie?)
+ Unusual features for sentiment?
Train topic model with 5 topics - what do they look like?
+ Spot high-frequency unusual features
+ Find a weird topic (may need more topics) and look at docs - any sign of why they're together?

What is the point of this process?
+ First, sometimes it's not obvious immediately that something is wrong (ex. unusual feature or copyright text) - that's okay
+ Sometimes it's going to be hard to know you need something until you get into it (I need to limit # of each movie) - also okay
+ Models fixate on unusual distinguishers so they can actually be good sleuths...so long as you have an evaluation plan afterwards
   - Held out test set until the very end
   - Human validation (labeling task or comparison task)
   
   
What works?
+ Keeping a log of choices
+ Trying several options at once
+ Don't remove stuff unless you have a reason why related to your question
+ 
"""

"\nFirst, load in movie review data from RottenTomatoes. Sample will be in a CSV\nNext, look at data and check features - lengths reasonable? top/bottom features seem normal? Who are our reviewers? What's the fresh vs rotten ratio?\n+ Develop some hypotheses of things that might cause issues with this data\nCheck for repeats - should be there (will add some if there aren't)\nTrain classifier for fresh vs rotten - any obvious features? Find something promising and something weird!\n+ Specific movie names? (Overrepresent a movie?)\n+ Unusual features for sentiment?\nTrain topic model with 5 topics - what do they look like?\n+ Spot high-frequency unusual features\n+ Find a weird topic (may need more topics) and look at docs - any sign of why they're together?\n\nWhat is the point of this process?\n+ First, sometimes it's not obvious immediately that something is wrong (ex. unusual feature or copyright text) - that's okay\n+ Sometimes it's going to be hard to know you need something until 

In [56]:
# big file alternative:
# with open("rotten_tomatoes_critic_reviews.csv", encoding='utf-8') as reviews_file:
with open("rotten_tomatoes_critic_reviews_50k.csv", encoding='utf-8') as reviews_file:
    csvr = csv.DictReader(reviews_file)
    review_data = [row for row in csvr]

Let's confirm we do have 50,000 reviews in our reviews file and what data we get with each:

In [22]:
print("# of reviews:", len(review_data))
print(review_data[0])

# of reviews: 50000
{'rotten_tomatoes_link': 'm/the_king_of_staten_island', 'critic_name': 'Roger Tennis', 'top_critic': 'False', 'publisher_name': 'Cinemaclips.com', 'review_type': 'Fresh', 'review_score': '3.5/5', 'review_date': '2020-06-11', 'review_content': "SNL's Pete Davidson is a commanding presence in this appealing comedy/drama."}


Interesting - we get the review and reviewer information, but instead of getting a proper movie title or metadata about the movie, we just get a `rotten_tomatoes_link` to the part of the URL where a movie is. That's because there's a second CSV with metadata for each movie. Since we're going to be doing some cross-referencing between multiple dictionaries of things and it'll be easy to mistype, I'm going to leave myself some variables with the keys I need to pull information I care about: a unique ID for each movie, and where the text of the review is.

In [6]:
ID_LINK = 'rotten_tomatoes_link'
TEXT = 'review_content'

Now, let's take a look at what's in the second CSV of movie data:

In [7]:
with open("rotten_tomatoes_movies.csv", encoding='utf-8') as movies_file:
    csvr = csv.DictReader(movies_file)
    movie_data = [row for row in csvr]

In [8]:
print("# of movies:", len(movie_data))
print(movie_data[0])

# of movies: 17712
{'rotten_tomatoes_link': 'm/0814255', 'movie_title': 'Percy Jackson & the Olympians: The Lightning Thief', 'movie_info': "Always trouble-prone, the life of teenager Percy Jackson (Logan Lerman) gets a lot more complicated when he learns he's the son of the Greek god Poseidon. At a training ground for the children of deities, Percy learns to harness his divine powers and prepare for the adventure of a lifetime: he must prevent a feud among the Olympians from erupting into a devastating war on Earth, and rescue his mother from the clutches of Hades, god of the underworld.", 'critics_consensus': 'Though it may seem like just another Harry Potter knockoff, Percy Jackson benefits from a strong supporting cast, a speedy plot, and plenty of fun with Greek mythology.', 'content_rating': 'PG', 'genres': 'Action & Adventure, Comedy, Drama, Science Fiction & Fantasy', 'directors': 'Chris Columbus', 'authors': 'Craig Titley, Chris Columbus, Rick Riordan', 'actors': "Logan Lerman

Here we go: this gives us movie information and, lucky for us, also has a `rotten_tomatoes_link` we can use to cross-reference between the two CSVs. We're going to quickly make a dictionary to help us look up the metadata for each movie using a *dictionary comprehension* (which is a lot like a list comprehension in Python, but generates key-value pairs in a dictionary instead!)

In [12]:
movie_lookup = {md[ID_LINK]: md for md in movie_data}

In [13]:
movie_lookup['m/0814255']

{'rotten_tomatoes_link': 'm/0814255',
 'movie_title': 'Percy Jackson & the Olympians: The Lightning Thief',
 'movie_info': "Always trouble-prone, the life of teenager Percy Jackson (Logan Lerman) gets a lot more complicated when he learns he's the son of the Greek god Poseidon. At a training ground for the children of deities, Percy learns to harness his divine powers and prepare for the adventure of a lifetime: he must prevent a feud among the Olympians from erupting into a devastating war on Earth, and rescue his mother from the clutches of Hades, god of the underworld.",
 'critics_consensus': 'Though it may seem like just another Harry Potter knockoff, Percy Jackson benefits from a strong supporting cast, a speedy plot, and plenty of fun with Greek mythology.',
 'content_rating': 'PG',
 'genres': 'Action & Adventure, Comedy, Drama, Science Fiction & Fantasy',
 'directors': 'Chris Columbus',
 'authors': 'Craig Titley, Chris Columbus, Rick Riordan',
 'actors': "Logan Lerman, Brandon T

In [23]:
num_reviews_by_movie = Counter(rd[ID_LINK] for rd in review_data)
top_movies = num_reviews_by_movie.most_common()
for movie_title, count in top_movies[:100]:
    print(movie_title, count)

m/star_wars_episode_vii_the_force_awakens 48
m/solo_a_star_wars_story 46
m/suicide_squad_2016 45
m/star_wars_the_rise_of_skywalker 44
m/spider_man_far_from_home 43
m/spider_man_homecoming 42
m/ready_player_one 42
m/shazam 42
m/spotlight_2015 41
m/star_wars_the_last_jedi 41
m/ready_or_not_2019 37
m/star_wars_episode_iii_revenge_of_the_sith 37
m/roma_2018 36
m/spider_man_into_the_spider_verse 36
m/star_trek_11 35
m/rocketman_2019 34
m/room_2015 34
m/split_2017 33
m/red_sparrow 33
m/sully 33
m/rogue_one_a_star_wars_story 31
m/sin_city 31
m/prometheus_2012 31
m/steve_jobs_2015 30
m/joker_2019 29
m/skyfall 29
m/prisoners_2013 29
m/shrek_3 29
m/richard_jewell 28
m/super_8 28
m/slumdog_millionaire 28
m/spiderman_2 28
m/son_of_saul 28
m/scott_pilgrims_vs_the_world 28
m/san_andreas 27
m/star_trek_beyond 27
m/silver_linings_playbook 27
m/prince_of_persia_sands_of_time 26
m/sicario_2015 26
m/side_effects_2013 26
m/once_upon_a_time_in_hollywood 26
m/snowden 26
m/avengers_endgame 26
m/sideways 26
m

**Exercise** What do we notice about the movies here? What's present and what's absent? What gets the most reviews?

That's a lot of movies! Can we just pull out the comedies to explore those more? I'm going to generate a *set* of the movie IDs for movies marked as comedies. A set keeps track of distinct elements in a way that makes it quick to check whether or not something is in the set, but without preserving order:

In [19]:
comedy_ids = set([m[ID_LINK] for m in movie_data if 'Comedy' in m['genres']])
print("Number of comedies:", len(comedy_ids))

Number of comedies: 5674


Now, let's check which genres this pulled out:

In [16]:
genres = Counter([movie_lookup[c]['genres'] for c in comedy_ids])
print(genres.most_common())

[('Comedy', 1263), ('Comedy, Drama', 863), ('Comedy, Drama, Romance', 312), ('Comedy, Romance', 273), ('Art House & International, Comedy, Drama', 268), ('Action & Adventure, Comedy', 180), ('Comedy, Kids & Family', 125), ('Comedy, Horror', 96), ('Art House & International, Comedy', 93), ('Action & Adventure, Comedy, Drama', 87), ('Comedy, Drama, Mystery & Suspense', 81), ('Comedy, Science Fiction & Fantasy', 66), ('Animation, Comedy, Kids & Family', 65), ('Art House & International, Comedy, Drama, Romance', 63), ('Classics, Comedy, Drama', 61), ('Classics, Comedy', 59), ('Action & Adventure, Comedy, Science Fiction & Fantasy', 55), ('Comedy, Drama, Musical & Performing Arts', 49), ('Classics, Comedy, Drama, Romance', 46), ('Classics, Comedy, Romance', 42), ('Comedy, Mystery & Suspense', 42), ('Comedy, Musical & Performing Arts', 40), ('Action & Adventure, Animation, Comedy, Kids & Family', 40), ('Action & Adventure, Comedy, Kids & Family', 38), ('Comedy, Drama, Kids & Family', 36), ('

Whoops, okay, that's not super readable. It looks like because most movies have multiple genres and "Comedy" has a lot of overlap, this gives us a huge list of different options. Looking through, it looks like genres are separated by `", "`, so let's use that to split them up and then count those!

In [17]:
genres = Counter()
for c in comedy_ids:
    genres.update(movie_lookup[c]['genres'].split(', '))
print(genres.most_common())

[('Comedy', 5674), ('Drama', 2377), ('Romance', 1006), ('Action & Adventure', 806), ('Art House & International', 708), ('Kids & Family', 549), ('Classics', 473), ('Science Fiction & Fantasy', 460), ('Mystery & Suspense', 358), ('Musical & Performing Arts', 337), ('Horror', 271), ('Animation', 240), ('Special Interest', 118), ('Documentary', 109), ('Television', 73), ('Western', 49), ('Gay & Lesbian', 40), ('Cult Movies', 32), ('Sports & Fitness', 31), ('Faith & Spirituality', 7), ('Anime & Manga', 2)]


Better. There are some weird things we notice (for instance, we have a large Animation category but a small Anime and Manga category), but reassuringly, the number of times Comedy shows up is the same as the number of movies we grabbed. Phew.

Let's grab the reviews that go with these movies and verify it seems to be working:

In [24]:
comedy_reviews = [r for r in review_data if r[ID_LINK] in comedy_ids]

In [25]:
print(len(comedy_reviews))
print(comedy_reviews[1000])

16434
{'rotten_tomatoes_link': 'm/men_in_black', 'critic_name': 'Mike Massie', 'top_critic': 'False', 'publisher_name': 'Gone With The Twins', 'review_type': 'Fresh', 'review_score': '9/10', 'review_date': '2020-09-14', 'review_content': "Choosing to go the route of puppets and primarily practical effects, the film's look remains striking, even after years of advancing computer graphics."}


Yup - Men in Black is an action comedy, so this checks out. Now it's time to featurize the text. Since this is in a highly processed dataset, we're not going to start with much cleaning - let's just see how well it does as it is.

In [26]:
texts = [cr[TEXT] for cr in comedy_reviews]

Since these review excerpts are very short, it turns out the threshold I recommend for nontrivial-length documents (e.g. .3, .5) aren't going to work well here to remove stopwords. So I'm going to use a small stopword file of my own to help clean out some words I don't anticipate needing today:

In [28]:
with open('stoplist_en.txt') as stop_file:
    stoplist = [line.strip() for line in stop_file]
print(stoplist)

['the', 'and', 'of', 'for', 'in', 'a', 'on', 'is', 'an', 'this', 'to', 'by', 'our', 'that', 'will', 'have', 'are', 'with', 'all', 'must', 'not', 'more', 'their', 'has', 'but', 'can', 'people', 'new', 'world', 'from', 'year', 'which', 'they', 'these', 'you', 'years', 'now', 'than', 'been', 'who', 'should', 'its', 'one', 'make', 'every', 'other', 'those', 'them', 'time', 'was', 'also', 'there', 'many', 'great', 'last', 'first', 'only', 'would', 'when', 'most', 'need', 'own', 'what', 'because']


This is a fairly conservative list - I might have missed some words I later will not be interested in. But for now, let's keep it conservative - we'll also keep any feature that shows up in three distinct review snippets. I'll give my stoplist to my CountVectorizer to proceed:

In [29]:
cv = CountVectorizer(input=texts, min_df=3, stop_words=stoplist)
review_features = cv.fit_transform(texts)
feature_names = cv.get_feature_names()
print("Documents by features:", review_features.shape)

Documents by features: (16434, 9207)


As previously mentioned, CountVectorizers generate sparse matrices - that is, they only represent the entries in our 2D data structure of numbers that are nonzero. For some math we're about to do, I'm going to want the dense matrix, or a representation with all of the zeros where they should be. I'll call `toarray()` to make that happen, then look at how the data looks:

In [30]:
dense_review_features = review_features.toarray()
print(dense_review_features.sum(), "words")
nonzero_prop = len(dense_review_features.nonzero()[0])/(dense_review_features.shape[0]*dense_review_features.shape[1])
print("{:.4f}% nonzero".format(nonzero_prop*100))

print(feature_names[:10])
print(dense_review_features[:20,:10])

192934 words
0.1235% nonzero
['000', '007', '10', '100', '11', '110', '12', '13', '14', '15']
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]]


As we can see, a little over one in a thousand of the entries in our matrix are nonzero - wild! For example, for the first features in our list (

**Exercise:** Find the index `i` of the word "the" in the features, then look at the counts in each review for that feature by selecting `dense_review_features[:, i]`. Can you figure out how to compute the average value in that matrix? What is it? ([This](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) may help).

## Finding interesting documents with lexicons

Let's narrow our search further - maybe we're interested in how people talk about money in conjunction with comedies. (Maybe we'll compare that with how it looks for other genres.) Let's start by coming up with a few words that we're interested in, then finding the list of documents that contain those words.

In [31]:
money_lexicon = ['money', 'earning', 'cash', 'income']
money_lexicon_idxs = [feature_names.index(w) for w in money_lexicon if w in feature_names]
print(money_lexicon_idxs)
print([feature_names[i] for i in money_lexicon_idxs])

[5307, 2539, 1238]
['money', 'earning', 'cash']


We notice that not all the features we were looking for are actually in our data. If we knew we should have expected a feature that didn't show up, then we could go back through and check our data processing for why documents that should contain that word aren't registering as having that feature...but in this case, I'm not surprised "income" isn't something people mention in review snippets, so we won't worry about it. Let's start by just getting the list of reviews that mention any of these three things:

In [44]:
def find_reviews_by_lexicon(lexicon, feature_names, feature_matrix, data_entries):
    """Use the words that show up in a lexicon to find documents with nonzero lexicon items in them"""
    # get all the lexicon indices for the items present
    lexicon_idxs = [feature_names.index(w) for w in lexicon if w in feature_names]
    # sum across all the different words for each dictionary (that is, across columns - which are counted in shape[1])
    lexicon_sums = np.sum(feature_matrix[:, lexicon_idxs], axis=1)
    # find all the indices of documents with nonzero counts
    doc_idxs = np.argwhere(lexicon_sums > 0)[:,0]
    
    return [data_entries[i] for i in doc_idxs]

In [46]:
money_docs = find_reviews_by_lexicon(lexicon=money_lexicon,
                                     feature_names=feature_names,
                                     feature_matrix=dense_review_features,
                                     data_entries=comedy_reviews)

print("Number of money docs:", len(money_docs))
for doc in money_docs[:5]:
    print(doc)

Number of money docs: 67
{'rotten_tomatoes_link': 'm/10009083-land_of_the_lost', 'critic_name': 'Erik Childress', 'top_critic': 'False', 'publisher_name': 'eFilmCritic.com', 'review_type': 'Rotten', 'review_score': '1.5/4', 'review_date': '2009-06-05', 'review_content': 'Land of the Lost will have kids asking if they can just go see Up again. Hopefully mom and dad have a little money left after blowing it on this.'}
{'rotten_tomatoes_link': 'm/1188347-mad_money', 'critic_name': 'Tricia Olszewski', 'top_critic': 'False', 'publisher_name': "Let's Not Listen", 'review_type': 'Fresh', 'review_score': '', 'review_date': '2008-02-29', 'review_content': "Keaton and Latifah lend enough intelligence, wit, and charm to their characters that Mad Money often feels like an ovarian Ocean's Eleven."}
{'rotten_tomatoes_link': 'm/the_hangover_2', 'critic_name': 'Nick Schager', 'top_critic': 'False', 'publisher_name': 'Slant Magazine', 'review_type': 'Rotten', 'review_score': '1/4', 'review_date': '2011

Okay, we're down to 67 reviews, which is...not much. What if we could expand our lexicon? Unfortunately, brainstorming lexicon words from scratch is kind of hard. Lucky for us, we can write programs to help!

Without going into too much detail: this program uses one kind of correlation metric, Spearman rho, to see whether the occurrences of each word in the vocabulary follow a similar pattern to the lexicon counts for our smaller lexicon. It'll sort the resulting words in decreasing order by the amount they're correlated. You can do this with lots of metrics (I often use PMI) with different results, but the general idea is that we can use word correlations to help suggest other lexicon words we might not have come up with:

In [50]:
def find_words_by_spearman(lexicon, feature_names, feature_matrix):
    """uses Spearman rho correlation to find words that are correlated with our lexicon"""
    lexicon_idxs = [feature_names.index(w) for w in lexicon if w in feature_names]
    lexicon_sums = np.sum(dense_review_features[:, money_lexicon_idxs], axis=1)
    n_features = len(feature_names)
    # for each word, check how correlated it is with the lexicon counts
    feature_scores = [(spearmanr(lexicon_sums, feature_matrix[:,i]).correlation, feature_names[i]) for i in range(n_features)]
    return sorted(feature_scores, reverse=True)

Let's see what this suggests for our money lexicon (warning, this is slow):

In [51]:
money_scores = find_words_by_spearman(money_lexicon, feature_names, dense_review_features)

In [52]:
for score, wd in money_scores[:100]:
    print(wd, "{:.4f}".format(score))

money 0.8891
cash 0.4400
earning 0.2112
grab 0.1618
buy 0.1398
fork 0.1214
spend 0.0917
sent 0.0912
happiness 0.0722
zealand 0.0698
tickets 0.0698
stoned 0.0698
sale 0.0698
sack 0.0698
rudo 0.0698
robbins 0.0698
pretending 0.0698
plotless 0.0698
novella 0.0698
losers 0.0698
limitless 0.0698
individuals 0.0698
id 0.0698
horny 0.0698
hollowness 0.0698
fold 0.0698
feelgood 0.0698
extras 0.0698
drillbit 0.0698
dreamy 0.0698
dispensing 0.0698
disasters 0.0698
cursi 0.0698
contraption 0.0698
congeniality 0.0698
basement 0.0698
produce 0.0690
joyous 0.0690
15 0.0690
quality 0.0665
corporate 0.0614
confidence 0.0613
garcia 0.0605
hoary 0.0605
terry 0.0602
stranded 0.0602
soda 0.0602
sexist 0.0602
resources 0.0602
replace 0.0602
provided 0.0602
property 0.0602
poise 0.0602
pointlessly 0.0602
millions 0.0602
madison 0.0602
lend 0.0602
laced 0.0602
katz 0.0602
juggling 0.0602
incoherent 0.0602
harbor 0.0602
grabbing 0.0602
gigli 0.0602
excrement 0.0602
dash 0.0602
cost 0.0602
converted 0.0602
bou

While I could have removed my "seed words" from the list, I like keeping them in there to verify my algorithm is working how I expect - if "money" wasn't at the top of the list, that'd tell me I had a bug in my code!

**Exercise** Let's add some of these suggested words to our lexicon and see the effect:

In [53]:
new_money_lexicon = ["money", "cash", "earning", "buy"]
new_money_docs = find_reviews_by_lexicon(lexicon=new_money_lexicon,
                                     feature_names=feature_names,
                                     feature_matrix=dense_review_features,
                                     data_entries=comedy_reviews)

print("Number of new money docs:", len(new_money_docs))
for doc in new_money_docs[:5]:
    print(doc)

Number of new money docs: 75
{'rotten_tomatoes_link': 'm/10009083-land_of_the_lost', 'critic_name': 'Erik Childress', 'top_critic': 'False', 'publisher_name': 'eFilmCritic.com', 'review_type': 'Rotten', 'review_score': '1.5/4', 'review_date': '2009-06-05', 'review_content': 'Land of the Lost will have kids asking if they can just go see Up again. Hopefully mom and dad have a little money left after blowing it on this.'}
{'rotten_tomatoes_link': 'm/1188347-mad_money', 'critic_name': 'Tricia Olszewski', 'top_critic': 'False', 'publisher_name': "Let's Not Listen", 'review_type': 'Fresh', 'review_score': '', 'review_date': '2008-02-29', 'review_content': "Keaton and Latifah lend enough intelligence, wit, and charm to their characters that Mad Money often feels like an ovarian Ocean's Eleven."}
{'rotten_tomatoes_link': 'm/the_hangover_2', 'critic_name': 'Nick Schager', 'top_critic': 'False', 'publisher_name': 'Slant Magazine', 'review_type': 'Rotten', 'review_score': '1/4', 'review_date': '

I find lexicons to be underrated: they're a really helpful tool to make it easy to document a particular filtering or counting task. They're often a pain because they have to be made manually, and since machine learning researchers don't usually count as experts in the domains of the data they study, there's not as much discussion in the places I publish papers about how to do this effectively. However, if you have expertise in the data, you can (and should) use it to help develop things like lexicons to help check and filter things as needed.

## Rough topic models for rough-draft datasets

When I don't know much about a dataset, one of the first things I'll often do is train an LDA topic model on it - sometimes before many of the steps we've talked about in the other two lessons.

Why would I do that? Well, first, let's briefly talk about what an LDA topic model does. If you haven't run into these before, I recommend sources like [Lisa Rhody's Digital Humanities article](http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/) and [Ted Underwood's blog post](https://tedunderwood.com/category/methodology/topic-modeling/) to get some of the intuitions. Once you've gotten at the intuition of these models, it can be good to dig into tutorials, like the existing [topic modeling course on Constellate](https://constellate.org/tutorials/topic-modeling) or other variations online, to explore what you can do with these.

First, in this context, a *topic* is just a probability distribution of words (for instance, a topic could have high probabilities of the words "great", "cool", and "fun" and low probabilities of the words "boring", "dinosaur", and "edward"). Every topic will have at least some tiny probability of every word in the vocabulary of the text collection, but there should be a small subset of the vocabulary with high probability and very little probability of the majority of the vocabulary. A topic *model* describes a collection of texts using a fixed number of different topics: each document is described as having proportions of each topic, and each topic is described as having proportions of each word.

To *infer* a topic model is to run an algorithm that knows the list of documents and the words in them (think of the data in our CountVectorizer - no order information, just word counts) and to try to find a fixed number of topics that can best describe the actual words present in the documents when combined together. Not all documents necessarily have one dominant topic - a document #372 could be 15% about topic 1, 80% about topic 2, and ~5% distributed over everything else. However, our topic model is working well if between topics 1 and 2, we get high probabilities of the words that actually show up in document #372, and likewise for the topics present in each of our other documents.

A topic model has two major outputs: topic-word distributions, which describe how often each word shows up in each topic, and document-topic distributions, which describe the topic breakdown of each document. The nice thing is that all we need to infer one of these models is some way to get word counts by document (again, our CountVectorizer is great at that) and some existing code that trains a model.

Notably, since finding the "best" topics is an impossibly hard math problem, we instead have programs that use randomness and iteration to eventually converge to "good" topics. You should expect the outputs of standard topic modeling software to change each time you run it. But that aside, usually you'll see some shuffling of topic order and the order of top words in each topic but a fair amount of consistency in what topics and words are present.

Now, hot take time: most people training topic models in Python will probably turn to a library called `gensim`. I'm instead using the implementation that's built into scikit-learn, not because it's better (the interface is actually worse) but because
1. we've been using sklearn already, and
2. while gensim has nicer interfaces for some parts of this, both scikit-learn and gensim don't train good LDA topic models on normal-size text collections.

What? That's right - with the libraries currently available, I do not recommend using Python to train topic models for your actual analysis. Both gensim and scikit-learn use a strategy called batch/online LDA to find the "best" topic model for a corpus that is meant to work well for very large collections (think millions of documents). On even tens of thousands, the topics they learn tend to be pretty iffy. Without digging into the math of why, when you train a topic model on a non-massive corpus, you probably want to use something that says "MCMC" or "Gibbs sampling" in how it does inference. [MALLET](https://mimno.github.io/Mallet/) (a command-line tool) and the R package [topicmodels](https://cran.r-project.org/web/packages/topicmodels/index.html) both support this and will give you better analyses and have plenty of tutorials available online. If you want help getting MALLET set up, check out [Melanie Walsh's tutorial](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/06-Topic-Modeling-Overview.html).

Since we're in Python-land for this tutorial and just exploring how to use topic models to notice if something's up with the corpus, we'll make do with what we've got. Let's train a 25-topic model and pull out our document-topic and topic-word information.

In [58]:
n_topics = 25
lda = LatentDirichletAllocation(n_components=n_topics,batch_size=len(comedy_reviews))
doc_topic_vecs = lda.fit_transform(review_features)
topic_word_vecs = lda.components_

To look at our topics, we generally want to pull out highest-probability words for each topic. We can do this using numpy's `argsort`, which takes in data and, rather than just sorting the data, puts the *indices* of different elements of the data in order based on the values present. So, when we *argsort* the list of word proportions for a particular topic below, we're listing indices of words in our vocabulary in order of how present they are in the topic. (The `[::-1]` syntax is a weird way of saying "put these elements highest to lowest instead of lowest to highest".)

In [59]:
n_words_to_print = 10
for topic in range(n_topics):
    top_word_idxs = np.argsort(topic_word_vecs[topic])[-n_words_to_print:][::-1]
    top_words = [feature_names[i] for i in top_word_idxs]
    print(topic, ' '.join(top_words))
    

0 it too at enough or comedy be much movie good
1 his movie he it neither nor as about or enough
2 it movie no as be film about comedy how just
3 it film movie story just as worth so be about
4 full review it spanish as film good so comedy be
5 as it about at much ve film be we he
6 it film as movie at or way be into doesn
7 as it be film animated like movie his out anderson
8 film it family comedy about plot be action movie quality
9 it film fun like movie kids family at adults out
10 it your or don if at out we movie be
11 as film movie over some it he live good comedy
12 it he little be no comedy never movie film really
13 it as like her film much movie so be movies
14 comedy romantic rom charming com love drama performances teen funny
15 it funny movie film comedy into too little so smart
16 it so as movie up just if re big out
17 it at as film be really movie like we re
18 funny as it comedy film his director best very up
19 it movie about comedy director film as thought writer we

At this point, everyone I know has the instinct to say "Our stoplist is incomplete - let's go back and fix it and then retrain our model again." This commences a very long loop of changing the stoplist, then retraining, then changing again, then retraining again...it's one of many pre-processing loops that happens, and it can require revisiting over and over if you change your tokenization or other things about your corpus.

I'm here to bear good news: aside from extremely frequent words, most stopwords aren't actually affecting how well your topic model distinguishes documents or themes. They look like they do, because they interfere with your ability to guess what a topic might be about based on the top words, but you can just grab more words and ignore the ones you don't care about instead of retraining. In English, "the", "was", etc. are likely to be important to remove before training, but most won't affect what happens to the rest of the text...so you can just write a function like the one below to ignore those words after the fact. (I [wrote a paper](https://aclanthology.org/E17-2069/) showing this works out fine for a few different Latin-based languages.)

**Exercise** Add some words to the `post_stoplist` and modify `n_words_to_print` until you feel like the topics look distinct. Anything stick out as unusual about the topics?

In [61]:
def print_topic_keys(topic_word_vecs, n_words_to_print=20, post_stoplist=[]):
    for topic in range(n_topics):
        top_word_idxs = np.argsort(topic_word_vecs[topic])[-n_words_to_print+len(post_stoplist):][::-1]
        top_words = [feature_names[i] for i in top_word_idxs if feature_names[i] not in post_stoplist]
        print(topic, ' '.join(top_words[:n_words_to_print]))
        print()
    
print_topic_keys(topic_word_vecs, n_words_to_print=20, post_stoplist=['it', 'be', 'at', 'movie', 'film'])

0 too enough or comedy much good down even as least something

1 his he neither nor as about or enough made debut do characters so

2 no as about comedy how just still good family if so

3 story just as worth so about makes over up his laughs

4 full review spanish as good so comedy bad content up some parents

5 as about much ve we he seen his just so up

6 as or way into doesn out while comedy enough well

7 as animated like his out anderson off some humor still work

8 family comedy about plot action quality if us performances like characters

9 fun like kids family adults out comedy funny good high

10 your or don if out we little re so laugh

11 as over some he live good comedy his nothing self even

12 he little no comedy never really good better story does two

13 as like her much so movies about his watching over just

14 comedy romantic rom charming com love drama performances teen funny perfect like age some strong

15 funny comedy into too little so smart heart if no stupid 

Generally speaking, these top words can help you start to get at some broad themes -- we expect to see things like rom-coms, or clear positive or negative words, as signals of topics. However, it's worth taking a skim through to see if there's anything unusual in our top words that sets off alarm bells, or any topics where we can't clear discern why that would be a topic. Again, our topic model is probably not super great on this data since we're only using a few thousand documents, but we might notice some words that are weird as high-probability parts of topics. In those cases, what we should do is look at the documents with the highest proportion of a topic and see if we can figure out in what context those words are showing up.

**Exercise** Pick a couple of topics that look odd and inspect their top documents using the code below. See if you can figure out which documents are producing the words that looked odd, and whether the cause of their oddness is benign or something that might require further intervention.

In [34]:
def get_top_docs_by_topic(docs, doc_topic_vecs, topic, n_top_docs=50):
    doc_idxs = doc_topic_vecs[:,topic].argsort()[:-n_top_docs-1:-1]
    for idx in doc_idxs:
        print("Topic proportion:", doc_topic_vecs[idx][topic])
        print(docs[idx])
        print()

get_top_docs_by_topic(comedy_reviews, doc_topic_vecs, 0)

Topic proportion: 0.9657142857094732
{'rotten_tomatoes_link': 'm/the_five_year_engagement', 'critic_name': 'Jason Best', 'top_critic': 'False', 'publisher_name': 'Movie Talk', 'review_type': 'Fresh', 'review_score': '', 'review_date': '2012-06-25', 'review_content': "The Five-Year Engagement seems at times to have taken its title literally - it's way too long. But though the pace sometimes drags, Blunt and Segal are amiable companions and the lack of gross-out gags makes a refreshing change."}

Topic proportion: 0.9657142857057369
{'rotten_tomatoes_link': 'm/donovans_reef', 'critic_name': 'Emanuel Levy', 'top_critic': 'False', 'publisher_name': 'EmanuelLevy.Com', 'review_type': 'Fresh', 'review_score': '', 'review_date': '2008-03-11', 'review_content': 'Though not in top form, director John Ford and star John Wayne, building on previous and better collaborations in this easy-going (and lazy) adventure set in the Pacific, offering good old fun with the Duke, Jack Warden, Dorothy Lamour 

Topic models are very good at finding words that show up together, which is actually a perk for us when we're cleaning a text collection: if there's a systematic issue that causes repeated text to appear or that means a subset of the data is fundamentally very different from the rest, the topic model will almost always put the words that indicate that issue together in a topic. Since topic models try to represent the whole text as well as possible, even when things are working well, we expect some topics may not look interesting - but if you see sequences of words that aren't just boring but actually puzzling (or if you're trying to label topics) it's always important to go look at the documents!

With that, we wrap up this brief overview of text curation. One thing we didn't do in this lesson that I'd hoped to was to show that the use of a model to detect issues in data isn't limited to topic models: supervised text classifiers like Naive Bayes classifiers learn weights on how much certain features matter, so one could train a classifier to predict which reviews are fresh versus rotten, then inspect whether some of the words that are considered good indicators of that are indicating a systamtic issue. If you have time to explore more, I recommend checking out [Susan Li's tutorial](https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f) for an idea of how to use scikit-learn's multinomial Naive Bayes classifier `MultinomialNB` to do analyses.

___

Thank you! For access to the source code of all three lessons, go to [https://github.com/xandaschofield/tapi-text-data]