<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Xanda Schofield](https://www.cs.hmc.edu/~xanda) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email xanda@cs.hmc.edu.<br />
____

# Text Data Curation 3

This is lesson 3 of 3 in the educational series on Text Data Curation. This notebook is intended to look at how trained models, such as naive Bayes models and topic models, can actually help the text curation process. 

**Audience:** `Learners` / `Researchers`

**Use case:** [`How-To`](https://constellate.org/docs/documentation-categories#howtoproblemoriented) 

**Difficulty:** `Intermediate`
Assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.

**Completion time:** `90 minutes`

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* How Python libraries work (installation and imports)

**Knowledge Recommended:**
* Basic file operations (open, close, read, write)
* How text is stored on computers

**Learning Objectives:**
After this lesson, learners will be able to:
1. Use a lexicon to retrieve interesting documents
2. Get a basic scikit-learn classifier running
3. Use a simple topic model to check for oddities in a corpus

___

# Required Python Libraries

* `matplotlib`
* `nltk`
* `numpy`
* `seaborn`
* `sklearn`

## Install Required Libraries

In [None]:
### Install Libraries ###

# # Using !pip installs
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [1]:
### Import Libraries ###
from collections import Counter
import csv
import os
import urllib.request

import matplotlib
import seaborn
from matplotlib import pyplot as plt
from nltk.tokenize import word_tokenize
import numpy as np
from scipy.stats import spearmanr

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# import spacy

# Required Data

**Data Format:** 
* comma-separated value (.csv)

**Data Source:**
* [Rotten Tomatoes Dataset](https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset)


## Download Required Data

In [2]:
### Retrieve multiple files using a list ###

download_urls = [
    'https://cs.hmc.edu/~xanda/data/rotten_tomatoes_critic_reviews_50k.csv',
    'https://cs.hmc.edu/~xanda/data/rotten_tomatoes_movies.csv',
    'https://cs.hmc.edu/~xanda/data/stoplist_en.txt' # a modification of an English stoplist constructed by David Mimno
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

# Introduction

```
Introduce the lesson topic. Answer questions such as:
* Why is it useful? 
* Why should we learn it? 
* Who might use it? 
* Where has it been used by scholars/industry?
* What do we need to do it?
* What subjects are included in the notebooks?
* What is not in this notebook? Where should we look for it?
```

# Lesson

In [3]:
"""
First, load in movie review data from RottenTomatoes. Sample will be in a CSV
Next, look at data and check features - lengths reasonable? top/bottom features seem normal? Who are our reviewers? What's the fresh vs rotten ratio?
+ Develop some hypotheses of things that might cause issues with this data
Check for repeats - should be there (will add some if there aren't)
Train classifier for fresh vs rotten - any obvious features? Find something promising and something weird!
+ Specific movie names? (Overrepresent a movie?)
+ Unusual features for sentiment?
Train topic model with 5 topics - what do they look like?
+ Spot high-frequency unusual features
+ Find a weird topic (may need more topics) and look at docs - any sign of why they're together?

What is the point of this process?
+ First, sometimes it's not obvious immediately that something is wrong (ex. unusual feature or copyright text) - that's okay
+ Sometimes it's going to be hard to know you need something until you get into it (I need to limit # of each movie) - also okay
+ Models fixate on unusual distinguishers so they can actually be good sleuths...so long as you have an evaluation plan afterwards
   - Held out test set until the very end
   - Human validation (labeling task or comparison task)
   
   
What works?
+ Keeping a log of choices
+ Trying several options at once
+ Don't remove stuff unless you have a reason why related to your question
+ 
"""

"\nFirst, load in movie review data from RottenTomatoes. Sample will be in a CSV\nNext, look at data and check features - lengths reasonable? top/bottom features seem normal? Who are our reviewers? What's the fresh vs rotten ratio?\n+ Develop some hypotheses of things that might cause issues with this data\nCheck for repeats - should be there (will add some if there aren't)\nTrain classifier for fresh vs rotten - any obvious features? Find something promising and something weird!\n+ Specific movie names? (Overrepresent a movie?)\n+ Unusual features for sentiment?\nTrain topic model with 5 topics - what do they look like?\n+ Spot high-frequency unusual features\n+ Find a weird topic (may need more topics) and look at docs - any sign of why they're together?\n\nWhat is the point of this process?\n+ First, sometimes it's not obvious immediately that something is wrong (ex. unusual feature or copyright text) - that's okay\n+ Sometimes it's going to be hard to know you need something until 

In [4]:
with open("rotten_tomatoes_critic_reviews_50k.csv", encoding='utf-8') as reviews_file:
    csvr = csv.DictReader(reviews_file)
    review_data = [row for row in csvr]

In [5]:
print("# of reviews:", len(review_data))
print(review_data[0])

# of reviews: 50000
{'rotten_tomatoes_link': 'm/the_king_of_staten_island', 'critic_name': 'Roger Tennis', 'top_critic': 'False', 'publisher_name': 'Cinemaclips.com', 'review_type': 'Fresh', 'review_score': '3.5/5', 'review_date': '2020-06-11', 'review_content': "SNL's Pete Davidson is a commanding presence in this appealing comedy/drama."}


In [6]:
ID_LINK = 'rotten_tomatoes_link'
TYPE = 'review_type'
TEXT = 'review_content'

In [7]:
with open("rotten_tomatoes_movies.csv", encoding='utf-8') as movies_file:
    csvr = csv.DictReader(movies_file)
    movie_data = [row for row in csvr]

In [8]:
print("# of movies:", len(movie_data))
print(movie_data[0])

# of movies: 17712
{'rotten_tomatoes_link': 'm/0814255', 'movie_title': 'Percy Jackson & the Olympians: The Lightning Thief', 'movie_info': "Always trouble-prone, the life of teenager Percy Jackson (Logan Lerman) gets a lot more complicated when he learns he's the son of the Greek god Poseidon. At a training ground for the children of deities, Percy learns to harness his divine powers and prepare for the adventure of a lifetime: he must prevent a feud among the Olympians from erupting into a devastating war on Earth, and rescue his mother from the clutches of Hades, god of the underworld.", 'critics_consensus': 'Though it may seem like just another Harry Potter knockoff, Percy Jackson benefits from a strong supporting cast, a speedy plot, and plenty of fun with Greek mythology.', 'content_rating': 'PG', 'genres': 'Action & Adventure, Comedy, Drama, Science Fiction & Fantasy', 'directors': 'Chris Columbus', 'authors': 'Craig Titley, Chris Columbus, Rick Riordan', 'actors': "Logan Lerman

In [9]:
movie_lookup = {md[ID_LINK]: md for md in movie_data}

In [10]:
movie_lookup['m/0814255']

{'rotten_tomatoes_link': 'm/0814255',
 'movie_title': 'Percy Jackson & the Olympians: The Lightning Thief',
 'movie_info': "Always trouble-prone, the life of teenager Percy Jackson (Logan Lerman) gets a lot more complicated when he learns he's the son of the Greek god Poseidon. At a training ground for the children of deities, Percy learns to harness his divine powers and prepare for the adventure of a lifetime: he must prevent a feud among the Olympians from erupting into a devastating war on Earth, and rescue his mother from the clutches of Hades, god of the underworld.",
 'critics_consensus': 'Though it may seem like just another Harry Potter knockoff, Percy Jackson benefits from a strong supporting cast, a speedy plot, and plenty of fun with Greek mythology.',
 'content_rating': 'PG',
 'genres': 'Action & Adventure, Comedy, Drama, Science Fiction & Fantasy',
 'directors': 'Chris Columbus',
 'authors': 'Craig Titley, Chris Columbus, Rick Riordan',
 'actors': "Logan Lerman, Brandon T

In [11]:
num_reviews_by_movie = Counter(rd[ID_LINK] for rd in review_data)
top_movies = num_reviews_by_movie.most_common()
for movie_title, count in top_movies[:100]:
    print(movie_title, count)

m/star_wars_episode_vii_the_force_awakens 48
m/solo_a_star_wars_story 46
m/suicide_squad_2016 45
m/star_wars_the_rise_of_skywalker 44
m/spider_man_far_from_home 43
m/spider_man_homecoming 42
m/ready_player_one 42
m/shazam 42
m/spotlight_2015 41
m/star_wars_the_last_jedi 41
m/ready_or_not_2019 37
m/star_wars_episode_iii_revenge_of_the_sith 37
m/roma_2018 36
m/spider_man_into_the_spider_verse 36
m/star_trek_11 35
m/rocketman_2019 34
m/room_2015 34
m/split_2017 33
m/red_sparrow 33
m/sully 33
m/rogue_one_a_star_wars_story 31
m/sin_city 31
m/prometheus_2012 31
m/steve_jobs_2015 30
m/joker_2019 29
m/skyfall 29
m/prisoners_2013 29
m/shrek_3 29
m/richard_jewell 28
m/super_8 28
m/slumdog_millionaire 28
m/spiderman_2 28
m/son_of_saul 28
m/scott_pilgrims_vs_the_world 28
m/san_andreas 27
m/star_trek_beyond 27
m/silver_linings_playbook 27
m/prince_of_persia_sands_of_time 26
m/sicario_2015 26
m/side_effects_2013 26
m/once_upon_a_time_in_hollywood 26
m/snowden 26
m/avengers_endgame 26
m/sideways 26
m

In [12]:
comedy_ids = set([m[ID_LINK] for m in movie_data if 'Comedy' in m['genres']])

In [13]:
genres = Counter([movie_lookup[c]['genres'] for c in comedy_ids])
print(genres.most_common())

[('Comedy', 1263), ('Comedy, Drama', 863), ('Comedy, Drama, Romance', 312), ('Comedy, Romance', 273), ('Art House & International, Comedy, Drama', 268), ('Action & Adventure, Comedy', 180), ('Comedy, Kids & Family', 125), ('Comedy, Horror', 96), ('Art House & International, Comedy', 93), ('Action & Adventure, Comedy, Drama', 87), ('Comedy, Drama, Mystery & Suspense', 81), ('Comedy, Science Fiction & Fantasy', 66), ('Animation, Comedy, Kids & Family', 65), ('Art House & International, Comedy, Drama, Romance', 63), ('Classics, Comedy, Drama', 61), ('Classics, Comedy', 59), ('Action & Adventure, Comedy, Science Fiction & Fantasy', 55), ('Comedy, Drama, Musical & Performing Arts', 49), ('Classics, Comedy, Drama, Romance', 46), ('Comedy, Mystery & Suspense', 42), ('Classics, Comedy, Romance', 42), ('Comedy, Musical & Performing Arts', 40), ('Action & Adventure, Animation, Comedy, Kids & Family', 40), ('Action & Adventure, Comedy, Kids & Family', 38), ('Comedy, Drama, Kids & Family', 36), ('

In [14]:
genres = Counter()
for c in comedy_ids:
    genres.update(movie_lookup[c]['genres'].split(', '))
print(genres.most_common())

[('Comedy', 5674), ('Drama', 2377), ('Romance', 1006), ('Action & Adventure', 806), ('Art House & International', 708), ('Kids & Family', 549), ('Classics', 473), ('Science Fiction & Fantasy', 460), ('Mystery & Suspense', 358), ('Musical & Performing Arts', 337), ('Horror', 271), ('Animation', 240), ('Special Interest', 118), ('Documentary', 109), ('Television', 73), ('Western', 49), ('Gay & Lesbian', 40), ('Cult Movies', 32), ('Sports & Fitness', 31), ('Faith & Spirituality', 7), ('Anime & Manga', 2)]


In [15]:
comedy_reviews = [r for r in review_data if r[ID_LINK] in comedy_ids]

In [16]:
print(len(comedy_reviews))
print(comedy_reviews[1000])

16434
{'rotten_tomatoes_link': 'm/men_in_black', 'critic_name': 'Mike Massie', 'top_critic': 'False', 'publisher_name': 'Gone With The Twins', 'review_type': 'Fresh', 'review_score': '9/10', 'review_date': '2020-09-14', 'review_content': "Choosing to go the route of puppets and primarily practical effects, the film's look remains striking, even after years of advancing computer graphics."}


In [17]:
scores = np.array([cr[TYPE] for cr in comedy_reviews])
texts = [cr[TEXT] for cr in comedy_reviews]

In [18]:
with open('stoplist_en.txt') as stop_file:
    stoplist = [line.strip() for line in stop_file]
print(stoplist)

['the', 'and', 'of', 'for', 'in', 'a', 'on', 'is', 'an', 'this', 'to', 'by', 'our', 'that', 'will', 'have', 'are', 'with', 'all', 'must', 'not', 'more', 'their', 'has', 'but', 'can', 'people', 'new', 'world', 'from', 'year', 'which', 'they', 'these', 'you', 'years', 'now', 'than', 'been', 'who', 'should', 'its', 'one', 'make', 'every', 'other', 'those', 'them', 'time', 'was', 'also', 'there', 'many', 'great', 'last', 'first', 'only', 'would', 'when', 'most', 'need', 'own', 'what', 'because']


In [19]:
cv = CountVectorizer(input=texts, min_df=3, stop_words=stoplist)
review_features = cv.fit_transform(texts)
feature_names = cv.get_feature_names()
print("Documents by features:", review_features.shape)

Documents by features: (16434, 9207)


In [20]:
dense_review_features = review_features.toarray()
print(dense_review_features.sum(), "words")
nonzero_prop = len(dense_review_features.nonzero()[0])/(dense_review_features.shape[0]*dense_review_features.shape[1])
print("{:.4f}% nonzero".format(nonzero_prop*100))

print(feature_names[:10])
print(dense_review_features[:20,:10])

192934 words
0.1235% nonzero
['000', '007', '10', '100', '11', '110', '12', '13', '14', '15']
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]]


In [21]:
money_lexicon = ['money', 'finance', 'income', 'investment', 'cash', 'dollar']
money_lexicon_idxs = [feature_names.index(w) for w in money_lexicon if w in feature_names]
print(money_lexicon_idxs)
print([feature_names[i] for i in money_lexicon_idxs])

[5307, 4328, 1238, 2376]
['money', 'investment', 'cash', 'dollar']


In [22]:
def find_words_by_spearman(lexicon, feature_names, feature_matrix):
    """uses Spearman rho correlation to find words that are correlated with our lexicon"""
    lexicon_idxs = [feature_names.index(w) for w in lexicon if w in feature_names]
    lexicon_sums = np.sum(dense_review_features[:, money_lexicon_idxs], axis=1)
    n_features = len(feature_names)
    feature_scores = [(spearmanr(lexicon_sums, feature_matrix[:,i]).correlation, feature_names[i]) for i in range(n_features)]
    return sorted(feature_scores, reverse=True)

In [23]:
money_scores = find_words_by_spearman(money_lexicon, feature_names, dense_review_features)

In [29]:
for score, wd in money_scores[:100]:
    print(wd, "{:.4f}".format(score))

money 0.8575
cash 0.4244
dollar 0.2629
investment 0.2037
billion 0.1761
grab 0.1559
buy 0.1348
fork 0.1171
million 0.0895
spend 0.0883
sent 0.0879
happiness 0.0695
zealand 0.0673
weakness 0.0673
tickets 0.0673
savior 0.0673
sale 0.0673
sack 0.0673
rudo 0.0673
robbins 0.0673
pretending 0.0673
plotless 0.0673
novella 0.0673
losers 0.0673
limitless 0.0673
legit 0.0673
individuals 0.0673
hollowness 0.0673
genteel 0.0673
fold 0.0673
feelgood 0.0673
extras 0.0673
drillbit 0.0673
dreamy 0.0673
dispensing 0.0673
disasters 0.0673
cursi 0.0673
contraption 0.0673
congeniality 0.0673
basement 0.0673
produce 0.0664
joyous 0.0664
continue 0.0664
15 0.0664
hope 0.0662
quality 0.0639
reeves 0.0637
corporate 0.0591
garcia 0.0583
hoary 0.0583
terry 0.0580
surrender 0.0580
stranded 0.0580
soda 0.0580
sexist 0.0580
resources 0.0580
replace 0.0580
provided 0.0580
property 0.0580
pointlessly 0.0580
millions 0.0580
lend 0.0580
laced 0.0580
katz 0.0580
juggling 0.0580
incoherent 0.0580
harbor 0.0580
grabbing 

In [25]:
n_topics = 25
lda = LatentDirichletAllocation(n_components=n_topics,batch_size=len(comedy_reviews))
doc_topic_vecs = lda.fit_transform(review_features)
topic_word_vecs = lda.components_

In [26]:
n_words_to_print = 10
for topic in range(n_topics):
    top_word_idxs = np.argsort(topic_word_vecs[topic])[-n_words_to_print:][-1::-1]
    top_words = [feature_names[i] for i in top_word_idxs]
    print(topic, ' '.join(top_words))
    

0 it out movie at like his well series ll gross
1 full review spanish it movie film his best he at
2 it if be like movie as were could we out
3 film as up comedy at it her never heartfelt beautiful
4 it too film be movie like so over just much
5 it entertaining comedy film movie be characters about into romantic
6 film movie director it best kid his since martin another
7 it as be so movie he like his about much
8 as film it so about witty fun your like or
9 it as comedy at out film movie so her about
10 it if even as film movie be re films so
11 it we as movie comedy be funny about his two
12 about it movie de do know or much be comedy
13 as film it piece his heart best find comedy be
14 it family action comedy rom com disney like film as
15 it film movie some like as comedy if be may
16 it fun story film something doesn funny be while mean
17 it comedy his at director writer movie he as about
18 as it movie film just funny be or about his
19 as at it low best movie or comedy cast up


In [27]:
def print_topic_keys(topic_word_vecs, n_words_to_print=20, post_stoplist=[]):
    for topic in range(n_topics):
        top_word_idxs = np.argsort(topic_word_vecs[topic])[-n_words_to_print+len(post_stoplist):][::-1]
        top_words = [feature_names[i] for i in top_word_idxs if feature_names[i] not in post_stoplist]
        print(topic, ' '.join(top_words[:n_words_to_print]))
        print()
    
print_topic_keys(topic_word_vecs, n_words_to_print=50, post_stoplist=['it', 'be', 'at', 'movie', 'as', 'or', 'we', 'film', 'if'])

0 out like his well series ll gross heart soul lot makes too often little yet us story set script laugh comedy into feels old feel movies no way performance fun good familiar off lives through rather re

1 full review spanish his best he like performances director performance anderson out lead good here la fun gives movies among chemistry films very british eye thing after see una laughs my standard so screen making black

2 like were could out little about even too so then love had something into ve better never much any his just go real script no up do written brilliant see maybe such may

3 up comedy her never heartfelt beautiful story apatow his hilarious two out cast funny judd look us life small delightful well comic man about rather woman performance sweet wonderful john very fresh while off way romantic

4 too like so over just much little kids story action even doesn while feels no good out though some looks teen well laughs about he comedy audience characters might together e

In [34]:
def get_top_docs_by_topic(docs, doc_topic_vecs, topic, n_top_docs=50):
    doc_idxs = doc_topic_vecs[:,topic].argsort()[:-n_top_docs-1:-1]
    for idx in doc_idxs:
        print("Topic proportion:", doc_topic_vecs[idx][topic])
        print(docs[idx])
        print()

get_top_docs_by_topic(comedy_reviews, doc_topic_vecs, 0)

Topic proportion: 0.9657142857094732
{'rotten_tomatoes_link': 'm/the_five_year_engagement', 'critic_name': 'Jason Best', 'top_critic': 'False', 'publisher_name': 'Movie Talk', 'review_type': 'Fresh', 'review_score': '', 'review_date': '2012-06-25', 'review_content': "The Five-Year Engagement seems at times to have taken its title literally - it's way too long. But though the pace sometimes drags, Blunt and Segal are amiable companions and the lack of gross-out gags makes a refreshing change."}

Topic proportion: 0.9657142857057369
{'rotten_tomatoes_link': 'm/donovans_reef', 'critic_name': 'Emanuel Levy', 'top_critic': 'False', 'publisher_name': 'EmanuelLevy.Com', 'review_type': 'Fresh', 'review_score': '', 'review_date': '2008-03-11', 'review_content': 'Though not in top form, director John Ford and star John Wayne, building on previous and better collaborations in this easy-going (and lazy) adventure set in the Pacific, offering good old fun with the Duke, Jack Warden, Dorothy Lamour 

___

For access to all three lessons, go to [https://github.com/xandaschofield/tapi-text-data]