# Homework 3: film dialogue

In our lab for Feb 15 we looked at two different ways to compare collections of texts: by examining specific words that are overrepresented in one collection relative to another, or by assessing the strength of a model that attempts to describe the boundary between the two.

We'll pursue both of those strategies a little further in this homework.


## Familiar preparation

To start with we'll import useful modules.

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path

#### read in the dialogue dataset

It has one line for each character; the field ```lines``` contains all dialogue attributed to that character in the [Cornell Movie Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). Separate lines are delimited by slashes, but we'll ignore that here.

We call the dataset ```chars``` because the it has one line per character.

Notice that the path below has changed because the ```homeworks/``` directory is only one level down from the parent is417 directory. ```labs/Feb15Dialog/``` was two levels down.

In [3]:
dialogpath = Path('../data/movie_dialogue.tsv')

chars = pd.read_csv(dialogpath, sep = '\t')

# let's also randomize the row order
chars = chars.sample(frac = 1.0)

In [64]:
chars.head()

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
1310,m334,u5054,KATHARINE,the english patient,f,1159,1996,"['romance', 'drama', 'war']",False,False,True,True,"Our Garden, our garden - not so much the garde..."
317,m153,u2401,PETE,"o brother, where art thou?",m,460,2000,"['comedy', 'adventure', 'crime', 'music']",True,False,False,False,"Good Lord, what do we do? / Awful sorry I betr..."
964,m272,u4081,BLADE,blade,m,696,1998,"['action', 'adventure', 'fantasy', 'horror', '...",False,True,False,False,"Whistler, I -- / No, we can treat the wounds -..."
24,m101,u1502,MIKE WALLACE,the insider,m,534,1999,"['biography', 'drama', 'thriller']",False,True,True,False,"""Mike?"" / How grave? / No, no, we're fine... /..."
1958,m455,u6837,WILL,nothing but a man,m,303,1964,"['drama', 'romance']",False,False,True,True,Didn't I tell you to beat it - huh? / Couldn't...


## A better way to validate models

In our lab, we spent a lot of energy selecting test sets and training sets. We learned that if you want to produce a truly general model of a category like "comedy," it's important to make sure your algorithm can't "cheat" by just memorizing (say) the names of characters who appear in a limited set of comedies.

Defining train and test sets so they contain non-overlapping groups of movies produced a more general model of comedy, and a more realistic estimate of accuracy.

But each of us got a slightly different measure of model accuracy, because our measure of accuracy depended entirely on a small subset of the data (1/5th of the movies) that we randomly chose as the test set.

To get a more stable measure of accuracy, it's better to *cross-validate* your model, by repeatedly holding out a different 1/5th (or 1/10th) of the data as the test set, and training on the remainder. If we hold out a different test set each time we can eventually test on all the data, without ever testing on data that was included in our training set.

We could do this ourselves by writing a loop, but scikit-learn also comes with a ```cross_validate()``` function that does it for us automatically.

In [4]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_predict

First we need to create a matrix of ```wordcounts``` that we'll use to make predictions, and define a useful function

In [10]:
if 'cid' in chars.columns:         
    character_ids = chars['cid']
    chars = chars.set_index('cid')   # If we haven't made this the index yet, let's do it.
else:
    character_ids = chars.index.tolist()

vectorizer = CountVectorizer(max_features = 8000)
sparse_counts = vectorizer.fit_transform(chars['lines']) # the vectorizer produces something
                                                               # called a 'sparse matrix'; we need to
                                                               # unpack it
wordcounts = pd.DataFrame(sparse_counts.toarray(), index = character_ids, 
                            columns = vectorizer.get_feature_names())
wordcounts.head()

# We'll also define a useful function

def gendertonumber(astring):
    if astring.lower() == 'f':  # note that we lowercase before checking:
        return 1                    # this version of the function is slightly
    else:                           # better than the one in the lab notebook
        return 0

Having done that, we can proceed to train models.

In the lines below we cross-validate a model of romance, while making sure that the data is grouped by movie-id. If you're curious about GroupKFold (or any other aspect of scikit-learn), you can always [inspect the documentation.](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html) Basically, what's happening here is that we define an object that splits the data up into five parts, and then instruct it to use movie-id to do the splitting.

What happens if you change the number of splits (```n_splits```) to 10, or to 2? If you see slight changes in accuracy, reflect on why they're happening.

In [22]:
all_y = chars['romance'].astype(int)
bayes = MultinomialNB(alpha = 1)
grouper = GroupKFold(n_splits = 5)
cv_results = cross_validate(bayes, wordcounts, all_y, groups = chars['mid'], cv = grouper)
np.mean(cv_results['test_score'])

0.6850795759733364

The ```cross_validate()``` function produces a different test_score for each of the five separate test sets. Above, we take the mean score, but we can also inspect them individually.

In [128]:
cv_results

{'fit_time': array([0.56459904, 0.56746197, 0.55672193, 0.54945993, 0.5609231 ]),
 'score_time': array([0.07308292, 0.07102227, 0.07092404, 0.07096982, 0.07038021]),
 'test_score': array([0.70707071, 0.72390572, 0.6952862 , 0.71380471, 0.71838111])}

Notice that if we take out the ```groups``` and ```cv``` parameters it no longer divides the test and training sets by movie-id. In that case, we get an unrealistically high accuracy for genre, because genre becomes easy to predict if you can memorize the vocabulary of specific movies.

In [18]:
all_y = chars['romance'].astype(int)
bayes = MultinomialNB(alpha = 1)
cv_results = cross_validate(bayes, wordcounts, all_y)
np.mean(cv_results['test_score'])

0.8231829253751682

We can also use this same function to model gender. We just have to change the way the response variable *y* is generated.

In [33]:
all_y = chars['gender'].map(gendertonumber)
bayes = MultinomialNB(alpha = 1)
grouper = GroupKFold(n_splits = 5)
cv_results = cross_validate(bayes, wordcounts, all_y, groups = chars['mid'], cv = grouper)
score = np.mean(cv_results['test_score'])
print(score)

0.7322301145235378


## Assignment 1. Changing the smoothing parameter of the model

Write a little loop that tests the gender model above while varying the ```alpha``` parameter of the model.

Remember, this is the Laplacian smoothing--a number that gets added to the count of each word to acknowledge that it probably *sometimes* occurs in dialogue spoken by women (or by men), even if not in this sample.

**1a.** Try each of the settings in this list: [0.001, 1, 2, 4, 8, 16, 32, 64]. In each case print ```alpha,``` and the resulting accuracy. 

**1b.** Then do the same thing--in a separate cell--for a model of genre (say, romance).

**1c.** Write a reflection on these results. Why might alphas higher than one improve the model? Why is the ideal alpha not the same for every category? (Your answer may be a little speculative; that's okay; speculate.)

In [None]:
# 1a

In [None]:
# 1b

In [None]:
# 1c

## Assignment 2: Words overrepresented in romance.

In the lab on Feb 15 we practiced ways to identify words that are overrepresented in one corpus relative to another.

If you review [Ben Schmidt's blog post about comparing corpuses,](http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html), you'll realize that the methods we used in that lab were fairly crude. We were just asking how much larger a word's frequency is in one corpus than in the other, or how many *times* larger its frequency is. As Schmidt's post shows, there may be better ways to make that comparison, and in future weeks we'll explore them.

But for right now, let's just practice the simple ratio test. **2a** Create a list of 25 words that are many times more common in romances than in other movies, and **2b** a list of 25 words that are many times more common in other movies than in romances. Compare these lists to the similar test we performed on gender categories in the lab. **2c** Interpret the evidence. The evidence you get from this comparison will not be decisive, but what hypotheses come to mind as deserving further investigation?

In [None]:
# 2a

In [None]:
# 2b

In [None]:
# 2c

## Assignment 3: Do gender stereotypes weaken over time?

It probably doesn't make sense to try to answer this question with a single big model of all movies, because the gender norms involved in shaping dialogue could have *changed* over time.

Instead, let's find the median release date for characters, and divide our dataset into two roughly evenly-sized groups (early and late). Then we can cross-validate models to predict gender in each half of the dataset, and see if the accuracy of the later model is lower.

In the lab notebook, I proposed running this test many times, in order to measure uncertainty, and draw inferences about significance. But the technique of "sampling with replacement" may require more explanation. I'll demonstrate a way to do that in the homework solution, but for right now just cross-validate a single model for each half of the timeline -- using the methods from assignment 1 -- and see what result you get.

You may want to try different values of "alpha," as we did in Assignment 1, and use the value that performs best for each model.

In [None]:
# 3a accuracy of a gender model based on movies released 1995 or before

In [54]:
# 3b accuracy of a gender model based on movies released after 1995