# Exploring film dialogue

There are two parts to this notebook. The first part is just general practice in Pandas. We won't work through all of that in class. But if Pandas is unfamiliar, you may want to work through those sections on your own.

The second part develops models of film dialogue. We will

1. look for words overrepresented in the dialogue of men and women
2. ask whether gender norms are stronger in particular genres, and
3. ask whether gender norms weaken over time


## A. Loading the film dialogue dataset; reviewing Pandas

To start with we'll import useful modules.

In [103]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import random
from pathlib import Path

#### read in the dialogue dataset

It has one line for each character; the field ```lines``` contains all dialogue attributed to that character in the [Cornell Movie Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). Separate lines are delimited by slashes, but we'll ignore that here.

We call the dataset ```chars``` because the it has one line per character.

In [104]:
dialogpath = Path('../../data/movie_dialogue.tsv')

chars = pd.read_csv(dialogpath, sep = '\t')

# let's also randomize the row order
chars = chars.sample(frac = 1)

In [105]:
chars.head()

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
989,m278,u4168,JOANNE,body of evidence,f,581,1993,"['drama', 'romance', 'thriller']",False,True,True,True,Yes. / Horrible. He was tired and pale. / Yes...
244,m144,u2224,ALEXANDER,napoleon,m,746,1995,"['family', 'adventure']",False,False,False,False,I have given a great deal of thought to that p...
773,m238,u3599,ADDISON,all about eve,m,1833,1950,['drama'],False,False,True,False,"Why not? Tell me, Phoebe, do you want some day..."
2069,m48,u753,MAX,dark angel,M,2663,1990,"['action', 'crime', 'drama', 'horror', 'sci-fi...",False,True,True,False,"This. / Okay, okay. I can explain... You ever..."
1869,m440,u6608,PETER,mimic,m,568,1997,"['drama', 'horror', 'sci-fi']",False,False,True,False,Weird shit...? / Yeah...? / There are thirty f...


#### reviewing pandas: ways to explore the data

Once we've loaded this dataframe, you can explore it in all the ways Melanie Walsh recommends in ["Pandas Basics — Part 2."](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part2.html) For instance, what are the data types of the columns? (Mid and cid are movie-id and character-id).

In [106]:
chars.dtypes

mid          object
cid          object
cname        object
mname        object
gender       object
wordcount     int64
year          int64
genres       object
comedy         bool
thriller       bool
drama          bool
romance        bool
lines        object
dtype: object

You can also use the ```.describe()``` function, check for missing data with ```.isna()```, rename a column or drop a column, and sort the whole dataframe, by say wordcount or release year. Note that when we run the sort below, we get a sorted dataframe as *output* of the statement. But it doesn't actually change the order of rows in ```chars``` unless we start the command with ```chars =``` or include an argument ```inplace = True``` inside the parentheses.

In [107]:
chars.sort_values(by='wordcount', ascending=False)

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
292,m150,u2340,NIXON,nixon,m,7798,1995,"['biography', 'drama']",False,False,True,False,"For Christ's sake, it soils my mother's memory..."
1047,m289,u4331,ACE,casino,m,6387,1995,"['biography', 'crime', 'drama']",False,False,True,False,"Now, instead of the cops only lookin' at Nicky..."
809,m243,u3681,ALVY,annie hall,m,5853,1977,"['comedy', 'drama', 'romance']",True,False,True,True,Your girl friend's name is Ralph? / You don't ...
41,m104,u1568,JIM,jfk,m,5511,1991,"['biography', 'drama', 'history', 'mystery', '...",False,True,True,False,"Gentlemen, I will not hear this. I value Bill..."
361,m161,u2495,STEW,platinum blonde,m,4905,1931,"['comedy', 'romance']",True,False,False,True,"All right, I'm a child. Have it any way you wa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
691,m221,u3335,HARRY,total recall,m,103,1990,"['action', 'adventure', 'sci-fi', 'thriller']",False,True,False,False,"Unh-uh, pal. You've got yourself mixed up wit..."
1824,m431,u6496,SPRINGFIELD,manhunter,m,102,1986,"['crime', 'thriller']",False,True,False,False,No. / It's at the vet's. The kids brought it i...
2551,m571,u8418,SRT,thx 1138,m,102,1971,"['drama', 'mystery', 'sci-fi', 'thriller']",False,True,True,False,Just look at all those people. / Save yourself...
1292,m331,u4986,SHERRY,election,f,101,1999,"['comedy', 'drama']",True,False,True,False,I was lonely. You took advantage / We made a ...


#### reviewing more powers of Pandas: indexing and selecting data

The functions explored here are related to Melanie Walsh's post on [Pandas — Part 3.](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part3.html)

How many different movies do we have in this data? We have one row for each character--so 2969 characters--but how many *movies*? The ```unique``` method is useful here. (You could also transform the column into a set and take the length of the set.)

In [108]:
len(chars['mname'].unique())

600

Notice that in order to focus on a single column, I selected by using square brackets and a string that was the column name. There are several other ways to select parts of a dataframe.

The first (bolded) column in the dataframe is an "index." Wecan select rows based on the index, using ```.loc[]```. *Notice the square brackets; this is a form of indexing/selection, not a function call, which is done with round parentheses.* 

Right now the index is just an integer counter that was created when we read this file. For instance, when we originally read in the file, Neo was in the row labeled 1831.

In [109]:
chars.loc[[1831, 1832], : ]

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
1831,m433,u6523,NEO,the matrix,m,926,1999,"['action', 'adventure', 'adventure', 'adventur...",False,False,False,False,You ever have the feeling that you're not sure...
1832,m433,u6524,ORACLE,the matrix,f,231,1999,"['action', 'adventure', 'adventure', 'adventur...",False,False,False,False,You're going to have to make a final choice. ...


You can also pass an integer to ```.iloc[]``` to select rows in the current order. Since since we randomized the rows after reading them in, this produces a different result than the command above. ```.iloc[]``` gives you the row that is *now* 1831, whereas ```.loc[]``` gives you the row that was originally assigned 1831.

We're passing lists of numbers right now, but a single number will also work. Take out the inner brackets in the command below and see what happens.

In [110]:
chars.iloc[[1831], : ]

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
2607,m582,u8591,TOMMY,trainspotting,m,178,1996,"['crime', 'drama']",False,False,True,False,It'll be here somewhere. I might have returned...


We can also set the index of the dataframe to be something other than a mere row number.

In [111]:
chars = chars.set_index('cid')
chars.head()

Unnamed: 0_level_0,mid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
u4168,m278,JOANNE,body of evidence,f,581,1993,"['drama', 'romance', 'thriller']",False,True,True,True,Yes. / Horrible. He was tired and pale. / Yes...
u2224,m144,ALEXANDER,napoleon,m,746,1995,"['family', 'adventure']",False,False,False,False,I have given a great deal of thought to that p...
u3599,m238,ADDISON,all about eve,m,1833,1950,['drama'],False,False,True,False,"Why not? Tell me, Phoebe, do you want some day..."
u753,m48,MAX,dark angel,M,2663,1990,"['action', 'crime', 'drama', 'horror', 'sci-fi...",False,True,True,False,"This. / Okay, okay. I can explain... You ever..."
u6608,m440,PETER,mimic,m,568,1997,"['drama', 'horror', 'sci-fi']",False,False,True,False,Weird shit...? / Yeah...? / There are thirty f...


Now notice what happens when we pass a list of character ids

In [112]:
chars.loc[['u3681', 'u8858']]

Unnamed: 0_level_0,mid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
u3681,m243,ALVY,annie hall,m,5853,1977,"['comedy', 'drama', 'romance']",True,False,True,True,Your girl friend's name is Ralph? / You don't ...
u8858,m601,CAITLIN,what lies beneath,f,100,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,Someone crying. A girl. I thought I was craz...


We can also select rows using a Boolean mask. For instance, suppose we create a list of length equal to the dataframe length, with all members of the list ```False```, except for two ```Trues.```

In [113]:
mask = [False] * chars.shape[0]   # If it's not clear what this does, create a cell, and
mask[130] = True                  # execute chars.shape. Then another cell, and 
mask[1030] = True                 # execute [True] * 4
print('The length is ', len(mask))
print('and the number of Trues is ', sum(mask))

The length is  2969
and the number of Trues is  2


In [114]:
chars[mask]

Unnamed: 0_level_0,mid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
u6872,m459,SALLY,the nightmare before christmas,f,120,1993,"['animation', 'family', 'fantasy', 'musical']",False,False,False,False,"Experiments? / When he left, he took a lot of ..."
u642,m40,WILLIAM,braveheart,m,217,1995,"['action', 'biography', 'drama', 'history', 'w...",False,False,True,False,"What are they doing? / It was in Latin, sir. /..."


In this example, the Boolean mask is just a silly way of doing the same thing as ```chars.iloc[[130, 1030]]```. But there are also ways to use the mask where it becomes genuinely useful.

In [115]:
chars[chars['mname'] == 'what lies beneath']

Unnamed: 0_level_0,mid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
u8867,m601,NORMAN,what lies beneath,m,1169,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,What? / Shhhh. / IT'S TRUE. That's...you have...
u8858,m601,CAITLIN,what lies beneath,f,100,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,Someone crying. A girl. I thought I was craz...
u8864,m601,MRS. FEUR,what lies beneath,f,240,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,I'm sorry that I frightened you like that. Bu...
u8859,m601,CLAIRE,what lies beneath,f,1965,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,Then we both are. / What? / Did you? / I start...
u8862,m601,JODY,what lies beneath,f,630,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,"Trust me, Claire. You hear something... chang..."
u8860,m601,DR. DRAYTON,what lies beneath,m,141,2000,"['drama', 'horror', 'mystery', 'thriller']",False,True,True,False,That can't feel good. / What? / You were in an...


The comparison operator ```==``` will test every member of the column ```mname``` and return True in cases where the expressions is true. Note that comparisons only *broadcast* across lists this way if you're working with pandas columns and numpy arrays. Broadcasting doesn't work with basic Python lists; it will always return False, because a list is not *the same thing as* a single element.

In [116]:
fruit_list = ['apple', 'orange', 'apple']
fruit_list == 'apple'

False

In [117]:
fruit_series = pd.Series(['apple', 'orange', 'apple'])
fruit_series == 'apple'

0     True
1    False
2     True
dtype: bool

The columns in a pandas DataFrame are each a pandas Series; that's why you can use comparisons to create Boolean masks.

#### Exercise for you to try at home

Create a list of all the movie titles that are comedies. The list should only include *unique* values; we don't need six copies of 'airplane!'

## B. Now let's create some models of film dialogue!

Because we've explored Naive Bayes models, I'm going to use that function in the examples that follow. But if you're disappointed with the accuracy we're getting, rest assured that it's possible to do better once we learn fancier algorithms.

We're going to be using raw word frequencies as predictive features.

So first we need to create a dataframe that stores the wordcounts for each character.

In [118]:
if 'cid' in chars.columns:         
    character_ids = chars['cid']
    chars = chars.set_index('cid')   # If we haven't made this the index yet, let's do it.
else:
    character_ids = chars.index.tolist()

vectorizer = CountVectorizer(max_features = 8000)
sparse_counts = vectorizer.fit_transform(chars['lines']) # the vectorizer produces something
                                                               # called a 'sparse matrix'; we need to
                                                               # unpack it
wordcounts = pd.DataFrame(sparse_counts.toarray(), index = character_ids, 
                            columns = vectorizer.get_feature_names())
wordcounts.head()

Unnamed: 0,00,000,10,100,1000,11,12,13,14,15,...,yuh,yup,yuppie,zack,zero,zip,zoe,zone,zoo,zuzu
u4168,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
u2224,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
u3599,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
u753,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
u6608,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Modeling comedic characters

Let's start by building a model that distinguishes characters in comedies from other characters.

Since we've got lots of variables, a regular statistical model trained on all the data would overfit. So we'll need to distinguish our test set from a training set. Let's choose 600 random rows as a test set. The .sample() function will do this.

In [119]:
test_X = wordcounts.sample(600)
test_X.head(2)

Unnamed: 0,00,000,10,100,1000,11,12,13,14,15,...,yuh,yup,yuppie,zack,zero,zip,zoe,zone,zoo,zuzu
u3777,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
u2188,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Notice that we have the character ids as index. That means we can select the 'comedy' columns for the same rows and make those test_y. 

In [120]:
test_characters = test_X.index.tolist()
test_y = chars.loc[test_characters, 'comedy']

print('The length of your test set is ', len(test_y))
print('And of that group', sum(test_y), 'are from a comedy.')

The length of your test set is  600
And of that group 170 are from a comedy.


Note that right now we have these vales as Boolean trues and falses, but we want ones and zeroes. We can change that using ```.astype()```.

In [121]:
print(test_y[0:10])
test_y = test_y.astype(int)
print(test_y[0:10])

cid
u3777     True
u2188     True
u2654     True
u5829     True
u8657    False
u1520     True
u2215     True
u1788    False
u5179    False
u5754    False
Name: comedy, dtype: bool
cid
u3777    1
u2188    1
u2654    1
u5829    1
u8657    0
u1520    1
u2215    1
u1788    0
u5179    0
u5754    0
Name: comedy, dtype: int64


Now we need a training set, which has everything not in the test set.

In [122]:
train_X = wordcounts.loc[~wordcounts.index.isin(test_characters), : ]

# That's worth reading closely; I'm introducing the function 'isin()' which returns
# true when an item in a Pandas series is in a collection (a set or list) included
# in the parenthesis

train_y = chars.loc[~chars.index.isin(test_characters), 'comedy'].astype(int)

Now our task becomes simple. Create a bayesian modeler, fit it to our training set, then use that model to predict on the test set.

In [123]:
bayes = MultinomialNB(alpha = 1)
bayes.fit(train_X, train_y)

predictions = bayes.predict(test_X)
sum(predictions == test_y) / len(test_y)

0.8083333333333333

That's pretty high accuracy, but let's not get too excited. There are reasons to be a little suspicious here. Remember that characters come from movies. We randomly selected a test set of characters. None of those *characters* will be in the training set, but many characters *from the same movies* will be in the training set. And when two characters come from the same movie, vocabulary is likely to be similar. So our model might not be learning "funny vocabulary"; it might just be learning that references to "Holy" and "Grail" prove you're in a comedy.

That sort of model will generalize well *here* (if you're likely to encounter other characters from Monty Python). But it won't generalize well in the wild.

### Lab project 1: For a better test, let's separate train and test sets *by movie id*

Here's what you need to do:

    a) create a list of unique movie ids
    b) randomly select a fifth of them (120) as a test set
    c) find the character ids that correspond to that list of movies
    d) then do everything else as above, except that you won't need to create the test set
    by random sampling

I'm going to help you with parts A & B because there's a little kink that complicates things.

In [124]:
# Parts A & B: 

# random sampling unique values is a little tricky,
# because the unique values are no longer a pandas series,
# so you need a different random sampling function, like
# 'random.sample()' from base Python.

movies = set(chars['mid'])
test_movies = random.sample(movies, 120)
len(test_movies)

120

In [125]:
# Now get a list of characters who are in test_movies


In [126]:
# and use that list to create train_X and test_X,
# plus train_y and test_y


In [127]:
# Now some boilerplate that tests a model

bayes = MultinomialNB(alpha = 1)
bayes.fit(train_X, train_y)

predictions = bayes.predict(test_X)
sum(predictions == test_y) / len(test_y)

0.8083333333333333

So we can *weakly* predict whether dialogue comes from a comedy, but not as well as we thought. Our earlier model was mostly learning to recognize a particular set of movies.

Can we predict character gender? Here it's probably less crucial to separate test and training sets by movie id, because character gender does not correlate strongly with a limited set of movie ids. But let's test that empirically. First let's use our existing list of characters, divided by movie, keep the same train_X and test_X, and just convert train_y and test_y so we're learning to predict gender.

Because gender is coded as 'm' and 'f,' not True and False, we'll need to write a function to convert genders to 1s and 0s.

In [128]:
def gendertonumber(astring):
    if astring == 'f':
        return 1
    else:
        return 0

test_y = chars.loc[test_characters, 'gender'].map(gendertonumber)
train_y = chars.loc[~chars.index.isin(test_characters), 'gender'].map(gendertonumber)

In [129]:
bayes = MultinomialNB(alpha = 1)
bayes.fit(train_X, train_y)

predictions = bayes.predict(test_X)
sum(predictions == test_y) / len(test_y)

0.7133333333333334

Gender is about as predictable as the genre a character is in.

But why is it predictable?

### Lab project 2: what do men and women sound like in the movies?

Let's start by counting up all the words spoken by men and women.

In [130]:
masculine_counts = wordcounts.loc[chars['gender'] == 'm', : ].sum(axis = 'rows')
masculine_counts[1000:1005]

bye        131
cab         77
cabin       32
cabinet     14
cable       31
dtype: int64

In [131]:
feminine_counts = wordcounts.loc[chars['gender'] == 'f', : ].sum(axis = 'rows')
feminine_counts[1000:1005]

bye        106
cab         28
cabin       23
cabinet      4
cable       12
dtype: int64

Now your mission, if you choose to accept it: 

    find words that are overrepresented in the movie dialogue of men and women.

I recommend you start by normalizing the counts into frequencies, using the smoothing formula below.

Then review [Ben Schmidt's blog post about comparing corpuses,](http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html) and try both the "multiplication" method and the "addition" method to compare ```masculine_freqs``` and ```feminine_freqs.``` In other words, look first at the ratio of feminine frequencies to masculine -- and then at the absolute difference between them.

```.sort_values()``` will put a Pandas series in numeric order.

What do we learn? (PS: there's one aspect of this that I actually find jaw-droppingly weird. At first I thought it was an error.)

In [132]:
# Normal Laplacian smothing would be +1 instead of +100
# as I've done below. I'm using heavy smoothing for reasons
# that will shortly become clear. But you can play around
# with that constant.

masculine_freqs = (masculine_counts + 100) / sum(masculine_counts)
feminine_freqs = (feminine_counts + 100) / sum(feminine_counts)


### A better way to validate models

In the lines above, we spent a lot of energy selecting test sets and training sets. We also didn't get a very stable measure of model accuracy, because our measure of accuracy depended entirely on a small subset of the data that we chose as the test set.

It's better to *cross-validate* the model, by repeatedly holding out a different 1/5th (or 1/10th) of the data as the test set, and training on the remainder. If we hold out a different test set each time we can eventually test on all the data, without ever testing on data that was included in our training set.

We could do this ourselves by writing a loop, but scikit-learn also comes with functions that do it for us automatically.

In the lines below we cross-validate a model of comedy, while making sure that the data is grouped by movie-id. If you take out the last two arguments of the cross-validate function, you can do the same thing without grouping the data. **Why does the result change?**

In [133]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_predict

In [134]:
all_y = chars['comedy'].astype(int)
bayes = MultinomialNB(alpha = 1)
grouper = GroupKFold(n_splits = 5)
cv_results = cross_validate(bayes, wordcounts, all_y, groups = chars['mid'], cv = grouper)
np.mean(cv_results['test_score'])

0.7150788378444366

### Lab project 3a: Changing the smoothing parameter of the model

Write a little loop that tests the genre model above while varying the ```alpha``` parameter of the model.

Remember, this is the Laplacian smoothing--a number that gets added to the count of each word to acknowledge that it probably *sometimes* occurs inside/outside comedies, even if not in this sample.

Try each of the settings in this list: [1, 2, 4, 8, 10, 12, 16, 20]. In each case print ```alpha,``` and the resulting accuracy. What is this telling us? Why might different parameters perform better or worse in different datasets?

### Collective accuracy is a blunt description. Can this tell us anything about individuals?

Yes, maybe--with the proviso that we're only going to get very rough probabilistic indications. There's a lot of noise and error. But with that proviso, we can inquire, for instance, which characters *do or don't conform* to the overall gender norms of movie dialogue.

We find this out by asking about the probabilities the model generates for individual characters, using the "predict_proba" method.

In [135]:
all_y = chars['gender'].map(gendertonumber)
bayes = MultinomialNB(alpha = 10)
grouper = GroupKFold(n_splits = 10)
proba = cross_val_predict(bayes, wordcounts, all_y, groups = chars['mid'], cv = grouper, method='predict_proba')

In [136]:
proba = [x[1] for x in proba]   # since 'f' is the positive class in this model
proba[0:10]            # we're interested in the predicted probability of being feminine

[0.0037331418556135957,
 5.707649512670865e-12,
 1.4177570883788445e-07,
 1.6590718488505837e-52,
 1.1148026651816514e-10,
 0.007589326239666668,
 0.004826050019705022,
 0.8024038799973086,
 3.032764171503745e-25,
 0.9808879828078247]

Now we can map those predicted probabilities back onto the character dataset.

In [137]:
charserror = chars.assign(error = np.abs(all_y - proba))
charserror.round(4)

Unnamed: 0_level_0,mid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines,error
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
u4168,m278,JOANNE,body of evidence,f,581,1993,"['drama', 'romance', 'thriller']",False,True,True,True,Yes. / Horrible. He was tired and pale. / Yes...,0.9963
u2224,m144,ALEXANDER,napoleon,m,746,1995,"['family', 'adventure']",False,False,False,False,I have given a great deal of thought to that p...,0.0000
u3599,m238,ADDISON,all about eve,m,1833,1950,['drama'],False,False,True,False,"Why not? Tell me, Phoebe, do you want some day...",0.0000
u753,m48,MAX,dark angel,M,2663,1990,"['action', 'crime', 'drama', 'horror', 'sci-fi...",False,True,True,False,"This. / Okay, okay. I can explain... You ever...",0.0000
u6608,m440,PETER,mimic,m,568,1997,"['drama', 'horror', 'sci-fi']",False,False,True,False,Weird shit...? / Yeah...? / There are thirty f...,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
u4731,m314,GREGORY,the curse,m,166,1987,"['sci-fi', 'horror']",False,False,False,False,So pee? Here? / I wouldn't mind doin' somethin...,0.5146
u3710,m244,MONICA,the anniversary party,f,623,2001,"['drama', 'comedy']",True,False,True,False,You have? / Excuse me? / We live next door. / ...,0.9998
u6361,m424,BILLY,lord of illusions,m,129,1995,"['fantasy', 'horror', 'mystery', 'thriller']",False,True,False,False,"Hey, anytime. Actually, no. This was enough....",0.0046
u6749,m450,JEFFREY,my girl 2,m,212,1994,"['comedy', 'drama', 'family', 'romance']",True,False,True,True,She didn't wanna miss out on anything...especi...,0.4781


### Lab project 3b: Are gender norms stronger in some genres?

The ```error``` column above basically indicates whether a character diverged from our model of gender. Are there some genres where characters are especially likely, or unlikely, to fit the gender norms of the larger dataset? Consider the four genres we've broken out as separate columns. Calculate the mean error for characters in each genre; then use a t-test to figure out whether the difference between the most gender-conforming genre and the least gender-conforming genre is statistically significant.

In [138]:
from scipy.stats import ttest_ind



#### Individual movies

Just for fun we can take the mean error for each movie and see if any patterns emerge.

In [139]:
movieerror = charserror.groupby('mname', as_index = False)['error'].mean()
movieerror.round(5)

Unnamed: 0,mname,error
0,"""murderland""",0.00000
1,10 things i hate about you,0.07713
2,1492: conquest of paradise,0.01239
3,15 minutes,0.24999
4,2001: a space odyssey,0.00000
...,...,...
595,willow,0.33147
596,witness,0.25000
597,wonder boys,0.56387
598,xxx,0.00000


In [140]:
movieerror.sort_values(by = 'error').head(12)   # movies where gender is predictable

Unnamed: 0,mname,error
503,the life of david gale,7.927802e-67
554,thunderheart,7.482145e-41
552,thirteen days,5.520761e-32
556,ticker,6.135094e-30
427,star wars,3.574646e-24
21,alien,3.950033e-24
293,midnight cowboy,8.448863000000001e-23
363,red white black & blue,2.361128e-17
598,xxx,7.843311000000001e-17
352,predator,2.890573e-16


In [141]:
movieerror.sort_values(by = 'error').tail(12)     # movies where gender is hard to predict

Unnamed: 0,mname,error
530,the silence of the lambs,0.666667
152,final destination 2,0.666667
383,serial mom,0.666745
545,the witching hour,0.66813
382,scream 3,0.713455
95,cherry falls,0.71569
309,my best friend's wedding,0.73323
410,sounder,0.749896
200,hellraiser iii: hell on earth,0.89813
127,drop dead gorgeous,0.931086


### Lab project 4: Do gender stereotypes in film weaken over time?

It probably doesn't make sense to try to answer this question with a single big model of all movies, because the gender norms involved in shaping dialogue could have *changed* over time.

Instead, let's find the median release date for characters, and divide our dataset into two roughly evenly-sized groups (early and late). Then we can cross-validate models to predict gender in each half of the dataset, and see if the accuracy of the later model is lower.

We can start by doing this just once for each half of the data. Then, to get a sense of uncertainty, we can write a loop that repeatedly resamples the data and runs the cross-validation, say, twenty times for each half of the timeline. We can compare the sets of accuracies to see if the difference of means is significant.

Then we'll consider possible confounding factors that could be producing this result.