# Homework 3: distinctive words

We calculated distinctive words in movie titles. But there aren't really very many words in movie titles!

It would be more fun to do the same thing with movie dialogue.

Fortunately we have a dataset available.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import math
pd.set_option('display.max_rows', 100)

In [2]:
dialogpath = Path('../data/movie_dialogue.tsv')

chars = pd.read_csv(dialogpath, sep = '\t')

# let's also randomize the row order
chars = chars.sample(frac = 1.0)

chars.head()

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
2183,m505,u7468,DEREK,scream 2,m,463,1997,"['horror', 'mystery', 'thriller']",False,True,False,False,Yeah. / She dumped me. / Not one bit. / Defini...
299,m151,u2360,MOSS,no country for old men,m,673,2007,"['crime', 'drama', 'mystery', 'thriller', 'wes...",False,True,True,False,She'll be all right. / No. This works better. ...
1240,m322,u4829,MA STONE,the devil and daniel webster,M,795,2004,"['comedy', 'drama', 'fantasy']",True,False,True,False,"I see you! Riding pretty high, ain't you? Look..."
920,m265,u3979,ADAM,beetle juice,m,945,1988,"['comedy', 'fantasy']",True,False,False,False,I will <u>never</u> sell this house. I'll be b...
1207,m316,u4752,DAVE,dave,m,1127,1993,"['comedy', 'romance']",True,False,False,True,I'm the President and as they say 'The buck st...


## Assignment 1

Find the 25 words that most strongly characterize romance, and the 25 that most strongly characterize dialogue that is not-in-a-romance. 

1. In doing this, create a CountVectorizer that considers the top *5000* words, not just the top *100* we took for titles. You've got more words to work with now.

2. You can use the 'romance' column to divide rows.

3. Use the get_dunnings function we developed in the lab to measure Dunning's log-likelihood.

Finally report the 25 words at the top, and the bottom, of a list sorted by the Dunning's statistic.

In [3]:
vectorizer = CountVectorizer(max_features = 5000)
sparse_wordcounts = vectorizer.fit_transform(chars['lines'])
wordcounts = sparse_wordcounts.toarray()
features = vectorizer.get_feature_names()
wordcounts = pd.DataFrame(wordcounts, columns = vectorizer.get_feature_names())

In [4]:
romantic = wordcounts.loc[chars['romance'] == True, features].sum(axis = 'rows')
unromantic = wordcounts.loc[chars['romance'] != True, features].sum(axis = 'rows')

In [5]:
def get_dunnings(word, series1, series2):
    observed = pd.DataFrame({'series1': [series1[word], sum(series1) - series1[word]],
                          'series2': [series2[word], sum(series2) - series2[word]]},
                        index = [word, 'all_others'])
    total_words = observed.to_numpy().sum()
    observed['word_totals'] = observed.sum(axis = 1)
    observed = observed.append(observed.sum(axis = 0).rename(index = 'group_totals'))
    observed.iat[2,2] = 0
    observed['word_totals'] = observed['word_totals'] / sum(observed['word_totals'])
    observed.loc['group_totals', : ] = observed.loc['group_totals', : ] / sum(observed.loc['group_totals', : ])
    expected = np.outer(observed['word_totals'][0:2], observed.loc['group_totals', : ][0:2])
    expected = pd.DataFrame(expected, index = [word, 'all_others'], columns = ['series1', 'series2'])
    expected = expected * total_words
    
    G = 0
    for i in range(2):
        for j in range(2):
            O = observed.iat[i, j] + .000001
            E = expected.iat[i, j] + .000001
            G = G + O * math.log(O / E)
    
    if (observed.iat[0, 0] / sum(observed.iloc[0: 2, 0])) < (observed.iat[0, 1] / sum(observed.iloc[0 : 2, 1])):
        G = -G
    
    return 2 * G, observed, expected

In [6]:
dunningslist = []

for w in features:
    G, observed, expected = get_dunnings(w, romantic, unromantic)
    dunningslist.append(G)

In [7]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [8]:
dunnings = pd.Series(dunningslist, index = features)

In [9]:
dunnings = dunnings.sort_values()

In [10]:
dunnings[0: 25]

ray         -56.21675
theo        -42.28872
marty       -41.72248
brad        -41.14686
roy         -35.06822
pete        -34.63867
claudia     -34.22905
lenny       -32.59347
walter      -29.58421
diego       -29.00593
sergeant    -27.86446
west        -27.72823
deeds       -27.26712
amy         -26.82898
lex         -26.68696
alvy        -26.10680
superman    -25.17840
an          -24.85525
ned         -24.63396
preysing    -24.36632
gallagher   -24.36632
debbie      -24.09016
diz         -23.78617
creasy      -23.78617
neo         -23.20601
dtype: float64

In [11]:
dunnings[-25 : ]

kubelik     32.74856
lecter      32.88106
marcus      34.16615
grail       34.19810
jake        34.73883
huh         36.28533
joanna      37.38169
harold      38.20273
barton      39.23366
thou        44.94803
norman      46.78519
mulwray     52.36025
edie        52.42091
mister      55.08855
lloyd       57.83746
agnes       58.29420
mcmurphy    61.86155
dil         62.21768
toto        64.38499
beavis      66.20027
dickie      67.72528
grace       86.57108
sal        106.73629
thelma     122.16247
heh        189.70823
dtype: float64

I confess these are not very interpretable features! If you don't immediately see why one set of proper names is associated with romance, and another set isn't -- you're right to be unsure.

## Assignment 2

A. What's the probability $P(character-in-romance)$, for all characters in this dataset?

B. What's the conditional probability of a character occurring in a romance, given that the character speaks the word 'you'?

In other words, calculate

$P(character-in-romance \mid character-says-you)$

Note that both of these questions require a slightly different approach from the probability table we used to calculate Dunnings. The things being counted here are not words, but characters.

#### Part A

In [12]:
# part A

chars['romance'].value_counts() / len(chars)

False   0.76255
True    0.23745
Name: romance, dtype: float64

P(romance) is 0.23745.

#### Part B

In [13]:
lovechars = chars.loc[wordcounts['love'] > 1, : ]

In [14]:
lovechars['romance'].value_counts() / len(lovechars)

False   0.73785
True    0.26215
Name: romance, dtype: float64

P(romance|says-love) is 0.24653.

This is not a huge difference, but it's intuitive that a character's chance of being in a romance is at least slightly higher if they say the word "love."

If we wanted a sanity check, we could compare this to evidence from Assignment 1:

In [15]:
dunnings['love']

2.311183142577576