# Homework 3: distinctive words

We calculated distinctive words in movie titles. But there aren't really very many words in movie titles!

It would be more fun to do the same thing with movie dialogue.

Fortunately we have a dataset available.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import math
pd.set_option('display.max_rows', 100)

In [2]:
dialogpath = Path('../data/movie_dialogue.tsv')

chars = pd.read_csv(dialogpath, sep = '\t')

# let's also randomize the row order
chars = chars.sample(frac = 1.0)

chars.head()

Unnamed: 0,mid,cid,cname,mname,gender,wordcount,year,genres,comedy,thriller,drama,romance,lines
955,m27,u441,DUNWITTY,bamboozled,m,1175,2000,"['comedy', 'drama', 'music']",True,False,True,False,"I want to meet her one day, please tell her th..."
1998,m465,u6953,EDIE,on the waterfront,f,1278,1954,"['crime', 'drama', 'romance']",False,False,True,True,They're waiting for him to walk in. / Terry......
1609,m394,u5951,HAWK,hudson hawk,m,1845,1991,"['action', 'adventure', 'comedy', 'action', 'a...",True,False,False,False,"Let's just forget it, I mean... / You're a ree..."
2865,m78,u1195,PREYSING,grand hotel,m,1335,1932,"['drama', 'romance']",False,False,True,True,Hello! Hello! -- / I'm going to call the polic...
2921,m9,u132,DAVE,the atomic submarine,m,258,1959,"['sci-fi', 'thriller']",False,True,False,False,"What's goin' on in here, Lad? What - ? / Oh it..."


## Assignment 1

Find the 25 words that most strongly characterize romance, and the 25 that most strongly characterize dialogue that is not-in-a-romance. 

1. In doing this, create a CountVectorizer that considers the top *5000* words, not just the top *100* we took for titles. You've got more words to work with now.

2. You can use the 'romance' column to divide rows.

3. Use the get_dunnings function we developed in the lab to measure Dunning's log-likelihood.

Finally report the 25 words at the top, and the bottom, of a list sorted by the Dunning's statistic.

In [3]:
vectorizer = CountVectorizer(max_features = 5000)
sparse_wordcounts = vectorizer.fit_transform(chars['lines'])
wordcounts = sparse_wordcounts.toarray()
features = vectorizer.get_feature_names()
wordcounts = pd.DataFrame(wordcounts, columns = vectorizer.get_feature_names())

In [4]:
romantic = wordcounts.loc[chars['romance'] == True, features].sum(axis = 'rows')
unromantic = wordcounts.loc[chars['romance'] != True, features].sum(axis = 'rows')

In [5]:
def get_dunnings(word, series1, series2):
    observed = pd.DataFrame({'series1': [series1[word], sum(series1) - series1[word]],
                          'series2': [series2[word], sum(series2) - series2[word]]},
                        index = [word, 'all_others'])
    total_words = observed.to_numpy().sum()
    observed['word_totals'] = observed.sum(axis = 1)
    observed = observed.append(observed.sum(axis = 0).rename(index = 'group_totals'))
    observed.iat[2,2] = 0
    observed['word_totals'] = observed['word_totals'] / sum(observed['word_totals'])
    observed.loc['group_totals', : ] = observed.loc['group_totals', : ] / sum(observed.loc['group_totals', : ])
    expected = np.outer(observed['word_totals'][0:2], observed.loc['group_totals', : ][0:2])
    expected = pd.DataFrame(expected, index = [word, 'all_others'], columns = ['series1', 'series2'])
    expected = expected * total_words
    
    G = 0
    for i in range(2):
        for j in range(2):
            O = observed.iat[i, j] + .000001
            E = expected.iat[i, j] + .000001
            G = G + O * math.log(O / E)
    
    if (observed.iat[0, 0] / sum(observed.iloc[0: 2, 0])) < (observed.iat[0, 1] / sum(observed.iloc[0 : 2, 1])):
        G = -G
    
    return 2 * G, observed, expected

In [6]:
dunningslist = []

for w in features:
    G, observed, expected = get_dunnings(w, romantic, unromantic)
    dunningslist.append(G)

In [7]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [8]:
dunnings = pd.Series(dunningslist, index = features)

In [9]:
dunnings = dunnings.sort_values()

In [10]:
dunnings[0: 25]

heh         -75.68132
yah         -70.82271
brad        -51.69719
hildy       -45.30152
walter      -41.39556
jim         -39.42503
spock       -37.90584
russia      -37.83995
hannah      -31.97733
beth        -29.64249
richard     -29.25958
ted         -28.97996
oswald      -28.77777
louise      -28.69289
lad         -28.18934
joe         -25.69103
saunders    -23.98290
alvy        -23.98290
marylin     -23.44994
dil         -23.44994
caitlin     -22.91698
norman      -22.89427
shelly      -22.46723
gallagher   -22.38402
john        -21.72625
dtype: float64

In [11]:
dunnings[-25 : ]

lenny       38.64834
frances     38.88159
lecter      40.43748
mulder      40.59826
paulie      41.73455
norma       42.74878
channing    43.72901
friedman    45.09753
hudson      45.16919
susie       46.30778
sailor      47.07013
dignan      47.97458
mcmurphy    49.89092
dude        51.89048
mister      52.50631
mulwray     57.29341
mama        58.29879
moraes      61.86898
stephen     64.66821
thelma      71.19726
viktor      78.55825
karl        80.80105
bela        85.04651
bob        110.25736
wyatt      116.14330
dtype: float64

I confess these are not very interpretable features! If you don't immediately see why one set of proper names is associated with romance, and another set isn't -- you're right to be unsure.

## Assignment 2

A. What's the probability $P(character-in-romance)$, for all characters in this dataset?

B. What's the conditional probability of a character occurring in a romance, given that the character speaks the word 'you'?

In other words, calculate

$P(character-in-romance \mid character-says-you)$

Note that both of these questions require a slightly different approach from the probability table we used to calculate Dunnings. The things being counted here are not words, but characters.

#### Part A

In [23]:
# part A

chars['romance'].value_counts() / len(chars)

False   0.76255
True    0.23745
Name: romance, dtype: float64

P(romance) is 0.23745.

#### Part B

In [24]:
youchars = chars.loc[wordcounts['you'] >= 1, : ]

In [25]:
youchars['romance'].value_counts() / len(youchars)

False   0.76240
True    0.23760
Name: romance, dtype: float64

P(romance|says-love) is 0.23760. The conditional probability is higher than the marginal.

In [21]:
lovechars = chars.loc[wordcounts['love'] >= 1, : ]

In [22]:
lovechars['romance'].value_counts() / len(lovechars)

False   0.76173
True    0.23827
Name: romance, dtype: float64

P(romance|says-love) is 0.23827.

This is not a huge difference, but it's intuitive that a character's chance of being in a romance is at least slightly higher if they say the word "love."
