# Graph Analysis
Create graphs from movie scripts and obtain statistics on the node centralities.

As input, this file expects data from the text-pipeline.

Outputs statistics, graphs and aditional files to use with R for additional tests.

Author: Victor R Martinez

Last Modified: Sept 18, 2017

In [1]:
%matplotlib inline
from utils import read, createGraph, readExtraInfo, readGenre, readBirthdays, getCharacterAges
from glob import iglob as glob
from os.path import exists, basename
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
import numpy as np
import scipy as sp
import warnings
import networkx as nx
import logging
from funcy import walk_values, partial
from scipy import stats
import itertools
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison

plt.rcParams["figure.figsize"] = (10, 8)

  from pandas.core import datetools


In [2]:
logging.basicConfig(format = "%(asctime)-15s %(message)s", level = logging.DEBUG)

# Config

* Threshold (_th_) is the minimum number of utterances a character needs to have in order to be considered as a node.

#### Outputs from text pipeline:
* _data_dir_ refers to the folder where the files with utterances and charnames are stored. 
* _info_dir_ has the files with character info
* _birthdays_f_ is the file that has the birthdays of actors

In [3]:
th = 2
data_dir = "../data/utterances_with_charnames/*"
info_dir = "../data/charandmovie_info/"
birthdays_f = "../data/age/actor_birthdays.txt"

What measures of centrality are we going to use? For a list of all centralities available, check: https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.centrality.html

In [4]:
centrality_measures = ['degree_centrality', 'betweenness_centrality']

Define a set of colors to use for the graphs.

In [5]:
xkcd_colors = sns.xkcd_palette(["grass green", "sand", "blue", "light red", "cerulean",
                                "red", "light blue", "teal", "orange", "light green",
                                "magenta", "yellow", "sky blue", "grey", "cobalt",
                                "grass", "algae green", "coral", "cerise", "steel",
                                "hot purple", "mango", "pale lime", "rouge"])
colors = itertools.cycle(xkcd_colors)

## Controling for multiple comparisons.

In order to control for multiple comparissons, we're going to use the [Holm-Bonferroni method](https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method). We use a in house implementation, defined as follows:

In [6]:
# Using Holm-Bonferroni method
def holmBonferroni(tests):
    res_2 = sorted(tests, key = lambda x: x[1].pvalue)
    m = len(res_2)

    k = 0
    while k < len(res_2) and res_2[k][1].pvalue < 0.05 / (m + 1 - k - 1):
        k += 1

    return res_2[:(k - 1)]

# Read all scripts
The following lines reads all files and creates a data structure holding the graph, character list, movie genres, character/actor ages and races, as well as year of release.

In [7]:
birthdays = readBirthdays(birthdays_f)

In [8]:
data = {}
for script in glob(data_dir):
    
    _, char_list, adj = read(script, threshold = th)

    extra_info = info_dir + basename(script)
    if exists(extra_info):
        genders, races, namesids, movieyear = readExtraInfo(extra_info)
        genres = readGenre(extra_info)
    else:
#         logging.warning("Info for {} not found".format(basename(script)))
#         gens = defaultdict(lambda: 'unknown')
#         races = defaultdict(lambda: 'unknown')  
        continue
    
    def splitRaces(x):
        r = x.split(",")
        if len(r) > 0:
            if len(r) > 1:
                return "mixed"
            else:
                return r[0]
    
    races = walk_values(splitRaces, races)
    ages = getCharacterAges(char_list, namesids, movieyear, birthdays)
        
        
    G = createGraph(char_list,
                    adj,
                    genders = genders,
                    races = races,
                    ages = ages)

    

    key = basename(script)

    data[key] = {}
    data[key]['graph'] = G
    data[key]['chars'] = char_list
    data[key]['genres'] = genres
    data[key]['ages'] = ages
    data[key]['races'] = races
    data[key]['year'] = movieyear
    

These variables helps us with the graphs later on.

First, we obtain the distribution of movies across genres (remember that a movie might belong to one or more genres). _types_ is a list of movie genres without repeats.

Second, we obtain the distribution of actor races in the movie dataset. _races_ is a list of races without repeats.

In [9]:
types = Counter([y for x in [d['genres'] for script, d in data.items()] for y in x])
print(types)
types = list(types.keys())

Counter({'Drama': 559, 'Thriller': 368, 'Comedy': 287, 'Action': 252, 'Crime': 242, 'Romance': 194, 'Adventure': 170, 'Sci-Fi': 156, 'Mystery': 145, 'Horror': 116, 'Fantasy': 115, 'Biography': 70, 'Family': 49, 'History': 34, 'War': 34, 'Sport': 32, 'Animation': 32, 'Music': 22, 'Musical': 19, 'Western': 17, 'Film-Noir': 5, 'Short': 5})


In [10]:
races = Counter([y for x in [list(d['races'].values()) for d in data.values()] for y in x])
print(races)
races = list(races.keys())

Counter({'unknown': 7893, 'caucasian': 6887, 'african': 618, 'mixed': 449, 'latino': 165, 'eastasian': 78, 'asianindian': 44, 'other': 25, 'nativeamerican': 15, 'pacificislander': 7, 'others': 2})


## Pre-check
In [GENDER BIAS WITHOUT BORDERS](http://seejane.org/wp-content/uploads/gender-bias-without-borders-executive-summary.pdf), there is a ratio of 2.25 men for every women on screen (women = $30.9$%). Lets check our numbers.

In [11]:
total, males, females = 0, 0, 0
for _, d in data.items():
    G = d['graph']
    
    for i in G.nodes():
        if G.node[i]['gender'] == 'male':
            males += 1
        elif G.node[i]['gender'] == 'female':
            females += 1
        
        total += 1
        
print("total: {}".format(total))
print("males: {:.2f}%".format(float(males) / total))
print("females: {:.2f}%".format(float(females) / total))


total: 15133
males: 0.55%
females: 0.21%


What if we drop unknown?... shouldn't matter right?

In [12]:
print("males: {:.2f}%".format(float(males) / (males + females)))
print("females: {:.2f}%".format(float(females) / (males + females)))

males: 0.72%
females: 0.28%


# Graph analysis

## Centrality Measurements

### Calculate and save centralities

In [13]:
for script, d in data.items():
    G = d['graph']
    
    for cent in centrality_measures:
        vals = nx.__getattribute__(cent)(G) #Obtain the function using reflection on nx
        nx.set_node_attributes(G, cent, vals) #Save it in the graph nodes
    

### Some examples
What are the most prominent women / most prominent men in movies?

In [14]:
def averageCents(node):
    """Calculates the mean of each centrality"""
    return np.mean([node[c] for c in centrality_measures])  
    
male_cents, female_cents = [], []
for script, d in data.items():
    G = d['graph']
    char_list = d['chars']
       
    male_cents.extend([(averageCents(G.node[i]), char_list[i], script) for i in G.nodes() if G.node[i]['gender'] == 'male'])
    female_cents.extend([(averageCents(G.node[i]), char_list[i], script) for i in G.nodes() if G.node[i]['gender'] == 'female'])

In [15]:
sorted(male_cents, key=lambda x: x[0], reverse=True)[0:10]

[(1.0, 'LT', 'bad_lieutenant.txt'),
 (1.0, 'SCOTTIE', 'vertigo.txt'),
 (0.91746031746031753, 'TONY', 'scarface.txt'),
 (0.86818181818181817, 'NICHOLAS', 'the_game.txt'),
 (0.8666666666666667, 'CHUCK', 'cast_away.txt'),
 (0.8660714285714286, 'ALEXANDER', 'the_time_machine.txt'),
 (0.86228070175438609, 'FORREST', 'forrest_gump.txt'),
 (0.85897435897435903, 'RODERICK', 'barry_lyndon.txt'),
 (0.84829059829059827, 'DANNY', 'salton_sea_the.txt'),
 (0.84761904761904761, 'BILLY', 'gremlins.txt')]

In [16]:
sorted(female_cents, key=lambda x: x[0], reverse=True)[0:10]

[(0.85119047619047628, 'OLIVE', 'easy_a.txt'),
 (0.8214285714285714, 'CORALINE', 'coraline.txt'),
 (0.81341991341991338, 'DOROTHY', 'wizard_of_oz_the.txt'),
 (0.79075943179204056, 'KATHY', 'whistleblower_the.txt'),
 (0.77412698412698422, 'SISSY', 'even_cowgirls_get_the_blues.txt'),
 (0.76930014430014437, 'MAYA', 'zero_dark_thirty.txt'),
 (0.75803811443932423, 'LISA', 'margaret.txt'),
 (0.75555555555555554, 'FRANCES', 'frances.txt'),
 (0.75555555555555554, 'THE BRIDE', 'kill_bill_volume_1_and_2.txt'),
 (0.75218253968253967, 'EMMA', 'no_strings_attached.txt')]

## By Gender
First, we describe the distribution of centralities across different genders:

In [17]:
for cent in centrality_measures:
    print(cent)
    print(stats.describe([vals for script, d in data.items() for vals in nx.get_node_attributes(d['graph'], cent).values()]))

degree_centrality
DescribeResult(nobs=15133, minmax=(0.0, 1.0), mean=0.39085344359028013, variance=0.060658846118447539, skewness=0.8413833310233242, kurtosis=-0.17330570803736434)
betweenness_centrality
DescribeResult(nobs=15133, minmax=(0.0, 1.0), mean=0.045594842347982223, variance=0.0087036555743516143, skewness=3.5561931822032755, kurtosis=15.501383735343612)


What are the mean values for each gender?
__FIXME:__ This should be medians

In [18]:
for cent in centrality_measures:
    male_cent, female_cent = [], []
    for script, d in data.items():
        G = d['graph']
        male_cent.extend([G.node[i][cent] for i in G.nodes() if G.node[i]['gender'] == 'male'])
        female_cent.extend([G.node[i][cent] for i in G.nodes() if G.node[i]['gender'] == 'female'])
    print(cent)
    print("#male: {}, #female: {}".format(len(male_cent), len(female_cent)))
    print("male's mean: {}, female's mean: {}".format(np.mean(male_cent), np.mean(female_cent)))

degree_centrality
#male: 8270, #female: 3168
male's mean: 0.439224457856615, female's mean: 0.4496618675033694
betweenness_centrality
#male: 8270, #female: 3168
male's mean: 0.0588726277976661, female's mean: 0.051286038407297764


We want to test if the median between male and female centrality differs significantly. We make use of Mann Whitney U-test and control for multiple comparissons (i.e., the different centralities) using Holm-Bonferroni method. When controling for FWER, we could not find any difference between degree centralities. Likewise, there were no significant differences in betweenness centrality.

In [19]:
tests = {}
for cent in centrality_measures:
    
    male_cent, female_cent = [], []
    
    for script, d in data.items():
        G = d['graph']    
        male_cent.extend([G.node[i][cent] for i in G.nodes() if G.node[i]['gender'] == 'male'])
        female_cent.extend([G.node[i][cent] for i in G.nodes() if G.node[i]['gender'] == 'female'])
    
    #Remove nans
    male_cent, female_cent = np.array(male_cent), np.array(female_cent)
    male_cent = male_cent[~np.isnan(male_cent)]
    female_cent = female_cent[~np.isnan(female_cent)]

    tests[cent] = stats.mannwhitneyu(male_cent, female_cent)

__ There were no significant differences in medians __

In [20]:
# Using Holm-Bonferroni method
holmBonferroni(tests.items())

[]

### Split by genre
What if we take into consideration the movie genre?

In [21]:
centr_byGenre = []
for script, d in data.items():
    G = d['graph']
    genres = d['genres']
        
    for centr_t in centrality_measures:
        centr_v = nx.get_node_attributes(G, centr_t)
        genders = nx.get_node_attributes(G, 'gender')
        
        for k in genders:
            if k in centr_v:
                centr_byGenre.extend([(t, centr_t, genders[k], centr_v[k]) for t in genres])
            else:
                centr_byGenre.extend([(t, centr_t, genders[k], np.nan) for t in genres])
                
centr_byGenre = pd.DataFrame(centr_byGenre, columns = ["genre", "centrality", "gender", "value"])
centr_byGenre.head()
    

Unnamed: 0,genre,centrality,gender,value
0,Comedy,degree_centrality,female,0.769231
1,Drama,degree_centrality,female,0.769231
2,Romance,degree_centrality,female,0.769231
3,Comedy,degree_centrality,male,0.230769
4,Drama,degree_centrality,male,0.230769


In [22]:
#Drop nan's
centr_byGenre = centr_byGenre[(centr_byGenre.genre!="Documentary")]
centr_byGenre = centr_byGenre[(centr_byGenre.genre!="Short")]
centr_byGenre = centr_byGenre[(centr_byGenre.genre!="Reality-TV")]

centr_byGenre = centr_byGenre[(centr_byGenre.gender!="unknown")]

centr_byGenre = centr_byGenre.dropna()

centr_byGenre.head()

Unnamed: 0,genre,centrality,gender,value
0,Comedy,degree_centrality,female,0.769231
1,Drama,degree_centrality,female,0.769231
2,Romance,degree_centrality,female,0.769231
3,Comedy,degree_centrality,male,0.230769
4,Drama,degree_centrality,male,0.230769


In [23]:
#TODO: Use a .loc instead of .ix
res = centr_byGenre.groupby(["centrality", "genre"]).apply(lambda x: stats.mannwhitneyu(x.ix[x['gender'] == 'male', 'value'].values,
                                                                     x.ix[x['gender'] == 'female', 'value'].values))

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


__ There is a real difference in the centrality of characters in: __
* Degree of Horror movies (p < 0.001)

In [24]:
# Using Holm-Bonferroni's method
bygenre = res.unstack(level=0)
for centr_t in centrality_measures:
    
    tmp = zip(bygenre[centr_t].index, bygenre[centr_t].values)

    print(centr_t)
    for g, ttest in holmBonferroni(tmp):
        print(g, ttest)
    print()

degree_centrality
Horror MannwhitneyuResult(statistic=144206.0, pvalue=0.00098616602154669359)

betweenness_centrality



## By Race
Calculate statistics for centralities per race.

In [25]:
# Agg by race
centr_race_byGenre = []
for script, d in data.items():
    G = d['graph']
    genres = d['genres']
        
    for centr_t in centrality_measures:
        centr_v = nx.get_node_attributes(G, centr_t)
        genders = nx.get_node_attributes(G, 'gender')
        races = nx.get_node_attributes(G, 'race')
        
        for k in genders:
            if k in centr_v:
                centr_race_byGenre.extend([(t, centr_t, genders[k], races[k], centr_v[k]) for t in genres])
            else:
                centr_race_byGenre.extend([(t, centr_t, genders[k], races[k], np.nan) for t in genres])
                
centr_race_byGenre = pd.DataFrame(centr_race_byGenre, columns = ["genre", "centrality", "gender", "race", "value"])
centr_race_byGenre.head()

Unnamed: 0,genre,centrality,gender,race,value
0,Comedy,degree_centrality,female,caucasian,0.769231
1,Drama,degree_centrality,female,caucasian,0.769231
2,Romance,degree_centrality,female,caucasian,0.769231
3,Comedy,degree_centrality,male,unknown,0.230769
4,Drama,degree_centrality,male,unknown,0.230769


In [26]:
# Fix others -> other
# TODO: Use .loc instead of .ix
centr_race_byGenre.ix[centr_race_byGenre.race == "others", "race"] = "other"

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


In [27]:
centr_race_byGenre.groupby(["centrality", "race"]).agg([len, sp.median])

Unnamed: 0_level_0,Unnamed: 1_level_0,value,value
Unnamed: 0_level_1,Unnamed: 1_level_1,len,median
centrality,race,Unnamed: 2_level_2,Unnamed: 3_level_2
betweenness_centrality,african,1570.0,0.026181
betweenness_centrality,asianindian,102.0,0.029853
betweenness_centrality,caucasian,17878.0,0.031528
betweenness_centrality,eastasian,212.0,0.02374
betweenness_centrality,latino,412.0,0.02358
betweenness_centrality,mixed,1256.0,0.029609
betweenness_centrality,nativeamerican,53.0,0.006349
betweenness_centrality,other,53.0,0.022894
betweenness_centrality,pacificislander,11.0,0.088603
betweenness_centrality,unknown,23628.0,0.00262


### Kruskal-Wallis in R

This code comes from __CorrectedANOVA.R__ from this same folder. I would recomend you run that file in R instead.

For completeness, we present a running version of the code using __Rscript__

In [28]:
# Save to use in R
centr_race_byGenre.to_csv("../data/R/aggByRace.csv", index = False)

__This calls R, installs dependencies and prints output here__

In [31]:
!Rscript CorrectedANOVA.R "race" "../data/R/aggByRace.csv"

[1] "CorrectedANOVA.R"
[1] "CWD:  /Users/victor/Workspace/mica-text-charactergraphs/analysis"
[1] "Analyzing by:  race"
[1] "File:  ../data/R/aggByRace.csv"
Parsed with column specification:
cols(
  genre = col_character(),
  centrality = col_character(),
  gender = col_character(),
  race = col_character(),
  value = col_double()
)
[1] "Levene's test"
[1] "Degree: "
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     9   466.8 < 2.2e-16 ***
      45165                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] ""
[1] ""
[1] "Betweenness: "
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)    
group     9  588.24 < 2.2e-16 ***
      45165                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] ""
[1] ""

	Pairwise comparisons using Tukey and Kramer (Nemenyi) test	
                   with Tukey-Dist approximation for independe

## By Age

Calculate statistics for each age. Removes one guy/gal who had more than 100 years, as it is most likely an outlier.

In [32]:
for script, d in data.items():
    G = d['graph']
    char_list = d['chars']
    
    for n in G.nodes():
        if G.node[n]['age'] > 100:
            print(script, char_list[n], G.node[n]['age'])

In [33]:
# Agg by age

centr_age_byGenre = []
for script, d in data.items():
    G = d['graph']
    genres = d['genres']
        
    for centr_t in centrality_measures:
        centr_v = nx.get_node_attributes(G, centr_t)
        genders = nx.get_node_attributes(G, 'gender')
        ages = nx.get_node_attributes(G, 'age')
        
        for k in genders:
            if k in centr_v:
                centr_age_byGenre.extend([(t, centr_t, genders[k], ages[k], centr_v[k]) for t in genres])
            else:
                centr_age_byGenre.extend([(t, centr_t, genders[k], ages[k], np.nan) for t in genres])
                
centr_age_byGenre = pd.DataFrame(centr_age_byGenre, columns = ["genre", "centrality", "gender", "age", "value"])
centr_age_byGenre.head()

Unnamed: 0,genre,centrality,gender,age,value
0,Comedy,degree_centrality,female,18,0.769231
1,Drama,degree_centrality,female,18,0.769231
2,Romance,degree_centrality,female,18,0.769231
3,Comedy,degree_centrality,male,0,0.230769
4,Drama,degree_centrality,male,0,0.230769


In [34]:
# Drop the guy with more than 100 yrs
centr_age_byGenre = centr_age_byGenre[centr_age_byGenre.age < 100]

__ Analysis moved to R __

In [35]:
centr_age_byGenre.to_csv("../data/R/aggByAgeGender.csv", index=False)

__This calls R, installs dependencies and prints output here__

In [36]:
!Rscript CorrectedANOVA.R "age" "../data/R/aggByAgeGender.csv"

[1] "CorrectedANOVA.R"
[1] "CWD:  /Users/victor/Workspace/mica-text-charactergraphs/analysis"
[1] "Analyzing by:  age"
[1] "File:  ../data/R/aggByAgeGender.csv"
Parsed with column specification:
cols(
  genre = col_character(),
  centrality = col_character(),
  gender = col_character(),
  age = col_integer(),
  value = col_double()
)

Call:
lm(formula = value ~ age, data = degree)

Coefficients:
(Intercept)          age  
   0.314007     0.003371  


Call:
lm(formula = value ~ age, data = betweenness)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.10085 -0.04173 -0.02614 -0.00429  0.93123 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.671e-02  6.070e-04   44.00   <2e-16 ***
age         8.413e-04  1.928e-05   43.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09135 on 45173 degrees of freedom
Multiple R-squared:  0.04043,	Adjusted R-squared:  0.04041 
F-statistic:  1903 on 1