## Do movies with strong female roles make as much money as those with male leads?

### Links
* https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/
* http://bechdeltest.com/

### Table of Contents
* [Explore distributions](#1)
* [Analytics by test passed/failed](#2)
* [Top Movies](#3)
* [Other Explorations](#4)

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# stats
from statsmodels.graphics.mosaicplot import mosaic

In [None]:
# load and preview
df = pd.read_csv('../input/women-in-movies/movies.csv')# lo
df.head()

<a id='1'></a>
# Explore distributions

In [None]:
# distribution over years
plt.figure(figsize=(12,4))
df.year.value_counts().plot(kind='bar')
plt.title('Movies by year')
plt.grid()
plt.show()

In [None]:
# test outcome (clean)
plt.figure(figsize=(12,4))
df.clean_test.value_counts().plot(kind='bar')
plt.title('clean_test')
plt.grid()
plt.show()

In [None]:
# show also test categories before cleaning
df.test.value_counts()

In [None]:
# test passed / failed [binary]
plt.figure(figsize=(12,4))
df.binary.value_counts().plot(kind='bar')
plt.title('Test Pass / Fail [binary]')
plt.grid()
plt.show()

### Evaluate financial features

#### In order to achieve a fair comparison of new and old movies, we will work with the indexed figures (2013 Dollars) instead of the original values in the following.

In [None]:
col_budget = 'budget_2013$'
col_domgross = 'domgross_2013$'
col_intgross = 'intgross_2013$'

# for a non-indexed perspective simply use the following deactivated code instead:
# col_budget = 'budget'
# col_domgross = 'domgross'
# col_intgross = 'intgross'

features_num = [col_budget, col_domgross, col_intgross]
features_num

In [None]:
# pairwise scatter plots + distributions
sns.pairplot(df[features_num + ['binary']], hue='binary',
             plot_kws={'alpha': 0.5})
plt.show()

In [None]:
# log plots
for f in features_num:
    np.log10(1+df[f]).plot(kind='hist', bins=25)
    plt.title('log_10(1+' + f + ')')
    plt.xlim(0,10) # same range for all plots
    plt.grid()
    plt.show()

<a id='2'></a>
# Analytics by test passed/failed

In [None]:
plt.figure(figsize=(14,6))
sns.violinplot(data=df, y='binary', x=col_budget)
plt.title('Budget by test passed/failed')
plt.grid()
plt.show()

In [None]:
plt.figure(figsize=(14,6))
sns.violinplot(data=df, y='binary', x=col_domgross)
plt.title('Domestic gross by test passed/failed')
plt.grid()
plt.show()

In [None]:
plt.figure(figsize=(14,6))
sns.violinplot(data=df, y='binary', x=col_intgross)
plt.title('International gross by test passed/failed')
plt.grid()
plt.show()

In [None]:
# basic stats - budget
print('BUDGET')
print('======')
print('FAIL:')
print(df[col_budget][df.binary=='FAIL'].describe())
print('PASS:')
print(df[col_budget][df.binary=='PASS'].describe())

In [None]:
# basic stats - international gross
print('INT.GROSS')
print('=========')
print('FAIL:')
print(df[col_intgross][df.binary=='FAIL'].describe())
print('PASS:')
print(df[col_intgross][df.binary=='PASS'].describe())

In [None]:
# basic stats - domestic gross
print('DOM.GROSS')
print('=========')
print('FAIL:')
print(df[col_domgross][df.binary=='FAIL'].describe())
print('PASS:')
print(df[col_domgross][df.binary=='PASS'].describe())

#### For all three financial perspectives the "FAIL" group shows considerably higher means and medians.

In [None]:
# show mean ratios
print('All Years:\n')
for f in features_num:
    ratio_temp = df[f][df.binary=='FAIL'].mean() / df[f][df.binary=='PASS'].mean()
    print('Perspective', f, ': mean ratio =', np.round(ratio_temp,4))

#### Maybe the effect is driven by older years => Let's look only at the most recent 10 years:

In [None]:
df_recent = df[df.year>=2004]

In [None]:

# show mean ratios
print('2004 - 2013:\n')
for f in features_num:
    ratio_temp = df_recent[f][df.binary=='FAIL'].mean() / df_recent[f][df.binary=='PASS'].mean()
    print('Perspective', f, ': mean ratio =', np.round(ratio_temp,4))

#### Still a big difference... Let's now evaluate the distributions by year (again only for the most recent years):

In [None]:
for f in features_num:
    plt.figure(figsize=(14,6))
    sns.boxplot(data=df_recent, x='year', y=f, hue='binary')
    plt.xticks(rotation=90)
    plt.title(f + 'by year')
    plt.grid()
    plt.show()

#### Remark: The 2009 outlier is the movie "Avatar".

### Development of pass/fail over time

In [None]:
# plot development of passed/failed over time using mosaic plot
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics
mosaic(data=df.sort_values('year'), 
       index=['year', 'binary'],
       title='Test passed/failed by year',
       label_rotation=90)
plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

#### Zoom in older years:

In [None]:
# plot development of passed/failed over time using mosaic plot - zoom
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

plt.rcParams['figure.figsize'] = (16,6) # increase plot size for mosaics
mosaic(data=df[df.year<=1980].sort_values('year'),
       index=['year', 'binary'], 
       title='Test passed/failed by year [1970-1980]',
       label_rotation=90)
plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

In [None]:
# plot development of passed/failed over time using mosaic plot - zoom
plt_para_save = plt.rcParams['figure.figsize'] # remember plot settings

plt.rcParams["figure.figsize"] = (16,6) # increase plot size for mosaics
mosaic(data=df[(df.year>=1981) & (df.year<=1995)].sort_values('year'), index=['year', 'binary'],
       title='Test passed/failed by year [1981-1995]',
       label_rotation=90)
plt.show()
    
# reset plot size again
plt.rcParams['figure.figsize'] = plt_para_save

In [None]:
# show corresponding table
df_pass_dev = pd.crosstab(df.year, df.binary)
df_pass_dev['n'] = df_pass_dev.FAIL + df_pass_dev.PASS
df_pass_dev['FAIL_perc'] = 100*np.round(df_pass_dev.FAIL / df_pass_dev.n,4)
df_pass_dev['PASS_perc'] = 100*np.round(df_pass_dev.PASS / df_pass_dev.n,4)
df_pass_dev

In [None]:
# plot development of "passed" percentage
plt.figure(figsize=(14,5))
plt.scatter(df_pass_dev.index, df_pass_dev.PASS_perc)
plt.title('Percentage "passed" per year')
plt.grid()
plt.show()

#### For 1970 we have only one movie => better ignore leftmost point...
#### Let's make another plot starting 1984, then we have at least 10 observations for each year:

In [None]:
# plot development of "passed" percentage; year >= 1984
df_pass_dev_temp = df_pass_dev[df_pass_dev.index>=1984]
plt.figure(figsize=(14,5))
plt.scatter(df_pass_dev_temp.index, df_pass_dev_temp.PASS_perc)
# add regression line
m, b = np.polyfit(df_pass_dev_temp.index, df_pass_dev_temp.PASS_perc, 1)
plt.plot(df_pass_dev_temp.index, m*df_pass_dev_temp.index + b, c='darkred')
plt.title('Percentage "passed" per year [starting 1984 => at least 10 observations for each year]')
plt.grid()
plt.show()

#### We see an improvement of the "pass" percentage over time.

<a id='3'></a>
# Top Movies

In [None]:
n_top = 20
show_cols = ['year', 'title' , 'binary'] + features_num

### By Budget

In [None]:
df_top_b = df.nlargest(columns=[col_budget], n=n_top)
df_top_b[show_cols]

In [None]:
# plot fail/pass distribution
df_top_b.binary.value_counts().plot(kind='bar')
plt.title('Top Movies by budget')
plt.grid()
plt.show()

### By international gross

In [None]:
df_top_i = df.nlargest(columns=[col_intgross], n=n_top)
df_top_i[show_cols]

In [None]:
# plot fail/pass distribution
df_top_i.binary.value_counts().plot(kind='bar')
plt.title('Top Movies by international gross')
plt.grid()
plt.show()

### By domestic gross

In [None]:
df_top_d = df.nlargest(columns=[col_domgross], n=n_top)
df_top_d[show_cols]

In [None]:
# plot fail/pass distribution
df_top_d.binary.value_counts().plot(kind='bar')
plt.title('Top Movies by domestic gross')
plt.grid()
plt.show()

#### We observe that (no matter which financial perspective we choose) in the top movies segment there are significantly more movies that fail the Bechdel-Test than that pass it. Situation gets better when using the non-indexed figures (as older movies have much less impact then). 

<a id='4'></a>
# Other Explorations

### Financial figures by year

In [None]:
for f in features_num:
    plt.figure(figsize=(14,6))
    sns.boxplot(data=df, x='year', y=f)
    plt.xticks(rotation=90)
    plt.title(f + 'by year')
    plt.grid()
    plt.show()