![title](https://user-images.githubusercontent.com/30674288/36270852-b5fd146a-1231-11e8-9432-a4b0bcbe574c.png)

## High-Level Topics Covered
* Bokeh
* API Calls
* Pandas Dataframe manipulation
* Speed measurement/optimization
* Gender_guesser
* Bechdel test
* Feminism
* Machine learning models

## Female Representation - The Bechdel Test
1. https://bechdeltest.com/
1. The Bechdel test measures female representation in movies by assigning a movie 1 of 4 scores
    1. It has to have at least two women in it 
    1. ... who talk to each other 
    1. ... about something besides a man.

![Chart Image](https://user-images.githubusercontent.com/30674288/36269933-5101af50-122f-11e8-8b86-c71ce6050c40.png)

![Chart Image](https://user-images.githubusercontent.com/30674288/36269934-5122a354-122f-11e8-86d1-17ef5424621a.png)

![Chart Image](https://user-images.githubusercontent.com/30674288/36269935-513ae054-122f-11e8-9115-f20be8ad235f.png)

![Chart Image](https://user-images.githubusercontent.com/30674288/36269936-51531066-122f-11e8-96b5-0db838b25ac5.png)

## The Problem
1. The Bechdel test is labor-intenstive, as it requires watching an entire movie before a score can be assigned. 
1. There are only 7,000 movies in the Bechdel test database. Compared to the IMDb Database, which has over 200,000 movies, this is very small.
1. Can I build a machine learning model that will classify movies based on their Bechdel test scores and apply that model to every movie in the IMDb dataset?

## Table of Contents<a id='back to top'></a>

1. [Libraries](#link1)
1. [Load Data](#link2)
    1. [IMDb Data](#link2a)
    1. [Bechdel Test API Call](#link2b)
1. [Merge Data](#link3)
    1. [Entity Relationship (ER) Diagram](#link3a)
    1. [Merge Name-Related Tables](#link3b)
    1. [Merge Title-Related Tables](#link3c)
    1. [Apply Gender Detector](#link3d)
    1. [Missing Values](#link3e)
    1. [Create Dummies](#link3f)
    1. [Merge Bechdel Test Table](#link3g)
1. [Start Here (with data on local machine)](#link4)
1. [Exploratory Data Analysis (EDA)](#link5)
    1. [Dataset Overview](#link5a)
    1. [Bechdel Test Scores](#link5b)
    1. [IMDb Rating](#link5c)
    1. [Regional Map](#link5d)
    1. [Crew](#link5e)
1. [Classification and Modeling](#link6)
    1. [Feature Selection](#link6a)
    1. [Support Vector Machine (SVM)](#link6b)
    1. [Logistic Regression](#link6c)
    1. [K-Nearest-Neighbors](#link6d)
    1. [Random Forest](#link6e)
    1. [Model Application to IMDb Data](#link6f)
1. [Outcomes](#link7)
    1. [Predicted Dataset Overview](#link7a)
    1. [Key Directors and Actors](#link7b)
    1. [Regional Map](#link7c)
1. [Takeaways / Conclusions](#link8)
    1. [What is the Point?](#link8a)
    1. [Limitation](#link8b)
    1. [Review of Concepts Covered](#link8c)
    1. [Next Steps to Expand This Analysis](#link8d)
    
    

http://www.imdb.com/interfaces/

You can view the notebook here:
https://nbviewer.jupyter.org/



# Libraries<a id='link1'></a>

[Back to Top](#back to top)

In [2]:
# install the libraries in the requirements.txt file to easily reproduce this notebook
#
# !pip install -r requirements.txt

In [3]:
# These are all the libraries we'll be using in this notebook.
import pandas as pd
import numpy as np

import requests
import json

from bokeh.io import show,output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource,HoverTool,CategoricalColorMapper
from bokeh.layouts import row,column,gridplot
from bokeh.models.widgets import Tabs,Panel
from bokeh.palettes import Spectral5, Spectral6
from bokeh.transform import factor_cmap
output_notebook()


from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

from sklearn import neighbors

import sklearn.ensemble as ske

from sklearn.metrics import accuracy_score, make_scorer

# Load Data <a id='link2'></a> 

[Back to Top](#back to top)

Load a bunch of tab separated values files in this notebook. This is what it all looks like from the imdb website and the bechdel API call

IMDb Data can be found here: 

##### title.basics.tsv.gz - Contains the following information for titles:
* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title.
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year.
* endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

##### title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:
* tconst (string)
* directors (array of nconsts) - director(s) of the given title
* writers (array of nconsts) – writer(s) of the given title

##### title.episode.tsv.gz – Contains the tv episode information. Fields include:
* tconst (string) - alphanumeric identifier of episode
* parentTconst (string) - alphanumeric identifier of the parent TV Series
* seasonNumber (integer) – season number the episode belongs to
* episodeNumber (integer) – episode number of the tconst in the TV series.

##### title.principals.tsv.gz – Contains the principal cast/crew for titles
* tconst (string)
* principalCast (array of nconsts) – title’s top-billed cast/crew

##### title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles
* tconst (string)
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

##### name.basics.tsv.gz – Contains the following information for names:
* nconst (string) - alphanumeric unique identifier of the name/person
* primaryName (string)– name by which the person is most often credited
* birthYear – in YYYY format
* deathYear – in YYYY format if applicable, else ‘\N’
* primaryProfession (array of strings)– the top-3 professions of the person
* knownForTitles (array of tconsts) – titles the person is known for

##### Bechdel Test Scores
* rating - The actual score. Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test).
* imdbid - The IMDb id.
* id - The bechdeltest.com unique id.
* title - The title of the movie. Any weird characters are HTML encoded (so Brüno is returned as "Br&uuml;no").
* year - The year this movie was released (according to IMDb).

## IMDb Datasets<a id='link2a'></a>

In [4]:
# load title.ratings.tsv
rating_path = "https://datasets.imdbws.com/title.ratings.tsv.gz"
ratings = pd.read_csv(rating_path, sep='\t')

In [5]:
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.8,1350
1,tt0000002,6.5,157
2,tt0000003,6.6,933
3,tt0000004,6.4,93
4,tt0000005,6.2,1620


In [6]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 805224 entries, 0 to 805223
Data columns (total 3 columns):
tconst           805224 non-null object
averageRating    805224 non-null float64
numVotes         805224 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 18.4+ MB


In [7]:
# title.basics.tsv

title_basics_path = "https://datasets.imdbws.com/title.basics.tsv.gz"
title_basics = pd.read_csv(title_basics_path, sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [8]:
title_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


In [9]:
title_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809400 entries, 0 to 4809399
Data columns (total 9 columns):
tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           int64
startYear         object
endYear           object
runtimeMinutes    object
genres            object
dtypes: int64(1), object(8)
memory usage: 330.2+ MB


In [10]:
# title.crew.tsv

title_crew_path = "https://datasets.imdbws.com/title.crew.tsv.gz"
title_crew = pd.read_csv(title_crew_path, sep='\t')

In [11]:
title_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


In [12]:
title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809400 entries, 0 to 4809399
Data columns (total 3 columns):
tconst       object
directors    object
writers      object
dtypes: object(3)
memory usage: 110.1+ MB


In [13]:
# name.basics.tsv

name_basics_path = 'https://datasets.imdbws.com/name.basics.tsv.gz'
name_basics = pd.read_csv(name_basics_path, sep= '\t')

In [14]:
name_basics.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0043044,tt0072308,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0040506,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0049189,tt0063715,tt0059956,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0077975,tt0080455,tt0078723,tt0072562"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0050976,tt0083922,tt0060827"


In [15]:
name_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8427042 entries, 0 to 8427041
Data columns (total 6 columns):
nconst               object
primaryName          object
birthYear            object
deathYear            object
primaryProfession    object
knownForTitles       object
dtypes: object(6)
memory usage: 385.8+ MB


In [16]:
# title.akas.csv

title_akas_path = 'https://datasets.imdbws.com/title.akas.tsv.gz'
title_akas = pd.read_csv(title_akas_path, sep= '\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [17]:
title_akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
1,tt0000001,2,Карменсита,RU,\N,\N,\N,0
2,tt0000001,3,Carmencita,US,\N,\N,\N,0
3,tt0000001,4,Carmencita,\N,\N,original,\N,1
4,tt0000002,1,Le clown et ses chiens,\N,\N,original,\N,1


In [18]:
title_akas.sample(5)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
3085852,tt6885848,1,Záletí,CZ,\N,\N,\N,0
1283524,tt0843746,1,Team Holby,GB,\N,\N,\N,0
1768407,tt1613050,1,Radio,US,\N,\N,\N,0
1816494,tt1686022,2,Einai apla i arhi,GR,\N,festival,\N,0
1537046,tt1258701,3,Falsely Accused,GB,\N,alternative,\N,0


In [19]:
# title.principals.tsv

title_principals_path = 'https://datasets.imdbws.com/title.principals.tsv.gz'
title_principals = pd.read_csv(title_principals_path, sep= '\t')

In [20]:
title_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [21]:
title_principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27058256 entries, 0 to 27058255
Data columns (total 6 columns):
tconst        object
ordering      int64
nconst        object
category      object
job           object
characters    object
dtypes: int64(1), object(5)
memory usage: 1.2+ GB


In [22]:
# title.episode.tsv

title_episode_path = 'https://datasets.imdbws.com/title.episode.tsv.gz'
title_episode = pd.read_csv(title_episode_path, sep='\t')

In [23]:
title_episode.head()

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0041951,tt0041038,1,9
1,tt0042816,tt0989125,1,17
2,tt0042889,tt0989125,\N,\N
3,tt0043426,tt0040051,3,42
4,tt0043631,tt0989125,2,16


In [24]:
title_episode.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3206390 entries, 0 to 3206389
Data columns (total 4 columns):
tconst           object
parentTconst     object
seasonNumber     object
episodeNumber    object
dtypes: object(4)
memory usage: 97.9+ MB


## Bechdel Test API Call<a id='link2b'></a>

How I'm teaching myself how to connect to the API
* https://www.dataquest.io/blog/python-api-tutorial/
* https://github.com/noahgift/web_scraping_python/blob/master/notebooks/Web_Scraping_in_Python_for_AI_Fun_Profit_(PART%201).ipynb

This is the Bechdel Test API documentation:
* https://bechdeltest.com/api/v1/doc

This is the Bechdel test api call:
* http://bechdeltest.com/api/v1/method?params

This is my more in-depth documentation on how I connected to the API and created this dataframe:
* https://github.com/svirshup/IMDb-analysis/blob/master/Bechdel%20API%20connection.ipynb

In [25]:
import requests
import json
import pandas as pd

# Let's call the method that returns all movies at once
response_all = requests.get("http://bechdeltest.com/api/v1/getAllMovies")

# Let's get an object we can use
x = response_all.json()

# Turn the API response into a dataframe
bechdel = pd.DataFrame(x)

# add a 'tt' to the beginning of the bechdel imdbid field so it matches the tconst fields in other tables
bechdel['imdbid'] = 'tt' + bechdel['imdbid'].astype(str)

# Drop unnecessary columns
columns_to_drop = ['id','year']
bechdel = bechdel.drop(columns_to_drop, axis=1)

# Rename rating to bechdelScore
bechdel = bechdel.rename(index=str, columns={"rating": "bechdelScore"})


bechdel.head()

Unnamed: 0,imdbid,bechdelScore,title
0,tt0000003,0,Pauvre Pierrot
1,tt0132134,0,"Execution of Mary, Queen of Scots, The"
2,tt0000014,0,Tables Turned on the Gardener
3,tt0000012,0,"Arrival of a Train, The"
4,tt0000131,0,Une nuit terrible


# Merge Data<a id='link3'></a>

[Back to Top](#back to top)

Here's a link to some documentation on how to perform merges/joins in python:
* https://pandas.pydata.org/pandas-docs/stable/merging.html

## Entity Relationship (ER) Diagram<a id='link3a'></a>

![Chart Image](https://user-images.githubusercontent.com/30674288/36070466-00569dd6-0eb0-11e8-867f-ea7ed30dc553.png)

## Merge Name-Related Tables<a id='link3b'></a>

In [26]:
# Join the name basics table and the crew table in order to match names to ids
names_crew = pd.merge(title_crew, name_basics, left_on='directors', right_on='nconst', how='left')

# Now there's a ton of extra columns here. Let's drop the unnecessary ones
columns_to_drop = ['directors','writers','birthYear','deathYear','knownForTitles','primaryProfession']
names_crew = names_crew.drop(columns_to_drop, axis=1)

# Rename nconst to Director_ID
names_crew = names_crew.rename(index=str, columns={"nconst": "Director_ID"})

# Rename primaryName to Director
names_crew = names_crew.rename(index=str, columns={"primaryName": "Director"})

names_crew.head()

Unnamed: 0,tconst,Director_ID,Director
0,tt0000001,nm0005690,William K.L. Dickson
1,tt0000002,nm0721526,Émile Reynaud
2,tt0000003,nm0721526,Émile Reynaud
3,tt0000004,nm0721526,Émile Reynaud
4,tt0000005,nm0005690,William K.L. Dickson


In [27]:
# At the last minute, the title_principals dataset changed format.
# We need to make some subsets to account for this change

# top billed means ordering = 1
# 2nd billed means ordering = 2
# 3rd billed means ordering = 3
# etc.

topBilled_subset = title_principals[title_principals['ordering'] == 1]
topBilled_subset = topBilled_subset.rename(index=str, columns={"nconst": "topBilled_ID"})

Billed2_subset = title_principals[title_principals['ordering'] == 2]
Billed2_subset = Billed2_subset.rename(index=str, columns={"nconst": "Billed2nd_ID"})

Billed3_subset = title_principals[title_principals['ordering'] == 3]
Billed3_subset = Billed3_subset.rename(index=str, columns={"nconst": "Billed3rd_ID"})

In [28]:
name_basics.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0043044,tt0072308,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0040506,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0049189,tt0063715,tt0059956,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0077975,tt0080455,tt0078723,tt0072562"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0050976,tt0083922,tt0060827"


In [29]:
# For each subset, merge with the names basics table to find the actual names

topBilled_subset = pd.merge(topBilled_subset, name_basics, left_on='topBilled_ID', right_on='nconst', how='left')

Billed2_subset = pd.merge(Billed2_subset, name_basics, left_on='Billed2nd_ID', right_on='nconst', how='left')

Billed3_subset = pd.merge(Billed3_subset, name_basics, left_on='Billed3rd_ID', right_on='nconst', how='left')

In [30]:
# rename the name columns with names that distinguish them

topBilled_subset = topBilled_subset.rename(index=str, columns={"primaryName": "topBilled"})

Billed2_subset = Billed2_subset.rename(index=str, columns={"primaryName": "TopBilled2nd"})

Billed3_subset = Billed3_subset.rename(index=str, columns={"primaryName": "TopBilled3rd"})

In [31]:
drop_columns = ['birthYear','deathYear', 'primaryProfession','knownForTitles']

topBilled_subset = topBilled_subset.drop(drop_columns, axis=1)

Billed2_subset = Billed2_subset.drop(drop_columns, axis=1)

Billed3_subset = Billed3_subset.drop(drop_columns, axis=1)

In [32]:
# Join the new subsets of title_principals with each other
# top billed + billed 2
names_billed = pd.merge(topBilled_subset, Billed2_subset, on='tconst', how='outer')

# names_billed + billed 3
names_billed = pd.merge(names_billed, Billed3_subset, on='tconst', how='outer')

In [33]:
# Now there's a ton of extra columns here. Let's drop the unnecessary ones
columns_to_drop = ['ordering_x','category_x','job_x','characters_x','nconst_x',
                   'ordering_y','category_y','job_y','characters_y','nconst_y',
                   'ordering','category','job','characters','nconst',
                  'topBilled_ID','Billed2nd_ID','Billed3rd_ID']

names_billed = names_billed.drop(columns_to_drop, axis=1)

names_billed.head()

Unnamed: 0,tconst,topBilled,TopBilled2nd,TopBilled3rd
0,tt0000001,Carmencita,William K.L. Dickson,William Heise
1,tt0000002,Émile Reynaud,Gaston Paulin,
2,tt0000003,Émile Reynaud,Julien Pappé,Gaston Paulin
3,tt0000004,Émile Reynaud,Gaston Paulin,
4,tt0000005,Charles Kayser,John Ott,William K.L. Dickson


## Merge Title-Related Tables<a id='link3c'></a>

In [34]:
# Continue to pull it all together within IMDb

# title_basics + ratings
titles = pd.merge(title_basics, ratings, on='tconst')

# movies (from above) + crew
titles = pd.merge(titles, names_crew, on='tconst')

# movies (from above) + principals
titles = pd.merge(titles, names_billed, on='tconst')

titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,Director_ID,Director,topBilled,TopBilled2nd,TopBilled3rd
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.8,1350,nm0005690,William K.L. Dickson,Carmencita,William K.L. Dickson,William Heise
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",6.5,157,nm0721526,Émile Reynaud,Émile Reynaud,Gaston Paulin,
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",6.6,933,nm0721526,Émile Reynaud,Émile Reynaud,Julien Pappé,Gaston Paulin
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short",6.4,93,nm0721526,Émile Reynaud,Émile Reynaud,Gaston Paulin,
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short,6.2,1620,nm0005690,William K.L. Dickson,Charles Kayser,John Ott,William K.L. Dickson


In [35]:
# Merge akas table to include region
title_akas_subset = title_akas[title_akas['ordering'] == 1]

title_akas_subset = title_akas_subset.rename(index=str, columns={"region": "primaryRegion"})

titles = pd.merge(titles, title_akas_subset, left_on='tconst', right_on='titleId', how='left')

columns_to_drop = ['ordering','title','language','types','attributes',
                   'isOriginalTitle','titleId']

titles = titles.drop(columns_to_drop, axis=1)

titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,Director_ID,Director,topBilled,TopBilled2nd,TopBilled3rd,primaryRegion
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.8,1350,nm0005690,William K.L. Dickson,Carmencita,William K.L. Dickson,William Heise,HU
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",6.5,157,nm0721526,Émile Reynaud,Émile Reynaud,Gaston Paulin,,\N
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",6.6,933,nm0721526,Émile Reynaud,Émile Reynaud,Julien Pappé,Gaston Paulin,HU
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short",6.4,93,nm0721526,Émile Reynaud,Émile Reynaud,Gaston Paulin,,\N
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short,6.2,1620,nm0005690,William K.L. Dickson,Charles Kayser,John Ott,William K.L. Dickson,US


## Apply Gender Detector<a id='link3d'></a>

quick data cleaning to prepare for applying the director gender to the datasets. We need to:
* Subset titles dataset to only have movies in it (size goes from 800k to approximately 200k)
* Split topBilled into first name and name remaining
* Drop unnecessary columns
* Apply gender detector to a small subset to make sure it works and get an idea of timing
* Apply gender detector to movies dataset

In [36]:
# We're looking at nearly 800,000 records in our dataset
titles.shape

(795543, 17)

In [37]:
# When you look at the counts by titleType, you see that over 210,000 are movies
titles.titleType.value_counts()

tvEpisode       324100
movie           214195
short            97201
tvSeries         53679
video            42248
tvMovie          41443
tvMiniSeries      7432
videoGame         7161
tvSpecial         5744
tvShort           2340
Name: titleType, dtype: int64

We're going to constrain our analysis to movies only.
As we'll see later, we're doing this because the Bechdel dataset primarily consists of movies, but also because the overall dataset would be very large for the problem we're working with here. Specifically the gender detector function takes a long time to run on the dataset of 210,000 movies, so we're trying to avoid running it unnecessarily on 800,000 records

In [38]:
# Subset dataset into just a movies dataset

movies = titles[titles['titleType'] == 'movie']

In [39]:
# We're left with 214,000 rows, as expected
movies.shape

(214195, 17)

In [40]:
# Split the topBilled column so we have a column called "firstNameTopBilled"
# This will be input in our gender detector function later on
# Do this for 2nd and 3rd billed as well as director

movies = movies.join(movies['topBilled']
                     .str.split(' ', 1, expand=True).rename(columns={0:'firstNameTopBilled', 1:'firstNameRemaining'}))

movies = movies.join(movies['TopBilled2nd']
                     .str.split(' ', 1, expand=True).rename(columns={0:'first2ndTopBilled', 1:'first2ndNameRemaining'}))

movies = movies.join(movies['TopBilled3rd']
                     .str.split(' ', 1, expand=True).rename(columns={0:'first3rdTopBilled', 1:'first3rdNameRemaining'}))

movies = movies.join(movies['Director']
                     .str.split(' ', 1, expand=True).rename(columns={0:'firstNameDirector', 1:'DirectorRemaining'}))

In [41]:
# Drop unnecessary columns to simpify our dataset

columns_to_drop = ["titleType","isAdult","originalTitle","endYear",
                   "Director_ID","firstNameRemaining",
                  'first2ndNameRemaining','first3rdNameRemaining','DirectorRemaining']

movies = movies.drop(columns_to_drop, axis=1)

In [42]:
movies.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Director,topBilled,TopBilled2nd,TopBilled3rd,primaryRegion,firstNameTopBilled,first2ndTopBilled,first3rdTopBilled,firstNameDirector
8,tt0000009,Miss Jerry,1894,45,Romance,5.4,62,Alexander Black,Blanche Bayliss,William Courtenay,Chauncey Depew,\N,Blanche,William,Chauncey,Alexander
141,tt0000147,The Corbett-Fitzsimmons Fight,1897,20,"Documentary,News,Sport",5.2,245,Enoch J. Rector,James J. Corbett,Bob Fitzsimmons,Billy Madden,US,James,Bob,Billy,Enoch
232,tt0000335,Soldiers of the Cross,1900,\N,"Biography,Drama",6.2,32,,Beatrice Day,Harold Graham,Mr. Graham,AU,Beatrice,Harold,Mr.,
330,tt0000574,The Story of the Kelly Gang,1906,70,"Biography,Crime,Drama",6.3,431,Charles Tait,Elizabeth Tait,John Tait,Norman Campbell,HU,Elizabeth,John,Norman,Charles
346,tt0000615,Robbery Under Arms,1907,\N,Drama,5.1,12,Charles MacMahon,Jim Gerald,George Merriman,Lance Vane,AU,Jim,George,Lance,Charles


#### Now to actually implement Gender Detector
* Information on the library can be found here: https://pypi.python.org/pypi/gender-guesser/

In [43]:
import gender_guesser.detector as gender

In [44]:
# User-generated function to detect gender. 
# Simplify it so that it will return male if "male" or "mostly_male" and same for female.
# Test it on something to see what a result looks like. 
# Normally the name "shannon" is "mostly_female"

gender_detector = gender.Detector(case_sensitive=False)

def detect_gender(name):
    detected = gender_detector.get_gender(name)
    if detected == "male":
        return "Male"
    elif detected == "female":
        return "Female"
    if detected == "mostly_male":
        return "Male"
    elif detected == "mostly_female":
        return "Female"
    else:
        return "Unknown"

print(detect_gender(""))

Unknown


#### Test the gender detector speed
The function is pretty slow, and we're applying it to over 200,000 rows, so let's evaluate the timing of a few different methods. We'll look at:
* map()
* apply()

We'll be applying the above methods on a small subset of the large dataset.

In [45]:
# We'll call the small dataset movies_small

movies_small = movies.head(50)

In [46]:
# map()
%timeit movies_small["topBilledGender_map"] = list(map(detect_gender,movies_small["firstNameTopBilled"]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


26.2 ms ± 660 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [47]:
# .apply()
%timeit movies_small["topBilledGender_apply"] = movies_small["firstNameTopBilled"].apply(detect_gender)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


25.5 ms ± 678 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### results of timing tests
* map() = 800 µs ± 25 µs per loop
* apply() = 850 µs ± 35.4 µs per loop

Overall these are both very fast, so it likely won't matter which one we use. I'll use map() since it is marginally faster and a bit more consistent

In [48]:
movies.isnull().sum()

tconst                    0
primaryTitle              0
startYear                 0
runtimeMinutes            0
genres                    0
averageRating             0
numVotes                  0
Director              19899
topBilled                 0
TopBilled2nd           2214
TopBilled3rd           4700
primaryRegion         35083
firstNameTopBilled        0
first2ndTopBilled      2214
first3rdTopBilled      4700
firstNameDirector     19899
dtype: int64

In [49]:
# fill na values of director first names with blanks
movies.firstNameTopBilled = movies.firstNameTopBilled.fillna('')

# fill na values of director first names with blanks
movies.first2ndTopBilled = movies.first2ndTopBilled.fillna('')

# fill na values of director first names with blanks
movies.first3rdTopBilled = movies.first3rdTopBilled.fillna('')

# fill na values of director first names with blanks
movies.firstNameDirector = movies.firstNameDirector.fillna('')

In [50]:
# This will run the gender detector on the movies dataset

# topBilledGender
movies["topBilledGender"] = list(map(detect_gender,movies["firstNameTopBilled"]))

# secondTopBilledGender
movies["secondTopBilledGender"] = list(map(detect_gender,movies["first2ndTopBilled"]))

# thirdTopBilledGender
movies["thirdTopBilledGender"] = list(map(detect_gender,movies["first3rdTopBilled"]))

# directorGender
movies["directorGender"] = list(map(detect_gender,movies["firstNameDirector"]))
movies.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Director,topBilled,TopBilled2nd,TopBilled3rd,primaryRegion,firstNameTopBilled,first2ndTopBilled,first3rdTopBilled,firstNameDirector,topBilledGender,secondTopBilledGender,thirdTopBilledGender,directorGender
8,tt0000009,Miss Jerry,1894,45,Romance,5.4,62,Alexander Black,Blanche Bayliss,William Courtenay,Chauncey Depew,\N,Blanche,William,Chauncey,Alexander,Female,Male,Male,Male
141,tt0000147,The Corbett-Fitzsimmons Fight,1897,20,"Documentary,News,Sport",5.2,245,Enoch J. Rector,James J. Corbett,Bob Fitzsimmons,Billy Madden,US,James,Bob,Billy,Enoch,Male,Male,Male,Male
232,tt0000335,Soldiers of the Cross,1900,\N,"Biography,Drama",6.2,32,,Beatrice Day,Harold Graham,Mr. Graham,AU,Beatrice,Harold,Mr.,,Female,Male,Unknown,Unknown
330,tt0000574,The Story of the Kelly Gang,1906,70,"Biography,Crime,Drama",6.3,431,Charles Tait,Elizabeth Tait,John Tait,Norman Campbell,HU,Elizabeth,John,Norman,Charles,Female,Male,Male,Male
346,tt0000615,Robbery Under Arms,1907,\N,Drama,5.1,12,Charles MacMahon,Jim Gerald,George Merriman,Lance Vane,AU,Jim,George,Lance,Charles,Male,Male,Male,Male


In [51]:
movies.topBilledGender.value_counts()

Male       120515
Female      57296
Unknown     36384
Name: topBilledGender, dtype: int64

In [52]:
movies.secondTopBilledGender.value_counts()

Male       102186
Female      73358
Unknown     38651
Name: secondTopBilledGender, dtype: int64

In [53]:
movies.thirdTopBilledGender.value_counts()

Male       109028
Female      63377
Unknown     41790
Name: thirdTopBilledGender, dtype: int64

In [54]:
movies.directorGender.value_counts()

Male       143750
Unknown     53855
Female      16590
Name: directorGender, dtype: int64

In [55]:
# Drop unnecessary columns to simpify our dataset

columns_to_drop = ["firstNameTopBilled","first2ndTopBilled","first3rdTopBilled",
                   "firstNameDirector"]

movies = movies.drop(columns_to_drop, axis=1)

In [56]:
movies.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Director,topBilled,TopBilled2nd,TopBilled3rd,primaryRegion,topBilledGender,secondTopBilledGender,thirdTopBilledGender,directorGender
8,tt0000009,Miss Jerry,1894,45,Romance,5.4,62,Alexander Black,Blanche Bayliss,William Courtenay,Chauncey Depew,\N,Female,Male,Male,Male
141,tt0000147,The Corbett-Fitzsimmons Fight,1897,20,"Documentary,News,Sport",5.2,245,Enoch J. Rector,James J. Corbett,Bob Fitzsimmons,Billy Madden,US,Male,Male,Male,Male
232,tt0000335,Soldiers of the Cross,1900,\N,"Biography,Drama",6.2,32,,Beatrice Day,Harold Graham,Mr. Graham,AU,Female,Male,Unknown,Unknown
330,tt0000574,The Story of the Kelly Gang,1906,70,"Biography,Crime,Drama",6.3,431,Charles Tait,Elizabeth Tait,John Tait,Norman Campbell,HU,Female,Male,Male,Male
346,tt0000615,Robbery Under Arms,1907,\N,Drama,5.1,12,Charles MacMahon,Jim Gerald,George Merriman,Lance Vane,AU,Male,Male,Male,Male


## Missing Values<a id='link3e'></a>

In [57]:
movies.isnull().sum()

tconst                       0
primaryTitle                 0
startYear                    0
runtimeMinutes               0
genres                       0
averageRating                0
numVotes                     0
Director                 19899
topBilled                    0
TopBilled2nd              2214
TopBilled3rd              4700
primaryRegion            35083
topBilledGender              0
secondTopBilledGender        0
thirdTopBilledGender         0
directorGender               0
dtype: int64

In [58]:
# let's take a final look at the details about the data
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214195 entries, 8 to 795530
Data columns (total 16 columns):
tconst                   214195 non-null object
primaryTitle             214195 non-null object
startYear                214195 non-null object
runtimeMinutes           214195 non-null object
genres                   214195 non-null object
averageRating            214195 non-null float64
numVotes                 214195 non-null int64
Director                 194296 non-null object
topBilled                214195 non-null object
TopBilled2nd             211981 non-null object
TopBilled3rd             209495 non-null object
primaryRegion            179112 non-null object
topBilledGender          214195 non-null object
secondTopBilledGender    214195 non-null object
thirdTopBilledGender     214195 non-null object
directorGender           214195 non-null object
dtypes: float64(1), int64(1), object(14)
memory usage: 37.8+ MB


In [59]:
# make sure data types are correct before moving forward



# runtimeMinutes
movies.runtimeMinutes = pd.to_numeric(movies.runtimeMinutes, errors='coerce').fillna(0).astype(np.int64)

## Director and Crew Names

In [60]:
# nearly 20,000 null director records
movies.Director = movies.Director.fillna('')

movies.TopBilled2nd = movies.TopBilled2nd.fillna('')

movies.TopBilled3rd = movies.TopBilled3rd.fillna('')

## Region

In [61]:
# Region has a significant number of null values
movies.primaryRegion.isnull().sum()

35083

In [62]:
movies.primaryRegion = movies.primaryRegion.fillna('')

In [63]:
movies.primaryRegion.isnull().sum()

0

## startYear

In [64]:
# startYear
movies.startYear = pd.to_numeric(movies.startYear, errors='coerce').fillna(0).astype(np.int64)

# 4 Movies have startYear = 0
# remove those rows
movies = movies.loc[movies['startYear'] != 0]

# Check to make sure it worked
movies[movies['startYear'] == 0].head(5)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Director,topBilled,TopBilled2nd,TopBilled3rd,primaryRegion,topBilledGender,secondTopBilledGender,thirdTopBilledGender,directorGender


## Genre

## Create Dummies<a id='link3f'></a>

We'll make dummy variables for a few categorical variables
* genres
    * Since the genre field is a list of genres, we can say that if a movie's genre contains "comedy", then it is a comedy movie, and similar logic for other popular genres such as action, adventure, drama, sci-fi, etc.
* top Billed gender
    * A dummy for the gender of the top billed person, which will have 3 columns: male, female, or unknown/androgenous
* Director gender
    * Since this overlaps heavily with top Billed person, it might not be worthwhile calculating this, but it can be done the same way as top Billed gender
* Other crew gender
    * The movie's principal cast/crew field contains the top billed individual, but generally also contains other names of other prominent people in the cast/crew. We can have 3 dummy columns for this, based on the number of other people in this field. column 1 = If no females in remaining cast/crew column 2 = 1 female in remaining cast/crew column 3 = 2 or more female in remaining cast/crew.
* Region
    * We can split region into a number of different groups, but since the US region accounts for a vast majority of the entries, I'll split it into US or not US.

### genres
Major movie genres that we will split into dummy categories for
* Action
* Adventure
* Comedy
* Drama
* Sci-fi
* Horror
* Romance
* Thriller
* Mystery
* Crime
* Animation
* Fantasy
* Documentary
* Musical
* Sport

In [65]:
movies.genres.value_counts()

Drama                              35432
Comedy                             19964
Documentary                        17712
\N                                 14507
Comedy,Drama                        6777
Drama,Romance                       6189
Horror                              4547
Comedy,Romance                      3967
Comedy,Drama,Romance                3073
Thriller                            2977
Action                              2933
Crime,Drama                         2930
Adult                               2232
Western                             2146
Drama,Thriller                      2136
Action,Crime,Drama                  2045
Action,Drama                        1999
Romance                             1847
Drama,War                           1676
Horror,Thriller                     1630
Crime,Drama,Thriller                1333
Drama,Family                        1233
Documentary,Music                   1218
Family                              1207
Adventure       

In [66]:
# Action
movies["Action"] = movies["genres"].map(lambda x: 
                                                        1 if "Action" in x else 0)
# Adventure
movies["Adventure"] = movies["genres"].map(lambda x: 
                                                        1 if "Adventure" in x else 0)
# Comedy
movies["Comedy"] = movies["genres"].map(lambda x: 
                                                        1 if "Comedy" in x else 0)
# Drama
movies["Drama"] = movies["genres"].map(lambda x: 
                                                        1 if "Drama" in x else 0)
# Sci-Fi
movies["SciFi"] = movies["genres"].map(lambda x: 
                                                        1 if "Sci-Fi" in x else 0)
# Horror
movies["Horror"] = movies["genres"].map(lambda x: 
                                                        1 if "Horror" in x else 0)
# Romance
movies["Romance"] = movies["genres"].map(lambda x: 
                                                        1 if "Romance" in x else 0)
# Thriller
movies["Thriller"] = movies["genres"].map(lambda x: 
                                                        1 if "Thriller" in x else 0)
# Mystery
movies["Mystery"] = movies["genres"].map(lambda x: 
                                                        1 if "Mystery" in x else 0)
# Crime
movies["Crime"] = movies["genres"].map(lambda x: 
                                                        1 if "Crime" in x else 0)
# Animation
movies["Animation"] = movies["genres"].map(lambda x: 
                                                        1 if "Animation" in x else 0)
# Fantasy
movies["Fantasy"] = movies["genres"].map(lambda x: 
                                                        1 if "Fantasy" in x else 0)
# Documentary
movies["Documentary"] = movies["genres"].map(lambda x: 
                                                        1 if "Documentary" in x else 0)
# Musical
movies["Musical"] = movies["genres"].map(lambda x: 
                                                        1 if "Music" in x else 0)
# Sport
movies["Sport"] = movies["genres"].map(lambda x: 
                                                        1 if "Sport" in x else 0)
# Adult
movies["Adult"] = movies["genres"].map(lambda x: 
                                                        1 if "Adult" in x else 0)
# Western
movies["Western"] = movies["genres"].map(lambda x: 
                                                        1 if "Western" in x else 0)
# Family
movies["Family"] = movies["genres"].map(lambda x: 
                                                        1 if "Family" in x else 0)

In [67]:
movies.sample(10)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Director,topBilled,TopBilled2nd,...,Mystery,Crime,Animation,Fantasy,Documentary,Musical,Sport,Adult,Western,Family
477086,tt1475580,The Girl from Shady Streets,2008,50,Documentary,6.7,6,Johanna Enäsuo,Marta Correa,Mileider Gil,...,0,0,0,0,1,0,0,0,0,0
63593,tt0093407,Less Than Zero,1987,98,"Crime,Drama",6.4,15816,Marek Kanievska,Andrew McCarthy,Jami Gertz,...,0,1,0,0,0,0,0,0,0,0
590739,tt2375097,Heino Jaeger - Look Before You Kuck,2012,120,"Biography,Comedy,Documentary",7.3,18,Gerd Kroske,Ivonne Durand,Renate Durand,...,0,0,0,0,1,0,0,0,0,0
733983,tt5497458,The Watcher,2016,89,"Horror,Thriller",5.4,2199,Ryan Rothmaier,Erin Cahill,Edi Gathegi,...,0,0,0,0,0,0,0,0,0,0
18437,tt0038911,Secret Flight,1946,102,"Drama,War",6.6,98,Peter Ustinov,Ralph Richardson,Raymond Huntley,...,0,0,0,0,0,0,0,0,0,0
200536,tt0390454,Evergreen Tree,1961,110,Drama,8.4,8,Sang-ok Shin,Eun-hie Choi,Yeong-gyun Shin,...,0,0,0,0,0,0,0,0,0,0
47292,tt0073946,The Softening of the Egg,1975,107,Comedy,6.2,670,Hans Alfredson,Gösta Ekman,Max von Sydow,...,0,0,0,0,0,0,0,0,0,0
84503,tt0119005,Don't Sleep Alone,1997,80,"Mystery,Thriller",4.3,47,Tim Andrew,Lisa Welti,Doug Jeffery,...,1,0,0,0,0,0,0,0,0,0
747244,tt5831124,Taming the Horse,2017,118,Documentary,7.1,9,Tao Gu,Tao Gu,Aonan Yang,...,0,0,0,0,1,0,0,0,0,0
417456,tt1122603,Russian Bride,2007,97,"Comedy,Drama",6.1,18,Natasha Guruleva,Elena Roth,Richard Virga,...,0,0,0,0,0,0,0,0,0,0


### Gender Dummies

In [68]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214192 entries, 8 to 795530
Data columns (total 34 columns):
tconst                   214192 non-null object
primaryTitle             214192 non-null object
startYear                214192 non-null int64
runtimeMinutes           214192 non-null int64
genres                   214192 non-null object
averageRating            214192 non-null float64
numVotes                 214192 non-null int64
Director                 214192 non-null object
topBilled                214192 non-null object
TopBilled2nd             214192 non-null object
TopBilled3rd             214192 non-null object
primaryRegion            214192 non-null object
topBilledGender          214192 non-null object
secondTopBilledGender    214192 non-null object
thirdTopBilledGender     214192 non-null object
directorGender           214192 non-null object
Action                   214192 non-null int64
Adventure                214192 non-null int64
Comedy                   2141

In [69]:
# topBilledGender
# Male
movies["topBilled_Male"] = movies["topBilledGender"].map(lambda x: 
                                                        1 if "Male" in x else 0)
# Female
movies["topBilled_Female"] = movies["topBilledGender"].map(lambda x: 
                                                        1 if "Female" in x else 0)
# Unknown
movies["topBilled_Unknown"] = movies["topBilledGender"].map(lambda x: 
                                                        1 if "Unknown" in x else 0)

# secondTopBilledGender
# Male
movies["secondBilled_Male"] = movies["secondTopBilledGender"].map(lambda x: 
                                                        1 if "Male" in x else 0)
# Female
movies["secondBilled_Female"] = movies["secondTopBilledGender"].map(lambda x: 
                                                        1 if "Female" in x else 0)
# Unknown
movies["secondBilled_Unknown"] = movies["secondTopBilledGender"].map(lambda x: 
                                                        1 if "Unknown" in x else 0)

# thirdTopBilledGender
# Male
movies["thirdBilled_Male"] = movies["thirdTopBilledGender"].map(lambda x: 
                                                        1 if "Male" in x else 0)
# Female
movies["thirdBilled_Female"] = movies["thirdTopBilledGender"].map(lambda x: 
                                                        1 if "Female" in x else 0)
# Unknown
movies["thirdBilled_Unknown"] = movies["thirdTopBilledGender"].map(lambda x: 
                                                        1 if "Unknown" in x else 0)

# directorGender
# Male
movies["director_Male"] = movies["directorGender"].map(lambda x: 
                                                        1 if "Male" in x else 0)
# Female
movies["director_Female"] = movies["directorGender"].map(lambda x: 
                                                        1 if "Female" in x else 0)
# Unknown
movies["director_Unknown"] = movies["directorGender"].map(lambda x: 
                                                        1 if "Unknown" in x else 0)

### Region Dummy

In [70]:
# Region dummy
movies["US_Region"] = movies["primaryRegion"].map(lambda x: 
                                                        1 if "US" in x else 0)

In [71]:
movies.US_Region.value_counts()

0    180910
1     33282
Name: US_Region, dtype: int64

## Merge Bechdel Test Table<a id='link3g'></a>

In [72]:
#This dataset will be much smaller, and will only have records with a Bechdel score
movies_bechdel = pd.merge(movies, bechdel, left_on='tconst', right_on='imdbid', how='inner')

In [73]:
# Make sure that the bechdel scores are integer datatypes

movies_bechdel.bechdelScore = pd.to_numeric(movies_bechdel.bechdelScore, errors='coerce').fillna(0).astype(np.int64)


In [74]:
# Drop unnecessary columns
columns_to_drop = ["imdbid", "title"]

movies_bechdel = movies_bechdel.drop(columns_to_drop, axis=1)

# Start Here (with data on local machine)<a id='link4'></a>
### If you already have the files locally, start running things from here

[Back to Top](#back to top)

In [77]:
# download a csv of the datasets
#movies_bechdel.to_csv('movies_bechdel.csv')  # remove hash at beginning of the line to download
#movies.to_csv('movies.csv')                  # remove hash at beginning of the line to download

# If you already have the files downloaded, use this to re-assign the tables
#movies_bechdel = pd.read_csv('movies_bechdel.csv')  # remove hash at beginning of the line to upload from csv
#movies = pd.read_csv('movies.csv')                  # remove hash at beginning of the line to upload from csv

# Exploratory Data Analysis (EDA)<a id='link5'></a>

[Back to Top](#back to top)

In general I'm trying to do my visualizations in Bokeh to teach myself that package. Here are some of the resources that I used while creating this work.
* https://www.kaggle.com/kanncaa1/interactive-bokeh-tutorial-part-1
* https://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh-plotting

In [80]:
# bokeh packages
from bokeh.io import show,output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource,HoverTool,CategoricalColorMapper
from bokeh.layouts import row,column,gridplot
from bokeh.models.widgets import Tabs,Panel
from bokeh.palettes import Spectral5, Spectral6
from bokeh.transform import factor_cmap
output_notebook()

## Dataset Overview<a id='link5a'></a>

In [81]:
movies_bechdel.shape

(6966, 48)

In [82]:
merged_all.shape

(214198, 50)

In [83]:
movies_bechdel.describe()

Unnamed: 0,startYear,runtimeMinutes,averageRating,numVotes,Action,Adventure,Comedy,Drama,SciFi,Horror,...,secondBilled_Female,secondBilled_Unknown,thirdBilled_Male,thirdBilled_Female,thirdBilled_Unknown,director_Male,director_Female,director_Unknown,US_Region,bechdelScore
count,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,...,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0,6966.0
mean,1995.299167,105.687913,6.705225,73032.06,0.168246,0.158915,0.364915,0.562159,0.078381,0.116997,...,0.404823,0.087138,0.54321,0.349411,0.107379,0.756245,0.098335,0.145421,0.05857,2.16035
std,22.475819,26.807842,0.992848,139675.9,0.374111,0.365623,0.481441,0.496157,0.268789,0.32144,...,0.490893,0.282057,0.498165,0.476818,0.309616,0.429378,0.297788,0.35255,0.234835,1.07453
min,1906.0,0.0,1.3,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1987.0,92.0,6.1,4483.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
50%,2004.0,102.0,6.8,21080.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,3.0
75%,2011.0,116.0,7.4,78832.25,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0
max,2017.0,1440.0,9.3,1913498.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0


In [84]:
merged_all.describe()

Unnamed: 0,startYear,runtimeMinutes,averageRating,numVotes,Action,Adventure,Comedy,Drama,SciFi,Horror,...,secondBilled_Female,secondBilled_Unknown,thirdBilled_Male,thirdBilled_Female,thirdBilled_Unknown,director_Male,director_Female,director_Unknown,US_Region,bechdelScore
count,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,...,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0,214198.0
mean,1989.805633,80.725161,6.263551,3130.426,0.100631,0.05874,0.253336,0.428132,0.024608,0.066378,...,0.342487,0.180445,0.509015,0.295885,0.1951,0.671117,0.077456,0.251426,0.155384,0.070257
std,24.853926,56.163873,1.339313,28728.94,0.30084,0.235138,0.434923,0.494809,0.154928,0.248942,...,0.474543,0.384559,0.49992,0.456441,0.396278,0.469808,0.267315,0.433834,0.362271,0.429406
min,1894.0,0.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1974.0,72.0,5.5,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1998.0,89.0,6.4,44.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,2010.0,100.0,7.2,241.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
max,2018.0,14400.0,10.0,1913498.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0


## Bechdel Test Scores<a id='link5b'></a>
Things I'm looking at in regard to this test score
* Distribution of scores & 5-number summary
* Distribution of scores by genre
* The scores of some famous directors (Quentin Tarantino, Steven Spielberg, Wes Anderson, Martin Scorsese, Stanley Kubrick, etc)
* Importantly if we're missing any big name movies that people are interested in the scores for

In [86]:
# histogram of bechdel scores for movies_bechdel
# I treat it as a categorical variable here: https://bokeh.pydata.org/en/latest/docs/user_guide/categorical.html

#output_file("bechdel_histogram.html")

bechdel_counts = movies_bechdel['bechdelScore'].value_counts().tolist()
bechdel_scores = movies_bechdel['bechdelScore'].value_counts().index.tolist()

source = ColumnDataSource(data=dict(scores=bechdel_scores, counts=bechdel_counts, color=Spectral6))

p = figure(x_range=('0','1','2','3'), y_range=(0,max(bechdel_counts)*1.1), plot_height=250, title="Bechdel Score Counts",
           toolbar_location=None, tools="")

p.vbar(x='scores', top='counts', width=0.9, color='color', legend="scores", source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)



In [166]:
# For each year, plot the average bechdel scores within that year
movies_bechdel.bechdelScore = movies_bechdel.bechdelScore.astype(int)

grouped = movies_bechdel.groupby("startYear")
bechdelScore = grouped["bechdelScore"]
bechdelScore_avg = bechdelScore.mean()

years = list(grouped.groups.keys())

p = figure(title="Bechdel Score Average by Year")

p.circle(x=years, y=bechdelScore_avg, size=10, alpha=0.5,
         color="red")

show(p)

In [167]:
# Can I do number of records by year in Bokeh instead?
# For each year, plot the average bechdel scores within that year
movies_bechdel.bechdelScore = movies_bechdel.bechdelScore.astype(int)

grouped = movies_bechdel.groupby("startYear")
bechdelScore = grouped["bechdelScore"]
bechdelScore_count = bechdelScore.count()

years = list(grouped.groups.keys())

p = figure(title="Number of records by Year")

p.circle(x=years, y=bechdelScore_count, size=10, alpha=0.5,
         color="red")

show(p)

![not predicted by gender](https://user-images.githubusercontent.com/30674288/36279820-39f86fec-124d-11e8-9a29-7894a9560e04.png)

## IMDb Rating<a id='link5c'></a>

In [95]:
# histogram of average ratings

p1 = figure(title="Average Rating Histogram")

hist, edges = np.histogram(movies_bechdel['averageRating'], density=True, bins=20)

p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="black")
show(p1)

In [96]:
# scatter plot of rating and year

source = ColumnDataSource(movies_bechdel)

# Add tooltip element (however it seems to crash with so much data)
hover = HoverTool(tooltips = [("Movie Title","@primaryTitle"),("Year","@startYear"),("Average Rating","@averageRating")], mode="hline")
plot = figure(tools=[hover])

# Choose circle (aka scatter plot)
plot.circle(x='startYear', y='averageRating', source=source)
show(plot)

In [162]:
# For each year, plot the average rating within that year
movies_bechdel.bechdelScore = movies_bechdel.bechdelScore.astype(int)

grouped = movies_bechdel.groupby("startYear")
average_Rating = grouped["averageRating"]
averageRating_avg = average_Rating.mean()

years = list(grouped.groups.keys())

p = figure(title="Average IMDb Rating by Year")

p.circle(x=years, y=averageRating_avg, size=10, alpha=0.5,
         color="red")

show(p)

<a id='mapbefore'></a>
## Regional Map<a id='link5d'></a>

[to see map with predictions - click here](#mapafter)

![map](https://user-images.githubusercontent.com/30674288/36272413-54cb5562-1236-11e8-8e64-fd5d800154b5.png)

## Crew<a id='link5e'></a>

In [146]:
# Top 10 instances of topBilled name
movies_bechdel.topBilled.value_counts().head(10)

Tom Hanks             32
Johnny Depp           29
Tom Cruise            26
Robert De Niro        23
Adam Sandler          23
Nicolas Cage          21
Bruce Willis          21
Jim Carrey            19
Bette Davis           19
Sylvester Stallone    19
Name: topBilled, dtype: int64

In [147]:
# Top 10 instances of Director name
movies_bechdel.Director.value_counts().head(10)

                     521
Alfred Hitchcock      43
Woody Allen           30
Steven Spielberg      25
Martin Scorsese       18
Pedro Almodóvar       17
Steven Soderbergh     17
Clint Eastwood        17
John Carpenter        17
Tim Burton            16
Name: Director, dtype: int64

# Classification & Modeling <a id='link6'></a>

[Back to Top](#back to top)

* Feature Selection
* Model Testing
* Model Application to IMDb Data
* Testing the model on brand new stuff that's completely outside of the IMDb dataset right now (brand new movies that are coming out)


## Feature Selection<a id='link6a'></a>

In [149]:
# Dataset for model training/testing
model_building = movies_bechdel

In [105]:
# Make sure it has no null values
model_building.isnull().sum()

tconst                   0
primaryTitle             0
startYear                0
runtimeMinutes           0
genres                   0
averageRating            0
numVotes                 0
Director                 0
topBilled                0
TopBilled2nd             0
TopBilled3rd             0
primaryRegion            0
topBilledGender          0
secondTopBilledGender    0
thirdTopBilledGender     0
directorGender           0
Action                   0
Adventure                0
Comedy                   0
Drama                    0
SciFi                    0
Horror                   0
Romance                  0
Thriller                 0
Mystery                  0
Crime                    0
Animation                0
Fantasy                  0
Documentary              0
Musical                  0
Sport                    0
Adult                    0
Western                  0
Family                   0
topBilled_Male           0
topBilled_Female         0
topBilled_Unknown        0
s

In [150]:
# Choose the Features

features_to_drop = ['tconst','primaryTitle','primaryRegion','Director','topBilled','genres',
                    'TopBilled2nd','TopBilled3rd','bechdelScore',
                    'topBilledGender','secondTopBilledGender','thirdTopBilledGender',
                    'directorGender','numVotes','runtimeMinutes']

y = model_building["bechdelScore"]
x = model_building.drop(features_to_drop, axis=1)
x.head()

Unnamed: 0,startYear,averageRating,Action,Adventure,Comedy,Drama,SciFi,Horror,Romance,Thriller,...,secondBilled_Male,secondBilled_Female,secondBilled_Unknown,thirdBilled_Male,thirdBilled_Female,thirdBilled_Unknown,director_Male,director_Female,director_Unknown,US_Region
0,1906,6.3,0,0,0,1,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0
1,1912,5.1,0,0,0,1,0,0,0,0,...,0,1,0,0,0,1,1,0,0,0
2,1914,6.1,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,1,0,0,0
3,1915,6.7,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,1916,6.5,0,0,0,1,0,0,1,0,...,1,0,0,1,0,0,0,0,1,1


In [113]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, make_scorer

from sklearn.svm import SVC

from sklearn import neighbors, linear_model
import sklearn.ensemble as ske

from sklearn.linear_model import LogisticRegression

In [114]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## Support Vector Machine (SVM)<a id='link6b'></a>

In [168]:
# Support Vector Machine (SVM/SVC)
cv = StratifiedKFold(n_splits=5,shuffle=True)

svc = SVC(gamma=0.0001)

scorer = make_scorer(accuracy_score, greater_is_better = True)
cross_val_score(svc, x,y, cv=cv, scoring=scorer).mean()

0.5756533386649224

## Logistic Regression<a id='link6c'></a>

In [133]:
scorer = make_scorer(accuracy_score, greater_is_better = True)

cv = StratifiedKFold(n_splits=5,shuffle=True)

# Let's try 3 different parameters within the logistic model
print( cross_val_score(LogisticRegression(class_weight='balanced'), x,y, cv=cv, scoring=scorer).mean())
print( cross_val_score(LogisticRegression(C=100), x,y, cv=cv, scoring=scorer).mean())
print( cross_val_score(LogisticRegression(), x,y, cv=cv, scoring=scorer).mean())
print( cross_val_score(LogisticRegression(class_weight='balanced',C=100), x,y, cv=cv, scoring=scorer).mean())

0.5533986874732649
0.6055144796929215
0.6047940264728144
0.5542636414823906


From this, it looks like the Logistic regression with C=100 is actually the best one. However, once we look at the classification report and confusion matrix for all three, we see that LogisticRegression(C=100) and LogisticRegression() both failed to make any predictions with a score of a 0 or a 2, which is certainly a bad sign. Even though their scores are higher, they are less accurate across the board.

In fact, the balanced logistic regression has the best predictive capabilities in this instance, even though the accuracy is lower than the others.

In [132]:
logmodel_balanced = LogisticRegression(class_weight='balanced').fit(X_train,y_train)
logmodel_C100 = LogisticRegression(C=100).fit(X_train,y_train)
logmodel_default = LogisticRegression().fit(X_train,y_train)
logmodel_balanced_C100 = LogisticRegression(class_weight='balanced',C=100).fit(X_train,y_train)

predictions_balanced = logmodel_balanced.predict(X_test)
predictions_C100 = logmodel_C100.predict(X_test)
predictions_default = logmodel_default.predict(X_test)
predicitons_balanced_C100 = logmodel_balanced_C100.predict(X_test)

print(classification_report(y_test, predictions_balanced))
print(classification_report(y_test, predictions_C100))
print(classification_report(y_test, predictions_default))
print(classification_report(y_test, predicitons_balanced_C100))

             precision    recall  f1-score   support

          0       0.29      0.27      0.28       139
          1       0.40      0.48      0.44       318
          2       0.19      0.17      0.18       150
          3       0.75      0.71      0.73       787

avg / total       0.56      0.56      0.56      1394

             precision    recall  f1-score   support

          0       0.00      0.00      0.00       139
          1       0.40      0.38      0.39       318
          2       0.00      0.00      0.00       150
          3       0.65      0.90      0.76       787

avg / total       0.46      0.60      0.52      1394

             precision    recall  f1-score   support

          0       0.00      0.00      0.00       139
          1       0.39      0.38      0.39       318
          2       0.00      0.00      0.00       150
          3       0.65      0.90      0.75       787

avg / total       0.46      0.59      0.51      1394

             precision    recall  f1-

  'precision', 'predicted', average, warn_for)


## K-Nearest Neighbors<a id='link6d'></a>

* https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
* https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/

In [118]:
# k nearest neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors = 7)

cv = StratifiedKFold(n_splits=5,shuffle=True)

scorer = make_scorer(accuracy_score, greater_is_better = True)

cross_val_score(knn, x,y, cv=cv, scoring=scorer).mean()

0.5618701327985455

## Random Forest<a id='link6e'></a>

In [120]:
random_forest = ske.RandomForestClassifier(n_estimators=50, max_depth=10)

In [121]:
scorer = make_scorer(accuracy_score, greater_is_better = True)

cross_val_score(random_forest, x,y, cv=cv, scoring=scorer).mean()

0.6059451002719577

In [122]:
random_forest_fitted = random_forest.fit(X_train,y_train)
predictions = random_forest_fitted.predict(X_test)
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.81      0.09      0.17       139
          1       0.43      0.38      0.40       318
          2       0.25      0.01      0.01       150
          3       0.65      0.90      0.76       787

avg / total       0.57      0.61      0.54      1394



## Model Application to IMDb<a id='link6f'></a>

In this section, we will apply the best model to the IMDb Dataset that does not have Bechdel Scores

The 'Best' model from above was the logistic regression. We need to train that model on all the training data, so that when we apply it to the rest of the IMDb dataset for predictions, it has as much training as possible.

In [134]:
# Re-fit the Logistic Regression model on all the data
logmodel_balanced_C100 = LogisticRegression(class_weight='balanced',C=100).fit(x,y)

In [125]:
# movies dataset is the one with everything except bechdel scores

model_application = movies

columns_to_drop = ['primaryRegion','Director','topBilled','genres',
                    'TopBilled2nd','TopBilled3rd',
                    'topBilledGender','secondTopBilledGender','thirdTopBilledGender',
                    'directorGender','numVotes','runtimeMinutes']

model_application = model_application.drop(columns_to_drop, axis=1)

Unnamed: 0,tconst,primaryTitle,startYear,averageRating,Action,Adventure,Comedy,Drama,SciFi,Horror,...,secondBilled_Male,secondBilled_Female,secondBilled_Unknown,thirdBilled_Male,thirdBilled_Female,thirdBilled_Unknown,director_Male,director_Female,director_Unknown,US_Region
8,tt0000009,Miss Jerry,1894,5.4,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0
141,tt0000147,The Corbett-Fitzsimmons Fight,1897,5.2,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,1
232,tt0000335,Soldiers of the Cross,1900,6.2,0,0,0,1,0,0,...,1,0,0,0,0,1,0,0,1,0
330,tt0000574,The Story of the Kelly Gang,1906,6.3,0,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,0
346,tt0000615,Robbery Under Arms,1907,5.1,0,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,0


In [165]:
# Create a dataframe of the model_results with an additional column called "Bechdel_Predictions"

to_drop = ['tconst','primaryTitle']
prediction_data = model_application.drop(to_drop, axis=1)
IMDb_predictions = logmodel_balanced_C100.predict(prediction_data)

IMDb_Bechdel_predictions = pd.DataFrame({'tconst': model_application['tconst'], 
                                         'primaryTitle': model_application['primaryTitle'], 
                                         'primaryRegion': movies['primaryRegion'],
                                         'Director': movies['Director'],
                                         'topBilled': movies['topBilled'],
                                         'genres': movies['genres'],
                                         'year': movies['startYear'],
                                         'averageRating':movies['averageRating'],
                                         'numVotes':movies['numVotes'],
                                         'topBilled':movies['topBilled'],
                                         'director_Female':movies['director_Female'],
                                         'Bechdel_Predictions': IMDb_predictions})

IMDb_Bechdel_predictions.to_csv('IMDb_Bechdel_predictions.csv', index=False)

# Outcomes <a id='link7'></a>

[Back to Top](#back to top)

## Predicted Dataset Overview<a id='link7a'></a>

In [151]:
IMDb_Bechdel_predictions.shape

(214192, 10)

In [152]:
IMDb_Bechdel_predictions.head()

Unnamed: 0,Bechdel_Predictions,Director,averageRating,genres,numVotes,primaryRegion,primaryTitle,tconst,topBilled,year
8,3,Alexander Black,5.4,Romance,62,\N,Miss Jerry,tt0000009,Blanche Bayliss,1894
141,0,Enoch J. Rector,5.2,"Documentary,News,Sport",245,US,The Corbett-Fitzsimmons Fight,tt0000147,James J. Corbett,1897
232,3,,6.2,"Biography,Drama",32,AU,Soldiers of the Cross,tt0000335,Beatrice Day,1900
330,3,Charles Tait,6.3,"Biography,Crime,Drama",431,HU,The Story of the Kelly Gang,tt0000574,Elizabeth Tait,1906
346,1,Charles MacMahon,5.1,Drama,12,AU,Robbery Under Arms,tt0000615,Jim Gerald,1907


In [153]:
# Search by IMDb ID
IMDb_Bechdel_predictions[IMDb_Bechdel_predictions['tconst'] == 'tt0362270'].head()

Unnamed: 0,Bechdel_Predictions,Director,averageRating,genres,numVotes,primaryRegion,primaryTitle,tconst,topBilled,year
188596,1,Wes Anderson,7.3,"Adventure,Comedy,Drama",154786,US,The Life Aquatic with Steve Zissou,tt0362270,Bill Murray,2004


In [154]:
# Search by Movie title
IMDb_Bechdel_predictions[IMDb_Bechdel_predictions['primaryTitle'].str.contains("Matrix")==True]

Unnamed: 0,Bechdel_Predictions,Director,averageRating,genres,numVotes,primaryRegion,primaryTitle,tconst,topBilled,year
92025,0,,8.7,"Action,Sci-Fi",1374962,,The Matrix,tt0133093,Keanu Reeves,1999
132511,3,Udo Blass,4.0,Drama,165,\N,Sex Files: Sexual Matrix,tt0224086,Jason Schnuit,2000
136210,0,,7.2,"Action,Sci-Fi",453664,ES,The Matrix Reloaded,tt0234215,Keanu Reeves,2003
139360,1,,6.7,"Action,Sci-Fi",393448,FR,The Matrix Revolutions,tt0242653,Keanu Reeves,2003
480654,3,Greg Becker,6.8,Documentary,152,US,The Living Matrix,tt1499960,Adam Dreamhealer,2009
496006,0,B.A. Brooks,6.1,"Documentary,News",14,US,The American Matrix: Age of Deception,tt1595473,George Bruch,2009
530606,1,Josh Oreck,7.1,\N,15,US,The Matrix Reloaded Revisited,tt1830850,Tanveer K. Atwal,2004
530607,1,Josh Oreck,5.8,\N,18,US,The Matrix Revolutions Revisited,tt1830851,Josh Oreck,2004
563297,0,Alex Jones,4.9,Documentary,35,US,Matrix of Evil,tt2121323,Alex Jones,2003
691739,3,Korak Day,7.9,Documentary,14,US,Matrix of Love,tt4413244,Korak Day,2008


In [155]:
# Search by Director Name
IMDb_Bechdel_predictions[IMDb_Bechdel_predictions['Director'].str.contains("Wes Anderson")==True]

Unnamed: 0,Bechdel_Predictions,Director,averageRating,genres,numVotes,primaryRegion,primaryTitle,tconst,topBilled,year
81851,1,Wes Anderson,7.1,"Comedy,Crime,Drama",59709,,Bottle Rocket,tt0115734,Luke Wilson,1996
90159,1,Wes Anderson,7.7,"Comedy,Drama",147722,,Rushmore,tt0128445,Jason Schwartzman,1998
149071,3,Wes Anderson,7.6,"Comedy,Drama",230136,PE,The Royal Tenenbaums,tt0265666,Gene Hackman,2001
188596,1,Wes Anderson,7.3,"Adventure,Comedy,Drama",154786,US,The Life Aquatic with Steve Zissou,tt0362270,Bill Murray,2004
217232,1,Wes Anderson,7.8,"Adventure,Animation,Comedy",159087,NO,Fantastic Mr. Fox,tt0432283,George Clooney,2009
362081,1,Wes Anderson,7.2,"Adventure,Comedy,Drama",152991,PL,The Darjeeling Limited,tt0838221,Owen Wilson,2007
519070,1,Wes Anderson,7.8,"Adventure,Comedy,Drama",268384,GR,Moonrise Kingdom,tt1748122,Jared Gilman,2012
580168,1,Wes Anderson,8.1,"Adventure,Comedy,Drama",570448,IT,The Grand Budapest Hotel,tt2278388,Ralph Fiennes,2014


In [156]:
# Search by topBilled name
IMDb_Bechdel_predictions[IMDb_Bechdel_predictions['topBilled'].str.contains("Tom Cruise")==True]

Unnamed: 0,Bechdel_Predictions,Director,averageRating,genres,numVotes,primaryRegion,primaryTitle,tconst,topBilled,year
56678,2,Michael Chapman,5.9,"Drama,Romance,Sport",13422,FR,All the Right Moves,tt0085154,Tom Cruise,1983
57271,1,Curtis Hanson,4.9,"Comedy,Drama",3398,FR,Losin' It,tt0085868,Tom Cruise,1983
57541,1,Paul Brickman,6.8,"Comedy,Crime,Drama",67962,DE,Risky Business,tt0086200,Tom Cruise,1983
60273,2,Ridley Scott,6.5,"Adventure,Fantasy,Romance",50889,CA,Legend,tt0089469,Tom Cruise,1985
62488,2,Tony Scott,6.9,"Action,Drama,Romance",242057,FR,Top Gun,tt0092099,Tom Cruise,1986
64844,2,Roger Donaldson,5.8,"Comedy,Drama,Romance",67408,US,Cocktail,tt0094889,Tom Cruise,1988
66609,1,Oliver Stone,7.2,"Biography,Drama,War",84367,BR,Born on the Fourth of July,tt0096969,Tom Cruise,1989
68645,2,Tony Scott,5.9,"Action,Drama,Sport",66641,,Days of Thunder,tt0099371,Tom Cruise,1990
72624,2,Ron Howard,6.5,"Adventure,Drama,Romance",51759,,Far and Away,tt0104231,Tom Cruise,1992
72646,1,Rob Reiner,7.7,"Drama,Thriller",200128,,A Few Good Men,tt0104257,Tom Cruise,1992


In [157]:
# histogram of bechdel scores for predicted values

bechdel_counts = IMDb_Bechdel_predictions['Bechdel_Predictions'].value_counts().tolist()
bechdel_scores = IMDb_Bechdel_predictions['Bechdel_Predictions'].value_counts().index.tolist()

source = ColumnDataSource(data=dict(scores=bechdel_scores, counts=bechdel_counts, color=Spectral6))

p = figure(x_range=('0','1','2','3'), y_range=(0,max(bechdel_counts)*1.1), plot_height=250, title="Bechdel Score Counts",
           toolbar_location=None, tools="")

p.vbar(x='scores', top='counts', width=0.9, color='color', legend="scores", source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)



In [158]:
# For each year, plot the average bechdel scores within that year
# http://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/master/quickstart/quickstart.ipynb

IMDb_Bechdel_predictions.IMDb_Bechdel_predictions = IMDb_Bechdel_predictions.Bechdel_Predictions.astype(int)

grouped = IMDb_Bechdel_predictions.groupby("year")
bechdelScore = grouped["Bechdel_Predictions"]
bechdelScore_avg = bechdelScore.mean()
bechdelScore_median = bechdelScore.median()

years = list(grouped.groups.keys())

p = figure(title="Bechdel Score Average by Year")

p.circle(x=years, y=bechdelScore_avg, size=10, alpha=0.5,
         color="red")

p.triangle(x=years, y=bechdelScore_median, size=10, alpha=0.5,
         color="blue")

show(p)

  after removing the cwd from sys.path.


![yearly percent](https://user-images.githubusercontent.com/30674288/36275071-accaefcc-123e-11e8-8b6f-60c5dfce6bac.png)

![bechdel score by director gender](https://user-images.githubusercontent.com/30674288/36279656-97a61a78-124c-11e8-98db-df4e9730cfd4.png)

 <a id='link7b'></a>
## Key Directors and Actors

![Image](https://user-images.githubusercontent.com/30674288/36269940-51bed2ba-122f-11e8-8244-6138c60affb4.png)

![Image](https://user-images.githubusercontent.com/30674288/36269939-519eb4a8-122f-11e8-9e3b-c147753ee282.png)

![Image](https://user-images.githubusercontent.com/30674288/36269937-5169db70-122f-11e8-9bac-46d3ead5dbc3.png)

![Image](https://user-images.githubusercontent.com/30674288/36274749-932da880-123d-11e8-85d6-4ed97643b864.png)

<a id='link7c'></a>
<a id='mapafter'></a>
## Regional Map 

[to see map before predictions - click here](#mapbefore)

![Image](https://user-images.githubusercontent.com/30674288/36272414-550bc75a-1236-11e8-836f-85c932ced371.png)

<a id='link8'></a>
# Takeaways / Conclusions 

[Back to Top](#back to top)

<a id='link8a'></a>
## What is the Point?
* We are adding to the literature, and we're making sure that representation of women in movies is being measured.
* We're taking very complicated machine learning models, and predicting something that everyone can relate to.

<a id='link8b'></a>
## Limitations
There are many limiations of the analysis that was conducted. There are points of bias, ways that the model can be improved, and things that will not likely ever be able to be changed about the data.

Some areas of bias are:
* The people who submit reviews for movies are inherently the kinds of people that know about the Bechdel Test. If that selection doesn't come with bias, then I don't know what does. These people are probably more likely to watch movies that are expected to come up as a '3' on the scale, so that would explain why our dataset is heavily skewed towards 3's
* The IMDb Dataset is not comprehensive, so whatever selection of movies is available there, is taken as the truth for what movies exist in the world.
* I only looked at movies, but bechdel scores exist for other types of film, such as tv series, short movies, and others. By not including them, I'm leaving out some potential learning opportunities for myself and the model.

This predictive classification model will be enhanced as I learn more about model building, and continue to iterate on this problem. Some specific areas of improvement are listed below. Hopefully others who are interested in this topic can further build upon it to improve model features and parameters.

In general, the results of this analysis should not be taken as gospel. The model itself is predicted to have an accuracy of around 55%, which means there is plenty of room for improvement. Once IMDb adds more data that can be incorporated into the model, and/or people find ways to merge more relevant data to this table, we can add more features that can improve it. For now, these should be taken with a grain of salt, and used to ensure that we continue to strive to learn more about machine learning, feminism, movies, and python!

<a id='link8c'></a>   
## Review of Concepts Covered
* Bokeh
* API Calls
* Pandas Dataframe manipulation
* Speed measurement/optimization
* Gender_guesser
* Bechdel test
* Feminism
* Machine learning models

<a id='link8d'></a>    
## Next Steps to Expand This Analysis
* Bring in new data even more often
* Profit from my predictions (gotta find a way to make this a profitable enterprise, right?)
* Downsample training set so we're focused less on 3's to make the class distribution more even
* Make bechdel predictions floats so that we can round them to an integer and then classify it that way
* GridSearchCV
* RFECV (Feature selection)
* Gradient Boosted Trees (GradientBoostingClassifier)
* Automated Feature Generation - use decision tree branches as features in another model

# Thank you for learning with me