# FEMALE REPRESENTATION IN MOVIES

MAKE THESE SIDE BY SIDE

<img style= src="titanic.jpg" width="15%" height="15%">
<img style= src="gravity.jpg" width="15%" height="15%">
<img style= src="panther.jpg" width="15%" height="15%">

## Introduction

### Motivation

Movies have become a pervasive societal influence, exerting influence in more ways than one. They are a form of mass communication - information or ideology (in the form of art) is distributed simultaneously to a large number of people. Since they have become so instrumental in influencing viewers’ opinions, it is crucial to ensure that the messages they are sending across are the sort we want to see in our society. As we read about movies and social issues related to them, we grew increasingly interested in gender bias and how that is monitored and measured in the film industry.

* In 2014, globally, there were 2.24 male characters for every 1 female character.
* Out of a total of 5,799 speaking or named characters 30.9 percent were female, 69.1 percent male.
* Females are 7 percent of directors, 19.7 percent of writers, and 22.7 percent of producers.

You can read more about such statistics [here](https://www.huffingtonpost.com/soraya-chemaly/20-mustknow-facts-about-g_b_5869564.html).

### The Bechdel Test
As we did more research, we found out about a metric known as the Bechdel Test. The Bechdel Test is a test originally proposed in 1985 by Alison Bechdel, that is used to "grade" movies on their representation of women. 

The movie needs to pass the following criteria
1. Have at least two named women in it
2. Who talk to each other
3. About something besides a man

A film is given a score between 0-3 based on how many of the Bechdel test criteria it satisfies. For example, a film with no women characters gets a score of 0, a film with two named women who don’t speak to each other gets a 1, and a film with at least two women who talk to each other about a man would receive a 2. 

The test itself has been in question as a metric for gender bias - is it accurate, sufficient and indicative of equal representation of women in film? For example, we found it interesting that the movie Gravity (starring Sandra Bullock) fails the test, even though Sandra Bullock, the protagonist, has an extremely developed plotline and background. The issue was that there were no other women in the movie, and so there was no way it would satisfy the criteria. On the other hand, the blockbuster Titanic is a movie about Rose’s journey and her growing into herself, but the movie passes because of conversations between her mother and her friends gossiping about another female character. We felt that this movie passed for the 'wrong' reasons.

We then decided to focus our project on understanding why such gender bias arises, i.e. what factors influence whether or not a movie passes the Bechdel Test?

Our two key questions were: 
1. What makes a movie likely to pass or fail the Bechdel test? 
2. Is the Bechdel test a good metric for evaluating female representation?

In [2]:
# Set up by importing all the necessary packages
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Dataset
###  Data Collection

We found a [website](https://bechdeltest.com/) with a crowd-sourced list of movies and their Bechdel test scores. We used their API to create a CSV file of ~7500 movies and their Bechdel test scores. The API also provided each movie's IMDB ID, the unique identifier for each movie on [IMDB.com](http://www.imdb.com) -- useful for combining this data with other information about the movies.

Because IMDB doesn't have a public API, we used an API provided by [The Movie Database](https://www.themoviedb.org/?language=en) to get details about each movie's production and success. We used the API to connect a movie's IMDB ID to its TMDB ID, and then sent requests to endpoints to get each movie's details, credits, and similar movies that TMDB would recommend to someone who liked it. We then went through each movie's first ten cast members, directors, and writers and send requests to obtain their details. 

Our dataset about movies is hosted [here](https://github.com/shelly/pds-bechdel-test/blob/master/movies.csv), generated by our code available [here](https://github.com/shelly/pds-bechdel-test/blob/master/get_movies.py).

Our dataset about people is hosted [here](https://github.com/shelly/pds-bechdel-test/blob/master/people.csv), generated by our code available [here](https://github.com/shelly/pds-bechdel-test/blob/master/get_people.py).   

In [42]:
# Load the movie data 
movie_data = pd.read_csv("https://raw.githubusercontent.com/shelly/pds-bechdel-test/master/movies.csv?token=AFdj5s79coVYnuD-37TFX_bgvLMRlD1Dks5a_cjkwA%3D%3D").set_index('TMDB_ID').drop(columns=['Overview']) 
movie_data.loc[[268896, 284054, 49047, 597, 15121]]


Unnamed: 0_level_0,Unnamed: 0,Title,IMDB_ID,Year,Bechdel_Rating,Budget,Popularity,Revenue,Genres,Cast,Crew,Recommendations
TMDB_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
268896,7652,Pacific Rim: Uprising,2557478,2018,3,150000000,39.495549,286536960,"[{'id': 28, 'name': 'Action'}, {'id': 14, 'nam...","[{'id': 236695, 'order': 0, 'character': 'Jake...","[{'id': 10828, 'department': 'Production', 'jo...","[333339, 338970, 299536, 427641, 284054]"
284054,7641,Black Panther,1825683,2018,3,200000000,361.506277,1325776812,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'id': 172069, 'order': 0, 'character': ""T'Ch...","[{'id': 1376891, 'department': 'Art', 'job': '...","[299536, 284053, 333339, 141052, 181808]"
49047,6146,Gravity,1454468,2013,0,105000000,20.679122,716392705,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...","[{'id': 18277, 'order': 0, 'character': 'Dr. R...","[{'id': 11218, 'department': 'Directing', 'job...","[68724, 109424, 137113, 17654, 9693]"
597,2694,Titanic,120338,1997,3,200000000,22.530842,1845034188,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...","[{'id': 204, 'order': 0, 'character': 'Rose De...","[{'id': 2710, 'department': 'Directing', 'job'...","[8587, 425, 808, 12, 607]"
15121,976,The Sound of Music,59742,1965,3,8200000,9.550476,286214286,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...","[{'id': 5823, 'order': 0, 'character': 'Maria'...","[{'id': 1744, 'department': 'Directing', 'job'...","[433, 11113, 630, 872, 11708]"


In [44]:
# Load the cast and crew data 
people_data = pd.read_csv("https://raw.githubusercontent.com/shelly/pds-bechdel-test/master/people.csv?token=AFdj5qKnfIZX5POf1pf7e5ndysWnq75Nks5a_clJwA%3D%3D").set_index('TMDB_ID')
people_data.loc[[54697, 10990, 488, 6884, 1932]]

Unnamed: 0_level_0,Unnamed: 0,Name,Birthday,Deathday,Gender,Place of Birth,Popularity
TMDB_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
54697,54697,Dave Franco,1985-06-12,,2.0,"Palo Alto, California, USA",3.095248
10990,10990,Emma Watson,1990-04-15,,1.0,"Paris, France",11.092018
488,488,Steven Spielberg,1946-12-18,,2.0,Cincinnati - Ohio - USA,8.235147
6884,6884,Patty Jenkins,1971-07-24,,1.0,Victorville - California - USA,0.215638
1932,1932,Audrey Hepburn,1929-05-04,1993-01-20,1.0,"Ixelles, Belgium",3.081924


Movie data's columns are: 

### Exploratory Data Analysis

In order to find signal and patterns in our data, we started to explore the dataset we curated a little bit more. 
INSERT GRAPHICS
The first graph here shows that Bechdel score over time (in each decade)
Per genre bechdel score

## Data Analysis

### Feature Engineering
We realized that training a model on the given rudimentary features that we collated (such as Year of Release, Names of Directors) would not result in sophisticated models. They would simply be unable to correctly and consistently predict whether or not a movie passed the Bechdel Test. 

Based on our EDA, we created new, improved features that we thought would better correlate to whether or not movies passed the Bechdel test. The file containing all the functions we wrote to created these features is available [here](GITHUB).

Some of the new features we created included categorical variable indicating whether or not the first billed actor was female, whether or not there was a female writer involved with the script and the average age of the director.

The example below shows how we created a new feature to find the fraction of women in the writing crew.


In [3]:
# Returns fraction of females in writing crew
def get_female_writing_score():
    crews = movie_data['Crew']
    scores = np.zeros(crews.shape)
    ind = 0
    for crew in crews:
        if (len(crew) == 0):
            scores[ind] == float('nan')
        else:
            writers = 0
            fem_writ = 0
            no_gend = 0
            for mem in crew:
                if (mem['department'] == 'Writing'):
                    person_id = int(mem['id'])
                    if (person_id in people_data.index):
                        person_info = people_data.loc[person_id]
                        person_gender = person_info['Gender']
                        if (person_gender == 0):
                            no_gend += 1
                        if (person_gender != 0):
                            writers += 1
                        if (person_gender == 1):
                            fem_writ += 1
            if (writers != 0):
                if (no_gend == len(crew)):
                    scores[ind] = float('nan')
                else:
                    scores[ind] = fem_writ/writers
        ind += 1
    return scores

### Modeling

Our aim was to train a model that would accurately predict whether or not a movie would pass the Bechdel Test. We created models using the following algorithms:

* Support Vector Machines
  * RBF Kernel
  * Linear Kernel
* Decision Trees
  * Max-Depth 3
  * Max-Depth 4
* Gaussian Mixture Models
* Naive Bayes
  * Gaussian Naive Bayes
  * Multinomial Naive Bayes
  
In our dataset, 58% of movies passed the Bechdel Test, so we considered this to be the baseline. Our best model, which used the SVM, gave us a testing accuracy of 71%. We found that similar movies behaved similarly on the Bechdel Test - so a movie was more likely to pass if other movies similar to it also passed, and vice versa.

## Conclusion

### Results
Our findings indicated that clusters of movies tend to all pass or fail the test together. For example, movies that have at least one female writer are all likely to pass, and the cluster of movies that are Westerns are all likely to fail. 

Overall, our work supports the hypothesis that the Bechdel test is a good baseline metric for female representation. As we mentioned, the fact that all movies with at least one female writer were more likely to pass the Bechdel Test, indicates that the test is somewhat a good indicator. However, as mentioned before, the fact that Gravity does not pass, but Titanic does raises doubt in our.....

### Future Work and Extensions

We found out about the [Mako Mori Test](http://geekfeminism.wikia.com/wiki/Mako_Mori_test), and believe this is another viable alternative that should be considered as a measure of female representation on film.