# DATA 512 - Final Project Plan 

## Submitted by: Vamsy Atluri 

<br>

### Project Statement - Gender Diversity in Movies

I have often found that movies are sometimes the most powerful medium of expression, with their widespread reach to a huge section of  population. It is an industry which often brings topics into social discussion by portraying them on a screen. I come from India where gender stereotypes are more strictly enforced and with a few exceptions, women in movies are more often that not arm candy. At the time, american movies which potrayed strong women characters was a widely different and refreshing experience. 

Nowadays, the industry and media claim that gender diversity is improving and more and more women are playing major roles than ever. I wish to use the data science skills I have learnt over the past year and analyze this claim in more detail. How gender diverse are movies really in terms of actors on the screen? I will be comparing the total number of actresses to the total number of actors over a time period of the last few decades.

### Why is this important?

Movies like any other industry are aimed at making money. Companies like Google, Facebook, Amazon etc have started to enforce gender equality requirements trying to make the workplace more diverse than becoming stereotypical male dominated companies. The aim of this project is to see where the movie industry falls when looking at the people acting in them. We obviously know for sure that it was a male dominated industry for several decades and might still be. But has there been a change in this percentage over the years?

My hope is that movie makers, historians, film studies majors, cultural scientists and people who study about society at large will be able to derive a variety of answers from this analytic directly or by building on it.

### What are the human-centered aspects of this project?

The analysis of data which is aimed at seeing how diversity in the movie industry has changed over time and the visualization of the results which shows the trends. 

### Data Sources

The primary sources of data used in this project are from IMDB, which are hosted [here](https://datasets.imdbws.com/). A desciption of the datasets and the various fields are given [here](https://www.imdb.com/interfaces/). 

Three of the available data sources (title.basics.tsv.gz, name.basic.tsv.gz, title.principals.tsv.gz ) are in my opinion at the moment, sufficent for the purposes of this project and which I plan to work on. If they prove inadequate, I will augment them with the other data sources.

### Data Handling Tools

Since the data is made available as .tsv (tab separated value) files, pythons [pandas](https://pandas.pydata.org/) library will be useful and adequate to create subset dataframes and perform my analysis. Also the size of these datasets range in a few hundred MBs which makes it convenient to stick to pandas rather than having to use Big Data tools like Hadoop, Pig etc.

### Data Structure

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing for that title/name --- ([reference](https://www.imdb.com/interfaces/)).

I will be showing the structure of all the relevant sources below.

In [12]:
import gzip
import pandas as pd

In [None]:
movie_title_details = pd.read_table(gzip.open("title.basics.tsv.gz"))
actor_details = pd.read_table(gzip.open("name.basics.tsv.gz"))
movie_crew_details = pd.read_table(gzip.open("title.principals.tsv.gz"))

The title.basics.tsv.gz file lists all the movies in the IMDB database. While the title, genre etc. are not required for my analysis the <b>tconst</b> which is an unique identifier for each movie and the year columns are the most relevant.

In [24]:
movie_title_details.shape

(5424031, 9)

As seen, there are 5,424,031 movies overall to consider.

In [25]:
movie_title_details.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


The type of movie is also important to elminate some types of movies like short movies, video games etc.

In [27]:
movie_title_details.titleType.unique()

array(['short', 'movie', 'tvMovie', 'tvSeries', 'tvEpisode', 'tvShort',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame'], dtype=object)

The name.basics.tsv.gz file lists all the professionals in the IMDB database. We will be using the primaryProfession column to filter only entries that have actor/actress as the first entry. Each entry has a unique identifier <b>nconst</b>.

In [28]:
actor_details.shape

(8973383, 6)

In [20]:
actor_details.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0072308,tt0043044,tt0053137"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0037382,tt0071877,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0059956,tt0057345,tt0049189,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0072562,tt0080455,tt0078723,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0060827,tt0050986,tt0050976,tt0083922"


The title.principals.tsv.gz file brings the two files above together by listing each movie and its crew as a list of entries with the unique identifiers tconst and nconst identifying each tuple.

In [21]:
movie_crew_details.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


The category column can be used to filter only actors and actresses.

In [30]:
movie_crew_details.category.unique()

array(['self', 'director', 'cinematographer', 'composer', 'producer',
       'editor', 'actor', 'actress', 'writer', 'production_designer',
       'archive_footage', 'archive_sound'], dtype=object)

### Gender limitations

As listed on the IMDB guidelines page for gender [here](https://help.imdb.com/article/contribution/filmography-credits/cast-acting-credits-guidelines/GH3JZC74FVYKKFMD#gender), only acting credits are seaprated by gender.

The classification is binary, only actors and actresses are the available types. But the page also lists that <b> each listing is made based the '<u>individual's chosen gender</u>'</b>. While this is not inclusive of people who do not identify with the male/female classification, this is the best dataset I could find that does classify poeple based on their chosen gender.

### Limitations of the analysis

Once again, due to the nature of the dataset gender information can only be obtained for actors and actresses but not for other people in the crew like directors, producers, cinematographers etc.

Hence the project scope will be limited only to actors and actresses in front of the camera rather than those working behind the scenes.

### Conclusion

While this project and analysis will not prove to be as exhaustive or definitive as past works on analyzing gender diversity, most of which use the Bechdel Test (like [this](http://poly-graph.co/bechdel/) one, the scope is different in the fact that we treat the movie industry as more of workplace for people.

The aim is compare pure numbers of employment of people who come on the screen, extras, side actors included. Regular companies often claim to have a [diverse workforce](https://informationisbeautiful.net/visualizations/diversity-in-tech/) in terms of gender even if this diversity is restricted to lower rungs of the organization and does not extend to the upper tiers which have the major decision makers.

By extending this same rule to the movie industry, I wish to see if they perform any better.

### Licensing

As stated on this [page](https://www.imdb.com/interfaces/), the datasets provide by IMDB can be used for personal and non-commercial use. 

Local copies of the data can also be stored.

IMDB prohibits data mining, robots, screen scraping, or similar data gathering and extraction tools. None of these techniques have been used to access the data, which has been downloaded from IMDB's public dataset [link](https://datasets.imdbws.com/)

More information about non-commerical licensing is given on the IMDB [website](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=BMZ4YZDQZXG5J9Y3H7RJ&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1#).

### References

A list of references I used in creating this report, both subject matter related as well as technical knowlegde related.

1. https://womenintvfilm.sdsu.edu/files/2014-15_Boxed_In_Report.pdf

2. https://informationisbeautiful.net/visualizations/diversity-in-tech/

3. http://poly-graph.co/bechdel/

4. https://www.imdb.com/interfaces/

5. https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=BMZ4YZDQZXG5J9Y3H7RJ&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1#

6. https://datasets.imdbws.com/

7. https://help.imdb.com/article/contribution/filmography-credits/cast-acting-credits-guidelines/GH3JZC74FVYKKFMD#gender

8. https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

9. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html