# Working with Bechdel Test JSON

In [28]:
from IPython.display import Image
Image(url='https://upload.wikimedia.org/wikipedia/en/b/bf/Dykes_to_Watch_Out_For_%28Bechdel_test_origin%29.jpg')

We retrieved the Bechdel Test data as a JSON file via https://bechdeltest.com/api/v1/doc. This notebook shows the process of importing the JSON file, reorganizing the data, and exporting as a CSV file for ease of use in the analysis. 

The Bechdel Test rating is on a 0-3 scale. 

- 0: Failed
- 1: There's at least two women with names in the movie
- 2: There's at least two women with names in the movie and they talk to each other
- 3: There's at least two women with names in the movie and they talk to each other about something other than a man

In [3]:
import json
import csv
import pandas as pd

bechdel_movies = open('../Data/bechdel_test_data/bechdel_test_all_movies.json', encoding='utf-8', mode='rt')
bechdel_in = json.loads(bechdel_movies.read())
bechdel_movies.close()


# included the header in advance
bechdel_csv = [['bechdel_id', 'imdb_id', 'rating', 'title', 'year']]
bechdel_convert = []

# with items(), they become tuples and we can use indexing to grab values in the next loop
# need this convert step so you can sort within the list
# iterating over the dict values doesn't retain order
for movie in bechdel_in:
    row_movie = []
    for item in movie.items():
        row_movie.append(item)
    row_movie.sort()
    bechdel_convert.append(row_movie)

# adding tt to match the IMDB ID formatting for tconst    
for movie in bechdel_convert:
    row_movie = []
    for item in movie:
        if item[0] == 'imdbid':
            row_movie.append('tt' + item[1])
        else:
            row_movie.append(item[1])
    bechdel_csv.append(row_movie)

#print(bechdel_csv[0:200])

bechdel_write = open('bechdel.csv', encoding='utf-8', mode='wt', newline='')
bechdel_out = csv.writer(bechdel_write)
bechdel_out.writerows(bechdel_csv)

bechdel_write.close()

# Checking for Errors

In [11]:
# put into pandas dataframe
#bechdel_df = pd.DataFrame(bechdel_csv)
bechdel_df = pd.read_csv("bechdel.csv")

# correct column names
bechdel_df.columns = ['bechdel_id', 'imdb_id', 'rating', 'title', 'year']

bechdel_df.head()

Unnamed: 0,bechdel_id,imdb_id,rating,title,year
0,8040,tt0392728,0,Roundhay Garden Scene,1888
1,5433,tt0000003,0,Pauvre Pierrot,1892
2,5444,tt0000014,0,Tables Turned on the Gardener,1895
3,6200,tt0132134,0,"Execution of Mary, Queen of Scots, The",1895
4,4982,tt0000091,0,"House of the Devil, The",1896


In [12]:
bechdel_df.shape

(7721, 5)

There are 7721 rows from the Bechdel Test data. First, we'll see if there are any duplicates via imdb_id. By using groupby and count, we can see if there are multiple values.

In [25]:
temp_bechdel1 = bechdel_df.groupby('imdb_id')['title'].count()
temp_bechdel1[temp_bechdel1 > 1]

imdb_id
tt0035279    2
tt0086425    2
tt0117056    2
tt2043900    2
tt2083355    2
tt2457282    2
Name: title, dtype: int64

This uncovered 6 duplicates, which will have to be checked and removed. We did this in our data filtering and cleaning process in SQL with the IMDB title.principals table. Ultimately, we exported an unduplicated copy (bechdel_test_updated.csv) of the Bechdel Test data from this process and continued to work off that. 

Let's also check for duplicates in the bechdel_id to see if there's any errors there.

In [26]:
temp_bechdel2 = bechdel_df.groupby('bechdel_id')['title'].count()
temp_bechdel2[temp_bechdel2 > 1]

Series([], Name: title, dtype: int64)

There are no duplicates in the bechdel_id column. We won't check for duplicates in the title column because many movies have the same name or they're remakes. 