## We are going to pre-process the Movies and Tv dataset from https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

The reason for this pre-processing is 
1. Adding an attribute gender, so we can use this dataset to see is certain models are fair with regards to gender of individuals.
2. We add an attribute cover. This will store an image of the repective movie or tv show

### Step 1
Get required packages

In [11]:
!pip install ijson
!pip install pandas
!pip install gender-guesser

Collecting gender-guesser
  Downloading gender_guesser-0.4.0-py2.py3-none-any.whl.metadata (3.0 kB)
Downloading gender_guesser-0.4.0-py2.py3-none-any.whl (379 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m379.3/379.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0


In [12]:
import json
import ijson
from pathlib import Path
from datetime import datetime
import pandas as pd
import gender_guesser.detector as gender


### Step 2
Load json file with movie and tv show data


In [3]:
cwd = Path.cwd()
fpath = cwd.parents[0] / "movie_tv_shows_dataset" / 'data' / 'Movies_and_TV_5.json'
with open(fpath, 'r') as file:
    df= pd.read_json(file, lines=True)

In [51]:
x=df.iloc[10]
print(x)
# Check to see if ratings are valid
invalid_ratings = df[(df['overall'] < 1) | (df['overall'] > 5)]

if invalid_ratings.empty:
    print("All ratings are within the range of 1 to 5.")
else:
    print("There are ratings outside the range of 1 to 5.")
    print("Invalid ratings:")
    print(invalid_ratings)


overall                                                           5
verified                                                       True
reviewTime                                              10 24, 2009
reviewerID                                           A1E7VTRDMI4XMV
asin                                                     0005092663
style                                           {'Format:': ' DVD'}
reviewerName                                                 Harold
reviewText        Really excellent piece of work showing the bel...
summary                            Freedom and Religon under attack
unixReviewTime                                           1256342400
vote                                                            NaN
image                                                           NaN
Gender                                                         male
Name: 23, dtype: object
All ratings are within the range of 1 to 5.


### Step 3 Adding a gender column
Use [gender-guesser](https://pypi.org/project/gender-guesser/) package to add gender for each person using their name.

In [38]:
# Initialize the gender detector
detector = gender.Detector()

# Remove rows without names
df.dropna(subset=['reviewerName'], inplace=True)  

# print(detector.get_gender("Jennifer"))

# Apply the gender detector to the "reviewerName" column
df['Gender'] = df['reviewerName'].apply(lambda x: detector.get_gender(x.split()[0]))

#


In [46]:
# We want male and female so replace mostly_male with male and mostly_female with female
replacements = {'mostly_male': 'male', 'mostly_female': 'female'}
df['Gender'].replace(replacements, inplace=True)
# Remove rows where gender wasnt identified
df = df[df['Gender'] != 'unknown']

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].replace(replacements, inplace=True)


In [47]:
print(len(df))

1554957


All ratings are within the range of 1 to 5.
