## We are going to pre-process the Movies and Tv dataset from https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

The reason for this pre-processing is 
1. Adding an attribute gender, so we can use this dataset to see is certain models are fair with regards to gender of individuals.
2. We add an attribute cover. This will store an image of the repective movie or tv show

### Step 1
Get required packages

In [53]:
!pip install ijson
!pip install pandas
!pip install gender-guesser
!pip install requests


Collecting requests
  Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Using cached charset_normalizer-3.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Using cached urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests)
  Using cached certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Using cached certifi-2024.2.2-py3-none-any.whl (163 kB)
Using cached charset_normalizer-3.3.2-cp311-cp311-macosx_11_0_arm64.whl (118 kB)
Downloading idna-3.7-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.8/66.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached urllib3-2.2.1-py3-none-any.whl (121 kB)
Installing collected packages: urlli

In [54]:
import json
import ijson
from pathlib import Path
from datetime import datetime
import pandas as pd
import gender_guesser.detector as gender
import requests
import os


### Step 2
Load json file with movie and tv show data


In [3]:
cwd = Path.cwd()
fpath = cwd.parents[0] / "movie_tv_shows_dataset" / 'data' / 'Movies_and_TV_5.json'
with open(fpath, 'r') as file:
    df= pd.read_json(file, lines=True)

In [51]:
x=df.iloc[10]
print(x)
# Check to see if ratings are valid
invalid_ratings = df[(df['overall'] < 1) | (df['overall'] > 5)]

if invalid_ratings.empty:
    print("All ratings are within the range of 1 to 5.")
else:
    print("There are ratings outside the range of 1 to 5.")
    print("Invalid ratings:")
    print(invalid_ratings)


overall                                                           5
verified                                                       True
reviewTime                                              10 24, 2009
reviewerID                                           A1E7VTRDMI4XMV
asin                                                     0005092663
style                                           {'Format:': ' DVD'}
reviewerName                                                 Harold
reviewText        Really excellent piece of work showing the bel...
summary                            Freedom and Religon under attack
unixReviewTime                                           1256342400
vote                                                            NaN
image                                                           NaN
Gender                                                         male
Name: 23, dtype: object
All ratings are within the range of 1 to 5.


### Step 3 Adding a gender column
Use [gender-guesser](https://pypi.org/project/gender-guesser/) package to add gender for each person using their name.

In [38]:
# Initialize the gender detector
detector = gender.Detector()

# Remove rows without names
df.dropna(subset=['reviewerName'], inplace=True)  

# print(detector.get_gender("Jennifer"))

# Apply the gender detector to the "reviewerName" column
df['Gender'] = df['reviewerName'].apply(lambda x: detector.get_gender(x.split()[0]))

#


In [46]:
# We want male and female so replace mostly_male with male and mostly_female with female
replacements = {'mostly_male': 'male', 'mostly_female': 'female'}
df['Gender'].replace(replacements, inplace=True)
# Remove rows where gender wasnt identified
df = df[df['Gender'] != 'unknown']

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].replace(replacements, inplace=True)


In [56]:
print(len(df))

1554957


### Step 4 Adding an image representing the cover of the movie or tv show
This is done using the ASIN (Amazon Standard Identification Number) attribute of each column

In [63]:
# we will also check  if the asin number is valid and remove any row that has an invalid asin number. to be valid it has to be 10 characters long
df = df[df['asin'].apply(lambda x: len(str(x)) == 10)]

# Add new column to hold links for the covers 
df['cover'] = 'https://images.amazon.com/images/P/' + df['asin'] + '.jpg'

# check it worked
print(df['cover'].iloc[5])

http://images.amazon.com/images/P/0005092663.jpg


### Step 5 Make a folder that contains all the covers by storing the images using the links

In [76]:
# url="https://images.amazon.com/images/P/0005019282.jpg"

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

fipath = cwd.parents[0] / "movie_tv_shows_dataset" / 'images' 

for index, row in df.iterrows():
    asin=row['asin']
    hasImage=os.path.exists(os.path.join(fipath, row['asin']+".jpg"))
    # print(hasImage)
    if not hasImage:
        response = requests.get(row['cover'],headers=headers)
        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Extract the filename from the URL
            filename = os.path.join(fipath, row['asin']+".jpg")
            
            # Save the image to a file
            with open(filename, 'wb') as file:
                file.write(response.content)
            
            print(f"Image downloaded successfully: {filename}")
        else:
            print(response)
            print(f"Failed to download image from {row['cover']}")

<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download image from http://images.amazon.com/images/P/0792836529.jpg
<Response [404]>
Failed to download imag