In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# Exploratory Data Analysis of the CineEthics Dataset
By Wanga Mulaudzi
<br>
23 February 2023
***
The datasets were found on [Kaggle](https://www.kaggle.com/datasets/).

## Import Statements

In [None]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [None]:
import ast
import cv2
from fuzzywuzzy import fuzz
import glob
from google.cloud import storage
from google.colab import auth
import io
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from PIL import Image
import random
import re
import seaborn as sns
import sys
from tqdm import tqdm
import zipfile

## Download the data
First ensure that you have a kaggle token generated and stored in ```content/.kaggle/kaggle.json```. The token can be generated on Kaggle under settings.

In [None]:
!mkdir -p /root/.kaggle

# Upload your kaggle.json file to google drive and then move it to the root path
!mv /content/drive/MyDrive/kaggle.json /root/.kaggle

### Movie Identification Dataset (screengrabs)
[Link](https://www.kaggle.com/datasets/asaniczka/movie-identification-dataset-800-movies) to Kaggle dataset.

Since this dataset is large, we will download the zip file locally and then use the terminal to upload it to google cloud:

Next, create ```cine_ethics/data/``` folders in google drive which we can store the kaggle datasets before uploading to google cloud.

In [None]:
!gcloud auth login
!gsutil cp archive.zip gs://cine_ethics/data/

Once this is uploaded, you can create a virtual machine (VM) instance, ssh into it from a terminal linked to the VM, and then download it to the VM:

In [None]:
!gcloud compute ssh --zone "europe-west1-b" "instance-20240302-120502" --project "ornate-lens-411311"
!gsutil cp gs://cine_ethics/data/archive.zip .

This tool needs to create the directory [/root/.ssh] before being able to 
generate SSH keys.

Do you want to continue (Y/n)?  Y

Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): ^C
AccessDeniedException: 403 annishared@gmail.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).


You can then unzip the file and reupload the unzipped contents to the bucket:

In [None]:
!sudo apt install unzip
!unzip archive.zip
!gsutil -m cp -r resized_frames/ gs://cine_ethics/data/

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unzip is already the newest version (6.0-26ubuntu3.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
unzip:  cannot find or open archive.zip, archive.zip.zip or archive.zip.ZIP.
CommandException: No URLs matched: resized_frames/
CommandException: 1 file/object could not be transferred.


### Cornell Movie Dialogs Corpus (Dialog Datasets)
[Link](https://www.kaggle.com/datasets/pandey881062/cornell-movie-dialogs-corpusdialog-datasets?select=movie_titles_metadata.txt) to Kaggle dataset.

In [None]:
!kaggle datasets download -d pandey881062/cornell-movie-dialogs-corpusdialog-datasets -p cine_ethics/data/cornel_corpus

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 403, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
!unzip -qq cine_ethics/data/cornel_corpus/cornell-movie-dialogs-corpusdialog-datasets.zip -d cine_ethics/data/cornel_corpus

unzip:  cannot find or open cine_ethics/data/cornel_corpus/cornell-movie-dialogs-corpusdialog-datasets.zip, cine_ethics/data/cornel_corpus/cornell-movie-dialogs-corpusdialog-datasets.zip.zip or cine_ethics/data/cornel_corpus/cornell-movie-dialogs-corpusdialog-datasets.zip.ZIP.


Then upload the ```cornel_corpus``` directory to the ```cine_ethics``` bucket:

In [None]:
!gsutil cp cine_ethics/data/cornel_corpus gs://cine_ethics/data

AccessDeniedException: 403 annishared@gmail.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).


### AI Related Movies Dataset
These .csv files were found directly on IMSDb via google search.

In [None]:
# Uploading to the bucket
!gsutil cp cine_ethics/data/ai_movies_imdb gs://cine_ethics/data/

CommandException: No URLs matched: cine_ethics/data/ai_movies_imdb


## Loading the data
Now that the data is in a bucket on google cloud, we can load the data into the notebook. First, we need to authorize google colab to access the project.

In [None]:
# Authenticate google colab with google cloud resources
auth.authenticate_user()

project_id = "ornate-lens-411311"
client = storage.Client(project=project_id)

bucket_name = "cine_ethics"
bucket = client.get_bucket(bucket_name)

### AI Related Movies Dataset

In [None]:
ai_movies_path = "data/ai_movies_imdb"

# Get blobs within the subfolder
blobs = bucket.list_blobs(prefix=ai_movies_path)

ai_related_movies_list = []

# Read each csv file into a dataframe
for blob in blobs:
    data_str = blob.download_as_string()
    data_df = pd.read_csv(io.BytesIO(data_str))

    ai_related_movies_list.append(data_df)

ai_related_movies_df = pd.concat(ai_related_movies_list, ignore_index=True)
ai_related_movies_df.head()

Unnamed: 0,Position,Const,Created,Modified,Description,Title,URL,Title Type,IMDb Rating,Runtime (mins),Year,Genres,Num Votes,Release Date,Directors,Your Rating,Date Rated
0,1,tt0470752,2018-11-19,2018-11-19,,Ex Machina,https://www.imdb.com/title/tt0470752/,movie,7.7,108.0,2014.0,"Drama, Sci-Fi, Thriller",581563.0,2014-12-16,Alex Garland,,
1,2,tt1798709,2018-11-19,2018-11-19,,Her,https://www.imdb.com/title/tt1798709/,movie,8.0,126.0,2013.0,"Drama, Romance, Sci-Fi",663896.0,2013-10-12,Spike Jonze,,
2,3,tt0343818,2018-11-19,2018-11-19,,"I, Robot",https://www.imdb.com/title/tt0343818/,movie,7.1,115.0,2004.0,"Action, Mystery, Sci-Fi, Thriller",571470.0,2004-07-07,Alex Proyas,,
3,4,tt0212720,2018-11-19,2018-11-19,,A.I. Artificial Intelligence,https://www.imdb.com/title/tt0212720/,movie,7.2,146.0,2001.0,"Drama, Sci-Fi",321397.0,2001-06-26,Steven Spielberg,,
4,5,tt2209764,2018-11-19,2018-11-19,,Transcendence,https://www.imdb.com/title/tt2209764/,movie,6.2,119.0,2014.0,"Action, Drama, Sci-Fi, Thriller",237916.0,2014-04-10,Wally Pfister,,


In [None]:
# Number of rows, Number of columns
ai_related_movies_df.shape

(248, 17)

In [None]:
# Percentage of NaNs per column
ai_related_movies_df.isna().sum()/len(ai_related_movies_df)

Position          0.000000
Const             0.000000
Created           0.000000
Modified          0.000000
Description       0.995968
Title             0.000000
URL               0.000000
Title Type        0.000000
IMDb Rating       0.028226
Runtime (mins)    0.020161
Year              0.008065
Genres            0.000000
Num Votes         0.028226
Release Date      0.016129
Directors         0.032258
Your Rating       1.000000
Date Rated        1.000000
dtype: float64

In [None]:
# Description, Your Rating and Date Rated columns are mostly NaNs so we can drop them
ai_related_movies_df.drop(columns=["Position", "Created", "Modified", "Description", "Your Rating",
                                   "Date Rated"], inplace=True)

In [None]:
# Rename columns
columns = ["imdb_ID", "movie_title", "imdb_url", "title_type", "imdb_rating", "runtime_mins", "year", "genres",
           "num_votes", "release_date", "directors"]
ai_related_movies_df.columns = columns

In [None]:
# Lower case the movie_title column
ai_related_movies_df["movie_title"] = ai_related_movies_df["movie_title"].str.lower()

In [None]:
# Convert genres column to have each entry a list of strings, not just a string of characters
ai_related_movies_df["genres"] = ai_related_movies_df["genres"].apply(lambda x: x.split(","))

In [None]:
# Lower case the genres
ai_related_movies_df["genres"] = ai_related_movies_df["genres"].apply(lambda x: [genre.lower() for genre in x])

In [None]:
ai_related_movies_df.head()

Unnamed: 0,imdb_ID,movie_title,imdb_url,title_type,imdb_rating,runtime_mins,year,genres,num_votes,release_date,directors
0,tt0470752,ex machina,https://www.imdb.com/title/tt0470752/,movie,7.7,108.0,2014.0,"[drama, sci-fi, thriller]",581563.0,2014-12-16,Alex Garland
1,tt1798709,her,https://www.imdb.com/title/tt1798709/,movie,8.0,126.0,2013.0,"[drama, romance, sci-fi]",663896.0,2013-10-12,Spike Jonze
2,tt0343818,"i, robot",https://www.imdb.com/title/tt0343818/,movie,7.1,115.0,2004.0,"[action, mystery, sci-fi, thriller]",571470.0,2004-07-07,Alex Proyas
3,tt0212720,a.i. artificial intelligence,https://www.imdb.com/title/tt0212720/,movie,7.2,146.0,2001.0,"[drama, sci-fi]",321397.0,2001-06-26,Steven Spielberg
4,tt2209764,transcendence,https://www.imdb.com/title/tt2209764/,movie,6.2,119.0,2014.0,"[action, drama, sci-fi, thriller]",237916.0,2014-04-10,Wally Pfister


### Cornell Movie Dialogs Corpus
Note that the text files use ``` +++$+++ ``` as a separator.

In [None]:
# Get the list of directories
# Specify the directory path
dir_cornel_corpus = "data/cornel_corpus/"

In [None]:
def file_to_df(blob, columns):
  """
  Function that takes in a blob and a list of column names for the blob's dataframe

  Returns:
    dataframe
  """
  # Download whole file as a string
  data_str = blob.download_as_string()

  # Decode the byte string to Unicode
  data_unicode = data_str.decode('latin1')

  # Read the text file and split each line using ' +++$+++ ' as the separator
  lines = []
  for line in data_unicode.splitlines():
    lines.append(line.strip().split(' +++$+++ '))

  return pd.DataFrame(lines, columns=columns)

In [None]:
# Get blobs within the subfolder
blobs = bucket.list_blobs(prefix=dir_cornel_corpus)

for blob in blobs:
  # movie_titles_metadata.txt
  if os.path.basename(blob.name) == "movie_titles_metadata.txt":
    columns = ["movieID", "movie_title", "movie_year", "imdb_rating", "num_votes", "genres"]
    movie_titles_metadata_df = file_to_df(blob, columns)

    # Convert genres column into list as opposed to string of a list
    movie_titles_metadata_df["genres"] = movie_titles_metadata_df["genres"].apply(ast.literal_eval)

  # movie_conversations.txt
  elif os.path.basename(blob.name) == "movie_conversations.txt":
    columns = ["characterID1", "characterID2", "movieID", "utterances"]
    movie_conversations_df = file_to_df(blob, columns)

    # Convert utterances column into list as opposed to string of a list
    movie_conversations_df["utterances"] = movie_conversations_df["utterances"].apply(ast.literal_eval)

  # movie_lines.txt
  elif os.path.basename(blob.name) == "movie_lines.txt":
    columns = ["lineID", "characterID", "movieID", "character_name", "utterance"]
    movie_lines_df = file_to_df(blob, columns)

Display the created dataframes.

In [None]:
movie_titles_metadata_df.head()

Unnamed: 0,movieID,movie_title,movie_year,imdb_rating,num_votes,genres
0,m0,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
1,m1,1492: conquest of paradise,1992,6.2,10421,"[adventure, biography, drama, history]"
2,m2,15 minutes,2001,6.1,25854,"[action, crime, drama, thriller]"
3,m3,2001: a space odyssey,1968,8.4,163227,"[adventure, mystery, sci-fi]"
4,m4,48 hrs.,1982,6.9,22289,"[action, comedy, crime, drama, thriller]"


In [None]:
movie_conversations_df.head()

Unnamed: 0,characterID1,characterID2,movieID,utterances
0,u0,u2,m0,"[L194, L195, L196, L197]"
1,u0,u2,m0,"[L198, L199]"
2,u0,u2,m0,"[L200, L201, L202, L203]"
3,u0,u2,m0,"[L204, L205, L206]"
4,u0,u2,m0,"[L207, L208]"


In [None]:
movie_lines_df.head()

Unnamed: 0,lineID,characterID,movieID,character_name,utterance
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


In [None]:
# Create a dictionary with lineID as keys and {character_name: utterance} as values
movie_lines_dict = {}

for _, row in movie_lines_df.iterrows():
    movie_lines_dict[row['lineID']] = {row['character_name']: row['utterance']}

Create one ```cornell_courpus``` dataframe.

In [None]:
# Define a custom aggregation function to concatenate utterances into one list
def concat_utterances(series):
    return sum(series, [])

In [None]:
# Group movie_conversations_df by movieID
grouped_movie_conversations_df = movie_conversations_df.groupby("movieID")["utterances"].agg(concat_utterances).reset_index()
grouped_movie_conversations_df.head()

Unnamed: 0,movieID,utterances
0,m0,"[L194, L195, L196, L197, L198, L199, L200, L20..."
1,m1,"[L2170, L2171, L2172, L2173, L2174, L2175, L21..."
2,m10,"[L12735, L12736, L12737, L12738, L12739, L1274..."
3,m100,"[L302360, L302361, L302362, L302363, L302364, ..."
4,m101,"[L303685, L303686, L303687, L303688, L303689, ..."


In [None]:
# Merge grouped_movie_conversations_df with movie_titles_metadata_df
merge_grouped_movie_conversations_movie_titles_metadata_df = pd.merge(grouped_movie_conversations_df,
                                                                      movie_titles_metadata_df, on="movieID")

# Drop movieID column
merge_grouped_movie_conversations_movie_titles_metadata_df.drop(columns="movieID", inplace=True)

merge_grouped_movie_conversations_movie_titles_metadata_df.head()

Unnamed: 0,utterances,movie_title,movie_year,imdb_rating,num_votes,genres
0,"[L194, L195, L196, L197, L198, L199, L200, L20...",10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
1,"[L2170, L2171, L2172, L2173, L2174, L2175, L21...",1492: conquest of paradise,1992,6.2,10421,"[adventure, biography, drama, history]"
2,"[L12735, L12736, L12737, L12738, L12739, L1274...",affliction,1997,6.9,7252,"[drama, mystery, thriller]"
3,"[L302360, L302361, L302362, L302363, L302364, ...",innerspace,1987,6.5,16854,"[action, adventure, comedy, crime, sci-fi]"
4,"[L303685, L303686, L303687, L303688, L303689, ...",the insider,1999,8.0,69660,"[biography, drama, thriller]"


In [None]:
merge_grouped_movie_conversations_movie_titles_metadata_df.shape

(617, 6)

In [None]:
# Check for NaNs
merge_grouped_movie_conversations_movie_titles_metadata_df.isna().sum() / len(merge_grouped_movie_conversations_movie_titles_metadata_df)

utterances     0.0
movie_title    0.0
movie_year     0.0
imdb_rating    0.0
num_votes      0.0
genres         0.0
dtype: float64

Check if there are any movies in ```merge_grouped_movie_conversations_movie_titles_metadata_df``` that are in ```ai_related_movies_df```.

In [None]:
# Merging
ai_movies_script_imdb_df = ai_related_movies_df.merge(merge_grouped_movie_conversations_movie_titles_metadata_df,
                                                      on="movie_title", how="left")
ai_movies_script_imdb_df.head()

Unnamed: 0,imdb_ID,movie_title,imdb_url,title_type,imdb_rating_x,runtime_mins,year,genres_x,num_votes_x,release_date,directors,utterances,movie_year,imdb_rating_y,num_votes_y,genres_y
0,tt0470752,ex machina,https://www.imdb.com/title/tt0470752/,movie,7.7,108.0,2014.0,"[drama, sci-fi, thriller]",581563.0,2014-12-16,Alex Garland,,,,,
1,tt1798709,her,https://www.imdb.com/title/tt1798709/,movie,8.0,126.0,2013.0,"[drama, romance, sci-fi]",663896.0,2013-10-12,Spike Jonze,,,,,
2,tt0343818,"i, robot",https://www.imdb.com/title/tt0343818/,movie,7.1,115.0,2004.0,"[action, mystery, sci-fi, thriller]",571470.0,2004-07-07,Alex Proyas,,,,,
3,tt0212720,a.i. artificial intelligence,https://www.imdb.com/title/tt0212720/,movie,7.2,146.0,2001.0,"[drama, sci-fi]",321397.0,2001-06-26,Steven Spielberg,,,,,
4,tt2209764,transcendence,https://www.imdb.com/title/tt2209764/,movie,6.2,119.0,2014.0,"[action, drama, sci-fi, thriller]",237916.0,2014-04-10,Wally Pfister,,,,,


We need to handle the NaNs.

In [None]:
# Impute missing values (NaNs)
ai_movies_script_imdb_df.loc[ai_movies_script_imdb_df["utterances"].isna(), "utterances"] = "none"

In [None]:
# Dropping duplicates based on movie_title
ai_movies_script_imdb_df = ai_movies_script_imdb_df.drop_duplicates(subset="movie_title")

In [None]:
# Calculate average imdb rating
ai_movies_script_imdb_df["av_rating"] = ai_movies_script_imdb_df[["imdb_rating_x", "imdb_rating_y"]].astype(float).mean(axis=1)

In [None]:
# Define a function to merge and strip the genres columns
def merge_and_strip_lists(row):
    genres_x = row["genres_x"]
    genres_y = row["genres_y"]

    # If genres_y and genres_x are lists
    if isinstance(genres_y, list) and isinstance(genres_y, list):
        # Otherwise, merge and strip both lists
        merged_genres = genres_x + genres_y
        stripped_genres = [genre.strip() for genre in merged_genres]

        return stripped_genres

    # Else if either one is NaN, return the other
    elif pd.isna(genres_y):
        return genres_x

    else:
        return genres_y

In [None]:
# Merge genres_x and genres_y lists and drop duplicates
# First strip white spaces
ai_movies_script_imdb_df["genres"] = ai_movies_script_imdb_df.apply(merge_and_strip_lists, axis=1)

In [None]:
# Calculate total votes
ai_movies_script_imdb_df["num_votes_x"] = ai_movies_script_imdb_df["num_votes_x"].fillna(0) # Fill with zeros if NaN
ai_movies_script_imdb_df["num_votes_y"] = ai_movies_script_imdb_df["num_votes_y"].fillna(0) # Fill with zeros if NaN
ai_movies_script_imdb_df["tot_votes"] = ai_movies_script_imdb_df["num_votes_x"] + ai_movies_script_imdb_df["num_votes_y"].astype(float)

In [None]:
# Drop columns that are no longer needed
ai_movies_script_imdb_df = ai_movies_script_imdb_df.drop(columns=["imdb_rating_x", "imdb_rating_y", "genres_x",
                                                                  "genres_y", "num_votes_x", "num_votes_y",
                                                                  "movie_year", "year"])

In [None]:
# Keep only movies
ai_movies_script_imdb_df = ai_movies_script_imdb_df[ai_movies_script_imdb_df["title_type"] == "movie"]

In [None]:
ai_movies_script_imdb_df.head()

Unnamed: 0,imdb_ID,movie_title,imdb_url,title_type,runtime_mins,release_date,directors,utterances,av_rating,genres,tot_votes
0,tt0470752,ex machina,https://www.imdb.com/title/tt0470752/,movie,108.0,2014-12-16,Alex Garland,none,7.7,"[drama, sci-fi, thriller]",581563.0
1,tt1798709,her,https://www.imdb.com/title/tt1798709/,movie,126.0,2013-10-12,Spike Jonze,none,8.0,"[drama, romance, sci-fi]",663896.0
2,tt0343818,"i, robot",https://www.imdb.com/title/tt0343818/,movie,115.0,2004-07-07,Alex Proyas,none,7.1,"[action, mystery, sci-fi, thriller]",571470.0
3,tt0212720,a.i. artificial intelligence,https://www.imdb.com/title/tt0212720/,movie,146.0,2001-06-26,Steven Spielberg,none,7.2,"[drama, sci-fi]",321397.0
4,tt2209764,transcendence,https://www.imdb.com/title/tt2209764/,movie,119.0,2014-04-10,Wally Pfister,none,6.2,"[action, drama, sci-fi, thriller]",237916.0


In [None]:
# Create a boolean mask to identify rows with NaN values
nan_mask = ai_movies_script_imdb_df.isnull().any(axis=1)

# Use the boolean mask to filter rows with NaN values and only keep rows without NaNs
rows_without_nans = ai_movies_script_imdb_df[~nan_mask]
ai_movies_script_imdb_df = pd.DataFrame(rows_without_nans)
ai_movies_script_imdb_df.head()

Unnamed: 0,imdb_ID,movie_title,imdb_url,title_type,runtime_mins,release_date,directors,utterances,av_rating,genres,tot_votes
0,tt0470752,ex machina,https://www.imdb.com/title/tt0470752/,movie,108.0,2014-12-16,Alex Garland,none,7.7,"[drama, sci-fi, thriller]",581563.0
1,tt1798709,her,https://www.imdb.com/title/tt1798709/,movie,126.0,2013-10-12,Spike Jonze,none,8.0,"[drama, romance, sci-fi]",663896.0
2,tt0343818,"i, robot",https://www.imdb.com/title/tt0343818/,movie,115.0,2004-07-07,Alex Proyas,none,7.1,"[action, mystery, sci-fi, thriller]",571470.0
3,tt0212720,a.i. artificial intelligence,https://www.imdb.com/title/tt0212720/,movie,146.0,2001-06-26,Steven Spielberg,none,7.2,"[drama, sci-fi]",321397.0
4,tt2209764,transcendence,https://www.imdb.com/title/tt2209764/,movie,119.0,2014-04-10,Wally Pfister,none,6.2,"[action, drama, sci-fi, thriller]",237916.0


In [None]:
ai_movies_script_imdb_df[ai_movies_script_imdb_df["utterances"] != "none"].shape

(17, 11)

In [None]:
ai_movies_script_imdb_df.isna().sum() / len(ai_movies_script_imdb_df)

imdb_ID         0.0
movie_title     0.0
imdb_url        0.0
title_type      0.0
runtime_mins    0.0
release_date    0.0
directors       0.0
utterances      0.0
av_rating       0.0
genres          0.0
tot_votes       0.0
dtype: float64

### Movie Identification Dataset
For this dataset, we need to work with the images as numpy arrays so that we can feed them into tensorflow. This is memory inefficient so for now we will just store the image paths into a dictionary.

In [None]:
ai_movies_path = "data/resized_frames"

# Get blobs within the subfolder
blobs = bucket.list_blobs(prefix=ai_movies_path)

# Set to store unique subfolder names
subfolders = set()

# Set to store titles from ai_movies_script_imdb_df
ai_movies_script_imdb_df_titles = set()

# Set to store titles from screengrabs folders
screengrabs_titles = set()
screengrabs_paths = []

# Threshold for similarity
similarity_threshold = 80

# Read each subfolder and extract unique subfolder names
for blob in blobs:
    subfolder = blob.name.split("/")[2]  # Extract the second level directory
    subfolders.add(subfolder)

# Iterate over unique subfolder names
for subfolder in subfolders:

    # Iterate over blobs again to process files within the current subfolder
    for blob in bucket.list_blobs(prefix=f"{ai_movies_path}/{subfolder}/"):

        # Extract movie title from path
        full_title = blob.name.split("/")[2]

        # Remove year
        title = re.sub(r"\s*\(\d+\)", "", full_title).lower()

        # Check similarity with titles in ai_movies_script_imdb_df
        for ai_title in ai_movies_script_imdb_df["movie_title"].values:

            if ai_title in ai_movies_script_imdb_df_titles:
                continue

            elif fuzz.token_set_ratio(title, ai_title.lower()) >= similarity_threshold:
                # Store the titles and paths
                ai_movies_script_imdb_df_titles.add(ai_title)
                screengrabs_titles.add(title)

                print(f"Found similar title: {title} -> {ai_title}")

                break

Found similar title: the matrix -> the matrix
Found similar title: the matrix -> the matrix reloaded
Found similar title: the matrix -> the matrix revolutions
Found similar title: star trek first contact -> star trek: first contact
Found similar title: vice -> vice
Found similar title: real steel -> real steel
Found similar title: the machinist -> the machine
Found similar title: star wars episode vii - the force awakens -> star wars
Found similar title: terminator 2 -> terminator genisys
Found similar title: terminator 2 -> the terminator
Found similar title: terminator 2 -> terminator 2: judgment day
Found similar title: terminator 2 -> terminator 3: rise of the machines
Found similar title: terminator 2 -> terminator salvation
Found similar title: blade runner 2049 -> blade runner 2049
Found similar title: blade runner 2049 -> blade runner
Found similar title: flight -> flight of the navigator
Found similar title: her -> her
Found similar title: back to the future iii -> back to the

In [None]:
related_titles

{'aliens',
 'back to the future',
 'blade runner',
 'star trek: first contact',
 'star trek: the motion picture',
 'star wars',
 'terminator 2: judgment day',
 'the matrix',
 'the terminator'}