# Capstone Project Part 4: Fine-Tuning and Enhancing Model

**Author:** Soohyun Ahn <br>
    
**Date:** March - April 2023<br>
    
**Notebook Number:** 4/ 4

In this final notebook, we showcase the iterative process of fine-tuning a movie recommendation model by incorporating genre information and experimenting with similarity weights to find the optimal combination for `cosine similarity`, `Jaccard similarity` based on keywords, and `Jaccard similarity` based on genres. This process not only improved the effectiveness of the model but also provided valuable learning opportunities, as we encountered and overcame various challenges.

We started by preprocessing the dataset and updating the `most_similar_movies` function (from our [second notebook](https://github.com/treelunar/2023_Capstone_BSTN/blob/main/Part_2_Feature_Engineering_Modeling.ipynb)) to include the weighted average of the `cosine similarity`, `Jaccard similarity` (keywords), and `Jaccard similarity` (genres) in the similarity calculation. Throughout the fine-tuning process, we faced several issues, such as the `eval` function and Python's built-in function of inserting an `Ellipsis`, that prevented our model from functioning properly. 

In conclusion, while our current model has some limitations, its iterative development process demonstrates the potential for continuous improvement, paving the way for future enhancements that can further refine and optimize the movie recommendation system.

## Dataset and Model Selection

So far, we have been working with two datasets concurrently. Moving forward, we have decided to focus on the dataset containing 40,000 movies for several reasons. 
- Firstly, the document embeddings derived from this dataset appear more effective for our modeling purposes.
- Secondly, the 40,000 movies dataset includes a `genres` column, which can be utilized to enhance our model. Although the MPST dataset contains a `tags` column, it encompasses a broader scope of information beyond genres, such as plot descriptors like "atmospheric." Given more time and resources, the MPST dataset could have been a better choice. However, due to our time constraints, the dataset with genres proved to be more suitable.
- Lastly, and most importantly, the MPST dataset primarily consists of lengthy movie synopses from Wikipedia, exhibiting significant variation in length. This characteristic made it less suitable for our project. Despite these limitations, we believe the MPST dataset holds great potential for future work, provided there is ample time and resources to fully explore its capabilities.

We opted to use [OpenAI's embeddings model](https://platform.openai.com/docs/guides/embeddings) for our project due to its state-of-the-art performance and ability to capture complex semantic relationships in text. Leveraging this pre-trained model allowed us to harness the power of a vast amount of data and training, ensuring a more reliable and effective representation of our movie dataset.

## Dataset Inspection and Cleaning

Now we can read in our 40,000 movies dataset and see a further cleaning or processing is required.

In [129]:
# Import packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import gensim
import pyLDAvis.gensim_models

# Set the random seed

random.seed(42)
np.random.seed(42)
# hide warnings

import warnings
warnings.filterwarnings('ignore')

In [130]:
# Read in the csv file
# OpenAI instruction says we should use `apply(eval).apply(np.array)` to retrieve the csv file

ada_40000_df = pd.read_csv('ada_embeddings_movie_40000.csv')
ada_40000_df['ada_embeddings'] = ada_40000_df.ada_embeddings.apply(eval).apply(np.array)
ada_40000_df.head()

Unnamed: 0,imdb_id,title,original_title,overview,clean_overview,genres,tagline,poster_path,num_tokens,ada_embeddings
0,tt0114709,Toy Story,Toy Story,"Led by Woody, Andy's toys live happily in his ...",led woodi toy live happili room birthday bring...,"['Animation', 'Comedy', 'Family']",,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,26,"[-0.01801527477800846, -0.02206587977707386, 0..."
1,tt0113497,Jumanji,Jumanji,When siblings Judy and Peter discover an encha...,sibl discov enchant board game open door magic...,"['Adventure', 'Fantasy', 'Family']",Roll the dice and unleash the excitement!,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,33,"[-0.005219690967351198, -0.019913190975785255,..."
2,tt0113228,Grumpier Old Men,Grumpier Old Men,A family wedding reignites the ancient feud be...,famili wed reignit ancient feud next door neig...,"['Romance', 'Comedy']",Still Yelling. Still Fighting. Still Ready for...,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,31,"[0.006672864779829979, -0.010083439759910107, ..."
3,tt0114885,Waiting to Exhale,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",cheat mistreat step women hold breath wait elu...,"['Comedy', 'Drama', 'Romance']",Friends are the people who let you be yourself...,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,24,"[-0.016164006665349007, -0.0014150608330965042..."
4,tt0113041,Father of the Bride Part II,Father of the Bride Part II,Just when George Banks has recovered from his ...,recov daughter wed receiv news pregnant wife e...,['Comedy'],Just When His World Is Back To Normal... He's ...,/e64sOI48hQXyru7naBFyssKFxVd.jpg,17,"[-0.012427903711795807, 0.0046121967025101185,..."


Since the file size is large, it woud be good reduce the size by removing unnecessary columns.

In [30]:
# Check column names
ada_40000_df.columns

Index(['imdb_id', 'title', 'original_title', 'overview', 'clean_overview',
       'genres', 'tagline', 'poster_path', 'num_tokens', 'ada_embeddings'],
      dtype='object')

The columns necssary for our project are:

- `imdb_id`: Required for fetching movie posters using the TMDb API.
- `title`: Necessary to display the title of the recommended movies.
- `overview`: Needed to provide the movie's synopsis for the recommended movies.
- `clean_overview`: Used for calculating Jaccard similarity.
- `genres`: While not currently used in the recommendation system, you might want to include genre information in the future.
- `ada_embeddings`: Essential for calculating cosine similarity between user input and movie descriptions.

We can remove the other columns (`original_title`, `tagline`, `poster_path`, and `num_tokens`) since they're not used in our application.

In [37]:
# Drop unwanted columns

columns_to_remove = ['original_title', 'tagline', 'poster_path', 'num_tokens']
ada_40000_min_df = ada_40000_df.drop(columns=columns_to_remove)
ada_40000_min_df.head()

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
0,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ...",led woodi toy live happili room birthday bring...,"['Animation', 'Comedy', 'Family']","[-0.01801527477800846, -0.02206587977707386, 0..."
1,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...,sibl discov enchant board game open door magic...,"['Adventure', 'Fantasy', 'Family']","[-0.005219690967351198, -0.019913190975785255,..."
2,tt0113228,Grumpier Old Men,A family wedding reignites the ancient feud be...,famili wed reignit ancient feud next door neig...,"['Romance', 'Comedy']","[0.006672864779829979, -0.010083439759910107, ..."
3,tt0114885,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",cheat mistreat step women hold breath wait elu...,"['Comedy', 'Drama', 'Romance']","[-0.016164006665349007, -0.0014150608330965042..."
4,tt0113041,Father of the Bride Part II,Just when George Banks has recovered from his ...,recov daughter wed receiv news pregnant wife e...,['Comedy'],"[-0.012427903711795807, 0.0046121967025101185,..."


Now we can save the current dataset as a csv file and use it for building our Streamlit app.

In [None]:
# Save the dataset
#ada_40000_min_df.to_csv('ada_40000_min_streamlit.csv', index=False)

Interestingly, the draft version of our Streamlit app recommends movies that are completely different from our earlier experiment. For example, in our [second notebook](https://github.com/treelunar/2023_Capstone_BSTN/blob/main/Part_2_Feature_Engineering_Modeling.ipynb), we experimented with the user input of "humans meeting friendly and curious aliens and becoming friends," and the movie **E.T. the Extra-Terrestrial** was in the list (although not the top choice).

However, our Streamlit app recommends irrelevant movies, and worse, it excludes **E.T. the Extra-Terrestrial** from the recommendations!

**NOTE**: Troubleshooting for this issue took an entire day. The key takeaway is that the code used when saving and retrieving a file can substantially change the dataset, which matters significantly for a language model.

Seemingly innocuous things, such as Python's built-in function of inserting an `Ellipsis` in a column with a long value, can produce errors like `ValueError: malformed node or string`. This happens because the string representation of the NumPy array contains elements that the `literal_eval()` function cannot parse or understand, such as the `Ellipsis`.

In the next cell, I will show how to get the error message and eventually the problem is resolved.

In [131]:
# This cell will return the `ValueError: malformed node or string`
#ada_40000_min_df = pd.read_csv('ada_40000_min_streamlit2.csv')
#ada_40000_min_df['ada_embeddings'] = ada_40000_min_df['ada_embeddings'].apply(literal_eval).apply(np.array)
#ada_40000_min_df.head()

ValueError: malformed node or string: <ast.Name object at 0x0000019C367740D0>

The `ValueError: malformed node or string` occcurs.<Br>We can examine the `ada_embeddings` column to find out what the cause is.

In [132]:
# Display the first 5 ada_embeddings

pd.set_option('display.max_colwidth', None)
print(ada_40000_min_df['ada_embeddings'].sample(5))

36544      [0.01422171,-0.0055085,-0.00897453,Ellipsis,0.00543973,-0.00144848,-0.02521121]
9728     [-0.04175027,-0.01659017,0.00512364,Ellipsis,-0.00926766,-0.00923242,-0.00717803]
22089     [-0.01487184,-0.02023523,0.0108318,Ellipsis,-0.01922696,-0.00039801,-0.00400153]
38351        [-0.02080106,-0.0188146,0.00699872,Ellipsis,0.01052114,0.00533151,0.00249549]
24740      [-0.01146415,-0.03905477,0.02411304,Ellipsis,-0.0304131,0.00498987,-0.00502471]
Name: ada_embeddings, dtype: object


We encountered an issue with `Ellipsis` in every row of our dataset, which caused an error. The `literal_eval()` function cannot parse strings containing `Ellipsis`.

Although `np.fromstring()` can potentially resolve this issue by not creating `Ellipsis`, it is not suitable for our Streamlit app, as it will soon be deprecated. The suggested replacements for `np.fromstring()` are `np.frombuffer()` and `np.loadtxt()`. However, `np.frombuffer()` is not appropriate for our case, as it is designed to work with binary data stored in buffer-like objects, such as bytes objects or bytearrays. Unfortunately, `np.loadtxt()` also fails when encountering `Ellipsis`.

Initially, the problem stemmed from using the `eval` function when loading the dataset (suggested by OpenAI). The `eval` function converts a string representation and then converts it back to a NumPy array. To avoid this, we tried several similar functions, including `literal_eval`. Along the way, we discovered that `Ellipsis` was causing problems, and we eventually arrived at the following solution:

- Save the embeddings as a separate binary file using NumPy's save function.
- Save the rest of the DataFrame to a CSV file.
- When loading the data, load the embeddings from the binary file and the rest of the DataFrame from the CSV file.

This approach successfully circumvents the issues related to `Ellipsis` and the limitations of the available functions.

In [133]:
# Read in the csv file
ada_40000_df = pd.read_csv('ada_embeddings_movie_40000.csv')
ada_40000_df['ada_embeddings'] = ada_40000_df.ada_embeddings.apply(eval).apply(np.array)

# Save the embeddings as a binary file
np.save('ada_embeddings_movie_40000.npy', ada_40000_df['ada_embeddings'].to_numpy())

columns_to_remove = ['original_title', 'tagline', 'poster_path', 'num_tokens', 'ada_embeddings']
ada_40000_min_df = ada_40000_df.drop(columns=columns_to_remove)
ada_40000_min_df.head()

# Save the DataFrame to a new CSV file without the 'ada_embeddings' column
# ada_40000_min_df.to_csv('ada_40000_min_streamlit.csv', index=False)

Unnamed: 0,imdb_id,title,overview,clean_overview,genres
0,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",led woodi toy live happili room birthday bring onto scene afraid lose place heart woodi plot circumst separ woodi owner duo eventu learn put asid differ,"['Animation', 'Comedy', 'Family']"
1,tt0113497,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",sibl discov enchant board game open door magic world unwittingli invit adult trap insid game year live room hope freedom finish game prove riski three find run giant rhinoceros evil monkey terrifi creatur,"['Adventure', 'Fantasy', 'Family']"
2,tt0113228,Grumpier Old Men,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",famili wed reignit ancient feud next door neighbor fish buddi meanwhil sultri italian divorc open restaur local bait shop alarm local worri scare fish away less interest seafood cook hot time,"['Romance', 'Comedy']"
3,tt0114885,Waiting to Exhale,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.",cheat mistreat step women hold breath wait elus good man break string less stellar lover friend confid vannah talk determin find better way breath,"['Comedy', 'Drama', 'Romance']"
4,tt0113041,Father of the Bride Part II,"Just when George Banks has recovered from his daughter's wedding, he receives the news that she's pregnant ... and that George's wife, Nina, is expecting too. He was planning on selling their home, but that's a plan that -- like George -- will have to change with the arrival of both a grandchild and a kid of his own.",recov daughter wed receiv news pregnant wife expect plan sell home plan like chang arriv grandchild kid,['Comedy']


In [134]:
# Load the embeddings from the binary file.
# By default, allow_pickle is set to False for security reasons,
# as loading pickled data can potentially execute arbitrary code,
# which may pose a risk if the data is from an untrusted source. 
# However, we are confident that the data is safe

embeddings = np.load('ada_embeddings_movie_40000.npy', allow_pickle=True)

# Load the rest of the DataFrame from the CSV file
ada_40000_min_df = pd.read_csv('ada_40000_min_streamlit.csv')

# Add the embeddings back to the DataFrame
ada_40000_min_df['ada_embeddings'] = pd.Series(embeddings)

We can check whether `Ellipsis` is created or not by looking at the first row of the `ada_embeddings` column.

In [138]:
# Check the dataset
pd.set_option('display.max_colwidth', None)
ada_40000_min_df.head(1)

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
0,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",led woodi toy live happili room birthday bring onto scene afraid lose place heart woodi plot circumst separ woodi owner duo eventu learn put asid differ,"['Animation', 'Comedy', 'Family']","[-0.01801527477800846, -0.02206587977707386, 0.008851844817399979, -0.028354229405522346, -0.012003101408481598, 0.016868075355887413, 0.0012941397726535797, -0.0013968211133033037, -0.014035485684871674, -0.009163429960608482, -0.0008055178914219141, 0.01163486484438181, -0.024416929110884666, -0.0010188473388552666, 0.003399108536541462, 0.019488221034407616, 0.007676319684833288, 0.01944573223590851, 0.011507398448884487, 0.006231698673218489, -0.02530919574201107, 0.028750792145729065, -0.011224139481782913, -0.0033760936930775642, 0.009999044239521027, -0.020012250170111656, 0.017392104491591454, 0.0012401434360072017, 0.0012118176091462374, 0.006252943072468042, 0.02101781964302063, 0.004673773888498545, -0.005817432422190905, -0.0029246495105326176, -0.0211452879011631, -0.020947005599737167, 0.0056864251382648945, -0.016188254579901695, 0.02199506387114525, -0.01644318737089634, 0.008554423227906227, 0.00802331231534481, -0.004032900556921959, -0.021556012332439423, -0.03580394387245178, 0.02120193839073181, -0.027929341420531273, 0.006206913851201534, -0.006362705957144499, 0.006614098325371742, 0.026215624064207077, 0.006493713241070509, -0.0012985656503587961, 0.009290896356105804, -0.00625648396089673, -0.0010046843672171235, 0.017335452139377594, 0.0006368902395479381, -0.003331834450364113, -0.025677431374788284, -0.027249518781900406, -0.0024962201714515686, -0.016768934205174446, 0.013051159679889679, -0.007212483324110508, 0.001765057910233736, -0.005222588311880827, -0.01398591510951519, -0.008398630656301975, -0.015182684175670147, 0.01928994059562683, 0.017760341987013817, -0.006316676735877991, -0.010600969195365906, 0.030790258198976517, -0.014382477849721909, -0.027249518781900406, 0.005944898817688227, 0.010006125085055828, 0.010678865946829319, 0.009567073546350002, -0.029487265273928642, -0.024516070261597633, 0.04129916802048683, 0.012746656313538551, 0.0194032434374094, -0.0007293919916264713, 0.032348182052373886, -0.02842504344880581, 0.003073360538110137, 0.005328810773789883, 0.015395129099488258, -0.012321768328547478, -0.0014499322278425097, 0.008795193396508694, 0.02461520954966545, -0.014255011454224586, 0.018383512273430824, -0.009652052074670792, -0.017108846455812454, ...]"


`Ellipsis` is not created this time!

## Fine-Tuning with `genres` information

In this section, we can use the `genres` column in our dataset to refine our your model's performance. First, we can create the list of unique genres. Then, we can re-define the `most_similar_movies` function to incorporate genre information. 

- Extract keywords from the user input and check if any of them match the unique genres.
- Calculate the 1Jaccard similarity1 between the matched genres and the genres of each movie in the dataset.
_ Update the similarity calculation by incorporating a weighted average of the cosine similarity, `Jaccard similarity` (based on keywords), and `Jaccard similarity` (based on genres).

Later, we can also adjust the weights (`cosine_weight`, `jaccard_weight`, and `genre_weight`) to produce the best result. 

In [154]:
# Convert the 'genres' column into lists
ada_40000_min_df['genres'] = ada_40000_min_df['genres'].apply(str_to_list)

# Unique genres in the dataset
unique_genres = set(sum(ada_40000_min_df['genres'].tolist(), []))

We also need to update our `jaccard_similarity` function! (another iterative process) I turned out that the `jaccard_similarity` function is causing a division by zero error when the union of the two sets is empty. To handle this case, we can modify the function to return 0 when the union is empty.

In [157]:
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    
    if len(union) == 0:
        return 0
    else:
        return len(intersection) / len(union)

In [158]:
def most_similar_movies(user_input, df, n=5, cosine_weight=0.5, jaccard_weight=0.25, genre_weight=0.25):
    input_embeddings = ada_embeddings(user_input)

    # Preprocess the user input
    user_input_preprocessed = preprocess_text(user_input)
    user_keywords = set(user_input_preprocessed.split())

    # Unique genres in the dataset
    unique_genres = set(sum(df['genres'].tolist(), []))

    # Find the matched genres
    matched_genres = unique_genres.intersection(user_keywords)

    similarities = []

    for index, row in df.iterrows():
        cur_embeddings = row['ada_embeddings']
        cosine_sim = cosine_similarity(input_embeddings.reshape(1, -1), cur_embeddings.reshape(1, -1))[0, 0]

        # Use preprocessed movie description
        movie_keywords = set(row['clean_overview'].split())

        jaccard_sim_keywords = jaccard_similarity(user_keywords, movie_keywords)

        # Calculate Jaccard similarity for genres
        movie_genres = set(row['genres'])
        jaccard_sim_genres = jaccard_similarity(matched_genres, movie_genres)

        # Calculate the weighted average of cosine similarity, Jaccard similarity (keywords), and Jaccard similarity (genres)
        similarity = cosine_weight * cosine_sim + jaccard_weight * jaccard_sim_keywords + genre_weight * jaccard_sim_genres
        similarities.append((similarity, index))

    top_n_similarities = nlargest(n, similarities)

    top_n_indices = [index for similarity, index in top_n_similarities]

    return df.loc[top_n_indices]

Now we can try our new functions with the same user input of ""humans meeting friendly and curious aliens and becoming friends." Ideally, it should return **E.T. the Extra-Terrestrial** as top movies and exclude **V** or similar scary movies.

**NOTE**: Occassionally, you can get the `APIConnectionError` while trying to get the movie recommendations. You can try to run the same code again a few seconds later and get the results.

In [160]:
user_input = (
    "humans meeting friendly and curious aliens and becoming friends."
)

pd.set_option('display.max_colwidth', 50)
top_5_similar_movies = most_similar_movies(user_input, ada_40000_min_df)
top_5_similar_movies

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
5466,tt0081764,Without Warning,An alien creature stalks human prey.,alien creatur stalk human prey,"[Horror, Science Fiction]","[-0.0027367251459509134, -0.025949206203222275..."
24297,tt0085106,V,Aliens pretending to be friendly come to Earth...,alien pretend friendli come earth receiv openl...,"[Drama, Science Fiction, Action, Adventure]","[0.013806127943098545, -0.03587256744503975, -..."
12326,tt0765476,Meet Dave,A crew of miniature aliens operate a spaceship...,crew miniatur alien oper spaceship human form ...,"[Comedy, Science Fiction, Adventure, Family]","[-0.010635590180754662, -0.022797487676143646,..."
41025,tt1070753,Bobik Visiting Barbos,"Two dogs, one stray the other well-off, become...",two dog one stray well becom friend,"[Animation, Comedy]","[-0.0059995767660439014, -0.00912640243768692,..."
37842,tt0128224,Fugitive Alien,An alien is pursued as a traitor by his own ra...,alien pursu traitor race refus kill human,"[Science Fiction, Action, Comedy, Foreign]","[-0.0144078079611063, -0.02955961413681507, -0..."


Oh no! Our model returns **Without Warning** and **V** as the first and second recommended movies and drops **E.T. the Extra-Terrestrial**. In this particular case, genre information doesn't help us much since the movie **V** contains 4 genres including **Drama**.

This suggests that we need to adjust weights for similarities.<br>OpenAI staes that "We recommend cosine similarity. The choice of distance function typically doesn’t matter much."<br>We can give more wight on cosine similarity.

In [164]:
def most_similar_movies(user_input, df, n=5, cosine_weight=0.8, jaccard_weight=0.1, genre_weight=0.1):
    input_embeddings = ada_embeddings(user_input)

    # Preprocess the user input
    user_input_preprocessed = preprocess_text(user_input)
    user_keywords = set(user_input_preprocessed.split())

    # Unique genres in the dataset
    unique_genres = set(sum(df['genres'].tolist(), []))

    # Find the matched genres
    matched_genres = unique_genres.intersection(user_keywords)

    similarities = []

    for index, row in df.iterrows():
        cur_embeddings = row['ada_embeddings']
        cosine_sim = cosine_similarity(input_embeddings.reshape(1, -1), cur_embeddings.reshape(1, -1))[0, 0]

        # Use preprocessed movie description
        movie_keywords = set(row['clean_overview'].split())

        jaccard_sim_keywords = jaccard_similarity(user_keywords, movie_keywords)

        # Calculate Jaccard similarity for genres
        movie_genres = set(row['genres'])
        jaccard_sim_genres = jaccard_similarity(matched_genres, movie_genres)

        # Calculate the weighted average of cosine similarity, Jaccard similarity (keywords), and Jaccard similarity (genres)
        similarity = cosine_weight * cosine_sim + jaccard_weight * jaccard_sim_keywords + genre_weight * jaccard_sim_genres
        similarities.append((similarity, index))

    top_n_similarities = nlargest(n, similarities)

    top_n_indices = [index for similarity, index in top_n_similarities]

    return df.loc[top_n_indices]

Let's see what movies our model recommends this time.

In [165]:
user_input = (
    "humans meeting friendly and curious aliens and becoming friends."
)

pd.set_option('display.max_colwidth', 50)
top_5_similar_movies = most_similar_movies(user_input, ada_40000_min_df)
top_5_similar_movies

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
5466,tt0081764,Without Warning,An alien creature stalks human prey.,alien creatur stalk human prey,"[Horror, Science Fiction]","[-0.0027367251459509134, -0.025949206203222275..."
24297,tt0085106,V,Aliens pretending to be friendly come to Earth...,alien pretend friendli come earth receiv openl...,"[Drama, Science Fiction, Action, Adventure]","[0.013806127943098545, -0.03587256744503975, -..."
12326,tt0765476,Meet Dave,A crew of miniature aliens operate a spaceship...,crew miniatur alien oper spaceship human form ...,"[Comedy, Science Fiction, Adventure, Family]","[-0.010635590180754662, -0.022797487676143646,..."
10528,tt0443693,The Wild Blue Yonder,An alien narrates the story of his dying plane...,alien narrat stori die planet peopl visit eart...,"[Drama, Science Fiction]","[0.023845355957746506, -0.02781049720942974, -..."
41025,tt1070753,Bobik Visiting Barbos,"Two dogs, one stray the other well-off, become...",two dog one stray well becom friend,"[Animation, Comedy]","[-0.0059995767660439014, -0.00912640243768692,..."


Again, the result is bad. 

We can continue experimenting with different weight combinations to fine-tune our model further. For instance, we can increase the `cosine_weight` and `genre_weight` while decreasing the `jaccard_weight`. This approach is based on the assumption that, given the prevalent use of `cosine similarity` in language models, it should occupy a higher proportion. Moreover, genre information plays a crucial role in movie-related datasets, which justifies increasing the weight for genre-based similarity.

By iteratively adjusting these weights and evaluating the model's performance, we can identify the optimal balance between `cosine similarity`, `Jaccard similarity`, and `genre similarity` to achieve the best recommendations for a diverse range of user inputs and preferences.

We can set `cosine_weight=0.70, jaccard_weight=0.05, genre_weight=0.25` and test with different user inputs.

In [189]:
def most_similar_movies(user_input, df, n=5, cosine_weight=0.70, jaccard_weight=0.05, genre_weight=0.25):
    input_embeddings = ada_embeddings(user_input)

    # Preprocess the user input
    user_input_preprocessed = preprocess_text(user_input)
    user_keywords = set(user_input_preprocessed.split())

    # Unique genres in the dataset
    unique_genres = set(sum(df['genres'].tolist(), []))

    # Find the matched genres
    matched_genres = unique_genres.intersection(user_keywords)

    similarities = []

    for index, row in df.iterrows():
        cur_embeddings = row['ada_embeddings']
        cosine_sim = cosine_similarity(input_embeddings.reshape(1, -1), cur_embeddings.reshape(1, -1))[0, 0]

        # Use preprocessed movie description
        movie_keywords = set(row['clean_overview'].split())

        jaccard_sim_keywords = jaccard_similarity(user_keywords, movie_keywords)

        # Calculate Jaccard similarity for genres
        movie_genres = set(row['genres'])
        jaccard_sim_genres = jaccard_similarity(matched_genres, movie_genres)

        # Calculate the weighted average of cosine similarity, Jaccard similarity (keywords), and Jaccard similarity (genres)
        similarity = cosine_weight * cosine_sim + jaccard_weight * jaccard_sim_keywords + genre_weight * jaccard_sim_genres
        similarities.append((similarity, index))

    top_n_similarities = nlargest(n, similarities)

    top_n_indices = [index for similarity, index in top_n_similarities]

    return df.loc[top_n_indices]

Let's try the troubling alien movie user input again!

In [190]:
user_input = (
    "humans meeting friendly and curious aliens and becoming friends."
)

pd.set_option('display.max_colwidth', 50)
top_5_similar_movies = most_similar_movies(user_input, ada_40000_min_df)
top_5_similar_movies

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
24297,tt0085106,V,Aliens pretending to be friendly come to Earth...,alien pretend friendli come earth receiv openl...,"[Drama, Science Fiction, Action, Adventure]","[0.013806127943098545, -0.03587256744503975, -..."
5466,tt0081764,Without Warning,An alien creature stalks human prey.,alien creatur stalk human prey,"[Horror, Science Fiction]","[-0.0027367251459509134, -0.025949206203222275..."
12326,tt0765476,Meet Dave,A crew of miniature aliens operate a spaceship...,crew miniatur alien oper spaceship human form ...,"[Comedy, Science Fiction, Adventure, Family]","[-0.010635590180754662, -0.022797487676143646,..."
10528,tt0443693,The Wild Blue Yonder,An alien narrates the story of his dying plane...,alien narrat stori die planet peopl visit eart...,"[Drama, Science Fiction]","[0.023845355957746506, -0.02781049720942974, -..."
1050,tt0083866,E.T. the Extra-Terrestrial,After a gentle alien becomes stranded on Earth...,gentl alien becom strand earth discov befriend...,"[Science Fiction, Adventure, Family, Fantasy]","[0.025521887466311455, -0.014421803876757622, ..."


Unfortunately, the model fails to exclude **V** and **Without Warning**. However, **E.T. the Extra-Terrestrial** is back to our list, which definitely is an improvement.

Let's try out a different input.

In [191]:
user_input = (
    "An ordinary woman is unexpectedly thrust into a criminal scheme that has no connection to her."
)
top_5_similar_movies = most_similar_movies(user_input, ada_40000_min_df)
top_5_similar_movies

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
37428,tt2315582,Una,When a young woman unexpectedly arrives at an ...,young woman unexpectedli arriv older man workp...,[Drama],"[-0.0017255287384614348, -0.03788863867521286,..."
9379,tt0249378,Backflash,A woman is released from prison and heads home...,woman releas prison head home help outwit loca...,"[Crime, Action]","[-0.006919859908521175, -0.009919636882841587,..."
25086,tt0040802,Smart Girls Don't Talk,A society woman gets involved with a gangster ...,societi woman get involv gangster find hidden ...,[Drama],"[-0.01806347817182541, -0.036013972014188766, ..."
24452,tt0024334,Midnight Mary,A young woman is on trial for murder. In flash...,young woman trial murder flashback learn strug...,"[Romance, Crime, Drama]","[-0.0067757428623735905, -0.0228307843208313, ..."
1008,tt0117202,Normal Life,Chris Anderson and his wife Pam live a fairly ...,wife live fairli normal life lose job polic fo...,"[Crime, Drama]","[-0.004159923177212477, -0.023969894275069237,..."


The result is the same as before (in our second notebook).

In [192]:
user_input = (
    "I'm looking for a movie with a classic love story."
    "A poor young man meets a wealthy young woman on an enormous cruise ship."
    "They fall in love at first sight."
    "However, tragedy strikes when the ship hits an iceberg and begins to sink."
    "The man sacrifices his life to save the woman."
)
top_5_similar_movies = most_similar_movies(user_input, ada_40000_min_df)
top_5_similar_movies

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
10682,tt0012938,Beyond the Rocks,A young woman marries an older millionaire and...,young woman marri older millionair fall love h...,"[Drama, Romance]","[-0.026327291503548622, -0.02202616259455681, ..."
15020,tt0181212,Don Quixote,"The classic tale of a man's dream, his epic jo...",classic tale man dream epic journey one true love,"[Adventure, Comedy, Romance, Drama]","[0.010016578249633312, -0.028224851936101913, ..."
27,tt0114117,Persuasion,This film adaptation of Jane Austen's last nov...,film adapt last novel follow daughter financi ...,"[Drama, Romance]","[-0.005888927727937698, -0.025731371715664864,..."
1558,tt0120257,Swept from the Sea,The film tells the story of Russian emigree an...,film tell stori russian emigre survivor ship c...,"[Drama, Romance]","[-0.0036487895995378494, -0.02344241552054882,..."
25827,tt0053437,The Wayward Girl,"The story of a young Gerd, played by Liv Ullma...",stori young play first lead role fall love you...,[Drama],"[-0.013105043210089207, -0.030474917963147163,..."


The result is neither ideal nor disappointing.

In [193]:
user_input = (
    "I'm in the mood for a thrilling science fiction movie with a futuristic setting, advanced technology."
    "and maybe some space travel or exploration."
    "I'd also like some action and adventure elements in the plot."
)
top_5_similar_movies = most_similar_movies(user_input, ada_40000_min_df)
top_5_similar_movies

Unnamed: 0,imdb_id,title,overview,clean_overview,genres,ada_embeddings
1555,tt0119177,Gattaca,Science fiction drama about a future society i...,scienc fiction drama futur societi era indefin...,"[Thriller, Science Fiction, Mystery, Romance]","[0.008741836994886398, -0.023597246035933495, ..."
1492,tt0118884,Contact,Contact is a science fiction film about an enc...,contact scienc fiction film encount alien inte...,"[Drama, Science Fiction, Mystery]","[0.02214028500020504, -0.02403842844069004, -0..."
41297,tt0120200,Starquest II,Sci-fi thriller directed by Fred Gallo.,sci fi thriller direct,"[Thriller, Science Fiction]","[-0.0091245137155056, -0.0281309112906456, -0...."
40265,tt0054415,12 to the Moon,A group of twelve international scientists are...,group twelv intern scientist first land moon e...,[Science Fiction],"[-0.004101065453141928, -0.014631942845880985,..."
28819,tt1824904,95ers: Time Runners,"Time is unraveling, paradoxes are everywhere, ...",time unravel paradox everywher stranger terrif...,"[Thriller, Science Fiction]","[0.006056539714336395, -0.02809791825711727, -..."


In [194]:
pd.set_option('display.max_colwidth', None)
top_5_similar_movies['overview']

1555                                                                                                                                                                                                                                                                                                                                                                                                                  Science fiction drama about a future society in the era of indefinite eugenics where humans are set on a life course depending on their DNA. The young Vincent Freeman is born with a condition that would prevent him from space travel, yet he is determined to infiltrate the GATTACA space program.
1492                                                                                                                                                                                                                                                                                                      

This result is quite good. Some classic Sci-fi movies, such as **Gattaca** and **Contact**, are included. Other movies also fit the user input.

**NOTE**: After conducting several iterations to fine-tune the model, the results remained relatively consistent across different weight combinations. As such, we can consider the current combination of `cosine_weight=0.70, jaccard_weight=0.05, genre_weight=0.25` to be a reasonable and well-performing choice at this point.

## Summary

In this notebook, we have fine-tuned and enhanced our movie recommendation model by incorporating genre information and iterating through different combinations of `cosine similarity`, `Jaccard similarity`, and `genre similarity` weights.

We have also addressed the issues encountered during the process, such as the use of the `eval` function and handling `Ellipsis` objects. As a result, we managed to improve the model's efficiency by storing precomputed embeddings in a separate binary file using NumPy's save function. By utilizing this weight configuration, we have arrived at a well-performing weight combination of `cosine_weight=0.70, jaccard_weight=0.05, and genre_weight=0.25`.

This fine-tuning process of utilizing the weight configuration provided valuable learning opportunities and insights, and although there are limitations in our current model, we have demonstrated that it can be further improved and adapted to various scenarios in the future. We hope to keep improving this model by adding more features (e.g., user feedback) and fine-tuning to enhance the overall performance of our recommendation system to cater to a diverse range of user inputs and preferences.

**Thank you** for taking the time to explore and engage with my series of notebooks. I appreciate your interest and commitment to learning alongside me as we navigated through various concepts, techniques, and challenges together.

## Streamlit App

Our Streamlit App, **ReelWhisperer**, is also ready!

Although my model doesn't explicitly filter movies by genre, movie descriptions frequently contain genre-related terms, and text embeddings can capture the relationships between these words and their contexts. Consequently, mentioning genre-related keywords can still help users obtain more relevant recommendations.

The model isn't perfect and may occasionally suggest seemingly unrelated movies. However, during this experimental stage, I hope users can appreciate and enjoy even the unexpected and surprising movie recommendations! :)