Let's silence the warnings



In [62]:
import warnings
warnings.filterwarnings("ignore")

Import libraries

In [1]:
import pandas as pd

Download data

### Download of data

We download two CSV files:

1. **`ratings.csv`**: A CSV file likely containing user ratings or evaluations for items, potentially for use in a recommendation or classification system.
2. **`metadata.csv`**: A CSV file probably containing metadata related to the items or users in the `ratings.csv`.

The `-O` option in `curl` ensures the files are saved with their original filenames in the current working directory.



In [2]:
!curl -O https://raw.githubusercontent.com/giuspillo/MRI-24-25_CBRS/refs/heads/main/cbrs_classifier/data/ratings.csv
!curl -O https://raw.githubusercontent.com/giuspillo/MRI-24-25_CBRS/refs/heads/main/cbrs_classifier/data/metadata.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.5M  100 20.5M    0     0  12.4M      0  0:00:01  0:00:01 --:--:-- 12.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2135k  100 2135k    0     0  1709k      0  0:00:01  0:00:01 --:--:-- 1709k


Load data related to ratings and metadata

### Load data as Pandas

This cell loads the two CSV files downloaded earlier into pandas DataFrames:

1. **`ratings`**: This DataFrame loads the `ratings.csv` file, which contains user-item ratings \(from 1 ro 5\).
2. **`metadata`**: This DataFrame loads the `metadata.csv` file, which holds additional information about the items or users referenced in the `ratings` file.


In [83]:
# load data we are interest in
ratings = pd.read_csv('ratings.csv')
metadata = pd.read_csv('metadata.csv')

Let's see how the ratings are formatted and which information it contains

In [4]:
print(f'In total we have {len(ratings)} ratings')
ratings.head(5)

In total we have 998034 ratings


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


This cell identifies the user with the highest number of interactions in the `ratings` dataset:

1. **`interaction_counts`**: Calculates the number of interactions per user by counting the occurrences of each unique `user_id` in the `ratings` DataFrame using the `value_counts()` method.

2. **`most_active_user`**: Identifies the user ID with the maximum number of interactions using the `idxmax()` function, which returns the index of the highest value.

3. **`max_interactions`**: Retrieves the maximum number of interactions for the most active user using the `max()` function.



In [229]:
# Let's find the user with the highest number of interactions - just for fun
interaction_counts = ratings['user_id'].value_counts()
most_active_user = interaction_counts.idxmax()
max_interactions = interaction_counts.max()
print(f"The most active user is {most_active_user} with {max_interactions} interactions")

The most active user is 4169 with 2306 interactions


### Description of the Code

This cell explores the distribution of user interactions in the `ratings` dataset by focusing on users with a specific range of interactions:

1. **Filter Users by Interaction Count**:
   - **`bet_int_users`**: Filters the `interaction_counts` DataFrame to include only users with more than 100 and fewer than 300 interactions using a conditional statement.
   - The result is reset with `reset_index()` to create a new DataFrame for easier handling.

2. **Count Users in Range**:
   - The number of users in the specified range is determined using `len(bet_int_users)`.
   - This count is printed to indicate how many users fall within the range of 100–300 interactions.

3. **Retrieve Specific User Details**:
   - Selects the first user from the filtered list and prints their ID.
   - Retrieves and prints their exact interaction count from the filtered DataFrame.


In [230]:
# Other curiosity: how many users have a number of interactions between 100 and 300?
bet_int_users = interaction_counts[(interaction_counts > 100) & (interaction_counts < 300)].reset_index()
print(f"There are {len(bet_int_users)} users with interactions between 100 and 300")

user_id = list(bet_int_users['user_id'])[0]
print(f"The first one of them is user {user_id} with {bet_int_users[bet_int_users['user_id'] == user_id]['count'].values[0]} interactions")

There are 1948 users with interactions between 100 and 300
The first one of them is user 4593 with 299 interactions


### Let's see which information is encoded in the metadata file

In [174]:
print(f"In total we have {len(metadata)} movies")
metadata.head(5)

In total we have 3859 movies


Unnamed: 0,movie_id,name,genres,overview
0,1,Toy Story (1995),Animation|Children's|Comedy,A little boy named Andy loves to be in his roo...
1,2,Jumanji (1995),Adventure|Children's|Fantasy,After being trapped in a jungle board game for...
2,3,Grumpier Old Men (1995),Comedy|Romance,Things don't seem to change much in Wabasha Co...
3,4,Waiting to Exhale (1995),Comedy|Drama,This story based on the best selling novel by ...
4,5,Father of the Bride Part II (1995),Comedy,"In this sequel to ""Father of the Bride"" Georg..."


### Binarize the ratings

- 1, 2 ,3 ratings become 0s (dislike)
- 4, 5 ratings become 1s (like)

In [175]:
# create a new column with 1 or 0 based on the rating
ratings['bin_rat'] = (ratings['rating'] >= 4).astype(int)

Let's count how many 0s and 1s we have

In [176]:
# show how many 1s and 0s the dataset now has
ratings['bin_rat'].value_counts()

Unnamed: 0_level_0,count
bin_rat,Unnamed: 1_level_1
1,573759
0,424275


### Fuse ratings and metadata

We merge the `metadata` and `ratings` DataFrames and analyze the preferences (likes and dislikes) of a specific user (identified earlier as `user_id`):

1. **Merging the DataFrames**:
   - **`df`**: The `metadata` DataFrame (which contains movie details) is merged with the `ratings` DataFrame (which contains user ratings) on the common `movie_id` column. The merged DataFrame is then sorted by `user_id` in ascending order for easier analysis.

2. **Identifying Likes and Dislikes**:
   - **`user_like`**: Filters the merged DataFrame to extract the movie names (`'name'`) where the specified user (with ID `user_id`) has given a positive rating (`'bin_rat' == 1`), indicating a like.
   - **`user_dislike`**: Similarly, filters for the movie names where the user has given a negative rating (`'bin_rat' == 0`), indicating a dislike.

3. **Displaying the Results**:
   - The first 5 liked movies are displayed as a pandas Series using the `display()` function.
   - A complete list of the liked movies is printed.
   - The first 5 disliked movies are displayed similarly, and a list of all disliked movies is printed.


In [177]:
# merge movies and ratings pandas and show user 0 like and dislikes
df = pd.merge(metadata, ratings, on='movie_id').sort_values(by=['user_id'], ascending=True)

# get a list of the titles of user 1 likes and a list of user 1 dislikes
user_like = df[(df['user_id'] == user_id) & (df['bin_rat'] == 1)]['name']
user_dislike = df[(df['user_id'] == user_id) & (df['bin_rat'] == 0)]['name']

# display the first likes as pandas...
display((user_like.head(5)))
# ... or list
print(f'user {user_id} likes the following {len(user_like.to_list())} movies: {user_like.tolist()}')

# display the first DISlikes as pandas...
display((user_dislike.head(5)))
# ... or list
print(f'user {user_id} doesn\'t like the following {len(user_dislike.to_list())} movies: {user_dislike.tolist()}')


Unnamed: 0,name
84365,"Santa Clause, The (1994)"
394883,"Crucible, The (1996)"
404917,Grease (1978)
563489,Doctor Zhivago (1965)
916253,"Odd Couple, The (1968)"


user 4593 likes the following 134 movies: ['Santa Clause, The (1994)', 'Crucible, The (1996)', 'Grease (1978)', 'Doctor Zhivago (1965)', 'Odd Couple, The (1968)', 'Babe: Pig in the City (1998)', 'Being John Malkovich (1999)', 'Dancing at Lughnasa (1998)', 'Koyaanisqatsi (1983)', 'Lovers of the Arctic Circle, The (Los Amantes del Círculo Polar) (1998)', 'Golden Bowl, The (2000)', 'As Good As It Gets (1997)', 'Castle, The (1997)', 'Restoration (1995)', 'Bowfinger (1999)', 'Night on Earth (1991)', 'Best in Show (2000)', 'Central Station (Central do Brasil) (1998)', 'All About My Mother (Todo Sobre Mi Madre) (1999)', 'Piano, The (1993)', 'Dead Man Walking (1995)', 'Sunshine (1999)', 'Remains of the Day, The (1993)', 'Analyze This (1999)', 'Four Weddings and a Funeral (1994)', 'Crying Game, The (1992)', 'Dog Day Afternoon (1975)', 'Maya Lin: A Strong Clear Vision (1994)', 'Hilary and Jackie (1998)', 'Babe (1995)', 'Gladiator (2000)', 'Three Colors: Red (1994)', 'Chariots of Fire (1981)', 'P

Unnamed: 0,name
924827,East is East (1999)
418056,Evita (1996)
164984,Batman (1989)
742997,Dick (1999)
135949,Mrs. Doubtfire (1993)


user 4593 doesn't like the following 165 movies: ['East is East (1999)', 'Evita (1996)', 'Batman (1989)', 'Dick (1999)', 'Mrs. Doubtfire (1993)', 'Truman Show, The (1998)', 'Wonder Boys (2000)', 'Back to the Future (1985)', 'Private Benjamin (1980)', 'Bodyguard, The (1992)', 'Airplane! (1980)', 'Frequency (2000)', 'Peggy Sue Got Married (1986)', 'Immortal Beloved (1994)', 'Nine Months (1995)', 'Working Girl (1988)', 'Fabulous Baker Boys, The (1989)', 'Space Cowboys (2000)', 'Erin Brockovich (2000)', 'Manhattan Murder Mystery (1993)', 'Gods and Monsters (1998)', 'Hideous Kinky (1998)', 'Chicken Run (2000)', 'Little Big Man (1970)', 'Good Will Hunting (1997)', 'Insider, The (1999)', 'Honey, I Shrunk the Kids (1989)', 'Horse Whisperer, The (1998)', 'Your Friends and Neighbors (1998)', 'What Dreams May Come (1998)', 'Ed Wood (1994)', 'Back to the Future Part II (1989)', 'Mass Appeal (1984)', 'Orlando (1993)', 'Notting Hill (1999)', 'Breaking Away (1979)', 'Blues Brothers, The (1980)', 'Dri

Just a curiosity: how many people liked and disliked star wars?

In [178]:
like_sw = df[(df['name'] == 'Star Wars: Episode IV - A New Hope (1977)') & (df['bin_rat'] == 1)].shape[0]
dislike_sw = df[(df['name'] == 'Star Wars: Episode IV - A New Hope (1977)') & (df['bin_rat'] == 0)].shape[0]

print(f"{like_sw} users liked Star Wars while {dislike_sw} users didn't like it")

2622 users liked Star Wars while 369 users didn't like it


Let's check some descriptions - we use Star Wars (pefforza)

In [179]:
# first we find the id of star wars
metadata[metadata['name'] == 'Star Wars: Episode IV - A New Hope (1977)']['overview']

Unnamed: 0,overview
257,The Imperial Forces under orders from cruel D...


In [180]:
# let's get the value of this df
overview = metadata[metadata['name'] == 'Star Wars: Episode IV - A New Hope (1977)']['overview'].values[0]

We count the number of missing overview, and, to avoid problems in the computations, we replace NaN values with empty strings

In [231]:
# how many overview are missing?
print(metadata['overview'].isna().sum())

# let's fix this by replacing these overview with a dummy string
# (it can be the empty string, a blank, 'pippo', 'pluto', 'colui che egli')
metadata['overview'] = metadata['overview'].fillna('')

0


### Learning movie embeddings through TF-IDF

This cell performs the process of representing each movie by an embedding using the TF-IDF (Term Frequency-Inverse Document Frequency) technique. The goal is to transform the textual data (in this case, the movie overviews) into numerical vectors that capture the importance of words across different movies.

1. **Importing Required Libraries**:
   - `TfidfVectorizer`: From `sklearn.feature_extraction.text`, it is used to convert a collection of text documents into a matrix of TF-IDF features.
   - `nltk`: The Natural Language Toolkit (NLTK) is used to download a list of stopwords (common words like "the", "is", etc. that are typically removed from text analysis).
   
2. **Downloading Stopwords**:
   - The list of English stopwords is downloaded using NLTK’s `stopwords.words('english')`. These words will be ignored during the TF-IDF calculation.

3. **Initializing the Vectorizer**:
   - A `TfidfVectorizer` is initialized with the stop words set to the list from NLTK. It will ignore common English words that do not carry significant meaning in the context of text analysis.

4. **Computing the TF-IDF Matrix**:
   - **`tfidf_matrix`**: The `fit_transform()` function is applied to the `overview` column of the `metadata` DataFrame, which contains textual descriptions of each movie. This results in a sparse matrix representing the TF-IDF values for each term in each movie's overview.

5. **Converting the Matrix to a DataFrame**:
   - **`tfidf_features`**: The sparse matrix is converted into a dense format using `toarray()` and then into a pandas DataFrame for easier inspection. The DataFrame's rows represent movies (indexed by `movie_id`), and the columns represent the terms from the overviews.

6. **Displaying the TF-IDF Features**:
   - The shape of the TF-IDF matrix (number of movies and terms) is printed for an overview.
   - The resulting `tfidf_features` DataFrame is displayed, showing the TF-IDF values for each movie and word in the overviews.


In [232]:
# now we represent each movie as an embedding, by exploiting the genres we have stored in the movies.csv file, and we use the tf-idf technique
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords

# Download stopwords list
nltk.download('stopwords')
stop_words = list(stopwords.words('english'))

# init vectorizer
# possible parameter: max_features=n. it considers only the top-n popular words
# currently, we do not set it, try by yourself :)
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Compute TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(metadata['overview'])
print(tfidf_matrix.shape)

# Convert the sparse matrix to a DataFrame for better visualization
tfidf_features = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out(), index=metadata['movie_id'])

print("Tf-idf Features:")
display(tfidf_features)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(3859, 24526)
Tf-idf Features:


Unnamed: 0_level_0,00,000,007,00h,00pm,05pm,10,100,1000,105,...,zorin,zorro,zquez,zucco,zuckerman,zuko,zulu,zundel,zuoqian,zyto
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3950,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3951,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Description of the Code

This cell computes the similarity between movies using **cosine similarity**, a measure of similarity between two non-zero vectors that calculates the cosine of the angle between them. In this case, the vectors represent the TF-IDF embeddings of movie overviews.

1. **Importing Cosine Similarity**:
   - The `cosine_similarity` function from `sklearn.metrics.pairwise` is imported. This function will be used to compute the pairwise cosine similarity between all movies based on their TF-IDF embeddings.

2. **Computing the Similarity Matrix**:
   - **`similarity_matrix`**: The `cosine_similarity()` function is applied to the `tfidf_matrix` computed previously. This generates a square matrix where each element `(i, j)` represents the cosine similarity between the `i`-th and `j`-th movies.

3. **Matrix Shape**:
   - The shape of the `similarity_matrix` is printed to confirm the number of movies being compared. The matrix will be a square matrix with dimensions equal to the number of movies in the dataset.


In [233]:
# just for fun, let's find the most similar movie to star wars
from sklearn.metrics.pairwise import cosine_similarity

# First, we need to compute the similarity matrix using cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)
print(similarity_matrix.shape)

(3859, 3859)


### Find similar movies to Star Wars Episode IV

In order to find the most similar movies to "Star Wars" (represented by its movie ID), we compute the cosine similarity scores between it and all other movies.

1. **Input: Movie ID of "Star Wars"**:
   - **`item_id`**: The movie ID for "Star Wars" is set to 260. This is the movie for which we want to find the most similar movies.

2. **Finding the Index of the Movie**:
   - **`item_index`**: The index of the movie in the `tfidf_features` DataFrame is found using the `get_loc()` method, which retrieves the position of the movie ID in the DataFrame's index.

3. **Computing Similarity Scores**:
   - **`similarity_scores`**: The cosine similarity values for the selected movie (i.e., "Star Wars") are extracted from the `similarity_matrix`. The `similarity_matrix[item_index]` gives the similarity scores between "Star Wars" and all other movies.
   - The scores are then enumerated into a list of tuples where each tuple contains the movie index and the corresponding similarity score.

4. **Displaying Similarity Scores**:
   - The unsorted list of movie-similarity score pairs is printed. This list includes the similarity scores between "Star Wars" and every other movie, which can be further processed to identify the most similar ones.


In [184]:
# Input: Movie ID of Star Wars to find similar movies
item_id = 260

# Find the corresponding index of the movie ID in tfidf_features
item_index = tfidf_features.index.get_loc(item_id)

# Compute similarity scores for the given movie
# It will contain pairs in the format (id_matrix, sim_score)
similarity_scores = list(enumerate(similarity_matrix[item_index]))
print('Movie-similarity score pairs (not sorted)')
print(similarity_scores)

Movie-similarity score pairs (not sorted)
[(0, 0.0), (1, 0.0), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.005611259521416583), (6, 0.005438556469947753), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.0), (12, 0.0), (13, 0.0), (14, 0.020513331429116138), (15, 0.019995212297021628), (16, 0.0), (17, 0.01934702244621565), (18, 0.0), (19, 0.008713596376681952), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.013276764834495514), (27, 0.0), (28, 0.0), (29, 0.024059512435648148), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.0), (34, 0.0), (35, 0.0051105447999713486), (36, 0.006348626965806443), (37, 0.0), (38, 0.007639667146395728), (39, 0.0), (40, 0.021127327939290914), (41, 0.0), (42, 0.0), (43, 0.0), (44, 0.0), (45, 0.0), (46, 0.011680379906654527), (47, 0.0), (48, 0.004338957731692339), (49, 0.006763629390154573), (50, 0.0), (51, 0.0), (52, 0.0), (53, 0.0), (54, 0.0), (55, 0.0), (56, 0.0), (57, 0.004724119319047174), (58, 0.0), (59, 0.0), (60, 0.0031183166973559394), (61, 0

### Description of the Code

We sort the similarity scores and find the top-k most similar movies to Star Wars

1. **Sorting the Similarity Scores**:
   - **`similarity_scores`**: The list of movie-similarity score pairs is sorted in descending order based on the similarity score. This is done using Python's `sorted()` function, where the `key` argument is a lambda function that extracts the similarity score (the second element in each tuple) to sort by.

2. **Displaying Sorted Similarity Scores**:
   - The sorted list is printed, showing the movie IDs along with their cosine similarity scores, with the most similar movies appearing first.


In [185]:
# Since we are interested in the most similar movies, we sort
# Sort by similarity score in descending order
similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
print('Movie-similarity score pairs (sorted)')
print(similarity_scores)

Movie-similarity score pairs (sorted)
[(257, 0.9999999999999999), (1184, 0.4726571038951565), (1171, 0.4335392852988003), (865, 0.12776183015002765), (3076, 0.10763980156127986), (3533, 0.10130248189765484), (3118, 0.0952227515187248), (2012, 0.08853496379109525), (1342, 0.08412621888340621), (574, 0.08321747918144712), (3633, 0.0814412634666697), (941, 0.08127429536911791), (1248, 0.07953463431301931), (1766, 0.07012974824437315), (462, 0.06934650192307547), (3628, 0.06861253545163931), (2553, 0.06795390810632497), (255, 0.06712046678403279), (2551, 0.06541532509609614), (228, 0.06032442937679674), (2347, 0.060294998631532634), (3103, 0.060220595184710464), (899, 0.05937161781971309), (3267, 0.05453470237410196), (2552, 0.049221093951288604), (3115, 0.04857444626870798), (1339, 0.04838283309302201), (3066, 0.0478597473190266), (2239, 0.047780390419612746), (2100, 0.047733350277797265), (1601, 0.04746997295787472), (3394, 0.04691724273427195), (582, 0.04640025589284035), (506, 0.046245

Finally, we print the top-10

In [186]:
# NB the first movie is Star Wars itself
# Exclude it and select the top 10 similar movies
top_k_similar = similarity_scores[1:11]

# Convert indices back to movie names
top_k_similar_ids = [(tfidf_features.index[i], score) for i, score in top_k_similar]

# Output the results
print(f"Top 10 similar movies to {metadata[metadata['movie_id']==item_id]['name'].values[0]}:")
for movie_id, score in top_k_similar_ids:
    print(f"Title: {metadata[metadata['movie_id']==movie_id]['name'].values[0]}, Similarity Score: {score}")

Top 10 similar movies to Star Wars: Episode IV - A New Hope (1977):
Title: Star Wars: Episode VI - Return of the Jedi (1983), Similarity Score: 0.4726571038951565
Title: Star Wars: Episode V - The Empire Strikes Back (1980), Similarity Score: 0.4335392852988003
Title: First Kid (1996), Similarity Score: 0.12776183015002765
Title: Topsy-Turvy (1999), Similarity Score: 0.10763980156127986
Title: Shanghai Noon (2000), Similarity Score: 0.10130248189765484
Title: Black Sunday (La Maschera Del Demonio) (1960), Similarity Score: 0.0952227515187248
Title: Sleeping Beauty (1959), Similarity Score: 0.08853496379109525
Title: Star Trek V: The Final Frontier (1989), Similarity Score: 0.08412621888340621
Title: Princess Caraboo (1994), Similarity Score: 0.08321747918144712
Title: Coming Home (1978), Similarity Score: 0.0814412634666697


## Classifier as Content-based RecSys

### Data preparation

Here we extract and prepare the movie embeddings for the movies that the user has liked and disliked, based on their TF-IDF representations.

1. **Converting the Sparse Matrix to Dense**:
   - **`movie_embs`**: The sparse `tfidf_matrix` is converted into a dense array using the `toarray()` method. This results in a full matrix where each row represents a movie's TF-IDF embedding, making it easier to process the embeddings for further analysis.

2. **Extracting Embeddings for Liked Movies**:
   - **`likes`**: The movie IDs of the movies that user 1 has liked (where the `bin_rat` is 1) are extracted from the `df` DataFrame.
   - **`like_indices`**: The indices of these movies in the `tfidf_features` DataFrame are found using the `get_loc()` method for each movie ID in the `likes` list.
   - **`like_embeddings`**: The embeddings corresponding to the liked movies are extracted from the `movie_embs` dense matrix using the `like_indices`. This results in an array of TF-IDF embeddings for the liked movies.

3. **Extracting Embeddings for Disliked Movies**:
   - **`dislikes`**: Similarly, the movie IDs of the movies that user 1 has disliked (where the `bin_rat` is 0) are extracted from the `df` DataFrame.
   - **`dislike_indices`**: The indices of the disliked movies are found using the `get_loc()` method, just like with the liked movies.
   - **`dislike_embeddings`**: The embeddings for the disliked movies are extracted from the `movie_embs` matrix using the `dislike_indices`.

4. **Printing the Shapes**:
   - The shapes of the `like_embeddings` and `dislike_embeddings` arrays are printed to confirm how many movies were liked and disliked by user 1, and how many embeddings were retrieved for each.



In [188]:
# from sparse to dense tfidf_matrix
movie_embs = tfidf_matrix.toarray()

# let's gather user 1 like embeddings
likes = df[(df['user_id'] == user_id) & (df['bin_rat'] == 1)]['movie_id']
like_indices = [tfidf_features.index.get_loc(movie_id) for movie_id in likes]
like_embeddings = movie_embs[like_indices]

print(like_embeddings.shape)

# let's gather user 1 dislike embeddings
dislikes = df[(df['user_id'] == user_id) & (df['bin_rat'] == 0)]['movie_id']
dislike_indices = [tfidf_features.index.get_loc(movie_id) for movie_id in dislikes]
dislike_embeddings = movie_embs[dislike_indices]

print(dislike_embeddings.shape)


(134, 24526)
(165, 24526)


In [189]:
like_embeddings

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Format the input

Here we prepare the input features and labels for training a machine learning model (likely a classifier) to predict whether a user would like or dislike a movie based on the movie embeddings.

1. **Combining the Embeddings**:
   - **`X`**: The embeddings for the movies that the user liked and disliked are combined into a single array. Each movie's embedding is represented as a numpy array, and the liked movies' embeddings (`like_embeddings`) are combined with the disliked movies' embeddings (`dislike_embeddings`). The resulting `X` is a 2D array where each row corresponds to a movie's TF-IDF embedding.

2. **Creating the Labels**:
   - **`y`**: A label array is created where movies that the user liked are labeled as 1 (positive class), and movies that the user disliked are labeled as 0 (negative class). The labels are created by concatenating a list of 1s (for likes) and a list of 0s (for dislikes), with the length of each list matching the number of liked and disliked movies.

3. **Printing the Shapes**:
   - The shapes of the `X` (input features) and `y` (labels) arrays are printed to confirm the dimensions. `X` should have a shape of `(number of liked + disliked movies, number of features)` and `y` should have a shape of `(number of movies,)`.


In [190]:
# the input features: user like (positive class), user dislike (negative class)
from sklearn import svm
import numpy as np

# Combine embeddings
X = np.array([np.array(emb) for emb in like_embeddings] + [np.array(emb) for emb in dislike_embeddings])
# create labels (positive and negative)
y = np.array(([1]) * len(likes) + ([0]) * len(dislikes))

print(X.shape)
print(y.shape)


(299, 24526)
(299,)


Let us use the KNN Classifier

### K-Nearest Neighbors Classifier

At this step, we user the **K-Nearest Neighbors (KNN)** algorithm for a recommendation task, using it to classify whether a user is likely to like or dislike a movie based on movie embeddings.

1. **K-Nearest Neighbors Classifier (KNN)**:
   - The **KNeighborsClassifier** is a supervised machine learning algorithm that makes predictions based on the similarity between data points. It works by finding the **k** nearest neighbors to a given data point in the feature space and classifying the data point based on the majority class among those neighbors.
   
   - In our context:
     - Each movie is represented by a feature vector (its TF-IDF embedding).
     - The classifier learns from the embeddings of movies the user has liked (positive class) and disliked (negative class).
     - For a new or unseen movie, the classifier uses the movie's embedding and compares it to the embeddings of the movies the user has already interacted with. The prediction is made based on the majority label (like or dislike) among the **k** nearest movies.

2. **How we use the KNN Classifier**:
   - **`cb_knn = KNeighborsClassifier(n_neighbors=20)`**: This line initializes the KNN classifier with 20 nearest neighbors (`n_neighbors=20`). This means the algorithm will consider the 20 most similar movies to classify the movie as either a like (1) or dislike (0).
   
   - **`cb_knn.fit(X, y)`**: The `fit()` method trains the KNN classifier using the movie embeddings (`X`) as input features and the user’s ratings (likes and dislikes) as the labels (`y`). This allows the model to learn the relationship between movie embeddings and user preferences.

### KNN in the Recommendation Task

- **Task Objective**: The goal is to predict whether the user would like or dislike a given movie based on its features (i.e., its embedding).
- **KNN's Role**: KNN is particularly well-suited for this task because it makes predictions based on the proximity (similarity) of movies in the feature space (defined by the TF-IDF embeddings).
   - For example, after training, when given a new movie's embedding, KNN will find the 20 most similar movies in the feature space (based on cosine similarity or other distance metrics) and predict whether the user is likely to like or dislike the movie by taking a majority vote from the labels of those similar movies.


In [234]:
from sklearn.neighbors import KNeighborsClassifier
cb_knn = KNeighborsClassifier(n_neighbors=20)
cb_knn.fit(X, y)

### Generating Recommendations for Unobserved Movies

Once the recommendation model (the classifier) is trained, we now have to provide the recommendation list to the user; to do so, we predict the score of all the movies in our catalog, find the top-n, and provide them as suggestions.

To do so, we need to:

1. **Identify Unobserved Movies**:
   - **`all_movie_ids`**: A set containing all unique movie IDs present in the `tfidf_features` index (i.e., all movies in the dataset).
   - **`observed_movie_ids`**: A set of movie IDs that the user has rated, which includes both liked (`likes`) and disliked (`dislikes`) movies.
   - **`unobserved_movie_ids`**: The difference between all movie IDs and the observed movie IDs, representing the movies that the user has not rated yet.

2. **Get the indices for Unobserved Movies**:
   - **`unobserved_indices`**: The indices of the unobserved movies are retrieved using the `get_loc()` method from `tfidf_features.index`. These indices correspond to the positions of the unobserved movies in the TF-IDF feature matrix.

3. **Extract the Embeddings for Unobserved Movies**:
   - **`unobserved_embeddings`**: The embeddings for the unobserved movies are extracted from the `movie_embs` dense matrix using the indices of the unobserved movies.

4. **Predict the Scores**:
   - **`scores`**: The `predict_proba()` method of the trained KNN classifier (`cb_knn`) is used to predict the probability of each unobserved movie being liked or disliked by the user. This returns a probability distribution for each unobserved movie across the two classes (like or dislike).



In [235]:
# Identify unobserved movies
# they are movies that have never been rated by users
all_movie_ids = set(tfidf_features.index)
observed_movie_ids = set(likes + dislikes)
unobserved_movie_ids = all_movie_ids - observed_movie_ids

# Get indices for unobserved movies
unobserved_indices = [tfidf_features.index.get_loc(movie_id) for movie_id in unobserved_movie_ids]

# Get the embeddings
unobserved_embeddings = movie_embs[unobserved_indices]  # Convert to dense array

print('I\'m generating the recommendations...')

# Predict scores for unobserved movies using our content based recsys
scores = cb_knn.predict_proba(unobserved_embeddings)

print('Done!')

I'm generating the recommendations...
Done!


Let's check what is inside the variable scores

In [239]:
scores

array([[0.65, 0.35],
       [0.6 , 0.4 ],
       [0.9 , 0.1 ],
       ...,
       [0.65, 0.35],
       [0.55, 0.45],
       [0.5 , 0.5 ]])

And let's check the shpe

In [240]:
scores.shape

(3859, 2)

### Generating Top Movie Recommendations

Finally, we get the top-10 of suggested movies.

1. **Extracting the Probability of "Like"**:
   - **`like_scores`**: The `scores` array returned by `predict_proba()` contains the predicted probabilities for both the "like" (positive) and "dislike" (negative) classes. The second column of the `scores` array corresponds to the probability that the user will "like" a movie. These probabilities are extracted by selecting all rows and the second column (`scores[:, 1]`).

2. **Creating Movie Scores List**:
   - **`movie_scores`**: A list of tuples is created, where each tuple contains a movie ID from the set of unobserved movies and its corresponding "like" score. This list pairs each movie with the likelihood that the user will like it.

3. **Sorting the Movies by "Like" Scores**:
   - **`sorted_movie_scores`**: The `movie_scores` list is sorted in descending order based on the "like" scores. This allows us to prioritize movies that the model predicts the user is most likely to enjoy.

4. **Displaying the Top 10 Recommendations**:
   - The top 10 movies are printed out by iterating through the sorted list. For each of the top 10 recommended movies:
     - The movie name is retrieved from the `metadata` DataFrame using the `movie_id`.
     - The movie name and its corresponding "like" score are displayed in a user-friendly format.


In [241]:
# Extract the probability of "like" from scores (second column)
like_scores = scores[:, 1]

# Create a list of movie IDs and their "like" scores
movie_scores = list(zip(unobserved_movie_ids, like_scores))

# Sort by "like" scores in descending order
sorted_movie_scores = sorted(movie_scores, key=lambda x: x[1], reverse=True)

# Print the top 10 recommendations
print(f"For user {user_id} we suggest...")
for i, (movie_id, score) in enumerate(sorted_movie_scores[0:10]):
    movie_name = metadata[metadata['movie_id'] == movie_id]['name'].values[0]
    print(f"{i+1}): {movie_name}, Score: {score:.4f}")

For user 4593 we suggest...
1): Running Man, The (1987), Score: 0.9000
2): Death and the Maiden (1994), Score: 0.8000
3): Central Station (Central do Brasil) (1998), Score: 0.8000
4): King of the Hill (1993), Score: 0.7500
5): Children of the Revolution (1996), Score: 0.7500
6): Antz (1998), Score: 0.7500
7): Battle of the Sexes, The (1959), Score: 0.7500
8): Titus (1999), Score: 0.7500
9): Anne Frank Remembered (1995), Score: 0.7000
10): White Man's Burden (1995), Score: 0.7000


Possible excercise: change the number of neighbors and check how the recommendation lists changens, both in terms of suggested movies and recommendation scores

## Now, let us use the Naive Bayesian Classifier

### Naive Bayes Classifier

Now, we train a **Naive Bayes classifier** to predict whether a user will like or dislike a movie based on the movie's TF-IDF embeddings.

1. **Naive Bayes Classifier**:
   - **Multinomial Naive Bayes (MultinomialNB)** is a probabilistic classifier that is particularly useful for classification tasks where the features are discrete or represent counts (like word counts in text classification). It is based on applying Bayes' Theorem with strong (naive) independence assumptions between the features.
   
   - **How it works**:
     - It calculates the probability of each class (like or dislike) given the features (movie embeddings), assuming that each feature is conditionally independent given the class.
     - The algorithm then predicts the class that maximizes the probability of the given features.
     - In this case, the classifier is trained to predict whether a user will like (`1`) or dislike (`0`) a movie based on the TF-IDF embedding of the movie.

2. **How the Naive Bayes Classifier is Used in our context**:
   - **`clf = MultinomialNB()`**: This line initializes the **Multinomial Naive Bayes** classifier.
   - **`clf.fit(X, y)`**: The `fit()` method is used to train the Naive Bayes classifier using the input features (`X`), which are the TF-IDF embeddings of the movies, and the labels (`y`), which indicate whether the user liked or disliked the movie. The classifier learns the relationship between the movie embeddings and the user's preferences.


In [243]:
from sklearn.naive_bayes import MultinomialNB
cb_nb = MultinomialNB()
cb_nb.fit(X, y)

As before, we compute the recommendation scores for the unobserved movies

In [244]:
# Identify unobserved movies
# they are movies that have never been rated by users
all_movie_ids = set(tfidf_features.index)
observed_movie_ids = set(likes + dislikes)
unobserved_movie_ids = all_movie_ids - observed_movie_ids

# Get indices for unobserved movies
unobserved_indices = [tfidf_features.index.get_loc(movie_id) for movie_id in unobserved_movie_ids]

# Get the embeddings
unobserved_embeddings = movie_embs[unobserved_indices]  # Convert to dense array

print('I\'m generating the recommendations...')

# Predict scores for unobserved movies using our content based recsys
scores = cb_nb.predict_proba(unobserved_embeddings)

print('Done!')

I'm generating the recommendations...
Done!


In [227]:
scores

array([[0.63438702, 0.36561298],
       [0.60722228, 0.39277772],
       [0.68614468, 0.31385532],
       ...,
       [0.61397919, 0.38602081],
       [0.58245863, 0.41754137],
       [0.34123119, 0.65876881]])

And find the top-10 of recommendations

In [228]:
# Extract the probability of "like" from scores (second column)
like_scores = scores[:, 1]

# Create a list of movie IDs and their "like" scores
movie_scores = list(zip(unobserved_movie_ids, like_scores))

# Sort by "like" scores in descending order
sorted_movie_scores = sorted(movie_scores, key=lambda x: x[1], reverse=True)

# Print the top 10 recommendations
print(f"For user {user_id} we suggest...")
for i, (movie_id, score) in enumerate(sorted_movie_scores[0:10]):
    movie_name = metadata[metadata['movie_id'] == movie_id]['name'].values[0]
    print(f"{i+1}): {movie_name}, Score: {score:.4f}")



For user 4593 we suggest...
1): Winslow Boy, The (1998), Score: 0.7031
2): Sunshine (1999), Score: 0.6983
3): Wilde (1997), Score: 0.6948
4): Oscar and Lucinda (a.k.a. Oscar & Lucinda) (1997), Score: 0.6923
5): Ideal Husband, An (1999), Score: 0.6906
6): Kolya (1996), Score: 0.6873
7): Color of Paradise, The (Rang-e Khoda) (1999), Score: 0.6854
8): Breaker Morant (1980), Score: 0.6758
9): Sense and Sensibility (1995), Score: 0.6745
10): Mansfield Park (1999), Score: 0.6740
