# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies. 

Author: Tània Pazos Puig

E-mail: tania.pazos01@estudiant.upf.edu

Date: 06/11/2024

# 1. The Movies dataset

Our dataset will be the 25M version of [MovieLens DataSet](https://grouplens.org/datasets/movielens/) released in late 2019. We will use a sub-set containing only movies released in the 2000s, and only 10% of the users and all of their ratings.

* **MOVIES** are described in `movies-2000s.csv` in the following format: `movieId,title,genres`
* **RATINGS** are contained in `ratings-2000s.csv` in the following format: `userId,movieId,rating`
* **TAGS** are contained in `tags.csv` in the following format: `userId,movieId,tag,timestamp`

# 1.1. Load the input files

In [1]:
# Leave this code as-is

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import*
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Leave this code as-is

FILENAME_MOVIES = "data/movielens-25M-filtered/movies-2000s.csv"
FILENAME_RATINGS = "data/movielens-25M-filtered/ratings-2000s.csv"
FILENAME_TAGS = "data/movielens-25M-filtered/tags-2000s.csv"

In [3]:
# Leave this code as-is

movies = pd.read_csv(FILENAME_MOVIES, 
                    sep=',', 
                    engine='python', 
                    encoding='latin-1',
                    names=['movie_id', 'title', 'genres'])
display(movies.head(5))

ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    encoding='latin-1',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,movie_id,title,genres
0,2769,"Yards, The (2000)",Crime|Drama
1,3177,Next Friday (2000),Comedy
2,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
3,3225,Down to You (2000),Comedy|Romance
4,3228,Wirey Spindell (2000),Comedy


Unnamed: 0,user_id,movie_id,rating
0,4,1,3.0
1,4,260,3.5
2,4,296,4.0
3,4,541,4.5
4,4,589,4.0


# 1.2. Merge the data into a single dataframe

Fist, we join the data into a single dataframe with columns: user_id, movie_id, rating, timestamp, title, genders.

In [4]:
ratings = pd.merge(ratings_raw, movies, how='inner', on='movie_id')

display(ratings.head(5))

Unnamed: 0,user_id,movie_id,rating,title,genres
0,4,3624,2.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
1,152,3624,3.0,Shanghai Noon (2000),Action|Adventure|Comedy|Western
2,171,3624,3.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western
3,276,3624,4.0,Shanghai Noon (2000),Action|Adventure|Comedy|Western
4,494,3624,3.5,Shanghai Noon (2000),Action|Adventure|Comedy|Western


Next, we use the code from the previous practice for the function `find_movies`, that lists movies containing a keyword.

In [5]:
def find_movies(keyword, movies_df):
    keyword_lower = keyword.lower()
    
    # Find movies that contain the keyword in the title
    results = movies_df[movies_df['title'].str.lower().str.contains(keyword_lower, na=False)]
    
    for index, row in results.iterrows():
        print(f"movie_id: {row['movie_id']}, title: {row['title']}")

In [6]:
# LEAVE AS-IS

# For testing, this should print 9 movies
find_movies("Spider-Man", movies)

movie_id: 5349, title: Spider-Man (2002)
movie_id: 8636, title: Spider-Man 2 (2004)
movie_id: 52722, title: Spider-Man 3 (2007)
movie_id: 76709, title: Spider-Man: The Ultimate Villain Showdown (2002)
movie_id: 95510, title: Amazing Spider-Man, The (2012)
movie_id: 110553, title: The Amazing Spider-Man 2 (2014)
movie_id: 122926, title: Untitled Spider-Man Reboot (2017)
movie_id: 195159, title: Spider-Man: Into the Spider-Verse (2018)
movie_id: 201773, title: Spider-Man: Far from Home (2019)


The following function prints the title of a movie given its movie_id.

In [7]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [8]:
# LEAVE AS-IS

# For testing, should print "Spider-Man 2 (2004)"
print(get_title(8636, movies))

Spider-Man 2 (2004)


## 1.3. Count unique registers

Next, we determine the number of unique users and unique movies in the `ratings` dataframe, and the total number of movies in the `movies` variable.

In [9]:
num_users = len(ratings['user_id'].unique())
num_rated_movies = len(ratings['movie_id'].unique())
total_movies = len(movies['movie_id'].unique())

print(f"Number of users who have rated a movie : {num_users}")
print(f"Number of movies that have been rated  : {num_rated_movies}")
print(f"Total number of movies                 : {total_movies}")

Number of users who have rated a movie : 12676
Number of movies that have been rated  : 2049
Total number of movies                 : 33168


# 2. Item-based Collaborative Filtering

For the purpose of this assignment, we will use **Pearson Similarity** and we will implement an **Item-based Collaborative filtering**.

## 2.1. Data pre-processing

Firstly, we create a new dataframe called "rated_movies" that is simply the "ratings" dataset with column genres removed.

In [10]:
rated_movies = ratings.drop(columns=['genres'])

display(rated_movies.head(10))

Unnamed: 0,user_id,movie_id,rating,title
0,4,3624,2.5,Shanghai Noon (2000)
1,152,3624,3.0,Shanghai Noon (2000)
2,171,3624,3.5,Shanghai Noon (2000)
3,276,3624,4.0,Shanghai Noon (2000)
4,494,3624,3.5,Shanghai Noon (2000)
5,1148,3624,2.5,Shanghai Noon (2000)
6,1967,3624,2.0,Shanghai Noon (2000)
7,2189,3624,4.0,Shanghai Noon (2000)
8,2287,3624,4.0,Shanghai Noon (2000)
9,2360,3624,4.0,Shanghai Noon (2000)


Now, we create a new dataframe named `ratings_summary` containing the following columns:

* movie_id
* title
* ratings_mean (average rating)
* ratings_count (number of people who have rated this movie)

In [11]:
# Initialize ratings_summary to be only the movie_id and title of movies in rated_movies
ratings_summary = rated_movies[['movie_id', 'title']].groupby('movie_id', as_index=False).first()

# Compute the average rating for each movie
ratings_mean = rated_movies.groupby('movie_id')['rating'].mean()

# Compute the count of ratings for each movie
ratings_count = rated_movies.groupby('movie_id')['rating'].count()

# Merge the ratings_mean and ratings_count with ratings_summary to align by 'movie_id'
ratings_summary = pd.merge(ratings_summary, ratings_mean, on='movie_id', how='left')
ratings_summary = pd.merge(ratings_summary, ratings_count, on='movie_id', how='left')

# Rename the columns to match the required names
ratings_summary.rename(columns={'rating_x': 'ratings_mean', 'rating_y': 'ratings_count'}, inplace=True)

# Display the first 10 rows of the ratings_summary dataframe
display(ratings_summary.head(10))

Unnamed: 0,movie_id,title,ratings_mean,ratings_count
0,2769,"Yards, The (2000)",3.122549,102
1,3177,Next Friday (2000),2.824,125
2,3190,Supernova (2000),2.395683,139
3,3225,Down to You (2000),2.577273,110
4,3228,Wirey Spindell (2000),2.5,2
5,3239,Isn't She Great? (2000),1.947368,19
6,3273,Scream 3 (2000),2.444664,759
7,3275,"Boondock Saints, The (2000)",3.870682,1071
8,3276,Gun Shy (2000),3.33871,31
9,3279,Knockout (2000),2.0,2


Now, we print the top 5 highest rated movies, considering only movies receiving at least 100 ratings. Note that we keep the orginal indices of the `ratings_summary` dataframe.

In [12]:
top_rated_movies_100 = ratings_summary[ratings_summary.ratings_count >= 100]
top_rated_movies_100 = top_rated_movies_100.sort_values(by='ratings_mean', ascending=False)
display(top_rated_movies_100.head(5))

Unnamed: 0,movie_id,title,ratings_mean,ratings_count
740,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.215216,2458
881,6016,City of God (Cidade de Deus) (2002),4.186592,2133
259,4226,Memento (2000),4.158512,4476
1250,7156,Fog of War: Eleven Lessons from the Life of Ro...,4.112013,308
488,4973,"Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le)...",4.097234,3687


We repeat this process, but this time considering movies receiving at least 3 ratings.

In [13]:
top_rated_movies_3 = ratings_summary[ratings_summary.ratings_count >= 3]
top_rated_movies_3 = top_rated_movies_3.sort_values(by='ratings_mean', ascending=False)
display(top_rated_movies_3.head(5))

Unnamed: 0,movie_id,title,ratings_mean,ratings_count
534,5082,"Rumor of Angels, A (2000)",4.666667,6
1778,27764,2LDK (2003),4.5,3
1961,31954,Beautiful City (Shah-re ziba) (2004),4.4,5
572,5224,Promises (2001),4.388889,18
1149,6775,Life and Debt (2001),4.333333,3


When the number of ratings is set to a small value such as 3, movies with only a few ratings are likley to appear in the top-rated list. However, this may not truly represent the movie's quality, as a high average rating can be distorted by the small number of ratings, particularly if those ratings are unusually high. For example, 2LDK (2003) has a high average rating (4.5), but this is based on only 3 ratings. In contrast, Memento (2000) has a slightly lower average rating (4.158512), but it is based on a much larger sample of 4476 ratings.

## 2.2. Compute the user-movie matrix

Now, we generate a `user_movie` matrix where columns are movies and rows are users, and each movie-user cell contains the score of that user for that movie.

In [14]:
user_movie = rated_movies.pivot_table(index='user_id', columns='movie_id', values='rating')

display(user_movie.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
33,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,4.5,,,...,,,,,,,,,,3.5
63,,,,,,,,,,,...,,,,,,,,,,
95,,,,,,,,3.5,,,...,,,,,,,,,,


In [15]:
# Compare matrix dimensions with results obtained in 1.3
num_rows = user_movie.shape[0]
num_cols = user_movie.shape[1]
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

Number of rows: 12676
Number of columns: 2049


Note that the number of rows of the matrix is the number of unique users that rated a movie, and the number of columns in the matrix is the number of unique movies that have been rated.

It should also be pointed out that the `user_movie` matrix has many NaN values because most users have either not rated any movies or have rated only a few. As a result, only a small portion of the user-movie pairs in the `user_movie` matrix are filled with ratings, while the rest are NaN values. This characteristic of user ratings in recommender systems is known as sparsity. It poses a significant challenge in collaborative filtering, as it means there is limited information to generate accurate recommendations.

# 2.3. Explore some correlations in the user-movie matrix

First, we will locate the movie_id for three movies: "Lord of the Rings: The Fellowship of the Ring (2001)", "Finding Nemo (2003)" and "Talk to Her (Hable con Ella) (2002)". Then, we will obtain the ratings for each of these movies and consolidate the three resulting series into the dataframe `ratings3`. From this dataframe, we will keep only users that have rated the 3 movies.

In [16]:
# Retrieve the movie_id for the specified movies from the dataset
id_pivot = 4993
id_m1 = 6377
id_m2 = 5878

# Obtain ratings for each movie, dropping Nan values
s1 = user_movie[id_pivot].dropna()
s2 = user_movie[id_m1].dropna()
s3 = user_movie[id_m2].dropna()

# Consolidate these series into a single dataframe
ratings3 = pd.concat([s1, s2, s3], axis=1)

# Drop rows with NaN values to keep only users who rated all three movies
ratings3.dropna(inplace=True)

display(ratings3.head(10))

Unnamed: 0_level_0,4993,6377,5878
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
859,3.0,4.0,5.0
1229,4.0,4.0,4.5
1281,3.0,2.5,3.0
1722,5.0,4.5,4.0
2004,4.5,3.0,3.5
4590,4.0,4.0,2.0
5052,2.0,4.0,4.0
5144,5.0,5.0,5.0
6497,3.5,3.5,3.5
8369,3.0,4.0,4.5


The code below computes all Pearson correlations between the three movies.

In [17]:
# Calculate the correlations between each pair of movies
similarity_lotr_finding_nemo = ratings3[4993].corr(ratings3[6377])
similarity_lotr_talk_to_her = ratings3[4993].corr(ratings3[5878])
similarity_finding_nemo_talk_to_her = ratings3[6377].corr(ratings3[5878])

print(f"Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': {similarity_lotr_finding_nemo:.2f}")
print(f"Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': {similarity_lotr_talk_to_her:.2f}")
print(f"Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': {similarity_finding_nemo_talk_to_her:.2f}")

Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Finding Nemo (2003)': 0.38
Similarity between 'Lord of the Rings: The Fellowship of the Ring, The (2001)' and 'Talk to Her (Hable con Ella) (2002)': 0.16
Similarity between 'Finding Nemo (2003)' and 'Talk to Her (Hable con Ella) (2002)': 0.20


On the one hand, it can be seen that "Lord of the Rings: The Fellowship of the Ring (2001)" and "Finding Nemo (2003)" have a moderate correlation score of 0.38, indicating that users who liked one are likely to enjoy the other.

On the other hand, the lower correlation between "Lord of the Rings: The Fellowship of the Ring (2001)" and "Talk to Her (Hable con Ella)(2002)" (0.16) and between "Finding Nemo (2003)" and "Talk to Her (Hable con Ella) (2002)" (0.20) suggests that "Talk to Her (Hable con Ella) (2002)" has a specific audience which might not enjoy mainstream fantasy or family-friendly films.

Now, we will compute the correlation of the "pivot" movie "Lord of the Rings: The Fellowship of the Ring (2001)" with all other movies. We will store the result in a new dataframe named `similarity_to_pivot` containing two columns: `movie_id` and `corr_with_pivot`.

In [18]:
# Extract ratings for the pivot movie into a single-column dataframe
df_pivot = pd.DataFrame(user_movie[id_pivot].dropna()).rename(columns={id_pivot: "rating"})

correlations = [] # List to store all correlations with the pivot

# Loop through each movie in user_movie columns
for movie_id in user_movie.columns:
    # Extract ratings for the current movie into a single-column dataframe
    df_movie = pd.DataFrame(user_movie[movie_id].dropna()).rename(columns={movie_id: "rating"})
    
    # Compute correlation between the pivot and each movie and store in the list
    corr = df_pivot.corrwith(df_movie)[0]
    correlations.append((movie_id, corr))

# Convert the list to a dataframe and rename columns
similarity_to_pivot = pd.DataFrame(correlations, columns=['movie_id', 'corr_with_pivot'])

# Drop rows were correlation with pivot is NaN
# (no overlapping users who rated both the pivot movie and the individual movie)
similarity_to_pivot.dropna(subset=['corr_with_pivot'], inplace=True)

display(similarity_to_pivot.head(10))

Unnamed: 0,movie_id,corr_with_pivot
0,2769,-0.127515
1,3177,0.093221
2,3190,0.041206
3,3225,0.1266
5,3239,0.338378
6,3273,0.166968
7,3275,0.182484
8,3276,0.134264
10,3285,0.075311
11,3286,0.242781


Next, we create a dataframe `corr_with_pivot` by using `similarity_to_pivot` and `ratings_summary`. This dataframe will have the following columns:

* movie_id
* corr_with_pivot (the correlation between movies movie_id and id_pivot)
* title
* ratings_mean
* ratings_count

We will keep only rows in which *ratings_count* > 500, i.e., popular movies and display the 20 movies rated 500 times or more with the highest correlation with the pivot movie "Lord of the Rings: The Fellowship of the Ring (2001)".

In [19]:
corr_with_pivot = pd.merge(similarity_to_pivot, ratings_summary, on='movie_id')

# Filter for popular movies (ratings_count > 500)
corr_with_pivot = corr_with_pivot[corr_with_pivot['ratings_count'] > 500]

# Sort by correlation with the pivot movie in descending order
corr_with_pivot = corr_with_pivot.sort_values(by='corr_with_pivot', ascending=False)

display(corr_with_pivot[['movie_id', 'corr_with_pivot', 'title', 'ratings_mean', 'ratings_count']].head(20))

Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count
481,4993,1.0,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
808,5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
1178,7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
987,6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
1340,8368,0.340934,Harry Potter and the Prisoner of Azkaban (2004),3.809971,2397
55,3578,0.337667,Gladiator (2000),3.95105,4811
86,3793,0.329686,X-Men (2000),3.556436,3535
451,4896,0.31918,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.678509,2843
68,3624,0.307471,Shanghai Noon (2000),3.297443,1017
1775,31658,0.303898,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.064417,1141


The movie with highest correlation with the pivot movie is, naturally, the pivot movie itself. The films "Lord of the Rings: The Two Towers" (2002) and "Lord of the Rings: The Return of the King" (2003) show a strong correlation with the pivot movie (around 0.89) since they are part of the same trilogy and attract the same audience.

However, starting from the third position, we see a significant drop in correlation values, from 0.89 to 0.38. This suggests that the audience for The Lord of the Rings is relatively niche.

Films like the Harry Potter series, as well as other action and fantasy films, show moderate correlations with the pivot movie, which makes sense given the shared genres. Nevertheless, there are some surprising links with superhero films and animation movies such as "Spiderman" and "Shrek". This indicates that the audience of Lord of the Rings is likely to enjoy action and animated movies.

Regarding the condition of `ratings_count`, if we set it to a large value (4000), we lose diversity of movie types, as only popular and widely rated movies appear in the list. This may stop us from discovering more specific preferences that could be interesting when making recommendations. However, it is true that movies with a large number of ratings may have more stable (and reliable) correlation values.
On the other hand, if we set `ratings_count` to a very small value (10), we will obtain less reliable correlation values, as they will significantly vary with outliers and individual preferences. Although it is true that the list contains a wide range of movies, this does not make up for the lack of reliability.
All in all, an intermediate value for `ratings_count` that ensures reliable correlation values while still preserving diversity should be chosen.

# 2.4. Implement the item-based recommendations

We are now ready to implement the item-based recommender. We will compute all correlations between columns (movies) in the matrix `user_movie` and store the result in `item_similarity`.

In [20]:
# Each cell in item_similarity matrix will contain the correlation between a pair of movies
item_similarity = user_movie.corr()

display(item_similarity.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,0.115068,0.033721,-0.232268,,-0.5,0.197011,0.199514,0.250873,,...,0.37998,0.87831,,,,0.248126,0.1806095,-0.08557,-0.408248,0.105671
3177,0.115068,1.0,0.30382,0.559533,,,0.331191,0.167918,1.0,,...,0.546119,0.735767,-1.0,,,-0.221382,0.3174747,0.014735,0.661989,0.185654
3190,0.033721,0.30382,1.0,0.636361,,-0.014315,0.146042,0.394293,-0.290397,,...,0.246183,0.632026,,,,0.378181,0.1709261,0.022444,-0.07336,-0.054114
3225,-0.232268,0.559533,0.636361,1.0,,0.578414,0.347716,0.263671,-0.250313,,...,-0.300376,0.318377,,,,0.480173,0.7503063,0.536828,0.753141,0.098748
3228,,,,,1.0,,,,,,...,,,,,,,,,,
3239,-0.5,,-0.014315,0.578414,,1.0,0.180846,1.0,,,...,,,,,,1.0,,1.0,0.636285,0.8882
3273,0.197011,0.331191,0.146042,0.347716,,0.180846,1.0,0.105735,0.154371,,...,0.006774,0.409968,1.0,,,0.088405,0.07516779,0.143492,0.466705,0.084202
3275,0.199514,0.167918,0.394293,0.263671,,1.0,0.105735,1.0,0.485071,,...,-0.011426,0.279624,,,,0.075827,0.2994603,0.187713,0.285584,0.225317
3276,0.250873,1.0,-0.290397,-0.250313,,,0.154371,0.485071,1.0,,...,,0.29277,,,,0.0,-6.885311000000001e-17,-0.45553,0.5,-0.138013
3279,,,,,,,,,,1.0,...,,,,,,,,,,


Since similarities between movies that do not have many ratings in common are unreliable, we will re-generate `item_similarity` setting `min_periods` (the minimum number of elements in common that two columns must have to compute the correlation) to 100 and store the result in `item_similarity_min_ratings`.

In [21]:
item_similarity_min_ratings = user_movie.corr(min_periods=100)
display(item_similarity_min_ratings.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,,,,,,,,,,...,,,,,,,,,,
3177,,1.0,,,,,,,,,...,,,,,,,,,,
3190,,,1.0,,,,,,,,...,,,,,,,,,,
3225,,,,1.0,,,,,,,...,,,,,,,,,,
3228,,,,,,,,,,,...,,,,,,,,,,


Next, in order to test our function we select a couple of interesting users.

Our first user, `user_id_super` will be someone who has given the following 3 films a rating higher than 4.5:

* movie_id=5349: *Spider-Man (2002)*
* movie_id=3793: *X-Men (2000)*
* movie_id=6534: *Hulk (2003)* 	

Our second user, `user_id_drama` will be someone who has given the following 3 films a rating higher than 4.5:

* movie_id=6870: *Mystic River (2003)*
* movie_id=5995: *Pianist, The (2002)*
* movie_id=3555: *U-571 (2000)*

The following code finds the user ids of the two example users: `user_id_super` and `user_id_drama`. 

In [22]:
superhero_movies = [5349, 3793, 6534]  # Spider-Man, X-Men, Hulk
drama_movies = [6870, 5995, 3555]      # Mystic River, The Pianist, U-571

# To find user_id_super, search for someone who has rated 
# the following 3 movies above 4.5
user_id_super = user_movie[
    (user_movie[5349] > 4.5) & 
    (user_movie[3793] > 4.5) & 
    (user_movie[6534] > 4.5)
].index

# To find user_id_drama, search for someone who hash rated 
# the 3 following movies above 4.5
user_id_drama = user_movie[
    (user_movie[6870] > 4.5) & 
    (user_movie[5995] > 4.5) & 
    (user_movie[3555] > 4.5)
].index

# In case multiple users are found, select the first user in the list
user_id_super = user_id_super[0] if not user_id_super.empty else None
user_id_drama = user_id_drama[0] if not user_id_drama.empty else None

print(f"user_id_super: {user_id_super}")
print(f"user_id_drama: {user_id_drama}")

user_id_super: 127342
user_id_drama: 34336


Some auxiliary functions are provided below.

In [23]:
# Leave this code as-is

# Gets a list of watched movies for a user_id
def get_watched_movies(user_id, user_movie):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)
    
# Gets the rating a user_id has given to a movie_id
def get_rating(user_id, movie_id, user_movie):
    return user_movie[movie_id][user_id]

# Print watched movies
def print_watched_movies(user_id, user_movie, movies):
    for movie_id in get_watched_movies(user_id, user_movie):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_id, movie_id, user_movie), get_title(movie_id, movies)))


In [24]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_super, user_movie, movies)

5502 5.0 Signs (2002) 
5445 5.0 Minority Report (2002) 
6156 5.0 Shanghai Knights (2003) 
5952 5.0 Lord of the Rings: The Two Towers, The (2002) 
5944 5.0 Star Trek: Nemesis (2002) 
5816 5.0 Harry Potter and the Chamber of Secrets (2002) 
5618 5.0 Spirited Away (Sen to Chihiro no kamikakushi) (2001) 
5524 5.0 Blue Crush (2002) 
5480 5.0 Stuart Little 2 (2002) 
5459 5.0 Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (2002) 
5420 5.0 Windtalkers (2002) 
4388 5.0 Scary Movie 2 (2001) 
5389 5.0 Spirit: Stallion of the Cimarron (2002) 
5349 5.0 Spider-Man (2002) 
5218 5.0 Ice Age (2002) 
5064 5.0 The Count of Monte Cristo (2002) 
4993 5.0 Lord of the Rings: The Fellowship of the Ring, The (2001) 
4973 5.0 Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le) (2001) 
4896 5.0 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001) 
4886 5.0 Monsters, Inc. (2001) 
6186 5.0 Gods and Generals (2003) 
6333 5.0 X2: X-Men United (2003) 
6377 5.0 Finding Nemo (2003) 
6

In [25]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_drama, user_movie, movies)

3967 5.0 Billy Elliot (2000) 
4014 5.0 Chocolat (2000) 
4034 5.0 Traffic (2000) 
5995 5.0 Pianist, The (2002) 
7147 5.0 Big Fish (2003) 
4995 5.0 Beautiful Mind, A (2001) 
3555 5.0 U-571 (2000) 
6870 5.0 Mystic River (2003) 
5991 5.0 Chicago (2002) 
8464 5.0 Super Size Me (2004) 
5669 5.0 Bowling for Columbine (2002) 
8622 5.0 Fahrenheit 9/11 (2004) 
30707 5.0 Million Dollar Baby (2004) 
6953 4.5 21 Grams (2003) 
5015 4.5 Monster's Ball (2001) 
5464 4.5 Road to Perdition (2002) 
3510 4.5 Frequency (2000) 
5989 4.5 Catch Me If You Can (2002) 
4022 4.0 Cast Away (2000) 
5010 4.0 Black Hawk Down (2001) 
5299 4.0 My Big Fat Greek Wedding (2002) 
3897 4.0 Almost Famous (2000) 
3755 4.0 Perfect Storm, The (2000) 
4308 4.0 Moulin Rouge (2001) 
4447 3.5 Legally Blonde (2001) 
4246 3.5 Bridget Jones's Diary (2001) 
4975 3.5 Vanilla Sky (2001) 
4019 3.5 Finding Forrester (2000) 
5377 3.5 About a Boy (2002) 
3948 3.5 Meet the Parents (2000) 
5956 3.0 Gangs of New York (2002) 
6281 3.0 Phone Booth

The following function `get_movies_relevance` calculates a relevance score for each movie the user hasn't rated, based on similarities with movies they have rated. For each user, the relevance of an unrated movie will be the weighted sum of the similarities between that new movie and all the movies the user has rated. The output is a dataframe with two columns, `movie_id` and `relevance`, where higher relevance scores indicate movies that are most similar to those the user has rated highly.

In [26]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
    # Create an empty series to store relevance scores
    movies_relevance = pd.Series(dtype=float)
    
    # Get the user's ratings from the user_movie matrix
    user_ratings = user_movie.loc[user_id].dropna()
    
    # Iterate through the movies the user has rated
    for watched_movie, rating_given in user_ratings.items():
        
        # Obtain similarity vector between rated movie and all other movies
        similarities = item_similarity_matrix[watched_movie]
        
        # Multiply each similarity by the rating given
        weighted_similarities = similarities * rating_given
        
        # Append to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
        
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

Next, we apply `get_movies_relevance` to the two users we have selected (`user_id_super` and `user_id_drama`) and merge the result with the `movies` dataframe to obtain movie titles. We then display the results by descending relevance and print the top 5 recommendations for each user.

In [32]:
# Apply get_movies_relevance for each user
# Note we use the matrix item_similarity_mean_ratings
# since similarities between movies that do not have many ratings
# in common are unreliable
user_id_super_relevance = get_movies_relevance(user_id_super, user_movie, item_similarity_min_ratings)
user_id_drama_relevance = get_movies_relevance(user_id_drama, user_movie, item_similarity_min_ratings)

# Merge with the movies df on movie_id to obtain movie titles
user_id_super_recommendations = user_id_super_relevance.merge(movies, on='movie_id')
user_id_drama_recommendations = user_id_drama_relevance.merge(movies, on='movie_id')

# Sort by descending relevance and get top 5 for each user
user_id_super_top5 = user_id_super_recommendations.sort_values(by='relevance', ascending=False).head(5)
user_id_drama_top5 = user_id_drama_recommendations.sort_values(by='relevance', ascending=False).head(5)

# Display the top 5 recommendations for each user
print("Top 5 recommendations for user_id_super:")
display(user_id_super_top5[['movie_id', 'title', 'relevance']])

print("\nTop 5 recommendations for user_id_drama:")
display(user_id_drama_top5[['movie_id', 'title', 'relevance']])


Top 5 recommendations for user_id_super:


Unnamed: 0,movie_id,title,relevance
1472,8644,"I, Robot (2004)",189.170085
663,5459,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,181.63812
85,3753,"Patriot, The (2000)",176.650945
1414,8361,"Day After Tomorrow, The (2004)",172.899804
310,4310,Pearl Harbor (2001),172.700877



Top 5 recommendations for user_id_drama:


Unnamed: 0,movie_id,title,relevance
1572,8958,Ray (2004),65.46137
195,4019,Finding Forrester (2000),63.007635
1055,6565,Seabiscuit (2003),61.354376
501,4995,"Beautiful Mind, A (2001)",61.21305
508,5014,I Am Sam (2001),61.209632


Firstly, it is worth mentioning that the relevance scores for the top 5 recommendations for `user_id_super` (the superhero fan) are relatively high (between 172 and 189), indicating a strong similarity with movies rated by this user. The list of recommendations includes "I, Robot" and "Men in Black II", two sci-fi movies packed with adventure that align well with the interest of `user_id_super`in superhero films. With slightly lower relevance scores we encounter three drama-oriented movies: "The Patriot", "The Day After Tomorrow" and "Pearl Harbor". Although these movies lack traditional superheroes, they depict heroism as courage and strength in extreme situations (war or environmental disaster), which may appeal to the superhero audience. 

For `user_id_drama` (the drama fan), we can see that relevance scores are significantly lower (between 61 to 65), but they still show a moderate correlation with the user's rated movies. Two of the recommended movies ("Ray" and "A Beautiful Mind") are biographical; while at first glance they might seem misaligned with the user's preferences, they include typical dramatic topics like real-life struggles, talent and resilience. Movies like "Finding Forrester", "Seabiscuit" and "I Am Sam" focus on key drama themes like social issues and personal growth, making them a perfect fit to the target user.

The following code defines the function `get_recommended_movies`, which removes the movies the user has already watched from the list of recommendations.

In [28]:
def get_recommended_movies(user_id, user_movie, item_similarity_matrix):
        # Relevant movies for the user
        relevant_movies = get_movies_relevance(user_id, user_movie, item_similarity_matrix)
        
        # Set this dataframe index to movie_id
        relevant_movies.set_index('movie_id', inplace=True)
        
        # Get the list of watched movies 
        watched_movies = get_watched_movies(user_id, user_movie)
            
        # Drop the watched movies from the relevant movies df
        recommended_movies = relevant_movies.loc[~relevant_movies.index.isin(watched_movies)]
        
        return recommended_movies

Finally, we print the final 10 most recommended (unwatched) movies for `user_id` and `user_id_drama`. 

In [33]:
# Again, use item_similarity_min_ratings to ensure similarity scores are reliable
recommended_super = get_recommended_movies(user_id_super, user_movie, item_similarity_min_ratings)
recommended_drama = get_recommended_movies(user_id_drama, user_movie, item_similarity_min_ratings)

# Merge with movies df to get movie titles
recommended_super = recommended_super.merge(movies[['movie_id', 'title']], on='movie_id', how='left')
recommended_drama = recommended_drama.merge(movies[['movie_id', 'title']], on='movie_id', how='left')

# Sort by descending relevance
recommended_super_sorted = recommended_super.sort_values(by='relevance', ascending=False)
recommended_drama_sorted = recommended_drama.sort_values(by='relevance', ascending=False)

print("Top 10 recommendations for user_id_super:")
display(recommended_super_sorted.head(10))
print("\nTop 10 recommendations for user_id_drama:")
display(recommended_drama_sorted.head(10))                                           

Top 10 recommendations for user_id_super:


Unnamed: 0,movie_id,relevance,title
898,6365,166.866641,"Matrix Reloaded, The (2003)"
163,4018,165.338077,What Women Want (2000)
169,4025,163.032765,Miss Congeniality (2000)
614,5507,161.080324,xXx (2002)
908,6378,155.293219,"Italian Job, The (2003)"
1791,31685,154.993274,Hitch (2005)
130,3948,150.570934,Meet the Parents (2000)
285,4369,148.949754,"Fast and the Furious, The (2001)"
1091,6934,148.394158,"Matrix Revolutions, The (2003)"
429,4963,148.251901,Ocean's Eleven (2001)



Top 10 recommendations for user_id_drama:


Unnamed: 0,movie_id,relevance,title
1519,8958,65.46137,Ray (2004)
1015,6565,61.354376,Seabiscuit (2003)
482,5014,61.209632,I Am Sam (2001)
1267,7325,59.820898,Starsky & Hutch (2004)
1198,7149,59.294621,Something's Gotta Give (2003)
330,4448,58.968024,"Score, The (2001)"
1312,7445,58.192646,Man on Fire (2004)
533,5152,58.004447,We Were Soldiers (2002)
81,3753,57.920754,"Patriot, The (2000)"
240,4223,57.482846,Enemy at the Gates (2001)


After filtering out movies each user has already watched, the resulting lists show a slight decrease in the relevance scores, specially for `user_id_super`'s recommendations.  

Regarding the recommendations for `user_id_super`, a few movies perfectly align with the user's preference for superhero themes. Indeed, movies like "The Matrix Reloaded", "The Matrix Revolutions", "xXx" and "The Fast and the Furious" are packed with adrenaline and spectacular visuals, which are key elements in the superhero genre. Nevertheless, some recommendations, such as "What Women Want, "Miss Congeniality", "Meet the Parents" and "Hitch", fall into the comedic and romantic genres and might be less appealing to a superhero-focused audience.

On the other hand, many of the `user_id_drama`'s recommendations closely align with their interest in dramatic stories. Movies like "Ray", "I Am Sam", and "Seabiscuit" are good fits, since they are built around personal growth and social injustice. The list of recommendations also includes war dramas ("We Were Soldiers" and "Enemy at the Gates") which have a strong emotional impact likely to connect with drama audiences. However, "Starsky & Hutch" leans more toward comedy, and "Something’s Gotta Give" is centered on both comedy and romance, making them less likely to resonate with fans of traditional drama.

In a nutshell, the lists of recommendations obtained after dropping previously watched movies are less aligned with the users' preferences, as they include several outliers that may not be relevant to the target users.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>