# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based collaborative filtering recommenders for movies. 

Author: <font color="blue">Shaoya Zhang</font>

E-mail: <font color="blue">shaoya.zhang01@estudiant.upf.edu</font>

Date: <font color="blue">2025-11-03</font>

# 1. The Movies dataset

We will use the same dataset as in PS05, the the [32M version of Movielens](https://www.kaggle.com/datasets/justsahil/movielens-32m), which was released in 2024. We will use a sub-set containing only movies released in 2000 or later, and only 10% of the users and all of their ratings.

**MOVIES** are described in `ml32m-movies-2000s.csv.gz` in the following format: `movieId,title,genres`.

**RATINGS** are contained in `ml32m-ratings-2000s.csv.gz` in the following format: `userId,movieId,rating`

**TAGS** are contained in `ml32m-tags-2000s.csv.gz` in the following format: `userId,movieId,tag`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1.1. Load the input files

In [111]:
# LEAVE THIS CODE AS-IS
# But feel free to add imports in an extra cell if needed

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import *
import random
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [112]:
# LEAVE THIS CODE AS-IS

FILENAME_MOVIES = "ml32m-movies-2000s.csv.gz"
FILENAME_RATINGS = "ml32m-ratings-2000s.csv.gz"
FILENAME_TAGS = "ml32m-tags-2000s.csv.gz"

In [113]:
# LEAVE THIS CODE AS-IS

# Load movies
movies = pd.read_csv(FILENAME_MOVIES, 
                    compression='gzip',
                    sep=',', 
                    engine='python', 
                    encoding='utf-8',
                    names=['movie_id', 'title', 'genres'])

# Remove header row from this file
movies.drop(index=0, inplace=True)

# Make sure the movie id is numeric
movies["movie_id"] = pd.to_numeric(movies["movie_id"])
display(movies.head(5))

Unnamed: 0,movie_id,title,genres
1,2769,"Yards, The (2000)",Crime|Drama
2,3177,Next Friday (2000),Comedy
3,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
4,3225,Down to You (2000),Comedy|Romance
5,3228,Wirey Spindell (2000),Comedy


In [114]:
# LEAVE THIS CODE AS-IS

# Load ratings
ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    compression='gzip',
                    encoding='utf-8',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,user_id,movie_id,rating
0,4,223,4.0
1,4,1210,3.0
2,4,1272,4.0
3,4,1327,3.0
4,4,1513,2.0


# 1.2. Merge the data into a single dataframe

Join the data into a single dataframe that should contain columns: user_id, movie_id, rating, timestamp, title, genders.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code from the previous practice that joined these three dataframes using "merge" into a single dataframe named "ratings". Print the first 5 rows of the resulting dataframe, which should contain columns "user_id", "movie_id", "rating", "title", and "genres".</font>

In [115]:
ratings = pd.merge(ratings_raw, movies, how='inner', on='movie_id')[
    ['user_id', 'movie_id', 'rating', 'title', 'genres']
]
display(ratings.head(10))

Unnamed: 0,user_id,movie_id,rating,title,genres
0,33,3285,4.5,"Beach, The (2000)",Adventure|Drama
1,33,3535,4.5,American Psycho (2000),Crime|Horror|Mystery|Thriller
2,33,3578,4.5,Gladiator (2000),Action|Adventure|Drama
3,33,3617,1.5,Road Trip (2000),Comedy
4,33,3623,4.0,Mission: Impossible II (2000),Action|Adventure|Thriller
5,33,3751,4.5,Chicken Run (2000),Animation|Children|Comedy
6,33,3793,4.0,X-Men (2000),Action|Adventure|Sci-Fi
7,33,3948,4.0,Meet the Parents (2000),Comedy
8,33,3949,5.0,Requiem for a Dream (2000),Drama
9,33,4011,4.5,Snatch (2000),Comedy|Crime|Thriller


<font size="+1" color="red">Replace this cell with your code from the previous practice for "find_movies" that list movies containing a keyword</font>

In [116]:
def find_movies(keyword, df):
    for _, row in df.iterrows():
        if keyword.lower() in row['title'].lower():
            print(f"movie_id: {row['movie_id']}, title: {row['title']}")

In [117]:
# LEAVE AS-IS

# For testing, this should print 6 movies
find_movies("Final Destination", movies)

movie_id: 3409, title: Final Destination (2000)
movie_id: 6058, title: Final Destination 2 (2003)
movie_id: 43679, title: Final Destination 3 (2006)
movie_id: 71252, title: Final Destination, The (Final Destination 4) (Final Destination in 3-D, The) (2009)
movie_id: 85278, title: City of Your Final Destination, The (2009)
movie_id: 88932, title: Final Destination 5 (2011)


The following function, which you can leave as-is, prints the title of a movie given its movie_id.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [118]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [119]:
# LEAVE AS-IS

# For testing, should print "Final Destination 5 (2011)"
print(get_title(88932, movies))

Final Destination 5 (2011)


## 1.3. Count unique registers

Count the number of unique users and unique movies in the `ratings` variable. Use [unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html). Print also the total number of movies in the `movies` variable. Your code should print:

```
Number of users who have rated a movie : 16348
Number of movies that have been rated  : 2878
Total number of movies                 : 51444
```

Note that ratings are heavily concentrated on a few popular movies.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your own code to indicate the number of unique users and unique movies in the "ratings" variable.</font>

In [120]:
users_unique = ratings.user_id.unique()
movies_uni = ratings.movie_id.unique()
num_movies_tot = movies.movie_id.unique()

print(f'Number of users who have rated a movie : {len(users_unique)}')
print(f'Number of movies that have been rated  : {len(movies_uni)}')
print(f'Total number of movies                 : {len(num_movies_tot)}')

Number of users who have rated a movie : 16348
Number of movies that have been rated  : 2878
Total number of movies                 : 51444


# 2. Item-based Collaborative Filtering

The two main types of interactions-based recommender system, also known as *collaborative filtering* algorithms are:

1. **User-based Collaborative Filtering**: To recommend items for user A, we first look at other users B1, B2, ..., Bk with a similar behavior to A, and aggregate their preferences. For instance, if all Bi like a movie that A has not watched, it would be a good candidate to be recommended. 


2. **Item-based Collaborative Filtering**: To recommend items for user A, we first look at all the items I1, I2, ..., Ik that the user A has consumed, and find items that elicit similar ratings from other users. For instnce, an item that is rated positively by the same users that rate positively the Ii items, and negatively by the same users that rate negatively the Ii items, would be a good candidate to be recommended.

In both cases, a similarity matrix needs to be built. For user-based, the **user-similarity matrix** will consist of some **distance metrics** that measure the similarity between any two pairs of users. For item-based, the **matrix** will measure the similarity between any two pairs of items.

As we already know, there are several metrics strategy for measure the "similarity" of two items. Some of the most used metrics are Jaccard, Cosine and Pearson. Meanwhile, Jaccard similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B (very useful for those use cases where there is not a numeric rating but just a boolean value like a product being bought), in Pearson and Cosine similarities we measure the similarity between two vectors.

For the purpose of this assignment, we will use **Pearson Similarity** and we will implement a **Item-based Collaborative filtering**.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 2.1. Data pre-processing

Firstly, create a new dataframe called "rated_movies" that is simply the "ratings" dataset with column genres removed using the [Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title</font>

In [121]:
rated_movies = ratings.drop(columns='genres', inplace=False)
display(rated_movies.head(10))

Unnamed: 0,user_id,movie_id,rating,title
0,33,3285,4.5,"Beach, The (2000)"
1,33,3535,4.5,American Psycho (2000)
2,33,3578,4.5,Gladiator (2000)
3,33,3617,1.5,Road Trip (2000)
4,33,3623,4.0,Mission: Impossible II (2000)
5,33,3751,4.5,Chicken Run (2000)
6,33,3793,4.0,X-Men (2000)
7,33,3948,4.0,Meet the Parents (2000)
8,33,3949,5.0,Requiem for a Dream (2000)
9,33,4011,4.5,Snatch (2000)


Now, using the `rated_movies` dataframe, create a new dataframe named `ratings_summary` containing the following columns:

* movie_id
* title
* ratings_mean (average rating)
* ratings_count (number of people who have rated this movie)

You can use the following operations:

* Initialize `ratings_summary` to be only the movie_id and title of all movies in `rated_movies`
   * To group dataframe `df` by column `a` and keep only one unique row per value of `a`, use: `df.groupby('a').first()`
* Compute three series: `ratings_mean`, `ratings_median`, `ratings_count`:
   * To obtain a series with the XX of column `a` for each distinct value of column `b` in dataframe `df`, use: `df.groupby(b)['a'].XX()` (XX=mean, median, count)
* Add these series to the `ratings_summary`
   * To add a series `s` with column name `a` to dataframe `df`, use: `df['a'] = s`
    
<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to generate "ratings_summary" and print the first 10 rows.</font>

In [122]:
# Only preserve movie_id and title 
ratings_summary = rated_movies.groupby('movie_id', as_index=False).first()[['movie_id', 'title']]

# group by movie_id to compute mean and count
ratings_mean = rated_movies.groupby('movie_id')['rating'].mean()
ratings_count = rated_movies.groupby('movie_id')['rating'].count()
ratings_median = rated_movies.groupby('movie_id')['rating'].median()

# Add the new Series to ratings_summary
ratings_summary['ratings_mean'] = ratings_summary['movie_id'].map(ratings_mean)
ratings_summary['ratings_count'] = ratings_summary['movie_id'].map(ratings_count)
ratings_summary['ratings_median'] = ratings_summary['movie_id'].map(ratings_median)

# Round ratings
ratings_summary['ratings_mean'] = ratings_summary['ratings_mean'].round(2)

display(ratings_summary.head(10))


Unnamed: 0,movie_id,title,ratings_mean,ratings_count,ratings_median
0,2769,"Yards, The (2000)",3.0,73,3.0
1,3177,Next Friday (2000),2.92,161,3.0
2,3190,Supernova (2000),2.34,131,2.5
3,3225,Down to You (2000),2.68,109,3.0
4,3228,Wirey Spindell (2000),1.67,3,2.0
5,3239,Isn't She Great? (2000),2.29,29,2.0
6,3273,Scream 3 (2000),2.44,832,2.5
7,3275,"Boondock Saints, The (2000)",3.88,1406,4.0
8,3276,Gun Shy (2000),3.02,31,3.0
9,3279,Knockout (2000),1.0,1,1.0


To select from dataframe A those having column C larger or equal to N, you can do `A[A.C >= N]`.

To sort dataframe A by decreasing values of column C, you can do `A.sort_values(by='C', ascending=False)`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to print the top 10 highest rated movies by average rating and the top 10 highest rated movies by median rating, considering only movies receiving at least 1000 ratings.</font>

In [123]:
# Filter movies with ratings >= 1000
popular_movies = ratings_summary[ratings_summary['ratings_count'] >= 1000]

# Top 10 mean
top10_mean = popular_movies.sort_values(by='ratings_mean', ascending=False).head(10)
print("Top 10 highest rated movies by average rating (min 1000 ratings):")
display(top10_mean[['title', 'ratings_mean', 'ratings_count']])

# Top 10 median
top10_median = popular_movies.sort_values(by='ratings_median', ascending=False).head(10)
print("\nTop 10 highest rated movies by median rating (min 1000 ratings):")
display(top10_median[['title', 'ratings_mean', 'ratings_median', 'ratings_count']])

Top 10 highest rated movies by average rating (min 1000 ratings):


Unnamed: 0,title,ratings_mean,ratings_count
736,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.22,3607
2469,"Lives of Others, The (Das leben der Anderen) (...",4.21,1321
882,City of God (Cidade de Deus) (2002),4.19,2848
259,Memento (2000),4.16,5804
2692,"Departed, The (2006)",4.14,3819
2718,"Prestige, The (2006)",4.1,3610
498,"Lord of the Rings: The Fellowship of the Ring,...",4.09,7935
1934,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.09,1703
1255,"Lord of the Rings: The Return of the King, The...",4.09,7365
849,"Lord of the Rings: The Two Towers, The (2002)",4.07,7447



Top 10 highest rated movies by median rating (min 1000 ratings):


Unnamed: 0,title,ratings_mean,ratings_median,ratings_count
498,"Lord of the Rings: The Fellowship of the Ring,...",4.09,4.5,7935
2469,"Lives of Others, The (Das leben der Anderen) (...",4.21,4.5,1321
1255,"Lord of the Rings: The Return of the King, The...",4.09,4.5,7365
882,City of God (Cidade de Deus) (2002),4.19,4.5,2848
736,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.22,4.5,3607
29,Erin Brockovich (2000),3.58,4.0,2326
35,High Fidelity (2000),3.82,4.0,2240
45,American Psycho (2000),3.7,4.0,2053
60,Gladiator (2000),3.94,4.0,6230
414,Training Day (2001),3.77,4.0,1604


<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings, and having a median of 4.5 or above.</font>

In [124]:
top_rating_median = ratings_summary[(ratings_summary['ratings_count'] >= 3) & (ratings_summary['ratings_median'] >= 4.5)]
top_10 = top_rating_median.sort_values(by=['ratings_count', 'ratings_median'], ascending=[False, False])
print("\nTop rated movies by median rating and rating count (min 1000 ratings):")
display(top_10.head(10))


Top rated movies by median rating and rating count (min 1000 ratings):


Unnamed: 0,movie_id,title,ratings_mean,ratings_count,ratings_median
498,4993,"Lord of the Rings: The Fellowship of the Ring,...",4.09,7935,4.5
1255,7153,"Lord of the Rings: The Return of the King, The...",4.09,7365,4.5
736,5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.22,3607,4.5
882,6016,City of God (Cidade de Deus) (2002),4.19,2848,4.5
2469,44555,"Lives of Others, The (Das leben der Anderen) (...",4.21,1321,4.5
1393,7767,"Best of Youth, The (La meglio gioventÃ¹) (2003)",4.25,54,4.5
1421,8327,Dolls (2002),4.2,52,4.5
1668,27423,"O Auto da Compadecida (Dog's Will, A) (2000)",4.24,35,4.5
1902,31148,Day of the Wacko (Dzien swira) (2002),3.61,18,4.5
298,4271,Eureka (YurÃ®ka) (2000),4.08,13,4.5


<font size="+1" color="red">Replace this cell with a brief commentary, in your own words, comparing the three lists above.</font>

### Answer
The first list ranks movies by their average rating among those with at least 1,000 ratings.
It highlights globally acclaimed titles like Spirited Away, City of God, Memento, and the Lord of the Rings trilogy — all with both high popularity and strong audience approval. These results reflect general audience preferences weighted by a large number of votes, favoring mainstream but high-quality films.

The second list, based on median ratings (also requiring ≥1,000 ratings), produces a very similar top set — the same major titles reappear, though their order shifts slightly.
This indicates that these films are consistently well-rated, not merely benefiting from a few extreme high scores. The median is more robust to outliers, so it confirms that these movies have uniformly positive reception.

The third list relaxes the popularity constraint (only ≥3 ratings) but still requires a median ≥ 4.5.
Here, while the top entries again include the same celebrated films, we also see lesser-known international or niche productions (The Best of Youth, Dolls, Dog’s Will, Eureka). These titles have very few ratings but extremely high medians — likely from a small group of passionate viewers.
This suggests that when the rating threshold is low, the ranking becomes less reliable but can highlight cult favorites or highly specialized films.

## 2.2. Compute the user-movie matrix

Before calculating the **similarity matrix**, we create a table where columns are movies and rows are users, and each movie-user cell contains the score of that user for that movie.

We will use the [pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) function of Pandas, which receives a dataframe plus the variable that will make the rows, the variable that will make the columns, and the variable that will make the cells, and transform it into a matrix of the specified rows, columns, and cells.

For instance, if you have a dataframe D containing:

```
U V W
1 a 3.0
1 b 2.0
2 a 1.0
2 c 4.0
```

Calling `D.pivot_table(index='U', columns='V', values='W')` will create the following:

```
V  a   b   c
U
1 3.0 2.0 NaN
2 1.0 NaN 4.0
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to generate a "user_movie" matrix by calling "pivot_table" on "rated_movies". Print the first 5 rows. It might take about one minute to compute, depending on your computer.</font>

In [140]:
user_movie = rated_movies.pivot_table(index='user_id', columns='movie_id', values='rating')
print(user_movie.head(5))

movie_id  2769   3177   3190   3225   3228   3239   3273   3275   3276   \
user_id                                                                   
33          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
63          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
94          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
95          NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
131         NaN    NaN    NaN    0.5    NaN    NaN    NaN    NaN    NaN   

movie_id  3279   ...  51187  51194  51255  51312  51314  51317  51402  51412  \
user_id          ...                                                           
33          NaN  ...    NaN    NaN    4.0    NaN    NaN    NaN    NaN    NaN   
63          NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
94          NaN  ...    NaN    NaN    4.5    NaN    NaN    NaN    NaN    NaN   
95          NaN  ...    NaN    NaN    NaN    NaN    NaN    NaN    NaN    N

<font size="+1" color="red">Replace this a brief commentary indicating why do you think the "user_movie" matrix has so many "NaN" values. How do we call this characteristic of user ratings in recommender systems?</font>

### Answer
The user_movie matrix has many NaN values because most users rate only a few of the thousands of available movies.
This results in many missing entries where no rating exists.
Such a pattern is called data sparsity in recommender systems — a common challenge caused by the limited overlap between users and the items they rate.

# 2.3. Explore some correlations in the user-movie matrix

Now let us explore whether correlations in this matrix make sense.

1. Locate the movie_id for the following three movies:
  * *Finding Nemo (2003)* -- name this id_pivot
  * *Animatrix, The (2003)* -- name this id_m1
  * *Hey Arnold! The Movie (2002)* -- name this id_m2
2. Obtain the ratings for each of these movies: `user_movie[movie_id].dropna()`. You will obtain a column, containing a series of ratings for each movie.
3. Consolidate these four series into a single dataframe: `ratings3 = pd.concat([s1, s2, s3], axis=1)`
4. Drop from `ratings3` all rows containing a *NaN*. This will keep only the users that have rated all the 3 movies.
5. Display the first 10 rows from this table.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute and display the first 10 rows of the "ratings3" table as described above.</font>

In [141]:
# Movies id
id_pivot = ratings[ratings['title'] == 'Finding Nemo (2003)']['movie_id'].iloc[0]
id_m1 = ratings[ratings['title'] == 'Animatrix, The (2003)']['movie_id'].iloc[0]
id_m2 = ratings[ratings['title'] == 'Hey Arnold! The Movie (2002)']['movie_id'].iloc[0]

# Ratings
s1 = user_movie[id_pivot]
s2 = user_movie[id_m1]
s3 = user_movie[id_m2]

ratings3 = pd.concat([s1, s2, s3], axis=1)
ratings3.columns = ['Finding Nemo', 'Animatrix', 'Hey Arnold']
ratings3 = ratings3.dropna()
ratings3 = ratings3.reset_index()
display(ratings3.head(10))

Unnamed: 0,user_id,Finding Nemo,Animatrix,Hey Arnold
0,9867,3.5,2.5,2.5
1,26686,4.0,3.0,3.0
2,95370,5.0,5.0,5.0
3,181756,3.5,4.0,3.0


<font size="+1" color="red">Replace this cell with an explanation, in your own words, of the contents of *ratings_3*.</font>

### Anwser
The dataframe ratings3 shows the ratings from four users who evaluated the same three movies: Finding Nemo (2003), The Animatrix (2003), and Hey Arnold! The Movie (2002). When comparing horizontally, each user’s ratings across the three films are relatively consistent, which suggests that these users have stable personal preferences. For example, user 95370 gave all three movies a perfect score of 5.0, indicating strong overall appreciation, while user 9867 rated all the three movies with relative lower scores.

When comparing vertically, across users, Finding Nemo tends to receive higher ratings overall than Hey Arnold! and The Animatrix, which suggests it is generally better received among users who watched all three. This difference in collective ratings reflects how users with similar viewing habits can still express varying degrees of enthusiasm for different films.

Overall, The table shows that the three movies receive generally consistent ratings across users, but individual users’ rating habits vary significantly. Some users tend to rate higher overall, while others rate more conservatively, so similar scores may reflect personal rating style as much as actual preference. This variation is important to consider in collaborative filtering.

To compute Pearson correlation, we use the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html) method.

To compute the correlation between two columns `a`, `b` in dataframe `df`, we use: `df[a].corr(df[b])`.

Compute the correlations between all pairs of columns of the `ratings3` table. You should display:

```
Similarity between 'Finding Nemo (2003)' and 'Animatrix, The (2003)': 0.74
Similarity between 'Finding Nemo (2003)' and 'Hey Arnold! The Movie (2002)': 0.96
Similarity between 'Animatrix, The (2003)' and 'Hey Arnold! The Movie (2002)': 0.90
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute all correlations between these three movies, as described above.</font>

In [142]:
corr1 = ratings3['Finding Nemo'].corr(ratings3['Animatrix']).round(2)
print(f"Similarity between 'Finding Nemo (2003)' and 'Animatrix, The (2003)': {corr1}")
corr2 = ratings3['Finding Nemo'].corr(ratings3['Hey Arnold']).round(2)
print(f"Similarity between 'Finding Nemo (2003)' and 'Hey Arnold! The Movie (2002)': {corr2}")
corr3 = ratings3['Animatrix'].corr(ratings3['Hey Arnold']).round(2)
print(f"Similarity between 'Animatrix, The (2003)' and 'Hey Arnold! The Movie (2002)': {corr3}")

Similarity between 'Finding Nemo (2003)' and 'Animatrix, The (2003)': 0.74
Similarity between 'Finding Nemo (2003)' and 'Hey Arnold! The Movie (2002)': 0.96
Similarity between 'Animatrix, The (2003)' and 'Hey Arnold! The Movie (2002)': 0.9


<font size="+1" color="red">Replace this cell with a brief commentary on the correlations you find.</font>

### Answer
The correlations between the three movies are all quite high, ranging from 0.74 to 0.96, indicating that users who rated one of these movies tended to rate the others similarly. The strongest correlation is between Finding Nemo and Hey Arnold!, suggesting that these two movies appeal to very similar users. Animatrix also shows strong correlations with both, though slightly lower with Finding Nemo, which may indicate small differences in user preferences or movie style. Overall, these results reflect the presence of strong positive relationships in user ratings, which can be leveraged in collaborative filtering to recommend movies to users with similar tastes.

Now let us take the first movie selected above, the one with movie_id `id_pivot`.

Select the column corresponding to this movie in `user_movies` and compute its correlation with all other columns in `user_movies`. This can be done with  [corrwith](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corrwith.html).

To extract the ratings for a movie into a dataframe containing a single column "rating", you can use:

```
df = pd.DataFrame(user_movie[id_movie].dropna()).rename(columns={id_movie: "rating"})
```

To compute the correlation between two single-column dataframes `df1` and `df2`, you can use:

```
corr = df1.corrwith(df2)[0]
```

Store the result in a new dataframe named `similarity_to_pivot` containing two columns: `movie_id` and `corr_with_pivot`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create a "similarity_to_pivot" series that contains the computed correlations, droping the NaNs in the series.</font>

In [143]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

df_pivot = pd.DataFrame(user_movie[id_pivot].dropna()).rename(columns={id_pivot: "rating"})

# Compute correlations of the pivot movie with all other movies
similarity_to_pivot = user_movie.corrwith(df_pivot['rating'])

# Drop NaN correlations (movies with no overlapping users)
similarity_to_pivot = similarity_to_pivot.dropna()

# 4. Convert to a dataframe
similarity_to_pivot = similarity_to_pivot.reset_index()
similarity_to_pivot.columns = ['movie_id', 'corr_with_pivot']

display(similarity_to_pivot.sort_values(by='corr_with_pivot', ascending=False).head(15))


Unnamed: 0,movie_id,corr_with_pivot
2502,51317,1.0
2202,45303,1.0
916,6377,1.0
1850,33847,1.0
898,6336,1.0
1020,6737,1.0
1109,6962,1.0
1599,27811,1.0
2158,44671,1.0
1049,6833,1.0


Next, create a dataframe `corr_with_pivot` by using `similarity_to_pivot` and `ratings_summary`. This dataframe should have the following columns:

* movie_id
* corr_with_pivot - the correlation between movies movie_id and id_pivot
* title
* ratings_mean
* ratings_count

Keep only rows in which *ratings_count* > 500, i.e., popular movies. To filter a dataframe `df` and keep only rows having column `c` larger than `x`, use `df[df[c] > x]`.

Display the top 10 rows with the largest correlation. To select the largest `n` rows from dataframe `df` according to column `c`, use `df.sort_values(c, ascending=False).head(n)`. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create a "corr_with_pivot" dataframe as specified above, and to print the 20 movies (rated 1000 times or more, and with a median rating of 4.0 or better) with the highest correlation with the selected movie.</font>

In [129]:
# Merge similarity_to_pivot with ratings_summary on movie_id
corr_with_pivot = pd.merge(similarity_to_pivot, ratings_summary, on='movie_id',how='inner')
# Filter movies with at least 1000 ratings and median rating >= 4.0
popular_high_rated = corr_with_pivot[
    (corr_with_pivot['ratings_count'] >= 1000) &
    (ratings_summary['ratings_median'] >= 4.0)
]

# Sort by correlation with pivot movie, descending
top_corr_movies = popular_high_rated.sort_values(
    by='corr_with_pivot',
    ascending=False
).head(20)

display(top_corr_movies)

  popular_high_rated = corr_with_pivot[


Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count,ratings_median
1426,8961,0.556336,"Incredibles, The (2004)",3.82,4575,4.0
2225,45722,0.381615,Pirates of the Caribbean: Dead Man's Chest (2006),3.45,2359,3.5
338,4701,0.341302,Rush Hour 2 (2001),3.29,1105,3.5
259,4246,0.330092,Bridget Jones's Diary (2001),3.41,1992,3.5
1700,31658,0.327425,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.09,1703,4.0
1649,30749,0.316286,Hotel Rwanda (2004),3.96,1525,4.0
917,6378,0.294069,"Italian Job, The (2003)",3.56,1849,3.5
180,4018,0.288693,What Women Want (2000),3.1,1571,3.0
2268,47099,0.281284,"Pursuit of Happyness, The (2006)",3.88,1522,4.0
462,4993,0.280466,"Lord of the Rings: The Fellowship of the Ring,...",4.09,7935,4.5


<font size="+1" color="red">Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?</font>

### Answer
The movies on this list tend to be popular, widely-seen films that share similar audiences with the pivot movie. Titles like Incredibles, The (2004), Howl’s Moving Castle, and Lord of the Rings: The Fellowship of the Ring are generally well-rated family or adventure films, while other entries like Rush Hour 2 or X-Men: The Last Stand reflect more action-oriented tastes. The correlations are moderate rather than extremely high, suggesting that while these movies appeal to similar viewers, they are not identical in style or genre.

If the condition on ratings_count is set to a much larger value, the list will shrink to include only very popular films with many ratings, which tends to favor blockbuster titles and may exclude niche or cult favorites. Conversely, setting it to a much smaller value allows more obscure movies with fewer ratings to appear; this increases the variability of the correlation values because fewer overlapping ratings can produce less reliable similarity estimates.

# 2.4. Implement the item-based recommendations

Now that we believe that this type of correlation sort of makes sense, let us implement the item-based recommender. We need all correlations between columns in `user_movie`.

To compute all correlations between columns in a dataframe, use [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). This function receives a matrix with *r* rows and *c* columns, and returns a square matrix of *c x c* containing all pair-wise correlations between columns.

**This process may take a few minutes.**

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in "item_similarity", and print the first 10 rows.</font>

In [144]:
item_similarity = user_movie.corr()
display(item_similarity.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,51187,51194,51255,51312,51314,51317,51402,51412,51418,51433
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,-0.406295,-0.049976,0.119232,,,0.261728,-0.166238,-0.754337,,...,,,0.359602,,,,,-0.274301,,
3177,-0.406295,1.0,0.680619,0.174519,,,-0.010883,0.321711,0.09759,,...,-1.0,,0.223872,,,,,0.707963,,
3190,-0.049976,0.680619,1.0,0.358953,,-0.582209,0.114346,0.215633,-0.304842,,...,,,-0.047137,,,,,0.593796,,
3225,0.119232,0.174519,0.358953,1.0,,0.5,0.318758,0.473451,0.044622,,...,,,-0.398374,,,,,0.963143,,
3228,,,,,1.0,,1.0,,,,...,,,,,,,,,,
3239,,,-0.582209,0.5,,1.0,0.591579,,0.5,,...,,,-1.0,,,,,,,
3273,0.261728,-0.010883,0.114346,0.318758,1.0,0.591579,1.0,0.176796,-0.168983,,...,1.0,,0.170269,,,,,0.206301,-0.567178,
3275,-0.166238,0.321711,0.215633,0.473451,,,0.176796,1.0,0.495821,,...,0.298818,,0.233542,-1.0,,,,0.191663,0.342029,
3276,-0.754337,0.09759,-0.304842,0.044622,,0.5,-0.168983,0.495821,1.0,,...,,,0.892531,,,,,0.486664,,
3279,,,,,,,,,,,...,,,,,,,,,,


Similarities between movies that do not have many ratings in common are unreliable. Fortunately, the `corr` method includes a parameter `min_periods` that establishes a minimum number of elements in common that two columns must have to compute the correlation.

Re-generate item_similarity setting min_periods to 100.

This process will also take a few minutes.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in "item_similarity_min_ratings", and print the first 10 rows</font>

In [131]:
item_similarity_min_ratings = user_movie.corr(min_periods=100)
display(item_similarity_min_ratings.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,51187,51194,51255,51312,51314,51317,51402,51412,51418,51433
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,,,,,,,,,,,...,,,,,,,,,,
3177,,1.0,,,,,,,,,...,,,,,,,,,,
3190,,,1.0,,,,,,,,...,,,,,,,,,,
3225,,,,1.0,,,,,,,...,,,,,,,,,,
3228,,,,,,,,,,,...,,,,,,,,,,
3239,,,,,,,,,,,...,,,,,,,,,,
3273,,,,,,,1.0,0.176796,,,...,,,0.170269,,,,,,,
3275,,,,,,,0.176796,1.0,,,...,,,0.233542,,,,,0.191663,,
3276,,,,,,,,,,,...,,,,,,,,,,
3279,,,,,,,,,,,...,,,,,,,,,,


Next will need some auxiliary functions that are provided below. These give us the list of movies that a user has rated. You can leave as-is.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [132]:
# Leave this code as-is

# Gets the rating a user_id has given to a movie_id
def get_rating(user_movie, user_id, movie_id):
    return user_movie[movie_id][user_id]

# Gets a list of rated movies for a user_id
def get_rated_movies(user_movie, user_id):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)
    
# Print rated movies
def print_rated_movies(user_movie, movies, user_id):
    for movie_id in get_rated_movies(user_movie, user_id):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_movie, user_id, movie_id), get_title(movie_id, movies)))


We will need to test our function so let us select a couple of interesting users.

Our first user, `user_id_super` will be someone who has given the following 3 films a rating higher than 4.5:

* super_movie_1=5349: *Spider-Man (2002)*
* super_movie_2=3793: *X-Men (2000)*
* super_movie_3=8961: *Incredibles, The (2004)* 	

Our second user, `user_id_drama` will be someone who has given the following 3 films a rating higher than 4.5:

* drama_movie_1=3408: *Erin Brockovich (2000)*
* drama_movie_2=5995: *Pianist, The (2002)*
* drama_movie_3=4995: *Beautiful Mind, A (2001)*
* and that has NOT rated the first superhero movies, i.e., having `user_movie[super_movie_1].isnull()`.

Print the number of users satisfying each condition and choose one at random.

*Tip:* To filter a dataframe by multiple conditions you can use, e.g., `df[(a > 1) & (b > 2)]`. 

**Important**: these particular users have watched lots of movies, so we cannot tell for sure they have only these interests.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to find userids of two example users: user_id_super (the one who liked the three superhero movies), and user_id_drama (the one who liked the three dramas and did not rate the first superhero movie). Print the number of users satisfying the conditions, and choose one at random using `random.choice()`. Print the user ids.</font>

In [145]:
super_movie_1 = 5349
super_movie_2 = 3793
super_movie_3 = 8961

user_ids_super = user_movie.index[
    (user_movie[[super_movie_1, super_movie_2, super_movie_3]] > 4.5).all(axis=1)  
].tolist()

num_super_ids = len(user_ids_super)
user_id_super = random.choice(user_ids_super)
print(f"There are in total {num_super_ids}, one example user id is: {user_id_super}")

drama_movie_1=3408
drama_movie_2=5995
drama_movie_3=4995

user_ids_drama = user_movie.index[
                                (user_movie[[drama_movie_1, drama_movie_2, drama_movie_3]] > 4.5).all(axis=1) &
                                (user_movie[super_movie_1].isna()) 
].tolist()

num_drama_ids = len(user_ids_drama)
user_id_drama = random.choice(user_ids_drama)
print(f"\nThere are in total {num_drama_ids}, one example user id is: {user_id_drama}")


There are in total 26, one example user id is: 101185

There are in total 5, one example user id is: 16212


In [134]:
# df_drama = rated_movies[
#     (rated_movies['user_id'].isin(user_id_drama)) &
#     (rated_movies['rating'] > 4.5)
# ]
# display(df_drama['user_id'].unique())

The next code, that you should leave as-is, checks that the users you selected are correct.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [146]:
# LEAVE THIS CODE AS-IS
# We use this to check that the user ids you selected are correct

assert get_rating(user_movie, user_id_super, super_movie_1) > 4.5
assert get_rating(user_movie, user_id_super, super_movie_2) > 4.5
assert get_rating(user_movie, user_id_super, super_movie_3) > 4.5

assert get_rating(user_movie, user_id_drama, drama_movie_1) > 4.5
assert get_rating(user_movie, user_id_drama, drama_movie_2) > 4.5
assert get_rating(user_movie, user_id_drama, drama_movie_3) > 4.5


<font size="+1" color="red">Given that the rest of the practice requires you to fix the user ids, so that your results are consistent, replace this cell with code to assign them to fixed values (user_id_super=XXX, user_id_drama=YYY).</font>

In [147]:
# LEAVE AS-IS (TESTING CODE)
user_id_super=53310
print_rated_movies(user_movie, movies, user_id_super)

51412 5.0 Next (2007) 
48304 5.0 Apocalypto (2006) 
47200 5.0 Crank (2006) 
48322 5.0 Jackass Number Two (2006) 
46976 5.0 Stranger than Fiction (2006) 
3793 5.0 X-Men (2000) 
3826 5.0 Hollow Man (2000) 
46578 5.0 Little Miss Sunshine (2006) 
46723 5.0 Babel (2006) 
46970 5.0 Talladega Nights: The Ballad of Ricky Bobby (2006) 
3827 5.0 Space Cowboys (2000) 
3785 5.0 Scary Movie (2000) 
4056 5.0 Pledge, The (2001) 
3908 5.0 Urban Legends: Final Cut (2000) 
4238 5.0 Along Came a Spider (2001) 
4030 5.0 Dracula 2000 (2000) 
4020 5.0 Gift, The (2000) 
4036 5.0 Shadow of the Vampire (2000) 
4226 5.0 Memento (2000) 
4167 5.0 15 Minutes (2001) 
4014 5.0 Chocolat (2000) 
3752 5.0 Me, Myself & Irene (2000) 
3994 5.0 Unbreakable (2000) 
4247 5.0 Joe Dirt (2001) 
4343 5.0 Evolution (2001) 
4448 5.0 Score, The (2001) 
5093 5.0 Collateral Damage (2002) 
4992 5.0 Kate & Leopold (2001) 
4958 5.0 Behind Enemy Lines (2001) 
43558 5.0 Big Momma's House 2 (2006) 
42723 5.0 Hostel (2005) 
41997 5.0 Munich

In [148]:
# LEAVE AS-IS (TESTING CODE)
user_id_drama = 98561
print_rated_movies(user_movie, movies, user_id_drama)

3408 5.0 Erin Brockovich (2000) 
3949 5.0 Requiem for a Dream (2000) 
4306 5.0 Shrek (2001) 
4886 5.0 Monsters, Inc. (2001) 
5218 5.0 Ice Age (2002) 
4995 5.0 Beautiful Mind, A (2001) 
8798 5.0 Collateral (2004) 
6874 5.0 Kill Bill: Vol. 1 (2003) 
7438 5.0 Kill Bill: Vol. 2 (2004) 
6942 5.0 Love Actually (2003) 
5991 5.0 Chicago (2002) 
30707 5.0 Million Dollar Baby (2004) 
48780 5.0 Prestige, The (2006) 
40819 5.0 Walk the Line (2005) 
5995 5.0 Pianist, The (2002) 
4973 4.5 Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le) (2001) 
5989 4.5 Catch Me If You Can (2002) 
7153 4.5 Lord of the Rings: The Return of the King, The (2003) 
37733 4.5 History of Violence, A (2005) 
7323 4.5 Good bye, Lenin! (2003) 
4308 4.5 Moulin Rouge (2001) 
3578 4.5 Gladiator (2000) 
36529 4.5 Lord of War (2005) 
8961 4.5 Incredibles, The (2004) 
36517 4.5 Constant Gardener, The (2005) 
44555 4.5 Lives of Others, The (Das leben der Anderen) (2006) 
44665 4.5 Lucky Number Slevin (2006) 
44191 4.5 V for Vendetta (

For every user, we will consider that the importance of a new movie (a movie s/he has not rated) will be equal to the sum of the similarities between that new movie and all the movies the user has already rated.

Indeed, to further improve this, we will compute a weighted sum, in which the weight will be the rating given to the movie.

For instance, suppose a user has rated movies as follows:

```
movie_id rating
1        2.0
2        3.0
3        NaN
4        NaN
```

And that movie similarities are as follows (values with a "." do not matter in this example):

```
movie_id   1   2   3   4
1         ...............
2         ...............
3         0.1 0.2 NaN ...
4         0.9 0.8 ... NaN
```

The importance of movie 3 to this user will be:

```
2.0 * 0.1 + 3.0 * 0.2 = 0.8
```

While the importance of movie 4 to this user will be:

```
2.0 * 0.9 + 3.0 + 0.8 = 5.6
```

As we can see, we are favoring movies that are highly similar to many movies that the user has rated high.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Create a function `get_movies_relevance` that returns a dataframe with columns `movie_id` and `relevance`. You can use the following template:

```python
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
    # Create an empty series
    movies_relevance = ...
    
    # Iterate through the movies the user has rated
    for rated_movie in ...
        
        # Obtain the rating given
        rating_given = ...
        
        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = ...
        
        # Multiply this vector by the given rating
        weighted_similarities = ...
        
        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for "get_movies_relevance"</font>

In [None]:
# def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
#     # Create an empty series
#     movies_relevance = pd.Series(dtype=float)
#     # print(user_movie.loc[user_id])
#     # Iterate through the movies the user has rated
#     for rated_movie in user_movie.loc[user_id].dropna().index:
        
        
#         # Obtain the rating given
#         rating_given = user_movie.loc[user_id, rated_movie]
        
#         # Obtain the vector containing the similarities of watched_movie
#         # with all other movies in item_similarity_matrix
#         similarities = item_similarity_matrix[rated_movie]
        
#         # Multiply this vector by the given rating
#         weighted_similarities = similarities * rating_given
        
#         # Append these terms to movies_relevance
#         movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
#     # Compute the sum for each movie
#     movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
#     # Convert to a dataframe
#     movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
#     movies_relevance_df['movie_id'] = movies_relevance_df.index
    
#     return movies_relevance_df

In [182]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    movies_relevance = pd.Series(dtype=float)
    
    # 获取该用户看过的电影
    rated_movies = get_rated_movies(user_movie, user_id)
    
    for rated_movie in rated_movies:
        rating_given = get_rating(user_movie, user_id, rated_movie)
        similarities = item_similarity_matrix[rated_movie]
        weighted_similarities = similarities * rating_given
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df


Apply `get_movies_relevance` to the two users we have selected, `user_id_super` and `user_id_drama`.

The result will contain only `movie_id` and `relevance`, you will have to merge with the `movies` dataframe on the `movie_id` attribute.

Sort the results by descending relevance and print the top 10 for each case.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to obtain the 10 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)</font>

In [183]:
df1 = get_movies_relevance(user_id_super, user_movie, item_similarity)
df1_sorted = df1.sort_values(by='relevance', ascending=False)
print(f"Top 10 most relevant movies for user: #{user_id_super}")
display(df1_sorted.head(10))

df2 = get_movies_relevance(user_id_drama, user_movie, item_similarity)
df2_sorted = df2.sort_values(by='relevance', ascending=False)
print(f"\nTop 10 most relevant movies for user: #{user_id_drama}")
display(df2_sorted.head(10))

Top 10 most relevant movies for user: #53310


Unnamed: 0,relevance,movie_id
8868,569.635437,8868
49282,549.53487,49282
40574,549.297223,40574
43987,545.876134,43987
27865,544.821168,27865
8534,535.802038,8534
48879,531.51384,48879
36531,514.003972,36531
46430,513.385655,46430
30659,508.495624,30659



Top 10 most relevant movies for user: #98561


Unnamed: 0,relevance,movie_id
48673,178.443658,48673
34375,177.864398,34375
27871,170.703256,27871
27742,170.604259,27742
27765,169.531592,27765
6041,169.199486,6041
27671,168.244406,27671
35828,165.564906,35828
46430,164.989491,46430
27840,163.5,27840


In [179]:
# Super hero related movie
ids_super = df1_sorted.head(10)['movie_id'].tolist()
recommended_movies = movies[movies['movie_id'].isin(ids_super)]
print(recommended_movies[['movie_id', 'title']])
print()

# Drama related movie
ids_drama = df2_sorted.head(10)['movie_id'].tolist()
recommended_movies1 = movies[movies['movie_id'].isin(ids_drama)]
print(recommended_movies1[['movie_id', 'title']])

      movie_id                               title
1505      8534  Two Brothers (Deux frÃ¨res) (2004)
1595      8868           Head in the Clouds (2004)
1928     27865                        Azumi (2003)
1953     30659                         Noel (2004)
2314     36531                        Venom (2005)
2428     40574       Get Rich or Die Tryin' (2005)
2575     43987                   Half Light (2006)
2710     46430                   Waist Deep (2006)
2870     48879                 Catch a Fire (2006)
2895     49282               Deck the Halls (2006)

      movie_id                                              title
911       6041                                       Amen. (2002)
1821     27671  As If I Didn't Exist (Elina - Som om jag inte ...
1855     27742  Vampire Effect (The Twins Effect) (Chin gei bi...
1866     27765                                      Trauma (2004)
1915     27840                    Day Without a Mexican, A (2004)
1931     27871                     Somethi

<font size="+1" color="red">Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.</font>

<font size="-1" color="gray">All those trivial facts you learned about 2000s pop culture were supposed to be useful one day; that day has arrived :-)</font>

### Answer
For user_id_super, the top-10 recommendations (Two Brothers, Head in the Clouds, Azumi, Catch a Fire, etc.) mostly feature action-drama themes with emotional or moral depth. Around 7–8 of them seem relevant, showing a strong match to a viewer interested in intense and story-driven films.

For user_id_drama, the list (Amen., Something the Lord Made, A Touch of Spice, Black, etc.) leans toward serious drama and biographical stories, consistent with someone who rated dramatic films highly. Most titles fit well, though one or two (like Vampire Effect) are slightly off-genre. Overall, both profiles show good recommendation consistency.

Finally, we need to remove the movies that the user has already watched. To do so:

* Obtain the dataframe of relevant movies with `get_movies_relevance`
* Set this dataframe index to 'movie_id'
* Obtain the list of movie_ids of watched movies with `get_watched_movies`
* Drop from the relevant movies dataframe the watched movies

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code implementing "get_recommended_movies"</font>

In [189]:
def get_recommended_movies(user_id, user_movie, item_similarity_matrix, top_n=10):
    """
    Returns a dataframe of the top-N recommended movies for a given user.
    The relevance is computed as the weighted sum of similarities between 
    the movies the user has already rated and all others.
    """

    # Step 1: Get the relevance scores for all movies
    movies_relevance_df = get_movies_relevance(user_id, user_movie, item_similarity_matrix)

    # Step 2: Set index to movie_id
    movies_relevance_df = movies_relevance_df.set_index('movie_id')

    # Step 3: Get watched movies (already rated)
    watched_movies = get_rated_movies(user_movie, user_id)

    # Step 4: Drop watched movies from recommendations
    movies_relevance_df = movies_relevance_df.drop(watched_movies, errors='ignore')

    # Step 5: Sort by relevance (descending)
    movies_relevance_df = movies_relevance_df.sort_values(by='relevance', ascending=False)

    # Step 6: Return top-N recommendations
    return movies_relevance_df.head(top_n)

    

<font size="+1" color="red">Replace this cell with your code to obtain the 20 most recommended movies for the users user_id_super and user_id_drama</font>

In [None]:
# Recommended movies for user_id_super
super_movies = get_recommended_movies(user_id_super, user_movie, item_similarity, 20)
super_movies.reset_index(inplace=True)
print(f"Top 20 most relevant movies to recommend for user: #{user_id_super}")
display(super_movies)

# Recommended movies for user_id_sdrama

drama_movies = get_recommended_movies(user_id_drama, user_movie, item_similarity, 20)
drama_movies.reset_index(inplace=True)
print(f"\nTop 20 most relevant movies to recommend for user: #{user_id_drama}")
display(drama_movies)

Top 20 most relevant movies to recommend for user: #53310


Unnamed: 0,movie_id,relevance
0,8868,569.635437
1,49282,549.53487
2,40574,549.297223
3,43987,545.876134
4,27865,544.821168
5,8534,535.802038
6,48879,531.51384
7,36531,514.003972
8,46430,513.385655
9,30659,508.495624



Top 20 most relevant movies to recommend for user: #98561


Unnamed: 0,movie_id,relevance
0,48673,178.443658
1,34375,177.864398
2,27871,170.703256
3,27742,170.604259
4,27765,169.531592
5,6041,169.199486
6,27671,168.244406
7,35828,165.564906
8,46430,164.989491
9,27840,163.5


In [207]:
# Get movie titles above to analyze
# super movies
super_movies_ids = super_movies['movie_id'].tolist()
print(super_movies_ids)
super_movies_lst = rated_movies[rated_movies['movie_id'].isin(super_movies_ids)]
super_movies_lst = super_movies_lst.drop_duplicates(subset='movie_id', keep='first')
display(super_movies_lst[['movie_id', 'title']])

# drama movies
drama_movies_ids = drama_movies['movie_id'].tolist()
print(drama_movies_ids)
drama_movies_lst = rated_movies[rated_movies['movie_id'].isin(drama_movies_ids)]
drama_movies_lst = drama_movies_lst.drop_duplicates(subset='movie_id', keep='first')
display(drama_movies_lst[['movie_id', 'title']])

[8868, 49282, 40574, 43987, 27865, 8534, 48879, 36531, 46430, 30659, 48161, 7257, 4885, 27871, 27524, 6013, 48791, 6616, 43589, 42013]


Unnamed: 0,movie_id,title
3507,7257,"Big Bounce, The (2004)"
3536,8534,Two Brothers (Deux frÃ¨res) (2004)
3582,27871,Something the Lord Made (2004)
3673,43987,Half Light (2006)
3720,48791,Flicka (2006)
6879,6013,Kangaroo Jack (2003)
7531,27865,Azumi (2003)
15904,49282,Deck the Halls (2006)
16597,4885,Domestic Disturbance (2001)
19445,8868,Head in the Clouds (2004)


[48673, 34375, 27871, 27742, 27765, 6041, 27671, 35828, 46430, 27840, 48239, 27639, 27351, 43987, 7475, 35015, 30659, 50183, 33826, 48167]


Unnamed: 0,movie_id,title
3582,27871,Something the Lord Made (2004)
3627,34375,"Touch of Spice, A (Politiki kouzina) (2003)"
3673,43987,Half Light (2006)
5291,6041,Amen. (2002)
7473,7475,Raid (2003)
7515,27671,As If I Didn't Exist (Elina - Som om jag inte ...
16189,33826,Saint Ralph (2004)
22228,50183,Loving Annabelle (2006)
24391,27351,Spiral (2000)
35039,48673,Kummelin jackpot (2006)


<font size="+1" color="red">Replace this cell with a brief commentary on these recommendations. (1) What percentage of recommendations would you say are relevant for the user who likes superhero movies? And for the user who like drama movies? (2) After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?</font>

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook

## Extra points available

For more learning and extra points, use the [surprise](http://surpriselib.com/) library to generate recommendations for the same two users. Display the generated recommendations and comment on them.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: surprise library</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that I completed this practice myself, that my answers were not written by an AI-enabled code assistant, and that except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>