# Hệ khuyến nghị. IUH 2025.
### Ngày 21/8/2025. Lab 1.
Mục tiêu: ôn tập về các tính các khoảng cách / độ tương đồng giữa các vector, cách xây dựng một hệ khuyến nghị cơ bản.

**Problem 1.**

The data includes four users $A$, $B$, $C$, and $D$, who have rated three movies. The ratings are stored in the following lists, and each list contains two numbers indicating the rating of each movie:

- Ratings by $A$ are $[4.0; 3.0; 5.0]$.
- Ratings by $B$ are $[2.0; 4.0; 3.0]$.
- Ratings by $C$ are $[2.0; 4.0; 1.0]$.
- Ratings by $D$ are $[4.0; 5.0; 2.0]$.

a) Using Euclid distance, Mahattan distance between all pairs among these users then give some remarks on which pairs is closet in each case.

b) Create the matrix of Pearson similiary of these rating vectors.

In [3]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean, cityblock
from scipy.stats import pearsonr

# Ratings của các user
ratings = {
    'A': [4.0, 3.0, 5.0],
    'B': [2.0, 4.0, 3.0],
    'C': [2.0, 4.0, 1.0],
    'D': [4.0, 5.0, 2.0]
}

df = pd.DataFrame(ratings, index=['Movie1', 'Movie2', 'Movie3']).T
print("Ratings:\n", df, "\n")

users = df.index
n = len(users)

# --- a) Euclidean và Manhattan ---
euclid = pd.DataFrame(np.zeros((n, n)), index=users, columns=users)
manhattan = pd.DataFrame(np.zeros((n, n)), index=users, columns=users)

for i in users:
    for j in users:
        euclid.loc[i, j] = euclidean(df.loc[i], df.loc[j])
        manhattan.loc[i, j] = cityblock(df.loc[i], df.loc[j])

print("Euclidean distance matrix:\n", euclid.round(3), "\n")
print("Manhattan distance matrix:\n", manhattan.round(3), "\n")

# --- b) Pearson similarity ---
pearson = pd.DataFrame(np.zeros((n, n)), index=users, columns=users)

for i in users:
    for j in users:
        r, _ = pearsonr(df.loc[i], df.loc[j])
        pearson.loc[i, j] = r

print("Pearson similarity matrix:\n", pearson.round(3), "\n")

# --- Nhận xét cặp gần nhất ---
min_euclid = euclid.replace(0, np.nan).min().min()
min_manhattan = manhattan.replace(0, np.nan).min().min()
max_pearson = pearson.replace(1, np.nan).max().max()

closest_euclid = [(i, j) for i in users for j in users if 0 < euclid.loc[i, j] == min_euclid]
closest_manhattan = [(i, j) for i in users for j in users if 0 < manhattan.loc[i, j] == min_manhattan]
closest_pearson = [(i, j) for i in users for j in users if pearson.loc[i, j] == max_pearson]

print(f"Cặp gần nhất (Euclidean): {closest_euclid}")
print(f"Cặp gần nhất (Manhattan): {closest_manhattan}")
print(f"Cặp tương đồng cao nhất (Pearson): {closest_pearson}")

Ratings:
    Movie1  Movie2  Movie3
A     4.0     3.0     5.0
B     2.0     4.0     3.0
C     2.0     4.0     1.0
D     4.0     5.0     2.0 

Euclidean distance matrix:
        A      B      C      D
A  0.000  3.000  4.583  3.606
B  3.000  0.000  2.000  2.449
C  4.583  2.000  0.000  2.449
D  3.606  2.449  2.449  0.000 

Manhattan distance matrix:
      A    B    C    D
A  0.0  5.0  7.0  5.0
B  5.0  0.0  2.0  4.0
C  7.0  2.0  0.0  4.0
D  5.0  4.0  4.0  0.0 

Pearson similarity matrix:
        A      B      C      D
A  1.000 -0.500 -0.982 -0.982
B -0.500  1.000  0.655  0.327
C -0.982  0.655  1.000  0.929
D -0.982  0.327  0.929  1.000 

Cặp gần nhất (Euclidean): [('B', 'C'), ('C', 'B')]
Cặp gần nhất (Manhattan): [('B', 'C'), ('C', 'B')]
Cặp tương đồng cao nhất (Pearson): [('A', 'A'), ('B', 'B')]


**Problem 2.** Vấn đề cold-start, khuyến nghị theo xu hướng cho người dùng mới, sử dụng thông tin rating cao và số lượt rating với một trọng số hợp lý.

The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. This dataset captures feature points like cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages. These feature points could be potentially used to train your models for content and collaborative filtering. This dataset consists of the following files:

- *movies_metadata.csv*: This file contains information on ~45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.
- *keywords.csv*: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
- *credits.csv*: Consists of Cast and Crew Information for all the movies. Available in the form of a stringified JSON Object.
- *links.csv*: This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
- *links_small.csv*: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
- *ratings_small.csv*: The subset of 100,000 ratings from 700 users on 9,000 movies.

The Full MovieLens Dataset comprises of 26 million ratings and 750,000 tag applications, from 270,000 users on all the 45,000 movies in this dataset. It can be accessed from the official GroupLens website.

In [None]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('movies.csv', low_memory=False)

# Print the first three rows
metadata.head(3)

One of the most basic metrics you can think of is the ranking to decide which top 250 movies are based on their respective ratings. However, using a rating as a metric has a few caveats:

- For one, it does not take into consideration the popularity of a movie. Therefore, a movie with a rating of 9 from 10 voters will be considered 'better' than a movie with a rating of 8.9 from 10,000 voters. For example, imagine you want to order Chinese food, you have a couple of options, one restaurant has a 5-star rating by only 5 people while the other restaurant has 4.5 ratings by 1000 people. Which restaurant would you prefer? The second one, right? Of course, there could be an exception that the first restaurant opened just a few days ago; hence, fewer people voted for it while, on the contrary, the second restaurant is operational for a year.
- On a related note, this metric will also tend to favor movies with a smaller number of voters with skewed and/or extremely high ratings. As the number of voters increases, the rating of a movie regularizes and approaches towards a value that is reflective of the movie's quality and gives the user a much better idea as to which movie he/she should choose. While it is difficult to discern the quality of a movie with extremely few voters, you might have to consider external sources to conclude.

Taking these shortcomings into consideration, you must come up with a weighted rating that takes into account the average rating and the number of votes it has accumulated. Such a system will make sure that a movie with a 9 rating from 100,000 voters gets a (far) higher score than a movie with the same rating but a mere few hundred voters. Since you are trying to build a clone of IMDB's Top 250, let's use its weighted rating formula as a metric/score. Mathematically, it is represented as follows:

$$\text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right) $$

In the above equation,

- v is the number of votes for the movie;
- m is the minimum votes required to be listed in the chart;
- R is the average rating of the movie;
- C is the mean vote across the whole report.

You already have the values to v (*vote_count*) and R (*vote_average*) for each movie in the dataset. It is also possible to directly calculate C from this data.Determining an appropriate value for m is a hyperparameter that you can choose accordingly since there is no right value for m. You can consider it as a preliminary negative filter that will simply remove the movies which have a number of votes less than a certain threshold m. The selectivity of your filter is up to your discretion.

In this exercise, you will use cutoff $m$ as the 90th percentile. In other words, for a movie to be featured in the charts, it must have more votes than at least 90% of the movies on the list. (On the other hand, if you had chosen the 75th percentile, you would have considered the top 25% of the movies in terms of the number of votes garnered. As percentile decreases, the number of movies considered will increase).

As a first step, let's calculate the value of C, the mean rating across all movies using the pandas .mean() function:

In [None]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)

Next, let's calculate the number of votes, $m$, received by a movie in the 90th percentile. The pandas library makes this task extremely trivial using the .quantile() method of pandas:

In [None]:
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

Since now you have the m you can simply use a greater than equal to condition to filter out movies having greater than equal to 160 vote counts: You can use the .copy() method to ensure that the new q_movies DataFrame created is independent of your original metadata DataFrame. In other words, any changes made to the q_movies DataFrame will not affect the original metadata data frame.

In [None]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape
metadata.shape

From the above output, it is clear that there are around 10% movies with vote count more than 160 and qualify to be on this list. Next and the most important step is to calculate the weighted rating for each qualified movie. To do this, you will:

- Define a function, weighted_rating();
- Since you already have calculated m and C you will simply pass them as an argument to the function;
- Then you will select the vote_count(v) and vote_average(R) column from the q_movies data frame;
- Finally, you will compute the weighted average and return the result.
  
You will define a new feature score, of which you'll calculate the value by applying this function to your DataFrame of qualified movies:

In [None]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v / (v + m)) * R + (m / (v + m)) * C

In [None]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

Finally, let's sort the DataFrame in descending order based on the score feature column and output the title, vote count, vote average, and weighted rating (score) of the top 20 movies.

In [None]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Well, from the above output, you can see that the simple recommender did a great job!

Since the chart has a lot of movies in common with the IMDB Top 250 chart: for example, your top two movies, "Shawshank Redemption" and "The Godfather", are the same as IMDB and we all know they are indeed amazing movies, in fact, all top 20 movies do deserve to be in that list, isn't it?

In [None]:
#CODE HERE

**Problem 3. Finding all pairs of movies** Trong bài này, ta sẽ tìm tất cả các cặp phim hoặc tất cả các hoán vị của các cặp phim đã được cùng một người xem, từ đó có thể khuyến nghị được.

The *user_ratings_df* has been loaded once again containing users, and the movies they have seen.

You will need to first create a function that finds all possible pairs of items in a list it is applied to. For ease of use, you will output the values of this as a DataFrame. Since you only want to find movies that have been seen by the same person and not all possible permutations, you will group by *user_id* when applying the function.

**Hint.** Create a function called *find_movie_pairs* that finds all permutations of a Series, and stores the results as a DataFrame. 
Apply this function to the* user_ratings_d*f DataFrame and print the results.

In [None]:
from itertools import permutations
import pandas as pd

# Create the function to find all permutations (tất cả cặp phim mà user đã xem)
def find_movie_pairs(x):
    pairs = pd.DataFrame(list(permutations(x.values, 2)),
                         columns=['movie_a', 'movie_b'])
    return pairs

# Apply the function to the title column and reset the index
movie_combinations = user_ratings_df.groupby('userId')['title'].apply(find_movie_pairs).reset_index(drop=True)

print(movie_combinations.head())

**Counting up the pairs** 
You can now create DataFrame of all the permutations of movies that have been watched by the same user. This is of limited use unless you can find which movies are most commonly paired. In this exercise, you will work with the *movie_combinations* DataFrame that you created in the last exercise (that has been loaded for you), and generate a new DataFrame containing the counts of occurrences of each of the pairs within.

**Hint.** Find the number of times each pair of movies occurs and assign it to combination_counts.

In [None]:
# Calculate how often each item in movies_a occurs with the items in movies_b
combination_counts = movie_combinations.groupby(['movie_a', 'movie_b']).size().reset_index(name='count')

# Inspect the results
print(combination_counts.head())


**Making your first movie recommendations**

Now that you have found the most commonly paired movies, you can make your first recommendations! While you are not taking in any information about the person watching, and do not even know any details about the movie, valuable recommendations can still be made by examining what groups of movies are watched by the same people. In this exercise, you will examine the movies often watched by the same people that watched Thor, and then use this data to give a recommendation to someone who just watched the movie. The DataFrame you generated in the last lesson, combination_counts_df, that contains counts of how often movies are watched together has been loaded for you.

**Hint.** Order the *combination_counts_df* object from largest to smallest by the size column. Find the newly ordered movie frequencies for the movie Thor by subsetting ordered the *combination_counts_df* object where movie_a is Thor assign them to *thor_df* and plot the results.

In [None]:
import matplotlib.pyplot as plt

# Sort the counts from highest to lowest
combination_counts_df.sort_values('size', ascending=False, inplace=True)

# Find the movies most frequently watched by people who watched Thor
thor_df = combination_counts_df[combination_counts_df['movie_a'] == 'Thor']

# Plot the results
thor_df.plot.bar(x="movie_b", y="size", legend=False, color='skyblue')
plt.title("Movies most frequently watched with 'Thor'")
plt.ylabel("Count")
plt.xlabel("Movie")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
