# Hệ khuyến nghị. IUH 2025.
### Ngày 25/9/2025. Lab 5.
Mục tiêu: sử dụng thư viện sklearn để khuyến nghị.

**Bài 1.** Thư viện Scikit learn với content-based (tham khảo từ Blog: *Building a Simple Recommendation System with Scikit-Learn*) SV tự tải thư viện tương ứng về để đọc hiểu ý tưởng của các đoạn code mẫu sau đây.

💻 Chúng ta sẽ xây dựng một hệ thống gợi ý đơn giản bằng cách sử dụng thư viện phổ biến Scikit-learn. Hệ thống gợi ý dựa trên nội dung sẽ đề xuất các mục cho người dùng dựa trên các tương tác trước đó của họ. Ví dụ, nếu một người dùng đã từng xem nhiều phim hành động, hệ thống gợi ý sẽ đề xuất thêm các phim hành động cho người đó. Để xây dựng hệ thống gợi ý dựa trên nội dung, chúng ta cần một tập dữ liệu chứa thông tin về các mục và các tương tác của người dùng với chúng.

In [3]:
import pandas as pd

movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

In [4]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [5]:
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


🔍 Next, we will need to clean and preprocess the data. In this tutorial, we will be focusing on the movie dataset and the rating dataset. We will be removing any duplicate rows and any missing values.Làm sạch dữ liệu  
Xóa các dòng trùng lặp: Đảm bảo rằng mỗi bản ghi trong tập dữ liệu là duy nhất để tránh ảnh hưởng đến kết quả gợi ý.

Loại bỏ giá trị thiếu (missing values): Những giá trị trống có thể gây lỗi hoặc làm sai lệch mô hình, nên cần được xử lý hoặc loại bỏ.


In [6]:
# Removing duplicate rows
movies.drop_duplicates(inplace=True)
ratings.drop_duplicates(inplace=True)

# Removing missing values
movies.dropna(inplace=True)
ratings.dropna(inplace=True)

💻 Bây giờ dữ liệu đã được làm sạch và tiền xử lý, chúng ta có thể bắt đầu xây dựng hệ thống gợi ý. Một thư viện phổ biến để xây dựng hệ thống gợi ý trong Python là scikit-learn. Thư viện này cung cấp nhiều công cụ để xây dựng, đánh giá và cải thiện hệ thống gợi ý.

🔍 Để xây dựng hệ thống gợi ý dựa trên nội dung, trước tiên chúng ta cần trích xuất các đặc trưng từ tập dữ liệu phim. Trong ví dụ này, chúng ta sẽ sử dụng thể loại phim (genres) làm đặc trưng. Chúng ta sẽ sử dụng lớp OneHotEncoder từ thư viện scikit-learn để chuyển đổi thể loại phim thành định dạng số, có thể dùng làm đầu vào cho hệ thống gợi ý.m.

In [7]:
from sklearn.preprocessing import OneHotEncoder

# Extracting the genres column
genres = movies['genres']

# Creating an instance of the OneHotEncoder
encoder = OneHotEncoder()
# Fitting and transforming the genres column
genres_encoded = encoder.fit_transform(genres.values.reshape(-1, 1))

💻 Now that we have extracted and encoded the features, we can start building the recommendation system. In this example, we will be using the NearestNeighbors class from scikit-learn to build the recommendation system. We will be using cosine similarity as the metric for measuring the similarity between the movies.

In [8]:
from sklearn.neighbors import NearestNeighbors

# Creating an instance of the NearestNeighbors class
recommender = NearestNeighbors(metric='cosine')

# Fitting the encoded genres to the recommender
recommender.fit(genres_encoded.toarray())

🔍 Now that the recommendation system is built, we can start making recommendations to the users. To make a recommendation, we will need to pass in the index of a movie that the user has previously watched. The recommendation system will then return the indexes of the most similar movies.

In [9]:
# Index of the movie the user has previously watched
movie_index = 0
# Number of recommendations to return
num_recommendations = 5
# Getting the recommendations
_, recommendations = recommender.kneighbors(genres_encoded[movie_index].toarray(), n_neighbors=num_recommendations)
# Extracting the movie titles from the recommendations
recommended_movie_titles = movies.iloc[recommendations[0]]['title']
recommended_movie_titles

0                                        Toy Story (1995)
9430                                         Moana (2016)
8927                             The Good Dinosaur (2015)
8219                                         Turbo (2013)
7760    Asterix and the Vikings (Astérix et les Viking...
Name: title, dtype: object

🚀 And that's it! We have successfully built a simple content-based recommendation system using scikit-learn. You can experiment with different features and metrics to see how it affects the recommendations.

💡 Remember, that recommendation system are an iterative process, you will need to test, evaluate, and improve your model over time. You can use libraries such as TensorFlow or Keras to build more complex recommendation systems.

**Bài 2.** Thư viện scikit-learn với lọc cộng tác.

In this project we make a movie recommender using matrix factorization in python.The tools that are used in this project are as follows: N
Num, y
Pan and Scikit-learn. A matrix factorization is simply a mathematical tool for playing around with matrices and is therefore applicable in many scenarios where one would like to find out something hidden under the data.das

In [14]:
import pandas as pd
import numpy as np
movie_df = pd.read_csv('movies.csv')
rating_df = pd.read_csv('ratings.csv')

In [15]:
# Now combine the two tables and drop things we dont have to use

combine_movie_rating = pd.merge(rating_df, movie_df, on='movieId')
combine_movie_rating.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
5,1,70,3.0,964982400,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller
6,1,101,5.0,964980868,Bottle Rocket (1996),Adventure|Comedy|Crime|Romance
7,1,110,4.0,964982176,Braveheart (1995),Action|Drama|War
8,1,151,5.0,964984041,Rob Roy (1995),Action|Drama|Romance|War
9,1,157,5.0,964984100,Canadian Bacon (1995),Comedy|War


In [16]:
columns = ['timestamp', 'genres']
combine_movie_rating = combine_movie_rating.drop(columns, axis=1)
combine_movie_rating.head(10)

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,1,3,4.0,Grumpier Old Men (1995)
2,1,6,4.0,Heat (1995)
3,1,47,5.0,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,"Usual Suspects, The (1995)"
5,1,70,3.0,From Dusk Till Dawn (1996)
6,1,101,5.0,Bottle Rocket (1996)
7,1,110,4.0,Braveheart (1995)
8,1,151,5.0,Rob Roy (1995)
9,1,157,5.0,Canadian Bacon (1995)


In [17]:
combine_movie_rating = combine_movie_rating.dropna(axis = 0, subset = ['title'])

movie_ratingCount = (combine_movie_rating.
     groupby(by = ['title'])['rating'].
     count().
     reset_index().
     rename(columns = {'rating': 'totalRatingCount'})
     [['title', 'totalRatingCount']]
    )
movie_ratingCount.head(10)

Unnamed: 0,title,totalRatingCount
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2
5,'Tis the Season for Love (2015),1
6,"'burbs, The (1989)",17
7,'night Mother (1986),1
8,(500) Days of Summer (2009),42
9,*batteries not included (1987),7


In [18]:
rating_with_totalRatingCount = combine_movie_rating.merge(movie_ratingCount, left_on = 'title', right_on = 'title', how = 'left')
rating_with_totalRatingCount.head(10)

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,1,1,4.0,Toy Story (1995),215
1,1,3,4.0,Grumpier Old Men (1995),52
2,1,6,4.0,Heat (1995),102
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),203
4,1,50,5.0,"Usual Suspects, The (1995)",204
5,1,70,3.0,From Dusk Till Dawn (1996),55
6,1,101,5.0,Bottle Rocket (1996),23
7,1,110,4.0,Braveheart (1995),237
8,1,151,5.0,Rob Roy (1995),44
9,1,157,5.0,Canadian Bacon (1995),11


In [19]:
# Now drop the duplicate data
user_rating = rating_with_totalRatingCount.drop_duplicates(['userId','title'])
user_rating.head(10)

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,1,1,4.0,Toy Story (1995),215
1,1,3,4.0,Grumpier Old Men (1995),52
2,1,6,4.0,Heat (1995),102
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),203
4,1,50,5.0,"Usual Suspects, The (1995)",204
5,1,70,3.0,From Dusk Till Dawn (1996),55
6,1,101,5.0,Bottle Rocket (1996),23
7,1,110,4.0,Braveheart (1995),237
8,1,151,5.0,Rob Roy (1995),44
9,1,157,5.0,Canadian Bacon (1995),11


### Matrix Factorization

Now create a matrix and fill 0 values

In [20]:
movie_user_rating_pivot = user_rating.pivot(index = 'userId', columns = 'title', values = 'rating').fillna(0)
movie_user_rating_pivot.head(10)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
X = movie_user_rating_pivot.values.T
X.shape

(9719, 610)

In [22]:
# Now lets fit the model

import sklearn
from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components=12, random_state=17)
matrix = SVD.fit_transform(X)
matrix.shape

(9719, 12)

In [23]:
import warnings
warnings.filterwarnings("ignore",category =RuntimeWarning)
corr = np.corrcoef(matrix)
corr.shape

(9719, 9719)

In [24]:
# Now lets check the results

movie_title = movie_user_rating_pivot.columns
movie_title_list = list(movie_title)
coffey_hands = movie_title_list.index("Guardians of the Galaxy (2014)")

In [25]:
corr_coffey_hands  = corr[coffey_hands]
list(movie_title[(corr_coffey_hands >= 0.9)])

['Amazing Spider-Man, The (2012)',
 'Ant-Man (2015)',
 'Avatar (2009)',
 'Avengers, The (2012)',
 'Avengers: Age of Ultron (2015)',
 'Big Hero 6 (2014)',
 'Brave (2012)',
 'Captain America: The First Avenger (2011)',
 'Captain America: The Winter Soldier (2014)',
 'Cloudy with a Chance of Meatballs (2009)',
 'Dark Knight Rises, The (2012)',
 'Deadpool (2016)',
 'Deadpool 2 (2018)',
 'Despicable Me (2010)',
 'District 9 (2009)',
 'Django Unchained (2012)',
 'Doctor Strange (2016)',
 'Edge of Tomorrow (2014)',
 "Ender's Game (2013)",
 'Grand Budapest Hotel, The (2014)',
 'Gravity (2013)',
 'Guardians of the Galaxy (2014)',
 'Guardians of the Galaxy 2 (2017)',
 'Harry Potter and the Deathly Hallows: Part 1 (2010)',
 'Harry Potter and the Deathly Hallows: Part 2 (2011)',
 'Hobbit: An Unexpected Journey, The (2012)',
 'Hobbit: The Desolation of Smaug, The (2013)',
 'How to Train Your Dragon (2010)',
 'Hugo (2011)',
 'Inside Out (2015)',
 'Interstellar (2014)',
 'Iron Man (2008)',
 'Iron Man

**Bài 3.** Xây dựng một recommendation engine:


After downloading the dataset, we need to import all the required libraries and then read the csv file using read_csv() method.

In [27]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = pd.read_csv("movie_dataset.csv")

If you visualize the dataset, you will see that it has many extra info about a movie. We don’t need all of them. So, we choose keywords, cast, genres and director column to use as our feature set(the so called “content” of the movie).

In [None]:
features = ['keywords','cast','genres','director']

Our next task is to create a function for combining the values of these columns into a single string.

In [None]:
def combine_features(row):
 return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

Now, we need to call this function over each row of our dataframe. But, before doing that, we need to clean and preprocess the data for our use. We will fill all the NaN values with blank string in the dataframe.

In [None]:
for feature in features:
    df[feature] = df[feature].fillna('') #filling all NaNs with blank string
df["combined_features"] = df.apply(combine_features,axis=1) 
#applying combined_features() method over each rows of dataframe and storing the combined string in “combined_features” column

Now that we have obtained the combined strings, we can now feed these strings to a CountVectorizer() object for getting the count matrix.

In [None]:
cv = CountVectorizer() #creating new CountVectorizer() object
count_matrix = cv.fit_transform(df["combined_features"]) 
#feeding combined strings(movie contents) to CountVectorizer() object

At this point, 60% work is done. Now, we need to obtain the cosine similarity matrix from the count matrix.

In [None]:
cosine_sim = cosine_similarity(count_matrix)

Now, we will define two helper functions to get movie title from movie index and vice-versa.

In [None]:
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

Our next step is to get the title of the movie that the user currently likes. Then we will find the index of that movie. After that, we will access the row corresponding to this movie in the similarity matrix. Thus, we will get the similarity scores of all other movies from the current movie. Then we will enumerate through all the similarity scores of that movie to make a tuple of movie index and similarity score. This will convert a row of similarity scores like this- $[1 0.5 0.2 0.9]$ to this- $[(0, 1) (1, 0.5) (2, 0.2) (3, 0.9)]$ . Here, each item is in this form- (movie index, similarity score).

In [None]:
movie_user_likes = "Avatar"
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index])) 
#accessing the row corresponding to given movie to find all the similarity scores for that movie and then enumerating over it

Now comes the most vital point. We will sort the list similar_movies according to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [None]:
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

Now, we will run a loop to print first 5 entries from sorted_similar_movies list.

In [None]:
i=0
print("Top 5 similar movies to "+movie_user_likes+" are:\n")
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i=i+1
    if i>5:
        break

And we are done here!

Now, it’s time to run our code and see the output. If you run the above code, you will see this output:eyond

> Top 5 similar movies to Avatar are:
- Guardians of the Galaxy
- Aliens
- Star Wars: Clone Wars: Volume 1
- Star Trek Into Darkness
- Star Trek Beyond