Step 2: Import required libraries and load the dataset


In [25]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, SVD
from surprise.model_selection import cross_validate

movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/ratings.csv')

Step 3: Prepare the data for collaborative filtering

Using the scikit-surprise library, we'll create a dataset object and split the data into a train and test set.



In [2]:
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)


Step 4: Build a collaborative filtering model

We'll use the SVD algorithm to make movie recommendations based on collaborative filtering.

In [3]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.7777  0.7779  0.7773  0.7778  0.7777  0.7777  0.0002  
MAE (testset)     0.5866  0.5869  0.5865  0.5869  0.5865  0.5867  0.0002  
Fit time          294.01  5609.21 501.17  4167.51 3482.13 2810.81 2087.63 
Test time         952.00  4490.50 608.83  3210.54 6241.48 3100.67 2127.65 


{'test_rmse': array([0.77766929, 0.77790203, 0.77733974, 0.77776941, 0.77765096]),
 'test_mae': array([0.5865782 , 0.58685871, 0.58649324, 0.58687172, 0.58646588]),
 'fit_time': (294.00930404663086,
  5609.209953069687,
  501.1718888282776,
  4167.514079093933,
  3482.1269397735596),
 'test_time': (952.0028901100159,
  4490.501410961151,
  608.8305280208588,
  3210.5446338653564,
  6241.480406761169)}

Step 5: Prepare the data for content-based filtering
To perform content-based filtering, we need to transform the movie genres into a feature vector using the TF-IDF approach.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(stop_words='english')
movies['genres'] = movies['genres'].fillna('')
tfidf_matrix = tfidf.fit_transform(movies['genres'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)


Step 6: Create a function for content-based recommendations
This function takes a movie title as input and returns the top n similar movies based on the genre.

In [5]:
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

def content_based_recommendations(title, n=10):
    index = indices[title]
    sim_scores = list(enumerate(cosine_sim[index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:n+1]
    movie_indices = [i[0] for i in sim_scores]
    return movies['title'].iloc[movie_indices]

Step 7: Combine collaborative and content-based filtering
Create a hybrid recommendation function that takes a user ID and movie title as input, and outputs a list of top n recommended movies.

In [26]:
# Load links.csv file from the MovieLens dataset
links = pd.read_csv('data/links.csv')

# Convert the 'tmdbId' to integer and drop rows with missing values
links = links.dropna(subset=['tmdbId'])
links['tmdbId'] = links['tmdbId'].astype(int)

# Merge the MovieLens and TMDB datasets using 'movieId' column
movies = movies.merge(links, left_on='movieId', right_on='movieId')

# Merge the movies dataset with the ratings dataset
movies_with_ratings = movies.merge(ratings, on='movieId')


In [35]:
def hybrid_recommendations(user_id, title, n=10):
    content_based = content_based_recommendations(title, n).to_frame()
    content_based.columns = ['title']
    content_based = content_based.merge(movies_with_ratings, on='title')
    content_based = content_based.drop_duplicates(subset=['title'], keep='first')
    content_based['est'] = content_based['movieId'].apply(lambda x: svd.predict(user_id, x).est)
    content_based = content_based.sort_values('est', ascending=False)
    return content_based.head(n)['title']


Step 8: Testing recommendation system

In [36]:
user_id = 1
title = 'Toy Story (1995)'
n = 10
recommendations = hybrid_recommendations(user_id, title, n)
print(f"Top {n} recommendations for User {user_id} who likes '{title}':")
print(recommendations)


Top 10 recommendations for User 1 who likes 'Toy Story (1995)':
1097                                        Traffic (2000)
0                                       Mighty, The (1998)
18439    Godzilla vs. Destroyah (Gojira vs. Desutoroiâ)...
908                                 Daddy Long Legs (1919)
18550                                    Stagecoach (1966)
18191                            Down in the Valley (2005)
18120                             Hell Up in Harlem (1973)
1053     Hatchet for the Honeymoon (Rosso segno della f...
18138                                      Airborne (1993)
18478                           The Alphabet Killer (2008)
Name: title, dtype: object
