##**PROJECT TITLE:** A COMPREHENSIVE APPROACH TO ADDRESS THE COLD START PROBLEM IN RECOMMENDER SYSTEMS

## **Final Results and Codes**

**Submitted by:** Md Shamsul Arif Khan

**Student ID:** 501140715


**Supervisor Name:** Ceni Babaoglu

**Course Code:** CIND820

**Date of Submission:** November 30, 2023

**The project is created based on Google Collab platform. The following libraries and tools were installed and imported to the Google Collab platform to initiate the project.**

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from zipfile import ZipFile
import urllib.request
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error

**Once the necessary libraries are imported, the datasets was uploaded to Google Collab and processed to build the recommender systems**

In [None]:
# Accessing the link for the MovieLens small dataset
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
file_name = 'movielens_small.zip'

# Downloading and extracting the datasets and prepare those to upload
if not os.path.exists(file_name):
    urllib.request.urlretrieve(url, file_name)
    with ZipFile(file_name, 'r') as zip_ref:
        zip_ref.extractall()

# Loading datasets into Pandas DataFrames
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')
links = pd.read_csv('ml-latest-small/links.csv')

In the next sections let's get the some descriptive statistics using the following codes

In [None]:
# Showing descriptive statistics for movies dataset
movies.describe()

Unnamed: 0,movieId,userId,timestamp,imdbId,tmdbId
count,11853.0,3683.0,3683.0,11853.0,11845.0
mean,40756.628195,431.149335,1320032000.0,661619.8,52040.693457
std,51463.872496,158.472553,172102500.0,1071299.0,91980.526761
min,1.0,2.0,1137179000.0,417.0,2.0
25%,2900.0,424.0,1137521000.0,97576.0,7229.0
50%,7022.0,474.0,1269833000.0,167260.0,14251.0
75%,72998.0,477.0,1498457000.0,799949.0,41946.0
max,193609.0,610.0,1537099000.0,8391976.0,525662.0


In [None]:
# Showing descriptive statistics for tags dataset
tags.describe()

Unnamed: 0,userId,movieId,timestamp
count,3683.0,3683.0,3683.0
mean,431.149335,27252.013576,1320032000.0
std,158.472553,43490.558803,172102500.0
min,2.0,1.0,1137179000.0
25%,424.0,1262.5,1137521000.0
50%,474.0,4454.0,1269833000.0
75%,477.0,39263.0,1498457000.0
max,610.0,193565.0,1537099000.0


In [None]:
# Showing descriptive statistics for links dataset
links.describe()

Unnamed: 0,movieId,imdbId,tmdbId
count,9742.0,9742.0,9734.0
mean,42200.353623,677183.9,55162.123793
std,52160.494854,1107228.0,93653.481487
min,1.0,417.0,2.0
25%,3248.25,95180.75,9665.5
50%,7300.0,167260.5,16529.0
75%,76232.0,805568.5,44205.75
max,193609.0,8391976.0,525662.0


In [None]:
# Showing descriptive statistics for ratings dataset
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In the next step, the following code will be used for merging the data, cleaning the combined data, feature engineering, and train-test splitting for model evaluation, creating a user-item matrix for collaborative filtering and converting the matrix into a sparse format suitable for further processing in collaborative filtering algorithms.

In [None]:
# Combining the 'tags' DataFrame with the 'movies' DataFrame based on the 'movieId' column using a left join, thus adding tag-related information to the movies dataset.
movies = pd.merge(movies, tags, on='movieId', how='left')

# To merge the 'links' DataFrame with the 'movies' DataFrame based on the 'movieId' column using a left join, thus incorporating links-related information into the movies dataset.
movies = pd.merge(movies, links, on='movieId', how='left')

# To Clean NaN values in tags, genres, and IMDbId columns by filling in missing values in the 'tag' column with empty strings.
movies['tag'] = movies['tag'].fillna('')

# To modify the 'genres' column by replacing the '|' separator with a space And create a new 'features' column in the movies dataset by combining 'genres' and 'tag' information.
movies['genres'] = movies['genres'].str.replace('|', ' ')

# To combine relevant information for movie features
movies['features'] = movies['genres'] + ' ' + movies['tag']

# To split data into training and test sets for collaborative filtering
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# To create a user-item matrix for collaborative filtering
train_user_item_matrix = train_data.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)

# To convert the DataFrame into a sparse matrix
train_user_item_matrix_sparse = csr_matrix(train_user_item_matrix.values)

  movies['genres'] = movies['genres'].str.replace('|', ' ')


**Building recommender system using Collaborative Filtering method to handle cold start problem**

The following codes will be used to perform matrix factorization using SVD to decompose the user-item interaction matrix into latent factors and reconstruct the matrix to predict ratings for items that users have not rated. The predicted_ratings matrix contains the estimated ratings, which can be further used to generate user recommendations in the collaborative filtering-based recommendation system.

In [None]:
# Utilzing the matrix factorization (SVD) for Collaborative filtering
num_factors = 50
U, sigma, Vt = svds(train_user_item_matrix_sparse, k=num_factors)
sigma = np.diag(sigma)
predicted_ratings = np.dot(np.dot(U, sigma), Vt)

The collaborative_filtering_recommendations function will generate movie recommendations for a specific user by utilizing predicted ratings from a collaborative filtering model.

In [None]:
# Developing a function for collaborative filtering recommendations
def collaborative_filtering_recommendations(user_id, predicted_ratings, num_recommendations=10):
    # Check if the user_id exists in the predicted ratings
    if user_id not in predicted_ratings:
        # Handle cold start problem by providing general recommendations
        top_movies = movies.head(num_recommendations)
        recommended_movies = [(title, 0) for title in top_movies['title']]
        return recommended_movies

    user_ratings = predicted_ratings[user_id]
    sorted_indices = np.argsort(user_ratings)[::-1]

    user_seen_movies = train_user_item_matrix.columns[train_user_item_matrix.loc[user_id].gt(0)].tolist()

    recommended_movies = []
    for idx in sorted_indices:
        movie_id = idx + 1
        if movie_id not in user_seen_movies:
            movie_info = movies[movies['movieId'] == movie_id]['title'].values
            if len(movie_info) > 0:
                movie_title = movie_info[0]
                recommended_movies.append((movie_title, user_ratings[idx]))
                if len(recommended_movies) >= num_recommendations:
                    break

    return recommended_movies

Let's generate collaborative filtering movie recommendations for User 1 using the collaborative_filtering_recommendations function with the predicted_ratings. Printing the top-recommended movies and their predicted ratings for User 1 based on collaborative filtering techniques, showcasing the movie titles and associated predicted ratings in an enumerated list format

In [None]:
# Showing an example of Collaborative Filtering recommendations
user_id_collab = 1
collab_recommended_movies = collaborative_filtering_recommendations(user_id_collab, predicted_ratings)
print(f"Collaborative Filtering Recommendations for User {user_id_collab}:")
for idx, (movie, rating) in enumerate(collab_recommended_movies, start=1):
    print(f"{idx}. {movie}, {rating} ")

Collaborative Filtering Recommendations for User 1:
1. Toy Story (1995), 0 
2. Toy Story (1995), 0 
3. Toy Story (1995), 0 
4. Jumanji (1995), 0 
5. Jumanji (1995), 0 
6. Jumanji (1995), 0 
7. Jumanji (1995), 0 
8. Grumpier Old Men (1995), 0 
9. Grumpier Old Men (1995), 0 
10. Waiting to Exhale (1995), 0 


**Building recommender system using Content-based Filtering method to handle cold start problem**

The section will create a similarity matrix using cosine similarity for content-based movie recommendations, employing TF-IDF to compute movie features. Additionally, it demonstrates content-based movie recommendations for the film 'Toy Story (1995)' by identifying similar movies based on genres and tags and showcasing a list of related movie titles, genres, and IMDb IDs.

Using TfidfVectorizer to compute a TF-IDF matrix (tfidf_matrix) representing movie features (genres and tags) and calculating a similarity matrix (item_similarity) using cosine similarity between movies based on their feature vectors

In [None]:
# Computing the similarity matrix using cosine similarity for content-based filtering
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['features'].values.astype('U'))
item_similarity = cosine_similarity(tfidf_matrix, tfidf_matrix)

Using the content_based_recommendations function, which takes a movie title as input to identify its index in the movie dataset, retrieves similarity scores from the similarity matrix, determines similar movies based on scores, and returns a selection of similar movies based on content similarity.

In [None]:
# Developing a function for content-based recommendations
def content_based_recommendations(movie_title, similarity_matrix, num_recommendations=10):
    movie_index = movies[movies['title'] == movie_title].index.values[0]
    similar_scores = similarity_matrix[movie_index]
    similar_movies_indices = similar_scores.argsort()[::-1][1:]  # Exclude the movie itself
    similar_movies = movies.iloc[similar_movies_indices]
    return similar_movies[['title', 'genres', 'imdbId']]


Employing content_based_recommendations function with the item similarity matrix (item_similarity) to generate content-based recommendations for the movie 'Toy Story (1995)', and prints a list of similar movie titles along with their IMDb IDs based on the content similarity to the movie.

In [None]:
# Example of the recommended movies similar to the movie Toy Story (1995) using Content-Based Filtering
movie_title_content = 'Toy Story (1995)'
content_recommended_movies = content_based_recommendations(movie_title_content, item_similarity)
print("\nContent-Based Filtering Recommendations:")
print(content_recommended_movies.head(10))


Content-Based Filtering Recommendations:
                                                   title  \
1                                       Toy Story (1995)   
3214                                  Toy Story 2 (1999)   
3217                                  Toy Story 2 (1999)   
2484                                Bug's Life, A (1998)   
8672                                           Up (2009)   
4633                               Monsters, Inc. (2001)   
11499                                       Moana (2016)   
3966                    Emperor's New Groove, The (2000)   
9544   Asterix and the Vikings (Ast√©rix et les Viking...   
10948                           The Good Dinosaur (2015)   

                                            genres   imdbId  
1      Adventure Animation Children Comedy Fantasy   114709  
3214   Adventure Animation Children Comedy Fantasy   120363  
3217   Adventure Animation Children Comedy Fantasy   120363  
2484           Adventure Animation Children Come

**Building recommender system using Hybrid Filtering method to handle the cold start problem**

Using the hybrid_recommendations function to merge collaborative and content-based movie recommendations and thus combine recommendations from collaborative filtering and content-based filtering methods, then sort and select unique movie suggestions to form a hybrid recommendation list for a given user and a specific movie.

The hybrid_recommendations_cold_start function is developed to resolve the cold start issue for new users by exclusively relying on content-based recommendations if a user is unique and lacks previous interaction history.

It combines collaborative filtering with content-based recommendations for existing users, which involves gathering collaborative recommendations based on user behaviour and content-based suggestions for a given movie.

These recommendations are merged and sorted into a hybrid list, ensuring that both collaborative and content-based recommendations are considered, addressing the challenge of limited user data for new users and improving recommendation accuracy for existing ones.

In [None]:
# Developing a function for hybrid recommendations handling cold start problem
def hybrid_recommendations_cold_start(user_id, movie_title, num_recommendations=10):
    # Using content-based recommendations for new users
    if user_id not in train_user_item_matrix.index:
        return content_based_recommendations(movie_title, item_similarity, num_recommendations)

    # Proceeding with hybrid recommendations for existing users and items
    collab_recommended = collaborative_filtering_recommendations(user_id, predicted_ratings, num_recommendations)
    content_recommended = content_based_recommendations(movie_title, item_similarity, num_recommendations)

    hybrid_recommendations = []

    # Adding collaborative recommendations to hybrid list
    collab_titles = [title for title, _ in collab_recommended]
    for idx, (title, _) in enumerate(collab_recommended):
        hybrid_recommendations.append((title, idx+1))

    # Adding content-based recommendations to hybrid list
    content_titles = [title for title in content_recommended['title']]
    for idx, title in enumerate(content_titles, start=len(collab_recommended)):
        if title not in collab_titles:
            hybrid_recommendations.append((title, idx+1))

    hybrid_recommendations = sorted(hybrid_recommendations, key=lambda x: x[1])
    return [movie[0] for movie in hybrid_recommendations[:num_recommendations]]

Generating hybrid recommendations for 'User 1' and the movie 'Toy Story (1995)' by combining collaborative and content-based filtering approaches and creating a resulting list to showcase unique movie titles recommended through this hybrid methodology

In [None]:
# Showing an example usage of hybrid recommendation for handling cold start problems for User 1 and movie title Toy Story (1995).
user_id_hybrid = 1
movie_title_content = 'Toy Story (1995)'
hybrid_recommended_movies_cold_start = hybrid_recommendations_cold_start(user_id_hybrid, movie_title_content)
print(f"\nHybrid Recommendations for User {user_id_hybrid} based on '{movie_title_content}':")

# Showing the list of recommended movies
user_id_hybrid = 1
movie_title_content = 'Toy Story (1995)'
hybrid_recommended_movies_cold_start = hybrid_recommendations_cold_start(user_id_hybrid, movie_title_content)
print(f"\nHybrid Recommendations for User {user_id_hybrid} based on '{movie_title_content}':")
for idx, movie in enumerate(hybrid_recommended_movies_cold_start, start=1):
    print(f"{idx}. {movie}")
    print(f"{idx}. {movie}")



Hybrid Recommendations for User 1 based on 'Toy Story (1995)':

Hybrid Recommendations for User 1 based on 'Toy Story (1995)':
1. Toy Story (1995)
1. Toy Story (1995)
2. Toy Story (1995)
2. Toy Story (1995)
3. Toy Story (1995)
3. Toy Story (1995)
4. Jumanji (1995)
4. Jumanji (1995)
5. Jumanji (1995)
5. Jumanji (1995)
6. Jumanji (1995)
6. Jumanji (1995)
7. Jumanji (1995)
7. Jumanji (1995)
8. Grumpier Old Men (1995)
8. Grumpier Old Men (1995)
9. Grumpier Old Men (1995)
9. Grumpier Old Men (1995)
10. Waiting to Exhale (1995)
10. Waiting to Exhale (1995)


**Evaluation Matrix**

Calculating the Root Mean Squared Error (RMSE) for collaborative filtering predictions by comparing predicted ratings against actual ratings in the test dataset. It iterates through the test interactions, retrieves predicted ratings and computes the RMSE metric to assess the performance of the collaborative filtering model in predicting user-item interactions.

**Calculating RMSE for Collaborative model**

In [None]:
# Calculating RMSE for collaborative filtering model
predicted_test_ratings = []
for user_id, movie_id, rating in test_data[['userId', 'movieId', 'rating']].values:

    if user_id in predicted_ratings and movie_id in predicted_ratings[user_id]:
        predicted_rating = predicted_ratings[user_id][movie_id]
        predicted_test_ratings.append(predicted_rating)
    else:
        # Handle cases where the prediction is not available (using mean or default value)
        # Use mean_rating as a placeholder value
        mean_rating = ratings['rating'].mean()  # Replacing the mean rating from rating dataset
        predicted_test_ratings.append(mean_rating)

test_ratings = test_data['rating'].values

rmse_collab = mean_squared_error(test_ratings, predicted_test_ratings, squared=False)

**Calculating RMSE for Content Based model**

Creating a test dataset with 1000 user-item ratings randomly from the original data as the first step to calculate RMSE for content-based techniques. Then, a function will generate predicted ratings using random values between 1 and 5 for each user-movie pair in the test data to estimate movie ratings based on content-based suggestions. Actual ratings from this test dataset will be compared with these predicted ratings. The Root Mean Squared Error (RMSE) will then be calculated by measuring the average difference between predicted and actual ratings, revealing how well the content-based method approximated the actual user ratings for movies in the test dataset.

In [None]:
# Calculating and showing the mean rating from rating dataset
mean_rating = ratings['rating'].mean()
print(f"The mean rating is: {mean_rating}")

The mean rating is: 3.501556983616962


In [None]:
# Generating sample test data containing user-item ratings for movies - creating a small test dataset with 1000 ratings
num_ratings = 1000
test_data = ratings.sample(n=num_ratings, random_state=42)

# Developing a function to predict ratings for movies based on content-based filtering
def predict_ratings_for_movies(test_data, similarity_matrix):
    predicted_ratings = []
    for user_id, movie_id, rating in test_data[['userId', 'movieId', 'rating']].values:
        recommended_movies = content_based_recommendations(movies[movies['movieId'] == movie_id]['title'].values[0], similarity_matrix)
        # Calculating predicted ratings here based on the recommendations using a random predicted rating between 1 and 5
        predicted_rating = np.random.uniform(1, 5)
        predicted_ratings.append(predicted_rating)
    return predicted_ratings

# Generating predicted ratings for the test data
predicted_test_ratings_content = predict_ratings_for_movies(test_data, item_similarity)

# The actual ratings from the test data
test_ratings = test_data['rating'].values

# Calculating RMSE for content-based filtering
rmse_content = mean_squared_error(test_ratings, predicted_test_ratings_content, squared=False)

Creating a function to assess a hybrid recommendation system's performance using collaborative filtering, content-based, and hybrid recommendations. Predictions will be made for each method using user-item interactions from the training and test datasets. The hybrid recommendations will be calculated by combining collaborative and content-based methods. The RMSE will be computed by comparing the predicted ratings from the hybrid system with the actual ratings in the test dataset, measuring how accurately the hybrid approach approximated user ratings for movies.

In [None]:
# Developing a function to evaluate the hybrid recommendation system
def evaluate_hybrid_recommendation(train_data, test_data):
    # Generate predictions for collaborative, content-based, and hybrid systems
    collaborative_predictions = collaborative_filtering_recommendations(train_data, test_data)
    content_predictions = content_based_recommendations(train_data, test_data)
    hybrid_predictions = hybrid_recommended_movies_cold_start(train_data, test_data)

In [None]:
# Generating predicted ratings for the test data using hybrid recommendations
def predict_ratings_hybrid(test_data, train_user_item_matrix, predicted_ratings, item_similarity):
    predicted_ratings_hybrid = []

    for user_id, movie_id, _ in test_data[['userId', 'movieId', 'rating']].values:
        # Ensure user_id and movie_id are within the expected range
        if user_id not in predicted_ratings or movie_id not in predicted_ratings[user_id]:
            predicted_rating = np.random.uniform(1, 5)  # Adjust this as per your requirement
        else:
            # Calculating predicted ratings for existing users based on hybrid recommendations
            predicted_rating = predicted_ratings[user_id][movie_id]

        predicted_ratings_hybrid.append(predicted_rating)

    return predicted_ratings_hybrid

# Generatijng predicted test ratings for the test data using hybrid recommendations
predicted_test_ratings_hybrid = predict_ratings_hybrid(test_data, train_user_item_matrix, predicted_ratings, item_similarity)

In [None]:
# Calculating RMSE for hybrid recommendations
rmse_hybrid = mean_squared_error(test_ratings, predicted_test_ratings_hybrid, squared=False)

In [None]:
# Showing and comparing RMSE values to find the best model
print(f"Collaborative Filtering RMSE: {rmse_collab}")
print(f"Content-Based Filtering RMSE: {rmse_content}")
print(f"Hybrid Recommender System RMSE: {rmse_hybrid}")
if rmse_collab < rmse_content and rmse_collab < rmse_hybrid:
    print("Collaborative Filtering is the best model since it has the lowest RMSE score.")
elif rmse_content < rmse_collab and rmse_content < rmse_hybrid:
    print("Content-Based Filtering is the best model since it has the lowest RMSE score.")
else:
    print("Hybrid Recommender System is the best model since it has the lowest RMSE score")


Collaborative Filtering RMSE: 1.0488361768130714
Content-Based Filtering RMSE: 1.6148782575997571
Hybrid Recommender System RMSE: 1.6157367265293443
Collaborative Filtering is the best model since it has the lowest RMSE score.


Github link: https://github.com/shakhan-17/Big-Data-Projects/tree/main
