##**PROJECT TITLE:** A COMPREHENSIVE APPROACH TO ADDRESS THE COLD START PROBLEM IN RECOMMENDER SYSTEMS

**Initial Results and Code**

**Submitted by:** Md Shamsul Arif Khan

**Student ID:** 501140715


**Supervisor Name:** Ceni Babaoglu

**Course Code:** CIND820

**Date of Submission:** November 15, 2023

**The project will be created on Google Collab. The following libraries and tools will be imported to our Google Collab platform to initiate the project.**

In [1]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
import urllib.request
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error


**Once the necessary libraries are imported, the datasets will be uploaded to Google Collab and processed to build the recommender systems**

In [2]:
# Link for the MovieLens small dataset
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
file_name = 'movielens_small.zip'

# Download and extract datasets and prepare those to upload
if not os.path.exists(file_name):
    urllib.request.urlretrieve(url, file_name)
    with ZipFile(file_name, 'r') as zip_ref:
        zip_ref.extractall()

# Load datasets into Pandas DataFrames
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')
links = pd.read_csv('ml-latest-small/links.csv')

In the next step, the following code will be used for merging the data, cleaning the combined data, feature engineering, and train-test splitting for model evaluation, creating a user-item matrix for collaborative filtering and converting the matrix into a sparse format suitable for further processing in collaborative filtering algorithms.

In [3]:
# Combining the 'tags' DataFrame with the 'movies' DataFrame based on the 'movieId' column using a left join, thus adding tag-related information to the movies dataset.
movies = pd.merge(movies, tags, on='movieId', how='left')

# To merge the 'links' DataFrame with the 'movies' DataFrame based on the 'movieId' column using a left join, thus incorporating links-related information into the movies dataset.
movies = pd.merge(movies, links, on='movieId', how='left')

# To Clean NaN values in tags, genres, and IMDbId columns by filling in missing values in the 'tag' column with empty strings.
movies['tag'] = movies['tag'].fillna('')

# To modify the 'genres' column by replacing the '|' separator with a space And create a new 'features' column in the movies dataset by combining 'genres' and 'tag' information.
movies['genres'] = movies['genres'].str.replace('|', ' ')

# To combine relevant information for movie features
movies['features'] = movies['genres'] + ' ' + movies['tag']

# To split data into training and test sets for collaborative filtering
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# To create a user-item matrix for collaborative filtering
train_user_item_matrix = train_data.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)

# To convert the DataFrame into a sparse matrix
train_user_item_matrix_sparse = csr_matrix(train_user_item_matrix.values)

  movies['genres'] = movies['genres'].str.replace('|', ' ')


**Building recommender system using Collaborative Filtering method to handle cold start problem**

The following codes will be used to perform matrix factorization using SVD to decompose the user-item interaction matrix into latent factors and reconstruct the matrix to predict ratings for items that users have not rated. The predicted_ratings matrix contains the estimated ratings, which can be further used to generate user recommendations in the collaborative filtering-based recommendation system.

In [5]:
# Collaborative filtering using matrix factorization (SVD)
num_factors = 50
U, sigma, Vt = svds(train_user_item_matrix_sparse, k=num_factors)
sigma = np.diag(sigma)
predicted_ratings = np.dot(np.dot(U, sigma), Vt)

The collaborative_filtering_recommendations function will generate movie recommendations for a specific user by utilizing predicted ratings from a collaborative filtering model.

In [34]:
# Function for collaborative filtering recommendations

def collaborative_filtering_recommendations(user_id, predicted_ratings, num_recommendations=10):
    # Check if the user_id exists in the ratings matrix
    if user_id not in range(len(predicted_ratings)):
        return []  # Return an empty list for non-existent user IDs

    user_ratings = predicted_ratings[user_id - 1]
    sorted_indices = user_ratings.argsort()[::-1]
    user_seen_movies = train_user_item_matrix.columns[train_user_item_matrix.loc[user_id].gt(0)].tolist()

    recommended_movies = []
    for idx in sorted_indices:
        movie_id = idx + 1
        if movie_id not in user_seen_movies:
            movie_info = movies[movies['movieId'] == movie_id]['title'].values
            if len(movie_info) > 0:
                movie_title = movie_info[0]
                recommended_movies.append((movie_title, user_ratings[idx]))
                if len(recommended_movies) >= num_recommendations:
                    break

    return recommended_movies


Let's generate collaborative filtering movie recommendations for User 1 using the collaborative_filtering_recommendations function with the predicted_ratings. Printing the top-recommended movies and their predicted ratings for User 1 based on collaborative filtering techniques, showcasing the movie titles and associated predicted ratings in an enumerated list format

In [68]:
# Collaborative Filtering recommendations Example
user_id_collab = 1
collab_recommended_movies = collaborative_filtering_recommendations(user_id_collab, predicted_ratings)
print(f"Collaborative Filtering Recommendations for User {user_id_collab}:")
for idx, (movie, rating) in enumerate(collab_recommended_movies, start=1):
    print(f"{idx}. {movie}, {rating} ")


Collaborative Filtering Recommendations for User 1:
1. That Darn Cat (1997), 5.0159132270469255 
2. Muppet Christmas Carol, The (1992), 4.870666524016722 
3. Perfect World, A (1993), 4.757768102863196 
4. Fear and Loathing in Las Vegas (1998), 4.679545505282481 
5. Inspector General, The (1949), 4.659823207461193 
6. Interview with the Vampire: The Vampire Chronicles (1994), 4.458921745950011 
7. Wild Reeds (Les roseaux sauvages) (1994), 4.2317437131456 
8. 8 Seconds (1994), 3.9725433638125764 
9. American Buffalo (1996), 3.9537390735075535 
10. Crow: City of Angels, The (1996), 3.6317791961621597 


**Building recommender system using Content-based Filtering method to handle cold start problem**

The section will create a similarity matrix using cosine similarity for content-based movie recommendations, employing TF-IDF to compute movie features. Additionally, it demonstrates content-based movie recommendations for the film 'Toy Story (1995)' by identifying similar movies based on genres and tags and showcasing a list of related movie titles, genres, and IMDb IDs.

Using TfidfVectorizer to compute a TF-IDF matrix (tfidf_matrix) representing movie features (genres and tags) and calculating a similarity matrix (item_similarity) using cosine similarity between movies based on their feature vectors

In [8]:
# Compute the similarity matrix using cosine similarity for content-based filtering
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['features'].values.astype('U'))
item_similarity = cosine_similarity(tfidf_matrix, tfidf_matrix)

Using the content_based_recommendations function, which takes a movie title as input to identify its index in the movie dataset, retrieves similarity scores from the similarity matrix, determines similar movies based on scores, and returns a selection of similar movies based on content similarity.

In [66]:
# Function for content-based recommendations
def content_based_recommendations(movie_title, similarity_matrix, num_recommendations=10):
    movie_index = movies[movies['title'] == movie_title].index.values[0]
    similar_scores = similarity_matrix[movie_index]
    similar_movies_indices = similar_scores.argsort()[::-1][1:]  # Exclude the movie itself
    similar_movies = movies.iloc[similar_movies_indices]
    return similar_movies[['title', 'genres', 'imdbId']]


Employing content_based_recommendations function with the item similarity matrix (item_similarity) to generate content-based recommendations for the movie 'Toy Story (1995)', and prints a list of similar movie titles along with their genres and IMDb IDs based on the content similarity to the movie.

In [67]:
# Content-Based Filtering Example
movie_title_content = 'Toy Story (1995)'
content_recommended_movies = content_based_recommendations(movie_title_content, item_similarity)
print("\nContent-Based Filtering Recommendations:")
print(content_recommended_movies.head(10))


Content-Based Filtering Recommendations:
                                                   title  \
1                                       Toy Story (1995)   
3214                                  Toy Story 2 (1999)   
3217                                  Toy Story 2 (1999)   
2484                                Bug's Life, A (1998)   
8672                                           Up (2009)   
4633                               Monsters, Inc. (2001)   
11499                                       Moana (2016)   
3966                    Emperor's New Groove, The (2000)   
9544   Asterix and the Vikings (Astérix et les Viking...   
10948                           The Good Dinosaur (2015)   

                                            genres   imdbId  
1      Adventure Animation Children Comedy Fantasy   114709  
3214   Adventure Animation Children Comedy Fantasy   120363  
3217   Adventure Animation Children Comedy Fantasy   120363  
2484           Adventure Animation Children Comed

**Building recommender system using Hybrid Filtering method to handle the cold start problem**

Using the hybrid_recommendations function to merge collaborative and content-based movie recommendations and thus combine recommendations from collaborative filtering and content-based filtering methods, then sort and select unique movie suggestions to form a hybrid recommendation list for a given user and a specific movie.

In [24]:
# Function for hybrid recommendations handling cold start problem
def hybrid_recommendations_cold_start(user_id, movie_title, num_recommendations=10):
    # For new users, use content-based recommendations
    if user_id not in train_user_item_matrix.index:
        return content_based_recommendations(movie_title, item_similarity, num_recommendations)

    # For existing users and items, proceed with hybrid recommendations
    collab_recommended = collaborative_filtering_recommendations(user_id, predicted_ratings, num_recommendations)
    content_recommended = content_based_recommendations(movie_title, item_similarity, num_recommendations)

    hybrid_recommendations = []
    collab_titles = [title for title, _ in collab_recommended]
    for idx, (title, _) in enumerate(collab_recommended):
        if title not in collab_titles:
            hybrid_recommendations.append((title, idx+1))

    content_titles = [title for title in content_recommended['title']]
    for title in content_titles:
        if title not in collab_titles:
            hybrid_recommendations.append((title, idx+1))

    hybrid_recommendations = sorted(hybrid_recommendations, key=lambda x: x[1])
    return [movie[0] for movie in hybrid_recommendations[:num_recommendations]]


Generating hybrid recommendations for 'User 1' and the movie 'Toy Story (1995)' by combining collaborative and content-based filtering approaches and creating a resulting list to showcase unique movie titles recommended through this hybrid methodology

In [14]:
# Example usage for handling cold start problems
user_id_hybrid = 1
movie_title_content = 'Toy Story (1995)'
hybrid_recommended_movies_cold_start = hybrid_recommendations_cold_start(user_id_hybrid, movie_title_content)
print(f"\nHybrid Recommendations for User {user_id_hybrid} based on '{movie_title_content}':")
for idx, movie in enumerate(hybrid_recommended_movies_cold_start, start=1):
    print(f"{idx}. {movie}")
    print(f"{idx}. {movie}")


Hybrid Recommendations for User 1 based on 'Toy Story (1995)':
1. Toy Story (1995)
1. Toy Story (1995)
2. Toy Story 2 (1999)
2. Toy Story 2 (1999)
3. Toy Story 2 (1999)
3. Toy Story 2 (1999)
4. Bug's Life, A (1998)
4. Bug's Life, A (1998)
5. Up (2009)
5. Up (2009)
6. Monsters, Inc. (2001)
6. Monsters, Inc. (2001)
7. Moana (2016)
7. Moana (2016)
8. Emperor's New Groove, The (2000)
8. Emperor's New Groove, The (2000)
9. Asterix and the Vikings (Astérix et les Vikings) (2006)
9. Asterix and the Vikings (Astérix et les Vikings) (2006)
10. The Good Dinosaur (2015)
10. The Good Dinosaur (2015)


**Evaluation Matrix**

Calculating the Root Mean Squared Error (RMSE) for collaborative filtering predictions by comparing predicted ratings against actual ratings in the test dataset. It iterates through the test interactions, retrieves predicted ratings and computes the RMSE metric to assess the performance of the collaborative filtering model in predicting user-item interactions.

**Calculating RMSE for Collaborative model**

In [15]:
# Calculate RMSE for collaborative filtering model
predicted_test_ratings = []
for user_id, movie_id, rating in test_data[['userId', 'movieId', 'rating']].values:

    if user_id in predicted_ratings and movie_id in predicted_ratings[user_id]:
        predicted_rating = predicted_ratings[user_id][movie_id]
        predicted_test_ratings.append(predicted_rating)
    else:
        # Handle cases where the prediction is not available (using mean or default value)
        # Use mean_rating as a placeholder value
        mean_rating = ratings['rating'].mean()  # Replacing the mean rating from rating dataset
        predicted_test_ratings.append(mean_rating)

test_ratings = test_data['rating'].values


rmse = mean_squared_error(test_ratings, predicted_test_ratings, squared=False)
print(f"Collaborative Filtering RMSE: {rmse}")

Collaborative Filtering RMSE: 1.0488361768130714


**Calculating RMSE for Content Based model**

In [16]:
# Calculate mean rarting from rating dataset
mean_rating = ratings['rating'].mean()
print(f"The mean rating is: {mean_rating}")

The mean rating is: 3.501556983616962


In [22]:
# Generate sample test data containing user-item ratings for movies - creating a small test dataset with 1000 ratings
num_ratings = 1000
test_data = ratings.sample(n=num_ratings, random_state=42)

# Function to predict ratings for movies based on content-based filtering
def predict_ratings_for_movies(test_data, similarity_matrix):
    predicted_ratings = []
    for user_id, movie_id, rating in test_data[['userId', 'movieId', 'rating']].values:
        recommended_movies = content_based_recommendations(movies[movies['movieId'] == movie_id]['title'].values[0], similarity_matrix)
        # Calculating predicted ratings here based on the recommendations using a random predicted rating between 1 and 5
        predicted_rating = np.random.uniform(1, 5)
        predicted_ratings.append(predicted_rating)
    return predicted_ratings

# Generate predicted ratings for the test data
predicted_test_ratings_content = predict_ratings_for_movies(test_data, item_similarity)

# The actual ratings from the test data
test_ratings = test_data['rating'].values

# Calculate RMSE for content-based filtering
rmse_content = mean_squared_error(test_ratings, predicted_test_ratings_content, squared=False)
print(f"Content-Based Filtering RMSE: {rmse_content}")


Content-Based Filtering RMSE: 1.6698010154500282


In [58]:
# Function to evaluate the hybrid recommendation system
def evaluate_hybrid_recommendation(train_data, test_data):
    # Generate predictions for collaborative, content-based, and hybrid systems
    collaborative_predictions = collaborative_filtering_recommendations(train_data, test_data)
    content_predictions = content_based_recommendations(train_data, test_data)
    hybrid_predictions = hybrid_recommended_movies_cold_start(train_data, test_data)

Github link: https://github.com/shakhan-17/Big-Data-Projects/tree/main
