# Mini Project: Recommendation Engines

Recommendation engines are algorithms designed to provide personalized suggestions or recommendations to users. These systems analyze user behavior, preferences, and interactions with items (products, movies, music, articles, etc.) to predict and offer items that users are likely to be interested in. Recommendation engines play a crucial role in enhancing user experience, driving engagement, and increasing conversion rates in various applications, including e-commerce, entertainment, content platforms, and more.

There are generally two approaches taken in collaborative filtering and content-based recommendation engines:

**1. Collaborative Filtering:**
Collaborative Filtering is a popular approach to building recommendation systems that leverages the collective behavior of users to make personalized recommendations. It is based on the idea that users who have agreed in the past will likely agree in the future. There are two main types of collaborative filtering:

- **User-based Collaborative Filtering:** This method finds users similar to the target user based on their past interactions (e.g., ratings or purchases). It then recommends items that similar users have liked but the target user has not interacted with yet.

- **Item-based Collaborative Filtering:** In this approach, the system identifies similar items based on user interactions. It recommends items that are similar to the ones the target user has already liked or interacted with.

Collaborative filtering does not require any explicit information about items but relies on the similarity between users or items. It is effective in capturing complex patterns and can provide serendipitous recommendations. However, it suffers from the cold-start problem (i.e., difficulty in recommending to new users or items with no interactions) and scalability challenges in large datasets.

**2. Content-Based Recommendation:**
Content-based recommendation is an alternative approach to building recommendation systems that focuses on the attributes or features of items and users. It leverages the characteristics of items to make recommendations. The key steps involved in content-based recommendation are:

- **Feature Extraction:** For each item, relevant features are extracted. For movies, these features could be genre, director, actors, and plot summary.

- **User Profile:** A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.

- **Similarity Calculation:** The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity or Euclidean distance.

- **Recommendation:** Items that are most similar to the user profile are recommended to the user.

Content-based recommendation systems are less affected by the cold-start problem as they can still recommend items based on their features. They are also more interpretable as they rely on item attributes. However, they may miss out on providing serendipitous recommendations and can be limited by the quality of feature extraction and user profiles.

**Choosing Between Collaborative Filtering and Content-Based:**
Both collaborative filtering and content-based approaches have their strengths and weaknesses. The choice between them depends on the specific requirements of the recommendation system, the type of data available, and the user base. Hybrid approaches that combine collaborative filtering and content-based techniques are also common, aiming to leverage the strengths of both methods and mitigate their weaknesses.

In this mini-project, you'll be building both content based and collaborative filtering engines for the [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/). The MovieLens 25M dataset is one of the most widely used and popular datasets for building and evaluating recommendation systems. It is provided by the GroupLens Research project, which collects and studies datasets related to movie ratings and recommendations. The MovieLens 25M dataset contains movie ratings and other related information contributed by users of the MovieLens website.

**Dataset Details:**
- **Size:** The dataset contains approximately 25 million movie ratings.
- **Users:** It includes ratings from over 162,000 users.
- **Movies:** The dataset consists of ratings for more than 62,000 movies.
- **Ratings:** The ratings are provided on a scale of 1 to 5, where 1 is the lowest rating and 5 is the highest.
- **Timestamps:** Each rating is associated with a timestamp, indicating when the rating was given.

**Data Files:**
The dataset is usually split into three CSV files:

1. **movies.csv:** Contains information about movies, including the movie ID, title, genres, and release year.
   - Columns: movieId, title, genres

2. **ratings.csv:** Contains movie ratings provided by users, including the user ID, movie ID, rating, and timestamp.
   - Columns: userId, movieId, rating, timestamp

3. **tags.csv:** Contains user-generated tags for movies, including the user ID, movie ID, tag, and timestamp.
   - Columns: userId, movieId, tag, timestamp

First, import all the libraries you'll need.

In [5]:
import zipfile
import numpy as np
import pandas as pd
from urllib.request import urlretrieve
from sklearn.metrics.pairwise import cosine_similarity

Next, download the relevant components of the MoveLens dataset. Note, these instructions are roughly based on the colab [here](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/recommendation-systems/recommendation-systems.ipynb?utm_source=ss-recommendation-systems&utm_campaign=colab-external&utm_medium=referral&utm_content=recommendation-systems#scrollTo=O3bcgduFo4s6).

In [6]:
print("Downloading movielens data...")

urlretrieve('http://files.grouplens.org/datasets/movielens/ml-100k.zip', 'movielens.zip')
zip_ref = zipfile.ZipFile('movielens.zip', 'r')
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info'))

ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

# The movies file contains a binary feature for each genre.
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
] + genre_cols
movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')

Downloading movielens data...
Done. Dataset contains:
b'943 users\n1682 items\n100000 ratings\n'


Before doing any kind of machine learning, it's always good to familiarize yourself with the datasets you'lll be working with.

Here are your tasks:

1. Spend some time familiarizing yourself with both the `movies` and `ratings` dataframes. How many unique user ids are present? How many unique movies are there?
2. Create a new dataframe that merges the `movies` and `ratings` tables on 'movie_id'. Only keep the 'user_id', 'title', 'rating' fields in this new dataframe.

In [32]:
# Print the first 5 rows of movies data
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [33]:
# Print first 5 rows of ratings data
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [34]:
# Number of unique users (user_id) in the ratings data
unique_count_ratings = ratings.nunique()
print(f"There are {unique_count_ratings['user_id']} number of unique users (i.e., user_id) in the ratings dataframe.")

There are 943 number of unique users (i.e., user_id) in the ratings dataframe.


In [35]:
# Number of unique movies in the movies dataframe
unique_count_movies = movies.nunique()
print(f"There are {unique_count_movies['movie_id']} number of unique movies (i.e., movie_id) in the movies dataframe.")
print(f"There are {unique_count_ratings['movie_id']} number of unique movies (i.e., movie_id) in the ratings dataframe.")

There are 1682 number of unique movies (i.e., movie_id) in the movies dataframe.
There are 1682 number of unique movies (i.e., movie_id) in the ratings dataframe.


In [36]:
# Check the size of the movie data frame
movies.shape

(1682, 24)

In [37]:
# Check the size of the rating data frame
ratings.shape

(100000, 4)

In [39]:
# Find the ratings given by specific user (user_id=7)
rating_given_user = ratings[ratings['user_id'] == 7]
rating_given_user

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
39,7,32,4,891350932
139,7,479,4,891352010
187,7,455,4,891353086
279,7,382,4,891352093
409,7,163,4,891353444
...,...,...,...,...
95929,7,441,2,891354257
97848,7,198,3,891351685
98150,7,152,4,891351851
98173,7,367,5,891350810


There are 403 ratings provided by user 7 within the 100000 data set.

In [7]:
# Merge movies and ratings dataframes
# Merging is based on 'movie_id'. Merging 'left' is used to keep all the rows of the rating data frame (how='outer' also is possible)
movie_rating_df = ratings.merge(movies, how='left', on='movie_id')

In [40]:
# Check the shape of the merged data frame
movie_rating_df.shape

(100000, 27)

In [41]:
# Check the first few rows of the merged data frame
movie_rating_df.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,title,release_date,video_release_date,imdb_url,genre_unknown,Action,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
1,186,302,3,891717742,L.A. Confidential (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?L%2EA%2E+Conf...,0,0,...,0,1,0,0,1,0,0,1,0,0
2,22,377,1,878887116,Heavyweights (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Heavyweights%...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,244,51,2,880606923,Legends of the Fall (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Legends%20of%...,0,0,...,0,0,0,0,0,1,0,0,1,1
4,166,346,1,886397596,Jackie Brown (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?imdb-title-11...,0,0,...,0,0,0,0,0,0,0,0,0,0


As mentioned in the introduction, content-Based Filtering is a recommendation engine approach that focuses on the attributes or features of items (products, movies, music, articles, etc.) and leverages these features to make personalized recommendations. The underlying idea is to match the characteristics of items with the preferences of users to suggest items that align with their interests. Content-based filtering is particularly useful when explicit user-item interactions (e.g., ratings or purchases) are sparse or unavailable.

**Key Steps in Content-Based Filtering:**

1. **Feature Extraction:**
   - For each item, relevant features are extracted. These features are typically descriptive attributes that can be represented numerically, such as genre, director, actors, author, publication date, and keywords.
   - In the case of text-based items, natural language processing techniques may be used to extract features like TF-IDF (Term Frequency-Inverse Document Frequency) scores.

2. **User Profile Creation:**
   - A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.
   - For example, if a user has watched several action movies, the action genre feature would receive a higher weight in their profile.

3. **Similarity Calculation:**
   - The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity, Euclidean distance, or Pearson correlation.
   - Cosine similarity is commonly used as it measures the cosine of the angle between two vectors, which represents their similarity.

4. **Recommendation:**
   - Items that are most similar to the user profile are recommended to the user. These are items whose features have the highest similarity scores with the user profile.
   - The recommended items are presented as a list sorted by their similarity scores.

**Advantages of Content-Based Filtering:**
1. **No Cold-Start Problem:** Content-based filtering can make recommendations even for new users with no historical interactions because it relies on item features rather than user history.

2. **User Independence:** The recommendations are based solely on the features of items and do not require knowledge of other users' preferences or behavior.

3. **Transparency:** Content-based recommendations are interpretable, as they depend on the features of items, making it easier for users to understand why specific items are recommended.

4. **Serendipity:** Content-based filtering can recommend items with characteristics not seen before by the user, leading to serendipitous discoveries.

5. **Diversity in Recommendations:** The method can offer diverse recommendations since it suggests items with different feature combinations.

**Limitations of Content-Based Filtering:**
1. **Limited Discovery:** Content-based filtering may struggle to recommend items outside the scope of users' historical interactions or interests.

2. **Over-Specialization:** Users may receive recommendations that are too similar to their previous choices, leading to a lack of exposure to new item categories.

3. **Dependency on Feature Quality:** The quality and relevance of item features significantly influence the quality of recommendations.

4. **Limited for Cold Items:** Content-based filtering can struggle to recommend new items with limited feature information.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return content-based recommendations for this user. Here are steps you can take:

  A. Get the user's rated movies

  B. Create a TF-IDF matrix using movie genres. Note, this can be extracted from the `movies` dataframe.

  C. Compute the cosine similarity between movie genres. Use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function.

  D. Get the indices of similar movies to those rated by the user based on cosine similarity. Keep only the top 5.

  E. Remove duplicates and movies already rated by the user.

In [42]:
# Print the first few rows of merged data frame
movie_rating_df.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,title,release_date,video_release_date,imdb_url,genre_unknown,Action,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,Kolya (1996),24-Jan-1997,,http://us.imdb.com/M/title-exact?Kolya%20(1996),0,0,...,0,0,0,0,0,0,0,0,0,0
1,186,302,3,891717742,L.A. Confidential (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?L%2EA%2E+Conf...,0,0,...,0,1,0,0,1,0,0,1,0,0
2,22,377,1,878887116,Heavyweights (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Heavyweights%...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,244,51,2,880606923,Legends of the Fall (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Legends%20of%...,0,0,...,0,0,0,0,0,1,0,0,1,1
4,166,346,1,886397596,Jackie Brown (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?imdb-title-11...,0,0,...,0,0,0,0,0,0,0,0,0,0


In [127]:
# Check the unique number of rating numbers given by users within the data set
movie_rating_df['rating'].unique()

array([3, 1, 2, 4, 5])

Only 1-5 ratings are given (in the data set)

In [43]:
# Verify the NA values (there are no ratings other than 1-5)
movie_rating_df['rating'].isna().sum()

0

All entries of the movie_rating have rating column a value between 1-5 (rated)

In [125]:
# Visualize the statistics
movie_rating_df.describe()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,video_release_date,genre_unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
count,100000.0,100000.0,100000.0,100000.0,0.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0,,0.0001,0.25589,0.13753,0.03605,0.07182,...,0.01352,0.01733,0.05317,0.04954,0.05245,0.19461,0.1273,0.21872,0.09398,0.01854
std,266.61442,330.798356,1.125674,5343856.0,,0.01,0.436362,0.344408,0.186416,0.258191,...,0.115487,0.130498,0.224373,0.216994,0.222934,0.395902,0.33331,0.41338,0.291802,0.134894
min,1.0,1.0,1.0,874724700.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,254.0,175.0,3.0,879448700.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,447.0,322.0,4.0,882826900.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,682.0,631.0,4.0,888260000.0,,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,943.0,1682.0,5.0,893286600.0,,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [103]:
# Load the TF-IDF vectorizer object
from sklearn.feature_extraction.text import TfidfVectorizer
# Load consine similarity function
from sklearn.metrics.pairwise import cosine_similarity

# Content-Based Filtering using Movie Genres
def content_based_recommendation(user_id, df):
  # Get the user's rated movies
  movie_rating_of_user = df[df['user_id'] == user_id]

  # Create a TF-IDF matrix using movie genres
  genre_columns = df.iloc[:,8:].columns

  # Create 'genres_list_words' column from the user rated genres
  df['genres_list_words'] = df[genre_columns].apply(lambda x: ' '.join(genre_columns[x == 1]), axis=1)

  # Instantiate the vectorizer object to the genre_vectorizer variable
  genre_vectorizer = TfidfVectorizer()

  # fit and transform the TF-IDF vectorizer model
  genre_vectorized_data = genre_vectorizer.fit_transform(df['genres_list_words'])

  # Create a DataFrame from TD-IDF data and label the columns from the feature names
  tdidf_df_genres = pd.DataFrame(genre_vectorized_data.toarray(), columns=genre_vectorizer.get_feature_names_out())

  # Compute the cosine similarity between movie genres
  cosine_similarity_array = cosine_similarity(tdidf_df_genres)

  # Wrap the array in a pandas DataFrame
  cosine_similarity_df = pd.DataFrame(cosine_similarity_array, columns=tdidf_df_genres.index, index=tdidf_df_genres.index)
  # print(cosine_similarity_df.columns)

  row_indices_users = movie_rating_of_user.index.to_list()
  # print(f'user {user_id} row_indices_users {row_indices_users}')

  # Create an empty data frame top_movies_df.
  # For each movie of user_id, most similar movies are found based on highest
  # cosine similarity. top_movies_df data frame store the sorted movies for each
  # of the user_id movies
  top_movies_df = pd.DataFrame()

  # Iterate over the list of movies from 'user_id'
  for index, row in movie_rating_of_user.iterrows():

    # Get list of indices sorted by the descending order of cosine similarity
    # (exclude the 0th index as it is the score for itself, i.e., always 1)
    # corresponding to the current iterating index of the movie_rating_of_user
    similarity_cur_index = cosine_similarity_df.iloc[index,:].sort_values(ascending=False)

    # Get the top 10 other movies for each movie of the user_id
    index_sorted = similarity_cur_index.index[1:11]

    # From the input data frame df, select the set of movies with high similarity
    df_scores = df.iloc[index_sorted.to_list(),:]

    # Exclude the other movies that are rated by the same user (i.e., user_id)
    # Remove duplicates and movies already rated by the user
    df_scores = df_scores[df_scores['user_id'] != user_id]

    # Add a column with cosine similarity score of the other movies
    df_scores['similarity_score'] = similarity_cur_index

    # Vertically stack the data frames
    top_movies_df = pd.concat([top_movies_df, df_scores])

  return top_movies_df



In [221]:
# User id to test the recommendations
user_id_test = 7
# Number of rows to select in the original data set (selecting the entire set requires large memory in the system)
number_of_data_points = 200

# Select the subset of data from the data set
df_tmp = movie_rating_df.iloc[0:number_of_data_points,:]

# Find the recommendations (Content-Based Filtering using Movie Genres) using content_based_recommendation
movies_ret = content_based_recommendation(user_id_test, df_tmp)
movies_recommended_with_score = movies_ret[['user_id','movie_id','similarity_score']]

# The top movies with similarity scores for each movie
movies_recommended_top = movies_recommended_with_score.sort_values(by='similarity_score', ascending=False)
top5 = movies_recommended_top['movie_id'].unique()[0:5].tolist()
print(f'Top five movies for user {user_id_test} is {top5}')

Top five movies for user 7 is [322, 1042, 328, 265, 144]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['genres_list_words'] = df[genre_columns].apply(lambda x: ' '.join(genre_columns[x == 1]), axis=1)


Testing the recommendation engine 'content_based_recommendation'!

In [234]:
# List of 5 recommended movies, user id and their similarity scores
movies_recommended_top.head(10)

Unnamed: 0,user_id,movie_id,similarity_score
192,276,322,1.0
52,260,322,1.0
18,291,1042,1.0
190,119,328,0.777572
105,166,328,0.777572
195,38,328,0.777572
129,41,265,0.704371
114,58,144,0.704371
184,99,79,0.704371
22,299,144,0.704371


In [230]:
# The user id 7 movies in the data set (Note: Only a subset from the original data set is used)
df_tmp[['user_id', 'movie_id', 'genres_list_words']][df_tmp['user_id'] == 7]

Unnamed: 0,user_id,movie_id,genres_list_words
39,7,32,Documentary
139,7,479,Mystery Thriller
187,7,455,Action


In [224]:
# Recommended top five movies for user 7 is [322, 1042, 328, 265, 144]
# Check the genere description of movie id 322 of user id 276
# movie_rating_df[(movie_rating_df['movie_id'] == 322) & (movie_rating_df['user_id'] == 276)]
df_tmp['genres_list_words'][(df_tmp['movie_id'] == 322) & (df_tmp['user_id'] == 276)]

Unnamed: 0,genres_list_words
192,Mystery Thriller


In [232]:
# Check the genere description of movie id 1042 of user id 291
df_tmp['genres_list_words'][(df_tmp['movie_id'] == 1042) & (df_tmp['user_id'] == 291)]

Unnamed: 0,genres_list_words
18,Mystery Thriller


In [233]:
df_tmp['genres_list_words'][(df_tmp['movie_id'] == 328) & (df_tmp['user_id'] == 119)]
# movie_rating_df[(movie_rating_df['movie_id'] == 328) & (movie_rating_df['user_id'] == 119)]

Unnamed: 0,genres_list_words
190,Action Mystery Romance Thriller


**Discussion on the 'Content-Based Filtering using Movie Genres' recommendation engine:**
Movie ID 322, rated by user id 276 (Mystry and Thriller genres) recommendation matches with movie id 479 rated by user id 7. Similarly, movie 1042 by user id 291 (Mystry and Thriller genres) recommendation also matches with user 7. As such cosine similarity find some suitable recommendations based on users generes. Therefore, based on the similarity of movie genres, recommendations make sense.   

The key idea behind collaborative filtering is that users who have agreed in the past will likely agree in the future. Instead of relying on item attributes or user profiles, collaborative filtering identifies patterns of user behavior and item preferences from the interactions present in the data.

**Types of Collaborative Filtering:**
There are two main types of collaborative filtering:

**Collaborative Filtering Process:**
The collaborative filtering process typically involves the following steps:

1. **Data Collection:**
   - Gather data on user-item interactions, such as movie ratings, product purchases, or article clicks.

2. **User-Item Matrix:**
   - Organize the data into a user-item matrix, where rows represent users, columns represent items, and the entries contain the users' interactions (e.g., ratings).

3. **Similarity Calculation:**
   - Calculate the similarity between users or items using similarity metrics such as cosine similarity, Pearson correlation, or Jaccard similarity.
   - For user-based collaborative filtering, user similarities are calculated, and for item-based collaborative filtering, item similarities are calculated.

4. **Neighborhood Selection:**
   - For each user or item, select the most similar users or items as the neighborhood.
   - The size of the neighborhood (the number of similar users or items to consider) is an important parameter to control the system's behavior.

5. **Prediction Generation:**
   - Predict the ratings for items that the target user has not yet interacted with by combining the ratings of neighboring users or items.

6. **Recommendation Generation:**
   - Recommend items with the highest predicted ratings to the target user.

**Advantages of Collaborative Filtering using User-Item Interactions:**
- Collaborative filtering is based solely on user interactions and does not require knowledge of item attributes, making it useful for cases where item data is sparse or unavailable.
- It can provide serendipitous recommendations, suggesting items that users may not have discovered on their own.
- Collaborative filtering can be applied in various domains, including e-commerce, music, movie, and content recommendations.

**Limitations of Collaborative Filtering:**
- The cold-start problem: Collaborative filtering struggles to recommend to new users or items with no or limited interaction history.
- It may suffer from sparsity when data is limited or when users have only interacted with a small subset of items.
- Scalability issues can arise with large datasets and an increasing number of users or items.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return collaborative filtering recommendations for this user based on a user-item interaction matrix. Here are steps you can take:

  A. Create the user-item matrix using Pandas' [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).

  B. Fill missing values with zeros in this matrix.

  C. Calculate user-user similarity matrix using cosine similarity.

  D. Get the array of similarity scores of the target user with all other users from the similarity matrix.

  E. Extract, say the the top 5 most similar users (excluding the target user).

  F. Generate movie recommendations based on the most similar users.

  G. Remove duplicate movies recommendations.

In [14]:
# Collaborative Filtering using User-Item Interactions
def collaborative_filtering_recommendation(user_id, df, N):
  # Create the user-item matrix
  movie_rating_small_table = df.pivot(index='user_id', columns='movie_id', values='rating')

  # Fill missing values with 0 (indicating no rating)
  movie_rating_small_table = movie_rating_small_table.fillna(0)

  # Calculate user-user similarity matrix using cosine similarity & convert to DataFrame
  cosine_similarities = cosine_similarity(movie_rating_small_table)
  cosine_similarities_df = pd.DataFrame(cosine_similarities, columns=movie_rating_small_table.index, index=movie_rating_small_table.index)

  # Get the similarity scores of the target user with all other users
  cosine_similarity_of_user = cosine_similarities_df.loc[user_id]

  # Find the top N most similar users (excluding the target user)
  assert df.shape[0] >= N, f'data frame size {df.shape[0]} is smaller than requested number of top users {N}'
  cosine_similarity_of_user_ordered = cosine_similarity_of_user.sort_values(ascending=False)[1:N+1]

  # Generate movie recommendations based on the most similar users
  top_movies_df = df[df['user_id'].isin(cosine_similarity_of_user_ordered.index)]

  # Remove duplicates from recommendations
  # First get the rows of df by the 'user_id'
  movie_ids_of_user_df = df[df['user_id'] == user_id]
  # Second get the list of unique movie_id that user rated
  unique_movie_ids_of_user = movie_ids_of_user_df['movie_id'].unique()
  # Third get the rows of top_movies_df excluding the movie_id the user_id rated
  top_movies_excluding_own_rated = top_movies_df[~top_movies_df['movie_id'].isin(unique_movie_ids_of_user)]

  return top_movies_excluding_own_rated

In [15]:
# Number of rows to select in the original data set (selecting the entire set requires large memory in the system)
number_of_rows =1000
# Select the subset of data from the data set
movie_rating_small_df = movie_rating_df.iloc[0:number_of_rows,:]
# Call the Collaborative Filtering using User-Item Interactions, i.e., collaborative_filtering_recommendation
user_id_test = 7 # user id of the user to give the recommendations
N = 5 # number of top users
df_ret = collaborative_filtering_recommendation(user_id_test, movie_rating_small_df, N)
# Get the top in the list
top_movie_list = df_ret['movie_id'].unique().tolist()
print(f'Top recommended movies for user id {user_id_test} is { top_movie_list }')

Top recommended movies for user id 7 is [515, 98, 174, 347, 813, 11, 1086, 294, 721, 900, 289, 259, 655, 1198, 25, 506, 774, 258, 22, 589, 923, 9]


Testing the recommendation engine 'collaborative_filtering_recommendation'!

Examine the movie ids and corresponding user ids those recommendations came from

In [16]:
# List the top 10 entries of the recommendations
df_ret[['user_id','movie_id', 'rating']].head(10)

Unnamed: 0,user_id,movie_id,rating
59,292,515,4
164,90,98,5
209,292,174,5
223,90,347,4
280,264,813,4
289,292,11,5
360,90,1086,4
502,264,294,3
543,264,721,5
557,90,900,4


In [26]:
# Create the pivot table from the movie_rating_small_df
movie_rating_small_pivot_table = movie_rating_small_df.pivot(index='user_id', columns='movie_id', values='rating')
movie_rating_small_pivot_table = movie_rating_small_pivot_table.fillna(0)

In [27]:
# cosine similarity of user id 7 and 156
cosine_similarity(movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 7)] , movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 156)])

array([[0.31497039]])

In [28]:
# cosine similarity of user id 7 and 118
cosine_similarity(movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 7)] , movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 118)])

array([[0.2227177]])

In [29]:
# cosine similarity of user id 7 and 292
movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 7) | (movie_rating_small_pivot_table.index == 292)]
cosine_similarity(movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 7)] , movie_rating_small_pivot_table[(movie_rating_small_pivot_table.index == 292)])

array([[0.14199962]])

In [21]:
# movie_id, user_id and rating for user_id 7 and 156
movie_rating_small_df[['user_id','movie_id', 'rating']][(movie_rating_small_df['user_id'] == 7) | (movie_rating_small_df['user_id'] == 156)]

Unnamed: 0,user_id,movie_id,rating
39,7,32,4
139,7,479,4
187,7,455,4
279,7,382,4
409,7,163,4
490,7,430,3
647,7,497,4
712,7,492,5
763,156,661,4
775,7,661,5


Movie 661 has been rated by both user 7 and 156 with rateing 5 and 4 respectively (both high rating) and therefore user 7 and 156 has similar interest. Therefore, it make sense to recommend movies highly rated by user 156 to user 7. However, there are no other movie id from user 156 to recommend to user 7.

In [23]:
# movie_id, user_id and rating for user_id 7 and 118
movie_rating_small_df[['user_id','movie_id', 'rating']][(movie_rating_small_df['user_id'] == 7) | (movie_rating_small_df['user_id'] == 118)]

Unnamed: 0,user_id,movie_id,rating
39,7,32,4
139,7,479,4
187,7,455,4
279,7,382,4
409,7,163,4
490,7,430,3
617,118,200,5
647,7,497,4
712,7,492,5
774,118,774,5


Movie 200 has been rated by both user 7 and 118 with rating 5 and therefore these two users has similar interest. The movie 774 is recommended for user id 7 because user id 118 has highly rated movie 774.

In [22]:
# movie_id, user_id and rating for user_id 7 and 292
movie_rating_small_df[['user_id','movie_id', 'rating']][(movie_rating_small_df['user_id'] == 7) | (movie_rating_small_df['user_id'] == 292)]

Unnamed: 0,user_id,movie_id,rating
39,7,32,4
59,292,515,4
139,7,479,4
187,7,455,4
209,292,174,5
279,7,382,4
289,292,11,5
409,7,163,4
490,7,430,3
647,7,497,4




661 has been highly rated (rating 5) by both user id 7 and 292 and therefore these two users have similar interest. Movie id 515, 174, 11, 589, 9 which have been rated highly (with rating 4 or 5) have been recommended for user id 7 because of similarity of user id 7 with user id 292.

In [31]:
# Find the highest cosine similarity user id list for user id 7
cosine_similarities = cosine_similarity(movie_rating_small_pivot_table)
cosine_similarities_df = pd.DataFrame(cosine_similarities, columns=movie_rating_small_table.index, index=movie_rating_small_table.index)
cosine_similarity_of_user = cosine_similarities_df.loc[7]
cosine_similarity_of_user_ordered = cosine_similarity_of_user.sort_values(ascending=False)
cosine_similarity_of_user_ordered.head(10)

Unnamed: 0_level_0,7
user_id,Unnamed: 1_level_1
7,1.0
156,0.31497
118,0.222718
264,0.165089
90,0.15873
292,0.142
271,0.135857
125,0.135857
119,0.106101
254,0.100728


**Discussion on the 'Collaborative Filtering using User-Item Interactions' recommendation engine:**
Based on the user similarity, movies highly rated by other user have been recommendations. Therefore, based on the similarity of movie rating similarity between users, the recommendations make sense.