### Chapter 3: (ML) Collaborative Filtering

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import ______ # Import cosine_similarity
from sklearn.metrics import ______ # Import mean_squared_error
from scipy.sparse.linalg import svds
from sklearn.decomposition import ______

In [2]:
# Reading ratings file
ratings = pd.read_csv('data-1m/ratings.csv', 
                    sep='\t', #Note that the separator here is "\t"
                    encoding='latin-1',
                    engine='python',
                    index_col=0
                     ) 

# Reading users file
users = pd.read_csv('data-1m/users.csv', 
                    sep='\t', #Note that the separator here is "\t"
                    encoding='latin-1',
                    engine='python',
                    index_col=0
                     )

# Reading movies file
movies = pd.read_csv('data-1m/movies.csv', 
                    sep='\t', #Note that the separator here is "\t"
                    encoding='latin-1',
                    engine='python',
                    index_col=0
                     )

#Reading the combined file (already prepared for your convenience)
combined = pd.read_csv('data-1m/dataset_combined.csv')

In [3]:
combined.______()

### Data Preparation

Let's first create a **User x Movies matrix**

<img src="img/Screenshot 2024-11-16 at 10.05.41 PM.png" width="750">

In [4]:
#Create user-movie rating matrix
rating_matrix = combined.pivot(index='______', columns='______', values='______').fillna(0)

In [5]:
#Let's print the head of the ratings matrix
______.______()

### User-User Collaborative Filtering

Here we find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

<img src="img/ub.png" width="300">

In [9]:
# Calculate user-user similarity matrix
user_similarity = ______(rating_matrix)

In [10]:
# ______ DataFrame for user similarity
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=rating_matrix.index,
    columns=rating_matrix.index
)

In [11]:
______.______()

In [12]:
# Find similar users
n_similar_users = ______
user_id = 1
similar_users = user_similarity_df[user_id].sort_values(ascending=False)[1:n_similar_users+1]

# Get ratings from similar users
similar_user_ratings = rating_matrix.loc[similar_users.index]

# Calculate weighted average of ratings
weights = similar_users.values.reshape(-1, 1)
weighted_ratings = (similar_user_ratings * weights).______(axis=0)
norm_weights = weights.______()

predicted_ratings = weighted_ratings / norm_weights

In [13]:
# Drop movies already rated by user 1
predicted_ratings = predicted_ratings.drop(columns = ratings[ratings['user_id'] == 1]['movie_id'].values)
predicted_ratings

In [14]:
# Get top recommendations
n_recommendations = ______
top_recommendations = predicted_ratings.sort_values(ascending=False)[:n_recommendations]

print("-" * 60)
print(f"{'Movie Title':<50} Similarity")
print("-" * 60)

for movie_id, pred_rating in top_recommendations.items():
    print(f"{movies[movies['movie_id'] == movie_id]['title'].values[0]}                 {pred_rating:.2f}")

### Item-Item Collaborative Filtering

We try finding movie's look-alike. Once we have movie's look-alike matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between users. And with fixed number of movies, movie-movie look alike matrix is fixed over time.

<img src="img/ib.png" width="300">

In [6]:
# Calculate item-item similarity matrix using ______ similarity
item_similarity = ______(rating_matrix.T)

# Create DataFrame for item similarity
item_similarity_df = pd.DataFrame(
    item_similarity,
    index=rating_matrix.columns,
    columns=rating_matrix.columns
)

In [7]:
______.______()

In [8]:
# Get similarity scores for the movie
n_similar = ______
movie_id = ______
similar_scores = item_similarity_df[movie_id]

# Sort similarities in ______ order (excluding the movie itself)
similar_movies = similar_scores.sort_values(ascending=False)[1:n_similar+1]

# Print header
print(f"\nMovies similar to '{movies[movies['movie_id'] == movie_id]['title'].values[0]}':")
print("-" * 60)
print(f"{'Movie Title':<50} Similarity")
print("-" * 60)

# Print each similar movie 
for movie_id, similarity in similar_movies.items():
    print(f"{movies[movies['movie_id'] == movie_id]['title'].values[0]}                          {similarity:.3f}")

### Matrix Factorization

<img src="img/Screenshot 2024-11-16 at 10.06.05 PM.png" width="750">

In [18]:
# Fill missing values with ______
R = rating_matrix.fillna(______).values
n_components = 50

# Initialize and fit ______
model = ______(n_components=n_components, init='random', random_state=0)

# Fit the model
# W: user latent factor matrix
# H: item latent factor matrix
W = model.______(R)
H = model.______

# Reconstruct rating matrix
R_pred = np.______(W, H)

# Convert to DataFrame for easier handling
predicted_ratings = pd.DataFrame(
R_pred,
index=rating_matrix.index,
columns=rating_matrix.columns
)

In [19]:
predicted_ratings = predicted_ratings.loc[1]

top_recommendations = predicted_ratings.drop(ratings[ratings['user_id'] == 1]['movie_id'].values). \
sort_values(ascending=______)[:n_recommendations]

print("-" * 60)
print(f"{'Movie Title':<50} Similarity")
print("-" * 60)

for movie_id, pred_rating in top_recommendations.items():
    print(f"{movies[movies['movie_id'] == movie_id]['title'].values[0]}                 {pred_rating:.2f}")