# Building Recommender Systems for Movie Rating Prediction

In this assignment, we will build a recommender systems that predict movie ratings. [MovieLense](https://grouplens.org/datasets/movielens/) has currently 25 million user-movie ratings.  Since the entire data is too big, we use  a 1 million ratings subset [MovieLens 1M](https://www.kaggle.com/odedgolden/movielens-1m-dataset), and we reformatted the data to make it more convenient to use.

In [1]:
from dataclasses import asdict

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix, diags
from scipy.spatial.distance import jaccard, cosine, pdist, squareform
from sklearn.metrics.pairwise import cosine_similarity
from pytest import approx

In [2]:
MV_users = pd.read_csv('data/users_m3.csv')
MV_movies = pd.read_csv('data/movies_m3.csv')
train = pd.read_csv('data/train_m3.csv')
test = pd.read_csv('data/test_m3.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

### Starter codes
Now, we will be building a recommender system which has various techniques to predict ratings. 
The `class RecSys` has baseline prediction methods (such as predicting everything to 3 or to average rating of each user) and other utility functions. `class ContentBased` and `class Collaborative` inherit `class RecSys` and further add methods calculating item-item similarity matrix. You will be completing those functions using what we learned about content-based filtering and collaborative filtering.

`RecSys`'s `rating_matrix` method converts the (user id, movie id, rating) triplet from the train data (train data's ratings are known) into a utility matrix for 6040 users and 3883 movies.    
Here, we create the utility matrix as a dense matrix (numpy.array) format for convenience. But in a real world data where hundreds of millions of users and items may exist, we won't be able to create the utility matrix in a dense matrix format (For those who are curious why, try measuring the dense matrix self.Mr using .nbytes()). In that case, we may use sparse matrix operations as much as possible and distributed file systems and distributed computing will be needed. Fortunately, our data is small enough to fit in a laptop/pc memory. Also, we will use numpy and scipy.sparse, which allow significantly faster calculations than calculating on pandas.DataFrame object.    
In the `rating_matrix` method, pay attention to the index mapping as user IDs and movie IDs are not the same as array index.

In [4]:
class RecSys():
    def __init__(self, data):
        self.data = data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID, list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID, list(range(len(self.data.users)))))
        self.Mr = self.rating_matrix()
        self.Mm = None
        self.sim = np.zeros((len(self.allmovies), len(self.allmovies)))

    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID]
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)

        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)),
                                   shape=(len(self.allusers), len(self.allmovies))).toarray())

    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        return np.ones(len(self.data.test)) * 3

    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        user_averages = np.zeros(len(self.data.users))
        for i, uid in enumerate(self.data.users.uID):
            user_ratings = self.Mr[self.uid2idx[uid], :]
            num_rated = user_ratings.sum()
            if num_rated > 0:
                user_movieset = self.Mr[self.uid2idx[uid], :] > 0
                user_averages[i] = user_ratings.sum() / user_movieset.sum()
            else:
                user_averages[i] = 3

        # Now map test users to their averages
        test_predictions = np.zeros(len(self.data.test))
        for i, uid in enumerate(self.data.test.uID):
            user_idx = self.uid2idx[uid]
            test_predictions[i] = user_averages[user_idx]

        return test_predictions

    def predict_from_sim(self, uid, mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        user_idx = self.uid2idx[uid]
        movie_idx = self.mid2idx[mid]
        user_ratings = self.Mr[user_idx]
        movie_sim = self.sim[movie_idx]

        user_average = 3

        unweighted_ratings = user_ratings @ movie_sim
        rated_movies = user_ratings > 0
        weighting = np.dot(movie_sim, rated_movies)
        if weighting == 0:
            num_rated = user_ratings.sum()
            if num_rated > 0:
                user_movieset = user_ratings > 0
                user_average = num_rated / user_movieset.sum()

            user_pred = user_average
        else:
            user_pred = unweighted_ratings / weighting

        return user_pred

    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        predicted_ratings = np.zeros(len(self.data.test))
        for i in range(len(self.data.test)):
            uid = self.data.test.uID[i]
            mid = self.data.test.mID[i]
            predicted_ratings[i] = self.predict_from_sim(uid, mid)
        return predicted_ratings

    def rmse(self, yp):
        yp[np.isnan(yp)] = 3  #In case there is nan values in prediction, it will impute to 3.
        yt = np.array(self.data.test.rating)
        return np.sqrt(((yt - yp) ** 2).mean())


class ContentBased(RecSys):
    def __init__(self, data):
        super().__init__(data)
        self.data = data
        self.Mm = self.calc_movie_feature_matrix()

    def calc_movie_feature_matrix(self):
        """
        Create movie feature matrix in a numpy array of shape (#allmovies, #genres) 
        """
        # your code here

        return np.asarray(self.data.movies[self.data.movies.columns[3:]])

    def calc_item_item_similarity(self):
        #Create item-item similarity using Jaccard similarity

        # Method 1: Using dense matrix
        start = time.perf_counter_ns()
        jaccard_dist = pdist(self.Mm, metric='jaccard')
        dense_sim = 1 - squareform(jaccard_dist)
        end = time.perf_counter_ns()
        print(f"Method 1 with dense matrix took {(end - start)/1000000:,.0f} milli seconds")

        # Method 2: Sparse implementation
        start = time.perf_counter_ns()
        # Convert to sparse
        Mm_sparse = csr_matrix(self.Mm, dtype=np.float32)
        # Calculate intersection matrix
        intersection = Mm_sparse @ Mm_sparse.T
        # Get movie sizes (number of genres)
        movie_sizes = np.array(Mm_sparse.sum(axis=1)).flatten()
        # Broadcast to create union matrix
        union = movie_sizes[:, None] + movie_sizes[None, :] - intersection.toarray()
        # Avoid division by zero
        union = np.maximum(union, 1)
        # Calculate Jaccard similarity
        sparse_sim = intersection.toarray() / union
        # Ensure diagonal is 1
        np.fill_diagonal(sparse_sim, 1.0)
        end = time.perf_counter_ns()
        print(f"Method 2 with sparse matrix took {(end - start)/1000000:,.0f} milli seconds")

        # Verification: Check if both methods produce identical results
        if np.allclose(dense_sim, sparse_sim, rtol=1e-7, atol=1e-7):
            print("✓ Dense and sparse methods produce the same result with rtol=1e-7 and atol=1e-7!")
        else:
            print("✗ Warning: Dense and sparse methods produce different results!")
            print(f"Max difference: {np.max(np.abs(dense_sim - sparse_sim))}")
            print(f"Mean difference: {np.mean(np.abs(dense_sim - sparse_sim))}")

        # Use the sparse result (more efficient)
        self.sim = sparse_sim


class Collaborative(RecSys):
    def __init__(self, data):
        super().__init__(data)

    def calc_item_item_similarity(self, simfunction, *X):
        """
        Create item-item similarity using similarity function.
        X is an optional transformed matrix of Mr
        """
        # General function that calculates item-item similarity based on the sim function and data inputed
        if len(X) == 0:
            self.sim = simfunction()
        else:
            self.sim = simfunction(X[0])  # *X passes in a tuple format of (X,), to X[0] will be the actual transformed matrix

    def cossim(self):
        """
        Calculates item-item similarity for all pairs of items using cosine similarity (values from 0 to 1) on utility matrix
        Returns a cosine similarity matrix of size (#all movies, #all movies)
        """
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Cosine Similarity: C(A, B) = (A.B) / (||A||.||B||)

        #####
        # Method 1 - using dense matrix (including centering and normalisation)
        start_time = time.perf_counter()
        Xd = self.Mr.copy().astype(np.float64)     #(users, movies)
        # calculate user averages
        user_means = np.zeros(Xd.shape[0])
        for i in range(Xd.shape[0]):
            rated_movies = Xd[i,:] > 0
            if rated_movies.sum() > 0:
                user_means[i] = Xd[i,rated_movies].mean()
            else:
                user_means[i] = 3.0
        # replace 0s with user averages
        for i in range(Xd.shape[0]):
            zero_mask = Xd[i,:] == 0
            Xd[i,zero_mask] = user_means[i]
        # Centre the data
        Xd = Xd - user_means[:,np.newaxis]  # converts user means to column vector of shape (users, 1) before centering
        # Transpose to get movies as rows and users as columns
        Xd = Xd.T     #(movies, users)
        # Normalise the data
        norms = np.linalg.norm(Xd, axis=1, keepdims=True)
        norms = np.maximum(norms, 1e-10)  # Avoid divide by zero
        Xd_norm = Xd / norms
        # Calc simularity matrix
        sim_dense = Xd_norm @ Xd_norm.T     #Cosine similarity = (A.B) / (||A||.||B||)
        # Rescale to be 0~1
        sim_dense = (sim_dense + 1) / 2
        # Handle NaNs and ensure diagonal is 1
        sim_dense = np.nan_to_num(sim_dense, nan=0.5)
        np.fill_diagonal(sim_dense, 1.0)
        dense_time = time.perf_counter() - start_time
        print(f"Dense matrix method took {dense_time:.4f} seconds")

        ###
        # Method 2 - using sparse matrix (including centering and normalisation)
        ### I tried with a sparse matrix implementation but it was slower than the dense matrix implementation
        ## As this is very new to me and I didn't fully understand the implementation, decided to delete it.
        ## As a reference, see the next cell in the notebook. Note: this function is not my original work, I made heavy use of AI here.

        return sim_dense


    def jacsim(self, Xr):
        """
        Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
        Xr is the transformed rating matrix. Shape: (users, movies)
        """
        n = Xr.shape[1]  # Number of movies
        maxr = int(Xr.max())  # Maximum rating value

        if maxr > 1: # Multi-category case
            intersection = np.zeros((n, n)).astype(int)
            union = np.zeros((n, n)).astype(int)

            for i in range(1, maxr + 1): # intersections and unions for each rating level
                rating_level = (Xr == i).astype(int)
                csr = csr_matrix(rating_level)
                # Intersection for this rating level
                level_intersection = np.array(csr.T.dot(csr).toarray()).astype(int)
                intersection += level_intersection
                # Union for this rating level (users who gave rating i to either movie)
                rowsum = rating_level.sum(axis=0)  # Movies with rating i
                level_union = rowsum[:, None] + rowsum[None, :] - level_intersection
                union += level_union

        else: # Binary case
            csr0 = csr_matrix((Xr > 0).astype(int))
            intersection = np.array(csr0.T.dot(csr0).toarray()).astype(int)
            A = (Xr > 0).astype(bool)
            rowsum = A.sum(axis=0)
            union = rowsum[:, None] + rowsum[None, :] - intersection

        # Calculate Jaccard similarity
        union = np.maximum(union, 1)
        similarity = intersection.astype(float) / union.astype(float)

        # Boundary checks
        similarity = np.nan_to_num(similarity, nan=0.0, posinf=0.0, neginf=0.0)
        np.fill_diagonal(similarity, 1.0)

        return similarity


# Q1. Baseline models [15 pts]

### 1a. Complete the function `predict_everything_to_3` in the class `RecSys`  [5 pts]

In [5]:
# Creating Sample test data
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]

sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]

sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)

In [6]:
# Sample tests predict_everything_to_3 in class RecSys

sample_rs = RecSys(sample_data)
sample_yp = sample_rs.predict_everything_to_3()
print(sample_rs.rmse(sample_yp))
assert sample_rs.rmse(sample_yp)==approx(1.2642784503423288, abs=1e-3), "Did you predict everything to 3 for the test data?"

1.2642784503423288


In [7]:
# Hidden tests predict_everything_to_3 in class RecSys
rs = RecSys(data)
yp = rs.predict_everything_to_3()
print(rs.rmse(yp))

1.2585510334053043


In [8]:
# build dictionary for comparison of RMSE values of all approaches
dict_compare = {}

rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Baseline 3'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}
print(dict_compare)

{'Baseline 3': {'sample rmse': np.float64(1.2642784503423288), 'data rmse': np.float64(1.2585510334053043)}}


### 1b. Complete the function predict_to_user_average in the class RecSys [10 pts]
Hint: Include rated items only when averaging

In [9]:
# Sample tests predict_to_user_average in the class RecSys
sample_yp = sample_rs.predict_to_user_average()
print(sample_rs.rmse(sample_yp))
assert sample_rs.rmse(sample_yp)==approx(1.1429596846619763, abs=1e-3), "Check predict_to_user_average in the RecSys class. Did you predict to average rating for the user?"

1.1429596846619763


In [10]:
# Hidden tests predict_to_user_average in the class RecSys
yp = rs.predict_to_user_average()
print(rs.rmse(yp))

1.0352910334228647


In [11]:
rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Baseline User Average'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

# Q2. Content-Based model [25 pts]

### 2a. Complete the function calc_movie_feature_matrix in the class ContentBased [5 pts]

In [12]:
cb = ContentBased(data)

In [13]:
# tests calc_movie_feature_matrix in the class ContentBased 
assert(cb.Mm.shape==(3883, 18))

### 2b. Complete the function calc_item_item_similarity in the class ContentBased [10 pts]
This function updates `self.sim` and does not return a value.    
Some factors to think about:     
1. The movie feature matrix has binary elements. Which similarity metric should be used?
2. What is the computation complexity (time complexity) on similarity calcuation?      
Hint: You may use functions in the `scipy.spatial.distance` module on the dense matrix, but it is quite slow (think about the time complexity). If you want to speed up, you may try using functions in the `scipy.sparse` module.

In [14]:
cb.calc_item_item_similarity()

Method 1 with dense matrix took 187 milli seconds
Method 2 with sparse matrix took 50 milli seconds
✓ Dense and sparse methods produce the same result with rtol=1e-7 and atol=1e-7!


In [15]:
# Sample tests calc_item_item_similarity in ContentBased class 

sample_cb = ContentBased(sample_data)
sample_cb.calc_item_item_similarity() 

# print(np.trace(sample_cb.sim))
# print(sample_cb.sim[10:13,10:13])
assert(sample_cb.sim.sum() > 0), "Check calc_item_item_similarity."
assert(np.trace(sample_cb.sim) == 3152), "Check calc_item_item_similarity. What do you think np.trace(cb.sim) should be?"


ans = np.array([[1, 0.25, 0.],[0.25, 1, 0.],[0., 0., 1]])
for pred, true in zip(sample_cb.sim[10:13, 10:13], ans):
    assert approx(pred, 0.01) == true, "Check calc_item_item_similarity. Look at cb.sim"

Method 1 with dense matrix took 110 milli seconds
Method 2 with sparse matrix took 30 milli seconds
✓ Dense and sparse methods produce the same result with rtol=1e-7 and atol=1e-7!


In [16]:
# tests calc_item_item_similarity in ContentBased class

In [17]:
# additional tests for calc_item_item_similarity in ContentBased class

In [18]:
# additional tests for calc_item_item_similarity in ContentBased class

In [19]:
# additional tests for calc_item_item_similarity in ContentBased class

In [20]:
# additional tests for calc_item_item_similarity in ContentBased class

### 2c. Complete the function predict_from_sim in the class RecSys [5 pts]

In [21]:
# for a, b in zip(sample_MV_users.uID, sample_MV_movies.mID):
#     print(a, b, sample_cb.predict_from_sim(a,b))

# Sample tests for predict_from_sim in RecSys class 
assert(sample_cb.predict_from_sim(245,276)==approx(2.5128205128205128,abs=1e-2)), "Check predict_from_sim. Look at how you predicted a user rating on a movie given UserID and movieID."
assert(sample_cb.predict_from_sim(2026,2436)==approx(2.785714285714286,abs=1e-2)), "Check predict_from_sim. Look at how you predicted a user rating on a movie given UserID and movieID."

In [22]:
# tests for predict_from_sim in RecSys class

### 2d. Complete the function predict in the class RecSys [5 pts]
After completing the predict method in the RecSys class, run the cell below to calculate rating prediction and RMSE. How much does the performance increase compared to the baseline results from above?

In [23]:
# Sample tests method predict in the RecSys class 

sample_yp = sample_cb.predict()
sample_rmse = sample_cb.rmse(sample_yp)
print(sample_rmse)

assert(sample_rmse==approx(1.1962537249116723, abs=1e-2)), "Check method predict in the RecSys class."

1.1966092788717029


In [24]:
# Hidden tests method predict in the RecSys class 

yp = cb.predict()
rmse = cb.rmse(yp)
print(rmse)

1.0125028201623478


In [25]:
rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Content Based Item-Item'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

In [26]:
# tests method predict in the RecSys class

# Q3. Collaborative Filtering

### 3a. Complete the function cossim in the class Collaborative [10 pts]
**To Do:**    
1.Impute the unrated entries in self.Mr to the user's average rating then subtract by the user mean, call this matrix X.   
2.Calculate cosine similarity for all item-item pairs. Don't forget to rescale the cosine similarity to be 0~1.    
You might encounter divide by zero warning (numpy will fill nan value for that entry). In that case, you can fill those with appropriate values.    

Hint: Let's say a movie item has not been rated by anyone. When you calculate similarity of this vector to anoter, you will get $\vec{0}$=[0,0,0,....,0]. When you normalize this vector, you'll get divide by zero warning and it will make nan value in self.sim matrix. Theoretically what should the similarity value for $\vec{x}_i \cdot \vec{x}_i$ when $\vec{x}_i = \vec{0}$? What about $\vec{x}_i \cdot \vec{x}_j$ when $\vec{x}_i = \vec{0}$ and $\vec{x}_j$ is an any vector?     

Hint: You may use `scipy.spatial.distance.cosine`, but it will be slow because its cosine function does vector-vector operation whereas you can implement matrix-matrix operation using numpy to calculate all cosines all at once (it can be 100 times faster than vector-vector operation in our data). Also pay attention to the definition. The scipy.spatial.distance provides distance, not similarity. 

3. Run the below cell that calculate yp and RMSE.

In [27]:
# Sample tests cossim method in the Collaborative class

sample_cf = Collaborative(sample_data)
sample_cf.calc_item_item_similarity(sample_cf.cossim)
sample_yp = sample_cf.predict()
sample_rmse = sample_cf.rmse(sample_yp)

assert(np.trace(sample_cf.sim)==3152), "Check cossim method in the Collaborative class. What should np.trace(cf.sim) equal?"
assert(sample_rmse==approx(1.1429596846619763, abs=5e-3)), "Check cossim method in the Collaborative class. rmse result is not as expected."
assert(sample_cf.sim[0,:3]==approx([1., 0.5, 0.5],abs=1e-2)), "Check cossim method in the Collaborative class. cf.sim isn't giving the expected results."

Dense matrix method took 0.3585 seconds


In [28]:
# Hidden tests cossim method in the Collaborative class

cf = Collaborative(data)
cf.calc_item_item_similarity(cf.cossim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)

Dense matrix method took 0.5234 seconds
1.0263081874204125


In [29]:
rmse_val_sample = sample_rs.rmse(sample_yp)
rmse_val_data = rs.rmse(yp)
dict_compare['Collaborative Cosine Similarity'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

In [30]:
# tests cossim method in the Collaborative class

In [31]:
# additional tests for cossim method in the Collaborative class

In [32]:
# additional tests for cossim method in the Collaborative class

In [33]:
# additional tests for cossim method in the Collaborative class

In [34]:
# additional tests for cossim method in the Collaborative class

In [35]:
# additional tests for cossim method in the Collaborative class

### 3b. Complete the function jacsim in the class Collaborative [15 pts]
**3b [15 pts] = 3b-i) [5 pts]+3b-ii) [5 pts]+ 3b-iii) [5 pts]**

Function `jacsim` calculates jaccard similarity between items using collaborative filtering method. When we have a rating matrix `self.Mr`, the entries of Mr matrix are 0 to 5 (0: unrated, 1-5: rating). We are interested to see which threshold method works better when we use jaccard dimilarity in the collaborative filtering.    
We may treat any rating 3 or above to be 1 and the negatively rated (below 3) and no-rating as 0. Or, we may treat movies with any ratings to be 1 and ones that has no rating as 0. In this question, we will complete a function jacsim that takes a transformed rating matrix X and calculate and returns a jaccard similarity matrix.     
Let's consider these input cases for the utility matrix $M_r$ with ratings 1-5 and 0s for no-rating.    
1. $M_r \geq 3$ 
2. $M_r \geq 0$ 
3. $M_r$, no transform.

Things to think about: 
- The cases 1 and 2 are straightforward to calculate Jaccard, but what does Jaccard mean for multicategory data?
- Time complexity: The matrix $M_r$ is much bigger than the item feature matrix $M_m$, therefore it will take very long time if we calculate on dense matrix.     
Hint: Use sparse matrix.
- Which method will give the best performance?

### 3b-i)  When $M_r\geq3$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.99.

In [36]:
cf = Collaborative(data)
Xr = cf.Mr>=3
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.99)

similarity calculation time 0.6451340829953551
0.9819500134453443


In [37]:
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['Jacsim - Rating > 3'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

In [38]:
# tests RMSE for jacsim implementation

In [39]:
# additional tests for RMSE for jacsim implementation

In [40]:
# additional tests for jacsim implementation

In [41]:
# additional tests for jacsim implementation

### 3b-ii)  When $M_r\geq1$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 1.0.

In [42]:
cf = Collaborative(data)
Xr = cf.Mr>=1
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<1.0)

similarity calculation time 0.7642687910120003
0.9913535741888877


In [43]:
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['Jacsim - Rating > 1'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

In [44]:
# tests RMSE for jacsim implementation

In [45]:
# tests RMSE for jacsim implementation

In [46]:
# tests jacsim implementation

In [47]:
# tests performance of jacsim implementation

### 3b-iii)  When $M_r$; no transform [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.96

In [48]:
cf = Collaborative(data)
Xr = cf.Mr.astype(int)
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.96)

similarity calculation time 1.9543431669590063
0.9516555951928918


In [49]:
rmse_val_sample = None
rmse_val_data = rs.rmse(yp)
dict_compare['Jacsim - Multiclass'] = {'sample rmse':rmse_val_sample, 'data rmse':rmse_val_data}

In [50]:
# tests jacsim implementation RMSE

In [51]:
# tests jacsim implementation RMSE

In [52]:
# tests jacsim implementation

In [53]:
# tests jacsim implementation performance

### 3.C Discussion [Peer Review]
Answer the questions below in this week's Peer Review assignment. <br>
1. Summarize the methods and performances: Below is a template/example.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| |
|Baseline, $Y_p=\mu_u$| |
|Content based, item-item| |
|Collaborative, cosine| |
|Collaborative, jaccard, $M_r\geq 3$|  |
|Collaborative, jaccard, $M_r\geq 1$|  |
|Collaborative, jaccard, $M_r$|  |

2. Discuss which method(s) work better than others and why.

In [54]:
df_compare = pd.DataFrame(dict_compare).T.reset_index()
df_compare.columns.name = None  # Remove the column name 'index'
df_compare = df_compare.rename(columns={'index': 'Method'})
display(df_compare)

Unnamed: 0,Method,sample rmse,data rmse
0,Baseline 3,1.264278,1.258551
1,Baseline User Average,1.14296,1.035291
2,Content Based Item-Item,1.196609,1.012503
3,Collaborative Cosine Similarity,1.142693,1.026308
4,Jacsim - Rating > 3,,0.98195
5,Jacsim - Rating > 1,,0.991354
6,Jacsim - Multiclass,,0.951656


## Appendix

Contents
- RMSE in a recommender context - AI generated explanation
- AI generated ultra-sparse implemention of cossim (didn't fully understad it, so didn't use it)
- Old versions of jacsim function for reference only.



## What is RMSE?

**RMSE (Root Mean Squared Error)** is a metric that measures how well your recommender system predicts movie ratings compared to the actual ratings users gave.

Looking at your code:

```python
def rmse(self, yp):
    yp[np.isnan(yp)] = 3  # In case there is nan values in prediction, it will impute to 3.
    yt = np.array(self.data.test.rating)  # Actual ratings from test data
    return np.sqrt(((yt - yp) ** 2).mean())  # RMSE calculation
```


## Breaking Down the Formula:

1. **`yt - yp`**: Difference between actual ratings and predicted ratings (prediction errors)
2. **`(yt - yp) ** 2`**: Square each error (makes all errors positive and penalizes large errors more)
3. **`((yt - yp) ** 2).mean()`**: Mean of all squared errors (MSE - Mean Squared Error)
4. **`np.sqrt(...)`**: Take square root to get back to original rating scale (RMSE)

## What RMSE Means in Practice:

- **Scale**: RMSE is in the same units as your ratings (1-5 stars)
- **Lower = Better**: Smaller RMSE means better predictions
- **Interpretation**: RMSE ≈ 1.0 means your predictions are typically off by about 1 star

## Example:

```python
# Actual ratings from test data
actual_ratings = [4, 3, 5, 2, 1]

# Your model's predictions
predicted_ratings = [3.8, 3.2, 4.5, 2.3, 1.5]

# Errors for each prediction
errors = [4-3.8, 3-3.2, 5-4.5, 2-2.3, 1-1.5]
#      = [0.2, -0.2, 0.5, -0.3, -0.5]

# Squared errors
squared_errors = [0.04, 0.04, 0.25, 0.09, 0.25]

# Mean squared error
mse = (0.04 + 0.04 + 0.25 + 0.09 + 0.25) / 5 = 0.134

# RMSE
rmse = sqrt(0.134) ≈ 0.366
```


This means your predictions are typically off by about 0.37 stars.

## In Your Recommender System Context:

- **RMSE ≈ 0.5**: Excellent predictions (half-star accuracy)
- **RMSE ≈ 1.0**: Good predictions (one-star accuracy)
- **RMSE ≈ 1.5**: Moderate predictions
- **RMSE ≈ 2.0+**: Poor predictions

## Why RMSE is Used:

1. **Interpretable**: Easy to understand in rating scale terms
2. **Penalizes large errors**: Squaring makes big mistakes much worse than small ones
3. **Standard metric**: Commonly used for comparing recommendation algorithms
4. **Smooth**: Differentiable for optimization algorithms

## Your Baseline Results Context:

When you run:
```python
rs = RecSys(data)
yp = rs.predict_everything_to_3()
print(rs.rmse(yp))  # Shows ~1.26
```


This tells you that predicting everything as 3 stars gives you an RMSE of about 1.26, meaning you're typically off by 1.26 stars. Any sophisticated algorithm should beat this baseline!

The goal of your content-based and collaborative filtering methods is to achieve a much lower RMSE by making smarter, personalized predictions.

In [55]:
def cossim_ultra_sparse(self):
    """
    Ultra-optimized sparse cosine similarity implementation
    """
    from scipy.sparse import diags

    # Convert to sparse and work entirely in sparse domain
    Xs = csr_matrix(self.Mr.astype(np.float64))  # (users, movies)

    # Calculate user means efficiently
    user_sums = np.array(Xs.sum(axis=1)).flatten()
    user_counts = np.array((Xs > 0).sum(axis=1)).flatten()
    user_means = np.divide(user_sums, user_counts,
                          out=np.full_like(user_sums, 3.0),
                          where=user_counts!=0)

    # Create mean-filled matrix more efficiently
    # This is still the challenging part with pure sparse operations
    X_dense = Xs.toarray()  # Convert to dense for centering

    # Fill zeros with user means
    for i in range(X_dense.shape[0]):
        zero_mask = X_dense[i, :] == 0
        X_dense[i, zero_mask] = user_means[i]

    # Center and transpose
    X_centered = (X_dense - user_means[:, np.newaxis]).T  # (movies, users)

    # Convert back to sparse
    X_sparse = csr_matrix(X_centered)

    # Efficient normalization using sparse operations
    norms = np.sqrt(np.array(X_sparse.multiply(X_sparse).sum(axis=1)).flatten())
    norms = np.maximum(norms, 1e-10)

    # Use sparse diagonal matrix for normalization
    norm_diag = diags(1.0 / norms, format='csr')
    X_normalized = norm_diag @ X_sparse

    # Calculate similarity
    similarity = X_normalized @ X_normalized.T
    similarity_dense = similarity.toarray()

    # Final processing
    similarity_dense = (similarity_dense + 1) / 2
    similarity_dense = np.nan_to_num(similarity_dense, nan=0.5)
    np.fill_diagonal(similarity_dense, 1.0)

    return similarity_dense


In [56]:
def jacsim_old_v1(self, Xr):
    """
    Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
    Handles both binary and multi-category data efficiently with sparse operations.
    Xr is the transformed rating matrix.
    """
    # Convert to sparse matrix and transpose to get (movies, users)
    Mr_sparse = csr_matrix(Xr.T, dtype=np.float32)

    # Detect if data is binary or multicategory
    max_value = Mr_sparse.max()
    is_binary = max_value <= 1.0

    if is_binary:
        print("Binary data detected.")
        # Binary Jaccard: J(A,B) = |A∩B| / |A∪B|
        # For binary data: intersection = A@B, union = |A| + |B| - |A∩B|
        # Calculate intersection matrix (dot product for binary data)
        intersection = Mr_sparse @ Mr_sparse.T
        # Get number of 1s for each movie (row sums)
        movie_counts = np.array(Mr_sparse.sum(axis=1)).flatten()
        # Calculate union using inclusion-exclusion principle
        union = movie_counts[:, None] + movie_counts[None, :] - intersection.toarray()
        # Avoid division by zero
        union = np.maximum(union, 1)
        # Calculate Jaccard similarity
        similarity_matrix = intersection.toarray() / union

    else:
        print("Multi-category data detected.")
        # Multi-category Jaccard: J(A,B) = Σmin(Ai,Bi) / Σmax(Ai,Bi)
        # Need to use broadcasting for min/max operations
        # For large sparse matrices, we can optimize by only computing for non-zero pairs
        # But for moderate sizes, dense broadcasting is efficient
        Mr_dense = Mr_sparse.toarray()
        n_movies = Mr_dense.shape[0]
        similarity_matrix = np.zeros((n_movies, n_movies), dtype=np.float32)

        # Compute similarities row by row to avoid huge memory usage
        for i in range(n_movies):
            if i % 500 == 0:  # Progress indicator
                print(f"Processing movie {i}/{n_movies}")

            # Get current movie's ratings
            movie_i = Mr_dense[i:i+1]  # Shape: (1, users)

            # Compute min and max with all other movies
            min_vals = np.minimum(movie_i, Mr_dense)  # Shape: (movies, users)
            max_vals = np.maximum(movie_i, Mr_dense)  # Shape: (movies, users)

            # Sum across users
            intersection = np.sum(min_vals, axis=1)  # Shape: (movies,)
            union = np.sum(max_vals, axis=1)        # Shape: (movies,)

            # Calculate similarities for this row
            similarities = np.divide(intersection, union,
                                   out=np.zeros_like(intersection),
                                   where=union!=0)

            similarity_matrix[i] = similarities


    # Ensure diagonal is 1 for both cases
    np.fill_diagonal(similarity_matrix, 1.0)

    return similarity_matrix

def jacsim_v1(self, Xr):
    """
    Calculates item-item similarity for all pairs of items using jaccard similarity
    Xr is the transformed rating matrix.
    """
    n = Xr.shape[1]  # Number of movies
    maxr = int(Xr.max())  # Maximum rating value

    if maxr > 1:
        # Multicategory case: sum intersections for each rating level
        intersection = np.zeros((n, n)).astype(int)
        for i in range(1, maxr + 1):
            # Create binary matrix for rating level i
            rating_level = (Xr == i).astype(int)
            csr = csr_matrix(rating_level)
            # Add intersection for this rating level
            intersection = intersection + np.array(csr.T.dot(csr).toarray()).astype(int)
    else:
        # Binary case: direct intersection calculation
        intersection = np.zeros((n, n)).astype(int)

    # Convert to binary matrix (rated vs not rated)
    csr0 = csr_matrix((Xr > 0).astype(int))
    # Calculate intersection of non-zero ratings
    nz_inter = np.array(csr0.T.dot(csr0).toarray()).astype(int)

    # For multicategory, use the sum of rating-level intersections
    if maxr > 1:
        intersection = intersection
    else:
        intersection = nz_inter

    # Calculate union using inclusion-exclusion principle
    # A = movies that have been rated (binary)
    A = (Xr > 0).astype(bool)
    rowsum = A.sum(axis=0)  # Number of users who rated each movie
    rsumtile = np.repeat(rowsum.reshape((n, 1)), n, axis=1)
    union = rsumtile.T + rsumtile - nz_inter

    # Calculate Jaccard similarity: intersection / union
    # Avoid division by zero
    union = np.maximum(union, 1)
    similarity = intersection.astype(float) / union.astype(float)

    # 1. Set NaNs and infs to 0
    similarity = np.nan_to_num(similarity, nan=0.0, posinf=0.0, neginf=0.0)

    # 2. Set diagonal to 1 (perfect self-similarity)
    np.fill_diagonal(similarity, 1.0)

    return similarity
