# Building Recommender Systems for Movie Rating Prediction

In this assignment, we will build a recommender systems that predict movie ratings. [MovieLense](https://grouplens.org/datasets/movielens/) has currently 25 million user-movie ratings.  Since the entire data is too big, we use  a 1 million ratings subset [MovieLens 1M](https://www.kaggle.com/odedgolden/movielens-1m-dataset), and we reformatted the data to make it more convenient to use.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

In [2]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

### Starter codes
Now, we will be building a recommender system which has various techniques to predict ratings. 
The `class RecSys` has baseline prediction methods (such as predicting everything to 3 or to average rating of each user) and other utility functions. `class ContentBased` and `class Collaborative` inherit `class RecSys` and further add methods calculating item-item similarity matrix. You will be completing those functions using what we learned about content-based filtering and collaborative filtering.

`RecSys`'s `rating_matrix` method converts the (user id, movie id, rating) triplet from the train data (train data's ratings are known) into a utility matrix for 6040 users and 3883 movies.    
Here, we create the utility matrix as a dense matrix (numpy.array) format for convenience. But in a real world data where hundreds of millions of users and items may exist, we won't be able to create the utility matrix in a dense matrix format (For those who are curious why, try measuring the dense matrix self.Mr using .nbytes()). In that case, we may use sparse matrix operations as much as possible and distributed file systems and distributed computing will be needed. Fortunately, our data is small enough to fit in a laptop/pc memory. Also, we will use numpy and scipy.sparse, which allow significantly faster calculations than calculating on pandas.DataFrame object.    
In the `rating_matrix` method, pay attention to the index mapping as user IDs and movie IDs are not the same as array index.

In [7]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(train.rating)
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())


    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        # YOUR CODE HERE
        matrix = self.data.test.shape[0]
        return np.full((matrix),3)

        
    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        yp = np.zeros(self.data.test.shape[0])
        matrix = self.Mr
        
        for i, v in enumerate(self.data.test['uID']):
            index = self.data.test['uID'][i] - 1                         # gets the index value associated with UID
            arr = np.delete(matrix[index], np.where(matrix[index] == 0)) # removes all 0s from vector
            avg = arr.mean()                                             # find avg
            yp[i] = avg                                                  # fills in yp with avg
        return yp
    
    
    def predict_from_sim(self,uid,mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        user_index = self.uid2idx[uid]
        movie_index = self.mid2idx[mid]
        user = self.Mr[user_index]
        c = self.sim[movie_index]
        weighted_ratings = sum(c*user)
        sc = 0
        for v1,v2 in zip(c*user,c):
            if v1 > 0:
                sc += v2
        return weighted_ratings / sc  


    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        yp = np.zeros(self.data.test.shape[0])
        for i in range(len(self.data.test)):
            pred = self.predict_from_sim(self.data.test['uID'][i], self.data.test['mID'][i])
            yp[i] = pred
            
        return yp

    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())

    
class ContentBased(RecSys):
    def __init__(self,data):
        super().__init__(data)
        self.data=data
        self.Mm = self.calc_movie_feature_matrix()
       
        
    def calc_movie_feature_matrix(self):
        """
        Create movie feature matrix in a numpy array of shape (#allmovies, #genres) 
        """
        # YOUR CODE HERE
        matrix = np.empty((len(self.data.movies),len(self.genres)))

        for i in range(len(self.data.movies)):
            movie_row = self.data.movies.iloc[i]
            genre_scores = movie_row[self.genres].to_numpy()
            matrix[i] = genre_scores
        return matrix

    def calc_item_item_similarity(self):
        """
        Create item-item similarity using Jaccard similarity
        """
#         start = time.perf_counter()
#         intersection = self.Mm @ self.Mm.T
#         union = np.zeros(intersection.shape)
#         for i, v in enumerate(self.Mm):
#             for j, v2 in enumerate(self.Mm):
#                 union[i][j] = sum((self.Mm[i] + self.Mm[j]) >= 1)
        
#         self.sim = intersection / union
#         end = time.perf_counter()
#         print(end-start)

        start = time.perf_counter()
        a = self.Mm
        b = self.Mm.T
        
        intersection = a @ b

        a_mag = a.sum(axis=1)
        a_mag= a_mag.reshape((len(a_mag),1))
        b_mag = b.sum(axis=0)

        union = (a_mag + b_mag) - intersection

        self.sim =  intersection / union
        end = time.perf_counter()
        print(end-start)
    
    
class Collaborative(RecSys):    
    def __init__(self,data):
        super().__init__(data)
        
    def calc_item_item_similarity(self, simfunction, *X):  
        """
        Create item-item similarity using similarity function. 
        X is an optional transformed matrix of Mr
        """    
        if len(X)==0:
            self.sim = simfunction()            
        else:
            self.sim = simfunction(X[0]) # *X passes in a tuple format of (X,), to X[0] will be the actual transformed matrix
            
    def cossim(self):    
        """
        Calculates item-item similarity for all pairs of items using cosine similarity (values from 0 to 1) on utility matrix
        Returns a cosine similarity matrix of size (#all movies, #all movies)
        """
        
        x = np.array(self.Mr.T)

        n_x = np.zeros((x.shape))
        for i, row in enumerate(x):
            if x[i].sum() > 0:
                arr = np.delete(x[i], np.where(x[i] == 0))
                avg = arr.mean()
                x[i][x[i] == 0] = avg
                x[i] = x[i] - avg
            n_x[i] = x[i]

        a = n_x
        b = a.T
        
        numerator = a @ b
#         print("numerator: ", numerator)
        a_mag = np.sqrt((a**2).sum(axis=1))
        a_mag = a_mag.reshape((len(a_mag),1))
        b_mag = np.sqrt((b**2).sum(axis=0))

        denominator = a_mag * b_mag
#         print("Denominator: ", denominator)
        cos = np.divide(numerator, denominator, out=np.zeros_like(numerator), where=denominator!=0)
#         print("before normalizing: ", cos)
        sim_cos = 0.5 + 0.5 * cos
#         print("sim_cos:", sim_cos)
#         self.sim = sim_cos
        return sim_cos
   
    def jacsim(self,Xr):
        """
        Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
        Xr is the transformed rating matrix.
        """     
        # YOUR CODE HERE
#         raise NotImplementedError()
        pass
    
    # YOUR CODE HERE
#     raise NotImplementedError()

In [8]:
# YOUR CODE HERE
cf = Collaborative(data)
cf.calc_item_item_similarity(cf.cossim)

In [6]:
### BEGIN HIDDEN TEST
# assert(cf.sim.sum()!=0.)# filtering out no-answer case
# n = len(cf.allmovies)
# assert(cf.sim.shape==(n,n))
# ### END HIDDEN TEST

# Q1. Baseline models [15 pts]

### 1a. Complete the function `predict_everything_to_3` in the class `RecSys`  [5 pts]

In [None]:
rs = RecSys(data)
yp = rs.predict_everything_to_3()
print(rs.rmse(yp))
### BEGIN HIDDEN TEST
assert(rs.rmse(yp)==approx(1.2585510334053043, abs=1e-2))
### END HIDDEN TEST

### 1b. Complete the function predict_to_user_average in the class RecSys [10 pts]
Hint: Include rated items only when averaging

In [None]:
yp = rs.predict_to_user_average()
print(rs.rmse(yp))
### BEGIN HIDDEN TEST
assert(rs.rmse(yp)==approx(1.0352910334228647, abs=1e-2))
### END HIDDEN TEST

In [None]:
X = np.array([[5,0,1,3], [4,0,3,1], [2,0,0,0]])

for i, row in enumerate(X):
    if X[i].sum() > 0:
        arr = np.delete(X[i], np.where(X[i] == 0))
        print("arr w/ 0s", arr)
        avg = arr.mean()
        print("avg = ",avg)
        X[i][X[i] == 0] = avg
        print("X[i] 0-> avg: ", X[i])
        X[i] = X[i] - avg
        print(X[i])
# print(X)

In [None]:
# print(rs.Mr.T)
# print(rs.Mr.T.shape)
u, i = np.unique(rs.Mr.T[0], return_counts=True)
print(rs.Mr.T[0].shape)
print(u)
print(i)

In [None]:
# a = np.array([[1,0,1,1], [1,0,0,1], [1,0,0,0]])
# x = np.array([[5,0,1,4,3], [2,3,5,0,3], [5,4,5,4,0], [3,2,0,2,1]], dtype=float)
# x = np.array([[5,0,1,4,3], [2,0,5,0,3], [5,0,5,4,0], [0,0,0,0,0]], dtype=float)
x = np.array(rs.Mr.T, dtype=float)
# x = np.array([[5,0,0,3], [0,0,0,0], [1,0,1,1]])
print(x)
n_x = np.zeros((x.shape))
for i, row in enumerate(x):
    if x[i].sum() > 0:
        arr = np.delete(x[i], np.where(x[i] == 0))
        avg = arr.mean()
        x[i][x[i] == 0] = avg
        x[i] = x[i] - avg
    n_x[i] = x[i]
print(n_x)

In [None]:
a = n_x
b = n_x.T
numerator = a@b

numerator

In [None]:

a_mag = np.sqrt((a**2).sum(axis=1))
a_mag = a_mag.reshape((len(a_mag),1))
b_mag = np.sqrt((b**2).sum(axis=0))

denominator = a_mag * b_mag
print(denominator)

In [None]:
cos_matrix = np.divide(numerator, denominator, out=np.zeros_like(numerator), where=denominator!=0)
# cos_matrix = 0.5 + 0.5 * cos_matrix
print("\n", cos_matrix)

In [None]:
np.diag(cos_matrix)
u, i = np.unique(np.diag(cos_matrix), return_counts=True)
print(rs.Mr.T[0].shape)
print(u)
print(i)
print(np.trace(cos_matrix))

# Q2. Content-Based model [25 pts]

### 2a. Complete the function calc_movie_feature_matrix in the class ContentBased [5 pts]

In [None]:
cb = ContentBased(data)

In [None]:
assert(cb.Mm.shape==(3883, 18))

In [None]:
### BEGIN HIDDEN TEST
assert((cb.Mm[440]==np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]))).all()
assert((cb.Mm[0]==np.array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]))).all()
assert((cb.Mm[2336]==np.array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]))).all()
### END HIDDEN TEST

In [None]:
print(f'Data test Shape: {data.test.shape}')
print(f'Data train Shape: {data.train.shape}')
print(f'rs.Mr shape (#allusers, #allmovies): {rs.Mr.shape}')
print(f'all users length: {len(rs.allusers)}')
print(f'all movies length: {len(rs.allmovies)}')
print(f'all genres length: {len(rs.genres)}')
print(f'mid2idx length: {len(rs.mid2idx)}')
print(f'uid2idx length: {len(rs.uid2idx)}')
print(f'sim shape: {rs.sim.shape}')

In [None]:
# rs.Mr
# data

In [None]:
# rs.uid2idx[2233]
# rs.mid2idx[440]
# # count = 0
# yp = np.zeros(data.test.shape[0])
# for i in range(len(data.test)):
#     prediction = rating(data.test['uID'][i], data.test['mID'][i])
# #     print(count)
# #     print(f'{data.test.uID[i]} {data.test.mID[i]} Prediction: {prediction}')
# #     count +=1
#     yp[i] = prediction
# print(yp)

In [None]:
# data.test['uID']
# rs.Mr

In [None]:
# c = np.array([0.0,0.33,1.0,0.45,0.6,0.0])
# user = np.array([0.0,0.0,0.0,5.0,3.2,4.5])

# c = np.array([0.8,.1,1])
# user = np.array([5,2,0])
# print(c*user)
# print(c)
# sc = 0
# for v1,v2 in zip(b,c):
#     if v1 > 0:
#         sc += v2
# print(sc)
# sC = c
# sC[2] = 0
# print(f'weighted_ratings: {sum(c*user)}')
# print(sum(c*user) / sum(sC))

In [None]:
# def rating(Uid, Mid):
#     user_index = rs.uid2idx[Uid]
#     movie_index = rs.mid2idx[Mid]
#     user = rs.Mr[user_index]
#     c = cb.sim[movie_index]
#     weighted_ratings = sum(c*user)
#     sc = 0
#     for v1,v2 in zip(c*user,c):
#         if v1 > 0:
#             sc += v2
#     return weighted_ratings / sc

In [None]:
# u_index = rs.uid2idx[2233]
# m_index = rs.mid2idx[440]

# user = rs.Mr[u_index]
# c = cb.sim[m_index]


### 2b. Complete the function calc_item_item_similarity in the class ContentBased [10 pts]
This function updates `self.sim` and does not return a value.    
Some factors to think about:     
1. The movie feature matrix has binary elements. Which similarity metric should be used?
2. What is the computation complexity (time complexity) on similarity calcuation?      
Hint: You may use functions in the `scipy.spatial.distance` module on the dense matrix, but it is quite slow (think about the time complexity). If you want to speed up, you may try using functions in the `scipy.sparse` module. 

In [None]:
cb.calc_item_item_similarity() 

In [None]:
### BEGIN HIDDEN TEST
assert(cb.sim.sum()>0)
assert(np.trace(cb.sim)==3883)
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cb.sim.sum()>0)
assert(cb.sim.min()==0)
assert(cb.sim.max()==1)
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cb.sim.sum()>0)
n = len(cb.allmovies)     
assert(cb.sim.shape==(n,n))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cb.sim.sum()>0)
assert(cb.sim==cb.sim.T).all()
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cb.sim.sum()>0)
assert(cb.sim[:3,:3]==approx(np.array([[1,0.2,0.25],[0.2,1,0],[0.25,0,1]])))
### END HIDDEN TEST

### 2c. Complete the function predict_from_sim in the class RecSys [5 pts]

In [None]:
### BEGIN HIDDEN TEST
assert(cb.predict_from_sim(2233,440)==approx(3.2010178117048347,abs=1e-2))
assert(cb.predict_from_sim(2868,2336)==approx(3.8280907095830283,abs=1e-2))
### END HIDDEN TEST

### 2d. Complete the function predict in the class RecSys [5 pts]
After completing, run below cell to calculate rating prediction and RMSE. How much does the performance increase compared to the baseline results from above? 

In [None]:
# YOUR CODE HERE
start = time.perf_counter()
yp = cb.predict()
end = time.perf_counter()
print((end-start)/ 60)

In [None]:
### BEGIN HIDDEN TEST
assert(cb.rmse(yp)==approx(1.0128116783754684, abs=1e-2))
### END HIDDEN TEST

In [None]:
# ### BEGIN HIDDEN TEST
# assert(rmse==approx(1.0128116783754684, abs=1e-2))
# ### END HIDDEN TEST

# Q3. Collaborative Filtering

### 3a. Complete the function cossim in the class Collaborative [10 pts]
**To Do:**    
1.Impute the unrated entries in self.Mr to the user's average rating then subtract by the user mean, call this matrix X.   
2.Calculate cosine similarity for all item-item pairs. Don't forget to rescale the cosine similarity to be 0~1.    
You might encounter divide by zero warning (numpy will fill nan value for that entry). In that case, you can fill those with appropriate values.    

Hint: Let's say a movie item has not been rated by anyone. When you calculate similarity of this vector to anoter, you will get $\vec{0}$=[0,0,0,....,0]. When you normalize this vector, you'll get divide by zero warning and it will make nan value in self.sim matrix. Theoretically what should the similarity value for $\vec{x}_i \cdot \vec{x}_i$ when $\vec{x}_i = \vec{0}$? What about $\vec{x}_i \cdot \vec{x}_j$ when $\vec{x}_i = \vec{0}$ and $\vec{x}_j$ is an any vector?     

Hint: You may use `scipy.spatial.distance.cosine`, but it will be slow because its cosine function does vector-vector operation whereas you can implement matrix-matrix operation using numpy to calculate all cosines all at once (it can be 100 times faster than vector-vector operation in our data). Also pay attention to the definition. The scipy.spatial.distance provides distance, not similarity. 

3. Run the below cell that calculate yp and RMSE. 

In [9]:
# YOUR CODE HERE
cf = Collaborative(data)
cf.calc_item_item_similarity(cf.cossim)

In [None]:
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim.sum()!=0.)# filtering out no-answer case
n = len(cf.allmovies)
assert(cf.sim.shape==(n,n))
### END HIDDEN TEST

In [None]:
np.trace(cos_matrix)

In [None]:
### BEGIN HIDDEN TEST
assert(np.trace(cf.sim)==3883)
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim.sum()!=0.) 
assert(cf.sim==cf.sim.T).all()
### END HIDDEN TEST

In [None]:
cos_matrix.max()

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim.min()==0)
assert(cf.sim.max()==1)
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim[0,:3]==approx([1., 0.48022892, 0.48356793],abs=1e-2))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(rmse==approx(1.0263081874204125, abs=5e-3))
### END HIDDEN TEST

### 3b. Complete the function jacsim in the class Collaborative [15 pts]
**3b [15 pts] = 3b-i) [5 pts]+3b-ii) [5 pts]+ 3b-iii) [5 pts]**

Function `jacsim` calculates jaccard similarity between items using collaborative filtering method. When we have a rating matrix `self.Mr`, the entries of Mr matrix are 0 to 5 (0: unrated, 1-5: rating). We are interested to see which threshold method works better when we use jaccard dimilarity in the collaborative filtering.    
We may treat any rating 3 or above to be 1 and the negatively rated (below 3) and no-rating as 0. Or, we may treat movies with any ratings to be 1 and ones that has no rating as 0. In this question, we will complete a function jacsim that takes a transformed rating matrix X and calculate and returns a jaccard similarity matrix.     
Let's consider these input cases for the utility matrix $M_r$ with ratings 1-5 and 0s for no-rating.    
1. $M_r \geq 3$ 
2. $M_r \geq 0$ 
3. $M_r$, no transform.

Things to think about: 
- The cases 1 and 2 are straightforward to calculate Jaccard, but what does Jaccard mean for multicategory data?
- Time complexity: The matrix $M_r$ is much bigger than the item feature matrix $M_m$, therefore it will take very long time if we calculate on dense matrix.     
Hint: Use sparse matrix.
- Which method will give the best performance?

### 3b-i)  When $M_r\geq3$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.99. 

In [None]:
cf = Collaborative(data)
Xr = cf.Mr>=3
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.99)

In [None]:
### BEGIN HIDDEN TEST
assert(rmse<1.02) #in case a small mistake in jaccard calcuation, it will give slightly higher rmse value
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(rmse==approx(0.9819058692126349, abs=5e-3))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim[0,:3]==approx([1., 0.10952085, 0.05501618],abs=1e-2))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(time_sim<30) 
### END HIDDEN TEST

### 3b-ii)  When $M_r\geq1$ [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 1.0. 

In [None]:
cf = Collaborative(data)
Xr = cf.Mr>=1
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<1.0)

In [None]:
### BEGIN HIDDEN TEST
assert(rmse<1.03) #in case a small mistake in jaccard calcuation, it will give slightly higher rmse value
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(rmse==approx(0.991363571262366, abs=5e-3))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim[0,:3]==approx([1., 0.13426737, 0.07757066],abs=1e-2))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(time_sim<30) 
### END HIDDEN TEST

### 3b-iii)  When $M_r$; no transform [5 pts]
After you've implemented the jacsim function, run the code below. If implemented correctly, you'll have RMSE below 0.96

In [None]:
cf = Collaborative(data)
Xr = cf.Mr.astype(int)
t0=time.perf_counter()
cf.calc_item_item_similarity(cf.jacsim,Xr)
t1=time.perf_counter()
time_sim = t1-t0
print('similarity calculation time',time_sim)
yp = cf.predict()
rmse = cf.rmse(yp)
print(rmse)
assert(rmse<0.96)

In [None]:
### BEGIN HIDDEN TEST
assert(rmse<1.0) #in case a small mistake in jaccard calcuation, it will give slightly higher rmse value
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(rmse==approx(0.9509126236828654, abs=5e-3))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(cf.sim[0,:3]==approx([1., 3.03561004e-02, 1.62357186e-02],abs=1e-2))
### END HIDDEN TEST

In [None]:
### BEGIN HIDDEN TEST
assert(time_sim<30) 
### END HIDDEN TEST

### 3.C Discussion [5 pts]
1. Summarize the methods and performances: Below is a template/example.

|Method|RMSE|
|:----|:--------:|
|Baseline, $Y_p$=3| |
|Baseline, $Y_p=\mu_u$| |
|Content based, item-item| |
|Collaborative, cosine| |
|Collaborative, jaccard, $M_r\geq 3$|  |
|Collaborative, jaccard, $M_r\geq 1$|  |
|Collaborative, jaccard, $M_r$|  |

2. Discuss which method(s) work better than others and why.

YOUR ANSWER HERE

In [None]:
# user1 = rs.rating_matrix()[0]
# print(np.unique(rs.rating_matrix()[0]))
# print(rs.predict_to_user_average())
# arr_new = np.delete(user1, np.where(user1 == 0))
# np.array(data.test.rating).mean()
# print(data.train)
# print(data.test)
# print(arr_new.mean())
# rs.rating_matrix().shape
print(rs.predict_to_user_average[0])

In [None]:
import time
shape = x@x.T
x2 = np.zeros(shape.shape)
start = time.perf_counter()
for i in range(len(x)):
    for j in range(len(x)):
        x2[i][j] = sum((x[i] + x[j]) >= 1)
end = time.perf_counter()
print(end-start)
print(shape / x2)

In [None]:
start = time.perf_counter()
a = np.array([[0,0,1,1], [1,0,0,1], [1,0,0,0]])
b = a.T
intersection = a @ b

a_mag = a.sum(axis=1)
a_mag = a_mag.reshape((len(a_mag),1))
b_mag = b.sum(axis=0)

union = (a_mag + b_mag) - intersection

print(intersection / union)
end = time.perf_counter()
print(end-start)

In [None]:
rs.Mr

In [None]:
# movie_feature_matrix = np.empty((len(self.data.movies),len(self.genres)))
movie_feature_matrix = rs.Mr

# for i in range(len(self.data.movies)):
#     movie_row = self.data.movies.iloc[i]
#     genre_scores = movie_row[self.genres].to_numpy()
#     movie_feature_matrix[i] = genre_scores
    
for i, row in enumerate(movie_feature_matrix):
    arr = np.delete(movie_feature_matrix[i], np.where(movie_feature_matrix[i] == 0))
    avg = arr.mean()
    movie_feature_matrix[i][movie_feature_matrix[i] == 0] = avg
    movie_feature_matrix[i] = movie_feature_matrix[i] - avg
    
a = movie_feature_matrix
b = movie_feature_matrix.T

numerator = a @ b

a_mag = np.sqrt((a**2).sum(axis=1))
a_mag = a_mag.reshape((len(a_mag),1))
b_mag = np.sqrt((b**2).sum(axis=0))

denominator = a_mag * b_mag
cos = numerator / denominator
sim_cos = (cos *0.5) + 0.5
self.sim = sim_cos

In [None]:
# a = np.array([[1,0,1,1], [1,0,0,1], [1,0,0,0]])
x = np.array([[5,0,1,4,3], [2,3,5,0,3], [5,4,5,4,0], [3,2,0,2,1]], dtype=float)
# x = np.array([[5,0,1,4,3], [2,0,5,0,3], [5,0,5,4,0], [0,0,0,0,0]], dtype=float)
x = rs.Mr.T


for i, row in enumerate(x):
    arr = np.delete(x[i], np.where(x[i] == 0))
    avg = arr.mean()
    x[i][x[i] == 0] = avg
    x[i] = x[i] - avg

a = x
b = x.T
numerator = a@a.T

a_mag = np.sqrt((a**2).sum(axis=1))
a_mag = a_mag.reshape((len(a_mag),1))
b_mag = np.sqrt((b**2).sum(axis=0))

denominator = a_mag * b_mag
denominator[denominator == 0] = 1
cos_matrix = numerator / denominator
# cos_matrix = (cos_matrix*0.5) + 0.5
print("\n", cos_matrix)

In [None]:
print(rs.Mr.T.shape)
print(rs.Mr.shape)

In [None]:
# start = time.perf_counter()
# a = self.Mm
# b = self.Mm.T

# intersection = a @ b

# a_mag = a.sum(axis=1)
# a_mag= a_mag.reshape((len(a_mag),1))
# b_mag = b.sum(axis=0)

# union = (a_mag + b_mag) - intersection

# self.sim =  intersection / union
# end = time.perf_counter()
# print(end-start)


a = self.Mm
b = self.Mm.T

numerator = a @ b

a_mag = np.sqrt((a**2).sum(axis=1))
a_mag = a_mag.reshape((len(a_mag),1))
b_mag = np.sqrt((b**2).sum(axis=0))

denominator = a_mag * b_mag

self.sim = numerator / denominator

In [None]:
x =np.array([[0,0,1,1], [1,0,0,1], [1,0,0,0]])
# a = np.array([[1,0,1,1,0], [0,1,1,0,0], [1,0,0,1,1]])
b = a.T
print("matrix x:")
print(a)

print("\n||A||^2")
print(a.sum(axis=1))  

print("\n||b||^2")
print(b.sum(axis=0))
# print(x[0].dot(x[1]))
# print(sum((x[0] + x[2]) >= 1))
print("\nintersection : ")
print(a@a.T)


# print(x[0].sum(axis=0) + + x[2].sum(axis=0))



print("\n matrix x.T:")
print(a.T)


print(a.sum(axis=1) + b.sum(axis=0))

In [None]:
x = np.array([3,2,3])
x = x.reshape((3,1))
x + b.sum(axis=0)

In [None]:
print(f'Data test Shape: {data.test.shape}')
print(f'Data train Shape: {data.train.shape}')
print(f'rs.Mr shape (#allusers, #allmovies): {rs.Mr.shape}')
print(f'all users length: {len(rs.allusers)}')
print(f'all movies length: {len(rs.allmovies)}')
print(f'all genres length: {len(rs.genres)}')
print(f'mid2idx length: {len(rs.mid2idx)}')
print(f'uid2idx length: {len(rs.uid2idx)}')
print(f'sim shape: {rs.sim.shape}')

In [None]:
x =np.array([[0,0,1,1], [1,0,0,1], [1,0,0,0]])
y =np.array([[2,2,1,1], [1,2,0,2], [1,0,2,0]])
print(x@x.T)
print(x.T@x)