## Collaborative Filtering

The learning objective:
to implement, evaluate, and improve upon traditional collaborative filtering recommenders.

We will implement collaborative filtering on the subset of Netflix prize dataset, provided sample dataset has only ~2000 items and ~28,000 users.


## Load Netflix Data

The dataset is subset of movie ratings data from the Netflix Prize Challenge. Download the dataset from Piazza. It contains a train set, test set, movie file, and README file. The last two files are original ones from the Netflix Prize, however; in this homework you will deal with train and test files which both are subsets of the Netflix training data. Each of train and test files has lines having this format: MovieID,UserID,Rating.

Our job is to predict a rating in the test set using those provided in the training set.

In [2]:
dir = 'C:\\Users\\ronak\\netflixdataset\\netflix-dataset'
lines = 0
file = open(dir + '\\TestingRatings.txt', 'r') 
for line in file:
    lines=lines+1
print('data in test set',lines)

lines = 0
file = open(dir + '\\TrainingRatings.txt', 'r') 
for line in file:
    lines=lines+1
print('data in train set',lines)

data in test set 100478
data in train set 3255352


## Implement Collaborative Filtering

In this part, we will implement the basic collaborative filtering algorithm. We consider the first 5,000 users with their associated items in the test set. 

In [3]:
import pandas as pd
import numpy as np

colnames=['movId','UserId','Ratings']

data = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\TrainingRatings.txt',names = colnames, header = None) 
data = data.pivot(index='movId', columns='UserId', values='Ratings')


In [4]:
corr_mat = data.corr()

In [None]:
#corr_mat[]

In [None]:
# trying with corr() function to get matrix directly
colnames = ['movId','UserId','actualRatings']
test = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\TestingRatings.txt',names = colnames, header = None)
#test = test.pivot(index='UserId', columns='movId', values='actualRatings')
#print(test)
k = 0.0001
#print(test.loc[0,'movId'])
for i in test.index.tolist():
    summ = 0
    mov = test.loc[i,'movId']
    #print(mov)
    user1 = test.loc[i,'UserId']
    #print(user1)
    for user2 in data.columns:
        #print(data.loc[mov,user2])
        #print(data.loc[user2,'mean'])
        if not np.isnan(data.loc[mov,user2]):
            #print(corr_mat.loc[user1,user2])
            corr = corr_mat.loc[user1,user2]
            if np.isnan(corr):
                corr = 0
            summ = summ + corr*(data.loc[mov,user2]-data[user2].mean())
    test.loc[i,'predicted'] = data[user1].mean() + k*summ

In [None]:
# colnames = ['movId','UserId','actualRatings']
# test = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\test_try.txt',names = colnames, header = None)
# #test = test.pivot(index='UserId', columns='movId', values='actualRatings')
# #print(test)
# k = 0.0001
# #print(test.loc[0,'movId'])
# for i in test.index.tolist():
#     summ = 0
#     mov = test.loc[i,'movId']
#     #print(mov)
#     user1 = test.loc[i,'UserId']
#     #print(user1)
#     for user2 in data.columns:
#         #print(data.loc[mov,user2])
#         #print(data.loc[user2,'mean'])
#         if not np.isnan(data.loc[mov,user2]):
#             corr = data[user1].corr(data[user2])
#             if np.isnan(corr):
#                 corr = 0
#             summ = summ + corr*(data.loc[mov,user2]-data[user2].mean())
#     test.loc[i,'predicted'] = data[user1].mean() + k*summ
        
# print (test)


## 2.3 Evaluation 

You should evaluate your predictions using Mean Absolute Error and Root Mean Squared Error. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
colnames = ['no','movId','UserId','actualRatings','predicted']
ans = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\ans_5000.txt',names = colnames,delim_whitespace = True)
#print(ans)
mse = np.sqrt(mean_squared_error(ans['actualRatings'].values, ans['predicted'].values))
mae = np.sqrt(mean_absolute_error(ans['actualRatings'].values, ans['predicted'].values))
print('I ran it for first 5000 users on hprc cluster and saved the result in a csv file similar to shown in the next cell and then ')
print('loaded that to a df to calculate the error')
print('mean square error is ',mse)
print('mean absolute error is ',mae)

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
colnames = ['no','movId','UserId','actualRatings','predicted']
ans = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\results.txt',names = colnames,delim_whitespace = True)
#print(ans)
mse = np.sqrt(mean_squared_error(ans['actualRatings'].values, ans['predicted'].values))
mae = np.sqrt(mean_absolute_error(ans['actualRatings'].values, ans['predicted'].values))
print('I ran it for first 10000 users on hprc cluster and saved the result in a csv file similar to shown in the next cell and then ')
print('loaded that to a df to calculate the error')
print('mean square error is ',mse)
print('mean absolute error is ',mae)

## 2.4 Extensions

Given your results in the previous part, can you do better? For this last part you should report on your best attempt at improving MAE and RMSE. Provide code, results, plus a brief discussion on your approach.

In [None]:
# matrix factorisation
'''
I have used matrix factorisation to get the results. The number of latent factors(k) were changed from 5 to 75 at interval 
of 5and the plot was plotted to find the no. of latent factors where the MSE and MAE are least.
As it turns out, the number of latent factors which give the minimum MAE and MSE is 15. 

At a high level, SVD is an algorithm that decomposes a matrix into the best lower rank (i.e. smaller/simpler) approximation 
of the original matrix. Mathematically, it decomposes into two unitary matrices and a diagonal matrix.

R = U Σ Vt

where  R is user ratings matrix, U is the user “features” matrix,  Σ is the diagonal matrix of singular values (essentially 
weights), and Vt is the movie “features” matrix.

To get the lower rank approximation, we take these matrices and keep only the top  k features, which we think of as the  
most important underlying taste and preference vectors.

The mean square error (MSE) is improved with MF to  0.926
 and mean absolute error (MAE) is improved with MF is  0.854
'''

In [None]:
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds

colnames=['movId','UserId','Ratings']
#user1 = pd.read_csv('dataset/1.csv', names=colnames, header=None)
#data = pd.DataFrame(columns=)
data = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\TrainingRatings.txt',names = colnames, header = None) 
data = data.pivot(index='UserId', columns='movId', values='Ratings')

data = data.apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
user_dict = {}
k = 0
for i in data.index.values:
    #print(i)
    if i not in user_dict:
        user_dict[i] = k
        k = k+1

#print(user_dict)

user_dict_ = dict((v,k) for k,v in user_dict.items())

In [None]:
R = data.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
test = pd.read_csv('C:\\Users\\ronak\\netflixdataset\\netflix-dataset\\try.txt',names = colnames, header = None)
mae = []
mse = []
start = 5
end = 75
for m in range(start,end,5):
    U, sigma, Vt = svds(R_demeaned, k = m)
    sigma = np.diag(sigma)

    all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
    preds_df = pd.DataFrame(all_user_predicted_ratings, columns = data.columns)
    

    for i in test.index.tolist():
        mov = test.loc[i,'movId']
        user = test.loc[i,'UserId']
        test.loc[i,'predicted'] = preds_df.loc[user_dict[user],mov]

    mse_temp = np.sqrt(mean_squared_error(test['Ratings'].values, test['predicted'].values))
    mae_temp = np.sqrt(mean_absolute_error(test['Ratings'].values, test['predicted'].values))

    mse.append(mse_temp)
    mae.append(mae_temp)
    print('mean square error with MF is ',mse_temp)
    print('mean absolute error with MF is ',mae_temp)
    
import matplotlib.pyplot as plt
k = np.arange(start, end, 5)

#plt.plot(range(5,25,5), mse, 'ro')
#plt.axis([0, 6, 0, 20])
#plt.show()

plt.plot(k, mse, 'r--', k, mae, 'bs')

plt.xlabel('no. of latent factors')
plt.ylabel('MSE and MAE')

plt.show()

In [None]:
k = np.arange(5, 75, 5)

#plt.plot(range(5,25,5), mse, 'ro')
#plt.axis([0, 6, 0, 20])
#plt.show()

plt.plot(k, mse, 'r--', k, mae, 'bs')

plt.xlabel('no. of latent factors')
plt.ylabel('MSE and MAE')

plt.show()