### Load movielens 1m. Transform it into Users:Movies matrix.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

ratings_list = [i.strip().split("::") for i in open('ml-1m/ml-1m/ratings.dat', 'r').readlines()]
users_list = [i.strip().split("::") for i in open('ml-1m/ml-1m/users.dat', 'r').readlines()]
movies_list = [i.strip().split("::") for i in open('ml-1m/ml-1m/movies.dat', 'r').readlines()]
#Turn the ratings_df and movies_df into pandas. 
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
users_df = pd.DataFrame(users_list, columns=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'])
#Change MovieID column to numeric
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)

#Transforming our ratings_df, from a VERY tall df, to: r_df, a less-tall, but VERY WIDE df. 
r_df= ratings_df.pivot(index = 'UserID', columns = 'MovieID', values='Rating') 
r_df.fillna(0, inplace=True)
print ratings_df.shape
# r_df is the ratings dataframe. 

r = r_df.as_matrix() #Set it as a matrix.
r.shape #Shape is 6040, 3706
user_ratings_mean = np.mean(r, axis=1) 
user_ratings_mean.shape #one long vector, which contains the mean of each row. shape 6040

#normalise the data by each users mean. 
r_demeaned = r - user_ratings_mean.reshape(-1,1) 
#Need to reshape user_ratings_mean from a horizontal vector to a vertical vector.

#r_demeaned: Each row contains ratings of a user. Each rating = [rating - mean rating] 



(1000209, 4)


### Basic usage of Surprise

In [3]:
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader
from surprise import accuracy



### Loading the data with Surprise.Reader, and setting k-folds

In [7]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(ratings_df[['UserID', 'MovieID', 'Rating']], reader)
data.split(n_folds = 4)



# Prediction algos (KNN, SVD)

How to interpret scores? 

** Mean absolute error (MAE) **

The MAE measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables. The equation is given in the library references. Expressed in words, the MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The MAE is a linear score which means that all the individual differences are weighted equally in the average.



** Root mean squared error (RMSE) **

The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The equation for the RMSE is given in both of the references. Expressing the formula in words, the difference between forecast and corresponding observed values are each squared and then averaged over the sample. Finally, the square root of the average is taken. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.



The MAE and the RMSE can be used together to diagnose the variation in the errors in a set of forecasts. The RMSE will always be larger or equal to the MAE; the greater difference between them, the greater the variance in the individual errors in the sample. If the RMSE=MAE, then all the errors are of the same magnitude

### Let's try with KNN.

In [11]:
from surprise import KNNBasic
trainset = data.build_full_trainset()

algo = KNNBasic()

algo.train(trainset)



Computing the msd similarity matrix...
Done computing similarity matrix.


In [12]:
evaluate(algo,data)

Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9252
MAE:  0.7301
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9267
MAE:  0.7302
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9257
MAE:  0.7302
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9259
MAE:  0.7299
------------
------------
Mean RMSE: 0.9259
Mean MAE : 0.7301
------------
------------


CaseInsensitiveDefaultDict(list,
                           {u'mae': [0.73010309254636685,
                             0.73016702384421495,
                             0.73020945841487706,
                             0.72988458437116954],
                            u'rmse': [0.92515635019810027,
                             0.92666893878278478,
                             0.92572206018581238,
                             0.92591154225677008]})

### Let's try with SVD

In [8]:
algo = SVD(n_factors = 10)
evaluate(algo, data)

# RMSE of 0.8732 means on average, our predicion was off by 0.8732


Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8768
MAE:  0.6913
------------
Fold 2
RMSE: 0.8777
MAE:  0.6909
------------
Fold 3
RMSE: 0.8813
MAE:  0.6950
------------
Fold 4
RMSE: 0.8758
MAE:  0.6894
------------
------------
Mean RMSE: 0.8779
Mean MAE : 0.6917
------------
------------


CaseInsensitiveDefaultDict(list,
                           {u'mae': [0.69127108019084837,
                             0.69094866618322071,
                             0.69500832346157515,
                             0.6894471773834564],
                            u'rmse': [0.87684642640120536,
                             0.8777020306766451,
                             0.88126465640351348,
                             0.87581871344637008]})

In [13]:
algo = SVD(n_factors = 50)
evaluate(algo, data)


Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8733
MAE:  0.6867
------------
Fold 2
RMSE: 0.8756
MAE:  0.6879
------------
Fold 3
RMSE: 0.8768
MAE:  0.6891
------------
Fold 4
RMSE: 0.8753
MAE:  0.6876
------------
------------
Mean RMSE: 0.8752
Mean MAE : 0.6878
------------
------------


CaseInsensitiveDefaultDict(list,
                           {u'mae': [0.68672967804118079,
                             0.68793592546616544,
                             0.68906590707554838,
                             0.68757470182470737],
                            u'rmse': [0.87330951052976979,
                             0.87557203666490457,
                             0.87676581865223235,
                             0.87531121539261991]})