# MovieLens Neighborhood Based Recommendation with GraphLab

We first explore an item similiarity neighborhood based approach. This In order to make recommendations for the target movies, the top k movies for users were obtained with two similiarity functions considered -- Pearson Correlation and Cosine Similarity . 

In [7]:
import numpy as np
import pandas as pd
import graphlab
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split,KFold

#prepare the data
r_cols = ['movie_id','itle','genres','user_id', 'rating', 'unix_timestamp']
ratings=pd.read_csv('ratings.csv', sep=',', encoding='latin-1')
movies=graphlab.SFrame.read_csv('movies.csv')
tags=graphlab.SFrame.read_csv('tags.csv')

#split the data into training and validation sets
train, test = train_test_split(ratings, test_size=0.2)
train=graphlab.SFrame(train)
test=graphlab.SFrame(test)


#train the Recommender Model
itemSimModel_pearson = graphlab.item_similarity_recommender.create(train, user_id='userId', item_id='movieId', target='rating', similarity_type='pearson')
itemSimModel_cosine = graphlab.item_similarity_recommender.create(train, user_id='userId', item_id='movieId', target='rating', similarity_type='cosine')
itemSimModel_pearson.evaluate_rmse(test,target='rating')
graphlab.recommender.util.compare_models(test,[itemSimModel_pearson])



------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |  0.00149476831091 | 2.29964355525e-05 |
|   2    | 0.000747384155456 | 2.29964355525e-05 |
|   3    | 0.000498256103637 | 2.29964355525e-05 |
|   4    | 0.000373692077728 | 2.29964355525e-05 |
|   5    | 0.000298953662182 | 2.29964355525e-05 |
|   6    | 0.000249128051819 | 2.29964355525e-05 |
|   7    |  0.00021353833013 | 2.29964355525e-05 |
|   8    | 0.000186846038864 | 2.29964355525e-05 |
|   9    | 0.000166085367879 | 2.29964355525e-05 |
|   10   | 0.000149476831091 | 2.29964355525e-05 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

('\nOverall RMSE: ', 1.1601208132503962)

Per User RMSE (best)
+--------+-------+----------------+
| userId | count |      rmse      |
+--------+-------+----------------+
|  341   |   

[{'precision_recall_by_user': Columns:
  	userId	int
  	cutoff	int
  	precision	float
  	recall	float
  	count	int
  
  Rows: 12042
  
  Data:
  +--------+--------+-----------+--------+-------+
  | userId | cutoff | precision | recall | count |
  +--------+--------+-----------+--------+-------+
  |   1    |   1    |    0.0    |  0.0   |   3   |
  |   1    |   2    |    0.0    |  0.0   |   3   |
  |   1    |   3    |    0.0    |  0.0   |   3   |
  |   1    |   4    |    0.0    |  0.0   |   3   |
  |   1    |   5    |    0.0    |  0.0   |   3   |
  |   1    |   6    |    0.0    |  0.0   |   3   |
  |   1    |   7    |    0.0    |  0.0   |   3   |
  |   1    |   8    |    0.0    |  0.0   |   3   |
  |   1    |   9    |    0.0    |  0.0   |   3   |
  |   1    |   10   |    0.0    |  0.0   |   3   |
  +--------+--------+-----------+--------+-------+
  [12042 rows x 5 columns]
  Note: Only the head of the SFrame is printed.
  You can use print_rows(num_rows=m, num_columns=n) to print more ro

In [None]:
#print sample from model- Top 3 for first 10 entries
itemSimModel_pearson.recommend(users=range(1,11),k=10)


# Evaluation Metrics & Interactive View of Model 

We can see below that the Pearson Correlation evaluation method has a smaller RMSE than the cosine correlation as the Pearson Correlation Coefficient takes into account the differences in users' rating scales.

In [None]:
pearson_eval = itemSimModel_pearson.evaluate(test)


In [None]:
cosine_eval = itemSimModel_cosine.evaluate(test) 


The following visualization provides a working interface of the model that recommends movies to an individual as well as provides a list of movies that may be liked given a movie of interest.

In [None]:
view = itemSimModel_pearson.views.overview(
        validation_set=test,
        item_data=movies)

view.show()


# Ranking Factorization

A RankingFactorizationRecommender "learns latent factors for each user and item and uses them to rank recommended items according to the likelihood of observing those (user, item) pairs. This is commonly desired when performing collaborative filtering for implicit feedback datasets or datasets with explicit ratings for which ranking prediction is desired."

In [9]:
#Naive factorization Model
m1 = graphlab.ranking_factorization_recommender.create(train,  user_id='userId', item_id='movieId', target='rating')

m1.recommend(users=range(1,11),k=10)

#The following produces the ratings matrix for the given ranking factor recommender
#m1.predict(test)

userId,movieId,score,rank
1,608,4.95162849699,1
1,539,4.87007754122,2
1,594,4.86656134402,3
1,141,4.82945977722,4
1,720,4.80027138268,5
1,1148,4.79107623373,6
1,17,4.73685633217,7
1,1500,4.71193914686,8
1,357,4.7090868858,9
1,515,4.66685851489,10


In [10]:
#Model with movie information
m2 = graphlab.ranking_factorization_recommender.create(train,  user_id='userId', item_id='movieId', item_data=movies, target='rating')
m2.recommend(users=range(1,11))

#The following produces the ratings matrix for the given ranking factor recommender that adds movie genres as a side feature
#m2.predict(test)


userId,movieId,score,rank
1,899,4.6849898306,1
1,260,4.24750616227,2
1,50,4.23813507532,3
1,608,4.23454047752,4
1,838,4.23394960736,5
1,1028,4.17189442192,6
1,17,4.14456035768,7
1,593,4.13577984099,8
1,910,4.06997211848,9
1,497,4.06237465714,10


In [11]:
#Model with tag information
m3 = graphlab.ranking_factorization_recommender.create(train,  user_id='userId', item_id='movieId', item_data=tags, target='rating')
m3.recommend(users=range(1,11),k=10)

#The following produces the ratings matrix for the given ranking factor recommender that adds movie tags as a side feature
#m3.predict(test)

userId,movieId,score,rank
1,37729,6.7399329727,1
1,34162,6.47691071487,2
1,45950,6.30934178751,3
1,858,5.9776759635,4
1,7153,5.87264213806,5
1,4993,5.83330133201,6
1,35957,5.78989033252,7
1,2571,5.74160243013,8
1,318,5.70079463608,9
1,5952,5.65537261732,10


In [12]:

#Model that pushes predicted ratings of unobserved user-item pairs toward 1 or below with movie genres as side feature
m4=  graphlab.ranking_factorization_recommender.create(train,  user_id='userId', item_id='movieId', item_data=movies, target='rating', unobserved_rating_value = 1)

m4.recommend(users=range(1,11),k=10)

#The following produces the ratings matrix for the given ranking factor recommender 
#m.4predict(test)

userId,movieId,score,rank
1,260,5.0640205202,1
1,1196,4.6197308061,2
1,296,4.61905713392,3
1,1028,4.36236293067,4
1,589,4.35540384625,5
1,908,4.35508050595,6
1,1035,4.27489547164,7
1,594,4.27439259564,8
1,1210,4.26430707251,9
1,541,4.24750422512,10


In [13]:

#Model that pushes predicted ratings of unobserved user-item pairs toward 1 or below with tags as side feature
m5=  graphlab.ranking_factorization_recommender.create(train,  user_id='userId', item_id='movieId', item_data=tags, target='rating', unobserved_rating_value = 1)

m5.recommend(users=range(1,11),k=10)

#The following produces the ratings matrix for the given ranking factor recommender 
#m5.predict(test)

userId,movieId,score,rank
1,260,4.11746645225,1
1,50,4.11475411468,2
1,608,3.95126862166,3
1,296,3.92786297155,4
1,318,3.80972273693,5
1,1089,3.79754195841,6
1,111,3.78791366017,7
1,527,3.74968918239,8
1,364,3.73783869328,9
1,1219,3.73299343723,10


In [20]:
#m1.evaluate_rmse(test,target='rating')  #'rmse_overall': 1.1997219768416447

#m2.evaluate_rmse(test,target='rating') #'rmse_overall': 1.5102909960735822

#m3.evaluate_rmse(test,target='rating') #'rmse_overall': 1.1110787691981734

#m4.evaluate_rmse(test,target='rating') #'rmse_overall': 1.467256102591114

#m5.evaluate_rmse(test,target='rating') #'rmse_overall': 1.0224394708220514

The model is trained with Stochastic Gradient Descent and automaticlly takes in a ranking regularization term set to be .25. According to Turi, "When ranking_regularization is larger than zero, the model samples a small set of unobserved user-item pairs and attempts to drive their rating predictions below the value specified with unobserved_rating_value. This has the effect of improving the precision-recall performance of recommended items." However, given our explicit ratings, this term should be minimized.

Adjusting this paramter [0,1] keeping all else equal using m5 as the desired model of contination as it minimizes RMSE we obtain the following plot that supports this intuition. Therefore we conclude with the final adjusted ranking factorization model, m_star.

In [None]:
x=np.array([0,.01,.1,.25,.5,.75,.9,.99])
y=np.array([.016460668430915367,.05088695106015607,.16003760731422528,0.27976936252929974,.4133214077837332,.5950865397988541,.6098885076702547,.6931611640547327])
plt.figure(figsize=(10,8))
plt.plot(x,y,'c')

plt.title('Tuning Ranking Regularization on Explicit Target Ratings')
plt.xlabel('Ranking Regularization')
plt.ylabel('RMSE')
    
plt.show()


In [23]:
m_star=  graphlab.ranking_factorization_recommender.create(train,  user_id='userId', item_id='movieId', item_data=tags, target='rating', unobserved_rating_value = 1, ranking_regularization=0)
m_star.evaluate_rmse(test,target='rating') # 'rmse_overall': 1.064844763280207

view = m_star.views.overview(
        validation_set=test,
        item_data=movies)

view.show()