**Introduction**

In this notebook, we are going to use multiple recommender model for evaluation. For this we use the dataset(s) provided by MovieLens.  

MovieLens has several datasets( 100K dataset )

This dataset set consists of: 
 
• 100,000 ratings (1-5) from 943 users on 1682 movies.  
• Each user has rated at least 20 movies.  
• Simple demographic info for the users (age, gender, occupation, zip) 

Download the "u.data" file. To view this file you can use Microsoft Excel, 

for example. It has the following tab-separated format:  user id | item id | rating | timestamp. 
    The timestamps are in Unix seconds since 1/1/1970 UTC, EPOCH format.

**Data Preparation**

In [1]:
import matplotlib as mpl
mpl.use('TkAgg')
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np

**Working with Data sets**

In [2]:
col_names = ["user_id", "item_id", "rating", "timestamp"]
data = pd.read_table("u.data", names=col_names)
data = data.drop("timestamp", 1)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
user_id    100000 non-null int64
item_id    100000 non-null int64
rating     100000 non-null int64
dtypes: int64(3)
memory usage: 2.3 MB


#**Plot data into histogram**

In [3]:
plt.hist(data["rating"])
plt.ylabel('Number of Items')
plt.xlabel('Ratings')
plt.title('What does data look like?')
plt.show()


<img src="Graph1.PNG" width =500 height =500>

**The dataset sparsity can be calculated as**

In [4]:
Number_Ratings = float(len(data))
Number_Movies = float(len(np.unique(data["item_id"])))
Number_Users = float(len(np.unique(data["user_id"])))
Sparsity = (Number_Ratings/(Number_Movies*Number_Users))*100.0
print "Sparsity of Dataset is", Sparsity, "Percent"

Sparsity of Dataset is 6.30466936422 Percent


**Sub-setting the data: **

the criteria we are currently using is to not include a user if they have fewer than 50 ratings. This value can be changed in the RATINGS_CUTOFF variable.

In [5]:
users = data["user_id"]
ratings_count = {}
for user in users:
    if user in ratings_count:
        ratings_count[user] += 1
    else:
        ratings_count[user] = 1
RATINGS_CUTOFF = 50
remove_users = []
for user,num_ratings in ratings_count.iteritems():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)
data = data.loc[~data['user_id'].isin(remove_users)]

**Recalculate sparsity**

In [6]:
Number_Ratings = float(len(data))
Number_Movies = float(len(np.unique(data["item_id"])))
Number_Users = float(len(np.unique(data["user_id"])))
Sparsity = (Number_Ratings/(Number_Movies*Number_Users))*100.0
print"Sparsity of Dataset is", Sparsity, "Percent"

Sparsity of Dataset is 9.26584192843 Percent


**Preparing for Graphlab**

Note : before using Graphlab you need to register with Graphlab and get the key to install it. link https://turi.com/download/academic.html

I have used Anaconda+turi (GraphLab Create Launcher) which worked perfectly well 

**what is GraphLab Create Launcher**
The GraphLab Create Launcher is a desktop application that makes it easy to get an Anaconda Python environment and GraphLab Create on your machine. After the installation is complete, it provides an icon to start a terminal or IPython Notebook (jupyter) session where you can use GraphLab Create. 

In [7]:
import graphlab as gl
sf = gl.SFrame(data)

This non-commercial license of GraphLab Create for academic use is assigned to sgmoorthy@gmail.com and will expire on November 13, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Guruc\AppData\Local\Temp\graphlab_server_1510942204.log.0


if in case the Graphlab installation poped error. use below code to check the dependency to get them installed


graphlab.get_dependencies()

**Splitting Data Randomly (Train/Test)**

SFrame.random_split(fraction, seed=None) : 

Randomly split the rows of an SFrame into two SFrames. The first SFrame contains M rows, sampled uniformly (without replacement) from the original SFrame. M is approximately the fraction times the original number of rows. The second SFrame contains the remaining rows of the original SFrame.

In [8]:
sf_train, sf_test = sf.random_split(.75, seed=5)
print "Train :",len(sf_train), "Test ",len(sf_test) ,"Total" ,(len(sf_train)+len(sf_test)) 


Train : 66476 Test  21995 Total 88471


In [9]:
sf_train, sf_validate = sf_train.random_split(.75)

**Popularity Recommender**

In [10]:
popularity_recommender = gl.recommender.popularity_recommender.create(sf_train,target='rating')
popularity_recommender.evaluate_rmse(sf_test,'rating')

{'rmse_by_item': Columns:
 	item_id	int
 	count	int
 	rmse	float
 
 Rows: 1447
 
 Data:
 +---------+-------+----------------+
 | item_id | count |      rmse      |
 +---------+-------+----------------+
 |   118   |   55  | 1.08391323518  |
 |   1029  |   3   | 1.51877143332  |
 |   435   |   39  | 0.798324248985 |
 |   1517  |   1   |      2.0       |
 |   537   |   8   | 1.40215882935  |
 |   526   |   35  | 0.934595841931 |
 |   232   |   21  | 0.627809382545 |
 |   310   |   28  | 0.756420634585 |
 |    49   |   19  | 0.796875237668 |
 |    13   |   37  | 0.918592725376 |
 +---------+-------+----------------+
 [1447 rows x 3 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
 'rmse_by_user': Columns:
 	user_id	int
 	count	int
 	rmse	float
 
 Rows: 568
 
 Data:
 +---------+-------+----------------+
 | user_id | count |      rmse      |
 +---------+-------+----------------+
 |   118   |   13  | 1.

**Collaborative Filtering**

FactorizationRecommender trains a model capable of predicting a score for each possible combination of users and items. The internal coefficients of the model are learned from known scores of users and items. Recommendations are then based on these scores.

supply the values to regularization as {1e-05 ,0.0001, 0.001, 0.01, 0.1}

In [11]:
regularization_terms = [10**-5,10**-4,10**-3,10**-2,10**-1]
best_regularization_term=0
best_RMSE = np.inf
for regularization_term in regularization_terms:
    factorization_recommender = gl.recommender.factorization_recommender.create(sf_train,
                                                                                      target='rating',
                                                                                      regularization=regularization_term)
    evaluation = factorization_recommender.evaluate_rmse(sf_validate,'rating')
    if evaluation['rmse_overall'] < best_RMSE:
        best_RMSE = evaluation['rmse_overall']
        best_regularization_term = regularization_term
print "Best Regularization Term", best_regularization_term
print "Best Validation RMSE Achieved", best_RMSE

#to show a graph
factorization_recommender.show()

#to show it in intractive graph
view = factorization_recommender.views.evaluate(sf_validate)
view.show()

Best Regularization Term 0.001
Best Validation RMSE Achieved 0.935978294474
Canvas is accessible via web browser at the URL: http://localhost:51939/index.html
Opening Canvas in default web browser.


View object

URI: 		http://localhost:32212/view/e6611e1b-8619-4e3a-b57c-ddc19f5735cd
HTML: 		
<gl-recommender-evaluate
    uri="http://localhost:32212/view/f0cbaf98-c09a-4352-bb41-ff48e159fbd4"
    api_key=""
/>
        

<img src="Chart1.PNG" width =500 height =500>

<img src="Chart2.PNG" width =500 height =500>

<img src="Data1.PNG" width =500 height =500>

**Collaborative Filtering - Validation set**

In [12]:
factorization_recommender = gl.recommender.factorization_recommender.create(sf_train,
                                                                                  target='rating',
                                                                                  regularization=best_regularization_term)
print "Test RMSE on best model", factorization_recommender.evaluate_rmse(sf_test,'rating')['rmse_overall']


Test RMSE on best model

 0.940223937639


**Item-Item Similarity Recommender**

This model first computes the similarity between items using the observations of users who have interacted with both items

In [13]:
item_similarity_recommender = gl.recommender.item_similarity_recommender.create(sf_train,target='rating')
print "Test RMSE on model", item_similarity_recommender.evaluate_rmse(sf_test,'rating')['rmse_overall'] 

#Return a score prediction for the user ids and item ids in the provided data set.
item_similarity_recommender.predict(sf_test, new_observation_data=None, new_user_data=None, new_item_data=None)


Test RMSE on model

 3.65552081794


dtype: float
Rows: 21995
[0.017236971464313446, 0.009476357612057009, 0.01802758553932453, 0.023346427522721837, 0.0029839809856487317, 0.0, 0.03946353712974236, 0.04077584894610123, 0.0, 0.07792791800621228, 0.0011947430590147612, 0.0078731369972229, 0.014337611198425294, 0.05699111971744271, 0.02318447910317587, 0.006497731195628973, 0.019313114881515502, 0.01754558310613363, 0.05419876317531742, 0.018670507217651088, 0.008293634653091431, 0.0, 0.012870118424699113, 0.01915279472315753, 0.024100208628004875, 0.045156659889806265, 0.06575820256363261, 0.0, 0.09463245808323727, 0.09022698737680912, 0.019285941627663627, 0.0018903595813806505, 0.010341822554212097, 0.008492468162016435, 0.0, 0.004521899270695566, 0.0008060800101228816, 0.004160178024959202, 0.05336611120255439, 0.03746527704325589, 0.04402656961419729, 0.006594951857220043, 0.0013480392156862745, 0.010793223772963432, 0.025391126016400895, 0.003909168695961988, 0.0028300979839903967, 0.009112366392642637, 0.002478433742

**Top-K Recommendations**

In [14]:
k=5
popularity_top_k = popularity_recommender.recommend(k=k)
factorization_top_k = factorization_recommender.recommend(k=k)
item_similarity_top_k = item_similarity_recommender.recommend(k=k)
print factorization_top_k

+---------+---------+---------------+------+
| user_id | item_id |     score     | rank |
+---------+---------+---------------+------+
|   115   |   408   | 4.72377209859 |  1   |
|   115   |   169   | 4.68151967721 |  2   |
|   115   |   483   | 4.67288379865 |  3   |
|   115   |   114   | 4.65567391114 |  4   |
|   115   |    64   | 4.60892932611 |  5   |
|   253   |   408   | 4.75145004997 |  1   |
|   253   |   169   | 4.70919762859 |  2   |
|   253   |   114   | 4.68335186252 |  3   |
|   253   |   318   | 4.60510693798 |  4   |
|   253   |   513   | 4.60478090057 |  5   |
+---------+---------+---------------+------+
[2840 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


# Evaluation: Precision/Recall, Confusion Matrix

**Precision/Recall comparison between the three models**

Compare the prediction or recommendation performance of recommender models on a common test dataset.

Models that are trained to predict ratings are compared separately from models that are trained without target ratings. The ratings prediction models are compared on root-mean-squared error, and the rest are compared on precision-recall.

In [15]:
models = [popularity_recommender,factorization_recommender,item_similarity_recommender]
model_names = ['popularity_recommender','factorization_recommender',
               'item_similarity_recommender']
precision_recall = gl.recommender.util.compare_models(sf_test,
                                                      models,metric='precision_recall',
                                                      model_names=model_names)

PROGRESS: Evaluate model popularity_recommender

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    |        0.0        |        0.0        |
|   4    |        0.0        |        0.0        |
|   5    | 0.000352112676056 | 1.41980917765e-05 |
|   6    | 0.000293427230047 | 1.41980917765e-05 |
|   7    | 0.000251509054326 | 1.41980917765e-05 |
|   8    | 0.000220070422535 | 1.41980917765e-05 |
|   9    | 0.000782472613459 | 6.46199346946e-05 |
|   10   | 0.000704225352113 | 6.46199346946e-05 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model factorization_recommender

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  me

In [16]:
precision_recall = gl.recommender.util.compare_models(sf_test,
                                                      models,metric='rmse',
                                                      model_names=model_names)

PROGRESS: Evaluate model popularity_recommender
('\nOverall RMSE: ', 1.0285619131890746)

Per User RMSE (best)
+---------+-------+----------------+
| user_id | count |      rmse      |
+---------+-------+----------------+
|    25   |   16  | 0.451728529612 |
+---------+-------+----------------+
[1 rows x 3 columns]


Per User RMSE (worst)
+---------+-------+---------------+
| user_id | count |      rmse     |
+---------+-------+---------------+
|   405   |  189  | 2.02569465546 |
+---------+-------+---------------+
[1 rows x 3 columns]


Per Item RMSE (best)
+---------+-------+------+
| item_id | count | rmse |
+---------+-------+------+
|   1545  |   1   | 0.0  |
+---------+-------+------+
[1 rows x 3 columns]


Per Item RMSE (worst)
+---------+-------+------+
| item_id | count | rmse |
+---------+-------+------+
|   1306  |   1   | 4.0  |
+---------+-------+------+
[1 rows x 3 columns]

PROGRESS: Evaluate model factorization_recommender
('\nOverall RMSE: ', 0.9402239376390233)

Per U

In [17]:
#targets = sf_test
#predictions = item_similarity_recommender.predict(sf_test, new_observation_data=None, new_user_data=None, new_item_data=None)

#gl.evaluation.confusion_matrix(targets, predictions)

**Conclusion**

In this session we have used tree recommendation models against given dataset and applied various model comparison.using Graphlab.recommender.util.compare_models() util found that the **factorization model **perfomred better commpared to other two models and produced less **RMSE **for the given dataset.


<img src="RMSE.PNG" width=300 height =300>