# Recommender with Graphlab Create 

## 0. Registier, install and launch

* Register account with [Graphlab](https://turi.com/)
* Follow instructions in the email you received to install Graphlab Create
* Launch Graphlab Create

In [1]:
import numpy as np
import graphlab;
import pandas as pd
import matplotlib.pyplot as plt

## 1. Load your data in Dato's SFrame type.

In [2]:
df = pd.read_table('data/u.data',
                   names=["user", "movie", "rating", "timestamp"])
sf = graphlab.SFrame(df[['user', 'movie', 'rating']])

This non-commercial license of GraphLab Create for academic use is assigned to yihaoson@usc.edu and will expire on June 29, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\songy\AppData\Local\Temp\graphlab_server_1498840243.log.0


In [3]:
sf.head(3)

user,movie,rating
196,242,3
186,302,3
22,377,1


## 2.Create a matrix factorization model.



In [4]:
rec = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False)

## 3. Call the `predict` method on your input data to get the predicted rating for user 1 of movie 100.

In [5]:
one_datapoint_sf = graphlab.SFrame({'user': [1], 'movie': [100]})

In [7]:
one_datapoint_sf

movie,user
100,1


In [8]:
print "rating:", rec.predict(one_datapoint_sf)[0]

rating: 4.7274451059


## 4. On the returned model object, call the `list_fields` method to see what kind of data is stored for your model.

In [9]:
rec.list_fields()

['adagrad_momentum_weighting',
 'additional_iterations_if_unhealthy',
 'binary_target',
 'coefficients',
 'data_load_time',
 'init_random_sigma',
 'item_id',
 'item_side_data_column_names',
 'item_side_data_column_types',
 'linear_regularization',
 'max_iterations',
 'model_name',
 'nmf',
 'num_factors',
 'num_features',
 'num_item_side_features',
 'num_items',
 'num_observations',
 'num_tempering_iterations',
 'num_user_side_features',
 'num_users',
 'observation_data_column_names',
 'random_seed',
 'regularization',
 'regularization_type',
 'sgd_convergence_interval',
 'sgd_convergence_threshold',
 'sgd_max_trial_iterations',
 'sgd_sampling_block_size',
 'sgd_step_adjustment_interval',
 'sgd_step_size',
 'sgd_trial_sample_minimum_size',
 'sgd_trial_sample_proportion',
 'side_data_factorization',
 'solver',
 'step_size_decrease_rate',
 'target',
 'tempering_regularization_start_value',
 'track_exact_loss',
 'training_rmse',
 'training_stats',
 'training_time',
 'user_id',
 'user_side_

## 5. Inspect the output of `get('coefficients')` to see what information your model uses.

In [10]:
rec['coefficients'] 

{'intercept': 3.5298599999999993, 'movie': Columns:
 	movie	int
 	linear_terms	float
 	factors	array
 
 Rows: 1682
 
 Data:
 +-------+--------------+-------------------------------+
 | movie | linear_terms |            factors            |
 +-------+--------------+-------------------------------+
 |  242  |     0.0      | [-0.0904000401497, -0.0860... |
 |  302  |     0.0      | [-0.00110040407162, -0.007... |
 |  377  |     0.0      | [-0.0872582793236, -0.0173... |
 |   51  |     0.0      | [-0.0411782637239, -0.0575... |
 |  346  |     0.0      | [-0.152245894074, -0.06872... |
 |  474  |     0.0      | [-0.0142985815182, -0.0249... |
 |  265  |     0.0      | [0.0623504184186, -0.03908... |
 |  465  |     0.0      | [0.0120684178546, -0.07836... |
 |  451  |     0.0      | [0.0150575470179, -0.07730... |
 |   86  |     0.0      | [-0.0385060533881, -0.0901... |
 +-------+--------------+-------------------------------+
 [1682 rows x 3 columns]
 Note: Only the head of the SFrame is p

## 6. There should be a `movie` and a `user` array in the coefficients. What are the dimensions of this data?

In [18]:
movie_sf = rec['coefficients']['movie']
print movie_sf
print len(movie_sf)
print movie_sf['factors'][0]
print len(movie_sf['factors'][0])


user_sf = rec['coefficients']['user']
print user_sf
print len(user_sf)
print len(user_sf['factors'][0])

+-------+--------------+-------------------------------+
| movie | linear_terms |            factors            |
+-------+--------------+-------------------------------+
|  242  |     0.0      | [-0.0904000401497, -0.0860... |
|  302  |     0.0      | [-0.00110040407162, -0.007... |
|  377  |     0.0      | [-0.0872582793236, -0.0173... |
|   51  |     0.0      | [-0.0411782637239, -0.0575... |
|  346  |     0.0      | [-0.152245894074, -0.06872... |
|  474  |     0.0      | [-0.0142985815182, -0.0249... |
|  265  |     0.0      | [0.0623504184186, -0.03908... |
|  465  |     0.0      | [0.0120684178546, -0.07836... |
|  451  |     0.0      | [0.0150575470179, -0.07730... |
|   86  |     0.0      | [-0.0385060533881, -0.0901... |
+-------+--------------+-------------------------------+
[1682 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
1682
[array('d', [-0.09040004014968872, -0.0860123

## 7. Without using the `predict` method, compute the predicted rating user 1 of movie 100.

In [28]:
movie_array = movie_sf[movie_sf['movie'] == 100]['factors'][0]
user_array = user_sf[user_sf['user'] == 1]['factors'][0]
intercept = rec['coefficients']['intercept']
print "rating:", np.dot(movie_array, user_array) + intercept    # 4.879

rating: 4.72744503168


## 8. What is the intercept term? Can you reproduce the calculation of this value on your own?

*The intercept term is the scaling factor. We can compute the value by taking the average of all the ratings in the original dataset.*

In [29]:
print "intercept:", intercept
print "average:", np.average(sf['rating'])

intercept: 3.52986
average: 3.52986


## 9. Call the `predict` method on your input data to get the predicted ratings, and verify that the RMSE reported by the model diagnostics is correct.

In [30]:
sf

user,movie,rating
196,242,3
186,302,3
22,377,1
244,51,2
166,346,1
298,474,4
115,265,2
253,465,5
305,451,3
6,86,3


In [31]:
from sklearn.metrics import mean_squared_error

predictions = rec.predict(sf)
rmse = np.sqrt(mean_squared_error(sf['rating'], predictions))

print "graphlab's reported rmse:", rec['training_rmse']
print "calculated rmse:", rmse

graphlab's reported rmse: 0.722516471207
calculated rmse: 0.722516471207


## 10. Compare the summary statistics of the original data with your predictions. (`pd.Series(ratings).describe()` to do this). 

Does anything stand out about the min/max?

In [32]:
pd.Series(sf['rating']).describe()

count    100000.000000
mean          3.529860
std           1.125674
min           1.000000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
dtype: float64

## 11. Regularization - graphlab provides two regularization parameters. 

The parameter `regularization` controls the value of lambda. Using what you know about regularization from linear regression, what effect would you expect this to have on solutions? What would you expect to see in the difference of training RMSE between setting this parameter to 0 or 0.1? Try it.

In [33]:
random_seed = 0
rec2 = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False,
            regularization=0,
            random_seed=random_seed)
print "training rmse with regularization 0:", rec2['training_rmse']   # 0.725

regularization_param = 1e-4
rec3 = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False,
            regularization=regularization_param,
            random_seed=random_seed)
print "training rmse with regularization %s:"%regularization_param, rec3['training_rmse']

training rmse with regularization 0: 0.724137738732


training rmse with regularization 0.0001: 0.80989106191


In [34]:
rec3 = graphlab.recommender.factorization_recommender.create(
            sf,
            user_id='user',
            item_id='movie',
            target='rating',
            solver='als',
            side_data_factorization=False,
            regularization=0.1,
            random_seed=random_seed)
print "training rmse with regularization %s:"%regularization_param, rec3['training_rmse']

training rmse with regularization 0.0001: 1.12566797076


## Extra Point #1. Tune your model to find the best parameters. 

What parameters are being tuned by this procedure?

In [35]:
kfolds = graphlab.cross_validation.KFold(sf, 5)
params = dict(user_id='user', 
              item_id='movie', 
              target='rating',
              solver='als', 
              side_data_factorization=False)
paramsearch = graphlab.model_parameter_search.create(
                    kfolds,
                    graphlab.recommender.factorization_recommender.create,
                    params)

[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.job: Creating a LocalAsync environment called 'async'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Jun-30-2017-10-27-2100000' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Jun-30-2017-10-27-2100000' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: A job with name 'Model-Parameter-Search-Jun-30-2017-10-27-2100000' already exists. Renaming the job to 'Model-Parameter-Search-Jun-30-2017-10-27-2100000-71fb6'.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Jun-30-2017-10-27-2100000-71fb6' ready for execution
[INFO] graphlab.deploy.map_job: Job: 'Model-Parameter-Search-Jun-30-2017-10-27-2100000-71fb6' scheduled.
[INFO] graphlab.deploy.job: Validating job.
[INFO] graphlab.deploy.map_job: Validation complete. Job: 'Model-Parameter-Search-Jun-30-2017-10-27-2100001' ready for execution


In [36]:
paramsearch.get_status()

{'Canceled': 0, 'Completed': 0, 'Failed': 0, 'Pending': 0, 'Running': 50}

#### Best models by different metrics

In [37]:
from pprint import pprint

print "best params by recall@5:"
pprint(paramsearch.get_best_params('mean_validation_recall@5'))
print

print "best params by precision@5:"
pprint(paramsearch.get_best_params('mean_validation_precision@5'))
print

print "best params by rmse:"
pprint(paramsearch.get_best_params('mean_validation_rmse'))

best params by recall@5:
{'item_id': 'movie',
 'linear_regularization': 1e-07,
 'max_iterations': 50,
 'num_factors': 64,
 'regularization': 0.0001,
 'side_data_factorization': False,
 'solver': 'als',
 'target': 'rating',
 'user_id': 'user'}

best params by precision@5:
{'item_id': 'movie',
 'linear_regularization': 1e-07,
 'max_iterations': 50,
 'num_factors': 64,
 'regularization': 0.0001,
 'side_data_factorization': False,
 'solver': 'als',
 'target': 'rating',
 'user_id': 'user'}

best params by rmse:
{'item_id': 'movie',
 'linear_regularization': 1e-05,
 'max_iterations': 50,
 'num_factors': 16,
 'regularization': 1e-09,
 'side_data_factorization': False,
 'solver': 'als',
 'target': 'rating',
 'user_id': 'user'}


## What are the latent features?

In [38]:
lf_df = df.set_index(['user', 'movie'])[['rating']].unstack().fillna(0)
lf_df

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movie,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,5.0,0.0,0.0,5.0,5.0,5.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,4.0,0.0,0.0,4.0,0.0,0.0,4.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
lf_df.columns

MultiIndex(levels=[[u'rating'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 

In [44]:
from scipy.spatial.distance import cdist

lf_df = df.set_index(['user', 'movie'])[['rating']].unstack().fillna(0)
user_df = user_sf[['user', 'factors']].sort('user').unpack('factors').to_dataframe()
corr = cdist(lf_df.values.T, user_df.values.T, 'correlation')
corr_df = pd.DataFrame(corr)
corr_df.index = lf_df.columns.get_loc_level('rating')[1]

movies = pd.read_table('data/u.item', sep='|', index_col=0, header=None,
                       names=['movie id', 'movie title', 'release date',
                              'video release date', 'imdb url', 'unknown',
                              'action', 'adventure', 'animation',
                              'children\'s', 'comedy', 'crime',
                              'documentary', 'drama', 'fantasy',
                              'film-noir', 'horror', 'musical', 'mystery',
                              'romance', 'sci-fi', 'thriller', 'war',
                              'western'])
movies_with_corr = pd.concat([movies, corr_df], axis=1)

for i in xrange(1, 9):
    print "TOP MOVIES FOR FACTOR {0}:".format(i)
    top_five_movies = movies_with_corr.sort([i], ascending=False)['movie title'][:5]
    print '    ' + '\n    '.join(top_five_movies)
    print

TOP MOVIES FOR FACTOR 1:
    Trainspotting (1996)
    Boogie Nights (1997)
    Chasing Amy (1997)
    English Patient, The (1996)
    Kolya (1996)

TOP MOVIES FOR FACTOR 2:
    Sense and Sensibility (1995)
    English Patient, The (1996)
    Evita (1996)
    Mr. Holland's Opus (1995)
    Anna Karenina (1997)

TOP MOVIES FOR FACTOR 3:
    The Innocent (1994)
    Man from Down Under, The (1943)
    Quartier Mozart (1992)
    Lashou shentan (1992)
    Symphonie pastorale, La (1946)

TOP MOVIES FOR FACTOR 4:
    Careful (1992)
    I, Worst of All (Yo, la peor de todas) (1990)
    Hostile Intentions (1994)
    Tigrero: A Film That Was Never Made (1994)
    Eye of Vichy, The (Oeil de Vichy, L') (1993)

TOP MOVIES FOR FACTOR 5:
    Twelve Monkeys (1995)
    Seven (Se7en) (1995)
    Trainspotting (1996)
    Scream (1996)
    Fargo (1996)

TOP MOVIES FOR FACTOR 6:
    Brazil (1985)
    Woman in Question, The (1950)
    Yankee Zulu (1994)
    Hostile Intentions (1994)
    The Courtyard (1995)

T



In [50]:
corr_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0.984534,0.903329,1.117066,0.918638,1.071789,1.061918,0.971362,1.028649,1.115279
2,0.957805,0.932939,0.995835,0.937575,1.028568,1.036035,0.991368,1.044315,1.087666
3,1.013820,1.073627,0.984958,0.947655,1.024219,1.116703,1.002719,1.069009,1.048956
4,1.023088,1.000132,0.996084,0.948219,0.970763,1.127082,1.053105,1.014609,1.089215
5,1.028357,0.966646,0.948531,0.976337,1.070833,1.062106,1.020686,0.960413,1.045418
6,1.022585,1.042769,0.979856,0.981120,0.903542,0.961277,1.077269,1.033274,0.970165
7,1.081155,1.087594,1.005814,0.893142,1.012686,1.181903,1.077578,1.101697,1.139220
8,1.042985,0.955531,1.063520,0.992091,0.971959,1.016892,1.064356,0.985099,1.045352
9,1.049508,1.091803,1.068178,0.978173,0.855845,1.077530,1.015133,0.961991,0.972399
10,1.026896,1.096171,1.027011,0.964624,0.923890,1.041566,1.075551,1.067005,0.969838


In [49]:
corr.shape

(1682L, 9L)

## Top topics for each latent feature

In [47]:
from collections import Counter

print "TOP TOPICS FOR EACH FACTOR:"
for i in xrange(1, 9):
    scores = Counter()
    for topic in ['action', 'adventure', 'animation', 'children\'s',
                  'comedy', 'crime', 'documentary', 'drama', 'fantasy',
                  'film-noir', 'horror', 'musical', 'mystery', 'romance',
                  'sci-fi', 'thriller', 'war', 'western']:
        scores[topic] = np.dot(movies_with_corr[i], movies_with_corr[topic]) / np.sum(movies_with_corr[topic])
    top_topics = [topic for topic, score in scores.most_common(3)]
    print "    FACTOR {0}:  {1}".format(i, ', '.join(top_topics))

TOP TOPICS FOR EACH FACTOR:
    FACTOR 1:  crime, film-noir, documentary
    FACTOR 2:  musical, animation, children's
    FACTOR 3:  children's, drama, romance
    FACTOR 4:  fantasy, children's, adventure
    FACTOR 5:  crime, sci-fi, thriller
    FACTOR 6:  film-noir, musical, documentary
    FACTOR 7:  sci-fi, animation, war
    FACTOR 8:  sci-fi, action, adventure
