In [6]:
import graphlab

In [9]:
import os
if os.path.exists('stack_overflow'):
    sf = graphlab.SFrame('stack_overflow')
else:
    sf= graphlab.SFrame('https://static.turi.com/datasets/stack_overflow')
    sf.save('stack_overflow')

Visually explore the above data using GraphLab Canvas:

Create a new column called *Tags* where each element is a list of all the tags used for that question:

In [10]:
sf = sf.pack_columns(column_prefix='Tag', new_column_name='Tags')

Make your SFrame only contain the OwnerUserId column and the Tags column you created in the previous step.

In [13]:
sf = sf[['OwnerUserId', 'Tags']]

In [23]:
graphlab.canvas.set_target('ipynb')
sf.show()

Use the following Python function to modify the Tags column to not have any empty strings in the list.

In [14]:
def remove_empty(tags):
    return [tag for tag in tags if tag != '']

In [15]:
sf['Tags'] = sf['Tags'].apply(remove_empty)

Create a new SFrame called *user_tag* that has a row for every (user, tag) pair.

In [16]:
user_tag = sf.stack(column_name='Tags', new_column_name='Tag')

In [25]:
graphlab.canvas.set_target('ipynb')
user_tag.show()

Create a new SFrame called `user_tag_count` that has three columns:
- OwnerUserId
- Tag
- Count
where `Count` contains the number of times the given `Tag` was used by that particular `OwnerUserId`. 

In [19]:
user_tag_count = user_tag.groupby(['OwnerUserId', 'Tag'], graphlab.aggregate.COUNT)

Visually explore this summarized version of your data set with GraphLab Canvas.

In [22]:
graphlab.canvas.set_target('ipynb')
user_tag_count.show()

Use `graphlab.recommender.create()` to create a model that can be used to recommend tags to each user.

In [26]:
m = graphlab.recommender.create(user_tag_count, user_id='OwnerUserId', item_id='Tag')

In [27]:
m

Class                            : ItemSimilarityRecommender

Schema
------
User ID                          : OwnerUserId
Item ID                          : Tag
Target                           : None
Additional observation features  : 0
User side features               : []
Item side features               : []

Statistics
----------
Number of observations           : 6352858
Number of users                  : 656102
Number of items                  : 44149

Training summary
----------------
Training time                    : 7.0618

Model Parameters
----------------
Model class                      : ItemSimilarityRecommender
threshold                        : 0.001
similarity_type                  : jaccard
training_method                  : auto

Other Settings
--------------
degree_approximation_threshold   : 4096
sparse_density_estimation_sample_size : 4096
max_data_passes                  : 4096
target_memory_usage              : 8589934592
seed_item_set_size               : 50

Get all unique users from the first 10000 observations and save them as a variable called `users`.

In [28]:
users = user_tag_count.head(10000)['OwnerUserId'].unique()

Get 20 recommendations for each user in your list of users. Save these as a new SFrame called `recs`.

In [29]:
recs = m.recommend(users, k=20)

When people use recommendation systems for online commerice, it's often useful to be able to recommending products from a single category of items, e.g. recommending shoes to somebody who typically buys shirts.

To illustrate how this can be done with GraphLab Create, suppose we have a Javascript user who is trying to learn Python. Below we will take just the Javascript users and see what Python tags to recommend them.

Create a variable called `javascript_users` that contains all unique users who have used the `javascript` tag.

In [31]:
javascript_users = user_tag_count['OwnerUserId'][user_tag_count['Tag'] == 'javascript'].unique()

In [35]:
javascript_users

dtype: str
Rows: 92719
['736642', '1002584', '991532', '382463', '1491856', '894788', '1041292', '946195', '663525', '534447', '120399', '1355611', '146471', '792206', '1503072', '1113987', '375106', '624305', '1197893', '130840', '1447743', '727074', '279481', '342303', '239240', '400514', '365440', '1338642', '203808', '1496062', '201244', '1236135', '71809', '367087', '1388052', '203839', '1260278', '1409843', '423103', '1165519', '675283', '99573', '739851', '557015', '307123', '488809', '514667', '1365038', '1486054', '1197815', '221548', '10286', '175878', '1595758', '38070', '1189880', '1354130', '576066', '251250', '1535196', '596983', '1034221', '678973', '601635', '733751', '1234376', '831694', '763349', '541070', '1359887', '57068', '1156058', '1296123', '1166862', '824954', '49086', '386584', '949627', '1670740', '77782', '186909', '1068754', '631015', '1479652', '514748', '859263', '945370', '1405151', '395650', '119271', '1427052', '409126', '1214590', '1319849', '611073'

Use the model you created above to find the 20 most similar items to the tag "python". Create a variable called `python_items` containing just these similar items.

In [36]:
python_items = m.get_similar_items(['python'], k=20)
python_items = python_items['similar']

For each user in `javascript_users`, make 5 recommendations among the items in `python_items`.

In [38]:
python_recs = m.recommend(users=javascript_users, items=python_items, k=5)

Use GraphLab Canvas to find out the 10 most often recommended items.

In [49]:
# graphlab.canvas.set_target('browser', port = 30000)
graphlab.canvas.set_target('ipynb')
python_recs.head()

OwnerUserId,Tag,score,rank
736642,css,0.216511654854,1
736642,mysql,0.16825915575,2
736642,sql,0.114513540268,3
736642,regex,0.105026602745,4
736642,database,0.0936569333076,5
1002584,html,0.0302163749128,1
1002584,sql,0.0272529044667,2
1002584,database,0.0259853198722,3
1002584,jquery,0.0259540467649,4
1002584,css,0.0256228898023,5


Save your model to a file.

In [42]:
m.save('my_model')

## Experimenting with new models

Create a train/test split of the user_tag_count data from the section above. Hint: Use random_split_by_user.

In [43]:
train, test = graphlab.recommender.util.random_split_by_user(user_tag_count,
                                                             user_id='OwnerUserId',
                                                             item_id='Tag')

Create a recommender model like you did above that only uses the training set.

In [44]:
m1 = graphlab.recommender.create(train, user_id='OwnerUserId', item_id='Tag')

Create a matrix factorization model that is better at ranking by setting unobserved_rating_regularization argument to 1.

In [45]:
m2 = graphlab.ranking_factorization_recommender.create(train,
                                                       user_id='OwnerUserId',
                                                       item_id='Tag',
                                                       target='Count',
                                                       ranking_regularization=1)

Retrieve the coefficients for each user that were learned by this algorithm.

In [46]:
m2['coefficients']['OwnerUserId']

OwnerUserId,linear_terms,factors
661140,-0.414406687021,"[0.551463365555, -0.0973705947399, ..."
739736,0.375382274389,"[-0.214054390788, 0.0523217916489, ..."
515160,0.656349599361,"[-0.762495756149, 0.26028418541, ..."
786238,-0.213030233979,"[0.454403668642, -0.609466075897, ..."
657536,-1.34143877029,"[0.239265173674, 0.515760600567, ..."
740293,-0.420180112123,"[-0.130076482892, -0.103303998709, ..."
155005,-0.27473115921,"[0.0576725900173, -0.373588114977, ..."
722508,-0.659019708633,"[0.181462213397, 0.718444824219, ..."
1632716,-0.606574058533,"[-0.572928726673, -0.286513179541, ..."
1294775,-1.15090548992,"[0.212015375495, 0.40648111701, ..."


Compare the predictive performance of the two models. Given the ability to make 10 recommendations, which model predicted the highest proportion of items in the test set (on average)?

In [47]:
results = graphlab.recommender.util.compare_models(test, [m1, m2],
                                                   metric='precision_recall')

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+----------------+
| cutoff |  mean_precision |  mean_recall   |
+--------+-----------------+----------------+
|   1    |  0.127889060092 | 0.057584358032 |
|   2    |  0.106317411402 | 0.097191224936 |
|   3    | 0.0883410374936 | 0.121982907201 |
|   4    | 0.0793528505393 | 0.137084639141 |
|   5    | 0.0699537750385 | 0.148733090526 |
|   6    | 0.0649717514124 | 0.16522260928  |
|   7    |  0.060092449923 | 0.17699772359  |
|   8    | 0.0558551617874 | 0.191185308966 |
|   9    | 0.0529019003595 | 0.20007385257  |
|   10   | 0.0499229583975 | 0.207194577398 |
+--------+-----------------+----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.106317411402 | 0.046

In [51]:
model_performance = graphlab.compare(test, [m1, m2], user_sample=0.05)
graphlab.show_comparison(model_performance,[m1, m2])

compare_models: using 32 users to estimate model performance
PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |      0.1875     | 0.0693539029536 |
|   2    |     0.171875    |  0.141311972574 |
|   3    |  0.114583333333 |  0.141311972574 |
|   4    |    0.1015625    |  0.172957542194 |
|   5    |     0.09375     |  0.188978111814 |
|   6    | 0.0885416666667 |  0.228040611814 |
|   7    | 0.0803571428571 |  0.228436181435 |
|   8    |     0.078125    |  0.230081751055 |
|   9    | 0.0763888888889 |  0.236540084388 |
|   10   |     0.078125    |  0.278602320675 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-----------------+----------------+
| cutoff |  mean_precision |  mean_recall   |
+-------