In [4]:
import graphlab as gl
gl.canvas.set_target('ipynb')

The following code snippet will parse the books data provided at the training.

In [5]:
import os
if os.path.exists('data/books/ratings'):
    ratings = gl.SFrame('data/books/ratings')
    items = gl.SFrame('data/books/items')
    users = gl.SFrame('data/books/users')
else:
    ratings = gl.SFrame.read_csv('data/books/book-ratings.csv')
    ratings.save('data/books/ratings')
    items = gl.SFrame.read_csv('data/books/book-data.csv')
    items.save('data/books/items')
    users = gl.SFrame.read_csv('data/books/user-data.csv')
    users.save('data/books/users')

Visually explore the above data using GraphLab Canvas.

In [6]:
ratings.show()

## Recommendation systems

In this section we will make a model that can be used to recommend new tags to users.

### Creating a Model

Use `gl.recommender.create()` to create a model that can be used to recommend tags to each user.

In [7]:
m = gl.recommender.create(ratings, user_id='name', item_id='book')

PROGRESS: Recsys training: model = item_similarity
PROGRESS:     To use this column as the target, set target = "rating" and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 98077 observations with 31856 users and 11121 items.
PROGRESS:     Data prepared in: 0.214488s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 11121 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 2.03572         |
PROGRESS: | 2000            | 2.1044          |
PROGRESS: | 3000            | 2.16984         |
PROGRESS: | 4000            | 2.23021         |
PROGRESS: | 5000            | 2.28726         |
PROGRESS: | 6000            | 2.34162         |
PROGRESS: | 7000            | 2.42102         |
PROGRESS: | 8000            | 2.50034         |
PROGRESS: | 9000            | 2.58374         |
P

Print a summary of the model by simply entering the name of the object.

In [8]:
m

Class                           : ItemSimilarityRecommender

Schema
------
User ID                         : name
Item ID                         : book
Target                          : None
Additional observation features : 0
Number of user side features    : 0
Number of item side features    : 0

Statistics
----------
Number of observations          : 98077
Number of users                 : 31856
Number of items                 : 11121

Training summary
----------------
Training time                   : 3.1733

Model Parameters
----------------
Model class                     : ItemSimilarityRecommender
only_top_k                      : 100
threshold                       : 0.001
similarity_type                 : jaccard
training_method                 : auto

Get all unique users from the first 10000 observations and save them as a variable called `users`.

In [9]:
users = ratings.head(10000)['name'].unique()

Get 20 recommendations for each user in your list of users. Save these as a new SFrame called `recs`.

In [10]:
recs = m.recommend(users, k=20)

PROGRESS: recommendations finished on 1000/6572 queries. users per second: 1158.56
PROGRESS: recommendations finished on 2000/6572 queries. users per second: 1215.11
PROGRESS: recommendations finished on 3000/6572 queries. users per second: 1147.53
PROGRESS: recommendations finished on 4000/6572 queries. users per second: 1138.16
PROGRESS: recommendations finished on 5000/6572 queries. users per second: 1175.74
PROGRESS: recommendations finished on 6000/6572 queries. users per second: 1191.36


## Inspecting your model

Get an SFrame of the 20 most similar items for each observed item.

In [11]:
sims = m.get_similar_items()

PROGRESS: Getting similar items completed in 0.109987


This dataset has multiple rows corresponding to the same book, e.g., in situations where reprintings were done by different publishers in different year.

For each unique value of 'book' in the `items` SFrame, select one of the of the available values for `author`, `publisher`, and `year`. Hint: Try using [`SFrame.groupby`](http://turi.com/products/create/docs/graphlab.data_structures.html#module-graphlab.aggregate) and [`gl.aggregate.SELECT_ONE`](http://turi.com/products/create/docs/graphlab.data_structures.html#graphlab.aggregate.SELECT_ONE).

In [12]:
items = items.groupby('book', {k: gl.aggregate.SELECT_ONE(k) for k in ['author', 'publisher', 'year']})

Computing the number of times each book was rated, and add a column containing these counts to the `items` SFrame using `SFrame.join`.

In [13]:
num_ratings_per_book = ratings.groupby('book', gl.aggregate.COUNT)
items = items.join(num_ratings_per_book, on='book')

Print the first few books, sorted by the number of times they have been rated. Do these values make sense?

In [14]:
items.sort('Count', ascending=False)

book,publisher,year,author,Count
Wild Animus,Too Far,2004,Rich Shapero,581
The Da Vinci Code,Doubleday,2003,Dan Brown,488
The Secret Life of Bees,Penguin Highbridge,2002,Sue Monk Kidd,406
Bridget Jones's Diary,Picador,2001,Helen Fielding,377
Life of Pi,Pub Group West,2004,Yann Martel,336
The Summons,Random House Large Print Publishing ...,2002,John Grisham,309
A Painted House,Random House Audio Publishing Group ...,2001,John Grisham,284
The Girls' Guide to Hunting and Fishing ...,Penguin Books Ltd,2000,Melissa Bank,259
Good in Bed,Atria,2001,Jennifer Weiner,247
The Five People You Meet in Heaven ...,Hyperion,2003,Mitch Albom,244


Now print the most similar items per item, sorted by the most common books. Hint: Join the two SFrames you created above.

In [15]:
sims = sims.join(items[['book', 'Count']], on='book')
sims = sims.sort(['Count', 'book', 'rank'], ascending=False)
sims.print_rows(1000, max_row_width=150)

+-------------------------------+--------------------------------+------------------+------+-------+
|              book             |            similar             |      score       | rank | Count |
+-------------------------------+--------------------------------+------------------+------+-------+
|          Wild Animus          |    A Prayer for Owen Meany     | 0.00925925925926 |  10  |  581  |
|          Wild Animus          |          Empire Falls          | 0.00972222222222 |  9   |  581  |
|          Wild Animus          |      When the Wind Blows       | 0.00980392156863 |  8   |  581  |
|          Wild Animus          |   The Bonesetter's Daughter    | 0.0108991825613  |  7   |  581  |
|          Wild Animus          |           Life of Pi           | 0.0110375275938  |  6   |  581  |
|          Wild Animus          |          The Alienist          |  0.011315417256  |  5   |  581  |
|          Wild Animus          |        A Painted House         | 0.0116959064327  |  4   

### Experimenting with other models

Create a dataset called `implicit` that contains only ratings data where `rating` was 4 or greater.

In [16]:
implicit = ratings[ratings['rating'] >= 4]

Create a train/test split of the `implicit` data created above. Hint: Use [random_split_by_user](http://graphlab.com/products/create/docs/generated/graphlab.recommender.random_split_by_user.html#graphlab.recommender.random_split_by_user).

In [17]:
train, test = gl.recommender.util.random_split_by_user(implicit, user_id='name', item_id='book')

Print the first 5 rows of the training set.

In [18]:
train.head(5)

name,book,rating
Channon,Dave Barry's Bad Habits a 100% Fact-Free Book ...,5
Boe,It's Not About the Bike: My Journey Back to Life ...,4
Raul,The Hero and the Crown,4
Sarah,One Night of Scandal,4
Brooklynn,Fat Ollie's Book: A Novel of the 87th ...,4


Create a `ranking_factorization_recommender` model using just the training set and 20 factors.

In [19]:
m = gl.ranking_factorization_recommender.create(train, 'name', 'book', target='rating', num_factors=20)

PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 57079 observations with 21029 users and 10508 items.
PROGRESS:     Data prepared in: 0.184047s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 20       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | sgd      |
PROGRESS: | linear_regularization          | L2 Regularization on L

Evaluate how well this model recommends items that were seen in the test set you created above. Hint: Check out `m.evaluate_precision_recall()`.

In [20]:
m.evaluate_precision_recall(test, cutoffs=[50])['precision_recall_overall']

cutoff,precision,recall
50,0.00298780487805,0.108027355436


Create an SFrame containing only one observation, where 'Billy Bob' has rated 'Animal Farm' with score 5.0.

In [21]:
new_observation_data = gl.SFrame({'name': ['Me'], 'book': ['Animal Farm'], 'rating': [5.0]})

Use this data when querying for recommendations.

In [22]:
m.recommend(users=['Me'], new_observation_data=new_observation_data)

name,book,score,rank
Me,The Da Vinci Code,4.73822578309,1
Me,A Prayer for Owen Meany,4.67328762529,2
Me,The Secret Life of Bees,4.62619789956,3
Me,The Five People You Meet in Heaven ...,4.59953253982,4
Me,Bridget Jones's Diary,4.58555614469,5
Me,Life of Pi,4.58295535562,6
Me,Suzanne's Diary for Nicholas ...,4.54191135881,7
Me,"A Child Called ""It"": One Child's Courage to ...",4.50687871394,8
Me,The Horse Whisperer,4.499653424,9
Me,The Handmaid's Tale,4.49429212031,10
