# Recommender Systems
https://apple.github.io/turicreate/docs/userguide/recommender/
 
A recommender system allows you to provide personalized recommendations to users. With this toolkit, you can create a model based on past interaction data and use that model to make recommendations.

## Turi Create and GPU Setup

In [None]:
!apt install libnvrtc8.0
!pip uninstall -y mxnet-cu80 && pip install mxnet-cu80==1.1.0
!pip install turicreate

## Google Drive Access

You will be asked to click a link to generate a secret key to access your Google Drive. 

Copy and paste secret key it into the space provided with the notebook.

In [None]:
import os.path
from google.colab import drive

# mount Google Drive to /content/drive/My Drive/
if os.path.isdir("/content/drive/My Drive"):
  print("Google Drive already mounted")
else:
  drive.mount('/content/drive')

## Fetch Example Data

Creating a recommender model typically requires a data set to use for training the model, with columns that contain the user IDs, the item IDs, and (optionally) the ratings. For this example, we use the MovieLens 20M dataset:
https://grouplens.org/datasets/movielens/20m/

Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. 

In [None]:
import os.path
import urllib.request
import tarfile
import zipfile
import gzip
from shutil import copy

def fetch_remote_datafile(filename, remote_url):
  if os.path.isfile("./" + filename):
    print("already have " + filename + " in workspace")
    return
  print("fetching " + filename + " from " + remote_url + "...")
  urllib.request.urlretrieve(remote_url, "./" + filename)

def cache_datafile_in_drive(filename):
  if os.path.isfile("./" + filename) == False:
    print("cannot cache " + filename + ", it is not in workspace")
    return
  
  data_drive_path = "/content/drive/My Drive/Colab Notebooks/data/"
  if os.path.isfile(data_drive_path + filename):
    print("" + filename + " has already been stored in Google Drive")
  else:
    print("copying " + filename + " to " + data_drive_path)
    copy("./" + filename, data_drive_path)
  

def load_datafile_from_drive(filename, remote_url=None):
  data_drive_path = "/content/drive/My Drive/Colab Notebooks/data/"
  if os.path.isfile("./" + filename):
    print("already have " + filename + " in workspace")
  elif os.path.isfile(data_drive_path + filename):
    print("have " + filename + " in Google Drive, copying to workspace...")
    copy(data_drive_path + filename, ".")
  elif remote_url != None:
    fetch_remote_datafile(filename, remote_url)
  else:
    print("error: you need to manually download " + filename + " and put in drive")
    
def extract_datafile(filename, expected_extract_artifact=None):
  if expected_extract_artifact != None and (os.path.isfile(expected_extract_artifact) or os.path.isdir(expected_extract_artifact)):
    print("files in " + filename + " have already been extracted")
  elif os.path.isfile("./" + filename) == False:
    print("error: cannot extract " + filename + ", it is not in the workspace")
  else:
    extension = filename.split('.')[-1]
    if extension == "zip":
      print("extracting " + filename + "...")
      data_file = open(filename, "rb")
      z = zipfile.ZipFile(data_file)
      for name in z.namelist():
          print("    extracting file", name)
          z.extract(name, "./")
      data_file.close()
    elif extension == "gz":
      print("extracting " + filename + "...")
      if filename.split('.')[-2] == "tar":
        tar = tarfile.open(filename)
        tar.extractall()
        tar.close()
      else:
        data_zip_file = gzip.GzipFile(filename, 'rb')
        data = data_zip_file.read()
        data_zip_file.close()
        extracted_file = open('.'.join(filename.split('.')[0:-1]), 'wb')
        extracted_file.write(data)
        extracted_file.close()
    elif extension == "tar":
      print("extracting " + filename + "...")
      tar = tarfile.open(filename)
      tar.extractall()
      tar.close()
    elif extension == "csv":
      print("do not need to extract csv")
    else:
      print("cannot extract " + filename)
      
def load_cache_extract_datafile(filename, expected_extract_artifact=None, remote_url=None):
  load_datafile_from_drive(filename, remote_url)
  extract_datafile(filename, expected_extract_artifact)
  cache_datafile_in_drive(filename)
  

In [None]:
load_cache_extract_datafile("ml-20m.zip", "ml-20m", "http://files.grouplens.org/datasets/movielens/ml-20m.zip")

already have ml-20m.zip in workspace
files in ml-20m.zip have already been extracted
ml-20m.zip has already been stored in Google Drive


## Setup Turi Create

In [None]:
import mxnet as mx
import turicreate as tc

In [None]:
# Use all GPUs (default)
tc.config.set_num_gpus(-1)

# Use only 1 GPU
#tc.config.set_num_gpus(1)

# Use CPU
#tc.config.set_num_gpus(0)

## Recommender Example - Movies

In [None]:
actions = tc.SFrame.read_csv('./ml-20m/ratings.csv')
actions.head()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940
1,260,4.0,1112484826


In [None]:
actions.groupby('userId', [tc.aggregate.COUNT]).sort("Count", ascending = False)

userId,Count
118205,9254
8405,7515
82418,5646
121535,5520
125794,5491
74142,5447
34576,5356
131904,5330
83090,5169
59477,4988


In [None]:
items = tc.SFrame.read_csv('./ml-20m/movies.csv')
items.head()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Child ren|Comedy|Fantasy ...
2,Jumanji (1995),Adventure|Children|Fantas y ...
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995) ...,Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


## Creating a model

There are a variety of machine learning techniques that can be used to build a recommender model. Turi Create provides a method `turicreate.recommender.create` that will automatically choose an appropriate model for your data set. More details on choosing a model: https://apple.github.io/turicreate/docs/userguide/recommender/choosing-a-model.html

First we create a random split of the data to produce a validation set that can be used to evaluate the model.

In [None]:
training_data, validation_data = tc.recommender.util.random_split_by_user(actions, 'userId', 'movieId')

In [None]:
model = tc.recommender.create(training_data, 'userId', 'movieId')

Now that you have a model, you can make recommendations

## Making recommendations

Once a model is created, you can now make recommendations of new items for users. To do so, call model.recommend() with an SArray of user ids. If users is set to None, then model.recommend() will make recommendations for all the users seen during model creation, automatically excluding the items that are observed for each user. In other words, if data contains a row "Alice, The Great Gatsby", then model.recommend() will not recommend "The Great Gatsby" for user "Alice". It will return at most k new items for each user, sorted by their rank. It will return fewer than k items if there are not enough items that the user has not already rated or seen.

The score column of the output contains the unnormalized prediction scores for each user-item pair. The semantic meanings of these scores may differ between models. For the linear regression model, for instance, a higher average score for a user means that the model thinks that this user is generally more enthusiastic than others.

There are a number of ways to make recommendations: for known users or new users, with new observation data or side information, and with different ways to explicitly control item inclusion or exclusion. Let's walk through these options together.

### Making recommendations for all users

By default, calling m.recommend() without any arguments returns the top 10 recommendations for all users seen during model creation. It automatically excludes items that were seen during model creation. Hence all generated recommendations are for items that the user has not already seen.

In [None]:
results = model.recommend()

+--------+---------+---------------------+------+
| userId | movieId |        score        | rank |
+--------+---------+---------------------+------+
|   1    |   1682  | 0.08997867686407907 |  1   |
|   1    |   2115  | 0.08676330430167062 |  2   |
|   1    |   1580  | 0.08470123972211566 |  3   |
|   1    |   1265  | 0.08374577590397426 |  4   |
|   1    |   3578  | 0.08367959431239537 |  5   |
|   1    |   1527  | 0.08159969432013375 |  6   |
|   1    |   2571  | 0.07945986543382917 |  7   |
|   1    |   1270  | 0.07668366568429129 |  8   |
|   1    |   3793  | 0.07137173175811767 |  9   |
|   1    |   5349  | 0.07035016877310617 |  10  |
+--------+---------+---------------------+------+
[1384930 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [None]:
print(results)

+--------+---------+---------------------+------+
| userId | movieId |        score        | rank |
+--------+---------+---------------------+------+
|   1    |   1682  | 0.08997867686407907 |  1   |
|   1    |   2115  | 0.08676330430167062 |  2   |
|   1    |   1580  | 0.08470123972211566 |  3   |
|   1    |   1265  | 0.08374577590397426 |  4   |
|   1    |   3578  | 0.08367959431239537 |  5   |
|   1    |   1527  | 0.08159969432013375 |  6   |
|   1    |   2571  | 0.07945986543382917 |  7   |
|   1    |   1270  | 0.07668366568429129 |  8   |
|   1    |   3793  | 0.07137173175811767 |  9   |
|   1    |   5349  | 0.07035016877310617 |  10  |
+--------+---------+---------------------+------+
[1384930 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [None]:
for i in range(10):
  print(items[results["movieId"][i]==items["movieId"]])


+---------+-------------------------+---------------------+
| movieId |          title          |        genres       |
+---------+-------------------------+---------------------+
|   1682  | Truman Show, The (1998) | Comedy|Drama|Sci-Fi |
+---------+-------------------------+---------------------+
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+---------+-------------------------------+--------------------------+
| movieId |             title             |          genres          |
+---------+-------------------------------+--------------------------+
|   2115  | Indiana Jones and the Temp... | Action|Adventure|Fantasy |
+---------+-------------------------------+--------------------------+
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+---------+------------------------

In [None]:
def print_favorite_movie(row):
  if row["rating"] > 4.0:
    print(items[row["movieId"]==items["movieId"]])

def print_movie(row):
  print(items[row["movieId"]==items["movieId"]])

In [None]:
test_user_id = 7

In [None]:
test_user_actions = actions[actions["userId"]==test_user_id].sort("rating", ascending = False)
print(test_user_actions.head())
test_user_actions.apply(print_favorite_movie)

+--------+---------+--------+------------+
| userId | movieId | rating | timestamp  |
+--------+---------+--------+------------+
|   7    |   2028  |  5.0   | 1011205481 |
|   7    |   480   |  5.0   | 1011206779 |
|   7    |   1721  |  5.0   | 1011207965 |
|   7    |   3479  |  5.0   | 1011207916 |
|   7    |   912   |  5.0   | 1011204596 |
|   7    |   587   |  5.0   | 1011208220 |
|   7    |   589   |  5.0   | 1011206456 |
|   7    |   2942  |  5.0   | 1011206321 |
|   7    |   3417  |  5.0   | 1011206698 |
|   7    |   1077  |  5.0   | 1011206898 |
+--------+---------+--------+------------+
[10 rows x 4 columns]

+---------+----------------------------+------------------+
| movieId |           title            |      genres      |
+---------+----------------------------+------------------+
|   2028  | Saving Private Ryan (1998) | Action|Drama|War |
+---------+----------------------------+------------------+
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFr

dtype: float
Rows: 276
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, ... ]

In [None]:
test_user_recommendations = model.recommend(users=[test_user_id])
test_user_recommendations.apply(print_movie)

+---------+------------------------+------------------------+
| movieId |         title          |         genres         |
+---------+------------------------+------------------------+
|   1240  | Terminator, The (1984) | Action|Sci-Fi|Thriller |
+---------+------------------------+------------------------+
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+---------+-------------------------------+------------------+
| movieId |             title             |      genres      |
+---------+-------------------------------+------------------+
|   1291  | Indiana Jones and the Last... | Action|Adventure |
+---------+-------------------------------+------------------+
[? rows x 3 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+---------+-------------------------------+----------------------

dtype: float
Rows: 10
[None, None, None, None, None, None, None, None, None, None]

## Saving and loading models

The model can be saved for later use. The saved model sits in its own directory, and can be loaded back in later to make more predictions.

In [None]:
model.save("Recommender.model")

In [None]:
model = tc.load_model("Recommender.model")

In [None]:
# Export for use in Core ML
model.export_coreml('Recommender.mlmodel')

This model is exported as a custom Core ML model. In order to use it in your
application, you must also include "libRecommender.dylib". For additional
details see:
https://apple.github.io/turicreate/docs/userguide/recommender/coreml-deployment.html


In [None]:
# download mlmodel locally
from google.colab import files
files.download("Recommender.mlmodel")

In [None]:
# copy model to Google Drive
from shutil import copy
copy("/content/Recommender.mlmodel", "/content/drive/My Drive/Colab Notebooks/data/models/Recommender.mlmodel")

In [None]:
# copy model to Google Drive
from shutil import copytree
copytree("/content/Recommender.model", "/content/drive/My Drive/Colab Notebooks/data/models/Recommender.model")

## Using Models

A FactorizationRecommender learns latent factors for each user and item and uses them to make rating predictions.
https://apple.github.io/turicreate/docs/api/generated/turicreate.recommender.factorization_recommender.create.html

In [None]:
data = tc.SFrame({'user_id': ["Ann", "Ann", "Ann", "Brian", "Brian", "Brian"],
                          'item_id': ["Item1", "Item2", "Item4", "Item2", "Item3", "Item5"],
                          'rating': [1, 3, 2, 5, 4, 2]})
data

item_id,rating,user_id
Item1,1,Ann
Item2,3,Ann
Item4,2,Ann
Item2,5,Brian
Item3,4,Brian
Item5,2,Brian


In [None]:
m = tc.factorization_recommender.create(data, target='rating')

In [None]:
recommendations = m.recommend()
print(recommendations)

+---------+---------+----------------------+------+
| user_id | item_id |        score         | rank |
+---------+---------+----------------------+------+
|   Ann   |  Item3  |  2.0560873647530875  |  1   |
|   Ann   |  Item5  | 0.055848081906636704 |  2   |
|  Brian  |  Item4  |  3.9441206951936087  |  1   |
|  Brian  |  Item1  |  2.9440365036328635  |  2   |
+---------+---------+----------------------+------+
[4 rows x 4 columns]



### Making recommendations for specific users

If you specify a list or SArray of users, recommend() returns recommendations for only those user(s). The user names must correspond to strings in the user_id column in the training data.

In [None]:
recommendations = m.recommend(users=['Brian'])
print(recommendations)

+---------+---------+--------------------+------+
| user_id | item_id |       score        | rank |
+---------+---------+--------------------+------+
|  Brian  |  Item4  | 3.9441206951936087 |  1   |
|  Brian  |  Item1  | 2.9440365036328635 |  2   |
+---------+---------+--------------------+------+
[2 rows x 4 columns]



### Making recommendations for specific users and items

In situations where you build a model for all of your users and items, you may wish to limit the recommendations for particular users based on known item attributes. For example, for US-based customers you may want to limit recommendations to US-based products. The following code sample restricts recommendations to a subset of users and items -- specifically those users and items whose value in the 'country' column is equal to "United States".

In [None]:
# need to create users and items with 'country' column to experiment
country = 'United States'
m.recommend(users=users['user_id'][users['country']==country].unique(),
            items=items['item_id'][items['country']==country])

### Making recommendations for new users

This is known as the "cold-start" problem. The recommend() function works seamlessly with new users. If the model has never seen the user, then it defaults to recommending popular items:

In [None]:
m.recommend(['Charlie'])

user_id,item_id,score,rank
Charlie,Item2,4.166840751965841,1
Charlie,Item3,3.1991079946359,2
Charlie,Item4,3.134204169114431,3
Charlie,Item1,2.1341262062390647,4
Charlie,Item5,1.1988665660222373,5


Here 'Charlie' is a new user that does not appear in the training data. Also note that you don't need to explicitly write down users=; Python automatically assumes that arguments are provided in order, so the first unnamed argument to recommend() is taken to be the user list.

### Incorporating information about a new user

To improve recommendations for new users, it helps to have side information or new observation data for the user.

#### Incorporating new side information

To incorporate side information, you must have already created a recommender model that knows how to incorporate side features. This can be done by passing in side information to create(). For example:

In [None]:
user_info = turicreate.SFrame({'user_id': ['Ann', 'Brian'],
                                'age_category': ['2', '3']})
m_side_info = turicreate.factorization_recommender.create(data, target='rating',
                                                             user_data=user_info)

Now, we can add side information for the new user at recommendation time. The new side information must contain a column with the same name as the column in the training data that's designated as the 'user_id'. (For more details, please see the API documentation for turicreate.recommender.create.)

In [None]:
new_user_info = turicreate.SFrame({'user_id' : ['Charlie'],
                                 'age_category' : ['2']})
recommendations = m_side_info.recommend(['Charlie'],
                                        new_user_data = new_user_info)

Given Charlie's age category, the model can incorporate what it knows about the importance of age categories for item recommendations. Currently, the following models can take side information into account when making recommendations: LinearRegressionModel, FactorizationRecommender. LinearRegressionModel is the simpler model, and FactorizationRecommender the more powerful. For more details on how each model makes use of side information, please refer to the model definition sections in the individual models' API documentation.

### Incorporating new observation data

recommend() accepts new observation data. Currently, the ItemSimilarityModel makes the best use of this information.

In [None]:
m_item_sim = turicreate.item_similarity_recommender.create(data)
new_obs_data = turicreate.SFrame({'user_id' : ['Charlie', 'Charlie'],
                                'item_id' : ['Item1', 'Item5']})
recommendations = m_item_sim.recommend(['Charlie'], new_observation_data = new_obs_data)

### Controlling the number of recommendations

The input parameter k controls how many items to recommend for each user.

In [None]:
recommendations = m.recommend(k = 5)

### Excluding specific items from recommendation

Suppose you make some recommendations to the user and they ignored them. So now you want other recommendations. This can be done by explicitly excluding those undesirable items via the exclude keyword argument.

In [None]:
exclude_pairs = turicreate.SFrame({'user_id' : ['Ann'],
                                    'item_id' : ['Item3']})

recommendations = m.recommend(['Ann'], k = 5, exclude = exclude_pairs)

By default, recommend() excludes items seen during training, so that it would not recommend items that the user has already seen. To change this behavior, you can specify exclude_known=False.

In [None]:
recommendations = m.recommend(exclude_known = False)

### Including specific items in recommendation

Suppose you want to see only recommendations within a subset of items. This can be done via the items keyword argument. The input must be an SArray of items.

In [None]:
item_subset = turicreate.SArray(["Item3", "Item5", "Item2"])
recommendations = m.recommend(['Ann'], items = item_subset)

### Finding Similar Items

Many of the above models make recommendations based on some notion of similarity between a pair of items. Querying for similar items can help you understand the model's behavior on your data.

We have made this process very easy with the get_similar_items function:

In [None]:
similar_items = model.get_similar_items(my_list_of_items, k=20)

The above will return an SFrame containing the 20 nearest items for every item in my_list_of_items. The definition of "nearest" depends on the type of similarity used by the model. For instance, "jaccard" similarity measures the two item's overlapping users. The 'score' column contains a similarity score ranging between 0 and 1, where larger values indicate increasing similarity. The mathematical formula used for each type of similarity can be found in the API documentation for ItemSimilarityRecommender.

For a factorization-based model, the similarity used for is the Euclidean distance between the items' two factors, which can be obtained using m['coefficients'].

## Choosing a Model

https://apple.github.io/turicreate/docs/userguide/recommender/choosing-a-model.html