# Introduction

In this lab, we will be using the Surprise library. 

Surprise is an easy-to-use Python scikit for recommender systems that supports both item-item and user-user collaborative filtering

Documentation: https://surprise.readthedocs.io/en/stable/index.html

# Installation

To install the Surprise, you require a C++ compiler (e.g. Visual studio C++2014) as the library runs on Cython.

Simply input into your anaconda terminal:

```python
pip install scikit-surprise
```

#### UPDATE

If you are using anaconda, you can do this as well:

```python
conda install -c conda-forge scikit-surprise
```

# Importing Files

In [1]:
import numpy as np
import pandas as pd
import random
from surprise import SVD
from surprise import NormalPredictor
from surprise import Dataset, Reader, accuracy
from surprise import get_dataset_dir, dump
from surprise import KNNBasic, KNNWithMeans, KNNBaseline, KNNWithZScore
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from collections import defaultdict
import io

In [2]:
#setting seed for reproducable results
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)

# Loading data

Surprise requires three columns in order to work, 

1. User (raw) ids
2. Item (raw) ids
3. Ratings

You will also require the range of the ratings (E.g. 1-5).

Surprise supports the pandas dataframe datatype, so we will perform all processing using pandas.

In [3]:
#load data from csv
df = pd.read_csv("user_data.csv", index_col = 0)
print(df.columns)

#we don't require timestamp, so drop it
df2 = df.drop("timestamp", axis = 1)
df2.head()

Index(['user_id', 'item_id', 'rating', 'timestamp'], dtype='object')


Unnamed: 0,user_id,item_id,rating
0,0,50,5
1,0,172,5
2,0,133,1
3,196,242,3
4,186,302,3


In [4]:
#find range of values for rating
print(np.sort(df2.rating.unique(), axis = -1))

#We require a Reader object, with the rating_scale parameter specified
reader = Reader(line_format='user item rating', rating_scale=(1, 5))

#Finally, to fully load the dataframe into a Surprise object, we use the load_from_df() method
data = Dataset.load_from_df(df2[['user_id', 'item_id', 'rating']], reader)
type(data)

[1 2 3 4 5]


surprise.dataset.DatasetAutoFolds

# Item to Item collaborative filtering

Surprise contains various algorithms, including SVD, Non-Negative Matrix Factorization and more, but the k-NNs are the only ones that support item-item.
    
There are a few other parameters we can adjust under our similarity options, but the most important for our purposes is the “user_based” flag. This determines whether we’re computing similarity between users or items. To use the item to item collaborative filtering, set “user_based” to “False”.

Source: https://medium.com/@jmcneilkeller/item-item-recommendation-with-surprise-4bf365355d96

MSD (Mean Squared Difference similarity) is Surprise’s default similarity option, however we will be using Pearson correlation coefficient instead.

Other similarity methods: https://surprise.readthedocs.io/en/stable/similarities.html

The “min_support” parameter establishes the minimum number of common users necessary for the similarity to not be zero. 

In [21]:
sim_options = {'name': 'pearson',
               'min_support': 5,
               'user_based': False} #change this to true for user-user

#split your data into train and test
train, test = train_test_split(data, test_size=.2)

#load your model
base1 = KNNBaseline(k=30,sim_options=sim_options)

#train
base1.fit(train)

#predict
base1_preds = base1.test(test)

#score
accuracy.rmse(base1_preds)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.9456


0.9456008481349131

In [9]:
# We can perform cross-validation on our results as well

#5 fold CV
cv_results = cross_validate(base1, data , cv=5, measures=['RMSE'])
cv_results

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.94421092, 0.94972995, 0.93699652, 0.94479174, 0.9328588 ]),
 'fit_time': (5.430090427398682,
  5.983448028564453,
  5.276017665863037,
  5.41839861869812,
  9.007898807525635),
 'test_time': (9.39434289932251,
  8.587968826293945,
  7.9939563274383545,
  13.728183031082153,
  9.03931736946106)}

In [10]:
cv_results['test_rmse'].mean()

0.941717585317052

The prediction results returns us a bunch of parameters:
    
uid – The (raw) user id <br>
iid – The (raw) item id <br>
r_ui (float) – The true rating rui<br>
est (float) – The estimated rating r^ui<br>
details (dict) – Stores additional details about the prediction that might be useful for later analysis

In [22]:
base1_preds[0:3]

[Prediction(uid=803, iid=748, r_ui=1.0, est=2.4486785494098595, details={'actual_k': 22, 'was_impossible': False}),
 Prediction(uid=228, iid=56, r_ui=2.0, est=3.4100294220574052, details={'actual_k': 11, 'was_impossible': False}),
 Prediction(uid=13, iid=358, r_ui=3.0, est=1.9768019520380848, details={'actual_k': 30, 'was_impossible': False})]

We can also predict individual ratings by directly calling the predict() method. <br>

Let’s say you’re interested in user 0 and item 50 (make sure they’re in the trainset!), and you know that the true rating rui=5

In [None]:
df2.head(1)

In [16]:
#for this example, ill just build the training set from the full data, to make sure user 0 and item 50 are inside

sim_options = {'name': 'pearson',
               'user_based': False}

trainset = data.build_full_trainset()
base2 = KNNBaseline(k=30,sim_options=sim_options)
#train
base2.fit(trainset)

uid = '0'  # raw user id (as in the ratings file). They need to be **strings**!
iid = '50'  # raw item id (as in the ratings file). They need to be **strings**!
print(trainset.knows_user(0))
print(trainset.knows_item(182))

# get a prediction for specific users and items.
pred = base2.predict(uid, iid, r_ui=5, verbose=True)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
True
True
user: 0          item: 50         r_ui = 5.00   est = 3.53   {'was_impossible': False}


# User to user collaborative filtering

You could change the user_based option in the KNN algorithim to True, or you can use matrix factorization algorithims to perform User to user collaborative filtering

In [23]:
#split your data into train and test
train, test = train_test_split(data, test_size=.2)

#load your model
base3 = SVD()

#train
base3.fit(train)

#predict
base3_preds = base3.test(test)

#score
accuracy.rmse(base3_preds)

RMSE: 0.9311


0.9311209154993982

# Useful Functions

### How to get the top-N recommendations for each user

In [41]:
def get_top_n(predictions, n):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm
data = Dataset.load_from_df(df2[['user_id', 'item_id', 'rating']], reader)
train, test = train_test_split(data, test_size=.2)
base4 = SVD()
base4.fit(train)

# Then predict ratings for all pairs (u, i) that are NOT in the training set.
test = train.build_anti_testset()
predictions = base4.test(test)

top_n = get_top_n(predictions, n=3)

top_n_list = [[uid, [iid for (iid, _) in user_ratings]] for uid, user_ratings in top_n.items()]

In [42]:
movie_titles = pd.read_csv('Movie_Id_Titles.csv', index_col = 0)
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [47]:
top_n[0]

[(114, 4.623994833610856), (408, 4.608792707825801), (169, 4.568051500644625)]

In [48]:
def get_movie_titles(predictions_dict,movie_titles_df,user_id):
    
    '''Convert the raw id of movie titles to the actual name

    Args:
        predictions_dict: A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
        
        movie_titles_df: a dataframe containing the item_id's and titles of movies
        
        user_id: a numerical value, U, used to denote an unique user
        
    Returns:
    A dataframe containing the top N titles and scores pertaining to user U
    '''
    
    top_recommendations = predictions_dict[user_id]
    raw_titles_index = [i[0]-1 for i in top_recommendations]
    scores = [i[1] for i in top_recommendations]
    temp = movie_titles_df.iloc[raw_titles_index,]
    temp['scores'] = scores
    #temp = temp.reset_index()
    return ('user {} recommendations'.format(user_id),temp)

## User 0 ratings

In [49]:
id0_df = df2[df2['user_id'] == 0]
pd.merge(id0_df, movie_titles, on="item_id")

Unnamed: 0,user_id,item_id,rating,title
0,0,50,5,Star Wars (1977)
1,0,172,5,"Empire Strikes Back, The (1980)"
2,0,133,1,Gone with the Wind (1939)


In [53]:
#return top n items from user 0 
get_movie_titles(top_n,movie_titles,0)

#warning = ignore, as usual

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


('user 0 recommendations',
      item_id                                              title    scores
 113      114  Wallace & Gromit: The Best of Aardman Animatio...  4.623995
 407      408                              Close Shave, A (1995)  4.608793
 168      169                         Wrong Trousers, The (1993)  4.568052)

## How to get the k nearest neighbors of a user (or item)

You can use the get_neighbors() methods of the algorithm object. This is only relevant for algorithms that use a similarity measure, such as the k-NN algorithms.

You have to convert between raw and inner Ids as Surprise stores them differently. Different algorithims return different Ids as well (some return raw, some return inner). More information here:

https://surprise.readthedocs.io/en/stable/FAQ.html#raw-inner-note

In [59]:
sim_options = {'name': 'pearson_baseline',
               'user_based': False}

base5 = KNNBaseline(k=5,sim_options=sim_options)
train = data.build_full_trainset()
base5.fit(train)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x1a24a8ea5c0>

In [61]:
#raw item id 1 is toy story
#raw item id 181 is return of the jedi
#convert raw id into inner id
inner_id = train.to_inner_iid(1)

#get the nearest 
raw_nearest_id = [train.to_raw_iid(i) for i in base5.get_neighbors(inner_id,10)]

#get the indexs for df locating
raw_titles_index = [i-1 for i in raw_nearest_id]

movie_titles.iloc[raw_titles_index,]

Unnamed: 0,item_id,title
587,588,Beauty and the Beast (1991)
173,174,Raiders of the Lost Ark (1981)
844,845,That Thing You Do! (1996)
70,71,"Lion King, The (1994)"
927,928,"Craft, The (1996)"
293,294,Liar Liar (1997)
94,95,Aladdin (1992)
522,523,Cool Hand Luke (1967)
968,969,Winnie the Pooh and the Blustery Day (1968)
209,210,Indiana Jones and the Last Crusade (1989)
