# GA Data Science 10 (DAT10) - Lab 14

### Recommendation Systems

Francesco Mosconi, Justin Breucop

### Today

1. Simple similarity based recommendation system
2. Recsys

## Similarity based Recommendation System: Beers


Let's build a recommendation system to recommend types of beers based on user reviews

Usual imports (numpy, pandas)

In [None]:
import pandas as pd
import numpy as np

First of all let's get the data

In [None]:
! curl -O https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz

Import data in a pandas dataframe called "allbeers". Use the compression keyword

In [None]:
allbeers = pd.read_csv("beer_reviews.tar.gz", compression='gzip')

Let's look at the data

In [None]:
allbeers.head()

Let's restrict this to the top 250 beers. Use the value_counts() method select the top 250 beers.
Assign the selected beers to a dataset called df

In [None]:
n = 250
top_n = allbeers.beer_name.value_counts().index[:n]
df = allbeers[allbeers.beer_name.isin(top_n)]
df.head()

How big is this dataset?

In [None]:
df.info()

### Pivot Table

Aggregate the data in a pivot table using the pivot_table method. Display the mean review_overall for each beer_name aggregating the review_overall values by review_profilename. Use the mean as aggregator.

In [None]:
df_pivot = pd.pivot_table(df, values=["review_overall"],
        columns=["beer_name", "review_profilename"],
        aggfunc=np.mean)
#pivot_table converts to a multi-index series. Unstack converts to a dataframe where the last index becomes our column head
df_wide = df_pivot.unstack(-1)
df_wide

Display the head of the pivot table, but only for 5 users (columns are users)

In [None]:
df_wide.ix[0:5, 0:5]

### Discussion: what do you notice in this table?

#### Data munging
Set Nans to zero

In [None]:
df_wide = df_wide.fillna(0)

Check that columns are users

In [None]:
df_wide.columns[:10]

Check that rows are beers

In [None]:
df_wide.index.levels[0]
beer_names = df_wide.index.levels[1]

### Calculate distance between beers

We're going to use cosine_similarity from scikit-learn to compute the distance between all beers

Imports

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import euclidean_distances

Apply cosine similarity to df_wide to calculate pairwise distances

In [None]:
dists = cosine_similarity(df_wide)
dists

### Discussion: what type of object is dists?

Convert dists to a Pandas DataFrame, use the index as column index as well (distances are a square matrix).

In [None]:
dists = pd.DataFrame(dists)
dists.columns = beer_names
dists.index = beer_names
dists.ix[0:10, 0:10]

Select some beers and look their distances to other beers

In [None]:
beers_i_like = ['Sierra Nevada Pale Ale', '120 Minute IPA', 'Allagash White']
dists[beers_i_like].head()

Sum the distances of my favourite beers by row, to have one distance from each beer in the sample

In [None]:
beers_summed = dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)
#beers_summed = np.sum(dists[beers_i_like], axis=1)

#### Performance

Optional: which one is faster? use ```%timeit``` to check

In [None]:
%timeit dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)

In [None]:
%timeit np.sum(dists[beers_i_like], axis=1)

#### Ranking

Sort summed beers from best to worse

In [None]:
beers_summed = beers_summed.order(ascending=False)
beers_summed

Filter out the beers used as input and transform to list

In [None]:
ranked_beers = beers_summed.index[beers_summed.index.isin(beers_i_like)==False]
ranked_beers = ranked_beers.tolist()
ranked_beers[:5]

###Pair Programming!

Define a function that does what we just did for an arbitrary input list of beers. it should also receive the maximum number of beers requested n as optional parameter.

Test your function. Find the 10 beers most similar to "120 Minute IPA"

Try again with the 10 beers most similar to ["Coors Light", "Bud Light", "Amstel Light"]

Optional: register an account on yhat and deploy your model following the instructions [here](https://docs.yhathq.com/python/examples/beer-recommender) and [here](http://nbviewer.ipython.org/gist/glamp/20a18d52c539b87de2af)

## Recsys

A python library for implementing a recommender system. If you'd like to, I recommend you explore this project. It's an efficient way to get a recommendation engine off the ground. The example below uses SVD.

In [None]:
"""
##install python-recsys

### first install dependencies

pip install csc-pysparse networkx divisi2

### then install recsys
git clone https://github.com/python-recsys/python-recsys.git
cd python-recsys/

python setup.py install
"""

Load recsys.algotihm, set VERBOSE = True import SVD class

In [1]:
import recsys.algorithm
recsys.algorithm.VERBOSE = True
from recsys.algorithm.factorize import SVD

Let's look at the files

In [None]:
! ls movielens

Import 'movies.dat' to a 'movies' pandas dataframe. Make sure you name the columns, use the correct separator and define the index.

In [None]:
movies = pd.read_table('movielens/movies.dat', sep='::', names= ['ITEMID', 'Title', 'Genres'], index_col= 'ITEMID')

In [None]:
movies.head()

Import 'ratings.dat' to a 'ratings' pandas dataframe. Make sure you name the columns, use the correct separator.

In [None]:
ratings = pd.read_table('movielens/ratings.dat', sep='::', names= ['UserID','MovieID','Rating','Timestamp'])

In [None]:
ratings.head()

Initialize an SVD instance

In [None]:
svd = SVD()

Populate it with the data from the ratings dataset, using the built in load_data method

In [None]:
svd.load_data(filename='./movielens/ratings.dat', sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})

Compute SVD

$M=U \Sigma V^T$:

In [None]:
k = 100
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True)

you can also save the output SVD model (in a zip file)

In [None]:
# svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/movielens')

Reload a saved model:

In [None]:
# svd2 = SVD(filename='/tmp/movielens')

Find the ITEMID number for "Toy Story (1995)"

In [None]:
movies[movies.Title == "Toy Story (1995)"]

Find the ITEMID number for "Bug's Life, A (1998)"

In [None]:
movies[movies.Title == "Bug's Life, A (1998)"]

Compute similarity between the two movies

In [None]:
ITEMID1 = 1    # Toy Story (1995)
ITEMID2 = 2355 # A bug's life (1998)
print svd.similarity(ITEMID1, ITEMID2)
# print svd2.similarity(ITEMID1, ITEMID2) to check

Get movies similar to Toy Story

In [None]:
svd.similar(ITEMID1)

Predict rating for a given user and movie, $\hat{r}_{ui}$

In [None]:
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING)

In [None]:
svd.get_matrix().value(ITEMID, USERID)

Recommend non rated movies to a user

In [None]:
svd.recommend(USERID, is_row=False)

Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)

In [None]:
svd.recommend(ITEMID)

Find out more here: [https://github.com/ocelma/python-recsys](https://github.com/ocelma/python-recsys)