# Recommendation Systems




### Okay, so what is it?

"A classification system that looks at LOTS of transactional data and comes up tailored recommendations for a specific user -- where recommendations are relevant to the user's preferences"


### Goal of this mini-session:
- Build a minimal recommendation engine
- Know key terms associated with buildinga reco engine
- Know the common problems that a reco engine works to solve
- Create a starting point for learning about Recommendation systems


# Key Concepts
- Item: What's being recommended. 
    - Examples
        - Songs
        - Movie
        - Service
- User: A person who rates items and receives recommendations for new items.
    - The population of interested
- Rating: Some sort of associated preference of an item by a user. 
    - Examples:
        - 1 - 5
        - like / Dislike

Approaches
- Content-Based Filtering: Two parts, 1. A history of your item preferences. 2. An attributes table of all items. 
- Collaborative Filtering: Ask people who are similar to you what you like
- Hybrid Filtering: Many ways of combining content and collaborative filtering

Evaluation
- Root Mean Squared Error: "you can usually expect 68% of the y values to be within one r.m.s. error, and 95% to be within two r.m.s. errors of the predicted values. These approximations assume that the data set is football-shape"
    - [Review](http://statweb.stanford.edu/~susan/courses/s60/split/node60.html)


Distance Measures:
- Euclidian
$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$
- Cosine
$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$


<center>
**Cosine vs Euclidian**
![title](http://semanticvoid.com/images/cosine_similarity.png)
When to use each? One is based on angles, another is based on length. The difference is in magnitude (euclidean) versus direction (cosine).



# Today's Datasets
1. Movie Lens [Description](https://grouplens.org/datasets/movielens/)

Stable benchmark dataset. 100,000 ratings from 1000 users on 1700 movies. Released 4/1998.


2. Jester [Download](http://eigentaste.berkeley.edu/dataset/) [Joke Tool](http://eigentaste.berkeley.edu/index.html)

    - Data files are in .zip format, when unzipped, they are in Excel (.xls) format
    - Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").
    - One row per user.The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
    - The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).
    
# Part 1: Reading the data
- Only a subset!

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

movielens_train = pd.read_csv('data/movielens_train.csv', index_col=0,encoding='latin-1')
movielens_test = pd.read_csv('data/movielens_test.csv', index_col=0,encoding='latin-1')
users = pd.read_table('data/users.dat',
                      sep='::', header=None,
                      names=['user_id', 'gender', 'age', 'occupation', 'zip'])




In [9]:
# Group data exploration

## Part 2a: Evaluation

In [34]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [35]:
movielens_test = movielens_test.iloc[1:10,:]

In [36]:
def score_reco(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.rating.values
    return compute_rmse(estimated, real)

## Part 2b: Similarity Functions:
To get you started, here are two similarity measures. The  higher the value, the more similar.



In [12]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))

def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))

#pearson
#jaccard
#binjaccard


## Part 3: Recommendation Systems


### Always Recommend a 3

In [37]:
def always3(user_id, movie_id):
    return 3.0

In [38]:
print('RMSE for my estimate function: %s' % score_reco(always3))

RMSE for my estimate function: 1.20185042515


### Average Movie score

In [99]:
# write a movie mean test

def movie_mean(user_id, movie_id):
    movies = movielens_train[movielens_train.movie_id == movie_id]
    if len(movies) == 0:
        return 3
    else:
        return movies.rating.mean()

In [98]:
score_reco(movie_mean)

0.92624522083554384

### Collaborative Filtering: 

In [100]:
class collabo_group(object):
    
    def __init__(self,input):
        self.group = input
        
    def learn(self):
        self.means_by_group = movielens_train.pivot_table('rating', index='movie_id', columns=self.group)          
        
    def estimate(self, user_id, movie_id):
        
        if movie_id not in self.means_by_group.index: 
            return 3.0
        
        user_group = users.ix[user_id, self.group]
        if ~np.isnan(self.means_by_group.ix[movie_id, user_group]):
            return self.means_by_group.ix[movie_id, user_group]
        else:
            return self.means_by_group.ix[movie_id].mean()

In [101]:
collabo_age = collabo_group('age')
collabo_age.learn()
print('RMSE for gen: %s' % score_reco(collabo_age.estimate))

RMSE for gen: 0.996908802496


In [102]:
collabo_gen = collabo_group('gender')
collabo_gen.learn()
print('RMSE for gen: %s' % score_reco(collabo_gen.estimate))

RMSE for gen: 1.05883991959


In [108]:
class CollaboCustom:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        
        self.all_user_profiles = movielens_train.pivot_table('rating', index='movie_id', columns='user_id')

    def euclidean(self):
        self.func = euclidean

    def cosine(self):
        self.func = cosine

    def estimate(self, user_id, movie_id):
        """ Ratings weighted by correlation similarity. """
        
        user_condition = movielens_train.user_id != user_id
        movie_condition = movielens_train.movie_id == movie_id
        
        ratings_by_others = movielens_train.loc[user_condition & movie_condition]
        
        if ratings_by_others.empty: 
            return 3.0
        
        ratings_by_others.set_index('user_id', inplace=True)
        
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        
        sims = their_profiles.apply(lambda profile: self.func(profile, user_profile), axis=0)
        
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ratings_sims.sim > 0]
        
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)

In [110]:
reco_cus = CollaboCustom()
reco_cus.learn()
reco_cus.cosine()
print('RMSE for CollaboCustom: %s' % score_reco(reco_cus.estimate))

RMSE for CollaboCustom: 0.926245220836


# Homework:
- Either download the larger version of movie lens or jester datasets and create an content-based filter. Remember, a content based filter is simply using a user's own history of items. We look at the item's attributes and predict based off of that.

- Now try to combine content and collaborative filtering techniques.

- Are there any extensions that you can think of? Think of some of the techniques that we've used in past classes. Can these supervised or unsupervised methods be used in combination with what we've learned today?


### Additional resources + inspiration for this session:
- [Datasets](http://shuaizhang.tech/2017/03/15/Datasets-For-Recommender-System/)
- [Pandora’s Music Recommender](https://courses.cs.washington.edu/courses/csep521/07wi/prj/michael.pdf)
- [Text Book](http://www.springer.com/gp/book/9780387858203#aboutBook)
- [3 hour Tutorial](https://www.youtube.com/watch?v=F6gWjOc1FUs)
- [4 Hour Tutorial](https://www.slideshare.net/xamat/recommender-systems-machine-learning-summer-school-2014-cmu)
