# Recommendation System for Beer Advocate

In [60]:
import os
import urllib.request

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [61]:
file_path = 'https://www.dropbox.com/s/dzg4j2jolmpc8tb/beer_reviews.csv?dl=1'

df = pd.read_csv(file_path)

In [62]:
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586614 entries, 0 to 1586613
Data columns (total 13 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   brewery_id          1586614 non-null  int64  
 1   brewery_name        1586599 non-null  object 
 2   review_time         1586614 non-null  int64  
 3   review_overall      1586614 non-null  float64
 4   review_aroma        1586614 non-null  float64
 5   review_appearance   1586614 non-null  float64
 6   review_profilename  1586266 non-null  object 
 7   beer_style          1586614 non-null  object 
 8   review_palate       1586614 non-null  float64
 9   review_taste        1586614 non-null  float64
 10  beer_name           1586614 non-null  object 
 11  beer_abv            1518829 non-null  float64
 12  beer_beerid         1586614 non-null  int64  
dtypes: float64(6), int64(3), object(4)
memory usage: 157.4+ MB


In [64]:
print(f'unique beer styles: {len(df.beer_style.unique())}')
print(f'unique beers: {len(df.beer_beerid.unique())}')
print(f'unique breweries: {len(df.brewery_id.unique())}')
print(f'unique users: {len(df.review_profilename.unique())}')

unique beer styles: 104
unique beers: 66055
unique breweries: 5840
unique users: 33388


In [65]:
# Let us add a user_id column that maps to every user's review_profilename

df = df.assign(user_id=df['review_profilename'].astype('category').cat.codes)
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid,user_id
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986,30566
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213,30566
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215,30566
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969,30566
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883,23008


In [66]:
# Total reviews for each category
df.review_overall.value_counts()

4.0    582764
4.5    324385
3.5    301817
3.0    165644
5.0     91320
2.5     58523
2.0     38225
1.5     12975
1.0     10954
0.0         7
Name: review_overall, dtype: int64

In [67]:
# Total reviews for each type of beer
df.beer_name.value_counts()

90 Minute IPA                          3290
India Pale Ale                         3130
Old Rasputin Russian Imperial Stout    3111
Sierra Nevada Celebration Ale          3000
Two Hearted Ale                        2728
                                       ... 
Panther India Pale Ale                    1
Brewmaster's Special Brown Ale            1
Imperial Doppel Bock                      1
Wiltse's Paul Bunyan Ale                  1
HGH Part Duh                              1
Name: beer_name, Length: 56857, dtype: int64

In [68]:
# Remove users who have not rated more than once
# df = df.groupby('user_id').filter(lambda x: len(x) > 1)

# remove beers that have not been reviewed more than once
# df = df.groupby('beer_beerid').filter(lambda x: len(x) > 1)

In [69]:
train_len = int(df.shape[0] * 0.8)
train = df[:train_len]
test = df[train_len:]

# Metrics

RMSE = $\sqrt{\frac{\sum(\hat y - y^{2})}n}$

We plan to go ahead with RMSE. The lesser the error, the more we are sure about how a user would rate a particular beer.


The rating is between 0 to 5. So in a worst case scenario, our RMSE will be 5 (Ex: We consistently predict a rating of 0 and the actual rating is 5 for all users). The best case scenario, it will be zero (our rating and the actual rating is the same).

In [70]:
def rmse(y_pred, y_true):
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [71]:
def evaluate(estimate_f):    
    ids_to_estimate = zip(test.user_id, test.beer_beerid)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = test.review_overall
    return rmse(estimated, real)

## General notations in order to understand Content-based and CF formulas

`U is the set of users in our domain. It's size is |U|
I is the set of items in out domain. It's size is |I|
I(u) is the set of items that user u has rated.
-I(u) is the complement of I(u) i.e., the set of items user u has not rated.
U(i) is the set of users that have rated item i.
-U(i) is the complement of U(i). It's the set of users who have not rated for item i.
S(u, i) is a function that measures the utility of item i for user u.`

### Goal of recommendation systems:

$i^{*} = argmax_{i\in-I(u)} S(u,i), \forall u \in U$

Recommendation system can be summarized as finding that item that the user has not seen (-I(u)) which optimizes the notion of utility (metric) S(u, i)

# Content-based filtering


Recommend based on user's rating history. For example, if you want to know what I thought about the beer "Sausa Weizen", you'd look up my past ratings for all other beers, or may be for beers of that particular category or a group of categories and then come up with what you think would be my prediction for that beer.

For a baseline model, in order to predict the rating of a particular item ${i^{'}}$, we will look up all the items that the user has rated $I(u)$, evaluate the aggregation function, which in our case for the baseline model is the 'mean' and predict this value as the rating a user might give for any new beer. Not very intutive, but makes for a good baseline model.

$r_{u, i} = aggr_{i'\in I(u)}[r_{u, i'}]$

A simple example using mean as an aggregation function:

$r_{u,i} = \bar r_u = \frac {\sum_{i'\in I(u)} r_{u, i'}} {|I(u)|}$

In [75]:
# # Simple Content based filtering using mean ratings (baseline model)

def content_mean(user_id, beer_id):    
    user_condition = train.user_id == user_id
    ratings_by_user = train.loc[user_condition]
    
    return ratings_by_user.review_overall.mean()

print(f'RMSE for content based filtering: {evaluate(content_mean)}')

RMSE for content based filtering: 0.7269122601676697


In [76]:
# Simple Content based filtering using mean ratings

def content_mean_with_default_rating(user_id, beer_id):    
    user_condition = train.user_id == user_id
    ratings_by_user = train.loc[user_condition]
    
    if ratings_by_user.empty:
        return 3.0
    else:
        return ratings_by_user.review_overall.mean()

print(f'RMSE for content based filtering with default mean: {evaluate(content_mean_with_default_rating)}')

RMSE for content based filtering with default mean: 0.7359312674816878


# Collaborative Filtering

Recommend based on other user's rating histories. For Example: If you want to know how I might rate a beer that I have not tried, you could guess by seeing the ratings of other people who have tried that beer. Presumably people who are similar to me.
Here, the rating is based on aggregating other people's ratings (u') for a particular item.

$r_{u', i} = aggr_{u' \in U(i)}[r_{u', i}]$

A simple example using mean as an aggregation function:

$r_{u,i} = \bar r_i = \frac {\sum_{u' \in U(i)}}{|U(i)|} [r_{u',i}]$

In [74]:
# Simple collaborative filtering using mean ratings

def cf_mean(user_id, beer_id):
    user_condition = train.user_id != user_id
    item_condition = train.beer_id == beer_id
    ratings_by_others = train[user_condition & item_condition, 'review_overall'].mean()
    
    if ratings_by_others.empty:
        return 3.0
    else:
        ratings_by_others.review_overall.mean()
    
print(f'RMSE for collaborative filtering: {evaluate(content_mean)}')

RMSE for collaborative filtering: 0.7359312674816878


**Questions:** 

1) Cold-start - what if there are no previous ratings by the user? Is it ok to assume the rating is 0 or 3? <br />
2) Is it a good idea to delete all users who have not rated at all or those who have not rated more than once? Similarly, is it a good idea to delete all beers not rated or rated by more than one user? Because, in both the above cases, there are chances that user_id/beer_id might appear only in either train or test datasets. I tested by removing both, but RMSE increased from 0.72691226 to 0.72698229 <br />
3) Any tips to speed this up? It takes a lot of time to run. <br />
4) How to deal with imbalanced data in recommendation systems? For instance, only a subset of the users would have reviewed most of the beers.


References:

https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093 <br />
https://unatainc.github.io/pycon2015/