# Collaborative-Based Model

#### The code implements two models, a baseline that uses the NormalPredictor algorithm and an improved model using Singular Value Decomposition (SVD) algorithm from the Surprise package.

#### Models 
The first part of the code starts by importing the necessary packages, loading a CSV file containing beer reviews into a pandas data frame and displaying the first 5 rows of the data.

The second part of the code creates a baseline model of the beer data using the NormalPredictor algorithm from the Surprise package. 

I first used the NormalPredictor algorithm as it is a simple and commonly used benchmark to see how well the data performs. It is then ran using the surprise package and a train-test split with the default training data as 75% and the testing data as 25%. 

I then perform the famous SVD algorithm, which also runs the default training and testing data. Which gives us the best mean average error and mean squared error scores. This model was then tested for overfitting with a cross-validation that ultimately stated a consistant model with each validation. I lastly run this model using a GridSearchCV as well as a RandomizedSearchCV to give us the best parameters possible. Each of their outputs were:

##### GridSearchCV 
{'n_factors': 50, 'n_epochs': 60, 'lr_all': 0.002, 'reg_all': 0.4}

##### RandomizedSearchCV
{'n_factors': 150, 'n_epochs': 40, 'lr_all': 0.005, 'reg_all': 0.4}


Performing a last model run of the SVD algorithm using these parameters, it was improved, but not by a great amount.

#### Recommendations
The first recommendation is performed for a user if they were to review a certain beer and provide them with 10 beers based off their rating of the beer provided. 

The next recommendation is performed for a user if they were to review a certain beer from a certain brewery. In this case, if they were to give it the best rating, their recommendations would include different beers from that brewery. 

In [1]:
#import necessary packages
import pandas as pd

In [2]:
# show all columns 
pd.set_option('display.max_columns', None)

In [3]:
# open beer_df.csv
beer_df = pd.read_csv('data/beer_df.csv', low_memory=False)
beer_df.head()

Unnamed: 0,address,categories,city,country,key,lat,long,brewery_name,phones,postalCode,province,websites,index,brewery_id,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid,review_year,review_month,beer_type
0,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1495017,735,2011-03-01 00:49:43,3.5,3.5,4.0,illidurit,American Double / Imperial IPA,3.5,3.5,21 Rock,9.7,66190,2011,3,IPA
1,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1495350,735,2008-12-04 19:03:15,4.0,4.0,4.0,magictrokini,American IPA,3.0,4.0,Harvest Moon,6.4,45648,2008,12,IPA
2,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1495733,735,2010-01-23 20:55:46,4.0,4.0,3.5,HapWifeHapLife,American IPA,4.0,4.0,21st Amendment IPA,7.0,20781,2010,1,IPA
3,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1501253,735,2010-04-08 18:58:54,4.0,3.5,4.5,pwoody11,Belgian Strong Dark Ale,4.0,4.0,Monk's Blood,8.3,52510,2010,4,Ale
4,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1501262,735,2010-03-14 16:30:10,4.0,3.5,4.0,metter98,Belgian Strong Dark Ale,4.0,4.5,Monk's Blood,8.3,52510,2010,3,Ale


## Baseline Model

In [4]:
# create baseline model of beer_df with normal_predictor and train test split from surprise
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

trainset, testset = train_test_split(data, test_size=.25)

model = NormalPredictor()
model.fit(trainset)
predictions = model.test(testset)

accuracy.mae(predictions)
accuracy.mse(predictions)

MAE:  0.6844
MSE: 0.7597


0.7597470573552234

## Performing model with Surprise Package

After running a baseline model, we will now run an SVD algoritm to improve our rmse and mae. 

In [5]:
# Running a model with SVD
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

trainset, testset = train_test_split(data, test_size=.25)

model = SVD()
model.fit(trainset)
predictions = model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.3151
MAE:  0.4176


0.41757312252042794

In [6]:
#Running a model with KNNBasic
from surprise import KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

trainset, testset = train_test_split(data, test_size=.25)

sim_options = {'name': 'cosine', 'user_based': True}
model = KNNBasic(sim_options=sim_options)
model.fit(trainset)
predictions = model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.3384
MAE:  0.4332


0.43323730472956196

## Tuned Model running GridSearchCV and RandomizedSearchCV

In [7]:
# run model svd using gridsearchcv
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

reader = Reader(rating_scale=(1.0, 5.0))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

param_grid = {'n_factors': [50, 100, 150], 'n_epochs': [20, 40, 60], 'lr_all': [0.002, 0.005, 0.008], 'reg_all': [0.4, 0.6, 0.8]}
gs = GridSearchCV(SVD, param_grid, measures=['mse', 'mae'], cv=3)

gs.fit(data)

print(gs.best_score['mse'])
print(gs.best_params['mse'])

print(gs.best_score['mae'])
print(gs.best_score['mae'])

0.3159573698039144
{'n_factors': 50, 'n_epochs': 60, 'lr_all': 0.002, 'reg_all': 0.4}
0.415345839982136
0.415345839982136


In [8]:
# Running a tuned SVD model with tuned parameters
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

trainset, testset = train_test_split(data, test_size=.25)

model = SVD(n_factors= 50, n_epochs= 60, lr_all= 0.002, reg_all= 0.4)
model.fit(trainset)
predictions = model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.3121
MAE:  0.4132


0.4131585497563155

In [9]:
# run model svd using randomizedsearchcv
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import RandomizedSearchCV

reader = Reader(rating_scale=(1.0, 5.0))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

param_grid = {'n_factors': [50, 100, 150], 'n_epochs': [20, 40, 60], 'lr_all': [0.002, 0.005, 0.008], 'reg_all': [0.4, 0.6, 0.8]}
gs = RandomizedSearchCV(SVD, param_grid, measures=['mse', 'mae'], cv=3)

gs.fit(data)

print(gs.best_score['mse'])
print(gs.best_params['mse'])

print(gs.best_score['mae'])
print(gs.best_score['mae'])

0.3157555897186146
{'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.4}
0.4151521015030181
0.4151521015030181


In [10]:
# Running a tuned SVD model with tuned parameters
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

trainset, testset = train_test_split(data, test_size=.25)

model = SVD(n_factors= 150, n_epochs= 40, lr_all= 0.005, reg_all= 0.4)
model.fit(trainset)
predictions = model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.3189
MAE:  0.4157


0.41574105211247253

There was not much improvement when using parameters provided from both the GridSearchCV and the RandomizedSearchCV algorithms. 

In [11]:
# Will perform a Cross Validation on the SVD model with the best parameters 
from surprise.model_selection import cross_validate

cross_validate(model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.5643  0.5597  0.5624  0.5513  0.5657  0.5607  0.0051  
MAE (testset)     0.4166  0.4155  0.4157  0.4100  0.4176  0.4151  0.0026  
Fit time          16.88   16.57   16.60   16.57   16.52   16.63   0.13    
Test time         0.18    0.18    0.12    0.19    0.12    0.16    0.03    


{'test_rmse': array([0.56428866, 0.55972212, 0.56238386, 0.55130214, 0.56565225]),
 'test_mae': array([0.41663443, 0.41550068, 0.41568534, 0.41000138, 0.41760135]),
 'fit_time': (16.877008199691772,
  16.569684982299805,
  16.595422983169556,
  16.568709135055542,
  16.52328896522522),
 'test_time': (0.18189382553100586,
  0.18473482131958008,
  0.12114882469177246,
  0.18572616577148438,
  0.12013792991638184)}

Cross-Validation of the SVD algorithm shows that the model is not overfitting the data. It also is showing that the model is performing accurate and consistantly with the data.

## Using the best model to create recommendation system

In [12]:
# use model to create recommendation system for user 
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_overall']], reader)

trainset, testset = train_test_split(data, test_size=.25)

algo = SVD(n_factors= 150, n_epochs= 40, lr_all= 0.005, reg_all= 0.4)
algo.fit(trainset)
predictions = algo.test(testset)

In [13]:
# create function to recommend beers to a user based on a specific beer 
def recommend_beers_from_beer(user_id, beer_name, n_recommendations, algo):
    user_ratings = beer_df[beer_df['review_profilename'] == user_id]
    user_beers = user_ratings['beer_name'].unique()
    beers_to_recommend = beer_df[~beer_df['beer_name'].isin(user_beers)]
    
    recommendations = beers_to_recommend.groupby('beer_name').agg({'review_overall': 'mean'}).sort_values('review_overall', ascending=False).head(n_recommendations)
    return recommendations

# get recommendations for user 'northyorksammy' based on the beer 'Sierra Nevada Pale Ale'
recommend_beers_from_beer('northyorksammy', 'Sierra Nevada Pale Ale', 10, algo)

Unnamed: 0_level_0,review_overall
beer_name,Unnamed: 1_level_1
Au Ciel,5.0
Best Bitter Ale With Cascade And Chinook Dry Hops,5.0
St. Patrick O'Sullivan's Irish Red,5.0
Stone Old Guardian Barley Wine Style Ale 1999,5.0
Kaldi Kreme,5.0
Sparnfarkel Smoked Porter,5.0
Bourbon Barley Wine,5.0
Cauldron Brew,5.0
Kona Belgian Triple,5.0
Anniversary Ale 2003,5.0


The user_rating column is provided to show the user has not given ratings for those beers yet. 

In [14]:
# create function to predict the rating of a beer for a user
def predict_rating(user_id, beer_name, algo):
    beer_id = beer_df[beer_df['beer_name'] == beer_name]['beer_beerid'].unique()[0]
    user_id = beer_df[beer_df['review_profilename'] == user_id]['review_profilename'].unique()[0]
    return algo.predict(user_id, beer_id).est

# print both the predicted rating and the actual rating for user 'northyorksammy' for beer 'Sierra Nevada Pale Ale'
print('Predicted Rating: ')
print(predict_rating('northyorksammy', 'Sierra Nevada Pale Ale', algo))
print(' ')
print('Actual Rating: ')
print(beer_df[(beer_df['review_profilename'] == 'northyorksammy') & (beer_df['beer_name'] == 'Sierra Nevada Pale Ale')]['review_overall'].unique()[0])


Predicted Rating: 
3.9099779957951526
 
Actual Rating: 
4.0


In [15]:
# create a function to recommend beers to a user only from a specific brewery
def recommend_beers_from_brewery(user_id, brewery_name, n_recommendations, algo):
    user_ratings = beer_df[beer_df['review_profilename'] == user_id]
    user_beers = user_ratings['beer_name'].unique()
    beers_to_recommend = beer_df[~beer_df['beer_name'].isin(user_beers)]
    beers_from_brewery = beers_to_recommend[beers_to_recommend['brewery_name'] == brewery_name]
    
    recommendations = beers_from_brewery.groupby('beer_name').agg({'review_overall': 'mean'}).sort_values('review_overall', ascending=False).head(n_recommendations)
    return recommendations

# get recommendations for user 'northyorksammy' based on the brewery 'Sierra Nevada Brewing Co.'
recommend_beers_from_brewery('northyorksammy', 'Sierra Nevada Brewing Co.', 10, algo)

Unnamed: 0_level_0,review_overall
beer_name,Unnamed: 1_level_1
Best Bitter Ale With Cascade And Chinook Dry Hops,5.0
Rhymes Wit - Beer Camp #20,4.666667
22 Bines Blonde IPA - Beer Camp #9,4.5
Red Perle Red Ale - Beer Camp #11,4.5
Liquid Sourdough (LSD) - Beer Camp #41,4.5
Knightro – Celtic Festival Beer,4.5
Knightro ESB - Beer Camp #23,4.5
"Que Syrah, Syrah!",4.5
Sierra Nevada Kölsch Style Ale,4.444444
Brewer's Blackbird Black IPA - Beer Camp #27,4.423077
