# Content-Based Model / Ensemble Model

#### The following code utilizes the KNNBasic algorithm from the surprise package in Python for content-filtering.

First, the required packages including pandas, surprise, reader, accuracy and train_test_split are imported. The beer_df.csv dataset is then loaded into a pandas dataframe called beer_df using the pd.read_csv function. This dataset contains various details regarding beer reviews such as brewery name, beer name, review time, reviewer's name, location, and ratings for different aspects of the beer such as aroma, taste, appearance and overall rating. The first 5 rows of the dataset are displayed in a DataFrame.

The content-filtering is carried out four times, using each of the features reviewed in the dataset. These include: 
- review_aroma
- review_appearance
- review_palate
- review_taste

In each instance, a reader object is created and a KNNBasic model is instantiated with the cosine similarity matrix. The algorithm is set to not use user-based recommendation. The dataset is then divided into train and test sets with a 75% training and 25% testing ratio. The KNNBasic model is trained on the training data and used to predict values for the test data.

The accuracy of each model is computed using the mean squared error (MSE) and mean absolute error (MAE) using the accuracy module from the surprise package. The MAE value is printed as output for each case. The features are then combined using an Ensemble Model which compares the predicted values of reviews to the actual values for each user input.

#### Perform a Content-Based Model using the only feature that is not a review: ABV

Lastly, a final Content-Based Model is run using the ABV feature, which is the only feature that is not a review. This feature is more accurate compared to the other models, with the lowest MSE and MAE values. The results of this model are then compared to the Ensemble Model, which is the combination of all four review features. The Ensemble Model has slightly higher MSE and MAE values compared to the ABV model, but is still a highly accurate model. 

The ABV feature was chosen as it is not only of numerical value, but also directly correlates with the type of beer. For example, a beer with a higher ABV is likely to be an IPA, while a beer with a lower ABV is likely to be a lager. This feature is included in the dataset and shown in the visualizations.

In [1]:
#import necessary packages
import pandas as pd

In [2]:
# show all columns 
pd.set_option('display.max_columns', None)

In [3]:
# open beer_df.csv
beer_df = pd.read_csv('data/beer_df.csv', low_memory=False)
beer_df.head()

Unnamed: 0,address,categories,city,country,key,lat,long,brewery_name,phones,postalCode,province,websites,index,brewery_id,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid,review_year,review_month,beer_type
0,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1495017,735,2011-03-01 00:49:43,3.5,3.5,4.0,illidurit,American Double / Imperial IPA,3.5,3.5,21 Rock,9.7,66190,2011,3,IPA
1,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1495350,735,2008-12-04 19:03:15,4.0,4.0,4.0,magictrokini,American IPA,3.0,4.0,Harvest Moon,6.4,45648,2008,12,IPA
2,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1495733,735,2010-01-23 20:55:46,4.0,4.0,3.5,HapWifeHapLife,American IPA,4.0,4.0,21st Amendment IPA,7.0,20781,2010,1,IPA
3,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1501253,735,2010-04-08 18:58:54,4.0,3.5,4.5,pwoody11,Belgian Strong Dark Ale,4.0,4.0,Monk's Blood,8.3,52510,2010,4,Ale
4,2010 Williams St,Brewery,San Leandro,US,us/ca/sanleandro/2010williamsst,37.711807,-122.177658,21st Amendment Brewery,5105952111,94577,CA,http://21st-amendment.com,1501262,735,2010-03-14 16:30:10,4.0,3.5,4.0,metter98,Belgian Strong Dark Ale,4.0,4.5,Monk's Blood,8.3,52510,2010,3,Ale


In [4]:
#Running a model with KNNBasic and review_aroma as the only feature
from surprise import KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_aroma']], reader)

trainset, testset = train_test_split(data, test_size=.25)

sim_options = {'name': 'cosine', 'user_based': False}
aroma_model = KNNBasic(sim_options=sim_options)
aroma_model.fit(trainset)
predictions = aroma_model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.3445
MAE:  0.4412


0.44121241956569396

In [5]:
#Running a model with KNNBasic and review_appearance as the only feature
from surprise import KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_appearance']], reader)

trainset, testset = train_test_split(data, test_size=.25)

sim_options = {'name': 'cosine', 'user_based': False}
appear_model = KNNBasic(sim_options=sim_options)
appear_model.fit(trainset)
predictions = appear_model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.2629
MAE:  0.3791


0.3790653146179316

The model above performed the best with this feature!

In [6]:
#Running a model with KNNBasic and review_palate as the only feature
from surprise import KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_palate']], reader)

trainset, testset = train_test_split(data, test_size=.25)

sim_options = {'name': 'cosine', 'user_based': False}
palate_model = KNNBasic(sim_options=sim_options)
palate_model.fit(trainset)
predictions = palate_model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.3273
MAE:  0.4225


0.4224549064633733

In [7]:
#Running a model with KNNBasic and review_taste as the only feature
from surprise import KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'review_taste']], reader)

trainset, testset = train_test_split(data, test_size=.25)

sim_options = {'name': 'cosine', 'user_based': False}
taste_model = KNNBasic(sim_options=sim_options)
taste_model.fit(trainset)
predictions = taste_model.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.3926
MAE:  0.4637


0.4636980886869683

In [8]:
# combine all models with features into one as a hybrid model
def hybrid_model(username, beer):
    # Get predictions from all four models
    aroma_prediction = aroma_model.predict(username, beer).est
    appear_prediction = appear_model.predict(username, beer).est
    palate_prediction = palate_model.predict(username, beer).est
    taste_prediction = taste_model.predict(username, beer).est

    prediction = (aroma_prediction * 0.25) + (appear_prediction * 0.25) + (palate_prediction * 0.25) + (taste_prediction * 0.25)

    return prediction

In [9]:
# Get a prediction using the hybrid model
username = "magictrokini"
beer = "Harvest Moon"
prediction = hybrid_model(username, beer)
mask = (beer_df['review_profilename'] == username) & (beer_df['beer_name'] == beer)
row = beer_df.loc[mask]

actual_rating = row[['review_aroma', 'review_appearance', 'review_palate', 'review_taste']].mean(axis=1).values[0]

print("Predicted rating:", prediction)
print("Actual rating:", actual_rating)

Predicted rating: 3.690625
Actual rating: 3.75


In [10]:
# Create prediction vs actual average
differences = []

# Loop through each row in beer_df
for index, row in beer_df.iterrows():
    username = row['review_profilename']
    beer = row['beer_name']
    prediction = hybrid_model(username, beer)
    actual_rating = row[['review_aroma', 'review_appearance', 'review_palate', 'review_taste']].mean()
    difference = abs(prediction - actual_rating)
    differences.append(difference)

avg_difference = sum(differences) / len(differences)

print("Average difference between predicted & actual ratings:", avg_difference)

Average difference between predicted & actual ratings: 0.29899211082265914


# Creating a Content-Based Model with beer_abv

ABV - Alcohol by Volume. 

Such metric shows the amount of alcohol within the beer. Many of the reviews rated the taste of high ABV beers lower than those with low ABV, as the taste of alcohol in a beer can make it distasteful to many. 

#### Creation of column for ABV Range metrics chosen as follows:
- 1 (Low ABV) is 0-4.9%
- 2 (Medium ABV) is 5-9.9%
- 3 (High ABV) is 10-14.9%
- 4 (Very High ABV) is 15-20%

In [11]:
# create new column of abv ranges with 0-4.9 being 1, 5-9.9 being 2, 10-14.9 being 3, 15-20 being 4
beer_df['abv_range'] = pd.cut(beer_df['beer_abv'], bins=[0, 4.9, 9.9, 14.9, 20], labels=[1, 2, 3, 4])

In [12]:
beer_df['abv_range'].value_counts()

2    128481
3     19995
1     11924
4        51
Name: abv_range, dtype: int64

In [13]:
# Running a model with NormalPredictor 
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

trainset, testset = train_test_split(data, test_size=.25)

abv_base = NormalPredictor()
abv_base.fit(trainset)
predictions = abv_base.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.3922
MAE:  0.4821


0.4820854582579

In [14]:
#Running a model with SVD
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

trainset, testset = train_test_split(data, test_size=.25)

svd_abv = SVD()
svd_abv.fit(trainset)
predictions = svd_abv.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.0074
MAE:  0.0397


0.03966755976104625

In [15]:
#Running a model with KNNBasic
from surprise import KNNBasic
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

trainset, testset = train_test_split(data, test_size=.25)

sim_options = {'name': 'cosine', 'user_based': False}
knn_abv = KNNBasic(sim_options=sim_options)
knn_abv.fit(trainset)
predictions = knn_abv.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.2084
MAE:  0.2731


0.27310354429958966

In [16]:
# Perform a grid search to find the best parameters for SVD
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import GridSearchCV

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

param_grid = {'n_factors': [50, 100, 150], 'n_epochs': [20, 40, 60], 'lr_all': [0.002, 0.005, 0.008], 'reg_all': [0.4, 0.6, 0.8]}
gs = GridSearchCV(SVD, param_grid, measures=['mse', 'mae'], cv=3)

gs.fit(data)

# Print the best parameters
print(gs.best_params['mse'])
print(gs.best_score['mse'])
print(gs.best_params['mae'])
print(gs.best_score['mae'])

{'n_factors': 50, 'n_epochs': 60, 'lr_all': 0.005, 'reg_all': 0.4}
0.021008812027913725
{'n_factors': 50, 'n_epochs': 60, 'lr_all': 0.002, 'reg_all': 0.4}
0.0801464956771105


In [17]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

trainset, testset = train_test_split(data, test_size=.25)

svd_abv = SVD(n_factors= 50, n_epochs= 60, lr_all= 0.005, reg_all= 0.4)
svd_abv.fit(trainset)
predictions = svd_abv.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.0205
MAE:  0.0799


0.07989296676245598

In [18]:
# Perform a randomizedsearch to find the best parameters for SVD
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import RandomizedSearchCV

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

param_grid = {'n_factors': [50, 100, 150], 'n_epochs': [20, 40, 60], 'lr_all': [0.002, 0.005, 0.008], 'reg_all': [0.4, 0.6, 0.8]}
rs = RandomizedSearchCV(SVD, param_grid, measures=['mse', 'mae'], cv=3)

rs.fit(data)

# Print the best parameters
print(rs.best_params['mse'])
print(rs.best_score['mse'])
print(rs.best_params['mae'])
print(rs.best_score['mae'])

{'n_factors': 150, 'n_epochs': 60, 'lr_all': 0.008, 'reg_all': 0.4}
0.021678054735405575
{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.4}
0.08089650695328027


In [19]:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

trainset, testset = train_test_split(data, test_size=.25)

svd_abv = SVD(n_factors= 50, n_epochs= 20, lr_all= 0.005, reg_all= 0.4)
svd_abv.fit(trainset)
predictions = svd_abv.test(testset)

accuracy.mse(predictions)
accuracy.mae(predictions)

MSE: 0.0223
MAE:  0.0805


0.08051025500699642

When running the models with parameters, there is not much of a difference in the accuracy of the model. The model with the lowest MAE is the one with the default parameters running the SVD algorithm.

In [20]:
# Performing a last check to see if the model is overfitting by performing a cross validation
from surprise.model_selection import cross_validate

cross_validate(svd_abv, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.1486  0.1492  0.1474  0.1461  0.1461  0.1475  0.0012  
MAE (testset)     0.0796  0.0805  0.0790  0.0788  0.0787  0.0793  0.0007  
Fit time          3.92    3.76    3.78    3.86    3.80    3.82    0.06    
Test time         0.28    0.11    0.12    0.27    0.13    0.18    0.08    


{'test_rmse': array([0.14857856, 0.14915392, 0.14742349, 0.14610562, 0.14610905]),
 'test_mae': array([0.07958769, 0.08052997, 0.07904603, 0.07878913, 0.07868731]),
 'fit_time': (3.9161670207977295,
  3.7605748176574707,
  3.7778780460357666,
  3.8578970432281494,
  3.798410177230835),
 'test_time': (0.2810943126678467,
  0.11377096176147461,
  0.11512494087219238,
  0.2718632221221924,
  0.13295793533325195)}

#### Recommendation System

In [21]:
# use model to create recommendation system for user 
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(beer_df[['review_profilename', 'beer_name', 'abv_range']], reader)

trainset, testset = train_test_split(data, test_size=.25)

algo = SVD(n_factors= 150, n_epochs= 40, lr_all= 0.005, reg_all= 0.4)
algo.fit(trainset)
predictions = algo.test(testset)

In [22]:
# modify function to print results as a dataframe
def recommend_beers(beer_name):
    beer_id = beer_df[beer_df['beer_name'] == beer_name]['beer_beerid'].iloc[0]
    beer_ratings = beer_df[beer_df['beer_beerid'] == beer_id][['review_profilename', 'beer_name', 'abv_range']]
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(beer_ratings, reader)
    
    trainset = data.build_full_trainset()
    algo = SVD(n_factors= 150, n_epochs= 40, lr_all= 0.005, reg_all= 0.4)
    algo.fit(trainset)
    
    # Get list of all beer names
    all_beer_names = beer_df['beer_name'].unique()
    
    # Create a list of tuples that contains beer names and predicted ratings
    predicted_ratings = []
    for name in all_beer_names:
        iid = beer_df[beer_df['beer_name'] == name]['beer_beerid'].iloc[0]
        uid = trainset.to_inner_uid(beer_df['review_profilename'].iloc[0])
        prediction = algo.predict(uid, iid, verbose=False)
        predicted_ratings.append((name, prediction.est))
    
    predicted_ratings.sort(key=lambda x: x[1], reverse=True)
    
    recommended_beers = [beer[0] for beer in predicted_ratings[:10]]
    return pd.DataFrame(recommended_beers, columns=['Recommended Beers'])

recommend_beers('Sierra Nevada Pale Ale')

Unnamed: 0,Recommended Beers
0,21 Rock
1,Harvest Moon
2,21st Amendment IPA
3,Monk's Blood
4,Hell Or High Watermelon Wheat Beer
5,Bitter American
6,Ugly Sweater IPA
7,Lost Sailor Imperial Pilsner
8,Beer School
9,Blind Lust
