# Beer Hops Data: Purpose Classification Model with XG-Boost

**Data Files:** *cln_hops_profile.csv, cln_hops_brewvalues.csv*

**Original Source:** *https://beermaverick.com/hops/*  (Data retrieved via web-scraping)

------------------------------------------------------------

### Setup

**Objective:** Import necessary modules for machine-learning models & visualization and read in CSV files into local dataframes for easier access.

In [1]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [2]:
# Read in raw CSV data into local dataframes
CLEAN_HOPS_PATH = './clean_data/cln_hops_brewvalues.csv'  
CLEAN_HOPS_PROFILE_PATH = './clean_data/cln_hops_profile.csv'
hop_values_df = pd.read_csv(CLEAN_HOPS_PATH, index_col='Hop Name')
hop_profile_df = pd.read_csv(CLEAN_HOPS_PROFILE_PATH, index_col='Hop Name')

# Create a master dataframe indexed on hop name
master_df = hop_values_df.merge(hop_profile_df, left_index=True, right_index=True)

master_df.head(2)

Unnamed: 0_level_0,Alpha Acid % - Min,Alpha Acid % - Max,Alpha Acid % - Avg,Beta Acid % - Min,Beta Acid % - Max,Beta Acid % - Avg,Alpha-Beta Ratio - Min,Alpha-Beta Ratio - Max,Alpha-Beta Ratio - Avg,Co-Humulone as % of Alpha - Min,...,violet,watermelon,whiskey,white_grape,white_wine,wild,wine,woody,yogurt,zest
Hop Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Astra,7.0,10.0,8.5,4.0,6.0,5.0,1.0,3.0,2.0,26.0,...,False,False,False,False,True,False,False,False,False,False
Eclipse,15.7,19.0,17.4,5.9,9.0,7.5,2.0,3.0,2.0,33.0,...,False,False,False,False,False,False,False,False,False,False


### Pre-Processing & Feature-Engineering

**Objective:** Feature-engineer attributes and prepare dataframe to be fed into boosting model with desired predictor variables of interest in the necessary formats.

Create region information to be used as potential outcome to classify.

In [3]:
# Create Region column to be used as potential outcome predicted
regions = {
'Australia': 'Australia',
'Canada': 'North America',
'China': 'Asia',
'Czech Republic': 'Europe',
'France': 'Europe',
'Germany': 'Europe',
'Japan': 'Asia',
'New Zealand': 'Australia',
'Poland': 'Europe',
'Slovenia': 'Europe',
'South Africa': 'Africa',
'Ukraine': 'Europe',
'United Kingdom': 'Europe',
'United States of America': 'North America'
}
master_df['Region'] = master_df.Country.map(lambda x: regions[x])
master_df.drop(columns=['Country'], inplace=True)

# Remove records of Asia (EDA script showed low amount of hops from China/Japan relative to other countries)
master_df = master_df[master_df['Region'] != 'Asia']

master_df.replace(float("inf"), np.nan, inplace=True)
master_df.dropna(inplace=True)

Separate X and Y data.

In [4]:
# Define array of dependent variable (outcome to predict)
Y_data = master_df.Purpose.copy()
Y_data.unique()

array(['Dual', 'Aroma', 'Bittering'], dtype=object)

In [5]:
# Define dataframe of independent variables to serve as potential predictors
X_data = master_df.copy()
X_data.drop(columns=['Purpose'], inplace=True)  # remove Y

Reformat X and Y in appropriate formats necessary for model.

In [6]:
# Label encode the multi-class outcome variable
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y_data)
Y_data = label_encoder.transform(Y_data)
Y_data[0:25]  # check encoding

array([2, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 0, 0, 2, 1, 0, 0, 1, 0, 0, 0, 2,
       0, 0, 0])

In [7]:
# Create dummy variables for remaining categorical variables
X_data = pd.get_dummies(X_data, prefix=None, prefix_sep='_', dummy_na=False, columns=['Region'], sparse=False)
X_data.columns.tolist()[-3:]  # check dummy columns were created

['Region_Australia', 'Region_Europe', 'Region_North America']

In [8]:
# Convert boolean columns into ints 
bool_cols = [i for i in hop_profile_df.columns if i not in ['Country', 'Purpose']]
for col in bool_cols:
    X_data[col] = X_data[col].astype('bool')
    X_data[col] = X_data[col].astype('int')

Choose any columns to drop (not use as predictor).

In [9]:
# Drop out unwanted brew values columns (based on EDA from step3)
X_data.drop(columns=[
#     'Region',
    'Alpha Acid % - Min',
    'Alpha Acid % - Max',
#     'Alpha Acid % - Avg',
    'Beta Acid % - Min',
    'Beta Acid % - Max',
#     'Beta Acid % - Avg',
    'Alpha-Beta Ratio - Min',
    'Alpha-Beta Ratio - Max',
#     'Alpha-Beta Ratio - Avg',
    'Co-Humulone as % of Alpha - Min',
    'Co-Humulone as % of Alpha - Max',
#     'Co-Humulone as % of Alpha - Avg',
    'Total Oils (mL/100g) - Min',
    'Total Oils (mL/100g) - Max',
#     'Total Oils (mL/100g) - Avg',
    'Myrcene - Min',
    'Myrcene - Max',
#     'Myrcene - Avg',
    'Humulene - Min',
    'Humulene - Max',
#     'Humulene - Avg',
    'Caryophyllene - Min',
    'Caryophyllene - Max',
#     'Caryophyllene - Avg',
    'Farnesene - Min',
    'Farnesene - Max',
#     'Farnesene - Avg',
    'Other Oils - Min',
    'Other Oils - Max'
], inplace=True)

### Data-Partitioning

**Objective:** Split dataset to prepare a training set and a testing set to be able to fit a model and evaluate its performance.

In [10]:
# Split X & Y data as per desired specifications
X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, test_size=0.3, random_state=123)

print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))

127
55
127
55


### Model Fitting: Random Forest

**Objective:** Execute a tree-based ensemble bagging algorithm to train model & predict the country categorical variable on the test dataset. 

In [11]:
# Instantiate classifier object with desired parameters
rf_model = RandomForestClassifier(
    max_depth=15, 
    random_state=123
)

# Fit the training data
rf_model.fit(X_train, y_train)

RandomForestClassifier(max_depth=15, random_state=123)

### Model Evaluation: Random Forest

**Objective:** Evaluate model based on the test data set results.

In [12]:
# Apply model on test set to make region predictions
y_pred = rf_model.predict(X_test)
predictions = [value for value in y_pred]

In [13]:
# Evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 74.55%


We see that with the Random Forest Algorithm, the accuracy is just below 70%. To see if this can be further improved, we will fit a boosted algorithm next.

### Model Fitting: XG-Boost

**Objective:** Execute tree-based ensemble boosting algorithm to train model & predict the country categorical variable on the test dataset. 

In [14]:
# Instantiate classifier object with desired parameters
xg_model = XGBClassifier(
    base_score=0.5, 
    booster='gbtree', 
    colsample_bylevel=1,
    colsample_bynode=1, 
    colsample_bytree=1, 
    enable_categorical=False,
    eval_metric='logloss',
    gamma=0, 
    gpu_id=-1, 
    importance_type=None,
    interaction_constraints='', 
    learning_rate=0.3,
    max_delta_step=0,
    max_depth=15, 
    min_child_weight=1, 
    missing=np.nan,
    monotone_constraints='()', 
    n_estimators=200, 
    n_jobs=12,
    num_parallel_tree=1, 
    objective='multi:softprob', 
    predictor='auto',
    random_state=123, 
    reg_alpha=0, 
    reg_lambda=1, 
    scale_pos_weight=None,
    subsample=1, 
    tree_method='exact', 
    use_label_encoder=False,
    validate_parameters=1, 
    verbosity=None
)

# Fit the training data
xg_model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.3, max_delta_step=0,
              max_depth=15, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=12,
              num_parallel_tree=1, objective='multi:softprob', predictor='auto',
              random_state=123, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=None, subsample=1, tree_method='exact',
              use_label_encoder=False, validate_parameters=1, ...)

### Model Evaluation: XG-Boost

**Objective:** Evaluate model based on the test data set results.

In [15]:
# Apply model on test set to make region predictions
y_pred = xg_model.predict(X_test)
predictions = [value for value in y_pred]

In [16]:
# Evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 58.18%


### Model Analysis:

EDA showed strong relationships between hop purpose and its respective brew values. In an effort to gain a better sense of this relationship, a model was built to classify purpose with respect to the remaining attributes serving as predictors. We see a similar classification accuracy score from this model and a corresponding tool was built below. 

### Hop Classification Tool
**Objective:** Design a tool that would take brew value and aroma info and output a predicted region classification based on that data.

#### How to use this tool:

1) Go to Beer Hops database webpage: https://beermaverick.com/hops/
2) Click on any hop we are interested in predicting the purpose.
3) Run the following cell-block and enter each corresponding data for the brew values and aroma tags as shown in example below:
- 17.4, 
- 7.5, 
- 12, 
- 35, 
- 2.3, 
- 42, 
- 1, 
- 9, 
- 0.5, 
- citrus, pine, mandarin
- Australia
4) Then execute the remaining cell-blocks to get a predicted purpose.

In [17]:
# Take in user input of brew values
print("Please input the following values. (Enter 'NA' if unknown)")
aa = input("Please enter the Alpha Acid % - Avg: ")
ba = input("Please enter the Beta Acid % - Avg: ")
abr = input("Please enter the Alpha-Beta Ratio - Avg: ")
ch = input("Please enter the Co-Humulone as % of Alpha - Avg: ")
to = input("Please enter the Total Oils (mL/100g) - Avg: ")
myr = input("Please enter the Myrcene - Avg: ")
hum = input("Please enter the Humulene - Avg: ")
car = input("Please enter the Caryophyllene - Avg: ")
far = input("Please enter the Farnesene - Avg: ")

# Take in user input for aromas
aroma_input = input("List all aromas in comma-separated format: ")

# Take in user input for hop region
region_input = input("Please enter the region: ")

Please input the following values. (Enter 'NA' if unknown)


Please enter the Alpha Acid % - Avg:  17.4
Please enter the Beta Acid % - Avg:  7.5
Please enter the Alpha-Beta Ratio - Avg:  12
Please enter the Co-Humulone as % of Alpha - Avg:  35
Please enter the Total Oils (mL/100g) - Avg:  2.3
Please enter the Myrcene - Avg:  42
Please enter the Humulene - Avg:  1
Please enter the Caryophyllene - Avg:  9
Please enter the Farnesene - Avg:  0.5
List all aromas in comma-separated format:  citrus, pine, mandarin
Please enter the region:  Australia


In [20]:
# Create dataframe from input data
input_val_dict = {
    'Alpha Acid % - Avg': aa, 
    'Beta Acid % - Avg': ba, 
    'Alpha-Beta Ratio - Avg': abr,
    'Co-Humulone as % of Alpha - Avg': ch, 
    'Total Oils (mL/100g) - Avg': to, 
    'Myrcene - Avg': myr, 
    'Humulene - Avg': hum, 
    'Caryophyllene - Avg': car, 
    'Farnesene - Avg': far
}

input_df = pd.DataFrame()
for col in input_val_dict.keys():
    if input_val_dict[col] != 'NA':
        input_df[col] = [float(input_val_dict[col])]
    else:
        input_df[col] = [np.nan]  

for col in X_data.columns.tolist():
    if col not in input_df.columns:
        input_df[col] = 0

for col in aroma_input.split(','):
    if ' ' == col[0]:
        input_df[col[1:]] = 1
    else:
        input_df[col] = 1

input_df['Region_' + region_input] = 1
    
input_df

Unnamed: 0,Alpha Acid % - Avg,Beta Acid % - Avg,Alpha-Beta Ratio - Avg,Co-Humulone as % of Alpha - Avg,Total Oils (mL/100g) - Avg,Myrcene - Avg,Humulene - Avg,Caryophyllene - Avg,Farnesene - Avg,alfalfa,...,white_wine,wild,wine,woody,yogurt,zest,Region_Africa,Region_Australia,Region_Europe,Region_North America
0,17.4,7.5,12.0,35.0,2.3,42.0,1.0,9.0,0.5,0,...,0,0,0,0,0,0,0,1,0,0


In [21]:
# For this classifier, we chose the model that yielded the better accuracy from the model evaluation steps.
def region_classfier(input_df):
    purpose = ['Aroma', 'Bittering', 'Dual']
    if input_df.isna().sum().sum() > 0:
        pred = xg_model.predict(input_df)
    else:
        pred = rf_model.predict(input_df)
    predicted_purpose = purpose[pred[0]]
    print("Predicted Purpose: ", predicted_purpose)
    
region_classfier(input_df)

Predicted Purpose:  Dual


This tool can come in handy for trying to figure out the purpose of a hop and can be customized in future implementations to further filter for only necessary parameters.