# Coffee Data: Classification model training ☕☕☕

🕵🏿‍♀️ To break the problem down we are going to look at a classification problem: predicting if a give coffee sample will have a `total_cup_points` of over 85 given the green and processing data.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from data_prep import handle_na_values, split_data, convert_bag_weight, total_points_over_85
from train import train_logistic_regression, train_decision_tree, train_random_forest, check_feature_importance
from evaluate import validate_model, print_model_evaluation, model_evaluation

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
df = pd.read_csv("../data/merged_data_cleaned.csv",  index_col=0)

## Data preparation

- make column names consistant
- replace some na values and remove rows with other na values
- convert bag weight
- define numerical and categorical features
- split data into train, validation, test sets and extract target value

In [3]:
df.columns = df.columns.str.lower().str.replace(".", "_")

  df.columns = df.columns.str.lower().str.replace(".", "_")


In [4]:
df = handle_na_values(df)

In [5]:
df["bag_weight"] = df["bag_weight"].apply(lambda weight_str: convert_bag_weight(weight_str))

In [6]:
numerical_features = [
    "moisture",
    "category_one_defects",
    "quakers",
    "category_two_defects",
    "altitude_mean_meters",
    "bag_weight"
]

categorical_features = [
    "color",
    "species",
    "owner",
    "country_of_origin",
    "farm_name",
    "mill",
    "company",
    "region",
    "producer",
    "in_country_partner",
    "harvest_year",
    "owner_1",
    "variety",
    "processing_method"
]

In [7]:
features = numerical_features + categorical_features

In [8]:
df_train, df_val, df_test, y_train, y_val, y_test, df_full_train = split_data(df, features)

length of training set: 536, validation set: 179, test set: 179


In [9]:
# set target y values to 1 if above 85 and 0 if 85 or lower total cup score
y_train = total_points_over_85(y_train)
y_val = total_points_over_85(y_val)
y_test = total_points_over_85(y_test)

## Model training

- train three different models and evaluate a base line evaluation to which performs best with ROC AUC, RSME, Accuracy
- see how each feature affects the base score for that model

In [10]:
feature_set = numerical_features + categorical_features

### Logistic Regression

In [11]:
dv_lr, model_lr = train_logistic_regression(df_train, y_train)

In [12]:
y_pred_lr = validate_model(df_val, y_val, dv_lr, model_lr)

print_model_evaluation(y_val, y_pred_lr)

Accuracy: 0.94
Roc Auc: 0.55
Rsme: 0.24


In [13]:
scores_df = check_feature_importance(feature_set, df_train, y_train, df_val, y_val, model_type="LR")
scores_df.sort_values(by=["accuracy", "auc", "rsme"], ascending=[0, 0, 1])

Unnamed: 0,feature_removed,accuracy,auc,rsme
16,harvest_year,0.955307,0.555556,0.211407
1,category_one_defects,0.949721,0.552614,0.224231
2,quakers,0.949721,0.552614,0.224231
5,bag_weight,0.949721,0.552614,0.224231
6,color,0.949721,0.552614,0.224231
8,owner,0.949721,0.552614,0.224231
9,country_of_origin,0.949721,0.552614,0.224231
10,farm_name,0.949721,0.552614,0.224231
13,region,0.949721,0.552614,0.224231
14,producer,0.949721,0.552614,0.224231


### Decision Tree

In [14]:
dv_dt, model_dt = train_decision_tree(df_train, y_train)

In [15]:
y_pred_dt = validate_model(df_val, y_val, dv_dt, model_dt)

print_model_evaluation(y_val, y_pred_dt)

Accuracy: 0.95
Roc Auc: 0.50
Rsme: 0.22


In [16]:
scores_df = check_feature_importance(feature_set, df_train, y_train, df_val, y_val, model_type="DT")
scores_df.sort_values(by=["accuracy", "auc", "rsme"], ascending=[0, 0, 1])

Unnamed: 0,feature_removed,accuracy,auc,rsme
0,moisture,0.949721,0.5,0.224231
1,category_one_defects,0.949721,0.5,0.224231
2,quakers,0.949721,0.5,0.224231
3,category_two_defects,0.949721,0.5,0.224231
4,altitude_mean_meters,0.949721,0.5,0.224231
5,bag_weight,0.949721,0.5,0.224231
6,color,0.949721,0.5,0.224231
7,species,0.949721,0.5,0.224231
8,owner,0.949721,0.5,0.224231
9,country_of_origin,0.949721,0.5,0.224231


### Random Forest

In [17]:
dv_rf, model_rf = train_random_forest(df_train, y_train)

In [18]:
y_pred_rf = validate_model(df_val, y_val, dv_dt, model_dt)

print_model_evaluation(y_val, y_pred_rf)

Accuracy: 0.95
Roc Auc: 0.50
Rsme: 0.22


In [19]:
scores_df = check_feature_importance(feature_set, df_train, y_train, df_val, y_val, model_type="RF")
scores_df.sort_values(by=["accuracy", "auc", "rsme"], ascending=[0, 0, 1])

Unnamed: 0,feature_removed,accuracy,auc,rsme
1,category_one_defects,0.949721,0.552614,0.224231
8,owner,0.949721,0.552614,0.224231
17,owner_1,0.949721,0.552614,0.224231
15,in_country_partner,0.944134,0.549673,0.23636
0,moisture,0.944134,0.497059,0.23636
3,category_two_defects,0.944134,0.497059,0.23636
4,altitude_mean_meters,0.944134,0.497059,0.23636
5,bag_weight,0.944134,0.497059,0.23636
9,country_of_origin,0.944134,0.497059,0.23636
13,region,0.944134,0.497059,0.23636


The model and selection of features that performed the best was the Logistic Regression with the Harvest year feature removed.

In [20]:
features_subset = [
    "moisture",
     "category_one_defects",
     "quakers",
     "category_two_defects",
     "altitude_mean_meters",
     "bag_weight",
     "color",
     "species",
     "owner",
     "country_of_origin",
     "farm_name",
     "mill",
     "company",
     "region",
     "producer",
     "in_country_partner",
     "owner_1",
     "variety",
     "processing_method"
]

In [21]:
dv_lr2, model_lr2 = train_logistic_regression(df_train[features_subset], y_train)
y_pred_lr2 = validate_model(df_val, y_val, dv_lr2, model_lr2)

print_model_evaluation(y_val, y_pred_lr2)

Accuracy: 0.96
Roc Auc: 0.56
Rsme: 0.21


## Parameter Tuning

Taking the model we decided upon in the previous step we now look to see which parameters for that model can be tuned.

In [22]:
import sys
import warnings
import os

if not sys.warnoptions:
    warnings.simplefilter("ignore")
    os.environ["PYTHONWARNINGS"] = "ignore" # Also affect subprocesses

In [23]:
dv_lr, model_lr = train_logistic_regression(df_train, y_train)

In [24]:
scores = []

for random_state in [2, 10, 32, 42]:
    for c in [0.01, 0.1, 1, 10]:
        for no_iter in [100, 500, 1000, 10000]:
            dv_lr, model_lr = train_logistic_regression(df_train, y_train, c=c, max_iter=no_iter, random_state=random_state)

            y_pred_lr = validate_model(df_val, y_val, dv_lr, model_lr)

            scores.append((random_state, c, no_iter, *model_evaluation(y_val, y_pred_lr)))

cols = ["random_state", "c_value", "max_iter", "accuracy", "auc", "rsme"]

parameter_tuning_scores_df = pd.DataFrame(scores, columns=cols)
parameter_tuning_scores_df.sort_values(by=["accuracy", "auc", "rsme"], ascending=[0, 0, 1])

Unnamed: 0,random_state,c_value,max_iter,accuracy,auc,rsme
8,2,1.00,100,0.949721,0.552614,0.224231
24,10,1.00,100,0.949721,0.552614,0.224231
40,32,1.00,100,0.949721,0.552614,0.224231
56,42,1.00,100,0.949721,0.552614,0.224231
0,2,0.01,100,0.949721,0.500000,0.224231
...,...,...,...,...,...,...
59,42,1.00,10000,0.944134,0.549673,0.236360
60,42,10.00,100,0.944134,0.549673,0.236360
61,42,10.00,500,0.944134,0.549673,0.236360
62,42,10.00,1000,0.944134,0.549673,0.236360


## Check model with test data

Now we have our model `Logistic Regression`and final parameters `c=1, max_iter=100, random_state=2`we train the final model with our full training set and validate it against the test data set.

In [25]:
dv_final, model_final = train_logistic_regression(df_full_train[features_subset], total_points_over_85(df_full_train["total_cup_points"].values),c=1, max_iter=100, random_state=2)

y_pred_final = validate_model(df_test, y_test, dv_final, model_final)

print_model_evaluation(y_val, y_pred_final)

Accuracy: 0.94
Roc Auc: 0.49
Rsme: 0.25


This model is compareable with the train set tested against the validation set though the performance has been reduced slightly which suggests some overfitting. A gridsearch looking at different folds (splits) of the training and test set might reveal more.

## Using the model

Let's test the model with a single input.

In [26]:
sample = df_full_train[features_subset].iloc[1]
sample

moisture                                        0.11
category_one_defects                               0
quakers                                          0.0
category_two_defects                               2
altitude_mean_meters                          1350.0
bag_weight                                         1
color                                          Green
species                                      Arabica
owner                                        cadexsa
country_of_origin                           Honduras
farm_name                                     bethel
mill                                         cadexsa
company                                      cadexsa
region                                       marcala
producer                                 Omar Acosta
in_country_partner      Instituto Hondureño del Café
owner_1                                      CADEXSA
variety                                       Catuai
processing_method                       Washed

In [27]:
X_val = dv_final.transform([sample])
y_pred = model_final.predict(X_val)

In [28]:
print(f"sample predicted to score over 85 points: {str(y_pred[0] == 1)}")

sample predicted to score over 85 points: False
