# Overview

**GENERAL THOUGHTS:**
Use AutoML (AutoGluon.Tabular) as a general way to investigate which algorithm, pre-processing, feature engineering options are (well) suited for the given tasks, as well as to investigate the potential performance based on a (large) varity of configurations of those options.
The notebook includes multiple scenarios of using AutoML:
- including and excluding custom data pre-processing (see below)
- including auto pre-processing by AutoGluon.Tabular
- including auto feature engineering by AutoGluon.Tabular
https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html
- including multiple classifiers by using:
  - multiple ml algorithms
  - "standard" HPO for each algorithm defined by AutoGluon.Tabular
  - ensables for algorithms (bagging and stacking)

**DATA PREPROCESSING:**

Imbalanced data:
- over_sampling for imbalanced data
- cost-sensitive learning for imbalanced data

categorical data:
- Ordinal Data: The categories have an inherent order
- Nominal Data: The categories do not have an inherent order



**MULTI-CLASS CLASSIFIERS:**
- Overview of models to be considered using AutoML (AutoGluon.Tabular):  
  - [X] RandomForest
  - [X] ExtraTrees
  - [X] XGBoost
  - [X] LightGBM
  - [X] KNeughbors
  - [X] CatBoost

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mlflow

import os
from datetime import datetime
import yaml
import json

In [2]:
import sklearn
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import GridSearchCV
# from sklearn.experimental import enable_halving_search_cv
# from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import classification_report
from sklearn.utils import class_weight
from sklearn.utils.class_weight import compute_sample_weight

# from sklearn.dummy import DummyClassifier
# from sklearn.base import BaseEstimator, ClassifierMixin

# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# import xgboost as xgb
# import lightgbm as lgbm

import optuna
# from optuna.samplers import TPESampler

import imblearn
from imblearn.over_sampling import RandomOverSampler

from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from autogluon.tabular import TabularDataset, TabularPredictor

In [4]:
SEED = 42

clf_name = "dt_clf"

# Get current date and time
now = datetime.now()
# Format date and time
formatted_date_time = now.strftime("%Y-%m-%d_%H:%M:%S")
print(formatted_date_time)

2024-01-18_18:44:27


# Load and prepare data

In [5]:
df = pd.read_csv('../../data/output/df_ml.csv', sep='\t')

df['material_number'] = df['material_number'].astype('object')

df_sub = df[[
    'material_number',
    'brand',
    'product_area',
    'core_segment',
    'component',
    'manufactoring_location',
    'characteristic_value',
    'material_weight', 
    'packaging_code',
    'packaging_category',
]]

# AutoML: without custom pre-processing; restricted selection of models including HPO and model ensembling

## Split data into train and test

In [6]:
# Define features and target
X = df_sub.iloc[:, :-1]
y = df_sub.iloc[:, -1]  # the last column is the target

In [7]:
# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y,
    random_state=SEED
)

## Transform to AutoML data format

In [8]:
df_train = pd.concat([X_train, y_train], axis=1)

In [9]:
train_data = TabularDataset(df_train)

## AutoML training pipeline

In [11]:
label = 'packaging_category'
automl_predictor = TabularPredictor(
    label=label,
    problem_type='multiclass',
    eval_metric='f1_macro',
    sample_weight='balance_weight'
).fit(
    train_data=train_data,
    tuning_data=None, # If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.
    holdout_frac=0.2, # Default value (if None) is selected based on the number of rows in the training data.
    time_limit=60*60,
    presets=['best_quality'], # default = ['medium_quality'], any user-specified arguments in fit() will override the values used by presets.
    # auto_stack=False, # Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy.
    # included_model_types=[],
    # excluded_model_types=['FASTAI', 'AG_AUTOMM'],
    hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified. Searchspaces are provided for some models, but not for all. Where no searchspace is provided, a fixed set of hyper-parameters is defined. (see /searchspace under each model: https://github.com/autogluon/autogluon/tree/master/tabular/src/autogluon/tabular/models).
        # 'num_trials': 15, # try at most n different hyperparameter configurations for each type of model
        'scheduler' : 'local',
        'searcher': 'auto', # ‘auto’: Perform bayesian optimization search on NN_TORCH and FASTAI models. Perform random search on other models.
    }  # Refer to TabularPredictor.fit docstring for all valid values
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240118_174517"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 60 seconds.
Starting holdout-based sub-fit for dynamic stacking. Context path is: AutogluonModels/ag-20240118_174517/ds_sub_fit/sub_fit_ho.
Using predefined sample weighting strategy: balance_weight. Evaluation metrics will ignore sample weights, specify weight_evaluation=True to instead report weighted metrics.
Beginning AutoGluon training ... Time limi

In [12]:
# Evaluation of models on training data
automl_predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist_BAG_L1,0.093594,f1_macro,0.002528,0.299576,0.002528,0.299576,1,True,2
1,WeightedEnsemble_L2,0.093594,f1_macro,0.023991,11.141783,0.021463,10.842207,2,True,3
2,WeightedEnsemble_L3,0.093594,f1_macro,0.02421,11.221712,0.021682,10.922136,3,True,4
3,KNeighborsUnif_BAG_L1,0.089275,f1_macro,0.001786,0.280633,0.001786,0.280633,1,True,1


In [14]:
# Evaluation of models on test data
df_test = pd.concat([X_test, y_test], axis=1)
test_data = TabularDataset(df_test)

automl_predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist_BAG_L1,0.188669,0.093594,f1_macro,0.067607,0.002528,0.299576,0.067607,0.002528,0.299576,1,True,2
1,WeightedEnsemble_L3,0.188669,0.093594,f1_macro,0.069288,0.02421,11.221712,0.001681,0.021682,10.922136,3,True,4
2,WeightedEnsemble_L2,0.188669,0.093594,f1_macro,0.069789,0.023991,11.141783,0.002182,0.021463,10.842207,2,True,3
3,KNeighborsUnif_BAG_L1,0.181568,0.089275,f1_macro,0.056656,0.001786,0.280633,0.056656,0.001786,0.280633,1,True,1


In [15]:
# For a single specified model: make predictions and perform detailed evaluation on hold out test data
i = 3  # index of model to use
model_to_use = automl_predictor.model_names()[i]
preds_y_test = automl_predictor.predict(X_test, model=model_to_use)
print("Predictions:  ", list(preds_y_test)[:5])

print(classification_report(y_test, preds_y_test))

Predictions:   ['Blister and Insert Card', 'Corrugated carton', 'Blister and Insert Card', 'Blister and Insert Card', 'Corrugated carton']
                            precision    recall  f1-score   support

   Blister and Insert Card       0.16      0.93      0.28      1749
  Blister and sealed blist       0.39      0.30      0.34      1582
            Book packaging       0.00      0.00      0.00         2
Cardb. Sleeve w - w/o Shr.       0.04      0.24      0.06       135
  Cardboard hanger w/o bag       0.94      0.19      0.31        80
    Carton cover (Lid box)       0.23      0.23      0.23       130
   Carton tube with or w/o       0.00      0.00      0.00         9
                      Case       0.13      0.16      0.15        97
         Corrugated carton       0.49      0.53      0.51       774
        Countertop display       0.14      0.07      0.09        30
                  Envelope       0.80      0.27      0.41        59
          Fabric packaging       0.00      0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# AutoML: custom pre-processing; restricted selection of models including HPO and model ensembling

## Define features and target, performe oversampling, split data into train and test

In [None]:
# Define features and target
X = df_sub.iloc[:, :-1]
y = df_sub.iloc[:, -1]  # the last column is the target

In [None]:
distribution_classes = y.value_counts()
print('Class distribution before oversmapling')
print(distribution_classes.to_dict())

# NOTE: Oversampling so each class has at least 100 sample; to properly apply CV and evaluation
dict_oversmapling = {
    'Metal Cassette': 100,
    'Carton tube with or w/o': 100,
    'Wooden box': 100,
    'Fabric packaging': 100,
    'Book packaging': 100
}
# define oversampling strategy
oversampler = RandomOverSampler(sampling_strategy=dict_oversmapling, random_state=SEED)
# fit and apply the transform
X_oversample, y_oversample = oversampler.fit_resample(X, y)

distribution_classes = y_oversample.value_counts()
print('\n')
print('Class distribution after oversmapling')
print(distribution_classes.to_dict())

Class distribution before oversmapling
{'Hanger/ Clip': 13543, 'Tube': 11687, 'Blister and Insert Card': 8744, 'TightPack': 8296, 'Folding carton': 8219, 'Blister and sealed blist': 7912, 'Corrugated carton': 3872, 'Paperboard pouch': 3478, 'Trap Folding Card': 2188, 'Plastic Pouch': 1904, 'Plastic bag with header': 1850, 'Plastic Cassette': 1708, 'Shrink film and insert o': 1499, 'Plastic Box': 1491, 'Unpacked': 1415, 'Skincard': 1143, 'Trap Card': 804, 'Cardb. Sleeve w - w/o Shr.': 676, 'Carton cover (Lid box)': 652, 'Case': 485, 'Tray Packer': 431, 'Cardboard hanger w/o bag': 400, 'Envelope': 295, 'Countertop display': 150, 'Metal Cassette': 50, 'Carton tube with or w/o': 44, 'Wooden box': 16, 'Fabric packaging': 15, 'Book packaging': 10}




Class distribution after oversmapling
{'Hanger/ Clip': 13543, 'Tube': 11687, 'Blister and Insert Card': 8744, 'TightPack': 8296, 'Folding carton': 8219, 'Blister and sealed blist': 7912, 'Corrugated carton': 3872, 'Paperboard pouch': 3478, 'Trap Folding Card': 2188, 'Plastic Pouch': 1904, 'Plastic bag with header': 1850, 'Plastic Cassette': 1708, 'Shrink film and insert o': 1499, 'Plastic Box': 1491, 'Unpacked': 1415, 'Skincard': 1143, 'Trap Card': 804, 'Cardb. Sleeve w - w/o Shr.': 676, 'Carton cover (Lid box)': 652, 'Case': 485, 'Tray Packer': 431, 'Cardboard hanger w/o bag': 400, 'Envelope': 295, 'Countertop display': 150, 'Carton tube with or w/o': 100, 'Wooden box': 100, 'Metal Cassette': 100, 'Book packaging': 100, 'Fabric packaging': 100}


In [None]:
# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_oversample, y_oversample, test_size=0.2, stratify=y_oversample,
    random_state=SEED
)

In [9]:
# DEFINE & EXECUTE PIPELINE

# define feature processing pipeline
# define numerical feature processing
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
# print(f'There are {len(numerical_features)} numerical features:', '\n')
# print(numerical_features)
numeric_feature_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='median')),
    ('log_transform', PowerTransformer()),
    # ('scale', MinMaxScaler())
])
# define categorical feature processing
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
# print(f'There are {len(categorical_features)} categorical features:', '\n')
# print(categorical_features)
categorical_feature_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    # ('one_hot', OneHotEncoder(handle_unknown='ignore', max_categories=None, sparse=False))
])
# apply both pipeline on seperate columns using "ColumnTransformer"
preprocess_pipeline = ColumnTransformer(transformers=[
    ('number', numeric_feature_pipeline, numerical_features),
    ('category', categorical_feature_pipeline, categorical_features)
]).set_output(transform="pandas")
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

# encode target variable
label_encoder = LabelEncoder()
y_train_transformed = label_encoder.fit_transform(y_train)
y_train_transformed = pd.Series(data=y_train_transformed, index=y_train.index, name=y_train.name)

## Transform to AutoML data format

In [10]:
df_train = pd.concat([X_train_transformed, y_train_transformed], axis=1)

In [11]:
train_data = TabularDataset(df_train)

## AutoML training pipeline

In [12]:
label = 'packaging_category'
automl_predictor = TabularPredictor(
    label=label,
    problem_type='multiclass',
    eval_metric='f1_macro',
    sample_weight='balance_weight'
).fit(
    train_data=train_data,
    tuning_data=None, # If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.
    holdout_frac=0.2, # Default value (if None) is selected based on the number of rows in the training data.
    time_limit=60*60,
    presets=['best_quality'], # default = ['medium_quality'], any user-specified arguments in fit() will override the values used by presets.
    # auto_stack=False, # Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy.
    # included_model_types=[],
    # excluded_model_types=['FASTAI', 'AG_AUTOMM'],
    hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified. Searchspaces are provided for some models, but not for all. Where no searchspace is provided, a fixed set of hyper-parameters is defined. (see /searchspace under each model: https://github.com/autogluon/autogluon/tree/master/tabular/src/autogluon/tabular/models).
        # 'num_trials': 15, # try at most n different hyperparameter configurations for each type of model
        'scheduler' : 'local',
        'searcher': 'auto', # ‘auto’: Perform bayesian optimization search on NN_TORCH and FASTAI models. Perform random search on other models.
    }  # Refer to TabularPredictor.fit docstring for all valid values
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240118_160631"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 3600 seconds.
Starting holdout-based sub-fit for dynamic stacking. Context path is: AutogluonModels/ag-20240118_160631/ds_sub_fit/sub_fit_ho.
Using predefined sample weighting strategy: balance_weight. Evaluation metrics will ignore sample weights, specify weight_evaluation=True to instead report weighted metrics.
Beginning AutoGluon training ... Time li

In [13]:
# Evaluation of models on training data
automl_predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.791491,f1_macro,0.110476,186.699047,0.033493,83.048896,2,True,57
1,WeightedEnsemble_L3,0.791491,f1_macro,0.111575,196.164078,0.034592,92.513927,3,True,58
2,RandomForestGini_BAG_L1,0.782605,f1_macro,0.002573,5.345762,0.002573,5.345762,1,True,5
3,ExtraTrees_r197_BAG_L1,0.78198,f1_macro,0.003979,3.74699,0.003979,3.74699,1,True,48
4,RandomForest_r166_BAG_L1,0.781682,f1_macro,0.002812,4.97715,0.002812,4.97715,1,True,41
5,RandomForestEntr_BAG_L1,0.781401,f1_macro,0.004407,5.591437,0.004407,5.591437,1,True,6
6,ExtraTrees_r42_BAG_L1,0.781042,f1_macro,0.004615,2.980708,0.004615,2.980708,1,True,15
7,RandomForest_r16_BAG_L1,0.780871,f1_macro,0.00401,5.820938,0.00401,5.820938,1,True,49
8,RandomForest_r195_BAG_L1,0.780347,f1_macro,0.003703,5.85559,0.003703,5.85559,1,True,18
9,ExtraTreesGini_BAG_L1,0.778449,f1_macro,0.003007,3.188313,0.003007,3.188313,1,True,8


In [14]:
# Evaluation of models on test data

# process X_test for evaluation and predictions
X_test_transformed = preprocess_pipeline.transform(X_test)

# evaluate models on test data
y_test_transformed = label_encoder.transform(y_test)
y_test_transformed = pd.Series(data=y_test_transformed, index=y_test.index, name=y_test.name)
df_test = pd.concat([X_test_transformed, y_test_transformed], axis=1)
test_data = TabularDataset(df_test)

automl_predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,ExtraTrees_r197_BAG_L1,0.753237,0.78198,f1_macro,0.635173,0.003979,3.74699,0.635173,0.003979,3.74699,1,True,48
1,WeightedEnsemble_L3,0.750393,0.791491,f1_macro,7.975973,0.111575,196.164078,0.036984,0.034592,92.513927,3,True,58
2,WeightedEnsemble_L2,0.750393,0.791491,f1_macro,7.984879,0.110476,186.699047,0.04589,0.033493,83.048896,2,True,57
3,ExtraTrees_r42_BAG_L1,0.748384,0.781042,f1_macro,0.619818,0.004615,2.980708,0.619818,0.004615,2.980708,1,True,15
4,ExtraTrees_r49_BAG_L1,0.741533,0.777724,f1_macro,0.748929,0.005675,3.384689,0.748929,0.005675,3.384689,1,True,31
5,ExtraTreesGini_BAG_L1,0.740493,0.778449,f1_macro,0.710735,0.003007,3.188313,0.710735,0.003007,3.188313,1,True,8
6,ExtraTreesEntr_BAG_L1,0.739568,0.777352,f1_macro,0.639133,0.003352,2.968069,0.639133,0.003352,2.968069,1,True,9
7,RandomForestEntr_BAG_L1,0.727815,0.781401,f1_macro,0.885521,0.004407,5.591437,0.885521,0.004407,5.591437,1,True,6
8,RandomForest_r166_BAG_L1,0.726957,0.781682,f1_macro,0.738727,0.002812,4.97715,0.738727,0.002812,4.97715,1,True,41
9,RandomForestGini_BAG_L1,0.726926,0.782605,f1_macro,1.293683,0.002573,5.345762,1.293683,0.002573,5.345762,1,True,5


In [15]:
# For a single specified model: make predictions and perform detailed evaluation on hold out test data
i = 3  # index of model to use
model_to_use = automl_predictor.model_names()[i]
preds_y_test = automl_predictor.predict(X_test_transformed, model=model_to_use)
print("Predictions:  ", list(preds_y_test)[:5])

preds_y_test_inverse = label_encoder.inverse_transform(preds_y_test)

print(classification_report(y_test, preds_y_test_inverse))

Predictions:   [22, 26, 24, 8, 0]
                            precision    recall  f1-score   support

   Blister and Insert Card       0.73      0.45      0.56      1749
  Blister and sealed blist       0.73      0.65      0.69      1582
            Book packaging       0.54      1.00      0.70        20
Cardb. Sleeve w - w/o Shr.       0.18      0.64      0.28       135
  Cardboard hanger w/o bag       0.15      0.82      0.25        80
    Carton cover (Lid box)       0.28      0.80      0.41       130
   Carton tube with or w/o       0.10      1.00      0.18        20
                      Case       0.14      0.68      0.24        97
         Corrugated carton       0.79      0.60      0.69       774
        Countertop display       0.09      0.63      0.16        30
                  Envelope       0.43      0.88      0.58        59
          Fabric packaging       0.56      1.00      0.71        20
            Folding carton       0.90      0.33      0.49      1644
             