# Overview

**GENERAL THOUGHTS:**  
Use AutoML (AutoGluon.Tabular) as a general way to investigate which algorithm, pre-processing, feature engineering options are (well) suited for the given tasks, as well as to investigate the potential performance based on a (large) varity of configurations of those options.
The notebook includes multiple scenarios of using AutoML:
- including and excluding custom data pre-processing (see below)
- including auto pre-processing by AutoGluon.Tabular
- including auto feature engineering by AutoGluon.Tabular
https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html
- including multiple classifiers by using:
  - multiple ml algorithms
  - "standard" HPO for each algorithm defined by AutoGluon.Tabular
  - ensables of algorithms (bagging and stacking with possible multiple layers)

**CUSTOM DATA PREPROCESSING:**

Imbalanced data:
- over_sampling for imbalanced data
- cost-sensitive learning for imbalanced data

numeric data:
- data imputation: SimpleImputer(strategy='median')
- data scaling: PowerTransformer() using 'log_transform'

categorical data:
- data imputation: SimpleImputer(strategy='most_frequent')
- categorical data encoding: OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)

**AUTOML MULTI-CLASS CLASSIFIERS:**
- Overview of models to be considered using AutoML (AutoGluon.Tabular):  
  - [X] RandomForest
  - [X] ExtraTrees
  - [X] XGBoost
  - [X] LightGBM
  - [X] KNeighbors
  - [X] CatBoost

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mlflow

import os
from datetime import datetime
import yaml
import json

In [2]:
import sklearn
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import classification_report

import imblearn
from imblearn.over_sampling import RandomOverSampler

from autogluon.tabular import TabularDataset, TabularPredictor

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# General settings within the data science workflow

pd.set_option('display.max_columns', None)

SEED = 42

# NOTE: for dev only
subsample = False
subsample_size = 100  # subsample subset of data for faster demo or development


# Get current date and time
now = datetime.now()
# Format date and time
formatted_date_time = now.strftime("%Y-%m-%d_%H:%M:%S")
print(formatted_date_time)

2024-01-23_11:54:28


# Load and prepare data

In [6]:
df = pd.read_csv('../../data/output/df_ml.csv', sep='\t')

df['material_number'] = df['material_number'].astype('object')

df_sub = df[[
    'material_number',
    'brand',
    'product_area',
    'core_segment',
    'component',
    'manufactoring_location',
    'characteristic_value',
    'material_weight', 
    'packaging_code',
    'packaging_category',
]]

# AutoML: without custom pre-processing; restricted selection of models including HPO and model ensembling

## Split data into train and test

In [7]:
# Define features and target
X = df_sub.iloc[:, :-1]
y = df_sub.iloc[:, -1]  # the last column is the target

In [8]:
# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y,
    random_state=SEED
)

## Transform to AutoML data format

In [9]:
df_train = pd.concat([X_train, y_train], axis=1)

In [10]:
train_data = TabularDataset(df_train)
if subsample is True:
    train_data = train_data.sample(n=subsample_size, random_state=SEED)

## AutoML training pipeline

In [11]:
label = 'packaging_category'
automl_predictor = TabularPredictor(
    label=label,
    problem_type='multiclass',
    eval_metric='f1_macro',
    sample_weight='balance_weight'
).fit(
    train_data=train_data,
    tuning_data=None, # If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.
    holdout_frac=0.2, # Default value (if None) is selected based on the number of rows in the training data.
    time_limit=3*60*60,
    presets=['best_quality'], # default = ['medium_quality'], any user-specified arguments in fit() will override the values used by presets.
    # auto_stack=False, # Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy.
    # included_model_types=['LR', 'KNN', 'RF', 'XT', 'GBM', 'XGB', 'CAT', 'NN'],
    # excluded_model_types=['FASTAI', 'AG_AUTOMM'],
    hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified. Searchspaces are provided for some models, but not for all. Where no searchspace is provided, a fixed set of hyper-parameters is defined. (see /searchspace under each model: https://github.com/autogluon/autogluon/tree/master/tabular/src/autogluon/tabular/models).
        # 'num_trials': 15, # try at most n different hyperparameter configurations for each type of model
        'scheduler' : 'local',
        'searcher': 'auto', # ‘auto’: Perform bayesian optimization search on NN_TORCH and FASTAI models. Perform random search on other models.
    }  # Refer to TabularPredictor.fit docstring for all valid values
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240123_105428"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 10800 seconds.
Starting holdout-based sub-fit for dynamic stacking. Context path is: AutogluonModels/ag-20240123_105428/ds_sub_fit/sub_fit_ho.
Using predefined sample weighting strategy: balance_weight. Evaluation metrics will ignore sample weights, specify weight_evaluation=True to instead report weighted metrics.
Beginning AutoGluon training ... Time l

In [12]:
# Evaluation of models on training data
automl_predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.826579,f1_macro,0.049325,260.106177,0.029240,78.230321,2,True,77
1,XGBoost_r194_BAG_L1,0.793051,f1_macro,0.003568,56.608631,0.003568,56.608631,1,True,25
2,RandomForest_r16_BAG_L1,0.728402,f1_macro,0.002482,11.737782,0.002482,11.737782,1,True,67
3,RandomForest_r195_BAG_L1,0.716536,f1_macro,0.002227,8.542151,0.002227,8.542151,1,True,20
4,ExtraTrees_r197_BAG_L1,0.708779,f1_macro,0.003885,5.920645,0.003885,5.920645,1,True,65
...,...,...,...,...,...,...,...,...,...,...
72,NeuralNetFastAI_r187_BAG_L1,0.178089,f1_macro,0.003580,49.542804,0.003580,49.542804,1,True,73
73,NeuralNetFastAI_r134_BAG_L1,0.172649,f1_macro,0.006366,51.739179,0.006366,51.739179,1,True,42
74,NeuralNetFastAI_r65_BAG_L1,0.160335,f1_macro,0.003510,50.236237,0.003510,50.236237,1,True,48
75,NeuralNetFastAI_r143_BAG_L1,0.129495,f1_macro,0.003627,52.369080,0.003627,52.369080,1,True,29


In [27]:
# Evaluation of models on test data
df_test = pd.concat([X_test, y_test], axis=1)
test_data = TabularDataset(df_test)

automl_std_leaderboard_testdata = automl_predictor.leaderboard(test_data)
automl_std_leaderboard_testdata.head(10)

KeyError: "9 required columns are missing from the provided dataset to transform using AutoMLPipelineFeatureGenerator. 9 missing columns: ['number__material_weight', 'category__material_number', 'category__brand', 'category__product_area', 'category__core_segment', 'category__component', 'category__manufactoring_location', 'category__characteristic_value', 'category__packaging_code'] | 9 available columns: ['material_number', 'brand', 'product_area', 'core_segment', 'component', 'manufactoring_location', 'characteristic_value', 'material_weight', 'packaging_code']"

In [14]:
# For a single specified model: make predictions and perform detailed evaluation on hold out test data
i = 1  # index of model to use
model_to_use = automl_predictor.model_names()[i]
preds_y_test = automl_predictor.predict(X_test, model=model_to_use)
print("Predictions:  ", list(preds_y_test)[:5])

print(classification_report(y_test, preds_y_test))

Predictions:   ['Blister and sealed blist', 'Cardboard hanger w/o bag', 'Skincard', 'Tray Packer', 'Shrink film and insert o']
                            precision    recall  f1-score   support

   Blister and Insert Card       0.66      0.39      0.50      1749
  Blister and sealed blist       0.62      0.61      0.61      1582
            Book packaging       0.00      0.00      0.00         2
Cardb. Sleeve w - w/o Shr.       0.15      0.13      0.14       135
  Cardboard hanger w/o bag       0.04      0.57      0.08        80
    Carton cover (Lid box)       0.36      0.28      0.32       130
   Carton tube with or w/o       0.02      0.56      0.03         9
                      Case       0.11      0.36      0.17        97
         Corrugated carton       0.26      0.08      0.12       774
        Countertop display       0.15      0.40      0.22        30
                  Envelope       0.15      0.25      0.19        59
          Fabric packaging       0.01      1.00      0.0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# AutoML: custom pre-processing; restricted selection of models including HPO and model ensembling

## Define features and target, performe oversampling, split data into train and test

In [15]:
# Define features and target
X = df_sub.iloc[:, :-1]
y = df_sub.iloc[:, -1]  # the last column is the target

In [16]:
distribution_classes = y.value_counts()
print('Class distribution before oversmapling')
print(distribution_classes.to_dict())

# NOTE: Oversampling so each class has at least 100 sample; to properly apply CV and evaluation
dict_oversmapling = {
    'Metal Cassette': 100,
    'Carton tube with or w/o': 100,
    'Wooden box': 100,
    'Fabric packaging': 100,
    'Book packaging': 100
}
# define oversampling strategy
oversampler = RandomOverSampler(sampling_strategy=dict_oversmapling, random_state=SEED)
# fit and apply the transform
X_oversample, y_oversample = oversampler.fit_resample(X, y)

distribution_classes = y_oversample.value_counts()
print('\n')
print('Class distribution after oversmapling')
print(distribution_classes.to_dict())

Class distribution before oversmapling
{'Hanger/ Clip': 13543, 'Tube': 11687, 'Blister and Insert Card': 8744, 'TightPack': 8296, 'Folding carton': 8219, 'Blister and sealed blist': 7912, 'Corrugated carton': 3872, 'Paperboard pouch': 3478, 'Trap Folding Card': 2188, 'Plastic Pouch': 1904, 'Plastic bag with header': 1850, 'Plastic Cassette': 1708, 'Shrink film and insert o': 1499, 'Plastic Box': 1491, 'Unpacked': 1415, 'Skincard': 1143, 'Trap Card': 804, 'Cardb. Sleeve w - w/o Shr.': 676, 'Carton cover (Lid box)': 652, 'Case': 485, 'Tray Packer': 431, 'Cardboard hanger w/o bag': 400, 'Envelope': 295, 'Countertop display': 150, 'Metal Cassette': 50, 'Carton tube with or w/o': 44, 'Wooden box': 16, 'Fabric packaging': 15, 'Book packaging': 10}


Class distribution after oversmapling
{'Hanger/ Clip': 13543, 'Tube': 11687, 'Blister and Insert Card': 8744, 'TightPack': 8296, 'Folding carton': 8219, 'Blister and sealed blist': 7912, 'Corrugated carton': 3872, 'Paperboard pouch': 3478, 'Trap 

In [17]:
# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_oversample, y_oversample, test_size=0.2, stratify=y_oversample,
    random_state=SEED
)

In [18]:
# DEFINE & EXECUTE PIPELINE

# define feature processing pipeline
# define numerical feature processing
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
# print(f'There are {len(numerical_features)} numerical features:', '\n')
# print(numerical_features)
numeric_feature_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='median')),
    ('log_transform', PowerTransformer()),
    # ('scale', MinMaxScaler())
])
# define categorical feature processing
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
# print(f'There are {len(categorical_features)} categorical features:', '\n')
# print(categorical_features)
categorical_feature_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    # ('one_hot', OneHotEncoder(handle_unknown='ignore', max_categories=None, sparse=False))
])
# apply both pipeline on seperate columns using "ColumnTransformer"
preprocess_pipeline = ColumnTransformer(transformers=[
    ('number', numeric_feature_pipeline, numerical_features),
    ('category', categorical_feature_pipeline, categorical_features)
]).set_output(transform="pandas")
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

# encode target variable
label_encoder = LabelEncoder()
y_train_transformed = label_encoder.fit_transform(y_train)
y_train_transformed = pd.Series(data=y_train_transformed, index=y_train.index, name=y_train.name)

## Transform to AutoML data format

In [19]:
df_train = pd.concat([X_train_transformed, y_train_transformed], axis=1)

In [20]:
train_data = TabularDataset(df_train)
if subsample is True:
    train_data = train_data.sample(n=subsample_size, random_state=SEED)

## AutoML training pipeline

In [21]:
label = 'packaging_category'
automl_predictor = TabularPredictor(
    label=label,
    problem_type='multiclass',
    eval_metric='f1_macro',
    sample_weight='balance_weight'
).fit(
    train_data=train_data,
    tuning_data=None, # If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.
    holdout_frac=0.2, # Default value (if None) is selected based on the number of rows in the training data.
    time_limit=3*60*60,
    presets=['best_quality'], # default = ['medium_quality'], any user-specified arguments in fit() will override the values used by presets.
    # auto_stack=False, # Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy.
    # included_model_types=['LR', 'KNN', 'RF', 'XT', 'GBM', 'XGB', 'CAT', 'NN'], 
    # excluded_model_types=['FASTAI', 'AG_AUTOMM'],
    hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified. Searchspaces are provided for some models, but not for all. Where no searchspace is provided, a fixed set of hyper-parameters is defined. (see /searchspace under each model: https://github.com/autogluon/autogluon/tree/master/tabular/src/autogluon/tabular/models).
        # 'num_trials': 15, # try at most n different hyperparameter configurations for each type of model
        'scheduler' : 'local',
        'searcher': 'auto', # ‘auto’: Perform bayesian optimization search on NN_TORCH and FASTAI models. Perform random search on other models.
    }  # Refer to TabularPredictor.fit docstring for all valid values
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240123_131906"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 10800 seconds.
Starting holdout-based sub-fit for dynamic stacking. Context path is: AutogluonModels/ag-20240123_131906/ds_sub_fit/sub_fit_ho.
Using predefined sample weighting strategy: balance_weight. Evaluation metrics will ignore sample weights, specify weight_evaluation=True to instead report weighted metrics.
Beginning AutoGluon training ... Time l

In [22]:
# Evaluation of models on training data
automl_predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,0.813188,f1_macro,0.493322,1947.320192,0.027990,85.902533,3,True,80
1,ExtraTrees_r49_BAG_L2,0.812443,f1_macro,0.465332,1861.417659,0.003728,40.485833,2,True,77
2,ExtraTreesGini_BAG_L2,0.811267,f1_macro,0.467912,1861.252752,0.006308,40.320926,2,True,75
3,ExtraTrees_r126_BAG_L2,0.806452,f1_macro,0.468715,1862.012287,0.007111,41.080462,2,True,79
4,ExtraTreesEntr_BAG_L2,0.804943,f1_macro,0.468610,1862.885374,0.007006,41.953548,2,True,76
...,...,...,...,...,...,...,...,...,...,...
75,LightGBM_r196_BAG_L1,0.223252,f1_macro,0.003974,33.015318,0.003974,33.015318,1,True,31
76,LightGBM_r94_BAG_L1,0.205092,f1_macro,0.002622,31.969279,0.002622,31.969279,1,True,43
77,NeuralNetFastAI_r156_BAG_L1,0.167062,f1_macro,0.006076,30.215514,0.006076,30.215514,1,True,30
78,NeuralNetFastAI_r100_BAG_L1,0.112000,f1_macro,0.002836,28.997645,0.002836,28.997645,1,True,68


In [23]:
# Evaluation of models on test data

# process X_test for evaluation and predictions
X_test_transformed = preprocess_pipeline.transform(X_test)

# evaluate models on test data
y_test_transformed = label_encoder.transform(y_test)
y_test_transformed = pd.Series(data=y_test_transformed, index=y_test.index, name=y_test.name)
df_test = pd.concat([X_test_transformed, y_test_transformed], axis=1)
test_data = TabularDataset(df_test)

automl_custom_leaderboard_testdata = automl_predictor.leaderboard(test_data)
automl_custom_leaderboard_testdata.head(10)

  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = df.fillna(column_fills, inplace=False, downcast=False)
  df = d

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,0.790286,0.813188,f1_macro,52.516910,0.493322,1947.320192,0.007701,0.027990,85.902533,3,True,80
1,ExtraTrees_r49_BAG_L2,0.778305,0.812443,f1_macro,52.509209,0.465332,1861.417659,0.880033,0.003728,40.485833,2,True,77
2,ExtraTreesGini_BAG_L2,0.777852,0.811267,f1_macro,52.552464,0.467912,1861.252752,0.923288,0.006308,40.320926,2,True,75
3,WeightedEnsemble_L2,0.774638,0.789252,f1_macro,3.129605,0.059754,175.076740,0.013699,0.040087,92.626149,2,True,74
4,ExtraTreesEntr_BAG_L2,0.774343,0.804943,f1_macro,52.573846,0.468610,1862.885374,0.944670,0.007006,41.953548,2,True,76
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,NeuralNetFastAI_r134_BAG_L1,0.252496,0.248504,f1_macro,1.331172,0.006697,31.817982,1.331172,0.006697,31.817982,1,True,41
76,NeuralNetTorch_r36_BAG_L1/T3,0.243525,0.232895,f1_macro,6.083937,0.003300,25.926688,6.083937,0.003300,25.926688,1,True,67
77,NeuralNetFastAI_r156_BAG_L1,0.184435,0.167062,f1_macro,1.010146,0.006076,30.215514,1.010146,0.006076,30.215514,1,True,30
78,NeuralNetFastAI_r100_BAG_L1,0.114674,0.112000,f1_macro,1.697475,0.002836,28.997645,1.697475,0.002836,28.997645,1,True,68


In [24]:
# For a single specified model: make predictions and perform detailed evaluation on hold out test data
i = 1  # index of model to use
model_to_use = automl_predictor.model_names()[i]
preds_y_test = automl_predictor.predict(X_test_transformed, model=model_to_use)
print("Predictions:  ", list(preds_y_test)[:5])

preds_y_test_inverse = label_encoder.inverse_transform(preds_y_test)

print(classification_report(y_test, preds_y_test_inverse))

Predictions:   [23, 7, 1, 26, 26]
                            precision    recall  f1-score   support

   Blister and Insert Card       0.71      0.49      0.58      1749
  Blister and sealed blist       0.73      0.65      0.69      1582
            Book packaging       0.54      1.00      0.70        20
Cardb. Sleeve w - w/o Shr.       0.22      0.67      0.33       135
  Cardboard hanger w/o bag       0.17      0.81      0.28        80
    Carton cover (Lid box)       0.29      0.71      0.41       130
   Carton tube with or w/o       0.15      0.90      0.26        20
                      Case       0.16      0.85      0.26        97
         Corrugated carton       0.79      0.64      0.70       774
        Countertop display       0.10      0.77      0.18        30
                  Envelope       0.46      0.90      0.61        59
          Fabric packaging       0.69      1.00      0.82        20
            Folding carton       0.82      0.38      0.51      1644
             