# Overview

**GENERAL THOUGHTS:**  
Use AutoML (AutoGluon.Tabular) as a general way to investigate which algorithm, pre-processing, feature engineering options are (well) suited for the given tasks, as well as to investigate the potential performance based on a (large) varity of configurations of those options.
The notebook includes multiple scenarios of using AutoML:
- including and excluding custom data pre-processing (see below)
- including auto pre-processing by AutoGluon.Tabular
- including auto feature engineering by AutoGluon.Tabular
https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html
- including multiple classifiers by using:
  - multiple ml algorithms
  - "standard" HPO for each algorithm defined by AutoGluon.Tabular
  - ensables of algorithms (bagging and stacking with possible multiple layers)

**DATA PREPROCESSING:**  
Imbalanced data:
- over_sampling for imbalanced data.
- cost-sensitive learning for imbalanced data.

continuous data:
- Impute missing data: SimpleImputer(strategy='median').
- Standardize data: StandardScaler().

categorical data:
- Impute missing data: SimpleImputer(strategy='most_frequent').
- Ordinal & Nominal data encoding: OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1).
- Unknown values ecoding and reordering of ordinal encoding: custom encoder "OrdinalEncoderExtensionUnknowns()".

target data:
- target encoding: OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

**AUTOML MULTI-CLASS CLASSIFIERS:**
- Overview of models to be considered using AutoML (AutoGluon.Tabular):  
  - [X] RandomForest
  - [X] ExtraTrees
  - [X] XGBoost
  - [X] LightGBM
  - [X] KNeighbors
  - [X] CatBoost
  - [X] Multiple Neural Nets

**FINAL MODEL PERFORMANCE:**  
- Evaluation of the best model from AutoML, including Experiment checkpointing.
- Loading final model from checkpoint for prediction on test set for evaluation based on classification report
- Tracking of the best model with MLFlow for performance benchmarking with other approaches (Baseline, PyCaret, PyTorch, ...) within the Repository.

In [1]:
colab = False
azure = True

In [2]:
if colab:
    # Import the library to mount Google Drive
    from google.colab import drive
    # Mount the Google Drive at /content/drive
    drive.mount('/content/drive')
    # Verify by listing the files in the drive
    # !ls /content/drive/My\ Drive/
    # current dir in colab
    !pwd

In [3]:
if colab:
    !pip install --upgrade autogluon.tabular
    !pip install --upgrade mlflow

In [4]:
import os
import sys
import yaml
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import classification_report
import imblearn
from imblearn.over_sampling import RandomOverSampler

from autogluon.tabular import TabularDataset, TabularPredictor

import mlflow

# ignore warnings
import warnings
warnings.filterwarnings('ignore')


In [6]:
# NOTE: if used in google colab, upload env_vars_colab.yml to current google colab directory!

# get config
if colab:
    with open('./env_vars_colab.yml', 'r') as file:
        config = yaml.safe_load(file)
elif azure:
    with open('../env_vars_azureml_compute.yml', 'r') as file:
        config = yaml.safe_load(file)
else:
    with open('../env_vars.yml', 'r') as file:
        config = yaml.safe_load(file)

# custom imports
sys.path.append(config['project_directory'])

# from src import utils

In [7]:
# General settings within the data science workflow

pd.set_option('display.max_columns', None)

SEED = 42

# NOTE: for dev only
subsample = False
subsample_size = 100  # subsample subset of data for faster demo or development

experiment_time_limit = 8*60*60 #3*60*60

# Get current date and time
now = datetime.datetime.now()
formatted_date_time = now.strftime("%Y-%m-%d_%H:%M:%S")
print(formatted_date_time)

2024-11-14_21:50:31


# Load and prepare data

In [8]:
df = pd.read_csv(f"{config['data_directory']}/output/df_ml.csv", sep='\t')

df['material_number'] = df['material_number'].astype('object')

df_sub = df[[
    'material_number',
    'brand',
    'product_area',
    'core_segment',
    'component',
    'manufactoring_location',
    'characteristic_value',
    'material_weight', 
    'packaging_code',
    'packaging_category',
]]

# AutoML: without custom pre-processing; restricted selection of models including HPO and model ensembling

## Split data into train and test

In [9]:
# Define features and target
X = df_sub.iloc[:, :-1]
y = df_sub.iloc[:, -1]  # the last column is the target

# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y,
    random_state=SEED
)

## Transform to AutoML data format

In [10]:
df_train = pd.concat([X_train, y_train], axis=1)

In [11]:
train_data = TabularDataset(df_train)
if subsample is True:
    train_data = train_data.sample(n=subsample_size, random_state=SEED)

## AutoML training pipeline

In [None]:
label = 'packaging_category'
automl_predictor = TabularPredictor(
    label=label,
    problem_type='multiclass',
    eval_metric='f1_macro',
    sample_weight='balance_weight'
).fit(
    train_data=train_data,
    tuning_data=None, # If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.
    holdout_frac=0.2, # Default value (if None) is selected based on the number of rows in the training data.
    time_limit=experiment_time_limit, # 3*60*60
    presets=['high_quality'], # ['high_quality'] # default = ['medium_quality'], any user-specified arguments in fit() will override the values used by presets.
    # auto_stack=False, # Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy.
    # included_model_types=['LR', 'KNN', 'RF', 'XT', 'GBM', 'XGB', 'CAT', 'NN'],
    # excluded_model_types=['FASTAI', 'AG_AUTOMM'],
    hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified. Searchspaces are provided for some models, but not for all. Where no searchspace is provided, a fixed set of hyper-parameters is defined. (see /searchspace under each model: https://github.com/autogluon/autogluon/tree/master/tabular/src/autogluon/tabular/models).
        # 'num_trials': 15, # try at most n different hyperparameter configurations for each type of model
        'scheduler' : 'local',
        'searcher': 'auto', # ‘auto’: Perform bayesian optimization search on NN_TORCH and FASTAI models. Perform random search on other models.
    }  # Refer to TabularPredictor.fit docstring for all valid values
)

No path specified. Models will be saved in: "AutogluonModels/ag-20241114_215127"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.11.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #82~20.04.1-Ubuntu SMP Tue Sep 3 12:27:43 UTC 2024
CPU Count:          4
Memory Avail:       28.42 GB / 31.34 GB (90.7%)
Disk Space Avail:   102399.87 GB / 102400.00 GB (100.0%)
Presets specified: ['high_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Note: `save_bag_folds=False`! This will greatly reduce peak disk usage during fit (by ~8x), but runs the risk of an out-of-memory error during model refit if memory is small relative to the data size.
	You can avoid this risk by setting `save_bag_folds=True`.
DyStack is enabled (dynamic_stacking=True). AutoGluon will 

	Running DyStack sub-fit in a ray process to avoid memory leakage. Enabling ray logging (enable_ray_logging=True). Specify `ds_args={'enable_ray_logging': False}` if you experience logging issues.
2024-11-14 21:51:30,321	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
		Context path: "AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho"
[36m(_dystack pid=15333)[0m Running DyStack sub-fit ...
[36m(_dystack pid=15333)[0m Using predefined sample weighting strategy: balance_weight. Evaluation metrics will ignore sample weights, specify weight_evaluation=True to instead report weighted metrics.
[36m(_dystack pid=15333)[0m Beginning AutoGluon training ... Time limit = 7197s
[36m(_dystack pid=15333)[0m AutoGluon will save models to "AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho"
[36m(_dystack pid=15333)[0m Train Data Rows:    59005
[36m(_dystack pid=15333)[0m Train Data Columns: 9
[36m(_dystack p

[36m(_dystack pid=15333)[0m ╭───────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetFastAI_BAG_L1   │
[36m(_dystack pid=15333)[0m ├───────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator          │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler            │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                     │
[36m(_dystack pid=15333)[0m ╰───────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetFastAI_BAG_L1


[36m(model_trial pid=15653)[0m [36mray::_ray_fit()[39m (pid=15701, ip=10.0.0.4)
[36m(model_trial pid=15653)[0m ModuleNotFoundError: No module named 'fastai'
[36m(model_trial pid=15653)[0m 
[36m(model_trial pid=15653)[0m During handling of the above exception, another exception occurred:
[36m(model_trial pid=15653)[0m 
[36m(model_trial pid=15653)[0m [36mray::_ray_fit()[39m (pid=15701, ip=10.0.0.4)
[36m(model_trial pid=15653)[0m   File "/home/azureuser/miniforge3/envs/py_ml_packaging_classification/lib/python3.11/site-packages/autogluon/core/models/ensemble/fold_fitting_strategy.py", line 402, in _ray_fit
[36m(model_trial pid=15653)[0m     fold_model.fit(X=X_fold, y=y_fold, X_val=X_val_fold, y_val=y_val_fold, time_limit=time_limit_fold, **resources, **kwargs_fold)
[36m(model_trial pid=15653)[0m   File "/home/azureuser/miniforge3/envs/py_ml_packaging_classification/lib/python3.11/site-packages/autogluon/core/models/abstract/abstract_model.py", line 856, in fit
[36m(

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Failed to fetch metrics for 8 trial(s):
[36m(_dystack pid=15333)[0m - dd4bfdaa: FileNotFoundError('Could not fetch metrics for dd4bfdaa: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetFastAI_BAG_L1/dd4bfdaa')
[36m(_dystack pid=15333)[0m - 3e8c7824: FileNotFoundError('Could not fetch metrics for 3e8c7824: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetFastAI_BAG_L1/3e8c7824')
[36m(_dystack pid=15333)[0m - b0fcf057: FileNotFoundError('Could not fetch metrics for b0fcf057: both result.json and progress.csv were not found at /mnt/batch/tasks/

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator         │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler           │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                    │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_BAG_L1


[36m(_dystack pid=15333)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
[36m(_dystack pid=15333)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
[36m(_dystack pid=15333)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
[36m(_dystack pid=15333)[0m I0000 00:00:1731621444.511914   15475 chttp2_transport.cc:1161] ipv4:10.0.0.4:38731: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T21:57:24.511306387+00:00"}
[36m(_ray_fit pid=21334)[0m   self.model = torch.load(net_filename)
[36m

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Failed to fetch metrics for 2 trial(s):
[36m(_dystack pid=15333)[0m - 37ce378b: FileNotFoundError('Could not fetch metrics for 37ce378b: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_BAG_L1/37ce378b')
[36m(_dystack pid=15333)[0m - e1902f61: FileNotFoundError('Could not fetch metrics for e1902f61: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_BAG_L1/e1902f61')
[36m(_dystack pid=15333)[0m No model was trained during hyperparameter tuning NeuralNetTorch_BAG_L1... Skipping this model.
[36m(_dystack pid=15333)[0m Fitting model:

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r79_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator             │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler               │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                        │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r79_BAG_L1


[36m(_ray_fit pid=22923)[0m   self.model = torch.load(net_filename)
[36m(_ray_fit pid=23169)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=22878)[0m I0000 00:00:1731621614.623844   22904 chttp2_transport.cc:1161] ipv4:10.0.0.4:36783: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T22:00:14.623816917+00:00"}
[36m(model_trial pid=22878)[0m Classification metrics can't handle a mix of unknown and multiclass targets
[36m(model_trial pid=22878)[0m Traceback (most recent call last):
[36m(model_trial pid=22878)[0m   File "/home/azureuser/miniforge3/envs/py_ml_packaging_classification/lib/python3.11/site-packages/autogluon/core/models/abstract/model_trial.py", line 37, in model_trial
[36m(model_trial pid=22878)[0m     model = fit_and_save_model(
[36m(model_trial pid=22878)[0m             ^^^^^^^^^^^^^^^^^^^
[36m(model_

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Failed to fetch metrics for 2 trial(s):
[36m(_dystack pid=15333)[0m - 68235dcc: FileNotFoundError('Could not fetch metrics for 68235dcc: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r79_BAG_L1/68235dcc')
[36m(_dystack pid=15333)[0m - f2305926: FileNotFoundError('Could not fetch metrics for f2305926: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r79_BAG_L1/f2305926')
[36m(_dystack pid=15333)[0m No model was trained during hyperparameter tuning NeuralNetTorch_r79_BAG_L1... Skipping this model.
[36m(_dystack pid=15333)[0m Hy

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r22_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator             │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler               │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                        │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r22_BAG_L1


[36m(_ray_fit pid=25082)[0m   self.model = torch.load(net_filename)
[36m(_ray_fit pid=25312)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=25036)[0m I0000 00:00:1731621795.320402   25080 chttp2_transport.cc:1161] ipv4:10.0.0.4:36059: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T22:03:15.320216187+00:00"}
[36m(model_trial pid=25036)[0m I0000 00:00:1731621795.473974   25061 chttp2_transport.cc:1161] ipv4:10.0.0.4:42961: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:03:15.473959231+00:00", http2_error:2, grpc_status:14}
[36m(model_trial pid=25036)[0m Classification metrics can't handle a mix of unknown and multiclass targets
[36m(model_trial pid=25036)[0m Traceback (most recent call last):
[36m(model_trial pid=25036)[0m   File "/home/az

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m 	No hyperparameter search space specified for XGBoost_r33_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
[36m(_dystack pid=15333)[0m 	Fitting 5 child models (S1F1 - S1F5) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=1, gpus=0, memory=0.22%)
[36m(_ray_fit pid=25327)[0m   self.model = torch.load(net_filename)[32m [repeated 3x across cluster][0m
[36m(_dystack pid=15333)[0m I0000 00:00:1731621819.149292   15477 chttp2_transport.cc:1161] ipv4:10.0.0.4:39677: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:03:39.149274237+00:00", http2_error:2, grpc_status:14}
[36m(_dystack pid=15333)[0m I0000 00:00:1731621837.707342   15483 chttp2_transport.cc:1161] ipv4:10.0.0.4:40985: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:03:57.707326234+00:00", 

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r30_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator             │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler               │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                        │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r30_BAG_L1


[36m(_ray_fit pid=28472)[0m   self.model = torch.load(net_filename)
[36m(_ray_fit pid=28704)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=28409)[0m I0000 00:00:1731622108.554741   28458 chttp2_transport.cc:1161] ipv4:10.0.0.4:44891: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:08:28.554709857+00:00", http2_error:2, grpc_status:14}
[36m(model_trial pid=28409)[0m I0000 00:00:1731622109.005375   28456 chttp2_transport.cc:1161] ipv4:10.0.0.4:35055: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T22:08:29.00536128+00:00"}
[36m(model_trial pid=28409)[0m I0000 00:00:1731622110.321526   28456 chttp2_transport.cc:1161] ipv4:10.0.0.4:36865: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Hyperparameter tuning model: LightGBM_r130_BAG_L1 ... Tuning model for up to 39.24s of the 6175.59s of remaining time.
[36m(_dystack pid=15333)[0m 	No hyperparameter search space specified for LightGBM_r130_BAG_L1. Skipping HPO. Will train one model based on the provided hyperparameters.
[36m(_dystack pid=15333)[0m 	Fitting 5 child models (S1F1 - S1F5) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=1, gpus=0, memory=0.11%)
[33m(raylet)[0m I0000 00:00:1731622139.396096   15309 chttp2_transport.cc:1161] ipv4:10.0.0.4:41731: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:08:59.396074893+00:00", http2_error:2, grpc_status:14}
[36m(_dystack pid=15333)[0m I0000 00:00:1731622163.425690   15613 chttp2_transport.cc:1161] ipv4:10.0.0.4:40467: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r86_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator             │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler               │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                        │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r86_BAG_L1


[36m(_ray_fit pid=29553)[0m   self.model = torch.load(net_filename)
[36m(model_trial pid=29509)[0m I0000 00:00:1731622185.668268   29532 chttp2_transport.cc:1161] ipv4:10.0.0.4:41029: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T22:09:45.668243268+00:00"}
[36m(_ray_fit pid=29792)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=29509)[0m I0000 00:00:1731622201.417430   29534 chttp2_transport.cc:1161] ipv4:10.0.0.4:43065: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:10:01.417412242+00:00", http2_error:2, grpc_status:14}
[36m(model_trial pid=29509)[0m Classification metrics can't handle a mix of unknown and multiclass targets
[36m(model_trial pid=29509)[0m Traceback (most recent call last):
[36m(model_trial pid=29509)[0m   File "/home/az

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Failed to fetch metrics for 2 trial(s):
[36m(_dystack pid=15333)[0m - 9fb30edb: FileNotFoundError('Could not fetch metrics for 9fb30edb: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r86_BAG_L1/9fb30edb')
[36m(_dystack pid=15333)[0m - 48f7ce84: FileNotFoundError('Could not fetch metrics for 48f7ce84: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r86_BAG_L1/48f7ce84')
[36m(_dystack pid=15333)[0m No model was trained during hyperparameter tuning NeuralNetTorch_r86_BAG_L1... Skipping this model.
[36m(_dystack pid=15333)[0m Hy

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r14_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator             │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler               │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                        │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r14_BAG_L1


[36m(_ray_fit pid=31764)[0m   self.model = torch.load(net_filename)
[36m(_ray_fit pid=32003)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=31719)[0m Classification metrics can't handle a mix of unknown and multiclass targets
[36m(model_trial pid=31719)[0m Traceback (most recent call last):
[36m(model_trial pid=31719)[0m   File "/home/azureuser/miniforge3/envs/py_ml_packaging_classification/lib/python3.11/site-packages/autogluon/core/models/abstract/model_trial.py", line 37, in model_trial
[36m(model_trial pid=31719)[0m     model = fit_and_save_model(
[36m(model_trial pid=31719)[0m             ^^^^^^^^^^^^^^^^^^^
[36m(model_trial pid=31719)[0m   File "/home/azureuser/miniforge3/envs/py_ml_packaging_classification/lib/python3.11/site-packages/autogluon/core/models/abstract/model_trial.py", line 104, in fit_and_save_model
[36m(model_trial pid=31719)[0m     model.val_score = model.score_with_y_pred_proba(y=fit_args["

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Failed to fetch metrics for 2 trial(s):
[36m(_dystack pid=15333)[0m - f20f9b59: FileNotFoundError('Could not fetch metrics for f20f9b59: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r14_BAG_L1/f20f9b59')
[36m(_dystack pid=15333)[0m - c7306935: FileNotFoundError('Could not fetch metrics for c7306935: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r14_BAG_L1/c7306935')
[36m(_dystack pid=15333)[0m No model was trained during hyperparameter tuning NeuralNetTorch_r14_BAG_L1... Skipping this model.
[36m(_dystack pid=15333)[0m Hy

[36m(_dystack pid=15333)[0m ╭──────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r41_BAG_L1   │
[36m(_dystack pid=15333)[0m ├──────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator             │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler               │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                        │
[36m(_dystack pid=15333)[0m ╰──────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r41_BAG_L1


[36m(_dystack pid=15333)[0m I0000 00:00:1731622613.668391   15477 chttp2_transport.cc:1161] ipv4:10.0.0.4:44125: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T22:16:53.668375477+00:00"}
[36m(_ray_fit pid=34819)[0m   self.model = torch.load(net_filename)
[36m(_ray_fit pid=35054)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=34775)[0m I0000 00:00:1731622647.819526   34786 chttp2_transport.cc:1161] ipv4:10.0.0.4:33801: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {grpc_status:14, http2_error:2, created_time:"2024-11-14T22:17:27.81950196+00:00"}
[36m(model_trial pid=34775)[0m Classification metrics can't handle a mix of unknown and multiclass targets
[36m(model_trial pid=34775)[0m Traceback (most recent call last):
[36m(model_trial pid=34775)[0m   File "/home/azureu

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Failed to fetch metrics for 2 trial(s):
[36m(_dystack pid=15333)[0m - 12619ac9: FileNotFoundError('Could not fetch metrics for 12619ac9: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r41_BAG_L1/12619ac9')
[36m(_dystack pid=15333)[0m - 97dbf835: FileNotFoundError('Could not fetch metrics for 97dbf835: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r41_BAG_L1/97dbf835')
[36m(_dystack pid=15333)[0m No model was trained during hyperparameter tuning NeuralNetTorch_r41_BAG_L1... Skipping this model.
[36m(_dystack pid=15333)[0m Hy

[36m(_dystack pid=15333)[0m ╭───────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r158_BAG_L1   │
[36m(_dystack pid=15333)[0m ├───────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator              │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler                │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                         │
[36m(_dystack pid=15333)[0m ╰───────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r158_BAG_L1


[36m(_dystack pid=15333)[0m I0000 00:00:1731622736.510150   15613 chttp2_transport.cc:1161] ipv4:10.0.0.4:46781: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:18:56.510127004+00:00", http2_error:2, grpc_status:14}
[36m(_ray_fit pid=36251)[0m   self.model = torch.load(net_filename)
[36m(_ray_fit pid=36489)[0m   self.model = torch.load(net_filename)[32m [repeated 4x across cluster][0m
[36m(model_trial pid=36207)[0m I0000 00:00:1731622771.199640   36232 chttp2_transport.cc:1161] ipv4:10.0.0.4:46339: Got goaway [2] err=UNAVAILABLE:GOAWAY received; Error code: 2; Debug Text: Cancelling all calls {created_time:"2024-11-14T22:19:31.19961515+00:00", http2_error:2, grpc_status:14}
[36m(model_trial pid=36207)[0m Classification metrics can't handle a mix of unknown and multiclass targets
[36m(model_trial pid=36207)[0m Traceback (most recent call last):
[36m(model_trial pid=36207)[0m   File "/home/azureu

[36m(_dystack pid=15333)[0m 


[36m(_dystack pid=15333)[0m Reached timeout of 39.24433434375741 seconds. Stopping all trials.
[36m(_dystack pid=15333)[0m Wrote the latest version of all result files and experiment state to '/mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r158_BAG_L1' in 0.0744s.
[36m(_dystack pid=15333)[0m Failed to fetch metrics for 3 trial(s):
[36m(_dystack pid=15333)[0m - bea413ae: FileNotFoundError('Could not fetch metrics for bea413ae: both result.json and progress.csv were not found at /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r158_BAG_L1/bea413ae')
[36m(_dystack pid=15333)[0m - c02fb007: FileNotFoundError('Could not fetch metrics for c02fb007: both result.j

[36m(_dystack pid=15333)[0m ╭───────────────────────────────────────────────────────────────╮
[36m(_dystack pid=15333)[0m │ Configuration for experiment     NeuralNetTorch_r197_BAG_L1   │
[36m(_dystack pid=15333)[0m ├───────────────────────────────────────────────────────────────┤
[36m(_dystack pid=15333)[0m │ Search algorithm                 SearchGenerator              │
[36m(_dystack pid=15333)[0m │ Scheduler                        FIFOScheduler                │
[36m(_dystack pid=15333)[0m │ Number of trials                 1000                         │
[36m(_dystack pid=15333)[0m ╰───────────────────────────────────────────────────────────────╯
[36m(_dystack pid=15333)[0m 
[36m(_dystack pid=15333)[0m View detailed results here: /mnt/batch/tasks/shared/LS_root/mounts/clusters/packaginge4dsv5/code/Users/david.tiefenthaler/ml_packaging_classification/notebooks/AutogluonModels/ag-20241114_215127/ds_sub_fit/sub_fit_ho/models/NeuralNetTorch_r197_BAG_L1


[36m(_dystack pid=15333)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
[36m(_dystack pid=15333)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
[36m(_dystack pid=15333)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
[36m(_ray_fit pid=37746)[0m   self.model = torch.load(net_filename)


In [None]:
# Evaluation of models on training data
automl_predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,0.859076,f1_macro,1188.566702,6820.678373,0.046121,42.044546,3,True,108
1,ExtraTreesGini_BAG_L2,0.819919,f1_macro,1140.665690,5161.254293,40.693184,80.217303,2,True,64
2,ExtraTrees_r126_BAG_L2,0.817493,f1_macro,1143.839039,5163.883235,43.866533,82.846245,2,True,103
3,XGBoost_r194_BAG_L2,0.816081,f1_macro,1103.829203,5284.588265,3.856698,203.551275,2,True,77
4,ExtraTrees_r49_BAG_L2,0.813140,f1_macro,1130.381731,5138.845068,30.409225,57.808078,2,True,87
...,...,...,...,...,...,...,...,...,...,...
103,CatBoost_r60_BAG_L2,0.111131,f1_macro,1101.568627,5237.745740,1.596121,156.708750,2,True,97
104,CatBoost_r137_BAG_L2,0.086546,f1_macro,1101.468116,5202.956067,1.495610,121.919077,2,True,71
105,CatBoost_r50_BAG_L2,0.080389,f1_macro,1101.629409,5218.065786,1.656903,137.028796,2,True,76
106,CatBoost_r6_BAG_L2,0.080389,f1_macro,1101.632979,5209.795149,1.660473,128.758159,2,True,100


## Evaluate AutoML experiment and best model

In [None]:
# Evaluation of models on test data
df_test = pd.concat([X_test, y_test], axis=1)
test_data = TabularDataset(df_test)

automl_std_leaderboard_testdata = automl_predictor.leaderboard(test_data)
automl_std_leaderboard_testdata.head(10)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,0.79575,0.859076,f1_macro,280.82339,1188.566702,6820.678373,0.064283,0.046121,42.044546,3,True,108
1,LightGBMLarge_BAG_L2,0.793289,0.770915,f1_macro,232.414696,1102.842265,5402.890837,5.511735,2.869759,321.853848,2,True,66
2,ExtraTrees_r49_BAG_L2,0.785937,0.81314,f1_macro,228.219033,1130.381731,5138.845068,1.316072,30.409225,57.808078,2,True,87
3,ExtraTreesGini_BAG_L2,0.784021,0.819919,f1_macro,228.352076,1140.66569,5161.254293,1.449115,40.693184,80.217303,2,True,64
4,ExtraTrees_r126_BAG_L2,0.779704,0.817493,f1_macro,228.390524,1143.839039,5163.883235,1.487563,43.866533,82.846245,2,True,103
5,LightGBM_r161_BAG_L2,0.772103,0.758902,f1_macro,230.255658,1102.791656,5266.127891,3.352697,2.81915,185.090901,2,True,79
6,LightGBM_r143_BAG_L2,0.771631,0.758595,f1_macro,230.610279,1102.714723,5274.867135,3.707318,2.742217,193.830145,2,True,88
7,XGBoost_r194_BAG_L2,0.769743,0.816081,f1_macro,229.520288,1103.829203,5284.588265,2.617327,3.856698,203.551275,2,True,77
8,XGBoost_r98_BAG_L2,0.767747,0.791508,f1_macro,229.617231,1103.124689,5296.424059,2.71427,3.152184,215.387069,2,True,83
9,ExtraTreesEntr_BAG_L2,0.764335,0.789718,f1_macro,229.280277,1170.114186,5202.825697,2.377316,70.14168,121.788707,2,True,65


In [None]:
# For a single specified model: make predictions and perform detailed evaluation on hold out test data
# i = -1  # index of model to use
# model_to_use = automl_predictor.model_names()[i]
model_to_use = automl_std_leaderboard_testdata.iloc[0, 0] # use best model from leaderboard
print(f"Model to be evaluated: {model_to_use}")
preds_y_test = automl_predictor.predict(X_test, model=model_to_use)
print("Predictions:  ", list(preds_y_test)[:5])

print(classification_report(y_test, preds_y_test))

Model to be evaluated: WeightedEnsemble_L3
Predictions:   ['Blister and Insert Card', 'Corrugated carton', 'Plastic bag with header', 'Tube', 'Shrink film and insert o']
                            precision    recall  f1-score   support

   Blister and Insert Card       0.90      0.88      0.89      1749
  Blister and sealed blist       0.87      0.83      0.85      1582
            Book packaging       0.00      0.00      0.00         2
Cardb. Sleeve w - w/o Shr.       0.75      0.71      0.73       135
  Cardboard hanger w/o bag       1.00      0.84      0.91        80
    Carton cover (Lid box)       0.63      0.63      0.63       130
   Carton tube with or w/o       1.00      0.67      0.80         9
                      Case       0.72      0.92      0.81        97
         Corrugated carton       0.81      0.84      0.82       774
        Countertop display       1.00      0.97      0.98        30
                  Envelope       0.95      0.98      0.97        59
          Fab

# AutoML: custom pre-processing; restricted selection of models including HPO and model ensembling

## Define features and target, performe oversampling, split data into train and test

In [None]:
# Define features and target
X = df_sub.iloc[:, :-1]
y = df_sub.iloc[:, -1]  # the last column is the target

# Oversampling
distribution_classes = y.value_counts()
print('Class distribution before oversmapling')
print(distribution_classes.to_dict())
# NOTE: Oversampling so each class has at least 100 sample; to properly apply CV and evaluation
dict_oversmapling = {
    'Metal Cassette': 100,
    'Carton tube with or w/o': 100,
    'Wooden box': 100,
    'Fabric packaging': 100,
    'Book packaging': 100
}
# define oversampling strategy
oversampler = RandomOverSampler(sampling_strategy=dict_oversmapling, random_state=SEED)
# fit and apply the transform
X_oversample, y_oversample = oversampler.fit_resample(X, y)
print('Class distribution after oversmapling')
print(y_oversample.value_counts().to_dict())

# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_oversample, y_oversample, test_size=0.2, stratify=y_oversample,
    random_state=SEED
)

Class distribution before oversmapling
{'Hanger/ Clip': 13543, 'Tube': 11687, 'Blister and Insert Card': 8744, 'TightPack': 8296, 'Folding carton': 8219, 'Blister and sealed blist': 7912, 'Corrugated carton': 3872, 'Paperboard pouch': 3478, 'Trap Folding Card': 2188, 'Plastic Pouch': 1904, 'Plastic bag with header': 1850, 'Plastic Cassette': 1708, 'Shrink film and insert o': 1499, 'Plastic Box': 1491, 'Unpacked': 1415, 'Skincard': 1143, 'Trap Card': 804, 'Cardb. Sleeve w - w/o Shr.': 676, 'Carton cover (Lid box)': 652, 'Case': 485, 'Tray Packer': 431, 'Cardboard hanger w/o bag': 400, 'Envelope': 295, 'Countertop display': 150, 'Metal Cassette': 50, 'Carton tube with or w/o': 44, 'Wooden box': 16, 'Fabric packaging': 15, 'Book packaging': 10}
Class distribution after oversmapling
{'Hanger/ Clip': 13543, 'Tube': 11687, 'Blister and Insert Card': 8744, 'TightPack': 8296, 'Folding carton': 8219, 'Blister and sealed blist': 7912, 'Corrugated carton': 3872, 'Paperboard pouch': 3478, 'Trap Fo

In [None]:
# DEFINE & EXECUTE PIPELINE
# Define pipeline
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
numeric_feature_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='median')),
    ('log_transform', PowerTransformer()),
    # ('scale', MinMaxScaler())
])
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
categorical_feature_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
])
preprocess_pipeline = ColumnTransformer(
    transformers=[
        ('number', numeric_feature_pipeline, numerical_features),
        ('category', categorical_feature_pipeline, categorical_features)
    ],
    verbose_feature_names_out=False
).set_output(transform="pandas")
# transform data
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

# encode target variable
label_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, encoded_missing_value=-1)
y_train_transformed = label_encoder.fit_transform(y_train.to_frame())
y_train_transformed = pd.DataFrame(data=y_train_transformed, index=y_train.index, columns=[y_train.name])

## Transform to AutoML data format

In [None]:
df_train = pd.concat([X_train_transformed, y_train_transformed], axis=1)

In [None]:
train_data = TabularDataset(df_train)
if subsample is True:
    train_data = train_data.sample(n=subsample_size, random_state=SEED)

## AutoML training pipeline

In [None]:
label = 'packaging_category'
automl_predictor = TabularPredictor(
    label=label,
    problem_type='multiclass',
    eval_metric='f1_macro',
    sample_weight='balance_weight'
).fit(
    train_data=train_data,
    tuning_data=None, # If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.
    holdout_frac=0.2, # Default value (if None) is selected based on the number of rows in the training data.
    time_limit=experiment_time_limit, # 3*60*60
    presets=['high_quality'], # ['high_quality'] # default = ['medium_quality'], any user-specified arguments in fit() will override the values used by presets.
    # auto_stack=False, # Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy.
    # included_model_types=['LR', 'KNN', 'RF', 'XT', 'GBM', 'XGB', 'CAT', 'NN'], 
    # excluded_model_types=['FASTAI', 'AG_AUTOMM'],
    hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified. Searchspaces are provided for some models, but not for all. Where no searchspace is provided, a fixed set of hyper-parameters is defined. (see /searchspace under each model: https://github.com/autogluon/autogluon/tree/master/tabular/src/autogluon/tabular/models).
        # 'num_trials': 15, # try at most n different hyperparameter configurations for each type of model
        'scheduler' : 'local',
        'searcher': 'auto', # ‘auto’: Perform bayesian optimization search on NN_TORCH and FASTAI models. Perform random search on other models.
    }  # Refer to TabularPredictor.fit docstring for all valid values
)

No model was trained during hyperparameter tuning NeuralNetTorch... Skipping this model.
Fitting model: LightGBMLarge ... Training model for up to 1993.81s of the 17758.43s of remaining time.
	0.7914	 = Validation score   (f1_macro)
	144.01s	 = Training   runtime
	55.27s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 2879.95s of the 17532.49s of remaining time.
	Ensemble Weights: {'LightGBMLarge': 0.357, 'KNeighborsDist': 0.214, 'RandomForestGini': 0.143, 'RandomForestEntr': 0.143, 'KNeighborsUnif': 0.071, 'ExtraTreesGini': 0.071}
	0.7983	 = Validation score   (f1_macro)
	3.17s	 = Training   runtime
	0.22s	 = Validation runtime
AutoGluon training complete, total runtime = 11271.22s ... Best model: WeightedEnsemble_L2
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20241113_225037")


In [None]:
# Evaluation of models on training data
automl_predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.798341,f1_macro,63.077952,194.487381,0.218747,3.174761,2,True,8
1,LightGBMLarge,0.791435,f1_macro,55.271506,144.005281,55.271506,144.005281,1,True,7
2,RandomForestGini,0.763345,f1_macro,2.218809,15.697407,2.218809,15.697407,1,True,3
3,RandomForestEntr,0.761439,f1_macro,2.849804,26.464896,2.849804,26.464896,1,True,4
4,ExtraTreesEntr,0.759531,f1_macro,1.834237,5.276078,1.834237,5.276078,1,True,6
5,ExtraTreesGini,0.758739,f1_macro,1.936817,4.777079,1.936817,4.777079,1,True,5
6,KNeighborsDist,0.437977,f1_macro,0.298646,0.236018,0.298646,0.236018,1,True,2
7,KNeighborsUnif,0.321687,f1_macro,0.283623,0.131939,0.283623,0.131939,1,True,1


## Evaluate AutoML experiment and best model

In [None]:
# Evaluation of models on test data

# NOTE: Load a TabularPredictor object previously produced by fit() from file and returns this object.
try:
    # NOTE: set the directory to the saved model
    specific_path = None # Default: None ; Path fromat: 'AutogluonModels/ag-20241113_002120'
    autogluon_saved_model_path = specific_path if specific_path else automl_predictor.path
    automl_predictor = automl_predictor if automl_predictor else TabularPredictor.load(f"{config['autogluon_exp_storage_directory']}/{autogluon_saved_model_path}")
    print(f"Model loaded from: {automl_predictor.path}")
except Exception as e:
    print(f"Model could not be loaded. An error occurred: {e}")

# process X_test for evaluation and predictions
X_test_transformed = preprocess_pipeline.transform(X_test)

# evaluate models on test data
y_test_transformed = label_encoder.transform(y_test.to_frame())
y_test_transformed = pd.DataFrame(data=y_test_transformed, index=y_test.index, columns=[y_test.name])
df_test = pd.concat([X_test_transformed, y_test_transformed], axis=1)
test_data = TabularDataset(df_test)

automl_custom_leaderboard_testdata = automl_predictor.leaderboard(test_data)
automl_custom_leaderboard_testdata.head(10)

Model loaded from: AutogluonModels/ag-20241113_225037


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.740478,0.798341,f1_macro,91.967577,63.077952,194.487381,0.152463,0.218747,3.174761,2,True,8
1,ExtraTreesGini,0.73912,0.758739,f1_macro,3.753227,1.936817,4.777079,3.753227,1.936817,4.777079,1,True,5
2,ExtraTreesEntr,0.734841,0.759531,f1_macro,3.846254,1.834237,5.276078,3.846254,1.834237,5.276078,1,True,6
3,LightGBMLarge,0.73464,0.791435,f1_macro,76.845919,55.271506,144.005281,76.845919,55.271506,144.005281,1,True,7
4,RandomForestEntr,0.728181,0.761439,f1_macro,6.205621,2.849804,26.464896,6.205621,2.849804,26.464896,1,True,4
5,RandomForestGini,0.72546,0.763345,f1_macro,4.805619,2.218809,15.697407,4.805619,2.218809,15.697407,1,True,3
6,KNeighborsDist,0.405158,0.437977,f1_macro,0.104866,0.298646,0.236018,0.104866,0.298646,0.236018,1,True,2
7,KNeighborsUnif,0.301214,0.321687,f1_macro,0.099862,0.283623,0.131939,0.099862,0.283623,0.131939,1,True,1


In [None]:
automl_custom_leaderboard_testdata.model.unique()

array(['WeightedEnsemble_L2', 'ExtraTreesGini', 'ExtraTreesEntr',
       'LightGBMLarge', 'RandomForestEntr', 'RandomForestGini',
       'KNeighborsDist', 'KNeighborsUnif'], dtype=object)

In [None]:
automl_custom_leaderboard_testdata[automl_custom_leaderboard_testdata['model'].str.contains('ExtraTreesGini')]

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
1,ExtraTreesGini,0.739069,0.758777,f1_macro,3.120127,3.023791,9.571872,3.120127,3.023791,9.571872,1,True,5


In [None]:
# For a single specified model: make predictions and perform detailed evaluation on hold out test data
# i = -1  # index of model to use
# model_to_use = automl_predictor.model_names()[i]
# model_to_use = automl_custom_leaderboard_testdata.iloc[0, 0] # use best model from leaderboard
model_to_use = automl_predictor.model_best
print(f"Model to be evaluated: {model_to_use}")
preds_y_test = automl_predictor.predict(X_test_transformed, model=model_to_use)
print("Predictions:  ", list(preds_y_test)[:5])

preds_y_test_inverse = label_encoder.inverse_transform(preds_y_test.to_frame())

# print classification report for holdout test data
print(classification_report(y_test, preds_y_test_inverse))
report = classification_report(y_test, preds_y_test_inverse, output_dict=True)
f1_score = report['accuracy']
f1_macro = report['macro avg']['f1-score']

# get best model parameters for mlflow tracking
trainer = automl_predictor._trainer
best_model = trainer.load_model(trainer.model_best)

Model to be evaluated: WeightedEnsemble_L2
Predictions:   [23.0, 0.0, 1.0, 26.0, 26.0]
                            precision    recall  f1-score   support

   Blister and Insert Card       0.77      0.85      0.81      1749
  Blister and sealed blist       0.78      0.77      0.78      1582
            Book packaging       0.91      1.00      0.95        20
Cardb. Sleeve w - w/o Shr.       0.59      0.41      0.49       135
  Cardboard hanger w/o bag       0.49      0.39      0.43        80
    Carton cover (Lid box)       0.55      0.54      0.54       130
   Carton tube with or w/o       0.68      0.85      0.76        20
                      Case       0.65      0.53      0.58        97
         Corrugated carton       0.77      0.75      0.76       774
        Countertop display       0.81      0.73      0.77        30
                  Envelope       0.94      0.86      0.90        59
          Fabric packaging       0.95      1.00      0.98        20
            Folding carton  

## Track performance using MLflow

In [None]:
# NOTE: Change to a meaningful name
EXPERIMENT_NAME = "AutoPackagingCategories"
RUN_NAME = "run_AutoML_AutoGluonTabular"

with open('../env_vars.yml', 'r') as file:
    env_vars = yaml.safe_load(file)

project_dir = env_vars['project_directory']
os.makedirs(project_dir + '/mlruns', exist_ok=True)

mlflow.set_tracking_uri("file://" + project_dir + "/mlruns")

try:
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    EXPERIMENT_ID = experiment.experiment_id
except AttributeError:
    EXPERIMENT_ID = mlflow.create_experiment(
        EXPERIMENT_NAME,
        # mlflow.set_artifact_uri("file://" + project_dir + "/artifacts/")
    )

current_time = datetime.datetime.now()
time_stamp = str(current_time)
# NOTE: Change to a meaningful name for the single trial
# exp_run_name = f"run_MeaningfulTrialName_{time_stamp}"
exp_run_name = f"{RUN_NAME}_{time_stamp}"

# Start MLflow
with mlflow.start_run(experiment_id=EXPERIMENT_ID, run_name=exp_run_name) as run:

    # Retrieve run id
    RUN_ID = run.info.run_id

    # Track parameters
    # track pipeline configs: preprocessing_pipeline
    mlflow.log_dict(
        {'oversampler': type(oversampler), 'label_encoder': type(label_encoder)} | preprocess_pipeline.named_transformers_,
        "preprocessing_pipeline.json"
    )

    # mode specfic parameters
    mlflow.log_param('model', f'{type(best_model)}: {best_model.base_model_names}')
    mlflow.log_param('model_configs', best_model.get_trained_params())

    # Track metrics
    mlflow.log_dict(report, "classification_report.json")
    mlflow.log_metric("Report_Test_f1_score", f1_score)
    mlflow.log_metric("Report_Test_f1_macro", f1_macro)
    
    # Track model
    # mlflow.sklearn.log_model(clf, "classifier")

In [None]:
import time

def keep_alive_with_cpu_activity(duration_hours=1):
    """
    Keeps the compute instance alive by running a periodic CPU task for a specified duration.
    """
    start_time = time.time()
    end_time = start_time + duration_hours * 3600  # convert hours to seconds
    print(f"Keeping the instance alive for {duration_hours} hours with periodic CPU activity.")
    print("To stop the function, create an empty file named stop_signal.txt in the same directory as your notebook.")
    print(f"os.path: {os.path}")

    while time.time() < end_time:
        # Check if stop signal file exists
        if os.path.isfile("stop_signal.txt"):
            print("Stop signal received. Exiting the loop.")
            break

        # Perform a small computation to generate CPU activity
        _ = np.random.rand(1000000, 1000000).dot(np.random.rand(1000000, 1000000))
        time.sleep(60)  # Sleep for 60 seconds

    print("Finished keeping the instance alive.")

# Run for the desired duration (e.g., 4 hours)
keep_alive_with_cpu_activity(duration_hours=8)