# Overview

**GENERAL THOUGHTS:**



**DATA PREPROCESSING:**

Imbalanced data:
- over_sampling for imbalanced data
- cost-sensitive learning for imbalanced data

continuous data:
- Impute missing data: SimpleImputer(strategy='median')
- Standardize data: StandardScaler()

categorical data:
- Impute missing data: SimpleImputer(strategy='most_frequent')
- Ordinal & Nominal data encoding: OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
- Unknown values ecoding: custom encoder "OrdinalEncoderExtensionUnknowns()"

target data:
- target encoding: OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

**MULTI-CLASS CLASSIFIER:**
- Overview models to be considered:  
Models:
  - [X] Neural Net: MLP with categorical variable embedding (embMLP)


In [None]:
colab = True

In [None]:
if colab:
  # Import the library to mount Google Drive
  from google.colab import drive
  # Mount the Google Drive at /content/drive
  drive.mount('/content/drive')
  # Verify by listing the files in the drive
  # !ls /content/drive/My\ Drive/
  # current dir in colab
  !pwd

In [None]:
if colab:
    !pip install optuna==3.5.0
    # !pip install optuna.integration
    !pip install lightning

In [None]:
# import os
import sys
import yaml

# import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

import optuna
from optuna.integration import PyTorchLightningPruningCallback

import lightning as L
from lightning.pytorch.tuner import Tuner
import torch
from torch import nn
import torch.nn.functional as F
from lightning.pytorch.loggers import CSVLogger
from lightning.pytorch.callbacks import EarlyStopping

from typing import Dict, Iterable, List, Optional, Tuple, Union, Literal

In [None]:
# NOTE: if used in google colab, upload env_vars_colab.yml to current google colab directory!

# get config
if colab:
    with open('./env_vars_colab.yml', 'r') as file:
        config = yaml.safe_load(file)

    # custom imports
    sys.path.append(config['project_directory'])
else:
    with open('../env_vars.yml', 'r') as file:
        config = yaml.safe_load(file)

    # custom imports
    sys.path.append(config['project_directory'])

from src.tabular_lightning import TabularDataModuleClassificationPACKAGING, MulticlassTabularLightningModule
from src import tabular_lightning_utils as tl_utils

In [None]:
SEED = 42 # Ensure same data split as in other notebooks

# Datasets, DataLoaders & LightningDataModule
- Logic of using `Datasets`, `DataLoaders` & `LightningDataModule` for **Tabular Data**  
    Since we are using tabular data and want to perform non tensor based processing to our date, we use Datasets, DataLoaders & LightningDataModule in a different manner as we would do when applying tensor operations (incl. tensor based preprpcessing) only.  
    - `LightningDataModule` 
        Our LightningDataModule builds the wrapper around this process, with the following intensions:
        - `def prepare_data`  
        Data Operations that only should be performed once on the data (and should not be performed on a distributed manner). Prepares the data for data pre-processing e.g. using sklearn.  
            - Loading the data from .csv to pd.DataFrame
            - General dataset preperations like feature selection and data type correction
        - `def setup`  
        First, data operations (like shuffle, split data, categorical encoding, normalization, etc.), which any dataframe should undergo before feeding into the dataloader will be performed here. Since we use sklearn functionalities for our tabular pre-processing pipeline the data input and output of the pre-processing is a tabular format (dataframe) and not a tensor format.
        Second, the outcome of `def setup` are `Dataset` objects for each split (e.g. train, val, test, pred), which is a wrapper around the pre-processed tabular data and provides the samples and their corresponding labels (ideally in tensor format) in a specific way to be compatible for the model(s).
            - `Dataset`  
            Dataset provides the samples and their corresponding labels in a specific way to be compatible for the model(s). We define the input for our tabular use case as a DataFrame, while the output should generally be a tensor. In our case the output is a tuple of a flattern tensor representing all features and a tensor for the target variable (label). This aligns with the input if a simple MLP model. For more complex models, e.g. that handle continous and categorical variables differently, this should be adapted.  
            The class performs specific data type correction for to use of Neural Networks (e.g. ensure that all outputs are numeric values of the correct type depeding of they are categorical or continous nature).
        - `def train/val/test/prediction_dataloader`  
        Creates our dataloaders within the LightningDataModule. See usage below.
            - `DataLoader` 
            DataLoader wraps an iterable around the Dataset to enable easy access to the samples during training and inference. The Dataset defines the format of the data that is passed to our model. The DataLoader retrieves our dataset’s features and labels. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval, which is handled by the DataLoader. Input and output is a tensor. https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

- Excorsion: Classical approach of using `Datasets`, `DataLoaders` & `LightningDataModule` for e.g. images, ...  
The main difference is the usage of tensors instead of a dataframe for efficent GPU usage.
    - `LightningDataModule` 
        Our LightningDataModule builds the wrapper around this process. It encapsulates training, validation, testing, and prediction dataloaders, as well as any necessary steps for data processing, downloads, and transformations. https://lightning.ai/docs/pytorch/stable/data/datamodule.html
        - `def prepare_data`  
        Loads the data and does general processing befor transfomring to a tensor, so efficent tensor operations can be enabled in `setup'.
        - `def setup`  
        Efficent tensor operations (like shuffle, split data, categorical encoding, normalization, etc.), which any dataframe should undergo before feeding into the dataloader.
        - `def train/val/test_dataloader`  
        Creates our dataloaders within the LightningDataModule.
    - `Dataset`
    Class to create tabular dataset to follow PyTorch Lightning conventions (eventhough we are working on tabular data), where the data for feature variables and the target variable are often already provided in a combined way (e.g. contrary to images and corresbonding labels). For "classical" approaches a Dataset class is often used at the start of the machine learning pipeline to provide the data in a format (e.g. combine images and corresponding labels, which are typically not provided in the same file) for further processing and training.

# Load and Pre-process Data

In [None]:
dm = TabularDataModuleClassificationPACKAGING(
    data_dir=f"{config['data_directory']}/output/df_ml.csv",
    continuous_cols=['material_weight'],
    categorical_cols=[
        'material_number',
        'brand',
        'product_area',
        'core_segment',
        'component',
        'manufactoring_location',
        'characteristic_value',
        'packaging_code'
    ],
    target=['packaging_category'],
    oversampling=True,
    test_size=0.2,
    val_size=0.2,
    batch_size=64,
    num_workers_train=0, # (os.cpu_count() / 2)
    num_workers_inference=0, # (os.cpu_count() / 2)
    SEED=SEED
)

In [None]:
# NOTE: Run dm.prepare_data() and dm.setup() to get information from the dataset to build the model.
dm.prepare_data()
dm.setup(stage='fit')
# dm.data.info()
# dm.data.head()

In [None]:
dm.train_dataset.get_dataframe.head(5)

In [None]:
tl_utils.check_data_consitancy(dm)

In [None]:
tl_utils.check_dataloader_output(dm, next(iter(dm.train_dataloader())))

In [None]:
tl_utils.print_dataloader_output(dm)

# Models and Training/HPO

## Embbeding MLP without HPO

In [None]:
tabular_data_full = pd.concat([dm.train_dataset.get_dataframe, dm.val_dataset.get_dataframe, dm.test_dataset.get_dataframe], axis=0, ignore_index=True)
embedding_sizes_cat_features = tl_utils.get_cat_feature_embedding_sizes(tabular_data_full, categorical_cols=dm.categorical_cols)
embedding_sizes_cat_features

In [None]:
tl_utils.print_embbeding_input_output(dm)

In [None]:
class MulticlassTabularCatEmbeddingMLP(torch.nn.Module):
    def __init__(
        self,
        continuous_cols: List[str] = None,
        categorical_cols: List[str] = None,
        output_size: int = None,
        # embedding_dim: int = None,
        hidden_size: int = None,
        n_hidden_layers: int = None,
        activation_class: torch.nn.functional = nn.ReLU,
        dropout: float = None,
        norm: bool = True,
        embedding_sizes: (Dict[str, Tuple[int, int]]) = None,
    ):
        """Embedding Multi Layer Perceptron (embMLP) with embedding for categorical features for multiclass classification for tabular data.
        Args:
            continues_cols (List[str]): order of continuous variables in tensor passed to forward function.
            categorical_cols (List[str]): order of categorical variables in tensor passed to forward function.
            output_size (int): Number of output classes.
            hidden_size (int): Number of neurons in hidden layers.
            n_hidden_layers (int): Number of hidden layers.
            activation_class (torch.nn.functional): Activation function.
            dropout (float): Dropout rate.
            norm (bool): Whether to use layer normalization.
            embedding_sizes (Dict[str, Tuple[int, int]]): Dictionary of embedding sizes for each categorical feature.
        """
        super().__init__()
        self.continuous_cols = continuous_cols
        self.categorical_cols = categorical_cols
        self.output_size = output_size
        self.embedding_sizes = embedding_sizes
        # self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.n_hidden_layers = n_hidden_layers
        self.activation_class = activation_class
        self.dropout = dropout
        self.norm = norm

        ### define the Embedding MLP ###
        ## embedding layers
        # cont featurres
        self.cont_normalizing = nn.BatchNorm1d(len(self.continuous_cols))
        # cat features
        self.cat_embeddings = nn.ModuleDict()
        for name in embedding_sizes.keys():
            self.cat_embeddings[name] = nn.Embedding(
                embedding_sizes[name][0],
                embedding_sizes[name][1],
            )
        ## input layer mlp
        mlp_input_size = sum(value[1] for value in embedding_sizes.values()) + len(self.continuous_cols)
        module_list = [nn.Linear(mlp_input_size, hidden_size), activation_class()]
        if dropout is not None:
            module_list.append(nn.Dropout(dropout))
        if norm:
            module_list.append(nn.LayerNorm(hidden_size))
        ## hidden layers
        for _ in range(n_hidden_layers):
            module_list.extend([nn.Linear(hidden_size, hidden_size), activation_class()])
            if dropout is not None:
                module_list.append(nn.Dropout(dropout))
            if norm:
                module_list.append(nn.LayerNorm(hidden_size))
        ## output layer
        module_list.append(nn.Linear(hidden_size, output_size))

        self.mlp_layers = nn.Sequential(*module_list)

    def forward(self, x: Dict[str, torch.Tensor]) -> torch.Tensor:
        """Forward pass through the embMLP."""

        assert "continuous" in x or "categorical" in x, "x must contain either continuous or categorical features"

        ### forward embedding layers ###
        # cont features
        if len(self.continuous_cols) > 0:
            embed_vector_cont = self.cont_normalizing(x["continuous"])
        else:
            embed_vector_cont = x["continuous"]
        # cat features
        if len(self.categorical_cols) > 0:
            output_vectors = {}
            for idx, (name, emb) in enumerate(self.cat_embeddings.items()):
                output_vectors[name] = emb(x["categorical"][:, idx])
            embed_vector_cat = torch.cat(list(output_vectors.values()), dim=1)
        # output_vector_emded
        if embed_vector_cont is None:
            output_vector_emded = embed_vector_cat
        else:
            output_vector_emded = torch.cat([embed_vector_cont, embed_vector_cat], dim=1)

        ### forward hidden layers ###
        return self.mlp_layers(output_vector_emded)

In [None]:
# NOTE: Run dm.prepare_data() and dm.setup() to get information from the dataset to build your model.
multiclass_embMLP = MulticlassTabularCatEmbeddingMLP(
    continuous_cols=dm.continuous_cols,
    categorical_cols=dm.categorical_cols,
    output_size=dm.n_classes,
    hidden_size=64,
    n_hidden_layers=3,
    dropout=0.2,
    norm=True,
    embedding_sizes=embedding_sizes_cat_features
)

In [None]:
tl_utils.print_model_summary(multiclass_embMLP)

In [None]:
lightningmodel = MulticlassTabularLightningModule(
    n_classes=dm.n_classes,
    model=multiclass_embMLP,
    learning_rate=0.001,
)

In [None]:
empmlp_experiment_name = "embMLP-v0"

trainer = L.Trainer(
    devices="auto", # (os.cpu_count() / 2)
    callbacks=[
        EarlyStopping(monitor='val_loss', min_delta=0.00, patience=5),
        # L.ModelCheckpoint(
        #     monitor="val_loss",
        #     mode="min",
        #     save_top_k=1,
        #     save_path=f"{save_path}/checkpoints",
        #     filename="best_model",
        # ),
    ],
    logger=CSVLogger(save_dir="logs/", name=empmlp_experiment_name),
    max_epochs=2,
    precision='bf16-mixed',
    default_root_dir="lightning_logs/",
)

# create a Tuner
tuner = Tuner(trainer)

# finds learning rate automatically, update hparams.lr to that learning rate
lr_finder = tuner.lr_find(lightningmodel, datamodule=dm)
fig = lr_finder.plot(suggest=True)
#fig.savefig("lr_suggest.pdf")
# get suggestion
new_lr = lr_finder.suggestion()
print(new_lr)
# update hparams of the model
lightningmodel.learning_rate = new_lr

In [None]:
# training
trainer.fit(model=lightningmodel, train_dataloaders=dm.train_dataloader(), val_dataloaders=dm.val_dataloader()) # stage of the dataloader to use

In [None]:
metrics = pd.read_csv(f"{trainer.logger.log_dir}/metrics.csv")

tl_utils.plot_training_metrics(metrics)

In [None]:
score = trainer.test(model=lightningmodel, dataloaders=dm.test_dataloader())

In [None]:
score[0]['test_F1_macro_weighted']

In [None]:
# make predictions on test data and evaluate
preds_y_test = torch.cat(trainer.predict(model=lightningmodel, dataloaders=dm.test_dataloader()))
# inverse transform to get back to original labels
preds_y_test = dm.label_encoder_target.inverse_transform(preds_y_test.reshape(-1, 1))
y_test = dm.label_encoder_target.inverse_transform(dm.test_dataset.get_dataframe.iloc[:, -1].values.reshape(-1, 1))
# calculate classification report
print(classification_report(y_test, preds_y_test))

## Embedding MLP with HPO

### Performe HPO

In [None]:
class OptunaObjective(object):
    """Optuna objective for hyperparameter tuning."""
    def __init__(self, dm: TabularDataModuleClassificationPACKAGING = None):
        self.dm = dm
        self.dm.prepare_data()
        self.dm.setup(stage='fit')

        tabular_data_full = pd.concat([self.dm.train_dataset.get_dataframe, self.dm.val_dataset.get_dataframe, self.dm.test_dataset.get_dataframe], axis=0, ignore_index=True)
        self.embedding_sizes_cat_features = tl_utils.get_cat_feature_embedding_sizes(tabular_data_full, categorical_cols=self.dm.categorical_cols)

    def __call__(self, trial: optuna.Trial) -> float:
        
        # joblib.dump(study, 'study.pkl')

        # Define the hyperparameter search space
        hp_space_optuna = {
            'hidden_size': trial.suggest_categorical('hidden_size', [8, 16, 32, 64, 128]), # number of neurons in each layer
            'n_hidden_layers': trial.suggest_int("n_hidden_layers", 1, 6), # number of layers
            'batch_size': trial.suggest_categorical("batch_size", [16, 32, 64]), # number of samples per batch
            'dropout': trial.suggest_categorical("dropout", [0.0, 0.1, 0.2, 0.4]), # dropout rate
        }
        self.dm.batch_size = hp_space_optuna['batch_size']
        # Create a model
        model = MulticlassTabularCatEmbeddingMLP(
            continuous_cols=self.dm.continuous_cols,
            categorical_cols=self.dm.categorical_cols,
            output_size=self.dm.n_classes,
            hidden_size=hp_space_optuna['hidden_size'],
            n_hidden_layers=hp_space_optuna['n_hidden_layers'],
            dropout=hp_space_optuna['dropout'],
            embedding_sizes=self.embedding_sizes_cat_features
        )
        # Create a Lightning Module
        lightningmodel = MulticlassTabularLightningModule(
            n_classes=self.dm.n_classes,
            model=model,
            learning_rate=0.001,
        )
        # Create a Lightning Trainer
        trainer = L.Trainer(
            devices="auto", # (os.cpu_count() / 2)
            callbacks=[
                PyTorchLightningPruningCallback(trial, monitor="val_loss"),
                # EarlyStopping(monitor='val_loss', min_delta=0.00, patience=5),
            ],
            # logger=CSVLogger(save_dir="logs/", name="my-model"),
            max_epochs=2,
            precision='bf16-mixed',
            default_root_dir="lightning_logs/",
        )
        # Create a Tuner
        tuner = Tuner(trainer)
        lr_finder = tuner.lr_find(lightningmodel, datamodule=self.dm) # finds learning rate automatically
        new_lr = lr_finder.suggestion()
        lightningmodel.learning_rate = new_lr # update hparams of the model
        # Train the model
        trainer.fit(
            model=lightningmodel,
            train_dataloaders=self.dm.train_dataloader(),
            val_dataloaders=self.dm.val_dataloader()
        )

        # score = trainer.test(model=lightningmodel, dataloaders=self.dm.test_dataloader())
        # score[0]['test_F1_macro_weighted']

        return trainer.callback_metrics["val_F1_macro_weighted"].item()

In [None]:
# define hyper-parameter space, model + training, optimization metric via Objective
objective = OptunaObjective(
    dm=TabularDataModuleClassificationPACKAGING(
        data_dir=f"{config['data_directory']}/output/df_ml.csv",
        continuous_cols=['material_weight'],
        categorical_cols=[
            'material_number',
            'brand',
            'product_area',
            'core_segment',
            'component',
            'manufactoring_location',
            'characteristic_value',
            'packaging_code'
        ],
        target=['packaging_category'],
        oversampling=True,
        test_size=0.2,
        val_size=0.2,
        batch_size=64,
        SEED=SEED
    )
)

# define and run study for optimization
storage_name = f"sqlite:///{config['optuna_storage_directory']}/{empmlp_experiment_name}.db"
study = optuna.create_study(
    study_name=empmlp_experiment_name,
    storage=storage_name,
    load_if_exists=True,
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=SEED),
    pruner=optuna.pruners.MedianPruner()
)

# define duration of the optimization process by and/or number_of_trails and timeout
study.optimize(
    objective,
    n_trials=1,
    # timeout=600,
    show_progress_bar=True
)

In [None]:
# print optimization results
print(f"Number of finished trials: {len(study.trials)}")
print("Best trial:")
best_trial = study.best_trial
print("  Performance: ", best_trial.value)
print('  Best trial:', best_trial.params)
# print("  Params: ")
# for key, value in trial.params.items():
#     print(f"    {key}: {value}")

### Analyse Optuna study

In [None]:
# history of all trials
hist = study.trials_dataframe()
hist.head()

In [None]:
# plot performance of all trials
optuna.visualization.plot_optimization_history(study)

In [None]:
# plot the parameter relationship concerning performance
optuna.visualization.plot_slice(study)

In [None]:
# plots the interactive visualization of the high-dimensional parameter relationship
optuna.visualization.plot_parallel_coordinate(study)

In [None]:
# plots parameter interactive chart from we can choose which hyperparameter space has to explore
optuna.visualization.plot_contour(study)

### Evaluate best model

In [None]:
best_trial.params

In [None]:
# Define best model

best_params = best_trial.params

# Evaluate best model on test data again
def eval_best_model(best_params):
    # datamodule
    dm=TabularDataModuleClassificationPACKAGING(
        data_dir=f"{config['data_directory']}/output/df_ml.csv",
        continuous_cols=['material_weight'],
        categorical_cols=[
            'material_number',
            'brand',
            'product_area',
            'core_segment',
            'component',
            'manufactoring_location',
            'characteristic_value',
            'packaging_code'
        ],
        target=['packaging_category'],
        oversampling=True,
        test_size=0.2,
        val_size=0.2,
        batch_size=best_params['batch_size'],
        SEED=SEED
    )
    dm.prepare_data()
    dm.setup(stage='fit')
    # model
    tabular_data_full = pd.concat([dm.train_dataset.get_dataframe, dm.val_dataset.get_dataframe, dm.test_dataset.get_dataframe], axis=0, ignore_index=True)
    embedding_sizes_cat_features = tl_utils.get_cat_feature_embedding_sizes(tabular_data_full, categorical_cols=dm.categorical_cols)
    best_model = MulticlassTabularCatEmbeddingMLP(
            continuous_cols=dm.continuous_cols,
            categorical_cols=dm.categorical_cols,
            output_size=dm.n_classes,
            hidden_size=best_params['hidden_size'],
            n_hidden_layers=best_params['n_hidden_layers'],
            dropout=best_params['dropout'],
            norm=True,
            embedding_sizes=embedding_sizes_cat_features
        )
    # lightningmodel
    lightningmodel = MulticlassTabularLightningModule(
        n_classes=dm.n_classes,
        model=best_model,
        learning_rate=0.001,
    )
    # trainer
    trainer = L.Trainer(
        devices="auto", # (os.cpu_count() / 2)
        callbacks=[
            EarlyStopping(monitor='val_loss', min_delta=0.00, patience=5),
        ],
        logger=CSVLogger(save_dir="logs/", name="MLP-Tuning"),
        max_epochs=3,
        precision='bf16-mixed',
        default_root_dir="lightning_logs/",
    )
    # find learning rate
    tuner = Tuner(trainer)
    lr_finder = tuner.lr_find(lightningmodel, datamodule=dm) # finds learning rate automatically
    new_lr = lr_finder.suggestion()
    lightningmodel.learning_rate = new_lr # update hparams of the model
    # train model
    trainer.fit(
        model=lightningmodel,
        train_dataloaders=dm.train_dataloader(),
        val_dataloaders=dm.val_dataloader()
    )
    # evaluate model on test data
    score = trainer.test(model=lightningmodel, dataloaders=dm.test_dataloader())
    print(f"test_F1_macro_weighted: {score[0]['test_F1_macro_weighted']}")
    # make predictions on test data and evaluate
    preds_y_test = torch.cat(trainer.predict(model=lightningmodel, dataloaders=dm.test_dataloader()))
    preds_y_test = dm.label_encoder_target.inverse_transform(preds_y_test.reshape(-1, 1))
    y_test = dm.label_encoder_target.inverse_transform(dm.test_dataset.get_dataframe.iloc[:, -1].values.reshape(-1, 1))
    # calculate classification report
    print(classification_report(y_test, preds_y_test))

eval_best_model(best_params)