# **ALL TOOLS ON DECK**

---
AutoGluon is a cutting-edge tool for automating machine learning (AutoML) processes on tabular datasets. The objective is to investigate the performance of several AutoML tools, including Auto-sklearn, FLAML,TPOT and MLJAR in comparison to AutoGluon's predictive capabilities.

While AutoGluon typically excels in longer training durations, our focus lies in examining its performance within shorter training times, specifically less than a minute. We aim to contrast this with an AutoGluon predictor trained for an equivalent total duration.

# Overview of AutoML Tools
## 1. AutoGluon
Description: AutoGluon is an open-source AutoML framework from Amazon that focuses on ease of use and high performance. It automates machine learning workflows, including feature engineering, model selection, and hyperparameter tuning.
### Strengths:
*   Versatility: Supports multiple data types and tasks (e.g., regression, classification).
*   Ensemble Learning: Automatically builds ensembles of models.
*   Efficiency: Optimized for performance with multi-threading and GPU support.

## 2. MLJAR
Description: MLJAR is a Python library that automates the machine learning pipeline with a focus on simplicity and interpretability. It also supports multiple types of data and tasks.
### Strengths:
*   Easy-to-Use Interface: Simplified API for quick model training.
*   Ensemble Learning: Combines multiple models to improve performance.
*   Feature Importance: Provides insights into feature importance.

## 3. TPOT
Description: TPOT (Tree-based Pipeline Optimization Tool) is an AutoML tool that uses genetic algorithms to optimize machine learning pipelines. It's part of the scikit-learn ecosystem.
### Strengths:


*   Pipeline Optimization: Automatically designs and optimizes machine learning pipelines.
*   Genetic Algorithms: Uses evolutionary algorithms to find the best models.
* Customization: Allows for detailed control over the optimization process.

Sure! Here is the information for Auto-sklearn and FLAML in the same format:

## 4. Auto-sklearn
Description: Auto-sklearn is an open-source AutoML tool built on top of the scikit-learn library. It uses Bayesian optimization to automate the process of model selection and hyperparameter tuning.
### Strengths:
*   Bayesian Optimization: Efficiently searches the hyperparameter space.
*   Meta-Learning: Leverages prior knowledge to warm-start the optimization.
*   Ensemble Learning: Automatically constructs ensembles of models to improve performance.
*   Extensible: Integrates well with the scikit-learn ecosystem, allowing for custom pipelines and preprocessing.

## 5. FLAML
Description: FLAML (Fast and Lightweight AutoML) is a lightweight open-source library developed by Microsoft Research. It focuses on efficient and fast AutoML for both classification and regression tasks.
### Strengths:
*   Efficiency: Optimized for fast performance and low computational cost.
*   Simplicity: Easy-to-use interface with minimal setup.
*   Customization: Allows users to specify constraints and customize the search space.
*   Versatility: Supports a range of machine learning tasks including time series forecasting.





### Install all required packages and import dependencies

In [None]:
!pip install pandas numpy auto-sklearn

In [None]:
import pandas as pd
import numpy as np
import pickle

### Parameters Settings

Change random_seed to draw different samples from the dataset. The default value is 42. Change the number of samples loaded for the practice sets by changing it in the loading cell three cells below. The value set now is 20 000 to mimick the exam-set size.

Change the time each model is allowed to train by changing the time parameter in the cell below. The default value is 45 seconds as it has shown to be a competetive for time vs performance.

In [None]:
random_seed = 42
time = 45

# Load different datsets with respect to the dataset of the experiement

In [None]:
# Final dataset

base_path = '../../data/exam_dataset/1'

X_train = pd.read_parquet(f'{base_path}/X_train.parquet')
y_train = pd.read_parquet(f'{base_path}/y_train.parquet')
train_dataset = pd.concat([X_train, y_train], axis=1)
#test = train_dataset.sample(frac=0.2, replace=False, random_state=random_seed)
exam_X_values = pd.read_parquet(f'{base_path}/X_test.parquet')

# Also instantiate the target column
label = 'price'



print(train_dataset.info())

In [None]:
# Change path respective to the dataset you are testing:
#  * Bike_Sharing_Demand (361099) - label: 'count'
#  * Brazilian Houses (361098) - label: 'total_(BRL)'
#  * y_prop_4_1 (361092) - label: 'oz252'
# Because the exam dataset is ~20,000 entries, we should sample around this as well maybe?

# Set the base path for the dataset
base_path = '../../data/361099'

# Initialize variables for training data
X_train = None
y_train = None

# Loop through each fold of the dataset
for fold_number in range(1, 11):
    # Read the X_train and y_train data for the current fold
    x_fold = pd.read_parquet(f'{base_path}/{fold_number}/X_train.parquet')
    y_fold = pd.read_parquet(f'{base_path}/{fold_number}/y_train.parquet')

    # Concatenate the data to the existing training data
    if X_train is None:
        X_train = x_fold
        y_train = y_fold
    else:
        X_train = pd.concat([X_train, x_fold])
        y_train = pd.concat([y_train, y_fold])

# Sample the training data to around 20,000 entries
X_train = X_train.sample(n=20000, random_state=random_seed)
y_train = y_train.sample(n=20000, random_state=random_seed)

# Create a concatenated dataset for gluon
train_dataset = pd.concat([X_train, y_train], axis=1)

# Initialize variable for test data
test = None

# Loop through each fold of the dataset
for fold_number in range(1, 11):
    # Read the X_test and y_test data for the current fold
    x_fold = pd.read_parquet(f'{base_path}/{fold_number}/X_test.parquet')
    y_fold = pd.read_parquet(f'{base_path}/{fold_number}/y_test.parquet')

    # Concatenate the data to the existing test data
    concat_fold = pd.concat([x_fold, y_fold], axis=1)
    if test is None:
        test = concat_fold
    else:
        test = pd.concat([test, concat_fold])

# Set the label for the dataset
label = 'count'

# Print the information of the training dataset
print(train_dataset.info())

## Run this job first, then switch from conda kernel and run the script again excluding this cell

# AutoSKLearn Training

Auto-sklearn automates the process of model selection and hyperparameter tuning to optimize machine learning pipelines. Save the best pipeline.

In [None]:
# Import the autosklearn.regression module
import autosklearn.regression

# Fit the AutoSklearnRegressor model with the given time budget
autosklearn = autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=time, n_jobs=-1).fit(X_train, y_train)

# Make predictions on the exam dataset
autosklearn_pred = autosklearn.predict(exam_X_values)

# Save the predictions to a pickle file
with open('autosklearn_pred_exam.pkl', 'wb') as f:
    pickle.dump(autosklearn_pred, f)

## Switch kernel to python default before running this
### Need to install all required packages and run imports and variables again

Install All Required package.

In [None]:
!pip install pandas numpy scikit-learn autogluon flaml tpot mljar-supervised

In [None]:
import pandas as pd
import numpy as np
import os
import torch
import matplotlib.pyplot as plt
import pickle

from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder

from __future__ import annotations

from pathlib import Path

random_seed = 42
time = 45

# FLAML Training

Train a FLAML model which uses efficient hyperparameter optimization to optimize machine learning pipelines. The best pipeline is saved and then used for making predictions. FLAML focuses on pipeline optimization and offers a high degree of automation.

In [None]:
from flaml import AutoML
from sklearn.utils import shuffle

# Instantiate the AutoML object
flaml = AutoML()

# Define the list of learners to be used
learners = ['lgbm', 'rf', 'catboost', 'extra_tree', 'kneighbor']

# Fit the AutoML model
flaml.fit(
    np.array(X_train),  # Training data features
    np.array(y_train),  # Training data labels
    task="regression",  # Task type is regression
    time_budget=time,  # Time budget for the AutoML search
    estimator_list=learners,  # List of learners to be used
    verbose=1  # Set verbosity level to 1 for progress updates
)

# Make predictions on the exam dataset
flaml_pred = flaml.predict(exam_X_values)

# Save the predictions to a pickle file
with open(f'flaml_pred_exam.pkl', 'wb') as f:
    pickle.dump(flaml_pred, f)
#flaml_score = r2_score(test[label], flaml_pred)


# AutoGluon Training

Train a model which uses an automated machine learning (AutoML) library Autogluon designed to simplify the process of training and optimizing machine learning models. The predictions are saved.

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor

# Fit the AutoGluon model with the given time budget
gluon = TabularPredictor(label=label, problem_type='regression', eval_metric='r2').fit(train_dataset, time_limit=time, presets='medium_quality', verbosity=1)

# Make predictions on the exam dataset
gluon_pred = gluon.predict(exam_X_values)

# Save the predictions to a pickle file
with open('gluon_predictions_EXAM.pkl', 'wb') as f:
    pickle.dump(gluon_pred, f)

# TPOT Training
Train a TPOT model which uses genetic algorithms to optimize machine learning pipelines. The best pipeline is saved and then used for making predictions. TPOT focuses on pipeline optimization and offers a high degree of automation.

In [None]:
from tpot import TPOTRegressor
import joblib

# Import the necessary libraries

# Instantiate the TPOTRegressor
tpot = TPOTRegressor(
    random_state=random_seed,
    n_jobs=-1,              # Utilize all CPU cores
    max_time_mins=1,        # Max total time in minutes
    max_eval_time_mins=0.2  # Max time per pipeline in minutes
)

# Fit the TPOTRegressor on the training data
tpot.fit(X_train, y_train)

# Save the fitted pipeline
joblib.dump(tpot.fitted_pipeline_, "tpot_pipeline.joblib")

# At prediction time, load the saved pipeline
loaded_pipeline = joblib.load("tpot_pipeline.joblib")

# Make predictions using the loaded pipeline
tpot_predictions = loaded_pipeline.predict(test.drop(columns=[label]))

# Calculate the R2 score of TPOT predictions
tpot_score = r2_score(test[label], tpot_predictions)

# Print the TPOT R2 score
print("TPOT R2 score:", tpot_score)

TPOT R2 score: 0.8568653401098989


#  MLJAR Training
Train an MLJAR model using the full dataset. MLJAR performs automated machine learning and provides a model that is evaluated on a test set. It focuses on easy-to-use interfaces and interpretability.

In [None]:
import pickle
from supervised import AutoML

# Prepare data
X_train = train_dataset.drop(columns=[label])
y_train = train_dataset[label]

# Initialize AutoML for regression
mljar_automl_regressor = AutoML(
    mode="Compete",
    total_time_limit=90,
    random_state=random_seed,
    n_jobs=-1
)

# Fit AutoML
mljar_automl_regressor.fit(X_train, y_train)

# Make predictions on the test set
mljar_predict = mljar_automl_regressor.predict(exam_X_values)

# Save the model as a pickle file
with open('mljar_model_EXAM.pkl', 'wb') as f:
    pickle.dump(mljar_automl_regressor, f)

# Save the predictions to a pickle file
with open('mljar_predictions_EXAM.pkl', 'wb') as f:
    pickle.dump(mljar_predict, f)

# Print completion messages
print("Model training completed and model saved as 'mljar_model_EXAM.pkl'.")
print("Predictions saved as 'mljar_predictions_EXAM.pkl'.")


# Load all predictions

In [None]:
# Load the mljar pred
with open('mljar_predictions_EXAM.pkl', 'rb') as f:
    mljar_pred = pickle.load(f)

#mljar_score = r2_score(test[label][0:len(mljar_pred)], mljar_pred)
print(mljar_pred, len(mljar_pred))

[13.67054082 12.99571632 12.26738305 ... 12.68263049 12.77947423
 12.12948934] 2162


In [None]:
# Load the flaml score
with open('flaml_pred_exam.pkl', 'rb') as f:
    flaml_pred = pickle.load(f)

#flaml_score = r2_score(test[label], flaml_pred)
print(flaml_pred)

0.980278110859552


In [None]:
# Load the AutoSKLEARN model
with open('autosklearn_pred_exam.pkl', 'rb') as f:
    autosklearn_pred = pickle.load(f)

#autosklearn_score = r2_score(test[label], autosklearn_pred)
print(autosklearn_pred, len(autosklearn_pred))

[13.77257299 12.98191023 12.3235178  ... 12.63088846 12.71451759
 12.15282154] 2162


In [None]:
# Load the gluon predictions
with open('gluon_predictions_EXAM.pkl', 'rb') as f:
    gluon_pred = pickle.load(f)

#gluon_score = r2_score(test[label], gluon_pred)
print(gluon_pred, len(gluon_pred))

# Optimize a weighted ensemble prediction

### TPOT is excluded due to poor performance measures

In [None]:
import numpy as np
from sklearn.metrics import r2_score
from skopt import gp_minimize
from skopt.space import Real

# Initial R2 scores
r2_flaml = flaml_score.clip(min=0)
r2_autosklearn = autosklearn_score.clip(min=0)
r2_gluon = eval_gluon['r2'].clip(min=0)
r2_mljar = mljar_score.clip(min=0)

# Normalize R2 scores to use as initial weights
total_r2 = r2_flaml + r2_autosklearn + r2_gluon + r2_mljar
initial_weights = [r2_flaml/total_r2, r2_autosklearn/total_r2, r2_gluon/total_r2, r2_mljar/total_r2]

def objective(weights):
    """
    Objective function for optimization.
    Calculates the ensemble prediction using the given weights and returns the negative R2 score.
    """
    w1, w2, w3, w4 = weights
    ensemble_pred = (w1 * flaml_pred + w2 * autosklearn_pred + w3 * gluon_pred +
                     w4 * mljar_pred) / (w1 + w2 + w3 + w4)
    return -r2_score(test[label], ensemble_pred)

# Define the search space
space = [Real(max(0, w-0.3), min(1, w+0.3), name=f'w{i+1}') for i, w in enumerate(initial_weights)]
# Set up the optimization
res = gp_minimize(objective, space, n_calls=50, random_state=random_seed, x0=initial_weights)

# Start from the initial weights

best_weights = res.x
total = sum(best_weights)
normalized_weights = [w/total for w in best_weights]
w1, w2, w3, w4 = normalized_weights

optimal_ensemble = (w1 * flaml_pred + w2 * autosklearn_pred + w3 * gluon_pred +
                    w4 * mljar_pred) / (w1 + w2 + w3 + w4)
optimal_r2 = r2_score(test[label], optimal_ensemble)


# Make the final ensemble-prediction

In [None]:
# Predictions of the final exam X dataset

# Ensemble to use from testing: AutoSklearn: 0.067, Gluon: 0.584, TPOT: 0.0000, MLJAR: 0.34891

final_exam_pred = (0.067 * autosklearn_pred + 0.584 * gluon_pred + 0.349 * mljar_pred)

# Convert to numpy array
predictions_array = final_exam_pred.to_numpy()

np.save('final_test_preds.npy', predictions_array)

loaded_preds = np.load('final_test_preds.npy')
print(loaded_preds)

In [None]:
# Compare it with a longer gluon run (benchmark)

gluon_benchmark = TabularPredictor(label=label, problem_type='regression', eval_metric='r2').fit(train_dataset, time_limit=180, presets='medium_quality', verbosity=1)
benchmark_pred = gluon_benchmark.predict(test.drop(columns=[label]))
eval_benchmark = gluon_benchmark.evaluate(test)

print('A 180 sec autogluon run on same dataset.', eval_benchmark['r2'])