# Automatic ML



## Hyperparameter tuning

| Library          | Method                     | Key Features                                                                 |
|------------------|----------------------------|------------------------------------------------------------------------------|
| Hyperopt         | Bayesian optimization (TPE) | Uses Tree-structured Parzen Estimator (TPE) to model promising parameters.    |
| Optuna           | Bayesian optimization (TPE/CMA-ES) | Supports pruning, parallel trials, and dynamic search spaces.           |
| Scikit-Optimize  | Bayesian optimization (GP) | Uses Gaussian Processes (GP) as a surrogate model.                            |

## CMA-ES (Covariance Matrix Adaptation Evolution Strategy)

1. Samples Points: At each iteration, it generates a population of candidate solutions (points) from a multivariate Gaussian distribution (defined by a mean and covariance matrix).
2. Evaluates Fitness: Ranks the points based on how well they perform 
4. Updates Distribution:
    * Mean: Moves toward better-performing points.
    * Covariance Matrix: Adapts to "stretch" the distribution toward promising directions (learning correlations between variables).
    * Step Size: Adjusts globally (large steps early, smaller steps as it converges).

5. Repeat
 
## TPE (Tree Parzen Estimator)
> Remember the class on kernel density estimations??

1. Samples Points Randomly (First Phase):
2. Builds two groups: "Good" (e.g. lowest 10%) and "Bad" (e.g. top 10%) points.
3. Estimates probability density for these two groups
4. Optimizes the "Promisingness" Ratio: $\frac{\mathbb{P}(\text{good}|x)}{ \mathbb{P}(\text{bad}|x)}$
5. Repeat

In [None]:
%load_ext jupyter_black

In [None]:
import optuna
import lightgbm as lgb
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn import datasets

from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


from optuna.visualization import (
    plot_optimization_history,
    plot_param_importances,
    plot_slice,
    plot_parallel_coordinate,
)

# plt.rcParams["figure.figsize"] = (12, 6)

In [None]:
X, y = datasets.fetch_california_housing(return_X_y=True, as_frame=True)
X = X[:5000]
y = y[:5000]
feature_names = X.columns
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

df_data = X
df_data["target"] = y
X.head(2)

In [None]:
def objective(trial):
    params = {
        "objective": "regression",
        "metric": "rmse",
        "boosting_type": "gbdt",
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 100),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-9, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-9, 10.0, log=True),
        "verbosity": -1,
    }

    model = lgb.LGBMRegressor(**params, n_estimators=1000)
    score = cross_val_score(
        model, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error"
    ).mean()
    return score

In [None]:
study = optuna.create_study(direction="maximize", storage="sqlite:///optuna.db")
study.optimize(objective, n_trials=20, show_progress_bar=True)
print("Best trial AUC:", study.best_value)
print("Best params:", study.best_params)

In [None]:
plot_optimization_history(study)

In [None]:
plot_slice(study)

In [None]:
plot_parallel_coordinate(study)

In [None]:
# Optuna-dashboard
# !pip install optuna-dashboard
# optuna-dashboard sqlite:///optuna.db

## AutoML
After participating in several competitions, you may have noticed that you keep repeating the same steps: reading the dataset, filtering, converting data types, splitting into train-val-test, initializing a model, training on the train set (or even on KFold splits), tuning hyperparameters, generating predictions, saving results... These tasks feel so routine that they perform should definitely be automated.  


### H20

In [None]:
import h2o
from h2o.automl import H2OAutoML

In [None]:
h2o.init()

In [None]:
data = h2o.H2OFrame(df_data)
aml = H2OAutoML(max_models=10, seed=42, max_runtime_secs=120)
aml.train(x=data.columns[:-1], y="target", training_frame=data)

In [None]:
lb = aml.leaderboard
print(lb.head())

# LightAutoML


**LightAutoML** was created for exactly this purpose—it’s a Sber project. Here’s what it offers:  
- **Automatic hyperparameter tuning and data processing**  
- **Automatic type detection and feature selection**  
- **Efficient time management**  
- **Automated report generation**  
- **A modular, user-friendly framework for building custom pipelines**  

To give you a taste of its power:  
[**0.77 accuracy on Titanic with just 5 lines of code (excluding imports)**](https://www.kaggle.com/code/alexryzhkov/lightautoml-extreme-short-titanic-solution)  

(Note: I kept the link formatting as-is since it appears to be a reference to external content. If this is meant to be embedded differently, let me know!)

In [None]:
# !pip install lightautoml

[Cannot run in Colab: python3.10 not supported](https://github.com/sb-ai-lab/LightAutoML/issues/87#issuecomment-1527406824)

In [None]:
import os
import torch
import requests

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# LightAutoML presets, task and report generation
from lightautoml.report.report_deco import ReportDeco
from lightautoml.tasks import Task
from lightautoml.automl.presets.tabular_presets import (
    TabularAutoML,
    TabularUtilizedAutoML,
)

In [None]:
N_THREADS = 6
N_FOLDS = 5

RANDOM_STATE = 261
TEST_SIZE = 0.2
TIMEOUT = 300

TARGET_NAME = "TARGET"

#### data

In [None]:
DATASET_DIR = "./data/"
DATASET_NAME = "sampled_app_train.csv"
DATASET_FULLNAME = os.path.join(DATASET_DIR, DATASET_NAME)
DATASET_URL = "https://raw.githubusercontent.com/AILab-MLTools/LightAutoML/master/examples/data/sampled_app_train.csv"

In [None]:
os.makedirs(DATASET_DIR, exist_ok=True)

In [None]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

In [None]:
if not os.path.exists(DATASET_FULLNAME):

    os.makedirs(DATASET_DIR, exist_ok=True)
    dataset = requests.get(DATASET_URL).text

    with open(DATASET_FULLNAME, "w", encoding="utf-8") as output:
        output.write(dataset)

In [None]:
data = pd.read_csv(DATASET_DIR + DATASET_NAME)
data.head()

In [None]:
data.shape

In [None]:
train_data, test_data = train_test_split(
    data, test_size=TEST_SIZE, stratify=data[TARGET_NAME], random_state=RANDOM_STATE
)

print(
    f"Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}"
)

train_data.head()

The following task types are available:

* 'binary' - for binary classification.
* 'reg’ - for regression.
*  ‘multiclass’ - for multiclass classification.
* 'multi:reg - for multiple regression.
* 'multilabel' - for multi-label classification.

More [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task)

In [None]:
task = Task("binary")

The library infers the types automatically, but they can be overriden:

* 'numeric' - numerical feature
* 'category' - categorical feature
* 'text' - text data
* 'datetime' - features with date and time
* 'date' - features with date only
* 'group' - features by which the data can be divided into groups and which can be taken into account for group k-fold validation (so the same group is not represented in both testing and training sets)
* 'drop' - features to drop, they will not be used in model building
* 'weights' - object weights for the loss and metric
* 'path' - image file paths (for CV tasks)
* 'treatment' - object group in uplift modelling tasks: treatment or control

In [None]:
roles = {"target": TARGET_NAME, "drop": ["SK_ID_CURR"]}

Pipeline 
![img](https://raw.githubusercontent.com/sb-ai-lab/LightAutoML/master/imgs/tutorial_1_laml_big.png)

Sample stage:
![](https://raw.githubusercontent.com/sb-ai-lab/LightAutoML/ac3c1b38873437eb74354fb44e68a449a0200aa6/imgs/tutorial_blackbox_pipeline.png)

In [None]:
automl = TabularAutoML(
    task=task,
    timeout=TIMEOUT,
    cpu_limit=N_THREADS,
    reader_params={"n_jobs": N_THREADS, "cv": N_FOLDS, "random_state": RANDOM_STATE},
)

In [None]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles=roles, verbose=1)

In [None]:
%%time

test_predictions = automl.predict(test_data)
print(
    f"Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}"
)

In [None]:
print(
    f"OOF score: {roc_auc_score(train_data[TARGET_NAME].values, out_of_fold_predictions.data[:, 0])}"
)
print(
    f"HOLDOUT score: {roc_auc_score(test_data[TARGET_NAME].values, test_predictions.data[:, 0])}"
)

In [None]:
print(automl.create_model_str_desc())

Reports:

In [None]:
RD = ReportDeco(output_path="tabularAutoML_model_report")

automl_rd = RD(
    TabularAutoML(
        task=task,
        timeout=60,
        cpu_limit=N_THREADS,
        reader_params={
            "n_jobs": N_THREADS,
            "cv": N_FOLDS,
            "random_state": RANDOM_STATE,
        },
    )
)

In [None]:
%%time
out_of_fold_predictions = automl_rd.fit_predict(train_data, roles=roles, verbose=1)

In [None]:
! ls tabularAutoML_model_report

In [None]:
%%time

test_predictions = automl_rd.predict(test_data)
print(
    f"Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}"
)

In [None]:
print(
    f"OOF score: {roc_auc_score(train_data[TARGET_NAME].values, out_of_fold_predictions.data[:, 0])}"
)
print(
    f"HOLDOUT score: {roc_auc_score(test_data[TARGET_NAME].values, test_predictions.data[:, 0])}"
)