# III.b Model Selection: The big ugly Hammer

After having achieved only unsatisfactory F1-Scores with deep neural networks, It is in fact time to bring out the big ugly hammer. Ensembles of classical models are a very powefull method of boosting model performance, and many ML competitions have been swept away by the best crafted ensemble of ML models. Let's see if we can do the same for this porblem.

In [1]:
import os
import pandas as pd
from dotenv import load_dotenv
from pathlib import Path
import ray
from ray import tune, train
from ray.tune.schedulers import ASHAScheduler
from evaluators import evaluate_model_ray
from ray.tune.search.hebo import HEBOSearch
from ray.train import CheckpointConfig


load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm
2024-10-01 06:38:35,829	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-10-01 06:38:35,949	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


True

## 1. Fetching the Data

As discovered in the feature engineering section of this example workspace, most transformation of the dataset do not possitively affect the RandomForest and similar classifier. Technically, at the loss of performance, polonomyal derivatives sharply improve the performance of linear models, however, due to compute and time constraints we are going to omit this at this stage

In [2]:
# Assuming feature_df and targets_df are already defined
data_dir = os.getenv("DATA")
encoded_df = pd.read_csv(Path(data_dir) / "encoded_df.csv")
targets_df = pd.read_csv(Path(data_dir) / "target.csv")
encoded_df.head()

Unnamed: 0,id,Geschlecht,Alter,Fahrerlaubnis,Vorversicherung,Alter_Fzg,Vorschaden,Jahresbeitrag,Kundentreue,Regional_Code_0,...,Vertriebskanal_152.0,Vertriebskanal_153.0,Vertriebskanal_154.0,Vertriebskanal_155.0,Vertriebskanal_156.0,Vertriebskanal_157.0,Vertriebskanal_158.0,Vertriebskanal_159.0,Vertriebskanal_160.0,Vertriebskanal_163.0
0,1.0,1.0,44.0,1.0,0.0,2.0,1.0,40454.0,217.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,1.0,76.0,1.0,0.0,0.0,0.0,33536.0,183.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.0,1.0,47.0,1.0,0.0,2.0,1.0,38294.0,27.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,1.0,21.0,1.0,1.0,1.0,0.0,28619.0,203.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5.0,0.0,29.0,1.0,1.0,1.0,0.0,27496.0,39.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
sub_encoded_df = encoded_df.sample(n=10000, random_state=42)

normal_df = pd.DataFrame(scaler.fit_transform(sub_encoded_df.drop(columns="id")),
                         columns=encoded_df.columns.difference(["id"]))

In [4]:
from sklearn.preprocessing import PolynomialFeatures

# Generate polynomial features (degree 2 is a common starting point)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly.fit(sub_encoded_df)
cols = poly.get_feature_names_out(sub_encoded_df.columns)
poly_df = pd.DataFrame(poly.transform(sub_encoded_df.copy()), columns=cols)

In [5]:
ids = sub_encoded_df.id
y = targets_df.set_index("id").loc[ids]["Interesse"].values
X = normal_df.values
X.shape

(10000, 216)

## 2. Tuning the Voting Ensemble

### Decision Tree Classifier
Builds a model in the form of a tree to break down a dataset into smaller subsets.
| Parameter                | Range/Choices               |
|--------------------------|-----------------------------|
| use_dt                   | True, False                 |
| dt.max_depth             | 5, 10, 15, 20, None         |
| dt.min_samples_split     | 2, 5, 10                    |

### Support Vector Machine Classifier
Is a powerful classifier that finds the best margin separating classes.
| Parameter         | Range/Choices                                |
|-------------------|----------------------------------------------|
| use_svm           | True, False                                  |
| svm.C             | loguniform(0.1, 10)                          |
| svm.kernel        | 'linear', 'poly', 'rbf', 'sigmoid'           |

### Random Forest Classifier
Builds multiple decision trees to improve accuracy and stability.
| Parameter               | Range/Choices               |
|-------------------------|-----------------------------|
| use_rf                  | True, False                 |
| rf.n_estimators         | 10, 50, 100, 200            |
| rf.max_depth            | 5, 10, 15, 20, None         |
| rf.min_samples_split    | 2, 5, 10                    |

### Gaussian Naive Bayes
Supports continuous data and assumes independence among predictors.
| Parameter         | Range/Choices                  |
|-------------------|--------------------------------|
| use_gnb           | True, False                    |
| gnb.var_smoothing | loguniform(1e-10, 1e-2)        |

### Logistic Regression
Models a binary dependent variable using a logistic function.
| Parameter      | Range/Choices                                        |
|----------------|------------------------------------------------------|
| use_lr         | True, False                                          |
| lr.C           | loguniform(0.1, 10)                                  |
| lr.penalty     | 'l2', 'l1', 'elasticnet', None                       |
| lr.solver      | 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'     |
| lr.l1_ratio    | loguniform(0.1, 1.0)                                 |

### AdaBoost Classifier
Improves performance by combining multiple weak models.
| Parameter               | Range/Choices               |
|-------------------------|-----------------------------|
| use_adaboost            | True, False                 |
| adaboost.n_estimators   | 50, 100, 200                |
| adaboost.learning_rate  | loguniform(0.01, 1)         |


In [6]:
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier

In [8]:
X_id = ray.put(X)
y_id = ray.put(y)


def train_ensemble(config):
    X = ray.get(X_id)
    y = ray.get(y_id)
    estimators = []

    # Decision Tree
    if config['use_dt']:
        dt = DecisionTreeClassifier(max_depth=config['dt.max_depth'],
                                    min_samples_split=config['dt.min_samples_split'])
        estimators.append(('dt', dt))

    # SVM
    if config['use_svm']:
        svm = SVC(C=config['svm.C'], kernel=config['svm.kernel'], probability=True)
        estimators.append(('svm', svm))

    # Random Forest
    if config['use_rf']:
        rf = RandomForestClassifier(n_estimators=config['rf.n_estimators'],
                                    max_depth=config['rf.max_depth'],
                                    min_samples_split=config['rf.min_samples_split'])
        estimators.append(('rf', rf))

    # Gaussian Naive Bayes
    if config['use_gnb']:
        gnb = GaussianNB(var_smoothing=config['gnb.var_smoothing'])
        estimators.append(('gnb', gnb))

    # Logistic Regression
    if config['use_lr']:
        # https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html
        solver = config['lr.solver']
        penalty = config['lr.penalty']
        l1_ratio = config['lr.l1_ratio']
        ChildProcessError
        if solver == 'liblinear' and penalty == None:
            penalty = 'l2'
        elif solver in ["lbfgs", "newton-cg", 'sag'] and penalty not in ['l2', None]:
            penalty = 'l2'
        elif solver == "liblinear" and penalty is None:
            penalty = 'l2'
        if penalty == "elasticnet" and solver not in ['saga']:
            penalty = 'l2'
            l1_ratio = None
        lr = LogisticRegression(C=config['lr.C'],
                                penalty=penalty,
                                solver=solver,
                                l1_ratio=l1_ratio)
        estimators.append(('lr', lr))

    # AdaBoost
    if config['use_adaboost']:
        adaboost = AdaBoostClassifier(n_estimators=config['adaboost.n_estimators'],
                                      learning_rate=config['adaboost.learning_rate'])
        estimators.append(('adaboost', adaboost))

    # Create ensemble model
    if estimators:
        ensemble = VotingClassifier(estimators=estimators, voting='soft')
        ensemble.fit(X, y)
        # Here you'd typically evaluate your ensemble
        # Assuming 'evaluate_network_ray' is replaced with a suitable evaluation method for classifiers:
        evaluate_model_ray(X, y, ensemble)  # This needs to be defined
    else:
        # Handle case where no classifiers are enabled
        train.report({
            "roc_auc": 0,
            "pr_auc": 0,
            "f1_score": 0
        })


def train_model(config):
    train_ensemble(config)


search_space = {
    # Decision Tree Classifier
    "use_dt": tune.choice([True, False]),
    "dt.max_depth": tune.choice([5, 10, 15, 20, None]),
    "dt.min_samples_split": tune.choice([2, 5, 10]),

    # Support Vector Machine Classifier
    "use_svm": tune.choice([True, False]),
    "svm.C": tune.loguniform(0.1, 10),
    "svm.kernel": tune.choice(['linear', 'poly', 'rbf', 'sigmoid']),

    # Random Forest Classifier
    "use_rf": tune.choice([True, False]),
    "rf.n_estimators": tune.choice([10, 50, 100, 200]),
    "rf.max_depth": tune.choice([5, 10, 15, 20, None]),
    "rf.min_samples_split": tune.choice([2, 5, 10]),

    # Gaussian Naive Bayes
    "use_gnb": tune.choice([True, False]),
    "gnb.var_smoothing": tune.loguniform(1e-10, 1e-2),

    # Logistic Regression
    "use_lr": tune.choice([True, False]),
    "lr.C": tune.loguniform(0.1, 10),
    "lr.penalty": tune.choice(['l2', 'l1', 'elasticnet', None]),
    "lr.solver": tune.choice(['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']),
    "lr.l1_ratio": tune.loguniform(0.1, 1.0),

    # AdaBoost Classifier
    "use_adaboost": tune.choice([True, False]),
    "adaboost.n_estimators": tune.choice([50, 100, 200]),
    "adaboost.learning_rate": tune.loguniform(0.01, 1),
}


ray.init(num_cpus=8, ignore_reinit_error=True)

analysis = tune.run(train_model,
                    name="garbadge_ensemble",
                    config=search_space,
                    num_samples=300,
                    storage_path=Path(os.getenv("WORKINGDIR"), "artifacts"),
                    max_failures=300,
                    search_alg=HEBOSearch(metric="f1_score", mode="max"),
                    scheduler=ASHAScheduler(metric="f1_score", mode="max"),
                    time_budget_s=300,
                    checkpoint_config=CheckpointConfig(
                        num_to_keep=4,
                        checkpoint_score_attribute="f1_score",
                        checkpoint_score_order='max',
                    ),
                    resources_per_trial={"cpu": 1})

best_config = analysis.get_best_config(metric="f1_score", mode="max")
best_trial = analysis.get_best_trial(metric="f1_score", mode="max")

print("Best config:", best_config)
print("Best F1 Score:", best_trial.last_result["f1_score"])


2024-10-01 06:38:57,545	INFO worker.py:1619 -- Calling ray.init() again after it has already been called.
2024-10-01 06:38:57,547	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949


0,1
Current time:,2024-10-01 06:43:57
Running for:,00:05:00.20
Memory:,20.1/47.0 GiB

Trial name,status,loc,adaboost.learning_ra te,adaboost.n_estimator s,dt.max_depth,dt.min_samples_split,gnb.var_smoothing,lr.C,lr.l1_ratio,lr.penalty,lr.solver,rf.max_depth,rf.min_samples_split,rf.n_estimators,svm.C,svm.kernel,use_adaboost,use_dt,use_gnb,use_lr,use_rf,use_svm,iter,total time (s),roc_auc,pr_auc,f1_score
train_model_c98f5173,TERMINATED,172.17.0.2:90035,0.420663,100,,10,5.97246e-10,8.68226,0.354853,elasticnet,sag,20.0,10,200,6.14041,sigmoid,False,True,False,True,True,False,1.0,27.8548,0.836344,0.345029,0.233015
train_model_7b7a833a,TERMINATED,172.17.0.2:90098,0.054002,50,5.0,2,0.0015638,0.188651,0.150271,l2,lbfgs,10.0,5,10,0.137258,poly,True,False,True,False,False,True,1.0,189.212,0.820769,0.330322,0.446564
train_model_8a3981a0,TERMINATED,172.17.0.2:90153,0.0112187,200,20.0,5,7.1414e-08,1.7001,0.691597,l1,saga,20.0,5,50,0.350211,linear,True,False,True,True,False,True,1.0,210.435,0.839977,0.369724,0.252314
train_model_809773f9,TERMINATED,172.17.0.2:90214,0.143362,100,10.0,5,1.50687e-05,0.35821,0.290419,,liblinear,10.0,2,100,2.40729,rbf,False,True,False,False,True,False,1.0,5.46608,0.842321,0.342468,0.0598771
train_model_0842c4f0,TERMINATED,172.17.0.2:90280,0.286593,100,15.0,5,2.79868e-07,0.171649,0.185339,elasticnet,lbfgs,,5,100,1.7647,poly,True,True,False,False,False,True,1.0,202.675,0.818435,0.324391,0.164837
train_model_2a942f10,TERMINATED,172.17.0.2:90349,0.0208686,50,15.0,5,3.32037e-06,3.48463,0.784862,l1,sag,5.0,10,50,0.684406,rbf,False,False,True,True,True,False,1.0,20.7924,0.837255,0.368973,0.465465
train_model_c4c217be,TERMINATED,172.17.0.2:90412,0.0794944,200,20.0,10,2.69642e-09,0.925193,0.132559,l1,newton-cg,15.0,2,10,0.309845,rbf,False,False,True,False,True,False,1.0,1.3334,0.826206,0.35663,0.260078
train_model_0437a60d,TERMINATED,172.17.0.2:90480,0.576291,100,10.0,2,0.000397079,1.80247,0.556646,elasticnet,liblinear,15.0,5,200,3.89908,poly,True,True,False,True,False,True,1.0,199.524,0.846746,0.352591,0.0298798
train_model_d3fc4929,TERMINATED,172.17.0.2:90543,0.03712,200,15.0,5,1.42618e-08,0.683367,0.931253,l1,liblinear,5.0,2,200,5.37439,rbf,True,True,True,False,False,True,1.0,223.363,0.814535,0.325562,0.341344
train_model_938c903d,TERMINATED,172.17.0.2:90607,0.509741,100,15.0,5,6.52306e-05,2.44586,0.223896,,saga,,5,10,0.205383,linear,False,False,False,True,True,False,1.0,20.3182,0.845435,0.36502,0.113298


Trial name,f1_score,pr_auc,roc_auc
train_model_0437a60d,0.0298798,0.352591,0.846746
train_model_0842c4f0,0.164837,0.324391,0.818435
train_model_1a72806d,0.461053,0.344335,0.831564
train_model_2a942f10,0.465465,0.368973,0.837255
train_model_7b7a833a,0.446564,0.330322,0.820769
train_model_809773f9,0.0598771,0.342468,0.842321
train_model_8a3981a0,0.252314,0.369724,0.839977
train_model_938c903d,0.113298,0.36502,0.845435
train_model_c4c217be,0.260078,0.35663,0.826206
train_model_c98f5173,0.233015,0.345029,0.836344


2024-10-01 06:43:57,755	INFO timeout.py:54 -- Reached timeout of 300 seconds. Stopping all trials.
2024-10-01 06:43:57,773	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/workspaces/example-01-ml-engineering-for-regression/artifacts/garbadge_ensemble' in 0.0138s.
2024-10-01 06:44:08,305	INFO tune.py:1041 -- Total run time: 310.76 seconds (300.19 seconds for the tuning loop).


Best config: {'use_dt': False, 'dt.max_depth': 15, 'dt.min_samples_split': 5, 'use_svm': False, 'svm.C': 0.6844062640968933, 'svm.kernel': 'rbf', 'use_rf': True, 'rf.n_estimators': 50, 'rf.max_depth': 5, 'rf.min_samples_split': 10, 'use_gnb': True, 'gnb.var_smoothing': 3.3203726553935036e-06, 'use_lr': True, 'lr.C': 3.4846330706636603, 'lr.penalty': 'l1', 'lr.solver': 'sag', 'lr.l1_ratio': 0.7848621599458883, 'use_adaboost': False, 'adaboost.n_estimators': 50, 'adaboost.learning_rate': 0.020868594694202942}
Best F1 Score: 0.4654651721777631


In [9]:
analysis.get_best_config(metric="f1_score", mode="max")

{'use_dt': False,
 'dt.max_depth': 15,
 'dt.min_samples_split': 5,
 'use_svm': False,
 'svm.C': 0.6844062640968933,
 'svm.kernel': 'rbf',
 'use_rf': True,
 'rf.n_estimators': 50,
 'rf.max_depth': 5,
 'rf.min_samples_split': 10,
 'use_gnb': True,
 'gnb.var_smoothing': 3.3203726553935036e-06,
 'use_lr': True,
 'lr.C': 3.4846330706636603,
 'lr.penalty': 'l1',
 'lr.solver': 'sag',
 'lr.l1_ratio': 0.7848621599458883,
 'use_adaboost': False,
 'adaboost.n_estimators': 50,
 'adaboost.learning_rate': 0.020868594694202942}

In [10]:
analysis.get_best_config(metric="roc_auc", mode="max")

{'use_dt': False,
 'dt.max_depth': 20,
 'dt.min_samples_split': 10,
 'use_svm': False,
 'svm.C': 0.9340025461180567,
 'svm.kernel': 'poly',
 'use_rf': True,
 'rf.n_estimators': 50,
 'rf.max_depth': 15,
 'rf.min_samples_split': 5,
 'use_gnb': False,
 'gnb.var_smoothing': 1.7055524445861225e-10,
 'use_lr': False,
 'lr.C': 0.13201681503831053,
 'lr.penalty': 'elasticnet',
 'lr.solver': 'lbfgs',
 'lr.l1_ratio': 0.43658669656779736,
 'use_adaboost': False,
 'adaboost.n_estimators': 100,
 'adaboost.learning_rate': 0.10239965431295568}

In [11]:
analysis.get_best_config(metric="pr_auc", mode="max")

{'use_dt': False,
 'dt.max_depth': 20,
 'dt.min_samples_split': 10,
 'use_svm': False,
 'svm.C': 0.9340025461180567,
 'svm.kernel': 'poly',
 'use_rf': True,
 'rf.n_estimators': 50,
 'rf.max_depth': 15,
 'rf.min_samples_split': 5,
 'use_gnb': False,
 'gnb.var_smoothing': 1.7055524445861225e-10,
 'use_lr': False,
 'lr.C': 0.13201681503831053,
 'lr.penalty': 'elasticnet',
 'lr.solver': 'lbfgs',
 'lr.l1_ratio': 0.43658669656779736,
 'use_adaboost': False,
 'adaboost.n_estimators': 100,
 'adaboost.learning_rate': 0.10239965431295568}