# Final training

Now we will work on how to run to train the different models in order to obtain the one with the best results and opitmize it's parameters to have the best model with the best parameters

To do so we will follow the next logic:

* From the subset of all models, we will select the ones with the following characteristics and train all of them
  * The ones with maximum interpretability
  * The ones with lowest inference

* To optimize their parameters, we will run two possible approaches main approaches
  * Model weights
    * Gradient Descent, automatically used in scikit-learn 
  * Hyper parameters
    * Random Search, in a small enough set, not small enough to try them all, is plausible to find an optimal set of hyperparameters with early stopping
    * Bayesian Approach, use probabilities to find the most suitable set of hyperparameters

Gradient Descent and similars do not work in discrete values (hyperparameters) since a continous function is needed in order to differentiate.

In this notebook we will work on the assumption that we have already pre-processed all the data and we receive as input a matrix X containing all ECGs + demographic data and a matrix Y containing the final labels of the train, and we can divide them between train and test accordingly.

The following cell is just some fictional data in order to facilitate the final code


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Parameters for synthetic data
n_samples = 500
n_ecg_features = 100  # e.g., ECG_0, ECG_1, ..., ECG_99

# Create synthetic numeric columns
np.random.seed(42)
df = pd.DataFrame({
    'Height': np.random.normal(170, 10, n_samples),
    'Weight': np.random.normal(70, 15, n_samples),
    'BMI': np.random.normal(24, 4, n_samples),
})

# Add synthetic ECG features
for i in range(n_ecg_features):
    df[f'ECG_{i}'] = np.random.normal(0, 1, n_samples)

# Add categorical/boolean columns
df['Gender'] = np.random.choice(['Male', 'Female'], n_samples)
df['Smoker'] = np.random.choice(['Yes', 'No'], n_samples)
df['HTA'] = np.random.choice(['Yes', 'No'], n_samples)
df['DM'] = np.random.choice(['Yes', 'No'], n_samples)
df['DLP'] = np.random.choice(['Yes', 'No'], n_samples)
df['COPD'] = np.random.choice(['Yes', 'No'], n_samples)
df['Sleep_apnea'] = np.random.choice(['Yes', 'No'], n_samples)

# Add a binary label column
df['Label'] = np.random.choice([0, 1], n_samples)

# Now proceed as before
Y = df['Label'].values

numeric_cols = ['Height', 'Weight', 'BMI'] + [col for col in df.columns if col.startswith('ECG_')]
X_numeric = df[numeric_cols].copy()

scaler = MinMaxScaler()
X_numeric_scaled = pd.DataFrame(scaler.fit_transform(X_numeric), columns=X_numeric.columns, index=X_numeric.index)

categorical_cols = ['Gender', 'Smoker', 'HTA', 'DM', 'DLP', 'COPD', 'Sleep_apnea']
X_categorical = df[categorical_cols].copy()
X_categorical_encoded = pd.get_dummies(X_categorical, drop_first=True)

X = pd.concat([X_numeric_scaled, X_categorical_encoded], axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

X_train shape: (400, 110)
X_test shape: (100, 110)
Y_train shape: (400,)
Y_test shape: (100,)


  df[f'ECG_{i}'] = np.random.normal(0, 1, n_samples)
  df['Gender'] = np.random.choice(['Male', 'Female'], n_samples)
  df['Smoker'] = np.random.choice(['Yes', 'No'], n_samples)
  df['HTA'] = np.random.choice(['Yes', 'No'], n_samples)
  df['DM'] = np.random.choice(['Yes', 'No'], n_samples)
  df['DLP'] = np.random.choice(['Yes', 'No'], n_samples)
  df['COPD'] = np.random.choice(['Yes', 'No'], n_samples)
  df['Sleep_apnea'] = np.random.choice(['Yes', 'No'], n_samples)
  df['Label'] = np.random.choice([0, 1], n_samples)


Now that we have expressed the idea behind, let's define the subset of models we have chosen

In [22]:
from sklearn.linear_model    import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes     import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.tree            import DecisionTreeClassifier
from sklearn.ensemble        import RandomForestClassifier
from imodels                 import RuleFitClassifier   # or any RIPPER implementation
from sklearn.neighbors       import KNeighborsClassifier, RadiusNeighborsClassifier, NearestCentroid

models = {
    "logreg": LogisticRegression(solver="saga", max_iter=5_000),
    "lda":    LinearDiscriminantAnalysis(),
    "gnb":    GaussianNB(),
    "mnb":    MultinomialNB(),
    "bnb":    BernoulliNB(),
    "dt":     DecisionTreeClassifier(),
    "rf":     RandomForestClassifier(),
    "rulefit": RuleFitClassifier(),
    "knn":    KNeighborsClassifier(),
    "rnn":    RadiusNeighborsClassifier(),
    "nc":     NearestCentroid(),
}

# Brute Force parameter aproach

Here we can set up the first original idea, that is of trying all different parameters using brute force, `GridSearchCV` does exactly that, from a list of possible parameters tries them all and chooses the best one.

We can do that if we pre-define the set-up of possible parameters, leaving a small space for the possible combination of parameters, making it possible to try them all in a plausible computational time.

In [23]:
param_grids = {
    "logreg": {
        "penalty": ["l1","l2","elasticnet"],
        "C":       [0.01, 0.1, 1, 10],
        "l1_ratio":[0.0, 0.5, 1.0],  # only for elasticnet
    },
    "lda": {
        "solver":   ["svd","lsqr","eigen"],
        "shrinkage":[None, 0.1, 0.5, 1.0],
    },
    "gnb": {"var_smoothing": np.logspace(-9, -6, 5)},
    "mnb": {"alpha": [0.1, 0.5, 1.0]},
    "bnb": {"alpha": [0.1, 0.5, 1.0]},
    "dt": {
        "max_depth":        [3,5,10,None],
        "min_samples_leaf": [1,5,10]
    },
    "rf": {
        "n_estimators": [50,100,200],
        "max_depth":    [5,10,None],
        "max_features": ["auto","sqrt","log2"]
    },
    "rulefit": {
        "max_rules":        [10, 20, 50],
        "tree_size":        [3, 5, 7]
    },
    "knn": {
        "n_neighbors": [3,5,10],
        "weights":     ["uniform","distance"],
        "p":           [1,2]
    },
    "rnn": {
        "radius":       [0.5, 1.0, 2.0],                 # how far to look
        "outlier_label":[ "most_frequent", 0, 1 ],      # how to label isolated points
        "weights":      [ "uniform", "distance" ]       # voting scheme
    },
    "nc":  {"shrink_threshold": [None, 0.0, 0.1]}
}


In [24]:
from sklearn.metrics import roc_auc_score, accuracy_score

def safe_score(estimator, X_test, y_test):
    """
    Return ROC-AUC if we can get continuous scores, else fall
    back to accuracy on hard predictions. Catches both ValueError
    and AttributeError so no model can crash us.
    """
    try:
        # Prefer probabilities
        if hasattr(estimator, "predict_proba"):
            probs = estimator.predict_proba(X_test)[:, 1]
            return roc_auc_score(y_test, probs)
        
        # Next, decision_function if available
        if hasattr(estimator, "decision_function"):
            scores = estimator.decision_function(X_test)
            return roc_auc_score(y_test, scores)
        
        # Finally, fall back to hard predictions → accuracy
        preds = estimator.predict(X_test)
        return accuracy_score(y_test, preds)
    
    except (ValueError, AttributeError) as e:
        print(f"⚠️  Warning scoring {estimator.__class__.__name__}: {e}")
        return float("nan")



from sklearn.model_selection import GridSearchCV

def train_and_evaluate(model_name,
                       model,
                       param_grid,
                       X_train, y_train,
                       X_test,  y_test,
                       cv=5,
                       scoring="roc_auc"):
    # 1) Hyper‐parameter search
    search = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        cv=cv,
        scoring=scoring,
        n_jobs=-1,
        verbose=1,
        refit=True
    )
    search.fit(X_train, y_train)
    
    best = search.best_estimator_
    
    # 2) Safe test‐set scoring
    test_score = safe_score(best, X_test, y_test)
    
    return {
        "model":       model_name,
        "best_params": search.best_params_,
        "cv_score":    search.best_score_,
        "test_score":  test_score,
        "estimator":   best
    }



In [25]:
results = []

for name, mdl in models.items():
    print(f"\n=== Tuning & evaluating: {name} ===")
    grid = param_grids.get(name, {})  # empty dict → no tuning, just default
    res  = train_and_evaluate(name, mdl, grid,
                              X_train, Y_train,
                              X_test,  Y_test,
                              cv=5, scoring="roc_auc")
    results.append(res)

# Sort by test performance
results = sorted(results, key=lambda r: r["test_score"], reverse=True)



=== Tuning & evaluating: logreg ===
Fitting 5 folds for each of 36 candidates, totalling 180 fits


15 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/discriminant_analysis.py", line 621, in fit
    raise NotImplementedError("shrinkage not supported with 'svd' solver.")
NotImplementedError: 


=== Tuning & evaluating: lda ===
Fitting 5 folds for each of 12 candidates, totalling 60 fits

=== Tuning & evaluating: gnb ===
Fitting 5 folds for each of 5 candidates, totalling 25 fits

=== Tuning & evaluating: mnb ===
Fitting 5 folds for each of 3 candidates, totalling 15 fits

=== Tuning & evaluating: bnb ===
Fitting 5 folds for each of 3 candidates, totalling 15 fits

=== Tuning & evaluating: dt ===
Fitting 5 folds for each of 12 candidates, totalling 60 fits

=== Tuning & evaluating: rf ===
Fitting 5 folds for each of 27 candidates, totalling 135 fits


45 fits failed out of a total of 135.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
29 fits failed with the following error:
Traceback (most recent call last):
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklea


=== Tuning & evaluating: rulefit ===
Fitting 5 folds for each of 9 candidates, totalling 45 fits

=== Tuning & evaluating: knn ===
Fitting 5 folds for each of 12 candidates, totalling 60 fits

=== Tuning & evaluating: rnn ===
Fitting 5 folds for each of 18 candidates, totalling 90 fits

=== Tuning & evaluating: nc ===
Fitting 5 folds for each of 3 candidates, totalling 15 fits


Traceback (most recent call last):
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 459, in _score
    y_pred = method_caller(clf, "decision_function", X, pos_label=pos_label)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 86, in _cached_call
    result, _ = _get_response_values(
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/utils/_response.py", line 181, in _get_response_values
    prediction_method = _check_response_method(estimator, response_method)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/utils/validation.py", line 1939, in _check_response_method
    raise AttributeError(
AttributeError: NearestCentroid has none of the following attributes: decision_function.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/marc

In [21]:
# Show top 3
for top in results[:3]:
    print(f"{top['model']:10s}  Test AUC: {top['test_score']:.3f}  CV AUC: {top['cv_score']:.3f}")
    print("  Params:", top["best_params"])

gnb         Test AUC: 0.572  CV AUC: 0.502
  Params: {'var_smoothing': 1e-09}
mnb         Test AUC: 0.505  CV AUC: 0.453
  Params: {'alpha': 0.1}
lda         Test AUC: 0.503  CV AUC: 0.494
  Params: {'shrinkage': None, 'solver': 'svd'}


Here we can see the best three possible approaches in the training model.

# Randomized search

Why RandomizedSearch?

When your hyper-parameter space is large (many parameters, each with multiple possible values), exhaustive Grid Search becomes prohibitively expensive (combinatorial explosion).

RandomizedSearchCV lets you specify a total budget (n_iter) of trials. It samples parameter combinations at random, often finding nearly optimal settings in far fewer trials.

In [26]:
from scipy.stats import loguniform, randint, uniform

param_dists = {
    # 1. Logistic Regression
    "logreg": {
        "penalty":   ["l1", "l2", "elasticnet"],
        "C":         loguniform(1e-3, 1e1),    # 0.001 → 10 (log-uniform)
        "l1_ratio":  uniform(0, 1)              # only used if penalty='elasticnet'
    },

    # 2. LDA
    "lda": {
        "solver":    ["svd", "lsqr", "eigen"],
        "shrinkage": uniform(0, 1)              # mix between 0 (none) and 1 (full)
    },

    # 3. Gaussian Naive Bayes
    "gnb": {
        "var_smoothing": loguniform(1e-12, 1e-6)
    },

    # 4. Multinomial NB
    "mnb": {
        "alpha": loguniform(1e-2, 10)           # smoothing from 0.01 → 10
    },

    # 5. Bernoulli NB
    "bnb": {
        "alpha": loguniform(1e-2, 10)
    },

    # 6. Decision Tree
    "dt": {
        "max_depth":        randint(1, 20),     # integer 1 → 19
        "min_samples_leaf": randint(1, 20),
        "criterion":        ["gini", "entropy"]
    },

    # 7. Random Forest
    "rf": {
        "n_estimators": randint(50, 300),
        "max_depth":    randint(3, 20),
        "max_features": ["auto", "sqrt", "log2"]
    },

    # 8. RuleFit (or RIPPER)
    "rulefit": {
        "max_rules": randint(5, 100),
        "tree_size": randint(2, 10)
    },

    # 9. k-Nearest Neighbors
    "knn": {
        "n_neighbors": randint(1, 30),
        "weights":     ["uniform", "distance"],
        "p":           [1, 2]                    # L1 vs L2
    },

    # 10. Radius Neighbors
    "rnn": {
        "radius":       uniform(0.1, 5.0),       # 0.1 → 5.1
        "weights":      ["uniform", "distance"],
        "outlier_label": ["most_frequent", 0, 1]
    },

    # 11. Nearest Centroid
    "nc": {
        "shrink_threshold": uniform(0, 1)        # amount of centroid shrinkage
    }
}


In [27]:
from sklearn.model_selection import RandomizedSearchCV

def train_with_random_search(model_name,
                             model,
                             param_dist,
                             X_train, y_train,
                             X_test,  y_test,
                             cv=5,
                             scoring="roc_auc",
                             n_iter=50,         # total random trials
                             random_state=42):
    """
    1) Runs RandomizedSearchCV on (model, param_dist)
    2) Refits best model on full X_train
    3) Scores on X_test via safe_score
    """
    search = RandomizedSearchCV(
        estimator     = model,
        param_distributions = param_dist,
        n_iter        = n_iter,
        cv            = cv,
        scoring       = scoring,
        n_jobs        = -1,
        verbose       = 1,
        refit         = True,
        random_state  = random_state
    )
    search.fit(X_train, y_train)
    
    best = search.best_estimator_
    test_score = safe_score(best, X_test, y_test)
    
    return {
        "model":       model_name,
        "best_params": search.best_params_,
        "cv_score":    search.best_score_,
        "test_score":  test_score,
        "estimator":   best
    }


In [28]:
results = []

for name, mdl in models.items():
    print(f"\n>>> Random search for: {name}")
    
    # get either your discrete param list or your continuous dist
    dist = param_dists.get(name, {})
    
    # if there's nothing to search, just train and score default
    if not dist:
        res = train_and_evaluate(name, mdl, {}, 
                                 X_train, Y_train, X_test, Y_test)
    else:
        res = train_with_random_search(name, mdl, dist,
                                       X_train, Y_train, X_test, Y_test,
                                       cv=5, scoring="roc_auc",
                                       n_iter=50, random_state=42)
    results.append(res)

# Filter out failures and sort by test score
results = [r for r in results if not np.isnan(r["test_score"])]
results.sort(key=lambda r: r["test_score"], reverse=True)




>>> Random search for: logreg
Fitting 5 folds for each of 50 candidates, totalling 250 fits





>>> Random search for: lda
Fitting 5 folds for each of 50 candidates, totalling 250 fits


90 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/discriminant_analysis.py", line 621, in fit
    raise NotImplementedError("shrinkage not supported with 'svd' solver.")
NotImplementedError:


>>> Random search for: gnb
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: mnb
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: bnb
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: dt
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: rf
Fitting 5 folds for each of 50 candidates, totalling 250 fits


80 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
26 fits failed with the following error:
Traceback (most recent call last):
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklea


>>> Random search for: rulefit
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: knn
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: rnn
Fitting 5 folds for each of 50 candidates, totalling 250 fits

>>> Random search for: nc
Fitting 5 folds for each of 50 candidates, totalling 250 fits


Traceback (most recent call last):
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 459, in _score
    y_pred = method_caller(clf, "decision_function", X, pos_label=pos_label)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 86, in _cached_call
    result, _ = _get_response_values(
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/utils/_response.py", line 181, in _get_response_values
    prediction_method = _check_response_method(estimator, response_method)
  File "/home/marc/miniconda3/envs/CompBioMed25/lib/python3.8/site-packages/sklearn/utils/validation.py", line 1939, in _check_response_method
    raise AttributeError(
AttributeError: NearestCentroid has none of the following attributes: decision_function.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/marc

In [29]:
# Show top 3
for top in results[:3]:
    print(f"{top['model']:10s}  Test AUC: {top['test_score']:.3f}  Params: {top['best_params']}")


gnb         Test AUC: 0.572  Params: {'var_smoothing': 1.7670169402947945e-10}
dt          Test AUC: 0.549  Params: {'criterion': 'entropy', 'max_depth': 15, 'min_samples_leaf': 13}
lda         Test AUC: 0.517  Params: {'shrinkage': 0.3951502360018144, 'solver': 'lsqr'}
