# Data Preprocessing and Model Exploration 

*   **Patient ID** - Unique identifier for each patient
*   **Age** - Age of the patient
*   **Sex** - Gender of the patient (Male/Female)
*   **Cholesterol** - Cholesterol levels of the patient
*   **Blood Pressure** - Blood pressure of the patient (systolic/diastolic)
*   **Heart Rate** - Heart rate of the patient
*   **Diabetes** - Whether the patient has diabetes (Yes/No)
*   **Family History** - Family history of heart-related problems (1: Yes, 0: No)
*   **Smoking** - Smoking status of the patient (1: Smoker, 0: Non-smoker)
*   **Obesity** - Obesity status of the patient (1: Obese, 0: Not obese)
*   **Alcohol Consumption** - Level of alcohol consumption by the patient (None/Light/Moderate/Heavy)
*   **Exercise Hours Per Week** - Number of exercise hours per week
*   **Diet** - Dietary habits of the patient (Healthy/Average/Unhealthy)
*   **Previous Heart Problems** - Previous heart problems of the patient (1: Yes, 0: No)
*   **Medication Use** - Medication usage by the patient (1: Yes, 0: No)
*   **Stress Level** - Stress level reported by the patient (1-10)
*   **Sedentary Hours Per Day** - Hours of sedentary activity per day
*   **Income** - Income level of the patient
*   **BMI** - Body Mass Index (BMI) of the patient
*   **Triglycerides** - Triglyceride levels of the patient
*   **Physical Activity Days Per Week** - Days of physical activity per week
*   **Sleep Hours Per Day** - Hours of sleep per day
*   **Country** - Country of the patient
*   **Continent** - Continent where the patient resides
*   **Hemisphere** - Hemisphere where the patient resides
*   **Heart Attack Risk** - Presence of heart attack risk (1: Yes, 0: No)

## Imports

In [1]:
import warnings

import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

In [2]:
# auto reload libs
%load_ext autoreload
%autoreload 2

## Paths

In [3]:
DATASET = "../data/heart_attack_prediction_dataset.csv"

## Load Data

In [4]:
org_df = pd.read_csv(DATASET)
org_df.head()

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk
0,BMW7812,67,Male,208,158/88,72,0,0,1,0,...,6.615001,261404,31.251233,286,0,6,Argentina,South America,Southern Hemisphere,0
1,CZE1114,21,Male,389,165/93,98,1,1,1,1,...,4.963459,285768,27.194973,235,1,7,Canada,North America,Northern Hemisphere,0
2,BNI9906,21,Female,324,174/99,72,1,0,0,0,...,9.463426,235282,28.176571,587,4,4,France,Europe,Northern Hemisphere,0
3,JLN3497,84,Male,383,163/100,73,1,1,1,0,...,7.648981,125640,36.464704,378,3,4,Canada,North America,Northern Hemisphere,0
4,GFO8847,66,Male,318,91/88,93,1,1,1,1,...,1.514821,160555,21.809144,231,1,5,Thailand,Asia,Northern Hemisphere,0


In [5]:
org_df.columns

Index(['Patient ID', 'Age', 'Sex', 'Cholesterol', 'Blood Pressure',
       'Heart Rate', 'Diabetes', 'Family History', 'Smoking', 'Obesity',
       'Alcohol Consumption', 'Exercise Hours Per Week', 'Diet',
       'Previous Heart Problems', 'Medication Use', 'Stress Level',
       'Sedentary Hours Per Day', 'Income', 'BMI', 'Triglycerides',
       'Physical Activity Days Per Week', 'Sleep Hours Per Day', 'Country',
       'Continent', 'Hemisphere', 'Heart Attack Risk'],
      dtype='object')

### Train, Validation, Test Split

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X = org_df.drop("Heart Attack Risk", axis=1)
y = org_df["Heart Attack Risk"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=X.Sex, random_state=42)
# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, stratify=X_train.Sex, random_state=42)

In [9]:
X_train.shape  # , X_val.shape, X_test.shape

(7886, 25)

## Data Cleaning and pre-processing

In [10]:
cols_to_drop = ["Patient ID", "Blood Pressure", "Country", "Continent", "Hemisphere", "Income"]

### processing Blood pressure column

In [11]:
source_df = org_df.copy()

bp_split = source_df["Blood Pressure"].str.split("/", expand=True).astype(int)
bp_split.columns = ["Systolic", "Diastolic"]
source_df.drop(cols_to_drop, axis=1, inplace=True)
source_df = pd.concat([source_df, bp_split], axis=1)

### Processing Categorical columns

In [12]:
source_df.Sex = source_df.Sex.map({"Male": 1, "Female": 0})
source_df.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,...,Medication Use,Stress Level,Sedentary Hours Per Day,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Heart Attack Risk,Systolic,Diastolic
0,67,1,208,72,0,0,1,0,0,4.168189,...,0,9,6.615001,31.251233,286,0,6,0,158,88
1,21,1,389,98,1,1,1,1,1,1.813242,...,0,1,4.963459,27.194973,235,1,7,0,165,93
2,21,0,324,72,1,0,0,0,0,2.078353,...,1,9,9.463426,28.176571,587,4,4,0,174,99
3,84,1,383,73,1,1,1,0,1,9.82813,...,0,9,7.648981,36.464704,378,3,4,0,163,100
4,66,1,318,93,1,1,1,1,0,5.804299,...,0,6,1.514821,21.809144,231,1,5,0,91,88


In [13]:
source_df.Diet = source_df.Diet.map({"Average": 0, "Healthy": 1, "Unhealthy": 2})
source_df.Diet[:5]

0    0
1    2
2    1
3    0
4    2
Name: Diet, dtype: int64

### Correlation Metrix calculation

In [14]:
corr_matrix = source_df.corr()
corr_matrix["Heart Attack Risk"].sort_values(ascending=False)

Heart Attack Risk                  1.000000
Cholesterol                        0.019340
Systolic                           0.018585
Diabetes                           0.017225
Exercise Hours Per Week            0.011133
Triglycerides                      0.010471
Age                                0.006403
Diet                               0.004540
Sex                                0.003095
Medication Use                     0.002234
Previous Heart Problems            0.000274
BMI                                0.000020
Family History                    -0.001652
Smoking                           -0.004051
Stress Level                      -0.004111
Heart Rate                        -0.004251
Physical Activity Days Per Week   -0.005014
Sedentary Hours Per Day           -0.005613
Diastolic                         -0.007509
Obesity                           -0.013318
Alcohol Consumption               -0.013778
Sleep Hours Per Day               -0.018528
Name: Heart Attack Risk, dtype: 

### Column selection

In [15]:
source_df["Sleep Hours Per Day"].min(), source_df["Sleep Hours Per Day"].max()

(np.int64(4), np.int64(10))

In [16]:
drop_less_corr_cols = [
    "Diet",
    "Medication Use",
    "BMI",
    "Physical Activity Days Per Week",
    # "Heart Rate",
    # "Exercise Hours Per Week",
    # "Sleep Hours Per Day",
    # "Sedentary Hours Per Day",
]
source_df.drop(drop_less_corr_cols, axis=1, inplace=True)
source_df.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Heart Attack Risk,Systolic,Diastolic
0,67,1,208,72,0,0,1,0,0,4.168189,0,9,6.615001,286,6,0,158,88
1,21,1,389,98,1,1,1,1,1,1.813242,1,1,4.963459,235,7,0,165,93
2,21,0,324,72,1,0,0,0,0,2.078353,1,9,9.463426,587,4,0,174,99
3,84,1,383,73,1,1,1,0,1,9.82813,1,9,7.648981,378,4,0,163,100
4,66,1,318,93,1,1,1,1,0,5.804299,1,6,1.514821,231,5,0,91,88


In [17]:
len(source_df.columns)

18

In [18]:
source_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      8763 non-null   int64  
 1   Sex                      8763 non-null   int64  
 2   Cholesterol              8763 non-null   int64  
 3   Heart Rate               8763 non-null   int64  
 4   Diabetes                 8763 non-null   int64  
 5   Family History           8763 non-null   int64  
 6   Smoking                  8763 non-null   int64  
 7   Obesity                  8763 non-null   int64  
 8   Alcohol Consumption      8763 non-null   int64  
 9   Exercise Hours Per Week  8763 non-null   float64
 10  Previous Heart Problems  8763 non-null   int64  
 11  Stress Level             8763 non-null   int64  
 12  Sedentary Hours Per Day  8763 non-null   float64
 13  Triglycerides            8763 non-null   int64  
 14  Sleep Hours Per Day     

In [19]:
source_df["Stress Level"].min(), source_df["Stress Level"].max()

(np.int64(1), np.int64(10))

In [20]:
def cols_preprocessor(source_df: pd.DataFrame, drop_cols: list):
    bp_split = source_df["Blood Pressure"].str.split("/", expand=True).astype(int)
    bp_split.columns = ["Systolic", "Diastolic"]
    source_df.drop(drop_cols, axis=1, inplace=True)
    source_df = pd.concat([source_df, bp_split], axis=1)
    source_df.Sex = source_df.Sex.map({"Male": 1, "Female": 0})
    return source_df

In [21]:
cols_to_drop.extend(drop_less_corr_cols)
cols_to_drop

['Patient ID',
 'Blood Pressure',
 'Country',
 'Continent',
 'Hemisphere',
 'Income',
 'Diet',
 'Medication Use',
 'BMI',
 'Physical Activity Days Per Week']

In [22]:
X_train = cols_preprocessor(X_train, cols_to_drop)
# X_val = cols_preprocessor(X_val, cols_to_drop)

In [23]:
X_train.columns

Index(['Age', 'Sex', 'Cholesterol', 'Heart Rate', 'Diabetes', 'Family History',
       'Smoking', 'Obesity', 'Alcohol Consumption', 'Exercise Hours Per Week',
       'Previous Heart Problems', 'Stress Level', 'Sedentary Hours Per Day',
       'Triglycerides', 'Sleep Hours Per Day', 'Systolic', 'Diastolic'],
      dtype='object')

In [24]:
len(X_train.columns)

17

### Data Preprocessor

In [25]:
categorical_cols = ["Stress Level"]  # is Ordinal Variable -  a categorical variable for which the possible values are ordered

In [26]:
X_train.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
6047,18,1,168,53,0,0,1,0,1,1.852128,1,9,0.636354,306,10,141,90
2582,71,1,199,63,1,0,1,0,0,15.143869,1,6,9.037672,157,6,154,84
5673,47,1,297,78,1,1,1,0,0,12.572641,0,6,2.797046,210,9,130,93
6451,43,0,163,85,1,0,1,1,1,11.634507,1,1,10.110641,66,5,127,78
4889,75,1,135,54,0,0,1,0,0,15.432331,0,8,4.340246,433,5,153,88


In [27]:
X_train["Alcohol Consumption"].min(), X_train["Alcohol Consumption"].max()

(np.int64(0), np.int64(1))

In [28]:
from sklearn.preprocessing import OrdinalEncoder

X_train_temp = X_train.copy()
oe = OrdinalEncoder()
X_train_temp[categorical_cols] = oe.fit_transform(X_train_temp[categorical_cols])
X_train_temp.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
6047,18,1,168,53,0,0,1,0,1,1.852128,1,8.0,0.636354,306,10,141,90
2582,71,1,199,63,1,0,1,0,0,15.143869,1,5.0,9.037672,157,6,154,84
5673,47,1,297,78,1,1,1,0,0,12.572641,0,5.0,2.797046,210,9,130,93
6451,43,0,163,85,1,0,1,1,1,11.634507,1,0.0,10.110641,66,5,127,78
4889,75,1,135,54,0,0,1,0,0,15.432331,0,7.0,4.340246,433,5,153,88


In [29]:
oe.categories_

[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])]

In [30]:
from sklearn.preprocessing import MinMaxScaler

all_cols = list(X_train.columns)
[all_cols.remove(col) for col in categorical_cols]
all_cols

['Age',
 'Sex',
 'Cholesterol',
 'Heart Rate',
 'Diabetes',
 'Family History',
 'Smoking',
 'Obesity',
 'Alcohol Consumption',
 'Exercise Hours Per Week',
 'Previous Heart Problems',
 'Sedentary Hours Per Day',
 'Triglycerides',
 'Sleep Hours Per Day',
 'Systolic',
 'Diastolic']

In [31]:
continues_col = ["Age", "Cholesterol", "Exercise Hours Per Week", "Sleep Hours Per Day", "Heart Rate", "Triglycerides", "Systolic", "Diastolic"]
# continues_col = ["Age", "Cholesterol", "Heart Rate", "Exercise Hours Per Week", "Triglycerides", "Sedentary Hours Per Day", "Systolic", "Diastolic"]
# bin_cols = ["Sex", "Diabetes", "Family History", "Smoking", "Obesity", "Alcohol Consumption", "Previous Heart Problems"]

In [32]:
mm = MinMaxScaler()
X_train_temp[continues_col] = mm.fit_transform(X_train_temp[continues_col])
X_train_temp.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
6047,0.0,1,0.171429,0.185714,0,0,1,0,1,0.092502,1,8.0,0.636354,0.358442,1.0,0.566667,0.6
2582,0.736111,1,0.282143,0.328571,1,0,1,0,0,0.757213,1,5.0,9.037672,0.164935,0.333333,0.711111,0.48
5673,0.402778,1,0.632143,0.542857,1,1,1,0,0,0.628627,0,5.0,2.797046,0.233766,0.833333,0.444444,0.66
6451,0.347222,0,0.153571,0.642857,1,0,1,1,1,0.581712,1,0.0,10.110641,0.046753,0.166667,0.411111,0.36
4889,0.791667,1,0.053571,0.2,0,0,1,0,0,0.771638,0,7.0,4.340246,0.523377,0.166667,0.7,0.56


In [33]:
X_train_temp.shape

(7886, 17)

In [34]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder


def create_preprocessor(categorical_cols: list, continuous_cols: list) -> ColumnTransformer:
    categorical_xformer = OrdinalEncoder()
    continues_xformer = MinMaxScaler()

    preprocessor = ColumnTransformer(
        transformers=[("cat", categorical_xformer, categorical_cols), ("cont", continues_xformer, continuous_cols)],
        remainder="passthrough",  # includes the rest of the columns
    )
    return preprocessor

In [35]:
preprocessor = create_preprocessor(categorical_cols, continues_col)
preprocessor.fit(X_train)

In [36]:
X_train_processed = preprocessor.transform(X_train)

X_test = cols_preprocessor(X_test, cols_to_drop)
X_test_processed = preprocessor.transform(X_test)

In [37]:
X_train_processed.shape

(7886, 17)

### Initial model training

class weight calculation

In [38]:
total_0 = y_train[y_train == 0].shape[0]
total_1 = y_train[y_train == 1].shape[0]
total_samples = y_train.shape[0]
weight_for_0 = total_samples / (2 * total_0)
weight_for_1 = total_samples / (2 * total_1)

class_weights = {0: weight_for_0, 1: weight_for_1}
class_weights

{0: 0.7823412698412698, 1: 1.3854532677442024}

In [39]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from xgboost import XGBClassifier

SCORING = "roc_auc"


def find_intial_best_model():
    models = [
        ("Gradient Boosting", GradientBoostingClassifier(random_state=42)),
        ("AdaBoost Classifier", AdaBoostClassifier(random_state=42)),
        ("Random Forest", RandomForestClassifier(random_state=42, class_weight=class_weights)),
        ("XGboost Classifier", XGBClassifier(random_state=42)),
        ("Support Vector Machine", SVC(random_state=42)),
        ("Naye base Classifier", GaussianNB()),
    ]

    best_model = None
    best_score = 0.0
    # Iterate over the models and evaluate their performance
    for name, model in models:
        # create a pipeline for each model
        pipeline = Pipeline([("model", model)])

        # perform cross validation
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        mean_roc_auc = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring=SCORING, n_jobs=-1).mean()

        # fit the pipeline on the training data
        pipeline.fit(X_train_processed, y_train)

        # make prediction on the test data
        y_pred = pipeline.predict(X_test_processed)

        # Calculate accuracy score
        score = roc_auc_score(y_test, y_pred)

        # print the performance metrics
        print("Model", name)
        print(f"Cross Validatino {SCORING}: ", mean_roc_auc)
        print("roc_auc_score: ", score)
        print()

        # Check if the current model has the best accuracy
        if score > best_score:
            best_score = score
            best_model = pipeline

    # Retrieve the best model
    print("Best Model: ", best_model)

### MLflow hyper param tuning

In [40]:
import mlflow
import optuna

optuna.logging.set_verbosity(optuna.logging.ERROR)

# mlflow.set_tracking_uri("http://localhost:5000")


def get_or_create_experiment(experiment_name) -> str:
    """
    Retrieve the ID of an existing MLflow experiment or create a new one if it doesn't exist.

    This function checks if an experiment with the given name exists within MLflow.
    If it does, the function returns its ID. If not, it creates a new experiment
    with the provided name and returns its ID.

    Parameters:
    - experiment_name (str): Name of the MLflow experiment.

    Returns:
    - str: ID of the existing or newly created MLflow experiment.
    """

    if experiment := mlflow.get_experiment_by_name(experiment_name):
        return experiment.experiment_id
    else:
        return mlflow.create_experiment(experiment_name)


experiment_id = get_or_create_experiment("Finding the claassifier model")
mlflow.set_experiment(experiment_id=experiment_id)
experiment_id

'876583567520604079'

In [41]:
# https://mlflow.org/docs/latest/traditional-ml/hyperparameter-tuning-with-child-runs/notebooks/hyperparameter-tuning-with-child-runs#configure-the-tracking-server-uri

import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


def champion_callback(study, frozen_trial) -> None:
    """
    Logging callback that will report when a new trial iteration improves upon existing
    best trial values.

    Note: This callback is not intended for use in distributed computing systems such as Spark
    or Ray due to the micro-batch iterative implementation for distributing trials to a cluster's
    workers or agents.
    The race conditions with file system state management for distributed trials will render
    inconsistent values with this callback.
    """

    winner = study.user_attrs.get("winner", None)

    if study.best_value and winner != study.best_value:
        study.set_user_attr("winner", study.best_value)
        if winner:
            improvement_percent = (abs(winner - study.best_value) / study.best_value) * 100
            print(f"Trial {frozen_trial.number} achieved value: {frozen_trial.value} with {improvement_percent: .4f}% improvement")
        else:
            print(f"Initial trial {frozen_trial.number} achieved value: {frozen_trial.value}")


# https://www.youtube.com/watch?v=E2b3SKMw934
# def objective(trial) -> float:
#     with mlflow.start_run(nested=True):
#         # choose algorithm to tune
#         ADABOOST = "AdaBoost Classifier"
#         RANDOM_FOREST = "Random Forest"
#         SCORING = "roc_auc"
#         classifier = trial.suggest_categorical("classifier", [ADABOOST, RANDOM_FOREST])
#         params = {}

#         params["n_estimators"] = trial.suggest_int("n_estimators", 50, 500, 50)
#         if classifier == ADABOOST:
#             params["learning_rate"] = trial.suggest_float("learning_rate", 0.001, 1.0, log=True)
#             model = AdaBoostClassifier(**params, random_state=42)
#         elif classifier == RANDOM_FOREST:
#             params["criterion"] = trial.suggest_categorical("criterion", ["gini", "entropy", "log_loss"])
#             params["class_weight"] = trial.suggest_categorical("class_weight", ["balanced_subsample", class_weights])
#             params["max_depth"] = trial.suggest_int("max_depth", 3, 15)
#             params["min_samples_split"] = trial.suggest_int("min_samples_split", 2, 20)
#             params["min_samples_leaf"] = trial.suggest_int("min_samples_leaf", 1, 20)
#             params["bootstrap"] = trial.suggest_categorical("bootstrap", [True, False])
#             params["max_features"] = trial.suggest_categorical("max_features", ["sqrt", "log2", None])

#             model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
#         score = cross_val_score(model, X_train_processed, y_train, cv=3, scoring=SCORING, n_jobs=-1).mean()

#         params["classifier"] = classifier
#         mlflow.log_params(params)
#         mlflow.log_metric(SCORING, score)

#         # log classification report
#         y_pred = model.fit(X_train_processed, y_train).predict(X_test_processed)
#         report = classification_report(y_test, y_pred, output_dict=True)
#         mlflow.log_dict(report, "classification_report")


#     return score


def objective(trial) -> float:
    with mlflow.start_run(nested=True):
        # Add gradient-boosted models
        classifier = trial.suggest_categorical("classifier", ["XGBoost", "AdaBoost", "RandomForest"])

        if classifier == "XGBoost":
            params = {
                "n_estimators": trial.suggest_int("n_estimators", 100, 1000, 50),
                "max_depth": trial.suggest_int("max_depth", 3, 30),
                "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3, log=True),
                "subsample": trial.suggest_float("subsample", 0.6, 1.0),
                "scale_pos_weight": trial.suggest_float("scale_pos_weight", 1, 10),  # For imbalance
            }
            model = xgb.XGBClassifier(**params)

        elif classifier == "AdaBoost":
            params = {}
            params["n_estimators"] = trial.suggest_int("n_estimators", 100, 1000, 50)
            params["learning_rate"] = trial.suggest_float("learning_rate", 0.001, 0.3, log=True)
            model = AdaBoostClassifier(**params, random_state=42)

        elif classifier == "RandomForest":
            params = {}
            params["criterion"] = trial.suggest_categorical("criterion", ["gini", "entropy", "log_loss"])
            params["class_weight"] = trial.suggest_categorical("class_weight", ["balanced_subsample", class_weights])
            params["max_depth"] = trial.suggest_int("max_depth", 3, 30)
            params["min_samples_split"] = trial.suggest_int("min_samples_split", 2, 25)
            params["min_samples_leaf"] = trial.suggest_int("min_samples_leaf", 1, 25)
            params["bootstrap"] = trial.suggest_categorical("bootstrap", [True, False])
            params["max_features"] = trial.suggest_categorical("max_features", ["sqrt", "log2", None])

            model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)

        # stratified K-fold for imbalance
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        score = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring=SCORING, n_jobs=-1).mean()
        params["classifier"] = classifier
        mlflow.log_params(params)
        mlflow.log_metric(SCORING, score)

        # log classification report
        y_pred = model.fit(X_train_processed, y_train).predict(X_test_processed)
        report = classification_report(y_test, y_pred, output_dict=True)
        mlflow.log_dict(report, "classification_report")

    return score

In [42]:
run_name = "Find the best model"
# Initiate the parent run and call the hyperparameter tuning child run logic
with mlflow.start_run(run_name=run_name, nested=True):
    # Initialize the Optuna study
    study = optuna.create_study(direction="maximize")

    # Execute the hyperparameter optimization trials.
    study.optimize(objective, n_trials=250, n_jobs=-1, show_progress_bar=True)
    # study.optimize(objective, n_trials=250, callbacks=[champion_callback], n_jobs=-1, show_progress_bar=True)

    mlflow.log_params(study.best_params)
    mlflow.log_metric(SCORING, study.best_value)

Best trial: 151. Best value: 0.522189:  51%|█████     | 255/500 [06:48<06:32,  1.60s/it]


KeyboardInterrupt: 

In [None]:
# Retrieve the best trial
best_trial = study.best_trial
print("Best trial parameters:", best_trial.params)
print("Best trial ruc score:", best_trial.value)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

random_search = {
    "n_estimators": [100, 200, 300, 400, 500, 600, 800, 1000],  # More variety, higher values for stability
    "criterion": ["gini", "entropy", "log_loss"],  # log_loss for probabilistic outputs (sklearn>=1.1)
    "max_depth": [None, 7, 10, 15, 20],  # None lets trees grow until all leaves are pure
    "min_samples_split": [2, 5, 10, 15, 20],  # Controls overfitting
    "min_samples_leaf": [4, 6, 8, 10],  # Controls overfitting
    "max_features": ["sqrt", "log2", None],  # Feature subset for splitting
    "bootstrap": [True, False],  # Use bootstrapping or not
    "class_weight": ["balanced_subsample", class_weights],  # For imbalanced datasets
}

rf = RandomForestClassifier()

rf_random = RandomizedSearchCV(rf, random_search, n_iter=10, cv=3, verbose=2, random_state=42, n_jobs=-1)
# rf_random.fit(X_train_processed, y_train)

In [None]:
best_params = rf_random.best_params_

print("Best Hyperparameters:")
print(best_params)

In [None]:
# save model

import pickle

filename = "rf_model.pkl"

with open(filename, "wb") as f:
    pickle.dump(rf_random, f)

In [None]:
# loading back pickle file
with open(filename, "rb") as f:
    rf_load = pickle.load(f)

## Evaluation

In [None]:
y_pred = rf_load.predict(X_test_processed)
print(classification_report(y_test, y_pred))