### SMOTEENN Resampling:

Reason: Addressing class imbalance (highly imbalanced classes in target variable) by oversampling minority classes and undersampling majority classes simultaneously.
Benefit: Helps in improving model performance by making the classes more balanced, thereby reducing bias towards majority classes and improving predictive accuracy.

### RobustScaler:

Reason: Scaling features using RobustScaler because it is less prone to outliers compared to standard scaling methods like StandardScaler.
Benefit: Ensures that features are on the same scale, which is crucial for models like Balanced Random Forest and XGBoost that rely on distance metrics or gradient-based optimization.

### Model Selection:
Balanced Random Forest Classifier:

Reason: Chosen due to its ability to handle imbalanced datasets naturally through class weighting and sampling techniques.
Benefit: Provides a balanced approach to classification tasks, maintaining robustness against imbalanced classes without requiring extensive data preprocessing.
XGBoost Classifier:

Reason: A powerful gradient boosting algorithm known for its high performance on structured datasets and ability to capture complex interactions in data.
Benefit: Effective in improving predictive accuracy and handling large datasets with high dimensionality, often outperforming traditional ensemble methods.

### Hyperparameter Tuning:
RandomizedSearchCV and GridSearchCV:
Reason: Used to optimize model performance by searching through a specified parameter space and selecting the best hyperparameters.
Benefit: Ensures that the models are fine-tuned to achieve optimal performance on the validation set, improving generalization and reducing overfitting.

### Evaluation Metrics:
Confusion Matrix, Classification Report, ROC AUC Score:
Reason: Metrics chosen to comprehensively evaluate model performance across multiple aspects such as precision, recall, F1-score, and ROC AUC.
Benefit: Provides a balanced view of how well the models classify each class, detect true positives and negatives, and handle class imbalance and uncertainty.

### Conclusion:
#### Objective: The goal was to develop robust classifiers capable of accurately predicting the target variable in a dataset with significant class imbalance.
#### Approach: By combining SMOTEENN resampling for handling imbalance, robust feature scaling, and leveraging ensemble models like Balanced Random Forest and XGBoost, we aimed to maximize predictive accuracy and generalization.
#### Outcome: The models were evaluated using comprehensive metrics to assess their performance, leading to the selection of the best-performing model for deployment.

In [15]:
# Standard Libraries
import joblib

# Data Manipulation and Model Evaluation
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, StratifiedKFold
from scipy.stats import randint

# Model Algorithms
from imblearn.combine import SMOTEENN
from imblearn.ensemble import BalancedRandomForestClassifier
from xgboost import XGBClassifier

In [16]:
def apply_smoteenn(X_train, y_train):
    """
    Applies SMOTEENN resampling technique to balance the training data.

    Parameters:
    - X_train (DataFrame or ndarray): Input features for training.
    - y_train (Series or ndarray): Target labels for training.

    Returns:
    - X_resampled (DataFrame or ndarray): Resampled features.
    - y_resampled (Series or ndarray): Resampled target labels.
    """
    smote_enn = SMOTEENN(random_state=42)
    X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)
    return X_resampled, y_resampled

In [28]:
def train_balanced_random_forest(X_train, y_train, X_test, y_test):
    """
    Trains a Balanced Random Forest classifier on the given data.

    Parameters:
    - X_train (DataFrame or ndarray): Training features.
    - y_train (Series or ndarray): Training target labels.
    - X_test (DataFrame or ndarray): Test features.
    - y_test (Series or ndarray): Test target labels.

    Returns:
    - brf (BalancedRandomForestClassifier): Trained classifier.
    - y_pred (Series or ndarray): Predictions on test data.
    """
    # Apply SMOTEENN resampling
    X_resampled, y_resampled = apply_smoteenn(X_train, y_train)
    
    # Scale the data
    scaler = RobustScaler()
    X_train_scaled = scaler.fit_transform(X_resampled)
    X_test_scaled = scaler.transform(X_test)
    
    # Train the model
    brf = BalancedRandomForestClassifier(random_state=42, class_weight='balanced')
    brf.fit(X_train_scaled, y_resampled)
    
    # Predict on test set
    y_pred = brf.predict(X_test_scaled)
    
    # Print evaluation metrics
    print("Balanced Random Forest Classifier:")
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("\nROC AUC Score:")
    print(roc_auc_score(y_test, brf.predict_proba(X_test_scaled), multi_class='ovr'))
    
    return (brf,y_pred,scaler)

In [29]:
def train_xgboost(X_train, y_train, X_test, y_test):
    """
    Trains an XGBoost classifier on the given data.

    Parameters:
    - X_train (DataFrame or ndarray): Training features.
    - y_train (Series or ndarray): Training target labels.
    - X_test (DataFrame or ndarray): Test features.
    - y_test (Series or ndarray): Test target labels.

    Returns:
    - xgb_model (XGBClassifier): Trained classifier.
    - y_pred (Series or ndarray): Predictions on test data.
    """
    # Apply SMOTEENN resampling
    X_resampled, y_resampled = apply_smoteenn(X_train, y_train)
    
    # Scale the data
    scaler = RobustScaler()
    X_train_scaled = scaler.fit_transform(X_resampled)
    X_test_scaled = scaler.transform(X_test)
    
    # Train the model
    xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
    xgb_model.fit(X_train_scaled, y_resampled)
    
    # Predict on test set
    y_pred = xgb_model.predict(X_test_scaled)
    
    # Print evaluation metrics
    print("\nXGBoost Classifier:")
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("\nROC AUC Score:")
    print(roc_auc_score(y_test, xgb_model.predict_proba(X_test_scaled), multi_class='ovr'))
    
    return xgb_model, y_pred

In [38]:
def tune_model(X_train, y_train,X_test,y_test):
    """
    Performs hyperparameter tuning using RandomizedSearchCV and GridSearchCV on Balanced Random Forest.

    Parameters:
    - X_train (DataFrame or ndarray): Training features.
    - y_train (Series or ndarray): Training target labels.

    Returns:
    - best_brf (BalancedRandomForestClassifier): Best tuned classifier.
    """
    # Define the parameter distribution for RandomizedSearchCV
    param_dist = {
        'n_estimators': randint(100, 300),
        'max_features': ['auto', 'sqrt', 'log2'],
        'max_depth': [10, 20, 30, None],
        'min_samples_split': randint(2, 10),
        'min_samples_leaf': randint(1, 4)
    }
    
    # Apply SMOTEENN resampling
    X_resampled, y_resampled = apply_smoteenn(X_train, y_train)
    
    # Scale the data
    scaler = RobustScaler()
    X_train_scaled = scaler.fit_transform(X_resampled)
    X_test_scaled = scaler.transform(X_test)
    
    # Create StratifiedKFold object
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    
    # Initialize the classifier
    brf = BalancedRandomForestClassifier(random_state=42, class_weight='balanced')
    
    # Initialize RandomizedSearchCV
    random_search = RandomizedSearchCV(estimator=brf, param_distributions=param_dist, n_iter=50, cv=skf, 
                                       scoring='f1_weighted', n_jobs=-1, random_state=42, verbose=3)
    
    # Fit RandomizedSearchCV
    random_search.fit(X_train_scaled, y_resampled)
    
    # Best parameters from RandomizedSearchCV
    best_params_random = random_search.best_params_
    print("\nBest parameters from RandomizedSearchCV: ", best_params_random)
    
    # Define the refined parameter grid for GridSearchCV
    param_grid_fine = {  
        'n_estimators': [best_params_random['n_estimators']-50, best_params_random['n_estimators'], best_params_random['n_estimators']+50],  
        'max_features': [best_params_random['max_features']],  
        'max_depth': [best_params_random['max_depth']-5, best_params_random['max_depth'], best_params_random['max_depth']+5] if best_params_random['max_depth'] else [None],  
        'min_samples_split': [best_params_random['min_samples_split']-1, best_params_random['min_samples_split'], best_params_random['min_samples_split']+1],  
        'min_samples_leaf': [best_params_random['min_samples_leaf']-1, best_params_random['min_samples_leaf'], best_params_random['min_samples_leaf']+1]  
    }
    
    # Initialize GridSearchCV for refined tuning
    grid_search_fine = GridSearchCV(estimator=brf, param_grid=param_grid_fine, cv=skf, 
                                    scoring='f1_weighted', n_jobs=-1, verbose=3)
    
    # Fit GridSearchCV
    grid_search_fine.fit(X_train_scaled, y_resampled)
    
    # Best parameters and estimator from GridSearchCV
    best_params_fine = grid_search_fine.best_params_
    print("\nBest parameters from GridSearchCV: ", best_params_fine)
    best_brf = grid_search_fine.best_estimator_
    
    return best_brf,scaler,X_test_scaled

In [39]:
def save_model_and_scaler(model, scaler, model_filename='best_model.pkl', scaler_filename='scaler.pkl'):
    """
    Saves the trained model and scaler to disk.

    Parameters:
    - model (object): Trained model object.
    - scaler (object): Scaler object used for preprocessing.
    - model_filename (str): Filename for saving the model (default: 'best_model.pkl').
    - scaler_filename (str): Filename for saving the scaler (default: 'scaler.pkl').
    """
    joblib.dump(model, model_filename)
    joblib.dump(scaler, scaler_filename)
    print(f"Model saved as {model_filename}")
    print(f"Scaler saved as {scaler_filename}")

In [40]:
def load_model_and_scaler(model_filename='best_model.pkl', scaler_filename='scaler.pkl'):
    """
    Loads a trained model and scaler from disk.

    Parameters:
    - model_filename (str): Filename of the saved model (default: 'best_model.pkl').
    - scaler_filename (str): Filename of the saved scaler (default: 'scaler.pkl').

    Returns:
    - model (object): Loaded model object.
    - scaler (object): Loaded scaler object.
    """
    model = joblib.load(model_filename)
    scaler = joblib.load(scaler_filename)
    print(f"Loaded model from {model_filename}")
    print(f"Loaded scaler from {scaler_filename}")
    return model, scaler

In [41]:
def evaluate_model(model, X_test_scaled, y_test):
    """Evaluate models: print confusion matrix, classification report, and ROC AUC score."""
    y_pred = model.predict(X_test_scaled)
    
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    if hasattr(model, "predict_proba"):
        print("\nROC AUC Score:")
        print(roc_auc_score(y_test, model.predict_proba(X_test_scaled), multi_class='ovr'))


In [42]:
def main():
    # Read X_train, y_train, X_test, y_test from saved files
    # Reading X_train, X_test
    X_train = pd.read_csv('X_train.csv')
    X_test = pd.read_csv('X_test.csv')

    # Reading y_train, y_test (assuming headers are present)
    y_train = pd.read_csv('y_train.csv') # Read as Series
    y_test = pd.read_csv('y_test.csv')
    # Train models and perform evaluations
    brf, y_pred_brf,scaler = train_balanced_random_forest(X_train, y_train, X_test, y_test)
    xgb_model, y_pred_xgb = train_xgboost(X_train, y_train, X_test, y_test)
    # Hyperparameter tuning
    best_brf,scaler,X_test_scaled = tune_model(X_train, y_train,X_test,y_test)
    
    # Save the best model and scaler
    save_model_and_scaler(best_brf, scaler)
    
    # Optionally, load the model and scaler
    loaded_model, loaded_scaler = load_model_and_scaler()
    evaluate_model(loaded_model,X_test_scaled,y_test)
    print("\nPredictions using loaded model:")

In [43]:
if __name__ == "__main__":
    main()

  warn(
  warn(
  warn(
  return fit_method(estimator, *args, **kwargs)


Balanced Random Forest Classifier:
Confusion Matrix:
[[4627  219    1  102   51]
 [  26   20    0    3    2]
 [   2    0   28    0    0]
 [  17    7    0    1    1]
 [   8    5    0    2    0]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.93      0.96      5000
           1       0.08      0.39      0.13        51
           2       0.97      0.93      0.95        30
           3       0.01      0.04      0.01        26
           4       0.00      0.00      0.00        15

    accuracy                           0.91      5122
   macro avg       0.41      0.46      0.41      5122
weighted avg       0.97      0.91      0.94      5122


ROC AUC Score:
0.8227358345324424


Parameters: { "use_label_encoder" } are not used.




XGBoost Classifier:
Confusion Matrix:
[[4501  264    6  150   79]
 [  21   20    1    6    3]
 [   1    0   28    0    1]
 [  14    7    0    3    2]
 [   7    5    0    2    1]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.90      0.94      5000
           1       0.07      0.39      0.12        51
           2       0.80      0.93      0.86        30
           3       0.02      0.12      0.03        26
           4       0.01      0.07      0.02        15

    accuracy                           0.89      5122
   macro avg       0.38      0.48      0.39      5122
weighted avg       0.97      0.89      0.93      5122


ROC AUC Score:
0.8193223876786785
Fitting 3 folds for each of 50 candidates, totalling 150 fits


81 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
23 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\sklearn\base.py", line 1145, in wrapper
    estimator._validate_params()
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\imblearn\base.py", line 42, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\imblearn\utils\_param_validation.py",


Best parameters from RandomizedSearchCV:  {'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 149}
Fitting 3 folds for each of 81 candidates, totalling 243 fits


81 fits failed out of a total of 243.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
81 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\sklearn\base.py", line 1145, in wrapper
    estimator._validate_params()
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\imblearn\base.py", line 42, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\E009819\AppData\Local\miniconda3\lib\site-packages\imblearn\utils\_param_validation.py",


Best parameters from GridSearchCV:  {'max_depth': 25, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 99}
Model saved as best_model.pkl
Scaler saved as scaler.pkl
Loaded model from best_model.pkl
Loaded scaler from scaler.pkl
Confusion Matrix:
[[4618  225    1  101   55]
 [  29   18    0    2    2]
 [   3    0   27    0    0]
 [  17    6    0    2    1]
 [   8    3    0    2    2]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.92      0.95      5000
           1       0.07      0.35      0.12        51
           2       0.96      0.90      0.93        30
           3       0.02      0.08      0.03        26
           4       0.03      0.13      0.05        15

    accuracy                           0.91      5122
   macro avg       0.42      0.48      0.42      5122
weighted avg       0.97      0.91      0.94      5122


ROC AUC Score:
0.8247766033213548

Predictions using loaded m