<a href="https://colab.research.google.com/github/tsholofelo-mokheleli/ACIS-2023-New-Zealand/blob/main/Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Imbalance Techniques Experiment + Cross Validation**

**Cross-validation**

* To assess the model's performance robustly.
* It helps you avoid overfitting
* Provides a more reliable estimate of your model's accuracy.

When performing cross-validation, the main results you obtain are the mean cross-validation accuracy and the standard deviation of cross-validation accuracy. These two metrics provide valuable information about the model's performance and its consistency across different folds.

1. **Mean Cross-Validation Accuracy:** This value represents the average accuracy of the model across all the cross-validation folds. It gives you an estimate of how well the model is likely to perform on unseen data.

2. **Standard Deviation of Cross-Validation Accuracy:** The standard deviation provides a measure of how much the accuracy scores vary across the different folds. A smaller standard deviation indicates that the model's performance is relatively consistent across folds, while a larger standard deviation may suggest that the model's performance is more sensitive to the particular training and validation splits.

In summary, these two metrics together give you an understanding of the model's average performance and its consistency, which is important in assessing the model's generalization capabilities and making informed decisions about its deployment.

In [42]:
# Load the libraries

import pandas  as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import psutil
import os

# Warning filter

import warnings
warnings.filterwarnings('ignore')
cmap=sns.color_palette('Blues_r')

# Metrics

from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score, confusion_matrix, balanced_accuracy_score
from imblearn.metrics import geometric_mean_score

# Preprocessing

from sklearn.preprocessing import LabelEncoder

# Algorithmns models

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier
import lightgbm as lgb

# Ensemble Methods

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

# Class imbalance

from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

from sklearn.utils import resample

# Plot Theme

sns.set_theme(style="darkgrid")
plt.style.use("ggplot")

**Load Data**

In [43]:
data = pd.read_csv("KNN Imputation Dataset.csv")

**Label Encode, Drop null from target var, and Convert data to Int**

In [44]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# make small letters
data['country'] = data['country'].apply(lambda x: x.lower() if isinstance(x, str) else x)

# Fit and transform the countries data
data["country"] = label_encoder.fit_transform(data['country'])

data = data.dropna()

# Convert all columns to int data type
for column in data.columns:
    data[column] = data[column].astype('int64')

In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2189 entries, 0 to 3268
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   self_employed             2189 non-null   int64
 1   no_employees              2189 non-null   int64
 2   tech_company              2189 non-null   int64
 3   company_role              2189 non-null   int64
 4   benefits                  2189 non-null   int64
 5   care_options              2189 non-null   int64
 6   wellness_program          2189 non-null   int64
 7   seek_help                 2189 non-null   int64
 8   anonymity                 2189 non-null   int64
 9   leave                     2189 non-null   int64
 10  mental_importance         2189 non-null   int64
 11  neg_consequence_coworker  2189 non-null   int64
 12  discuss_mh                2189 non-null   int64
 13  work_interfere            2189 non-null   int64
 14  coworkers                 2189 non-null 

## **Split Dataset**

In [46]:
X = data.drop(["mental_health_diagnosed"], axis=1)
y = data['mental_health_diagnosed']

**Perform cross-validation**

In [47]:
# Set the number of folds for cross-validation
num_folds = 5

# Create a cross-validation object (KFold)
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=42)

## **Baseline Model**

In [48]:
# Define a list of classifiers
classifiers = {
    "Support Vector Machine": SVC(),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Logistic Regression": LogisticRegression()
}

In [49]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in classifiers.items():
  # Perform cross-validation and get the evaluation scores for each fold
  cv_scores = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- Support Vector Machine ---
Mean Cross-Validation Accuracy: 0.664
Standard Deviation of Cross-Validation Accuracy: 0.01


--- Naive Bayes ---
Mean Cross-Validation Accuracy: 0.888
Standard Deviation of Cross-Validation Accuracy: 0.021


--- Decision Tree ---
Mean Cross-Validation Accuracy: 0.874
Standard Deviation of Cross-Validation Accuracy: 0.011


--- Logistic Regression ---
Mean Cross-Validation Accuracy: 0.89
Standard Deviation of Cross-Validation Accuracy: 0.009




## **Apply Imbalance Techniques**

### **Resampling Methods**

##### **1. Synthetic Minority Over-sampling Technique (SMOTE)**

In [50]:
# Initialize the model and SMOTE
smote = SMOTE(random_state=42)

# Create a function for cross-validation with SMOTE
def cross_val_with_smote(model, X, y, cv):
    cv_scores = []
    for train_idx, test_idx in cv.split(X, y):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Apply SMOTE to the training data
        X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

        # Fit the model on the resampled training data
        model.fit(X_train_resampled, y_train_resampled)

        # Evaluate the model on the test data
        score = model.score(X_test, y_test)
        cv_scores.append(score)

    return cv_scores

In [51]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in classifiers.items():
  # Perform cross-validation with SMOTE and get the evaluation scores for each fold
  cv_scores = cross_val_with_smote(clf, X, y, cv=kfold)

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- Support Vector Machine ---
Mean Cross-Validation Accuracy: 0.753
Standard Deviation of Cross-Validation Accuracy: 0.012


--- Naive Bayes ---
Mean Cross-Validation Accuracy: 0.884
Standard Deviation of Cross-Validation Accuracy: 0.018


--- Decision Tree ---
Mean Cross-Validation Accuracy: 0.858
Standard Deviation of Cross-Validation Accuracy: 0.01


--- Logistic Regression ---
Mean Cross-Validation Accuracy: 0.886
Standard Deviation of Cross-Validation Accuracy: 0.012




##### **2. Adaptive Synthetic  (ADASYN)**

In [52]:
# Initialize the model and ADASYN
adasyn = ADASYN(random_state=42)

# Create a function for cross-validation with ADASYN
def cross_val_with_adasyn(model, X, y, cv):
    cv_scores = []
    for train_idx, test_idx in cv.split(X, y):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Apply ADASYN to the training data
        X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)

        # Fit the model on the resampled training data
        model.fit(X_train_resampled, y_train_resampled)

        # Evaluate the model on the test data
        score = model.score(X_test, y_test)
        cv_scores.append(score)

    return cv_scores

In [53]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in classifiers.items():
  # Perform cross-validation with ADASYN and get the evaluation scores for each fold
  cv_scores = cross_val_with_adasyn(clf, X, y, cv=kfold)

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- Support Vector Machine ---
Mean Cross-Validation Accuracy: 0.744
Standard Deviation of Cross-Validation Accuracy: 0.01


--- Naive Bayes ---
Mean Cross-Validation Accuracy: 0.881
Standard Deviation of Cross-Validation Accuracy: 0.016


--- Decision Tree ---
Mean Cross-Validation Accuracy: 0.859
Standard Deviation of Cross-Validation Accuracy: 0.012


--- Logistic Regression ---
Mean Cross-Validation Accuracy: 0.886
Standard Deviation of Cross-Validation Accuracy: 0.013




##### **Tomek Links Under Sampling (TLUS)**

In [54]:
# Initialize the model and TomekLinks
tl = TomekLinks()

# Create a function for cross-validation with TomekLinks
def cross_val_with_tomek(model, X, y, cv):
    cv_scores = []
    for train_idx, test_idx in cv.split(X, y):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Apply TomekLinks to the training data
        X_train_resampled, y_train_resampled = tl.fit_resample(X_train, y_train)

        # Fit the model on the resampled training data
        model.fit(X_train_resampled, y_train_resampled)

        # Evaluate the model on the test data
        score = model.score(X_test, y_test)
        cv_scores.append(score)

    return cv_scores

In [55]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in classifiers.items():
  # Perform cross-validation with TomekLinks and get the evaluation scores for each fold
  cv_scores = cross_val_with_tomek(clf, X, y, cv=kfold)

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- Support Vector Machine ---
Mean Cross-Validation Accuracy: 0.672
Standard Deviation of Cross-Validation Accuracy: 0.015


--- Naive Bayes ---
Mean Cross-Validation Accuracy: 0.884
Standard Deviation of Cross-Validation Accuracy: 0.024


--- Decision Tree ---
Mean Cross-Validation Accuracy: 0.862
Standard Deviation of Cross-Validation Accuracy: 0.006


--- Logistic Regression ---
Mean Cross-Validation Accuracy: 0.889
Standard Deviation of Cross-Validation Accuracy: 0.01




##### **Near Miss Under Sampling (NMUS)**

In [56]:
# Initialize the model and NearMiss
nm = NearMiss()

# Create a function for cross-validation with NearMiss
def cross_val_with_nearmiss(model, X, y, cv):
    cv_scores = []
    for train_idx, test_idx in cv.split(X, y):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Apply NearMiss to the training data
        X_train_resampled, y_train_resampled = nm.fit_resample(X_train, y_train)

        # Fit the model on the resampled training data
        model.fit(X_train_resampled, y_train_resampled)

        # Evaluate the model on the test data
        score = model.score(X_test, y_test)
        cv_scores.append(score)

    return cv_scores

In [57]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in classifiers.items():
  # Perform cross-validation with Near Miss and get the evaluation scores for each fold
  cv_scores = cross_val_with_nearmiss(clf, X, y, cv=kfold)

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- Support Vector Machine ---
Mean Cross-Validation Accuracy: 0.648
Standard Deviation of Cross-Validation Accuracy: 0.014


--- Naive Bayes ---
Mean Cross-Validation Accuracy: 0.81
Standard Deviation of Cross-Validation Accuracy: 0.019


--- Decision Tree ---
Mean Cross-Validation Accuracy: 0.762
Standard Deviation of Cross-Validation Accuracy: 0.008


--- Logistic Regression ---
Mean Cross-Validation Accuracy: 0.828
Standard Deviation of Cross-Validation Accuracy: 0.009




### **Ensemble Techniques**

1. **Random Forest**

*Ensemble methods like Random Forest are indeed effective in handling class imbalance because they combine multiple weak learners (decision trees) to create a robust and accurate model. The inherent nature of Random Forest helps in addressing the class imbalance problem by reducing the risk of overfitting to the majority class and improving generalization to the minority class.*

2. **Boosting Algorithms**

*Boosting algorithms are powerful ensemble methods that can handle class imbalance effectively by giving more emphasis to misclassified instances and focusing on difficult-to-classify samples.  One popular boosting algorithm is AdaBoost (Adaptive Boosting).*

**Random Forest handles class imbalance effectively because of the following reasons:**

1. **Bootstrap Aggregating (Bagging):** Random Forest uses bagging, which means it creates multiple subsets of the training data with replacement. This helps in increasing the representation of the minority class in some of the subsets, making the classifier more robust to imbalanced data.

2. **Feature Randomness:** Random Forest selects a random subset of features to split at each node of the decision trees. This randomness further helps in reducing the dominance of the majority class and can improve the overall performance on the minority class.

3. **Voting Ensemble:** In the testing phase, the ensemble of decision trees in the Random Forest votes on the final classification. Since each decision tree in the forest may have learned from different subsets of data, it provides a more balanced voting mechanism.

4. **Out-of-Bag (OOB) Samples:** Random Forest can also use out-of-bag samples (samples not used during training) to estimate the performance of the model. This helps in getting an unbiased estimate of the model's performance even with imbalanced data.

By combining these mechanisms, Random Forest is well-suited for class imbalance problems and can produce reliable and accurate predictions even in the presence of imbalanced classes.

**AdaBoost handles class imbalance effectively due to the following reasons:**

1. **Weighted Voting:** In AdaBoost, each weak learner (typically decision trees) is assigned a weight based on its accuracy in classifying the training data. Misclassified instances are given higher weights, making subsequent weak learners focus more on correcting those errors.

2. **Iterative Learning:** AdaBoost iteratively trains weak learners, and at each iteration, it gives more attention to misclassified instances from the previous iteration. This way, difficult-to-classify samples receive more emphasis during the learning process.

3. **Ensemble Aggregation:** The final prediction in AdaBoost is made by aggregating the predictions of all weak learners, with more weight given to the ones with better performance on the training data. This ensemble aggregation further helps in handling class imbalance and producing more accurate predictions.

4. **Robustness:** By iteratively adapting to the difficult samples, AdaBoost becomes more robust to imbalanced classes over the course of iterations.

Overall, AdaBoost can effectively handle class imbalance by focusing on misclassified instances and adapting to the challenges posed by imbalanced data. It often outperforms traditional classifiers when dealing with skewed class distributions.

In [58]:
# Define a list of Ensemble classifiers
EnsembleClassifiers = {
    "Adaptive Boosting": AdaBoostClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

In [59]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in EnsembleClassifiers.items():
  # Perform cross-validation and get the evaluation scores for each fold
  cv_scores = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- Adaptive Boosting ---
Mean Cross-Validation Accuracy: 0.915
Standard Deviation of Cross-Validation Accuracy: 0.011


--- Random Forest ---
Mean Cross-Validation Accuracy: 0.919
Standard Deviation of Cross-Validation Accuracy: 0.009




### **Algorithm-specific Methods**

XGBoost (Extreme Gradient Boosting) is another powerful boosting algorithm that has specific parameters and techniques to handle class imbalance effectively. XGBoost is an enhanced version of Gradient Boosting that leverages a variety of regularization techniques and can handle class imbalance naturally.

We used XGBoost in algorithm specific methods because it had specific parameters or techniques to handle class imbalance.



**XGBoost can handle class imbalance effectively due to the following reasons:**

**scale_pos_weight:** The scale_pos_weight parameter helps in handling class imbalance by assigning higher weights to the minority class during the boosting process. This parameter helps in balancing the effect of class distribution and prevents the model from being biased towards the majority class.

**Regularization:** XGBoost uses L1 and L2 regularization to prevent overfitting, which can be beneficial when dealing with imbalanced data as it reduces the risk of overfitting to the majority class.

**Gradient-based Optimization:** XGBoost employs gradient-based optimization techniques, which allows it to prioritize difficult-to-classify samples during the boosting process, thus making it more robust to imbalanced classes.

Overall, XGBoost is an excellent choice for handling class imbalance, as it provides built-in mechanisms to deal with skewed class distributions while delivering high performance and accurate predictions.

In [60]:
# Create and set up the XGBoost classifier
# You can use scale_pos_weight to handle class imbalance by assigning higher weights to the minority class
# The scale_pos_weight should be set as the ratio of the number of negative (majority) class samples to positive (minority) class samples

scale_pos_weight = 592 / 1159

# Create and set up the LightGBM classifier
# You can use the is_unbalance parameter to handle class imbalance by automatically setting the positive (minority) class weight


AlgorithmSpecific = {
    "XGBoost": XGBClassifier(scale_pos_weight=scale_pos_weight, random_state=42),
    "LightGBM": lgb.LGBMClassifier(is_unbalance=True, random_state=42)
}

In [61]:
# Dictionary to store evaluation metrics
results = {}

# Loop through each classifier
for name, clf in AlgorithmSpecific.items():
  # Perform cross-validation and get the evaluation scores for each fold
  cv_scores = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy')

  # Calculate the mean and standard deviation of the cross-validation scores
  mean_cv_score = np.mean(cv_scores)
  std_cv_score = np.std(cv_scores)

  results[name] = {
      "MeanCVAccuracy": mean_cv_score,
      "StdCVAccuracy": std_cv_score
  }


# Display the results
for name, metrics in results.items():
  print(f"--- {name} ---")
  print("Mean Cross-Validation Accuracy:", round(metrics["MeanCVAccuracy"], 3))
  print("Standard Deviation of Cross-Validation Accuracy:", round(metrics["StdCVAccuracy"], 3))
  print("\n")

--- XGBoost ---
Mean Cross-Validation Accuracy: 0.91
Standard Deviation of Cross-Validation Accuracy: 0.011


--- LightGBM ---
Mean Cross-Validation Accuracy: 0.918
Standard Deviation of Cross-Validation Accuracy: 0.009


