# Uncovering Healthcare Inefficiencies - Model Building and Evaluation

This notebook focuses on building, training, and evaluating various models to determine the best performing model for our dataset.

The models included in this notebook are:

1. **Logistic Regression**: Used as the baseline model.
2. **Recurrent Neural Network (RNN)**: For capturing temporal dependencies.
3. **Convolutional Neural Network (CNN)**: For capturing spatial hierarchies.
4. **DBSCAN**: Unsupervised clustering to identify clusters and noise.

Each model undergoes the following steps:

1. **Data Preprocessing**: Standardizing and preparing data.
2. **Model Building**: Constructing model architecture.
3. **Model Training**: Training the model.
4. **Model Evaluation**: Assessing performance.
5. **Results Analysis**: Comparing results to determine the best model.


The objective is to identify the model that yields the best results in terms of accuracy and other relevant metrics. 

## Import Libaries

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import PowerTransformer, RobustScaler

from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.cluster import AgglomerativeClustering, DBSCAN
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, silhouette_score
from sklearn.base import BaseEstimator, ClusterMixin

import joblib

from sklearn.preprocessing import StandardScaler

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # supress warning 

In [2]:
# Check current working directory
current_directory = os.getcwd()
print("Current working directory:", current_directory)

Current working directory: /Users/amyou/Desktop/ADS 599 Capstone/healthcare-market-saturation-fraud


## Import data from preprocessing notebook

In [3]:
# Read in data
# all the training/validation/test dataframes
x_train = pd.read_csv('data/x_train.csv') 
x_train_scaled = pd.read_csv('data/x_train_scaled.csv')
x_train_pca = pd.read_csv('data/x_train_pca.csv')
x_train_scaled_pca = pd.read_csv('data/x_train_scaled_pca.csv')

x_val = pd.read_csv('data/x_val.csv') 
x_val_scaled = pd.read_csv('data/x_val_scaled.csv')
x_val_pca = pd.read_csv('data/x_val_pca.csv')
x_val_scaled_pca = pd.read_csv('data/x_val_scaled_pca.csv')

x_test = pd.read_csv('data/x_test.csv')
x_test_scaled = pd.read_csv('data/x_test_scaled.csv')
x_test_pca = pd.read_csv('data/x_test_pca.csv')
x_test_scaled_pca = pd.read_csv('data/x_test_scaled_pca.csv')


# all the labels
y_train = np.ravel(pd.read_csv('data/y_train.csv'))
y_val = np.ravel(pd.read_csv('data/y_val.csv'))
y_test = np.ravel(pd.read_csv('data/y_test.csv'))

## DataTransformation 

### Yeo Johnson transformation of data

We wanted to add in additional dataframes to see if there was a difference in modeling performance. This Yeo-Johnson transformation was one of them, another would be to do transformation + scaling.

In [4]:
# transformed data
# create copy of df 
x_train_transformed = x_train.copy()
x_val_transformed = x_val.copy()
x_test_transformed = x_test.copy()

# get numeric columns
numeric_columns = x_train_transformed.select_dtypes(include=['float']).columns

def yeo_johnson_transform(column):
    # Create an instance of PowerTransformer with Yeo-Johnson method
    pt = PowerTransformer(method='yeo-johnson')
    
    # Reshape column for PowerTransformer which expects 2D input
    column_reshaped = column.values.reshape(-1, 1)
    
    # Fit and transform the column
    transformed_col = pt.fit_transform(column_reshaped)
    
    # Flatten the result to match original column shape
    return transformed_col.flatten()

# Apply Box-Cox transformation to each numeric column
for col in numeric_columns:
    x_train_transformed[col] = yeo_johnson_transform(x_train_transformed[col])
    x_val_transformed[col] = yeo_johnson_transform(x_val_transformed[col])
    x_test_transformed[col] = yeo_johnson_transform(x_test_transformed[col])


  loglike = -n_samples / 2 * np.log(x_trans.var())


### Log transformed + scaled data

In [5]:
x_train_trans_scaled = x_train_transformed.copy()
x_val_trans_scaled = x_val_transformed.copy()
x_test_trans_scaled = x_test_transformed.copy()

scaler = RobustScaler()
x_train_trans_scaled[numeric_columns] = scaler.fit_transform(x_train_trans_scaled[numeric_columns])
x_val_trans_scaled[numeric_columns] = scaler.transform(x_val_trans_scaled[numeric_columns])
x_test_trans_scaled[numeric_columns] = scaler.transform(x_test_trans_scaled[numeric_columns])

## Baseline Model Selection - Logistic Regression

We'll first start by deciding on a baseline model for comparison against other models. The confusion matrix will be used to determine which dataframe will be ingested for each machine learning model. We currently have the following dataframes/data to feed into the logistic regression model:

* The preprocessed data - x_train
* The transformed data - x_train_tranformed
* The scaled data - x_train_scaled
* The transformed + scaled data - x_train_trans_scaled
* The pca transformed data - x_train_pca
* The scaled data + pca - x_train_scaled_pca

Based on the results of the baseline regression model, we can choose a dataframe to carry through the modeling process.

### Create and train Logistic Regression Model for unscaled data

This is the first model with the data that has been preprocessed but not scaled nor transformed for normality. The accuracy was terrible, the precision and F-score were non existant.

In [6]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Validation Accuracy: 0.25420515406663136
Validation Confusion Matrix:
[[     0 116831]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00    116831
           1       0.25      1.00      0.41     39822

    accuracy                           0.25    156653
   macro avg       0.13      0.50      0.20    156653
weighted avg       0.06      0.25      0.10    156653

Test Accuracy: 0.25420353134934315
Test Confusion Matrix:
[[     0 116832]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00    116832
           1       0.25      1.00      0.41     39822

    accuracy                           0.25    156654
   macro avg       0.13      0.50      0.20    156654
weighted avg       0.06      0.25      0.10    156654



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Create and train Logistic Regression Model for the scaled data

This is the first model with the data that has been preprocessed and scaled, but not transformed for normality. The accuracy was 100%, leading us to believe that the model is overfit.

In [7]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_scaled, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Validation Accuracy: 1.0
Validation Confusion Matrix:
[[116831      0]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116831
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156653
   macro avg       1.00      1.00      1.00    156653
weighted avg       1.00      1.00      1.00    156653

Test Accuracy: 1.0
Test Confusion Matrix:
[[116832      0]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116832
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156654
   macro avg       1.00      1.00      1.00    156654
weighted avg       1.00      1.00      1.00    156654



### Create and train Logistic Regression Model for yeo-johnson transformed data

This is the first model with the data that has been preprocessed and transformed, but not scaled. The accuracy was 100%, leading us to believe that the model is also overfit.

In [8]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_transformed, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_transformed)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_transformed)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 1.0
Validation Confusion Matrix:
[[116831      0]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116831
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156653
   macro avg       1.00      1.00      1.00    156653
weighted avg       1.00      1.00      1.00    156653

Test Accuracy: 1.0
Test Confusion Matrix:
[[116832      0]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116832
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156654
   macro avg       1.00      1.00      1.00    156654
weighted avg       1.00      1.00      1.00    156654



### Create and train Logistic Regression Model for yeo-johnson transformed and scaled data

This is the first model with the data that has been preprocessed, scaled, and transformed for normality. The accuracy was 100%, leading us to believe that the model is also overfit.

In [9]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_trans_scaled, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_trans_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_trans_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 1.0
Validation Confusion Matrix:
[[116831      0]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116831
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156653
   macro avg       1.00      1.00      1.00    156653
weighted avg       1.00      1.00      1.00    156653

Test Accuracy: 1.0
Test Confusion Matrix:
[[116832      0]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116832
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156654
   macro avg       1.00      1.00      1.00    156654
weighted avg       1.00      1.00      1.00    156654



### Create and train Logistic Regression Model for the PCA transformed data (orig)

This is the fifth model with the data that has been preprocessed, but not scaled nor transformed for normality. The accuracy was about 81%, which is the best model so far.

In [10]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_pca, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_pca)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_pca)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 0.8094706133939343
Validation Confusion Matrix:
[[114853   1978]
 [ 27869  11953]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.98      0.89    116831
           1       0.86      0.30      0.44     39822

    accuracy                           0.81    156653
   macro avg       0.83      0.64      0.66    156653
weighted avg       0.82      0.81      0.77    156653

Test Accuracy: 0.8097080189462127
Test Confusion Matrix:
[[114859   1973]
 [ 27837  11985]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.98      0.89    116832
           1       0.86      0.30      0.45     39822

    accuracy                           0.81    156654
   macro avg       0.83      0.64      0.67    156654
weighted avg       0.82      0.81      0.77    156654



### Create and train Logistic Regression Model for the PCA transformed data (scaled)

This is the sixth model with the data that has been preprocessed and scaled, but not transformed for normality. The accuracy was about 82%, which is the best model so far beating the previous model.

In [11]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_scaled_pca, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_scaled_pca)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_scaled_pca)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 0.8236739800706019
Validation Confusion Matrix:
[[107864   8967]
 [ 18655  21167]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.92      0.89    116831
           1       0.70      0.53      0.61     39822

    accuracy                           0.82    156653
   macro avg       0.78      0.73      0.75    156653
weighted avg       0.81      0.82      0.81    156653

Test Accuracy: 0.82516245994357
Test Confusion Matrix:
[[108066   8766]
 [ 18623  21199]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.92      0.89    116832
           1       0.71      0.53      0.61     39822

    accuracy                           0.83    156654
   macro avg       0.78      0.73      0.75    156654
weighted avg       0.82      0.83      0.82    156654



We opted to use PCA-transformed and scaled data for creating and training our Logistic Regression model for several compelling reasons:

1. **Dimensionality Reduction**:
   - **Principal Component Analysis (PCA)** is a powerful technique used to reduce the dimensionality of our dataset while retaining the most important information. This helps in eliminating redundant and less informative features, leading to a more efficient and interpretable model.

2. **Feature Scaling**:
   - Scaling our data ensures that all features contribute equally to the model. Logistic Regression, like many machine learning algorithms, performs better when the data is normalized, preventing features with larger scales from dominating the model training process.

3. **Model Performance**:
   - The Logistic Regression model trained on PCA-transformed and scaled data achieved an accuracy of about 82%. This is a significant improvement over previous models and is currently our best-performing model. The use of PCA likely helped in capturing the underlying structure of the data more effectively.

4. **Overfitting Reduction**:
   - By reducing the number of features, PCA helps in minimizing the risk of overfitting. Overfitting occurs when the model is too complex and captures noise in the data, rather than the actual underlying pattern. PCA helps in addressing this by simplifying the feature set.

5. **Computational Efficiency**:
   - With fewer features after PCA, the computational cost of training the Logistic Regression model decreases. This makes the model training process faster and more resource-efficient, which is particularly beneficial when dealing with large datasets.

Using PCA-transformed and scaled data has led to a significant improvement in model accuracy and overall performance, justifying our decision to incorporate these preprocessing steps in our modeling pipeline. The 82% accuracy stands as evidence to the effectiveness of this approach.


## XGBoost 

In [13]:
# define the parameter grid
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}


# initialize the XGBoost classifier
xgb_model = XGBClassifier()

# setup GridSearchCV
grid_search_xgb = GridSearchCV(estimator=xgb_model, 
                               param_grid=param_grid_xgb, 
                               scoring='accuracy', 
                               cv=3, n_jobs=2, verbose=2)

In [14]:
# fit the model
grid_search_xgb.fit(x_train_scaled_pca, y_train)

Fitting 3 folds for each of 243 candidates, totalling 729 fits
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=   1.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=   1.1s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.9; total time=   1.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.9; total time=   1.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.9; total time=   1.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=1.0; total time=   1.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=1.0; total time=   1.1s
[CV] END 



[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=0.8; total time=   3.9s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=0.8; total time=   3.8s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=0.9; total time=   3.9s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=0.9; total time=   3.9s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=0.9; total time=   3.8s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=1.0; total time=   3.5s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estimators=300, subsample=1.0; total time=   3.5s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, n_estimators=100, subsample=0.8; total time=   1.5s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=4, n_estima

In [15]:
# best parameters and best score
best_params_xgb = grid_search_xgb.best_params_
best_score_xgb = grid_search_xgb.best_score_

print(f'Best Parameters for XGBoost: {best_params_xgb}')
print(f'Best Cross-Validation Score: {best_score_xgb}')

Best Parameters for XGBoost: {'colsample_bytree': 1.0, 'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 300, 'subsample': 0.9}
Best Cross-Validation Score: 0.9018251649361441


In [16]:
# evaluate on validation set
xgb_best_model = grid_search_xgb.best_estimator_
y_val_pred = xgb_best_model.predict(x_val_scaled_pca)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'XGBoost Validation Accuracy: {val_accuracy}')
print('XGBoost Validation Confusion Matrix:')
print(val_confusion_matrix)
print('XGBoost Validation Classification Report:')
print(val_classification_report)

XGBoost Validation Accuracy: 0.8796831212935596
XGBoost Validation Confusion Matrix:
[[101218  15613]
 [  3235  36587]]
XGBoost Validation Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.87      0.91    116831
           1       0.70      0.92      0.80     39822

    accuracy                           0.88    156653
   macro avg       0.83      0.89      0.86    156653
weighted avg       0.90      0.88      0.88    156653



In [17]:
# evaluate on test set
y_test_pred = xgb_best_model.predict(x_test_scaled_pca)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'XGBoost Test Accuracy: {test_accuracy}')
print('XGBoost Test Confusion Matrix:')
print(test_confusion_matrix)
print('XGBoost Test Classification Report:')
print(test_classification_report)

XGBoost Test Accuracy: 0.8801818019329223
XGBoost Test Confusion Matrix:
[[101264  15568]
 [  3202  36620]]
XGBoost Test Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.87      0.92    116832
           1       0.70      0.92      0.80     39822

    accuracy                           0.88    156654
   macro avg       0.84      0.89      0.86    156654
weighted avg       0.90      0.88      0.88    156654



In [18]:
# ensure the models folder exists
os.makedirs('models', exist_ok=True)

# save tuned XGBoost model
xgb_best_model_path = 'models/xgb_best_model.pkl'
joblib.dump(xgb_best_model, xgb_best_model_path)

print(f'Tuned XGBoost model saved at: {xgb_best_model_path}')

Tuned XGBoost model saved at: models/xgb_best_model.pkl


##  Agglomerative Clustering

In [19]:
# reduce datasize for agglomerative clustering and DBSCAN 

# function to reduce dataset size
def reduce_dataset(x, y, sample_size):
    np.random.seed(42)
    indices = np.random.choice(x.shape[0], size=sample_size, replace=False)
    return x[indices], y[indices]

# convert data to np.float32
x_train_np = x_train_scaled_pca.astype(np.float32).to_numpy()
x_val_np = x_val_scaled_pca.astype(np.float32).to_numpy()
x_test_np = x_test_scaled_pca.astype(np.float32).to_numpy()

# sample the data
sample_size = 5000
x_train_sampled, y_train_sampled = reduce_dataset(x_train_np, y_train, sample_size)
x_val_sampled, y_val_sampled = reduce_dataset(x_val_np, y_val, sample_size)
x_test_sampled, y_test_sampled = reduce_dataset(x_test_np, y_test, sample_size)

In [31]:
# define the parameter grid for the number of clusters
param_grid_agg = {'n_clusters': [2, 3, 4, 5, 6]}

# initialize the AgglomerativeClustering model
agg_model = AgglomerativeClustering()

# define a custom scoring function using silhouette score
def silhouette_scorer(estimator, X):
    labels = estimator.fit_predict(X)
    score = silhouette_score(X, labels)
    return score

# setup GridSearchCV with silhouette scorer
grid_search_agg = GridSearchCV(estimator=agg_model, param_grid=param_grid_agg, 
                               scoring=silhouette_scorer, cv=3, n_jobs=-2, verbose=2)


In [32]:
# fit the model
grid_search_agg.fit(x_train_sampled)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] END .......................................n_clusters=2; total time=   0.3s
[CV] END .......................................n_clusters=2; total time=   0.3s
[CV] END .......................................n_clusters=3; total time=   0.3s
[CV] END .......................................n_clusters=3; total time=   0.3s
[CV] END .......................................n_clusters=3; total time=   0.3s
[CV] END .......................................n_clusters=2; total time=   0.3s
[CV] END .......................................n_clusters=4; total time=   0.4s
[CV] END .......................................n_clusters=4; total time=   0.4s
[CV] END .......................................n_clusters=5; total time=   0.4s
[CV] END .......................................n_clusters=4; total time=   0.4s
[CV] END .......................................n_clusters=6; total time=   0.3s[CV] END .......................................n_

In [33]:
# best parameters and best score
best_params_agg = grid_search_agg.best_params_
best_score_agg = grid_search_agg.best_score_

print(f'Best Parameters for Agglomerative Clustering: {best_params_agg}')
print(f'Best Cross-Validation Score: {best_score_agg}')

Best Parameters for Agglomerative Clustering: {'n_clusters': 2}
Best Cross-Validation Score: 0.9482576449712118


In [34]:
# evaluate on validation set
agg_best_model = grid_search_agg.best_estimator_
y_val_pred = agg_best_model.fit_predict(x_val_sampled)
val_accuracy = accuracy_score(y_val_sampled, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val_sampled, y_val_pred)
val_classification_report = classification_report(y_val_sampled, y_val_pred)

print(f'Agglomerative Clustering Validation Accuracy: {val_accuracy}')
print('Agglomerative Clustering Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Agglomerative Clustering Validation Classification Report:')
print(val_classification_report)

Agglomerative Clustering Validation Accuracy: 0.247
Agglomerative Clustering Validation Confusion Matrix:
[[  25 3710]
 [  55 1210]]
Agglomerative Clustering Validation Classification Report:
              precision    recall  f1-score   support

           0       0.31      0.01      0.01      3735
           1       0.25      0.96      0.39      1265

    accuracy                           0.25      5000
   macro avg       0.28      0.48      0.20      5000
weighted avg       0.30      0.25      0.11      5000



In [35]:
# evaluate on test set
y_test_pred = agg_best_model.fit_predict(x_test_sampled)
test_accuracy = accuracy_score(y_test_sampled, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test_sampled, y_test_pred)
test_classification_report = classification_report(y_test_sampled, y_test_pred)

print(f'Agglomerative Clustering Test Accuracy: {test_accuracy}')
print('Agglomerative Clustering Test Confusion Matrix:')
print(test_confusion_matrix)
print('Agglomerative Clustering Test Classification Report:')
print(test_classification_report)

Agglomerative Clustering Test Accuracy: 0.733
Agglomerative Clustering Test Confusion Matrix:
[[3660    2]
 [1333    5]]
Agglomerative Clustering Test Classification Report:
              precision    recall  f1-score   support

           0       0.73      1.00      0.85      3662
           1       0.71      0.00      0.01      1338

    accuracy                           0.73      5000
   macro avg       0.72      0.50      0.43      5000
weighted avg       0.73      0.73      0.62      5000



In [36]:
# save tuned Agglomerative Clustering model
agg_best_model_path = 'models/agg_best_model.pkl'
joblib.dump(agg_best_model, agg_best_model_path)

print(f'Tuned Agglomerative Clustering model saved at: {agg_best_model_path}')

Tuned Agglomerative Clustering model saved at: models/agg_best_model.pkl


## Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

In [51]:
# define silhouette scorer
class SilhouetteScorer:
    def __init__(self, X):
        self.X = X

    def __call__(self, estimator, X=None):
        labels = estimator.fit_predict(self.X)
        num_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        if num_clusters > 1:
            return silhouette_score(self.X, labels)
        else:
            return -1 
        
# define the parameter grid for DBSCAN
param_grid_dbscan = {
    'eps': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
    'min_samples': [3, 5, 7, 10, 15]
}

# initialize the DBSCAN model
dbscan_model = DBSCAN()

# create the silhouette scorer
silhouette_scorer = SilhouetteScorer(X=x_train_sampled)

# setup GridSearchCV
grid_search_dbscan = GridSearchCV(estimator=dbscan_model, param_grid=param_grid_dbscan, 
                                  scoring=silhouette_scorer, cv=3, n_jobs=-1, verbose=2)

In [52]:
# fit the model
grid_search_dbscan.fit(x_train_sampled)

Fitting 3 folds for each of 35 candidates, totalling 105 fits
[CV] END .............................eps=0.2, min_samples=3; total time=   0.6s
[CV] END .............................eps=0.2, min_samples=3; total time=   0.6s
[CV] END .............................eps=0.2, min_samples=5; total time=   0.6s
[CV] END .............................eps=0.2, min_samples=7; total time=   0.6s
[CV] END .............................eps=0.2, min_samples=5; total time=   0.7s
[CV] END .............................eps=0.2, min_samples=7; total time=   0.7s
[CV] END .............................eps=0.2, min_samples=3; total time=   0.8s
[CV] END .............................eps=0.2, min_samples=5; total time=   0.8s
[CV] END ............................eps=0.2, min_samples=10; total time=   0.6s
[CV] END .............................eps=0.2, min_samples=7; total time=   0.7s
[CV] END ............................eps=0.2, min_samples=10; total time=   0.7s
[CV] END ............................eps=0.2, m

In [53]:
# best parameters and best score
best_params_dbscan = grid_search_dbscan.best_params_
best_score_dbscan = grid_search_dbscan.best_score_

print(f'Best Parameters for DBSCAN: {best_params_dbscan}')
print(f'Best Cross-Validation Score: {best_score_dbscan}')

Best Parameters for DBSCAN: {'eps': 0.8, 'min_samples': 10}
Best Cross-Validation Score: 0.5735435485839844


In [54]:
# evaluate on validation set
dbscan_best_model = grid_search_dbscan.best_estimator_
y_val_pred = dbscan_best_model.fit_predict(x_val_sampled)
val_accuracy = accuracy_score(y_val_sampled, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val_sampled, y_val_pred)
val_classification_report = classification_report(y_val_sampled, y_val_pred)

print(f'DBSCAN Validation Accuracy: {val_accuracy}')
print('DBSCAN Validation Confusion Matrix:')
print(val_confusion_matrix)
print('DBSCAN Validation Classification Report:')
print(val_classification_report)

DBSCAN Validation Accuracy: 0.7356
DBSCAN Validation Confusion Matrix:
[[   0    0    0    0    0]
 [  71 3664    0    0    0]
 [ 363  866   14   12   10]
 [   0    0    0    0    0]
 [   0    0    0    0    0]]
DBSCAN Validation Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         0
           0       0.81      0.98      0.89      3735
           1       1.00      0.01      0.02      1265
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         0

    accuracy                           0.74      5000
   macro avg       0.36      0.20      0.18      5000
weighted avg       0.86      0.74      0.67      5000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [55]:
# evaluate on test set
y_test_pred = dbscan_best_model.fit_predict(x_test_sampled)
test_accuracy = accuracy_score(y_test_sampled, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test_sampled, y_test_pred)
test_classification_report = classification_report(y_test_sampled, y_test_pred)

print(f'DBSCAN Test Accuracy: {test_accuracy}')
print('DBSCAN Test Confusion Matrix:')
print(test_confusion_matrix)
print('DBSCAN Test Classification Report:')
print(test_classification_report)

DBSCAN Test Accuracy: 0.7196
DBSCAN Test Confusion Matrix:
[[   0    0    0]
 [  68 3591    3]
 [ 375  956    7]]
DBSCAN Test Classification Report:
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         0
           0       0.79      0.98      0.87      3662
           1       0.70      0.01      0.01      1338

    accuracy                           0.72      5000
   macro avg       0.50      0.33      0.30      5000
weighted avg       0.77      0.72      0.64      5000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [56]:
# save tuned DBSCAN model
dbscan_best_model_path = 'models/dbscan_best_model.pkl'
joblib.dump(dbscan_best_model, dbscan_best_model_path)

print(f'Tuned DBSCAN model saved at: {dbscan_best_model_path}')

Tuned DBSCAN model saved at: models/dbscan_best_model.pkl


Adaboost model

In [12]:
adaboost_model = AdaBoostClassifier(base_estimator=model, n_estimators=50, random_state=42)

# Train the model on the PCA-transformed and scaled training data
adaboost_model.fit(x_train_scaled_pca, y_train)

# Predict on the PCA-transformed and scaled training data
y_train_pred = adaboost_model.predict(x_train_scaled_pca)

# Predict on the PCA-transformed and scaled validation data
y_val_pred = adaboost_model.predict(x_val_scaled_pca)

# Predict on the PCA-transformed and scaled test data
y_test_pred = adaboost_model.predict(x_test_scaled_pca)

# Evaluate the model
train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f'Training Accuracy: {train_accuracy}')
print(f'Validation Accuracy: {val_accuracy}')
print(f'Test Accuracy: {test_accuracy}')

print("\nClassification Report (Validation):")
print(classification_report(y_val, y_val_pred))

print("\nClassification Report (Test):")
print(classification_report(y_test, y_test_pred))

Training Accuracy: 0.5350203866026181
Validation Accuracy: 0.7596279675461051
Test Accuracy: 0.7604338223090378

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.76      0.99      0.86    116831
           1       0.75      0.08      0.15     39822

    accuracy                           0.76    156653
   macro avg       0.76      0.54      0.50    156653
weighted avg       0.76      0.76      0.68    156653


Classification Report (Test):
              precision    recall  f1-score   support

           0       0.76      0.99      0.86    116832
           1       0.77      0.08      0.15     39822

    accuracy                           0.76    156654
   macro avg       0.76      0.54      0.50    156654
weighted avg       0.76      0.76      0.68    156654

