# Uncovering Healthcare Inefficiencies - Model Building and Evaluation

This notebook focuses on building, training, and evaluating various models to determine the best performing model for our dataset.

The models included in this notebook are:

1. **Logistic Regression**: Used as the baseline model.
2. **Recurrent Neural Network (RNN)**: For capturing temporal dependencies.
3. **Convolutional Neural Network (CNN)**: For capturing spatial hierarchies.
4. **DBSCAN**: Unsupervised clustering to identify clusters and noise.

Each model undergoes the following steps:

1. **Data Preprocessing**: Standardizing and preparing data.
2. **Model Building**: Constructing model architecture.
3. **Model Training**: Training the model.
4. **Model Evaluation**: Assessing performance.
5. **Results Analysis**: Comparing results to determine the best model.


The objective is to identify the model that yields the best results in terms of accuracy and other relevant metrics. 

## Import Libaries

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import PowerTransformer, RobustScaler

import tensorflow as tf
from tensorflow.keras.layers import Conv1D, MaxPooling1D
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Embedding, SimpleRNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow import keras
import keras_tuner as kt

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # supress warning 

In [2]:
# Check current working directory
current_directory = os.getcwd()
print("Current working directory:", current_directory)

Current working directory: /Users/samantharivas/Documents/UNIVERSITY OF SAN DIEGO/ADS599/healthcare-market-saturation-fraud


## Import data from preprocessing notebook

In [3]:
# Read in data
# all the training/validation/test dataframes
x_train = pd.read_csv('data/x_train.csv') 
x_train_scaled = pd.read_csv('data/x_train_scaled.csv')
x_train_pca = pd.read_csv('data/x_train_pca.csv')
x_train_scaled_pca = pd.read_csv('data/x_train_scaled_pca.csv')

x_val = pd.read_csv('data/x_val.csv') 
x_val_scaled = pd.read_csv('data/x_val_scaled.csv')
x_val_pca = pd.read_csv('data/x_val_pca.csv')
x_val_scaled_pca = pd.read_csv('data/x_val_scaled_pca.csv')

x_test = pd.read_csv('data/x_test.csv')
x_test_scaled = pd.read_csv('data/x_test_scaled.csv')
x_test_pca = pd.read_csv('data/x_test_pca.csv')
x_test_scaled_pca = pd.read_csv('data/x_test_scaled_pca.csv')


# all the labels
y_train = np.ravel(pd.read_csv('data/y_train.csv'))
y_val = np.ravel(pd.read_csv('data/y_val.csv'))
y_test = np.ravel(pd.read_csv('data/y_test.csv'))

## DataTransformation 

### Yeo Johnson transformation of data

We wanted to add in additional dataframes to see if there was a difference in modeling performance. This Yeo-Johnson transformation was one of them, another would be to do transformation + scaling.

In [4]:
# transformed data
# create copy of df 
x_train_transformed = x_train.copy()
x_val_transformed = x_val.copy()
x_test_transformed = x_test.copy()

# get numeric columns
numeric_columns = x_train_transformed.select_dtypes(include=['float']).columns

def yeo_johnson_transform(column):
    # Create an instance of PowerTransformer with Yeo-Johnson method
    pt = PowerTransformer(method='yeo-johnson')
    
    # Reshape column for PowerTransformer which expects 2D input
    column_reshaped = column.values.reshape(-1, 1)
    
    # Fit and transform the column
    transformed_col = pt.fit_transform(column_reshaped)
    
    # Flatten the result to match original column shape
    return transformed_col.flatten()

# Apply Box-Cox transformation to each numeric column
for col in numeric_columns:
    x_train_transformed[col] = yeo_johnson_transform(x_train_transformed[col])
    x_val_transformed[col] = yeo_johnson_transform(x_val_transformed[col])
    x_test_transformed[col] = yeo_johnson_transform(x_test_transformed[col])


### Log transformed + scaled data

In [5]:
x_train_trans_scaled = x_train_transformed.copy()
x_val_trans_scaled = x_val_transformed.copy()
x_test_trans_scaled = x_test_transformed.copy()

scaler = RobustScaler()
x_train_trans_scaled[numeric_columns] = scaler.fit_transform(x_train_trans_scaled[numeric_columns])
x_val_trans_scaled[numeric_columns] = scaler.transform(x_val_trans_scaled[numeric_columns])
x_test_trans_scaled[numeric_columns] = scaler.transform(x_test_trans_scaled[numeric_columns])

## Baseline Model Selection - Logistic Regression

We'll first start by deciding on a baseline model for comparison against other models. The confusion matrix will be used to determine which dataframe will be ingested for each machine learning model. We currently have the following dataframes/data to feed into the logistic regression model:

* The preprocessed data - x_train
* The transformed data - x_train_tranformed
* The scaled data - x_train_scaled
* The transformed + scaled data - x_train_trans_scaled
* The pca transformed data - x_train_pca
* The scaled data + pca - x_train_scaled_pca

Based on the results of the baseline regression model, we can choose a dataframe to carry through the modeling process.

### Create and train Logistic Regression Model for unscaled data

This is the first model with the data that has been preprocessed but not scaled nor transformed for normality. The accuracy was terrible, the precision and F-score were non existant.

In [6]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Validation Accuracy: 0.6181815860532515
Validation Confusion Matrix:
[[59423 57408]
 [ 2405 37417]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.51      0.67    116831
           1       0.39      0.94      0.56     39822

    accuracy                           0.62    156653
   macro avg       0.68      0.72      0.61    156653
weighted avg       0.82      0.62      0.64    156653

Test Accuracy: 0.6179414505853664
Test Confusion Matrix:
[[59410 57422]
 [ 2429 37393]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.51      0.67    116832
           1       0.39      0.94      0.56     39822

    accuracy                           0.62    156654
   macro avg       0.68      0.72      0.61    156654
weighted avg       0.82      0.62      0.64    156654



### Create and train Logistic Regression Model for the scaled data

This is the first model with the data that has been preprocessed and scaled, but not transformed for normality. The accuracy was 100%, leading us to believe that the model is overfit.

In [7]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_scaled, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 1.0
Validation Confusion Matrix:
[[116831      0]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116831
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156653
   macro avg       1.00      1.00      1.00    156653
weighted avg       1.00      1.00      1.00    156653

Test Accuracy: 1.0
Test Confusion Matrix:
[[116832      0]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116832
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156654
   macro avg       1.00      1.00      1.00    156654
weighted avg       1.00      1.00      1.00    156654



### Create and train Logistic Regression Model for yeo-johnson transformed data

This is the first model with the data that has been preprocessed and transformed, but not scaled. The accuracy was 100%, leading us to believe that the model is also overfit.

In [8]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_transformed, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_transformed)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_transformed)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 1.0
Validation Confusion Matrix:
[[116831      0]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116831
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156653
   macro avg       1.00      1.00      1.00    156653
weighted avg       1.00      1.00      1.00    156653

Test Accuracy: 1.0
Test Confusion Matrix:
[[116832      0]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116832
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156654
   macro avg       1.00      1.00      1.00    156654
weighted avg       1.00      1.00      1.00    156654



### Create and train Logistic Regression Model for yeo-johnson transformed and scaled data

This is the first model with the data that has been preprocessed, scaled, and transformed for normality. The accuracy was 100%, leading us to believe that the model is also overfit.

In [9]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_trans_scaled, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_trans_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_trans_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 1.0
Validation Confusion Matrix:
[[116831      0]
 [     0  39822]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116831
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156653
   macro avg       1.00      1.00      1.00    156653
weighted avg       1.00      1.00      1.00    156653

Test Accuracy: 1.0
Test Confusion Matrix:
[[116832      0]
 [     0  39822]]
Test Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    116832
           1       1.00      1.00      1.00     39822

    accuracy                           1.00    156654
   macro avg       1.00      1.00      1.00    156654
weighted avg       1.00      1.00      1.00    156654



### Create and train Logistic Regression Model for the PCA transformed data (orig)

This is the fifth model with the data that has been preprocessed, but not scaled nor transformed for normality. The accuracy was about 81%, which is the best model so far.

In [10]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_pca, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_pca)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_pca)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 0.8902414891511813
Validation Confusion Matrix:
[[114461   2370]
 [ 14824  24998]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.98      0.93    116831
           1       0.91      0.63      0.74     39822

    accuracy                           0.89    156653
   macro avg       0.90      0.80      0.84    156653
weighted avg       0.89      0.89      0.88    156653

Test Accuracy: 0.8902166558147254
Test Confusion Matrix:
[[114458   2374]
 [ 14824  24998]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.98      0.93    116832
           1       0.91      0.63      0.74     39822

    accuracy                           0.89    156654
   macro avg       0.90      0.80      0.84    156654
weighted avg       0.89      0.89      0.88    156654



### Create and train Logistic Regression Model for the PCA transformed data (scaled)

This is the sixth model with the data that has been preprocessed and scaled, but not transformed for normality. The accuracy was about 82%, which is the best model so far beating the previous model.

In [11]:
# logreg model
model = LogisticRegression()

# Train the model
model.fit(x_train_scaled_pca, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict(x_val_scaled_pca)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)

print(f'Validation Accuracy: {val_accuracy}')
print('Validation Confusion Matrix:')
print(val_confusion_matrix)
print('Validation Classification Report:')
print(val_classification_report)


# Evaluate the model on the test set
y_test_pred = model.predict(x_test_scaled_pca)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

Validation Accuracy: 0.8236739800706019
Validation Confusion Matrix:
[[107864   8967]
 [ 18655  21167]]
Validation Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.92      0.89    116831
           1       0.70      0.53      0.61     39822

    accuracy                           0.82    156653
   macro avg       0.78      0.73      0.75    156653
weighted avg       0.81      0.82      0.81    156653

Test Accuracy: 0.82516245994357
Test Confusion Matrix:
[[108066   8766]
 [ 18623  21199]]
Test Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.92      0.89    116832
           1       0.71      0.53      0.61     39822

    accuracy                           0.83    156654
   macro avg       0.78      0.73      0.75    156654
weighted avg       0.82      0.83      0.82    156654



We opted to use PCA-transformed and scaled data for creating and training our Logistic Regression model for several compelling reasons:

1. **Dimensionality Reduction**:
   - **Principal Component Analysis (PCA)** is a powerful technique used to reduce the dimensionality of our dataset while retaining the most important information. This helps in eliminating redundant and less informative features, leading to a more efficient and interpretable model.

2. **Feature Scaling**:
   - Scaling our data ensures that all features contribute equally to the model. Logistic Regression, like many machine learning algorithms, performs better when the data is normalized, preventing features with larger scales from dominating the model training process.

3. **Model Performance**:
   - The Logistic Regression model trained on PCA-transformed and scaled data achieved an accuracy of about 82%. This is a significant improvement over previous models and is currently our best-performing model. The use of PCA likely helped in capturing the underlying structure of the data more effectively.

4. **Overfitting Reduction**:
   - By reducing the number of features, PCA helps in minimizing the risk of overfitting. Overfitting occurs when the model is too complex and captures noise in the data, rather than the actual underlying pattern. PCA helps in addressing this by simplifying the feature set.

5. **Computational Efficiency**:
   - With fewer features after PCA, the computational cost of training the Logistic Regression model decreases. This makes the model training process faster and more resource-efficient, which is particularly beneficial when dealing with large datasets.

Using PCA-transformed and scaled data has led to a significant improvement in model accuracy and overall performance, justifying our decision to incorporate these preprocessing steps in our modeling pipeline. The 82% accuracy stands as evidence to the effectiveness of this approach.


## Recurrent Neural Network (RNN)

In [12]:
# convert df to NumPy array
x_train_np = x_train_scaled_pca.to_numpy()
x_val_np = x_val_scaled_pca.to_numpy()
x_test_np = x_test_scaled_pca.to_numpy()

In [13]:
# sample a fraction of the data

# define the sample size
sample_size = 100000

# denerate random indices for sampling
np.random.seed(42)

# sample the training data
sample_indices_train = np.random.choice(x_train_np.shape[0], size=sample_size, replace=False)
x_train_sampled = x_train_np[sample_indices_train]
y_train_sampled = y_train[sample_indices_train]

# sample the validation data independently
sample_indices_val = np.random.choice(x_val_np.shape[0], size=sample_size, replace=False)
x_val_sampled = x_val_np[sample_indices_val]
y_val_sampled = y_val[sample_indices_val]

# sample the test data independently
sample_indices_test = np.random.choice(x_test_np.shape[0], size=sample_size, replace=False)
x_test_sampled = x_test_np[sample_indices_test]
y_test_sampled = y_test[sample_indices_test]

In [14]:
# reshape data for RNN: (samples, time_steps, features)
# we assume height and width as 1, and channels as 3 (principal components)
x_train_rnn = x_train_sampled.reshape((x_train_sampled.shape[0], 1, x_train_sampled.shape[1]))
x_val_rnn = x_val_sampled.reshape((x_val_sampled.shape[0], 1, x_val_sampled.shape[1]))
x_test_rnn = x_test_sampled.reshape((x_test_sampled.shape[0], 1, x_test_sampled.shape[1]))

In [15]:
from keras.layers import Input, SimpleRNN, Dense
#define RNN modelfunction
def build_rnn_model(hp):
    model = Sequential([
        Input(shape=(x_train_rnn.shape[1], x_train_rnn.shape[2])),
        SimpleRNN(units=hp.Int('rnn_units', min_value=50, max_value=100, step=50), activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [16]:
from keras.callbacks import EarlyStopping

# initalize and run Keras Turner
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    tuner_rnn = kt.Hyperband(build_rnn_model,
                             objective='val_accuracy',
                             max_epochs=5,
                             factor=3,
                             directory='models',
                             project_name='rnn')

# define EarlyStopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Reloading Tuner from models/rnn/tuner0.json


In [17]:
# hyperparemeter tuning with earlt stopping 
tuner_rnn.search(x_train_rnn, y_train_sampled,
                 epochs=5,  
                 validation_data=(x_val_rnn, y_val_sampled),
                 callbacks=[early_stopping])

In [18]:
# compile best model - post hyperparemeter tuning 
best_hps_rnn = tuner_rnn.get_best_hyperparameters()[0]
print(f'Best Hyperparameters: {best_hps_rnn.values}')

Best Hyperparameters: {'rnn_units': 50, 'tuner/epochs': 2, 'tuner/initial_epoch': 0, 'tuner/bracket': 1, 'tuner/round': 0}


In [19]:
# train the model
rnn_model = tuner_rnn.hypermodel.build(best_hps_rnn)

In [None]:
history_rnn = rnn_model.fit(x_train_rnn, y_train_sampled,
                            epochs=10, batch_size=16,
                            validation_data=(x_val_rnn, y_val_sampled),
                            callbacks=[early_stopping])

Epoch 1/10


In [None]:
# evaluate model on test data 
y_test_pred = rnn_model.predict(x_test_rnn)
test_predictions = (y_test_pred > 0.5).astype(int)
test_accuracy = accuracy_score(y_test_sampled, test_predictions)
test_confusion_matrix = confusion_matrix(y_test_sampled, test_predictions)
test_classification_report = classification_report(y_test_sampled, test_predictions)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

## Convolutional Neural Network (CNN)

In [None]:
# reshape data for CNN: (samples, height, width, channels)
# we assume height and width as 1, and channels as 3 (principal components)
x_train_cnn = x_train_sampled.reshape((x_train_sampled.shape[0], 1, 3))
x_val_cnn = x_val_sampled.reshape((x_val_sampled.shape[0], 1, 3))
x_test_cnn = x_test_sampled.reshape((x_test_sampled.shape[0], 1, 3))

In [None]:
# define model-building function
def build_cnn_model(hp):
    model = Sequential()
    model.add(Conv1D(filters=hp.Int('filters', min_value=32, max_value=128, step=32),
                     kernel_size=hp.Choice('kernel_size', values=[1, 2, 3]),
                     activation='relu',
                     input_shape=(1, 3)))  
    model.add(MaxPooling1D(pool_size=1))
    model.add(Flatten())
    model.add(Dense(units=hp.Int('units', min_value=32, max_value=128, step=32), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer=hp.Choice('optimizer', ['adam', 'sgd']),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

In [None]:
# initialize and run KerasTuner
tuner_cnn = kt.Hyperband(build_cnn_model,
                         objective='val_accuracy',
                         max_epochs=5,
                         factor=3,
                         directory='models',
                         project_name='cnn')

tuner_cnn.search(x_train_cnn, y_train_sampled,
                 epochs=10,
                 validation_data=(x_val_cnn, y_val_sampled))

In [None]:
# compile best model - post hyperparemeter tuning 
best_hps = tuner_cnn.get_best_hyperparameters()[0]
print(f'Best Hyperparameters: {best_hps.values}')

# train model
cnn_model = tuner_cnn.hypermodel.build(best_hps)
history = cnn_model.fit(x_train_cnn, y_train_sampled,
                        epochs=10, batch_size=32,
                        validation_data=(x_val_cnn, y_val_sampled))

In [None]:
# evaluate model on test set 
y_test_pred = cnn_model.predict(x_test_cnn)
test_predictions = (y_test_pred > 0.5).astype(int)
test_accuracy = accuracy_score(y_test_sampled, test_predictions)
test_confusion_matrix = confusion_matrix(y_test_sampled, test_predictions)
test_classification_report = classification_report(y_test_sampled, test_predictions)

print(f'Test Accuracy: {test_accuracy}')
print('Test Confusion Matrix:')
print(test_confusion_matrix)
print('Test Classification Report:')
print(test_classification_report)

## Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

In [None]:
import os
import json
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid
from joblib import Parallel, delayed

# define parameter grid for DBSCAN
param_grid_dbscan = {
    'eps': [0.3, 0.5, 0.7],
    'min_samples': [5, 10, 15]
}

best_score = -1
best_params = {}

In [None]:
# dierectory to save best parameters
directory = 'models/dbscan'
if not os.path.exists(directory):
    os.makedirs(directory)
    
best_params_file = os.path.join(directory, 'best_params.json')

In [None]:
def evaluate_params(params):
    try:
        model_dbscan = DBSCAN(eps=params['eps'], min_samples=params['min_samples'])
        y_train_pred = model_dbscan.fit_predict(x_train_np)

        if len(set(y_train_pred)) > 1 and -1 not in set(y_train_pred):
            score = silhouette_score(x_train_np, y_train_pred)
            return (params, score)
    except Exception as e:
        print(f"Error with parameters {params}: {e}")
    return (params, -1)

results = Parallel(n_jobs=-1)(delayed(evaluate_params)(params) for params in ParameterGrid(param_grid_dbscan))

for params, score in results:
    if score > best_score:
        best_score = score
        best_params = params

In [None]:
# save the best parameters and score
with open(best_params_file, 'w') as f:
    json.dump({'best_params': best_params, 'best_score': best_score}, f)

print(f'Best Parameters for DBSCAN: {best_params}')
print(f'Best Silhouette Score: {best_score}')

In [None]:
# function to load the best parameters and score
def load_best_params(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data['best_params'], data['best_score']

# loading the best parameters
loaded_params, loaded_score = load_best_params(best_params_file)
print(f'Loaded Best Parameters: {loaded_params}')
print(f'Loaded Best Silhouette Score: {loaded_score}')

In [None]:
# use the best parameters to fit DBSCAN on the full training data
dbscan_model = DBSCAN(eps=loaded_params['eps'], min_samples=loaded_params['min_samples'])
y_train_best_pred = dbscan_model.fit_predict(x_train_np)

labels_file = os.path.join(directory, 'dbscan_labels.npy')
np.save(labels_file, y_train_best_pred)

loaded_labels = np.load(labels_file)
print(f'Loaded DBSCAN Labels: {loaded_labels}')