<a href="https://colab.research.google.com/github/toddwalters/pgaiml-python-coding-examples/blob/main/deep-learning/projects/automatingPortOperations/1714053668_ToddWalters_project_automating_port_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# <a id='toc1_'></a>[**Loan Default Prediction using Deep Learning**](#toc0_)

-----------------------------
## <a id='toc1_1_'></a>[**Project Context**](#toc0_)
-----------------------------

For a safe and secure lending experience, it's important to analyze the past data. In this project, you have to build a deep learning model to predict the chance of default for future loans using the historical data. As you will see, this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

-----------------------------
## <a id='toc1_2_'></a>[**Project Objectives**](#toc0_)
-----------------------------

The main objective of this project is to create a deep learning model that can accurately predict whether an applicant will be able to repay a loan based on historical data. This involves:

1. Analyzing and preprocessing the given dataset
2. Handling imbalanced data
3. Building and training a deep learning model
4. Evaluating the model using appropriate metrics

-----------------------------
## <a id='toc1_3_'></a>[**Project Dataset Description**](#toc0_)
-----------------------------

The dataset contains historical loan application data. It includes various features about loan applicants and a target variable indicating whether the loan was repaid or defaulted. The data is highly imbalanced, which presents an additional challenge for model training and evaluation.

-----------------------------------
## <a id='toc1_4_'></a>[**Project Analysis Steps To Perform**](#toc0_)
-----------------------------------

1. Load the dataset
2. Check for null values in the dataset
3. Analyze the distribution of the target variable (loan default rate)
4. Balance the dataset
5. Visualize the balanced/imbalanced data
6. Preprocess and encode the features
7. Build and train a deep learning model
8. Evaluate the model using Sensitivity and ROC AUC metrics



## <a id='toc1_5_'></a>[**1.0 Load The Dataset**](#toc0_)

### <a id='toc1_5_1_'></a>[**1.1 Setup: Import Necessary Libraries**](#toc0_)

In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn tensorflow

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import roc_auc_score, confusion_matrix

# Set random seed for reproducibility
np.random.seed(42)
import tensorflow as tf
tf.random.set_seed(42)

### <a id='toc1_5_2_'></a>[**1.2 Loading The Dataset**](#toc0_)

In [None]:
# Load the dataset (replace 'loan_data.csv' with your actual filename)
df = pd.read_csv('loan_data.csv')

# Display the first few rows and basic information about the dataset
print(df.head())
print(df.info())

## <a id='toc1_6_'></a>[**2.0  Check for null values in the dataset**](#toc0_)

In [None]:
# Check for null values
print("\nNull values in the dataset:")
print(df.isnull().sum())

#### <a id='toc1_6_1_'></a>[Explanations](#toc0_)

This code checks for and displays the count of null values in each column of the dataset.

#### <a id='toc1_6_2_'></a>[Why it's important:](#toc0_)

Identifying missing data is crucial as it can significantly impact the model's performance and may require specific handling techniques.

#### <a id='toc1_6_3_'></a>[Observations](#toc0_)

- Columns with null values (if any)
- The extent of missing data in affected columns

#### <a id='toc1_6_4_'></a>[Conclusions](#toc0_)

Understanding the presence and extent of missing data guides our data preprocessing strategies.

#### <a id='toc1_6_5_'></a>[Recommendations](#toc0_)

- For columns with a small number of nulls, consider imputation techniques
- For columns with a large number of nulls, consider dropping the column or using advanced imputation methods
- If no nulls are present, proceed with the analysis

## <a id='toc1_7_'></a>[**3.0 Analyze the distribution of the target variable (loan default rate)**](#toc0_)

In [None]:

# Calculate and print the percentage of defaults
default_rate = df['TARGET'].mean() * 100
print(f"\nPercentage of defaults: {default_rate:.2f}%")

# Visualize the class distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='TARGET', data=df)
plt.title('Distribution of Loan Defaults')
plt.show()

#### <a id='toc1_7_1_'></a>[Explanations](#toc0_)

This code calculates the percentage of loan defaults and visualizes the distribution of the target variable.

#### <a id='toc1_7_2_'></a>[Why it's important:](#toc0_)

Understanding the class distribution is crucial for binary classification problems, as imbalanced datasets can lead to biased models.

#### <a id='toc1_7_3_'></a>[Observations](#toc0_)

- The percentage of defaults in the dataset
- Visual representation of the class imbalance

#### <a id='toc1_7_4_'></a>[Conclusions](#toc0_)

This analysis reveals whether we're dealing with a balanced or imbalanced dataset.

#### <a id='toc1_7_5_'></a>[Recommendations](#toc0_)

- If the dataset is heavily imbalanced, consider using techniques like SMOTE, undersampling, or adjusting class weights
- If relatively balanced, proceed with caution and monitor for potential bias in the model

## <a id='toc1_8_'></a>[**4.0 Balance The Dataset**](#toc0_)

In [None]:
# Separate features and target
X = df.drop('TARGET', axis=1)
y = df['TARGET']

# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

#### <a id='toc1_8_1_'></a>[Explanations](#toc0_)

This code applies the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset by creating synthetic examples of the minority class.

#### <a id='toc1_8_2_'></a>[Why it's important:](#toc0_)

Balancing the dataset helps prevent the model from being biased towards the majority class, which is crucial for fair and accurate predictions, especially in loan default scenarios.

#### <a id='toc1_8_3_'></a>[Observations](#toc0_)

- The change in the number of samples after applying SMOTE
- The new ratio of default to non-default cases

#### <a id='toc1_8_4_'></a>[Conclusions](#toc0_)

SMOTE has created a balanced dataset, which should help in training a more fair and accurate model.

#### <a id='toc1_8_5_'></a>[Recommendations](#toc0_)

- Proceed with caution and validate the model's performance on both balanced and imbalanced test sets
- Consider experimenting with other balancing techniques if needed

## <a id='toc1_9_'></a>[**5.0 Visualize The Balanced/Imbalanced Data**](#toc0_)

In [None]:
# Visualize the balanced data
plt.figure(figsize=(8, 6))
sns.countplot(x=y_resampled)
plt.title('Distribution of Loan Defaults After SMOTE')
plt.show()

#### <a id='toc1_8_1_1_'></a>[Explanations](#toc0_)

This code visualizes the distribution of the target variable after applying SMOTE.

#### <a id='toc1_8_1_2_'></a>[Why it's important:](#toc0_)

Visualizing the balanced dataset confirms the effectiveness of the SMOTE technique and provides a clear comparison with the original imbalanced distribution.

#### <a id='toc1_8_1_3_'></a>[Observations](#toc0_)

- The new distribution of default and non-default cases
- Comparison with the original imbalanced distribution

#### <a id='toc1_8_1_4_'></a>[Conclusions](#toc0_)

The visualization confirms that SMOTE has successfully balanced the dataset.

#### <a id='toc1_8_1_5_'></a>[Recommendations](#toc0_)

- Use this balanced dataset for model training
- Keep the original imbalanced distribution in mind when interpreting model performance on real-world data

## <a id='toc1_9_'></a>[**6.0 Pre-process and encode the features**](#toc0_)

### <a id='toc1_9_1_'></a>[**Part_6_1**](#toc0_)

In [None]:
# Encode categorical variables
le = LabelEncoder()
for column in X.select_dtypes(include=['object']):
    X[column] = le.fit_transform(X[column])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### <a id='toc1_9_1_'></a>[Explanations](#toc0_)

This code preprocesses the data by encoding categorical variables, splitting the data into training and testing sets, and scaling the features.

#### <a id='toc1_9_2_'></a>[Why it's important:](#toc0_)

Proper preprocessing ensures that the data is in a suitable format for the deep learning model and that the model's performance can be accurately evaluated.

#### <a id='toc1_9_3_'></a>[Observations](#toc0_)

- The transformation of categorical variables into numerical format
- The split of data into training and testing sets
- The scaling of features to a common range

#### <a id='toc1_9_4_'></a>[Conclusions](#toc0_)

The data is now properly encoded, split, and scaled, ready for model training.

#### <a id='toc1_9_5_'></a>[Recommendations](#toc0_)

- Ensure that the same preprocessing steps are applied to any new data used for predictions
- Consider using cross-validation for a more robust evaluation of the model's performance

## <a id='toc1_10_'></a>[**7.0 Build and train a deep learning model**](#toc0_)

### <a id='toc1_10_1_'></a>[**Part_6_1**](#toc0_)

In [None]:
# Build the model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

#### <a id='toc1_10_1_'></a>[Explanations](#toc0_)

This code defines a deep learning model architecture, compiles the model with appropriate loss function and optimizer, and trains the model on the preprocessed data.

#### <a id='toc1_10_2_'></a>[Why it's important:](#toc0_)

Building and training the model is the core of the project, where the patterns in the data are learned to make predictions on loan defaults.

#### <a id='toc1_10_3_'></a>[Observations](#toc0_)

- The model architecture (number of layers, neurons, activation functions)
- The training process (number of epochs, batch size)
- The training and validation accuracy/loss over epochs

#### <a id='toc1_10_4_'></a>[Conclusions](#toc0_)

The model has been trained on the balanced dataset and should be capable of predicting loan defaults.

#### <a id='toc1_10_5_'></a>[Recommendations](#toc0_)

- Monitor the training process for signs of overfitting or underfitting
- Experiment with different architectures or hyperparameters if the performance is not satisfactory
- Consider using techniques like early stopping to prevent overfitting