<a href="https://colab.research.google.com/github/toddwalters/pgaiml-python-coding-examples/blob/main/deep-learning/projects/automatingPortOperations/1714053668_ToddWalters_project_automating_port_operations.ipynb" target="_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# <a id='toc1_'></a>[**Loan Default Prediction using Deep Learning**](#toc0_)

-----------------------------
## <a id='toc1_1_'></a>[**Project Context**](#toc0_)
-----------------------------

For a safe and secure lending experience, it's important to analyze the past data. In this project, you have to build a deep learning model to predict the chance of default for future loans using the historical data. As you will see, this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

-----------------------------
## <a id='toc1_2_'></a>[**Project Objectives**](#toc0_)
-----------------------------

The main objective of this project is to create a deep learning model that can accurately predict whether an applicant will be able to repay a loan based on historical data. This involves:

1. Analyzing and preprocessing the given dataset
2. Handling imbalanced data
3. Building and training a deep learning model
4. Evaluating the model using appropriate metrics

-----------------------------
## <a id='toc1_3_'></a>[**Project Dataset Description**](#toc0_)
-----------------------------

The dataset contains historical loan application data. It includes various features about loan applicants and a target variable indicating whether the loan was repaid or defaulted. The data is highly imbalanced, which presents an additional challenge for model training and evaluation.

-----------------------------------
## <a id='toc1_4_'></a>[**Project Analysis Steps To Perform**](#toc0_)
-----------------------------------

1. Load the dataset
2. Check for null values in the dataset
3. Analyze the distribution of the target variable (loan default rate)
4. Balance the dataset
5. Visualize the balanced/imbalanced data
6. Preprocess and encode the features
7. Build and train a deep learning model
8. Evaluate the model using Sensitivity and ROC AUC metrics



## <a id='toc1_5_'></a>[**1.0 Load The Dataset**](#toc0_)

### <a id='toc1_5_1_'></a>[**1.1 Setup: Import Necessary Libraries**](#toc0_)

In [None]:
%pip install pandas numpy matplotlib seaborn scikit-learn tensorflow

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout
# from tensorflow.keras.optimizers import Adam
from sklearn.metrics import roc_auc_score, confusion_matrix

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

### <a id='toc1_5_2_'></a>[**1.2 Loading The Dataset**](#toc0_)

In [None]:
# Load the dataset (replace 'loan_data.csv' with your actual filename)
df = pd.read_csv('/Users/toddwalters/Development/data/1688644938_dataset/1688644938_loan_data.csv')

In [None]:
# Display the first few rows and basic information about the dataset
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
with pd.option_context('display.max_rows', None):
    print(df.dtypes)

In [None]:
# print(f"\nThe df.describe output is:\n")
# with pd.option_context('display.max_rows', None):
#     print(df.describe().transpose())

print(f"\nThe df.describe output is:\n")
print(df.describe().transpose())

## <a id='toc1_6_'></a>[**2.0  Check for null values in the dataset**](#toc0_)

In [None]:
# Check for null values
print("\nNull values in the dataset:")
with pd.option_context('display.max_rows', None):
    print(df.isnull().sum().sort_values(ascending=False))

### <a id='toc1_6_1_'></a>[**2.1  Drop Features With More Than 100K Null Values**](#toc0_)

In [None]:
# Dropping features with large amounts of missing data (any feature missing more than 100K of null values)
df.drop(columns=[
    'COMMONAREA_MEDI', 'COMMONAREA_AVG', 'COMMONAREA_MODE',
    'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI',
    'FONDKAPREMONT_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MEDI',
    'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
    'YEARS_BUILD_MEDI', 'YEARS_BUILD_MODE', 'YEARS_BUILD_AVG',
    'OWN_CAR_AGE', 'LANDAREA_MEDI', 'LANDAREA_MODE', 'LANDAREA_AVG',
    'BASEMENTAREA_MEDI', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE',
    'EXT_SOURCE_1', 'NONLIVINGAREA_MODE', 'NONLIVINGAREA_AVG', 'NONLIVINGAREA_MEDI',
    'ELEVATORS_MEDI', 'ELEVATORS_AVG', 'ELEVATORS_MODE',
    'WALLSMATERIAL_MODE', 'APARTMENTS_MEDI', 'APARTMENTS_AVG', 'APARTMENTS_MODE',
    'ENTRANCES_MEDI', 'ENTRANCES_AVG', 'ENTRANCES_MODE',
    'LIVINGAREA_AVG', 'LIVINGAREA_MODE', 'LIVINGAREA_MEDI',
    'HOUSETYPE_MODE', 'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'FLOORSMAX_AVG',
    'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BEGINEXPLUATATION_AVG',
    'TOTALAREA_MODE', 'EMERGENCYSTATE_MODE'
], inplace=True)

### <a id='toc1_6_3_'></a>[**2.3 KNN Imputation of Numerical Features**](#toc0_)

1. **`create_availability_flags` Function**
  
    - **Input**: DataFrame and a list of column names
    - **Process**:
      - For each specified column, creates a new column with the suffix "_FLAG"
      - Flag is 1 if the original value is not null, 0 if null
    - **Purpose**: Tracks which values were originally missing before imputation

2. **Creation of Flags for Credit Bureau Features**

    - Defines a list of credit bureau features
    - Calls `create_availability_flags` to create flag columns for these features
    - Displays the resulting flag columns using `df.info()`

3. **`knn_impute_features` Function**

    - Scales input features using StandardScaler
    - Performs KNN imputation using sklearn's KNNImputer
    - Scales imputed values back to original range
    - Updates original DataFrame with imputed values

4. **KNN Imputation Process**

    - Creates a list of all features to impute (credit bureau features and others)
    - Displays DataFrame info before imputation
    - Performs KNN imputation using `knn_impute_features` function
    - Saves imputed DataFrame to CSV file
    - Displays DataFrame info after imputation

5. **Summary Statistics**

   - For each imputed feature, prints summary statistics using `describe()`

6. **Visualization**

    - Creates a grid of subplots, one for each imputed feature
    - Calculates number of rows and columns based on feature count
    - For each feature:
      - Plots a histogram showing distribution after imputation
    - Removes any unused subplots
    - Displays the resulting plot

**Overall Approach**

This code provides a comprehensive method for handling missing data:

1. Flags originally missing values (Section 2.2)
2. Imputes missing values using KNN
3. Provides summary statistics of imputed data
4. Visualizes distribution of imputed features

This approach ensures transparency in the imputation process and aids in understanding the impact of imputation on data distribution.


#### <a id='toc1_6_3_1_'></a>[**2.3.1 Create Feature Flag Categories**](#toc0_)

In [None]:
def create_availability_flags(df, columns):
    """
    Create binary flags indicating data availability for specified columns.
    
    :param df: pandas DataFrame
    :param columns: list of column names to create flags for
    :return: DataFrame with added flag columns
    """
    for col in columns:
        flag_col_name = f"{col}_FLAG"
        df[flag_col_name] = (~df[col].isnull()).astype(int)
    return df

# List of AMT_REQ_CREDIT_BUREAU features
credit_bureau_features = [
    'AMT_REQ_CREDIT_BUREAU_HOUR',
    'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_WEEK',
    'AMT_REQ_CREDIT_BUREAU_MON',
    'AMT_REQ_CREDIT_BUREAU_QRT',
    'AMT_REQ_CREDIT_BUREAU_YEAR'
]
# Create availability flags
df = create_availability_flags(df, credit_bureau_features)

# Display info for flag columns
print("\nFlag columns:")
print(df[[col + '_FLAG' for col in credit_bureau_features]].info())

#### <a id='toc1_6_3_2_'></a>[**2.3.2 KNN Feature Impuation**](#toc0_)

In [None]:
def knn_impute_features(df, features, n_neighbors=5):
    """
    Perform KNN imputation on specified columns.
    
    :param df: pandas DataFrame
    :param features: list of column names to impute
    :param n_neighbors: number of neighbors to use for KNN imputation
    :return: DataFrame with imputed values
    """
    # Prepare data for KNN imputation
    imputer = KNNImputer(n_neighbors=n_neighbors)
    scaler = StandardScaler()
    
    # Scale the features
    scaled_features = scaler.fit_transform(df[features])
    
    # Perform KNN imputation
    imputed_features = imputer.fit_transform(scaled_features)
    
    # Convert back to original scale
    imputed_features = scaler.inverse_transform(imputed_features)
    
    # Update the DataFrame with imputed values
    for i, col in enumerate(features):
        df[col] = imputed_features[:, i]
    
    return df

# List of all features that need KNN imputation
all_features_to_impute = credit_bureau_features + [
    'EXT_SOURCE_3',
    'OBS_30_CNT_SOCIAL_CIRCLE',
    'DEF_30_CNT_SOCIAL_CIRCLE',
    'OBS_60_CNT_SOCIAL_CIRCLE',
    'DEF_60_CNT_SOCIAL_CIRCLE',
    'EXT_SOURCE_2'
    # Add any other features that need imputation here
]

# Display info before imputation
print("Before imputation:")
print(df[all_features_to_impute].info())

# Perform KNN imputation
df = knn_impute_features(df, all_features_to_impute)

# Save the imputed dataframe if needed
df.to_csv('/Users/toddwalters/Development/data/1688644938_dataset/loan_data_imputed.csv', index=False)

print("\nImputation complete. The specified features have been updated in the main dataframe.\n")

# Display info after imputation
print("\nAfter imputation:")
print(df[all_features_to_impute].info())

# Optional: Print summary statistics for each imputed feature
for feature in all_features_to_impute:
    print(f"\nSummary statistics for {feature}:")
    print(df[feature].describe())

# Calculate the number of rows and columns needed for the subplots
n_features = len(all_features_to_impute)
n_cols = min(3, n_features)  # Max 3 columns
n_rows = math.ceil(n_features / n_cols)

# Create subplots
fig, axs = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows))
axs = axs.ravel() if n_features > 1 else [axs]

# Plot histograms for each feature
for i, feature in enumerate(all_features_to_impute):
    axs[i].hist(df[feature], bins=50)
    axs[i].set_title(f'Distribution of {feature}\nafter imputation')
    axs[i].set_xlabel('Value')
    axs[i].set_ylabel('Frequency')

# Remove any unused subplots
for j in range(i+1, len(axs)):
    fig.delaxes(axs[j])

plt.tight_layout()
plt.show()


### <a id='toc1_6_4_'></a>[**2.4 Imputation of Missing Categorical Data**](#toc0_)

In [None]:
def impute_categorical(df, column_name, strategy='proportional', missing_value=None, new_category_name='Unknown', distribute_ratio=0.8):
    """
    Impute missing values in a categorical column using various strategies.
    
    Parameters:
    - df: pandas DataFrame
    - column_name: str, name of the column to impute
    - strategy: str, one of 'proportional', 'new_category', 'mode'
    - missing_value: value to be considered as missing (if None, will use pd.isnull())
    - new_category_name: str, name of the new category if using 'new_category' strategy
    - distribute_ratio: float, ratio of missing values to distribute when using 'proportional' strategy
    
    Returns:
    - DataFrame with imputed values
    """
    # Create a copy of the dataframe to avoid modifying the original
    df = df.copy()
    
    # Identify missing values
    if missing_value is None:
        missing_mask = df[column_name].isnull()
    else:
        missing_mask = df[column_name] == missing_value
    
    missing_count = missing_mask.sum()
    
    if strategy == 'proportional':
        # Calculate the distribution of non-missing values
        value_counts = df[~missing_mask][column_name].value_counts()
        total_non_missing = value_counts.sum()
        
        # Calculate the number of values to distribute
        distribute_count = int(missing_count * distribute_ratio)
        other_count = missing_count - distribute_count
        
        # Calculate the number of new values for each category
        new_values = (value_counts / total_non_missing * distribute_count).round().astype(int)
        
        # Adjust to ensure we have exactly the right number
        while new_values.sum() + other_count < missing_count:
            new_values[new_values.idxmax()] += 1
        while new_values.sum() + other_count > missing_count:
            new_values[new_values.idxmax()] -= 1
        
        # Create a list of all new values
        all_new_values = []
        for category, count in new_values.items():
            all_new_values.extend([category] * count)
        
        # Add the new category for remaining values
        all_new_values.extend([new_category_name] * other_count)
        
        # Shuffle and assign new values
        np.random.shuffle(all_new_values)
        df.loc[missing_mask, column_name] = all_new_values
        
    elif strategy == 'new_category':
        # Simply replace all missing values with the new category name
        df.loc[missing_mask, column_name] = new_category_name
        
    elif strategy == 'mode':
        # Replace missing values with the most frequent category
        mode_value = df[~missing_mask][column_name].mode().iloc[0]
        df.loc[missing_mask, column_name] = mode_value
    
    else:
        raise ValueError("Invalid strategy. Choose 'proportional', 'new_category', or 'mode'.")
    
    return df

#### <a id='toc1_6_4_1'></a>[**2.4.1 Identify Categorical Features With Missing Values**](#toc0_)

In [None]:
# Check for null values
print("\nFeatures with non-zero Null values and their data types:")
null_counts = df.isnull().sum()
non_zero_nulls = null_counts[null_counts > 0].sort_values(ascending=False)

# Get data types for features with non-zero null values
data_types = df.dtypes[non_zero_nulls.index]

with pd.option_context('display.max_rows', None):
    # Combine the non-zero nulls with their data types for display
    combined_info = pd.DataFrame({'Null Counts': non_zero_nulls, 'Data Type': data_types})
    print(combined_info)

In [None]:
print(f"List of Features of Type Object:\n")
print(df.select_dtypes(include=['object']).columns)

# print(f"\nUnique values in OCCUPATION_TYPE: {df['OCCUPATION_TYPE'].unique()}")
# print(f"\nUnique values in NAME_TYPE_SUITE: {df['NAME_TYPE_SUITE'].unique()}")

for column in df.select_dtypes(include=['object']).columns:
    print(f"\nUnique values in {column}: {df[column].unique()}")

#### <a id='toc1_6_4_2'></a>[**2.4.2 Investigate Categorical Features With Missing Values**](#toc0_)

In [None]:
# Optional: Fill missing values with 'Missing' to include them in the histogram
# Create a copy of the dataframe to avoid modifying the original
df = df.copy()

df['OCCUPATION_TYPE'].fillna('Missing', inplace=True)
df['NAME_TYPE_SUITE'].fillna('Missing', inplace=True)


# List of columns to plot
columns_to_plot = ['OCCUPATION_TYPE', 'NAME_TYPE_SUITE']

# Plotting
for column in columns_to_plot:
    plt.figure(figsize=(15, 9))  # Adjust the size as needed
    ax = sns.countplot(data=df, x=column, order=df[column].value_counts().index, hue=column, palette='viridis')
    plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees
    # Annotate each bar with its total
    for p in ax.patches:
        ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=11, color='black', rotation=0, xytext=(0, 10), textcoords='offset points')




#### <a id='toc1_6_4_3'></a>[**2.4.3 Imputation of Missing Categorical Data Within OCCUPATION_TYPE Feature**](#toc0_)

Given this distribution, here's an approach to distribute the missing values across the other categories while maintaining the overall structure of the data:

1. We'll use a proportional distribution method, but with a slight modification to account for the large number of missing values.
2. Instead of directly distributing all missing values, we'll distribute a portion of them (e.g., 80%) across the existing categories based on their current proportions. This helps maintain the general distribution while not overly inflating the existing categories.
3. The remaining portion (e.g., 20%) will be assigned to a new category called "Other" or "Unspecified". This accounts for the possibility that some of these missing values might genuinely be unknown or not fit into existing categories.

In [None]:
# Before imputation
print("Before imputation:")
print(df['OCCUPATION_TYPE'].value_counts(dropna=False))

# Apply the imputation for OCCUPATION_TYPE
df = impute_categorical(df, 'OCCUPATION_TYPE', strategy='proportional', 
                        missing_value='Missing', new_category_name='Other', 
                        distribute_ratio=0.8)

# After imputation
print("\nAfter imputation:")
print(df['OCCUPATION_TYPE'].value_counts(dropna=False))

# Visualize the new distribution
columns_to_plot = ['OCCUPATION_TYPE']

for column in columns_to_plot:
    plt.figure(figsize=(15, 9))  # Adjust the size as needed
    ax = sns.countplot(data=df, x=column, order=df[column].value_counts().index, hue=column, palette='viridis')
    plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees
    # Annotate each bar with its total
    for p in ax.patches:
        ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=11, color='black', rotation=0, xytext=(0, 10), textcoords='offset points')

#### <a id='toc1_6_4_4'></a>[**2.4.4 Imputation of Missing Categorical Date Within NAME_TYPE_SUITE Feature**](#toc0_)

1. Using the 'proportional' strategy to distribute the missing values across existing categories based on their current proportions.
2. Setting distribute_ratio=0.95, which means 95% of the missing values will be distributed proportionally among existing categories, and only 5% will be assigned to the new 'Other' category.
3. This approach will:

    - Maintain the overall distribution of the data.
    - Assign most of the missing values to 'Unaccompanied', reflecting the dominant trend in the data.
    - Still allow for some diversity by assigning smaller portions to other categories.
    - Create a small 'Other' category to account for any truly unknown or unique cases.

In [None]:
# Before imputation
print("Before imputation:")
print(df['NAME_TYPE_SUITE'].value_counts(dropna=False))

# Apply the imputation
df = impute_categorical(df, 'NAME_TYPE_SUITE', strategy='proportional', 
                        missing_value='Missing', new_category_name='Other_C', 
                        distribute_ratio=0.95)

# After imputation
print("\nAfter imputation:")
print(df['NAME_TYPE_SUITE'].value_counts(dropna=False))

# Visualize the new distribution
columns_to_plot = ['NAME_TYPE_SUITE']

for column in columns_to_plot:
    plt.figure(figsize=(15, 9))  # Adjust the size as needed
    ax = sns.countplot(data=df, x=column, order=df[column].value_counts().index, hue=column, palette='viridis')
    plt.xticks(rotation=45)  # Rotate x-axis labels by 45 degrees
    # Annotate each bar with its total
    for p in ax.patches:
        ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=11, color='black', rotation=0, xytext=(0, 10), textcoords='offset points')

### <a id='toc1_6_5_'></a>[**2.5 Mean/Median Imputation for Numerical Features**](#toc0_)

1. Defines a function impute_mean_median that:

    - Takes a DataFrame and a dictionary specifying features and their imputation methods.
    - Imputes each feature using the specified method (mean or median).
    - Keeps track of the imputation process for each feature.
    - Returns the imputed DataFrame and a summary of the imputation process.

2. Specifies the features to impute and their respective methods in a dictionary.
3. Performs the imputation using the `impute_mean_median` function.
4. Displays a summary of the imputation process, showing:

    - The feature name
    - The imputation method used
    - The value used for imputation
    - The number of null values before and after imputation

5. Creates a visualization showing the distribution of each imputed feature after imputation.
6. Optionally prints summary statistics for each imputed feature.

This approach allows for flexibility in choosing the imputation method for each feature while providing transparency about the imputation process. The visualization and summary statistics help in understanding the impact of imputation on the data distribution.

In [None]:
def impute_mean_median(df, features_dict):
    """
    Impute specified features using either mean or median.
    
    :param df: pandas DataFrame
    :param features_dict: dictionary with feature names as keys and 'mean' or 'median' as values
    :return: DataFrame with imputed values and a summary of imputation
    """
    summary = []
    
    for feature, method in features_dict.items():
        original_null_count = df[feature].isnull().sum()
        
        if method == 'mean':
            impute_value = df[feature].mean()
        elif method == 'median':
            impute_value = df[feature].median()
        else:
            raise ValueError(f"Invalid method for {feature}. Use 'mean' or 'median'.")
        
        df[feature].fillna(impute_value, inplace=True)
        
        summary.append({
            'Feature': feature,
            'Method': method,
            'Imputed Value': impute_value,
            'Null Count Before': original_null_count,
            'Null Count After': df[feature].isnull().sum()
        })
    
    return df, pd.DataFrame(summary)

# Specify features and imputation methods
features_to_impute = {
    'AMT_GOODS_PRICE': 'median',
    'AMT_ANNUITY': 'median',
    'CNT_FAM_MEMBERS': 'median',
    'DAYS_LAST_PHONE_CHANGE': 'median'
}

# Perform imputation
df, imputation_summary = impute_mean_median(df, features_to_impute)

# Display imputation summary
print("Imputation Summary:")
print(imputation_summary)

# Visualize distributions before and after imputation
fig, axs = plt.subplots(2, 2, figsize=(15, 10))
axs = axs.ravel()

for i, feature in enumerate(features_to_impute.keys()):
    axs[i].hist(df[feature], bins=50, alpha=0.7, label='After Imputation')
    axs[i].set_title(f'Distribution of {feature}')
    axs[i].set_xlabel('Value')
    axs[i].set_ylabel('Frequency')
    axs[i].legend()

plt.tight_layout()
plt.show()

# Optional: Print summary statistics for each imputed feature
for feature in features_to_impute.keys():
    print(f"\nSummary statistics for {feature}:")
    print(df[feature].describe())
Last edited just now


## <a id='toc1_6_6_'></a>[**2.06  Re-Check For Null Values In The Dataset**](#toc0_)

In [None]:
# Check for null values
print("\nFeatures with non-zero Null values and their data types:")
null_counts = df.isnull().sum()
non_zero_nulls = null_counts[null_counts > 0].sort_values(ascending=False)

# Get data types for features with non-zero null values
data_types = df.dtypes[non_zero_nulls.index]

with pd.option_context('display.max_rows', None):
    # Combine the non-zero nulls with their data types for display
    combined_info = pd.DataFrame({'Null Counts': non_zero_nulls, 'Data Type': data_types})
    print(combined_info)

#### <a id='toc1_6_1_'></a>[Explanations](#toc0_)

This code checks for and displays the count of null values in each column of the dataset.

#### <a id='toc1_6_2_'></a>[Why it's important:](#toc0_)

Understanding the extent and distribution of missing data is crucial because:

1. It affects the quality and reliability of our analysis and model predictions.
2. It guides our data preprocessing strategy, including decisions on imputation or feature dropping.
3. Missing data patterns might provide insights into data collection processes or inherent characteristics of certain variables.

#### <a id='toc1_6_3_'></a>[Observations](#toc0_)

1. Many features have no missing values, including key variables like **TARGET** (***our prediction goal***) and basic applicant information.
2. Several features have a significant number of missing values:

    - **OWN_CAR_AGE**: 202,929 missing values
    - **OCCUPATION_TYPE**: 96,391 missing values
   - **EXT_SOURCE_1**: 173,378 missing values
    - **EXT_SOURCE_3**: 60,965 missing values

3. Most features related to building characteristics (e.g., **APARTMENTS_AVG, BASEMENTAREA_AVG**, etc.) have a large number of missing values, ranging from about 150,000 to 215,000.
4. Credit bureau request features (**AMT_REQ_CREDIT_BUREAU_***) all have 41,519 missing values each.
5. Some potentially important features like **AMT_ANNUITY** (*12 missing*) and **AMT_GOODS_PRICE** (ˆ) have a small number of null values.

#### <a id='toc1_6_4_'></a>[Conclusions](#toc0_)

1. The dataset has a mixed pattern of missing values, with some features being complete and others having significant gaps.
2. The high number of missing values in building-related features suggests these might not be applicable or available for all loan applications.
3. The consistent number of missing values in credit bureau request features indicates a systematic reason for their absence, possibly related to data availability or collection processes.
4. Core features for loan assessment (e.g., income, credit amount, target variable) are largely complete, which is positive for our analysis.

#### <a id='toc1_6_5_'></a>[Recommendations](#toc0_)

1. For features with a small number of missing values (e.g., **AMT_ANNUITY, AMT_GOODS_PRICE**): consider using imputation techniques like mean, median, or advanced methods like KNN imputation.
2. For features with a large number of missing values:
    1. If they're deemed crucial (like **EXT_SOURCE** variables), consider advanced imputation techniques or creating a "missing" category.
    2. If they're less important or redundant (like many of the building-related features), consider dropping them or creating aggregate features that combine information from related columns.
3. For credit bureau request features: consider creating a binary flag indicating whether this information was available, in addition to any imputation strategy.
4. Analyze the impact of missing values on the target variable to ensure that dropping or imputing doesn't introduce bias into the model.
5. Document all decisions made regarding handling of missing data, as this will be crucial for model interpretation and future data processing.

#### <a id='toc1_6_5_'></a>[Decisions](#toc0_)

1.  I used mean or median imputation on the following Features with null values:

    > 'AMT_GOODS_PRICE', 'AMT_ANNUITY', 'CNT_FAM_MEMBERS', 'DAYS_LAST_PHONE_CHANGE'

2. I used a proportional imputation technique combined with creation of a "Other" category with the following categorical features with missing data:

    > 'OCCUPATION_TYPE', NAME_TYPE_SUITE'

3.  I dropped any feature with more than 100K null values:

    > 'COMMONAREA_MEDI', 'COMMONAREA_AVG', 'COMMONAREA_MODE',
    > 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MEDI',
    > 'FONDKAPREMONT_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MEDI',
    > 'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI',
    > 'YEARS_BUILD_MEDI', 'YEARS_BUILD_MODE', 'YEARS_BUILD_AVG',
    > 'OWN_CAR_AGE', 'LANDAREA_MEDI', 'LANDAREA_MODE', 'LANDAREA_AVG',
    > 'BASEMENTAREA_MEDI', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE',
    > 'EXT_SOURCE_1', 'NONLIVINGAREA_MODE', 'NONLIVINGAREA_AVG', 'NONLIVINGAREA_MEDI',
    > 'ELEVATORS_MEDI', 'ELEVATORS_AVG', 'ELEVATORS_MODE',
    > 'WALLSMATERIAL_MODE', 'APARTMENTS_MEDI', 'APARTMENTS_AVG', 'APARTMENTS_MODE',
    > 'ENTRANCES_MEDI', 'ENTRANCES_AVG', 'ENTRANCES_MODE',
    > 'LIVINGAREA_AVG', 'LIVINGAREA_MODE', 'LIVINGAREA_MEDI',
    > 'HOUSETYPE_MODE', 'FLOORSMAX_MODE', 'FLOORSMAX_MEDI', 'FLOORSMAX_AVG',
    > 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BEGINEXPLUATATION_AVG',
    > 'TOTALAREA_MODE', 'EMERGENCYSTATE_MODE'

4. I created a feature flag category for each credit bureau request features that indicates whether this information was available

5. I used KNN imputation on the following features that have null values:

    > 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'EXT_SOURCE_2'


## <a id='toc1_7_'></a>[**3.0 Analyze the distribution of the target variable (loan default rate)**](#toc0_)

In [None]:

# Calculate and print the percentage of defaults
default_rate = df['TARGET'].mean() * 100
print(f"\nPercentage of defaults: {default_rate:.2f}%")

# Visualize the class distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='TARGET', data=df)
plt.title('Distribution of Loan Defaults')
plt.show()

#### <a id='toc1_7_1_'></a>[Explanations](#toc0_)

This code calculates the percentage of loan defaults and visualizes the distribution of the target variable.

#### <a id='toc1_7_2_'></a>[Why it's important:](#toc0_)

Understanding the class distribution is crucial for binary classification problems, as imbalanced datasets can lead to biased models.

#### <a id='toc1_7_3_'></a>[Observations](#toc0_)

- The percentage of defaults in the dataset
- Visual representation of the class imbalance

#### <a id='toc1_7_4_'></a>[Conclusions](#toc0_)

This analysis reveals whether we're dealing with a balanced or imbalanced dataset.

#### <a id='toc1_7_5_'></a>[Recommendations](#toc0_)

- If the dataset is heavily imbalanced, consider using techniques like SMOTE, undersampling, or adjusting class weights
- If relatively balanced, proceed with caution and monitor for potential bias in the model

## <a id='toc1_8_'></a>[**4.0 Balance The Dataset**](#toc0_)

In [None]:
# Separate features and target
X = df.drop('TARGET', axis=1)
y = df['TARGET']

# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

#### <a id='toc1_8_1_'></a>[Explanations](#toc0_)

This code applies the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset by creating synthetic examples of the minority class.

#### <a id='toc1_8_2_'></a>[Why it's important:](#toc0_)

Balancing the dataset helps prevent the model from being biased towards the majority class, which is crucial for fair and accurate predictions, especially in loan default scenarios.

#### <a id='toc1_8_3_'></a>[Observations](#toc0_)

- The change in the number of samples after applying SMOTE
- The new ratio of default to non-default cases

#### <a id='toc1_8_4_'></a>[Conclusions](#toc0_)

SMOTE has created a balanced dataset, which should help in training a more fair and accurate model.

#### <a id='toc1_8_5_'></a>[Recommendations](#toc0_)

- Proceed with caution and validate the model's performance on both balanced and imbalanced test sets
- Consider experimenting with other balancing techniques if needed

## <a id='toc1_9_'></a>[**5.0 Visualize The Balanced/Imbalanced Data**](#toc0_)

In [None]:
# Visualize the balanced data
plt.figure(figsize=(8, 6))
sns.countplot(x=y_resampled)
plt.title('Distribution of Loan Defaults After SMOTE')
plt.show()

#### <a id='toc1_8_1_1_'></a>[Explanations](#toc0_)

This code visualizes the distribution of the target variable after applying SMOTE.

#### <a id='toc1_8_1_2_'></a>[Why it's important:](#toc0_)

Visualizing the balanced dataset confirms the effectiveness of the SMOTE technique and provides a clear comparison with the original imbalanced distribution.

#### <a id='toc1_8_1_3_'></a>[Observations](#toc0_)

- The new distribution of default and non-default cases
- Comparison with the original imbalanced distribution

#### <a id='toc1_8_1_4_'></a>[Conclusions](#toc0_)

The visualization confirms that SMOTE has successfully balanced the dataset.

#### <a id='toc1_8_1_5_'></a>[Recommendations](#toc0_)

- Use this balanced dataset for model training
- Keep the original imbalanced distribution in mind when interpreting model performance on real-world data

## <a id='toc1_9_'></a>[**6.0 Pre-process and encode the features**](#toc0_)

### <a id='toc1_9_1_'></a>[**Part_6_1**](#toc0_)

In [None]:
# Encode categorical variables
le = LabelEncoder()
for column in X.select_dtypes(include=['object']):
    X[column] = le.fit_transform(X[column])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### <a id='toc1_9_1_'></a>[Explanations](#toc0_)

This code preprocesses the data by encoding categorical variables, splitting the data into training and testing sets, and scaling the features.

#### <a id='toc1_9_2_'></a>[Why it's important:](#toc0_)

Proper preprocessing ensures that the data is in a suitable format for the deep learning model and that the model's performance can be accurately evaluated.

#### <a id='toc1_9_3_'></a>[Observations](#toc0_)

- The transformation of categorical variables into numerical format
- The split of data into training and testing sets
- The scaling of features to a common range

#### <a id='toc1_9_4_'></a>[Conclusions](#toc0_)

The data is now properly encoded, split, and scaled, ready for model training.

#### <a id='toc1_9_5_'></a>[Recommendations](#toc0_)

- Ensure that the same preprocessing steps are applied to any new data used for predictions
- Consider using cross-validation for a more robust evaluation of the model's performance

## <a id='toc1_10_'></a>[**7.0 Build and train a deep learning model**](#toc0_)

### <a id='toc1_10_1_'></a>[**Part_6_1**](#toc0_)

In [None]:
# Build the model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

#### <a id='toc1_10_1_'></a>[Explanations](#toc0_)

This code defines a deep learning model architecture, compiles the model with appropriate loss function and optimizer, and trains the model on the preprocessed data.

#### <a id='toc1_10_2_'></a>[Why it's important:](#toc0_)

Building and training the model is the core of the project, where the patterns in the data are learned to make predictions on loan defaults.

#### <a id='toc1_10_3_'></a>[Observations](#toc0_)

- The model architecture (number of layers, neurons, activation functions)
- The training process (number of epochs, batch size)
- The training and validation accuracy/loss over epochs

#### <a id='toc1_10_4_'></a>[Conclusions](#toc0_)

The model has been trained on the balanced dataset and should be capable of predicting loan defaults.

#### <a id='toc1_10_5_'></a>[Recommendations](#toc0_)

- Monitor the training process for signs of overfitting or underfitting
- Experiment with different architectures or hyperparameters if the performance is not satisfactory
- Consider using techniques like early stopping to prevent overfitting