- Based on data exploration, now we will perform some data cleaning and preprocessing steps to prepare the dataset for modeling.

In [None]:
# Importing Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Defining the path to the datasets folder
dataset_path = os.path.join(data_dir, "UCI_Credit_Card.csv")

# Loading the CSV file into a DataFrame
df = pd.read_csv(dataset_path)

# Displaying the first few rows to confirm loading
df.head()

# **1. Data Cleaning**

## **1.1 Handling Missing Values**
- We will handle the following data cleaning tasks based on the observations and assumptions from the data exploration phase:

In [None]:
#Handling missing values in Education_Level and Marital_Status
print(f"Count of 0 values in Education_Level: {len(df[df['Education_Level'] == 0])}")
print(f"Count of 0 values in Marital_Status: {len(df[df['Marital_Status'] == 0])}")

- Missing values are very less so we will handle them by replacing them with appropriate values:
    - **Education_Level**: Replace 0 with 5 (unknown).
    - **Marital_Status**: Replace 0 with the mode of Marital_Status.

In [None]:
# Calculating mode for Marital_Status
marital_mode = df['Marital_Status'].mode()[0]
print(f"\nMode of Marital_Status: {marital_mode}")

# Replacing 0s in Marital_Status with mode
df['Marital_Status'] = df['Marital_Status'].replace(0, marital_mode)

# Replacing 0s in Education_Level with 5 (unknown)
df['Education_Level'] = df['Education_Level'].replace(0, 5)

# Verifying correction
print("\nCount of 0 values after handling:")
print(f"Education_Level: {len(df[df['Education_Level'] == 0])}")
print(f"Marital_Status: {len(df[df['Marital_Status'] == 0])}")

- We replaced 0s in Education_Level with 5(unknown).
- We replaced 0s in Marital_Status with the mode.

## **1.2 Handling Inconsistencies in Categorical Variables**

In [None]:
# Display counts (5,6) unknown in Education_Level
print(f"Education_Level 5: {len(df[df['Education_Level'] == 5])}")
print(f"Education_Level 6: {len(df[df['Education_Level'] == 6])}")

# Replacing 6 in Education_Level with 5 (unknown)
df['Education_Level'] = df['Education_Level'].replace(6, 5)

# Display the updated counts
print(f"Updated Education_Level 5: {len(df[df['Education_Level'] == 5])}")
print(f"Updated Education_Level 6: {len(df[df['Education_Level'] == 6])}")

- We replaced 6(unknown) in Education_Level with 5(unknown) as both are (unknown) according to the documentation and as per our assumption to maintain consistency.

In [None]:
# Expected value ranges based on documentation and assumptions
expected_ranges = {
    'Gender': [1, 2],
    'Education_Level': [1, 2, 3, 4, 5],
    'Marital_Status': [1, 2, 3],
    'Sept_Pay': list(range(-2, 10)),
    'Aug_Pay': list(range(-2, 10)),
    'July_Pay': list(range(-2, 10)),
    'June_Pay': list(range(-2, 10)),
    'May_Pay': list(range(-2, 10)),
    'Apr_Pay': list(range(-2, 10)),
    'default_payment_next_month': [0, 1]
}

# Checking for unexpected values
print("Checking for inconsistencies in categorical variables:")
for col in categorical_columns:
    unique_values = df[col].unique()
    unexpected = [x for x in unique_values if x not in expected_ranges[col]]
    if unexpected:
        print(f"{col}: Unexpected values found - {unexpected}")
        print(f"Count of unexpected values: {len(df[df[col].isin(unexpected)])}")
    else:
        print(f"{col}: All values within expected ranges.")

- All values are within expected ranges based on the documentation and assumptions.

## **1.3 Handling Duplicate Rows**
- There are no duplicate rows in the dataset as we have `ID` column which is unique for each row.
- We will drop the `ID` column as it is not needed for modeling and will not contribute to the predictive power of the model.
- After dropping the ID column, we will once again check for duplicates to ensure data integrity.

In [None]:
# Dropping the ID column
df.drop(columns=['ID'], inplace=True)

# Checking for duplicates after dropping ID column
df.duplicated().sum()

- We have 35 duplicate rows in the dataset, which is very less compared to the total number of rows (30,000). So we will drop these duplicate rows.

In [None]:
# Removing duplicate rows
df = df.drop_duplicates()

# Checking the shape of the DataFrame after removing duplicates and dropping ID column
df.shape

- After cleaning the dataset now we have 29,965 rows and 24 columns.

In [None]:
# Saving the cleaned DataFrame to a new CSV file
cleaned_data_path = os.path.join(data_dir, "cleaned_credit_card_data.csv")
df.to_csv(cleaned_data_path, index=False)

## **Summary of Data Cleaning**
- **Missing Values**: 
    - We handled missing values 0s in Education_Level and Marital_Status by replacing them with 5 (unknown) and the mode respectively.
- **Handling Outliers**:
    - We decided not to remove outliers as it would result in losing more than 50% of the data. Instead, we can apply log transformation for skewed numerical variables or use tree-based models which are robust to skewed distributions and outliers.
- **Inconsistencies**:
    - We replaced 6(unknown) in Education_Level with 5(unknown) as both are (unknown) according to the documentation and as per our assumption to maintain consistency.
- **Handling Duplicate Rows**:
    - We dropped the `ID` column as it is not needed for modeling and will not contribute to the predictive power of the model.
    - After dropping the ID column, we checked for duplicates and found 35 duplicate rows, which is very less compared to the total number of rows (30,000). So we dropped these duplicate rows.
- **Data Integrity**:
    - The dataset is now cleaned and ready for analysis, with no missing values, inconsistencies, or unexpected values in categorical variables.

# **2. Handling Outliers**

In [None]:
# Removing outliers in numerical variables

# Creating a copy if df to avoid modifying the original DataFrame
df1 = df.copy()

for col in numerical_columns:
    outlier_count, lower, upper = detect_outliers(df1, col)
    df1 = df1[(df1[col] >= lower) & (df1[col] <= upper)]
df1.reset_index(drop=True, inplace=True)

# Printing the shape of df1 after removing outliers
df1.shape

- If we remove outliers, we will lose more than 50% of the data, so we will not remove outliers. Instead, we can apply log transformation for skewed numerical variables or use tree-based models which are robust to skewed distributions and outliers.

# **3. Feature Engineering**

In [None]:
# Loading the cleaned DataFrame and saving it to a new variable df1
df1 = pd.read_csv(cleaned_data_path)
df1.head()

## **3.1 Binning Age Variable**
- We will bin the `Age` variable into age groups to reduce noise and improve model performance.

In [None]:
# Binning the Age variable into age groups
df1['Age_Groups'] = pd.cut(df1['Age'],
                             bins=[25, 30, 35, 40, 45, 50, 55, 60, np.inf],
                             labels=['20-25', '25-30', '30-35', '35-40', '40-45', '45-50', '50-55', '55-60', '60+'],
                             right=False)

# Dropping the Age column as it is no longer needed
df1.drop(columns=['Age'], inplace=True)

# Displaying the counts of each age groups
df1['Age_Groups'].value_counts().sort_index()

- We grouped the Age variable into age groups to capture more meaningful patterns instead of using it as a continuous variable.
- **Insights**: Most of the credit card owners are in the age group of 25-30, followed by 30-35 and 35-40. There are very few credit card owners in the age groups above 60.

In [None]:
# Bar plot for Age Groups vs Default Payment
plt.figure(figsize=(12, 6))
sns.countplot(data=df1, x='Age_Groups', hue='default_payment_next_month', palette=['lightgreen', 'salmon'])
plt.title('Age Groups vs Default Payment Next Month', fontsize=16, fontweight='bold')
plt.xlabel('Age Groups', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.legend(title='Default Payment Next Month', loc='upper right', labels=['No Default', 'Default'])
plt.tight_layout()
plt.savefig(os.path.join(visualization_outputs, 'age_groups_vs_default_payment.png'))
plt.show()

- Majority of the credit card owners are in the age group of 25-30, followed by 30-35 and 35-40.

In [None]:
# Calculating ratios of default payment next month by age groups
age_ratios = df1.groupby('Age_Groups')[target_column].value_counts(normalize=True).unstack().fillna(0) * 100
age_ratios = age_ratios.rename(columns={0: 'No Default', 1: 'Default'})
age_ratios['Total_Customers'] = df1['Age_Groups'].value_counts()

# Displaying the ratios
age_ratios.round(2)

- As we can see from the above table, age above 60 has the highest chance of defaulting on payment next month (around 30%), followed by age group 25-30 (around 27%). Ages between 25-50 have a lower chance of defaulting, with the lowest chance in the age group 30-35 (around 20%).

## **3.2 Creating New Features**
- We will create new features based on the existing features to improve the predictive power of the model.

In [None]:
# Creating 5 new features based on existing data

# Calculating average bill amount over 6 months
df1['Avg_Bill_Amt'] = df1[['Sept_Bill_Amt', 'Aug_Bill_Amt', 'July_Bill_Amt', 'June_Bill_Amt', 'May_Bill_Amt', 'Apr_Bill_Amt']].mean(axis=1).round(2)

# Calculating average payment amount over 6 months
df1['Avg_Pay_Amt'] = df1[['Sept_Pay_Amt', 'Aug_Pay_Amt', 'July_Pay_Amt', 'June_Pay_Amt', 'May_Pay_Amt', 'Apr_Pay_Amt']].mean(axis=1).round(2)

# Calculating payment-to-bill ratio (average payment / average bill, clipped to avoid division by zero)
df1['Pay_to_Bill_Ratio'] = np.where(df1['Avg_Bill_Amt'] != 0, df1['Avg_Pay_Amt'] / df1['Avg_Bill_Amt'], 0).round(2)

# Calculating average payment delay score (average of payment status)
df1['Avg_Delay_Score'] = df1[['Sept_Pay', 'Aug_Pay', 'July_Pay', 'June_Pay', 'May_Pay', 'Apr_Pay']].mean(axis=1).round(2)

# Calculating credit utilization ratio (average bill / credit limit)
df1['Credit_Utilization'] = np.where(df1['Credit_Limit'] != 0, df1['Avg_Bill_Amt'] / df1['Credit_Limit'], 0).round(2)

# Displaying random sample rows of new features created
new_features = ['Avg_Bill_Amt', 'Avg_Pay_Amt', 'Pay_to_Bill_Ratio', 'Avg_Delay_Score', 'Credit_Utilization']
df1[new_features].sample(10)

- We created 5 new features based on existing variables:
    - **Avg_Bill_Amt**: Average of all bill amounts over the six months.
    - **Avg_Pay_Amt**: Average of all payment amounts over the six months.
    - **Pay_to_Bill_Ratio**: Ratio of average payment amount to average bill amount, indicating how much of the billed amount is paid.
    - **Avg_Pay_Delay**: Average payment delay across the six months, indicating overall payment behavior.
    - **Credit_Utilization**: Ratio of average bill amount to credit limit, indicating how much of the credit limit is utilized.

In [None]:
# Displaying the first few rows of the updated DataFrame with new features
df1.head()

In [None]:
# Checking the shape of the DataFrame after adding new features
df1.shape

- Our new dataset now has 29,965 rows and 29 columns after adding the new features.

### **Summary of New Created Features**
- We grouped the Age variable into age groups to capture more meaningful patterns instead of using it as a continuous variable.
- We created 5 new features based on existing variables:
    - **Avg_Bill_Amt**: Average of all bill amounts over the six months.
    - **Avg_Pay_Amt**: Average of all payment amounts over the six months.
    - **Pay_to_Bill_Ratio**: Ratio of average payment amount to average bill amount, indicating how much of the billed amount is paid.
    - **Avg_Pay_Delay**: Average payment delay across the six months, indicating overall payment behavior.
    - **Credit_Utilization**: Ratio of average bill amount to credit limit, indicating how much of the credit limit is utilized.

- **Insights**
    - Most of the credit card owners are in the age group of 25-30, followed by 30-35 and 35-40. There are very few credit card owners in the age groups above 60.
    - Age above 60 has the highest chance of defaulting on payment next month (around 30%), followed by age group 25-30 (around 27%). Ages between 25-50 have a lower chance of defaulting, with the lowest chance in the age group 30-35 (around 20%).

# **4. Feature Scaling**
- We will scale the numerical features using StandardScaler to ensure that all features are on the same scale, which is important for many machine learning algorithms that rely on distance calculations.

In [None]:
# Feature scaling using StandardScaler
from sklearn.preprocessing import StandardScaler

# Initializing the StandardScaler
scaler = StandardScaler()

# Selecting numerical columns for scaling
numerical_columns = ['Credit_Limit', 'Avg_Bill_Amt', 'Avg_Pay_Amt', 'Pay_to_Bill_Ratio', 'Avg_Delay_Score', 'Credit_Utilization']

# Fitting the scaler to the numerical columns and transforming them
df1[numerical_columns] = scaler.fit_transform(df1[numerical_columns])

# Displaying the first few rows of the DataFrame after scaling
df1.head()

# **5. Encoding Categorical Variables**
- We will encode ordinal categorical variables using OrdinalEncoder and nominal categorical variables using OneHotEncoder to prepare them for modeling.

In [None]:
# Using OrdinalEncoder for Age Category
from sklearn.preprocessing import OrdinalEncoder

# Define the categories in the correct order (youngest to oldest)
age_groups_ordered = [['18-20', '20-25', '25-30', '30-35', '35-40', '40-45', '45-50', '50-55', '55-60', '60+']]

# Initializing the OrdinalEncoder with the specified category order
ordinal_encoder = OrdinalEncoder(categories=age_groups_ordered)

# Applying ordinal encoding to the Age_Groups column
df1['Age_Groups'] = ordinal_encoder.fit_transform(df1[['Age_Groups']])

# Converting to integer type for cleaner display
df1['Age_Groups'] = df1['Age_Groups'].astype(int)

# Displaying the mapping to verify
for i, category in enumerate(ordinal_encoder.categories_[0]):
    print(f"{category}: {i}")

# Displaying the updated Age_Groups column
df1['Age_Groups'].value_counts().sort_index()

- Mapped Age Groups categories to numerical value using OrdinalEncoder for modelling.

In [None]:
# Nominal categories for categorical variables
nominal_categories = ['Gender', 'Marital_Status', 'Education_Level', 'Sept_Pay', 'Aug_Pay', 'July_Pay', 'June_Pay', 'May_Pay', 'Apr_Pay']

# Transforming nominal categorical variables into numerical values using one-hot encoding
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Applying one-hot encoding to the nominal categorical variables
encoded_features = encoder.fit_transform(df1[nominal_categories])

# Creating a DataFrame from the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(nominal_categories))

# Concatenating the encoded DataFrame with the original DataFrame (excluding the original nominal columns)
df1 = pd.concat([df1.drop(columns=nominal_categories), encoded_df], axis=1)

# Displaying the first few rows of the updated DataFrame with one-hot encoded features
df1.head()

- We have transformed all the nominal categorical variables into numerical values using one-hot encoding converting them into binary columns. This will allow us to use these variables in our machine learning models effectively.
- We have also dropped 1 column from one-hot encoding to avoid the dummy variable trap, as it is not needed for modeling.

In [None]:
# Moving default_payment_next_month column to the last position for better clarity in analysis
cols = [col for col in df1.columns if col != 'default_payment_next_month']
cols.append('default_payment_next_month')
df1 = df1[cols]

# Display the DataFrame to confirm the change
df1.head()

In [None]:
df1.shape

- Now our dataset has 29965 rows and 85 columns after encoding the categorical variables.

In [None]:
# saving the updated DataFrame with new features to a new CSV file
final_data_path = os.path.join(data_dir, "final_data.csv")
df1.to_csv(final_data_path, index=False)

## **Summary of Feature Engineering**
- **New Features Created**:
    - We created 5 new features based on existing variables:
        - **Avg_Bill_Amt**: Average of all bill amounts over the six months.
        - **Avg_Pay_Amt**: Average of all payment amounts over the six months.
        - **Pay_to_Bill_Ratio**: Ratio of average payment amount to average bill amount, indicating how much of the billed amount is paid.
        - **Avg_Pay_Delay**: Average payment delay across the six months, indicating overall payment behavior.
        - **Credit_Utilization**: Ratio of average bill amount to credit limit, indicating how much of the credit limit is utilized.
- **Age Grouping**:
    - We grouped the Age variable into age groups to capture more meaningful patterns instead of using it as a continuous variable.
    - **Insights from Age Groups**
        - Most of the credit card owners are in the age group of 25-30, followed by 30-35 and 35-40. There are very few credit card owners in the age groups above 60.
        - Age above 60 has the highest chance of defaulting on payment next month (around 30%), followed by age group 25-30 (around 27%). Ages between 25-50 have a lower chance of defaulting, with the lowest chance in the age group 30-35 (around 20%).
- **Feature Transformation**:
    - We encoded ordinal categorical variables using OrdinalEncoder and nominal categorical variables using OneHotEncoder to prepare them for modeling.
- **Final Dataset**:
    - The final dataset has 29,965 rows and 85 columns after adding new features and encoding categorical variables, ready for modeling.