<H1> Default Credit Card Clients </h1>

<h4> Explain the Variable </h4>

There is a total of 25 variables.  They are a mix of the clients demographic information, financial history, bill/payment data, and whether the client defaulted the next month. 




| Column Name                  | Description                                                                                                |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `ID`                         | Unique identifier for each client                                                                          |
| `LIMIT_BAL`                  | Amount of given credit per customer id in New Taiwan (NT)dollars                                           |
| `SEX`                        | Gender of the client 1 = male, 2 = female                                                                  |
| `EDUCATION`                  | Education level of the client (1 = graduate school, 2 = university, 3 = high school, 4 = others)           |
| `MARRIAGE`                   | Marital status (1 = married, 2 = single, 3 = others)                                                       |
| `AGE`                        | Age in years                                                                                               |
| `PAY_0` to `PAY_6`           | Repayment status from April to September (e.g., -1 = pay duly, 0 = pay on time, 1 = one month delay, etc.) |
| `BILL_AMT1` to `BILL_AMT6`   | Amount of bill statement from April to September                                                           |
| `PAY_AMT1` to `PAY_AMT6`     | Amount paid in the previous month (April to September)                                                     |
| `default payment next month` | Target variable: whether the client defaulted the next month (1 = yes, 0 = no)                             |


In [2]:
#Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [3]:
#Convert CSV Into Dataframe To Use In Notebook
df = pd.read_csv('/Users/Ryan/Code/Data645/DefaultCreditCardClients.csv')

<H2> Data Cleaning <H2>

In [4]:
#Rename the Pay_0 To Pay_1 To match the style of the other columns
df.rename(columns={'PAY_0': 'PAY_1'}, inplace= True)

In [5]:
#Get the mean of the education

df['EDUCATION'].mean()

1.8531333333333333

In [6]:
#Use the mean for values by rounding to 2 for input
df['EDUCATION'] = df['EDUCATION'].replace([0, 5, 6], 2)

<h2> Data Reduction <H2>

In [8]:
# List of selected numeric features
features_for_pca = [
    'LIMIT_BAL', 'AGE',
    'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
    'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
    'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'
]

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features_for_pca])

# Apply PCA (keep 95% of variance)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# Check results
print(f"Original shape: {X_scaled.shape}")
print(f"Reduced shape: {X_pca.shape}")
print("Explained variance ratios:", pca.explained_variance_ratio_)

Original shape: (30000, 20)
Reduced shape: (30000, 12)
Explained variance ratios: [0.32691551 0.20310814 0.07543899 0.05014194 0.0461459  0.04440639
 0.04368688 0.03918403 0.03663652 0.03502348 0.02957272 0.02021896]


In [9]:
# Create a DataFrame for PCA components
pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(X_pca.shape[1])])

# Add the target variable back
pca_df['Default'] = df['default payment next month']

Performed PCA to reduce the features into 12 categories and scaled the data so that columns like the credit limit balance would not dominate over age.  The result is all the columns have a mean of 0 and a standard deviation of 1.  

The PCA was set to reduce so that it kept enough components to explain 95% of the variance.  This cut the features down from 20 to 12.  

Lastly, the PCA had the target column attached into a new data frame called PCA_DF.  

<H2> Data Transformation <h2>

In [10]:
#Create a new feature called Utilization Ratio 
df['Utilization_Ratio'] = df[['BILL_AMT1','BILL_AMT2','BILL_AMT3',
                              'BILL_AMT4','BILL_AMT5','BILL_AMT6']].mean(axis=1) / df['LIMIT_BAL']

In [11]:
#Add new feature to the PCA DF
pca_df['Utilization_Ratio'] = df['Utilization_Ratio']

In [12]:
#Ensure that features were added
pca_df

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,Default,Utilization_Ratio
0,-1.882545,0.950519,-0.635311,-1.301333,0.446332,-0.030903,-0.011523,-0.151913,-0.168605,3.028839,0.668630,0.207799,1,0.064200
1,-0.750441,2.091268,0.489198,-0.781045,0.079028,0.120256,0.179398,0.152436,-0.155184,-0.949828,0.023566,-0.440040,1,0.023718
2,-0.831802,1.043824,0.021472,-0.118971,0.003329,-0.101834,0.103799,0.028856,-0.061490,-0.116303,-0.286394,0.010800,0,0.188246
3,-0.191578,0.863092,-0.328582,0.051712,-0.241338,-0.035969,-0.087962,-0.083516,-0.016847,0.056527,-0.659508,-0.024124,0,0.771113
4,-0.881239,0.054430,0.834988,2.019853,-1.176848,0.169491,-0.572443,-0.363739,-1.059021,-0.127608,-1.332602,-0.181264,0,0.364463
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,2.465407,-0.812901,-0.795730,0.222874,-0.505435,-0.767524,-0.391536,-0.440540,-0.192230,0.084692,0.122086,-0.383555,0,0.549507
29996,-1.826196,-0.166504,0.071634,0.758088,-0.592182,0.140179,-0.085445,0.423197,-0.110729,-0.783052,-0.518512,0.298650,0,0.023536
29997,0.322151,3.238358,0.695109,0.391501,0.334243,0.567246,0.224318,0.835937,-0.125658,3.435828,1.041586,0.556153,1,0.391644
29998,0.591726,-0.853788,2.443873,0.346852,0.362364,-1.172280,-2.204289,-1.232226,4.294981,1.027712,-0.661387,1.063695,1,0.555440


The data transformation performed was a standardization during the PCA using a standard scaler.  

The next was to create a new feature called the utilization ratio and add that to the PCA df. 

<h2> Analysis <h2>

In [13]:
pca_df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,Default,Utilization_Ratio
0,-1.882545,0.950519,-0.635311,-1.301333,0.446332,-0.030903,-0.011523,-0.151913,-0.168605,3.028839,0.66863,0.207799,1,0.0642
1,-0.750441,2.091268,0.489198,-0.781045,0.079028,0.120256,0.179398,0.152436,-0.155184,-0.949828,0.023566,-0.44004,1,0.023718
2,-0.831802,1.043824,0.021472,-0.118971,0.003329,-0.101834,0.103799,0.028856,-0.06149,-0.116303,-0.286394,0.0108,0,0.188246
3,-0.191578,0.863092,-0.328582,0.051712,-0.241338,-0.035969,-0.087962,-0.083516,-0.016847,0.056527,-0.659508,-0.024124,0,0.771113
4,-0.881239,0.05443,0.834988,2.019853,-1.176848,0.169491,-0.572443,-0.363739,-1.059021,-0.127608,-1.332602,-0.181264,0,0.364463


In [14]:
# Split features and target
X = pca_df.drop('Default', axis=1)
y = pca_df['Default']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [15]:
# Initialize and fit the model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

In [16]:
# Predict on test set
y_pred = log_reg.predict(X_test)



In [23]:
print("Predicted vs Actual for first 20 clients:")
for i in range(20):
    print(f"Client {i+1}: Predicted = {y_pred[i]}, Actual = {y_test.iloc[i]}")

Predicted vs Actual for first 20 clients:
Client 1: Predicted = 0, Actual = 0
Client 2: Predicted = 0, Actual = 0
Client 3: Predicted = 0, Actual = 0
Client 4: Predicted = 0, Actual = 0
Client 5: Predicted = 0, Actual = 1
Client 6: Predicted = 0, Actual = 0
Client 7: Predicted = 0, Actual = 1
Client 8: Predicted = 0, Actual = 0
Client 9: Predicted = 0, Actual = 0
Client 10: Predicted = 0, Actual = 0
Client 11: Predicted = 1, Actual = 0
Client 12: Predicted = 0, Actual = 0
Client 13: Predicted = 0, Actual = 0
Client 14: Predicted = 1, Actual = 1
Client 15: Predicted = 0, Actual = 0
Client 16: Predicted = 0, Actual = 0
Client 17: Predicted = 0, Actual = 0
Client 18: Predicted = 0, Actual = 0
Client 19: Predicted = 0, Actual = 0
Client 20: Predicted = 1, Actual = 1


In [24]:
# Classification results
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Model coefficients
coefficients = pd.Series(log_reg.coef_[0], index=X.columns)
print("Logistic Regression Coefficients:")
print(coefficients.sort_values(key=abs, ascending=False))


Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      7040
           1       0.68      0.22      0.34      1960

    accuracy                           0.81      9000
   macro avg       0.75      0.60      0.61      9000
weighted avg       0.79      0.81      0.77      9000

Confusion Matrix:
[[6833  207]
 [1522  438]]
Logistic Regression Coefficients:
PC2                  0.417629
PC10                 0.390343
PC12                 0.331109
Utilization_Ratio   -0.249432
PC3                 -0.160695
PC4                  0.125244
PC11                 0.113329
PC1                  0.094844
PC5                  0.081137
PC6                  0.080223
PC7                  0.045642
PC9                  0.037287
PC8                  0.022380
dtype: float64


In [25]:
# Get the loadings (aka weights of each original feature on each PC)
loadings = pd.DataFrame(pca.components_.T,
                        columns=[f'PC{i+1}' for i in range(pca.n_components_)],
                        index=features_for_pca)

# Top features contributing to PC2
print("Top contributors to PC2:")
print(loadings['PC2'].sort_values(key=np.abs, ascending=False).head(5))

# Top features for PC10 (if exists)
if 'PC10' in loadings.columns:
    print("\nTop contributors to PC10:")
    print(loadings['PC10'].sort_values(key=np.abs, ascending=False).head(5))

# Top features for PC12 (if exists)
if 'PC12' in loadings.columns:
    print("\nTop contributors to PC12:")
    print(loadings['PC12'].sort_values(key=np.abs, ascending=False).head(5))

Top contributors to PC2:
PAY_4        0.348129
PAY_3        0.347142
PAY_2        0.337700
PAY_5        0.335298
LIMIT_BAL   -0.313392
Name: PC2, dtype: float64

Top contributors to PC10:
PAY_1        0.566203
PAY_6       -0.408870
PAY_5       -0.375204
LIMIT_BAL   -0.347309
PAY_2        0.343854
Name: PC10, dtype: float64

Top contributors to PC12:
PAY_1    0.607735
PAY_3   -0.541943
PAY_6    0.362860
PAY_2   -0.320703
PAY_5    0.175830
Name: PC12, dtype: float64


<h2> Conclusion <h2>

<h3> The model got the final results </h3>

<h3> Precision </h3>

When the model predicts the client will not default (0) it gets it right 82% of the time.

When the model predicts the client will default (1) it gets it right 68% of the time. 

<h3> Recall </h3>

The model correctly identifies 97% of clients that did not default this is very strong.

The model only catches 22% of clients that defaulted.  

This means the model is very good at catching who will not default, however, it is very bad at catching who will default. 

<h3> Accuracy </h3>

81% of the total predictions were correct, however, the majority of this came from the model predict clients that will not default correctly which makes up a larger percentage of the database. 

<h3> Findings Based On Results </h3>

The Repayment Status features, PAY_1 to PAY_6 columns, and the credit limit, LIMIT_BAL column, played the largest role in determining if a client would default or not.  This tells us that looking at a combination of if the client repayed on time or not in previous months combined with overall credit limit could if they would default. 

Clients that paying late and have lower credit limits have a higher tendency to default.

While clients that are paying on time or early and have a higher credit limit have a lower tendency to default. 


<h3> Limitations </h3>

The limitations that I had were the data focuses on payment history and credit limits but misses key information such as credit score and income.  These can be extremely useful in determining if a client would default. 

Some variables were messy and needed to be cleaned such as martial status and education which made the results harder to determine. 

The dataset is imbalance - more clients did not default.

The default payment next month did not show true risk as the amount that the client defaulted greatly impacts how much risk they are.  A client who misses a small payment is treated the same as one that missed a large one. 




<h3> Improvement Areas </h3>

The ways to improve would be to include additional financial features such as the credit score and income.  

Quantify the default amount so that the severity can be determined.  

Collect a more balanced dataset or sample the dataset so that the bias towards non-default clients is limited.  