<h2>Credit Card Holder Defaulter Classification</h2>

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

### 1. Understanding the problem</a>
<hr>

In [4]:
df_carddata = pd.read_csv("data/UCI_Credit_Card.csv")
df_carddata.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [6]:
df_carddata.shape

(30000, 25)

The goal of this project is to predict whether the credit card client will default or not. The column that determines this is <b>default.payment.next.month</b>. It is the <b>target variable</b> with value either 1(will default) or 0(not default). This tells us that it is a <b>binary classification problem</b>.

The dataset is of a <b>moderate size with 30,000 examples</b> and <b>25 features which is a small dimension</b>.

### 2. Data Splitting</a>
<hr>

In [8]:
train_data, test_data = train_test_split(df_carddata, test_size=0.2, random_state=123)

In [9]:
train_data.shape

(24000, 25)

In [10]:
test_data.shape

(6000, 25)

### 3. EDA</a>
<hr>

In [13]:
X_train, y_train = train_data.drop(columns=["ID","default.payment.next.month"]), train_data["default.payment.next.month"]
X_test, y_test = test_data.drop(columns=["ID","default.payment.next.month"]), test_data["default.payment.next.month"]

In [16]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24000 entries, 19682 to 19966
Data columns (total 23 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   LIMIT_BAL  24000 non-null  float64
 1   SEX        24000 non-null  int64  
 2   EDUCATION  24000 non-null  int64  
 3   MARRIAGE   24000 non-null  int64  
 4   AGE        24000 non-null  int64  
 5   PAY_0      24000 non-null  int64  
 6   PAY_2      24000 non-null  int64  
 7   PAY_3      24000 non-null  int64  
 8   PAY_4      24000 non-null  int64  
 9   PAY_5      24000 non-null  int64  
 10  PAY_6      24000 non-null  int64  
 11  BILL_AMT1  24000 non-null  float64
 12  BILL_AMT2  24000 non-null  float64
 13  BILL_AMT3  24000 non-null  float64
 14  BILL_AMT4  24000 non-null  float64
 15  BILL_AMT5  24000 non-null  float64
 16  BILL_AMT6  24000 non-null  float64
 17  PAY_AMT1   24000 non-null  float64
 18  PAY_AMT2   24000 non-null  float64
 19  PAY_AMT3   24000 non-null  float64
 20  PA

There are <b>no missing values</b> and all the columns are <b>encoded as numerical columns</b>. SEX, EDUCATION, MARRIAGE are more likely to be categorical features. 

In [18]:
y_train.value_counts(normalize=True)

0    0.777833
1    0.222167
Name: default.payment.next.month, dtype: float64

The class distribution of the target value is 77.8% and 22.2%. This statistics show that the classes are <b>imbalanced</b>. Both the classes seems important here. Precision is an important metric here since false positive will lead to customer dissatisfaction. Also, recall is an important metric since false negative will lead to credit card companies not knowing the actual defaulters correctly leading to financial loss to the company. So, we need a metric which balances both precision and recall. Therefore, we should choose <b>macro-average f1 score</b> as the evaluation metrics. 