In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Problem Statment

**I. Background & Problem Statement**

Credit card is one of the major consumer lending product in the market. When a credit card is defaulted, bank can use the opportunity to sell loan, If bank recognize that they are not able sell it, they will write if off and result in significant finance losses to the bank. Predicting accurately which customers are most likely to default represents significant business opportunity for all banks, at the same time it helps banks set up their proactive default prevention guideline to improve their bottom line.

This kernal focuses on Taiwan borrowers who constitute a portion of the Taiwan loan market. Our goal is to build a suitable model to predict whether a customer will default on credit card payment(Y) based on variables such as personal information and historical payment status (X).

**II. Data Description**
Our dataset consists of 30000 borrowers, who held at least one loan that entered repayment between April to September 2005. The borrower’s personal information is provided in the form of 5 categorial attributes and 12 billing and payment history. 

Click [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) for the source of the data from `UCI machine learning repository ` [1]

**III. Design & Findings**
We started with some data cleaning and wrangling to better visualize our data and prepare it for model training. Through frequency plots and histograms, we observed that many of our price data are heavily right skewed and therefore applied a log function to balance out the distribution. We further observed that the dataset also has more females than males. The age is also mostly younger than compared to the elderly. This can provide some explanation as to why the credit limit is right skewed, since the younger generation tends to have lower credits limits. Across the past 6 months, there are also more delayed payments as compared to no delay.

We then carried our exploratory data analysis to identify features that were highly correlated with our target class, and therefore able to select the useful features. From the heat map, it seems that pay status 1-6 can be can potentially be useful predictors of default status. Borrowers that defaulted in October, tends to also delay in the past 6 months. From data analysis, elderly defaulted more than the young. Male defaulted more than the females. Higher education defaulted lesser than the lower education borrowers. First month’s delay often leads to delay in the subsequent months. Features that we found to have high predictive value for Default are the PAY_X (ie the repayment status in previous months), the LIMIT_BAL & the PAY_AMTX (amount paid in previous months).

We also observed that our dataset was highly imbalanced; only ~22% of the target variable belonged to the “default” class. We therefore explored several methods such as upsampling of the minority class, downsampling of the majority class and synthetic construction of data by the SMOTE algorithm. 

We then proceeded to build a classification model. We chose to try both logistic regression and SVM which are known for their effectiveness in classification problems. The logistic regression and SVM achieved accuracy of 73% and 75% on the test set respectively. More time and further extensive model experiments are required to fine tune the models to gain higher accuracy.
For the purpose of this project, we will start with data cleaning and wragling, follow by exploritary data analysis to understand the potential relationship between variables. In the end, we will perform logistic regression and kernal SVM [2] to explore the non-linearity of our model for the prediction of our target variable

In [None]:
#import Library
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline

In [None]:
data = pd.read_csv("../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv")

In [None]:
data.head()

# 2. DATA WRANGLING

In [None]:
#Adjusting rows and columns
data.set_index('ID',inplace=True)
data.rename(columns = {"SEX":"GENDER","MARRIAGE":'MARITAL_STATUS', 'PAY_0' : 'PAY_1', "LIMIT_BAL" : "CREDIT_LIMIT", "default.payment.next.month" : "DEFAULT_STATUS"}, inplace = True)

In [None]:
print("For Data Description and Explaination of Variables, please refer to `APPENDIX`")
print()
data.info()

In [None]:
# creat a copy, in case we need variables in the format of number for analysis
data_original=data.copy()

In [None]:
data.describe()

#### Gerenal Description of the Data Set

>- The data set has 30,000 samples
>- Average Credit Limit is 167,484
>- Majority of the samples are female, average age is 35 years old
>- The average defaul rate is 22.1%

We noticed that some categorical variables do not match data description, for example:
>- `Education` has maximum value of 6 which is not in the data description 
>- `Matrital Status` has minimum value of 0 which is not in the data description 

So let's take a look at the unique values of categorical variables

> **a. Gender**

In [None]:
data.GENDER.unique()

>According to appendix, we can got the following categories:
>
>1. Male
>2. Female

In [None]:
data["GENDER"]=data["GENDER"].apply(lambda x: "Male" if x==1 else "Female")

**b. Education**


In [None]:
data.EDUCATION.unique()

>let's combine 0,4,5 and 6 to `"others"` catergory and we can got the following categories:
>
>1. Graduate School
>2. University
>3. High School
>4. Others


In [None]:
data["EDUCATION"] = data["EDUCATION"].apply(lambda x: "Graduate School" if x==1 else ("Univeristy" if x==2 else ("High School" if x==3 else "Others")))

**c. Marital Status**

In [None]:
data.MARITAL_STATUS.unique()

>let's combine 0 and 3 to `"others"` catergory and we can get the following categories
>
>1. Married
>2. Single
>3. Others

**d. Historical payment status**

In [None]:
data.PAY_1.unique()

>Let's combine the unique values to following categories:
>
>1. No delay
>2. Delay


In [None]:
for i in data.iloc[:,5:11]:
    data[i] = data[i].apply(lambda x: "No Delay" if x<0 else "Delay")

# 3. EDA

## Univariate Analysis

In [None]:
#Generating Frequency Distribution histograms of Numerical Variables

hist_num = data.iloc[:,:-1].hist(figsize=(20,15))
plt.suptitle('Frequency Distribution of Numerical Varibales', x=0.5, y=1.05, ha='center', fontsize='xx-large')
plt.tight_layout()

>`Interpretation`
>- Some variables have strongly right-skewed distribution(eg: Credit Limit, Pay_amt)

#### Let's try using log to make them more close to bell the curve by apply log function

In [None]:
# Log the Credit Limit in new column
data["Log_limit"] = np.log(data.CREDIT_LIMIT)

# Log the Payment Amounts in new columns
index=1
for i in data.iloc[:,-8:-2]:
    log_pay_amt="Log_pay_amt"+str(index)
    data[log_pay_amt] = np.log(data[i]+1)
    index=index+1
    
data_log_transformed = data_original.copy()
data_log_transformed["CREDIT_LIMIT"] = data["Log_limit"]

# New data_log_transformed dataframe 
data_log_transformed = data_original.copy()
data_log_transformed["CREDIT_LIMIT"] = data["Log_limit"]
# Log the Payment Amounts in new columns
index=1
for i in data_log_transformed.iloc[:,-7:-1]:
    log_pay_amt="Log_pay_amt"+str(index)
    data_log_transformed[log_pay_amt] = np.log(data_log_transformed[i]+1)
    index=index+1
    
# Log the Bill Amounts in new columns
index=1
for i in data_log_transformed[['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6']]:
    log_pay_amt="Log_bill_amt"+str(index)
    data_log_transformed[log_pay_amt] = data_log_transformed[i].apply( lambda x: np.log1p(x) if (x>0) else 0 )
    index += 1
    
    
data_log_transformed = data_log_transformed.drop(
    columns=['BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6',
             'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3','PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
            'CREDIT_LIMIT'])

#Check the result of logged distribution of Credit Limit and Payment Amount
hist_num = data.iloc[:,-7:].hist(figsize=(20,15))
plt.suptitle("Frequency Distribution of Log (Credit Limit) and Log(Payment Amount + 1) variables", x=0.5, y=1.05, ha='center', fontsize='xx-large')
plt.tight_layout()

> `Observation:` 
>
> Logged PAY_AMT graph is more curved than the original graph
>
> Other than "zero payments", the remaining data tends to be normally distributed

#### Generate Frequency Distribution Histogram of Categorical Variables

In [None]:
fig, axes = plt.subplots(figsize=(15,15),nrows=3, ncols=3)
plt.suptitle('Frequency Distribution of Categorical Varibales', x=0.5, y=1.05, ha='center', fontsize='x-large')
data['GENDER'].value_counts().plot.bar(ax=axes[0,0],title="Gender")
data['EDUCATION'].value_counts().plot.bar(ax=axes[0,1],title="Education")
data['MARITAL_STATUS'].value_counts().plot.bar(ax=axes[0,2],title="Marital_Status")
i_row=1
i_col=0
count=1
for i in data.iloc[:,5:11]:
    data[i].value_counts().plot.bar(ax=axes[i_row,i_col],title="Pay_Status_"+str(count))
    count=count+1
    i_col=i_col+1
    if count>=4:
        i_row=2
    if count==4:
        i_col=0
# set title and axis labels
plt.tight_layout()
plt.show()


#### Generate Frequency Distribution Histogram of Target Value

In [None]:
plt.suptitle('Frequency Distribution of Target Value', x=0.5, y=1.05, ha='center', fontsize='x-large')
data['DEFAULT_STATUS'].value_counts().plot.bar(title="Defaulted or not")

In [None]:
ave_defalt_rate = round(np.mean(data["DEFAULT_STATUS"]),3)*100
print("The average default rate of the dataset is " + str(ave_defalt_rate) +"%")

## Multivariate Analysis

#### Correlation Matrix

In [None]:
f, ax = plt.subplots(figsize = (20, 20))   
# this is to set fig size

correlational_matrix = data_original.corr()
#calculate correlation matrix and assign it

mask = np.triu(np.ones_like(correlational_matrix, dtype = np.bool))


cmap = sns.diverging_palette(500, 10, as_cmap = True) 
# this is just to set color range

# Range of correlational coefficients: -1 through 1

sns.heatmap(correlational_matrix,        
            mask = mask,
            cmap = cmap,                 #set color range
            vmax = 1,                    #affects the color range
            center = 0,                  #affects the color range
            square = True,               
            annot= True,                 #add annotation
            fmt=".1f",                   #set decimal place
            linewidths = 1,              #set line width
            cbar_kws = {"shrink": 0.5})  #shrink color bar by 0.5 times

>`Interpretation`
>- There is a positive correlation between `Default Status` and `PAY_1 to PAY_6`. However, the correlation is still >considered low, because it is only between 0.2 and 0.3
>
>- There is a negative correlation between `Default Status` and `Credit Limit`, However, the -0.2 correlation is considered as low.
>
>- There is a strong relationship for variables within the same category, such as "Bill_AMT", "Payment History"

Since our dependent variable "DEFAULT_STATUS" is categorical, we can split the distributions by "default/not-default" to look at the characteristics of the default/not-default groups in relation to our independent variables. We want to see how differentiated our target variable is against each individual independent variable  

In [None]:
data_log_transformed.head()

In [None]:
# Visualization of Credit Limite and Default Count using histogram
fig, ax = plt.subplots(figsize = (10, 6))
plt.hist(data.Log_limit, bins = 40, alpha = 1, color = "yellow", label="Total")
plt.hist(data.query('DEFAULT_STATUS == 0').Log_limit, bins = 40, alpha = 1, color = "orange", label = "Not Default")
plt.hist(data.query('DEFAULT_STATUS == 1').Log_limit, bins = 40, alpha = 1, color = "red", label = "Default")
plt.xlabel("Log of Credit Limit")
plt.ylabel("Count")
plt.title("Log value of Credit Limit VS Number of Defalt")
plt.grid()
plt.legend()
plt.show()

>`Interpretation`
>
>Cards with lower Logged `Credit Limit` (below 11) tend to have higher defaults.

In [None]:
subset = data[['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 
               'PAY_5', 'PAY_6', 'DEFAULT_STATUS']]

f, axes = plt.subplots(2, 3, figsize=(15, 13), facecolor='white')
f.suptitle('FREQUENCY OF DEFAULT (BY HISTORY OF DEFAULT)')

ax1 = sns.countplot(x="PAY_1", hue="DEFAULT_STATUS", data=subset, palette="Greens", ax=axes[0,0])
ax2 = sns.countplot(x="PAY_2", hue="DEFAULT_STATUS", data=subset, palette="Blues", ax=axes[0,1])
ax3 = sns.countplot(x="PAY_3", hue="DEFAULT_STATUS", data=subset, palette="Greens", ax=axes[0,2])
ax4 = sns.countplot(x="PAY_4", hue="DEFAULT_STATUS", data=subset, palette="Blues", ax=axes[1,0])
ax5 = sns.countplot(x="PAY_5", hue="DEFAULT_STATUS", data=subset, palette="Greens", ax=axes[1,1])
ax6 = sns.countplot(x="PAY_6", hue="DEFAULT_STATUS", data=subset, palette="Blues", ax=axes[1,2]);

>`Interpretation`
>
>Customer default in Oct(Target Value) has more history of delaying payment in the past 6 months.

In [None]:
print("Histogram by Age")

plt.figure(figsize = (8 , 6))
sns.distplot(data.query('DEFAULT_STATUS == 1').AGE, bins = 20, color="green")
mean_age = data.AGE.mean()
plt.axvline(mean_age,0,1, color = "blue")

>`Interpretation:`
>   
>- Majority of the credit card holders are between 20 to 30 years old

#### Let's categorize the age into following age groups
* age < 40 `young`
* age >=40 and age < 60 `middle`
* greater than 60 `old`

In [None]:
#define a function to categorize age group

def get_group (age):
        if age < 40:
            return "Young"
        elif age >= 60:
            return "Old"
        else:
            return "Middle"

In [None]:
# apply to "AGE" and create a new column
data["Age_group"] = data["AGE"].apply(get_group)

In [None]:
print("Age Group VS Default Rate")

plt.figure(figsize = (5 , 5))
sns.barplot(x = 'Age_group', y = "DEFAULT_STATUS", data = data)
plt.show()

>`Interpretation:`
>   
>- Higher age group has higher default rate

> Default Rate Ranking by `Age Group`:
> 
> `Old Age` > `Middle Age` > `Young Age`

In [None]:
# Define a function to plot barplot between categorical variable and default status
def plot_cat(categorical_variable):
    sns.barplot(x = categorical_variable, y="DEFAULT_STATUS", data=data)
    plt.figure(figsize=(10,6))
    plt.show()

In [None]:
print("Default Rate VS Gender")
plot_cat("GENDER")

>`Interpretation:`
   > 
>- Male has higher default rate than Female

In [None]:
print("Default Rate VS Education Background")
plot_cat("EDUCATION")

>`Interpretation:`
    >
>- Customer with higher education background have lower default rate

> Default Rate Ranking by `Education` Background:
> 
> `High School` > `University` > `Graduate School` > `Others`

In [None]:
print("Default Rate VS Marital Status")
plot_cat("MARITAL_STATUS")

>`Interpretation:`
>    
>- Customer with `Single` marital status has lower default rate
>- Customer with `Other` marital status has higher default rate

> Default Rate Ranking by `Marital Status`:
> 
> `Others` > `Marries` > `Single`

In [None]:
print("Default Rate VS History of Delaying Payment in previous month (SEP)")
plot_cat("PAY_1")

>`Interpretation:`
 >   
>- Customer with delay history in Sep (`PAY_1`) has higher chance of default in Oct (`Target Value`)

In [None]:
print("Grouped age and history payment status vs Default Rate")

sns.barplot(x = 'Age_group', y = "DEFAULT_STATUS", data = data, hue = "PAY_1")
plt.show()

>`Interpretation`
>
>Customer with history of payment delay in Sep(`PAY_1`) has higher default rate in Oct(`Target Value`)
>and this trend pattern is consistant in all age groups

In [None]:
print("Grouped age and education background vs Default Rate")

sns.barplot(x = 'Age_group', y = "DEFAULT_STATUS", data = data, hue = "EDUCATION")
plt.show()

>`Interpretation`
>
>- the default rate of customer having `High School` Education background is very close across all age agroup
>- The default rate of customer having `University` and `Graduate School` Education background is high for `Old` Age >Group, and reletively low for `Young` and `Middle` age group

In [None]:
print("Ploting log of PAY_AMT Histogram")

plt.figure(figsize = (8 , 6))
sns.distplot(data.query('DEFAULT_STATUS == 1').Log_pay_amt1, bins = 20, color="green")
mean_amt1 = data.Log_pay_amt1.mean()
plt.axvline(mean_amt1,0,1, color = "blue")

>`Interpretation`
>
>- The peak on the left represent customer with zero payment in Sep (`PAY_AMT1`)
>- Other than zero-payment customers, the of the logged payment amount data are close to normal distribution.

#### History of Payment Amount VS Default Status
Use Function to categorize the log of payment amount in Sep (`PAY_1`) into the following groups:
* log of previous_payment_amount in September(Log_pay_amt1) < 2.5 `low`
* Log_pay_amt1 >=2.5 and Pay_amt1 < 9 `medium`
* Log_pay_amt1 >= 9 `high`

In [None]:
# Define our function 
def log_amt(x):
    if x<2.5:
        return "low"
    elif x>=2.5 and x<9:
        return "medium"
    else:
        return "high"

In [None]:
# Apply get_amt function to SEP(PAY_AMT1) and create new column
data["log_amt1_group"]=data["Log_pay_amt1"].apply(lambda x:log_amt(x))

In [None]:
print("Default Status VS Payment Amount in Sep by Group")
plot_cat("log_amt1_group")

>`Interpretation`
>
>- `Low` Payment Amount in `Sep` has `higher` Default Rate in Target Month
>- `High` Payment Amount in `Sep` has `lower` Default Rate in Target Month


In [None]:
# Apply get_amt function to AUG(PAY_AMT2) and create new column
data["log_amt2_group"]=data["Log_pay_amt2"].apply(lambda x:log_amt(x))

In [None]:
print("Default Status VS Payment Amount in Aug by Group")
plot_cat("log_amt2_group")

>`Interpretation`
>
>- `Low` Payment Amount in `Aug` has `higher` Default Rate in Target Month
>- `High` Payment Amount in `Aug` has `lower` Default Rate in Target Month
>- This is consistant with our previous observation

In [None]:
# Apply the same formula to other months
data["log_amt3_group"]=data["Log_pay_amt3"].apply(lambda x:log_amt(x))
data["log_amt4_group"]=data["Log_pay_amt4"].apply(lambda x:log_amt(x))
data["log_amt5_group"]=data["Log_pay_amt5"].apply(lambda x:log_amt(x))
data["log_amt6_group"]=data["Log_pay_amt6"].apply(lambda x:log_amt(x))

In [None]:
f, axes = plt.subplots(2, 2, figsize=(15, 13), facecolor='white')

ax1 = sns.barplot(x = "log_amt3_group", y="DEFAULT_STATUS", data=data, ax=axes[0,0])
ax2 = sns.barplot(x = "log_amt4_group", y="DEFAULT_STATUS", data=data, ax=axes[0,1])
ax3 = sns.barplot(x = "log_amt5_group", y="DEFAULT_STATUS", data=data, ax=axes[1,0])
ax4 = sns.barplot(x = "log_amt6_group", y="DEFAULT_STATUS", data=data, ax=axes[1,1])


>`Interpretation`
>
>- `Low` Payment Amount in the past tends to have higher` Default Rate in Target Month
>- `High` Payment Amount in the past tends to have lower Default Rate in Target Month
>- This pattern is consistant throught out from April to Sep Data 

# 4. Modelling

In [None]:
import sklearn
from sklearn.model_selection import train_test_split

In [None]:
import statsmodels.api as sm

In [None]:
data_original.info()

### Training logistic regression model with all independent variables

In [None]:
y = data_original['DEFAULT_STATUS']
X = data_original.drop(columns=['DEFAULT_STATUS'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
log_reg = sm.Logit(y_train, X_train).fit() 

In [None]:
print(log_reg.summary()) 

In [None]:
yhat = log_reg.predict(X_test) 
prediction = list(map(round, yhat)) 

from sklearn.metrics import (confusion_matrix,  
                           accuracy_score) 
  
# confusion matrix 
cm = confusion_matrix(y_test, prediction)  
print ("Confusion Matrix : \n", cm)  
  
# accuracy score of the model 
print('Test accuracy = ', accuracy_score(y_test, prediction))


### Training logistic regression model with chosen independent variables

In [None]:
X_train=X_train[["CREDIT_LIMIT", "GENDER", "EDUCATION","MARITAL_STATUS","PAY_1", "PAY_2", "PAY_3", "BILL_AMT1", "PAY_AMT1", "PAY_AMT2"]]
X_test=X_test[["CREDIT_LIMIT", "GENDER", "EDUCATION","MARITAL_STATUS","PAY_1", "PAY_2", "PAY_3", "BILL_AMT1", "PAY_AMT1", "PAY_AMT2"]]

In [None]:
log_reg1 = sm.Logit(y_train, X_train).fit() 

In [None]:
print(log_reg1.summary()) 

In [None]:
yhat = log_reg1.predict(X_test) 
prediction = list(map(round, yhat)) 

from sklearn.metrics import (confusion_matrix,  
                           accuracy_score) 
  
# confusion matrix 
cm = confusion_matrix(y_test, prediction)  
print ("Confusion Matrix : \n", cm)  
  
# accuracy score of the model 
print('Test accuracy = ', accuracy_score(y_test, prediction))


### Implementing dummy variables for modeling with categorical variables

In [None]:
data.info()

In [None]:
data_d=data.drop(columns=['CREDIT_LIMIT','AGE','PAY_AMT1','PAY_AMT1','PAY_AMT2','PAY_AMT3',
                          'PAY_AMT4','PAY_AMT5','PAY_AMT6','Log_pay_amt1','Log_pay_amt1','Log_pay_amt2',
                          'Log_pay_amt3','Log_pay_amt4','Log_pay_amt5','Log_pay_amt6'])

In [None]:
data_d=pd.get_dummies(data_d)
y = data_d['DEFAULT_STATUS']
X = data_d.drop(columns=['DEFAULT_STATUS'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
log_reg2 = sm.Logit(y_train, X_train).fit() 

In [None]:
print(log_reg2.summary()) 

In [None]:
yhat = log_reg2.predict(X_test) 
prediction = list(map(round, yhat)) 

from sklearn.metrics import (confusion_matrix,  
                           accuracy_score) 
  
# confusion matrix 
cm = confusion_matrix(y_test, prediction)  
print ("Confusion Matrix : \n", cm)  
  
# accuracy score of the model 
print('Test accuracy = ', accuracy_score(y_test, prediction))

### Using just log CREDIT_LIMIT as independent variable


In [None]:
X_train = X_train['Log_limit']
X_test = X_test['Log_limit']

In [None]:
log_reg3 = sm.Logit(y_train, X_train).fit()
print(log_reg3.summary())

In [None]:
yhat = log_reg3.predict(X_test) 
prediction = list(map(round, yhat)) 

from sklearn.metrics import (confusion_matrix,  
                           accuracy_score) 
  
# confusion matrix 
cm = confusion_matrix(y_test, prediction)  
print ("Confusion Matrix : \n", cm)  
  
# accuracy score of the model 
print('Test accuracy = ', accuracy_score(y_test, prediction))

>`Interpretation`
>
> We can see here that there is an imbalance of y target class. There is a lower proportion of defaults, ~20%, that makes the base naive model predicting all to be non-default right 78% of the time. Hence, we need to explore some data sampling methods so as to create a more robust model which is reliable with unseen data

### Sample of data using over-sampling, under-sampling and SMOTE algorithm

In [None]:
from sklearn.utils import resample

In [None]:
data.columns

In [None]:
data_majority = data[data.DEFAULT_STATUS==0]
data_minority = data[data.DEFAULT_STATUS==1]

print(data_majority.DEFAULT_STATUS.count())
print("-----------")
print(data_minority.DEFAULT_STATUS.count())
print("-----------")
print(data.DEFAULT_STATUS.value_counts())

In [None]:
# Upsample minority class
data_minority_upsampled = resample(data_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=23364,    # to match majority class
                                 random_state=777) # reproducible results

# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled])
# Display new class counts
data_upsampled.DEFAULT_STATUS.value_counts()

In [None]:
# Downsample majority class
data_majority_downsampled = resample(data_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=6636,     # to match minority class
                                 random_state=777) # reproducible results

# Combine minority class with downsampled majority class
data_downsampled = pd.concat([data_majority_downsampled, data_minority])
# Display new class counts
data_downsampled.DEFAULT_STATUS.value_counts()

### Pros and Cons of Up/down Sampling 

Upsampling of minority class (DEFAULT_STATUS=1) has the risk of overfitting the model since it increases the counts/occurance of minority class. Thus it is expected to perform better than downsampling[3] 

Downsampling can discard potentially useful information and the sample can be biased, but it helps improving the run time

A third method is to create a syntetic sample by using the [SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/) (Synthetic Minority Oversampling TEchnique)[4] algorithm, which is an oversampling method which creates syntetic samples from the minority class instead of creating copies. It selects 2 or more similar instances and perturb them one at a time by random amount. 

Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

In [None]:
## remember to pip install imbalanced-learn

from imblearn.over_sampling import SMOTE

In [None]:
y = data_log_transformed['DEFAULT_STATUS']
X = data_log_transformed.drop(columns=['DEFAULT_STATUS'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
smote = SMOTE(random_state=123)
X_SMOTE, y_SMOTE = smote.fit_sample(X_train, y_train)
print(len(y_SMOTE))
print(y_SMOTE.sum())

In [None]:
y_SMOTE.value_counts()

In [None]:
log_reg_smote = sm.Logit(y_SMOTE, X_SMOTE).fit()
print(log_reg_smote.summary())

In [None]:
#-------------- 
# logistic regression 
#--------------
yhat = log_reg_smote.predict(X_test) 
prediction = list(map(round, yhat)) 

from sklearn.metrics import (confusion_matrix,  
                           accuracy_score) 
  
# confusion matrix 
cm = confusion_matrix(y_test, prediction)  
print ("Confusion Matrix : \n", cm)  
  
# accuracy score of the model 
print('Test accuracy = ', accuracy_score(y_test, prediction))

In [None]:
#-------------- 
# kernel SVM 
#--------------
from sklearn.svm import SVC
classifier1 = SVC(kernel="rbf")
classifier1.fit( X_SMOTE, y_SMOTE )
y_pred = classifier1.predict( X_test )

cm = confusion_matrix( y_test, y_pred )
print("Accuracy on Test Set for kernel-SVM = %.2f" % ((cm[0,0] + cm[1,1] )/len(X_test)))
# confusion matrix 
print ("Confusion Matrix : \n", cm) 

# 5. Conclusion

We started with some inital data cleaning and wrangling to better visualize our data, and prepare it for model training

Gaining insights from our initial data visualization we found that many of our price data are heavily right skewed and therefore applied a log function to balance out the distribution. From our exploratory data analysis, we were able to identify features that were highly correlated with our target class, and therefore able to select the useful features

A second issue was our highly imbalance target class which had only ~22% of the total counts. We therefore explored severa methods such as upsampling of the lesser class, downsampling of the higher class and also synthetic construction of data by the SMOTE algorithm

Features that we found to have high predictive value for Default are the PAY_X (ie the repayment status in previous months), the LIMIT_BAL & the PAY_AMTX (amount paid in previous months).

We chose to try both logistic regression and linear kernal SVM which are known for their effective in classification problems, and explored several feature selection process and compared the results. We concluded that more time and further extensive model experiments are required to fine tune the models to gain higher accuracy

# 6. References

[1] Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

[2] Zoltan, C. (2018, November 13) SVM and Kernel SVM https://towardsdatascience.com/svm-and-kernel-svm-fed02bef1200

[3] Weiss, Gary M., McCarthy K., and Bibi Zabar (2007). "Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?." DMIN 7 : 35-41.

[4] Brownlee, J. (2020, January 17). SMOTE for Imbalanced Classification with Python https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

# 7. Appendix

#### Data Description



| CODE |  REPLACED NAME | DESCRIPTION | UNIT |
|---|---|---|---|
|X1|CREDIT_LIMIT|Amount of the given credit| NT dollar|
|X2|GENDER|Gender|1 = male; 2 = female|
|X3|EDUCATION|Education level|1 = graduate school; 2 = university; 3 = high school; 4 = others|
|X4|MARITAL_STATUS|Marital status|1 = married; 2 = single; 3 = others|
|X5|AGE|Age |year|
|X6|PAY_1|Repayment status - Sep|-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months|
|X7|PAY_2|Repayment status - Aug|-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months|
|X8|PAY_3|Repayment status - Jul|-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months|
|X9|PAY_4|Repayment status - Jun|-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months|
|X10|PAY_5|Repayment status - May|-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months|
|X11|PAY_6|Repayment status - Apr|-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for 2 months|
|X12|BILL_AMT1|Amount of bill statement - Sep| NT dollar|
|X13|BILL_AMT2|Amount of bill statement - Aug| NT dollar|
|X14|BILL_AMT3|Amount of bill statement - Jul| NT dollar|
|X15|BILL_AMT4|Amount of bill statement - Jun| NT dollar|
|X16|BILL_AMT5|Amount of bill statement - May| NT dollar|
|X17|BILL_AMT6|Amount of bill statement - Apr| NT dollar|
|X18|PAY_AMT1|Amount paid in Sep| NT dollar|
|X19|PAY_AMT2|Amount paid in Aug| NT dollar|
|X20|PAY_AMT3|Amount paid in Jul| NT dollar|
|X21|PAY_AMT4|Amount paid in Jun| NT dollar|
|X22|PAY_AMT5|Amount paid in May| NT dollar|
|X23|PAY_AMT6|Amount paid in Apr| NT dollar|
|Y|DEFAULT_STATUS|Default payment in Oct|default = 1, not default = 0|


In [None]:
exit()