# **Credit Default Project**: Solved with a Logit Model

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this project, we will predict the probability of default on a credit card account. From a risk management perspective, the predicted probability of default is a valuable indicator of whether a customer should be granted credit or not. Learning from a data set where some customers have defaulted on their debt, we implement a simple statistical model.

You are already familiar with the customer default payments dataset from the Demo, so feel free to refer back to it.

<img src="https://greendayonline.com/wp-content/uploads/2017/03/Recovering-From-Student-Loan-Default.jpg" width="500" height="500" align="center"/>


Image source: https://greendayonline.com/wp-content/uploads/2017/03/Recovering-From-Student-Loan-Default.jpg

### The Credit Card Default Dataset 

We will try to predict the probability of defaulting on a credit card account at a Taiwanese bank. A credit card default happens when a customer fails to pay the minimum due on a credit card bill for more than 6 months. 

We will use a dataset from a Taiwanese bank with 30,000 observations (Source: *Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.*). Each observation represents an account at the bank at the end of October 2005.  We renamed the variable default_payment_next_month to customer_default. The target variable to predict is `customer_default` -- i.e., whether the customer will default in the following month (1 = Yes or 0 = No). The dataset also includes 23 other explanatory features. 

Variables are defined as follows:

| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| customer_default | Binary        | 1 = default in following month; 0 = no default 
| LIMIT_BAL        | Continuous    | Credit limit   
| SEX              | Categorical   | 1 = male; 2 = female
| EDUCATION        | Categorical   | 1 = graduate school; 2 = university; 3 = high school; 4 = others
| MARRIAGE         | Categorical   | 0 = unknown; 1 = married; 2 = single; 3 = others
| AGE              | Continuous    | Age in years  
| PAY1             | Categorical   | Repayment status in September, 2005 
| PAY2             | Categorical   | Repayment status in August, 2005 
| PAY3             | Categorical   | Repayment status in July, 2005 
| PAY4             | Categorical   | Repayment status in June, 2005 
| PAY5             | Categorical   | Repayment status in May, 2005 
| PAY6             | Categorical   | Repayment status in April, 2005 
| BILL_AMT1        | Continuous    | Balance in September, 2005  
| BILL_AMT2        | Continuous    | Balance in August, 2005  
| BILL_AMT3        | Continuous    | Balance in July, 2005  
| BILL_AMT4        | Continuous    | Balance in June, 2005 
| BILL_AMT5        | Continuous    | Balance in May, 2005  
| BILL_AMT6        | Continuous    | Balance in April, 2005  
| PAY_AMT1         | Continuous    | Amount paid in September, 2005
| PAY_AMT2         | Continuous    | Amount paid in August, 2005
| PAY_AMT3         | Continuous    | Amount paid in July, 2005
| PAY_AMT4         | Continuous    | Amount paid in June, 2005
| PAY_AMT5         | Continuous    | Amount paid in May, 2005
| PAY_AMT6         | Continuous    | Amount paid in April, 2005

The measurement scale for repayment status is:   

    -2 = payment two months in advance   
    -1 = payment one month in advance   
    0 = pay duly   
    1 = payment delay for one month   
    2 = payment delay for two months   
    3 = payment delay for three months   
    4 = payment delay for four months   
    5 = payment delay for five months   
    6 = payment delay for six months   
    7 = payment delay for seven months   
    8 = payment delay for eight months   
    9 = payment delay for nine months or more  

As you might remember, we tried to predict customer defaults using logistic regression. The validation of our results was based on a simple cross validation schema by splitting data into a train and test set. We also used class labels instead of class probabilities. Here, we are also implementing a logistic regression. We are adding a more statistically robust validation schema and more appropriate performance metric. 



The problem is divided into several parts. For each part, you will have time to work on the question yourself. Feel free to go back to the Demo, use Google/Stackoverflow and work with your neighbour. Together, we will review and discuss the solution to each part.

-------------

## **Part 0**: Setup

In [None]:
# Import all packages 

# Use short-hand for standard packages
import pandas            as pd
import numpy             as np
import seaborn           as sns
import matplotlib.pyplot as plt

# Import individual functions from sklearn 
from sklearn.preprocessing   import StandardScaler
from sklearn.dummy           import DummyClassifier
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.pipeline        import Pipeline

# Special code to ignore un-important warnings 
import warnings
warnings.filterwarnings('ignore')

# Ensure that output of plotting commands is displayed inline
%matplotlib inline 


In [None]:
# Define all constants

SEED = 1  # base to generate a random number

## **Part 1**: Data Preprocessing and EDA

First, we would like to understand the main characteristics of the dataset. We might need to transform and clean some features before we can specify a statistical model.

**Q 1:** Load the credit data. Which features need to be one-hot encoded and why? Shall you do that for all categorical features ? Perform the required one-hot encoding and save your preprocessed data in a new data frame.

Hint: Use the *get_dummies()* function in Pandas.

In [None]:
# Load data
data = pd.read_csv('credit_data.csv')

# One hot encoding of SEX and MARRIAGE status
cols_to_transform = ['SEX', 'MARRIAGE']

# CODE HERE
# One hot encode the two columns SEX and MARRIAGE with the get_dummies() function in Pandas
data_with_dummies = None

# Assert OK to proceed 
assert data_with_dummies is not None, 'HINT: you need to complete the code to proceed.'

**Q 2:** Compute the correlation of different features with respect to the target feature, using Pearson correlation. What are the most correlated features? Why do we not just use the most correlated features in the prediction task?

Hint: Store the feature name and correlation in a dictionary structure. To compute the Pearson correlation of two dataframe columns A and B, run the following:

*df['A'].corr(df['B'])*

In [None]:
# Visualize the correlation of different features with respect to target feature
# Compute the absolute correlation

corr_dic = {}

for column in data_with_dummies.columns:
    
    correlation = data_with_dummies[column].corr(data_with_dummies['customer_default'])
    corr_dic[column] = abs(correlation)
    
# Sort by descending correlation
# Change dictionary to a data frame
corr_df = pd.DataFrame.from_dict(corr_dic, orient = 'index', columns=['correlation'])
corr_df.sort_values(by='correlation', ascending = False)


## **Part 2**: Choosing Performance Metric and Baseline

**Q 1:** What could be an appropriate performance metric for this problem? Discuss different options, starting with simple accuracy.

Hint: Consider the case where the distribution of different classes is not balanced. Also consider the cut-off threshold to separate positive and negative classes. Find a popular metric which is not sensitive to class distributions as well as cut-off threshold and use it for the rest of your analysis.

In [None]:
# Define a toy data set
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.3, 0.8])

# Compute the ROC score 
roc_auc_score(y_true, y_scores)

**Q 2:** What could be an appropriate baseline model? What's the baseline performance? What performance do you expect? We will then compare this baseline to more complex models and see how they perform. 

Hint: Divide the data into train and test sets using the *train_test_split()* function in *sklearn.model_selection* package. Then fit an object from *DummyClassifier* class in *sklearn.dummy* package to the train set and evaluate the performance on the test set. What's the default value of the *strategy* parameter in the *DummyClassifier* function?

In [None]:
# Separates target and features

features = data_with_dummies.drop(columns=['customer_default','ID'])
target = data_with_dummies['customer_default']

# CODE HERE
# Create train and test sets (don't forget the seed!)
X_train, X_test, y_train, y_test = None

# Assert OK to proceed 
assert X_train is not None, 'HINT: you need to complete the code to proceed.'

# Define the baseline classifier
baseline_clf = DummyClassifier(strategy = 'stratified', random_state = SEED)

# Fit the dummy classifier 
baseline_clf.fit(X_train, y_train)

# Predict target probabilities of belonging to positive class
y_pred = baseline_clf.predict_proba(X_test)

# Compute area under the ROC curve
score = roc_auc_score(y_test, y_pred[:,1])

round(score, 4)

## **Part 3**: Prediction using Logistic Regression

**Q 1:** Predict the probability of default using logistic regression. Is standardization of data needed? What is the performance of your model ?

Hint: Use *LogisticRegression* class from *sklearn.linear_model* package.

In [None]:
# CODE HERE
# Define a logistic regression without regularization (don't forget the seed!)
lr_clf = None

# Assert OK to proceed 
assert lr_clf is not None, 'HINT: you need to complete the code to proceed.'

# Fit the dummy classifier 
lr_clf.fit(X_train, y_train)

# Predict target probabilities of belonging to the positive class
y_pred = lr_clf.predict_proba(X_test)

# Compute area under the ROC curve
score = roc_auc_score(y_test, y_pred[:,1])

round(score, 4)

**Q 2:** Try to think conceptually about this: what is the correct way to estimate performance of a classifier if we need to tune hyper-parameters? What issues should we avoid? Keep it simple. 

Hint: Think about the problem that might arise if we tune hyper-parameters using the train and test sets.

In [None]:
# Separate target and features into test and training and validation sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = SEED)
X_train_train, X_train_val, y_train_train, y_train_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = SEED)

print('Dimensions of training set: {}'.format(X_train_train.shape))
print('Dimensions of validation set: {}'.format(X_train_val.shape))
print('Dimensions of testing set: {}'.format(X_test.shape))

**Q 3:** Tune the parameter C in the *LogisticRegression()* function and use *l2* penalty. Note that data standardization is required for regularized linear models. What is the correct method to implement the standardization? Obtain the performance of the model using best value of *C*.

Hint: Think about the possible problem that might arise if we standardize all data (training and test) and then fit a classifier to it. Read about *Pipeline* class from *sklearn.pipeline* package and see how taking advantage of *Pipeline* can avoid the mentioned problem. Also, keep in mind that *C* is the inverse of regularization strength.

In [None]:
# Standardize features and estimate the logistic regression in a single pipeline
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('lr_clf', LogisticRegression(random_state=SEED)))
pipeline = Pipeline(estimators)
pipeline.set_params(lr_clf__penalty = 'l2')

# Finding the best value of C using the validation set
scores = []
Cs = []

for C in np.logspace(-4, 5, 10):
    
    pipeline.set_params(lr_clf__C = C) 
    pipeline.fit(X_train_train, y_train_train)
    y_train_pred = pipeline.predict_proba(X_train_val)
    score = roc_auc_score(y_train_val, y_train_pred[:,1])
    
    scores.append(score)
    Cs.append(C)

best_score = scores.index(max(scores))
best_C = Cs[best_score]

print('Best C = {} with ROC AUC score = {}'.format(best_C, round(max(scores), 4)))


In [None]:
# Performance of the tuned model on test set 
pipeline.set_params(lr_clf__C = best_C)
pipeline.fit(X_train, y_train)
y_pred_lr = pipeline.predict_proba(X_test)
score = roc_auc_score(y_test, y_pred_lr[:,1])

print('LR classifer ROC AUC with l2 regularization = {}'.format(round(score, 4)))

**Q 4:** Check the coefficients and intercept of your tuned model under *l2* regularization. How do the coefficients compare to the correlations computed earlier?

Hint: Extract the logistic regression from your pipeline and use a dictionary structure to store the feature names and coefficient values.

In [None]:
# Extract coefficients
lr_clf = pipeline.named_steps['lr_clf']
coefficients = lr_clf.coef_[0]

coef_dic = {}

for i, col_name in enumerate(X_train_train.columns):
    
    coef = round(coefficients[i], 4)
    coef_dic[col_name] = coef
    
# Sort by descending coefficient value
# Change dictionary to a data frame
coef_df = pd.DataFrame.from_dict(coef_dic, orient = 'index', columns=['coefficient'])
coef_df['abs_coefficient'] = abs(coef_df['coefficient'])
coef_df.sort_values(by='abs_coefficient', ascending = False)


In [None]:
print('Intercept: {}'.format(lr_clf.intercept_))

**Q 5:** Draw the ROC curve of your tuned model under *l2* regularization.

Hint: Use *roc_curve* function from *sklearn.metrics* package.

In [None]:
# Draw ROC curve

y_pred_lr = pipeline.predict_proba(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_lr[:, 1])

plt.figure(figsize = (12, 12))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_lr, tpr_lr, label='Logistic Regression')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

**Q 6:** Fit a logistic regression (with no l1 and l2 regularization) to the top 6 important features as defined by the logistic regression coefficients. How do you compare the performance of this model to the logistic regression model trained on all features?

In [None]:
top6_features = ['PAY_1','BILL_AMT1','PAY_AMT1','PAY_AMT2','BILL_AMT2','PAY_2']

X_train_top = X_train[top6_features] 
X_test_top = X_test[top6_features]

top6_features_clf = LogisticRegression(C=10e9, random_state=SEED) 
top6_features_clf.fit(X_train_top, y_train)
y_pred = top6_features_clf.predict_proba(X_test_top)
score = roc_auc_score(y_test, y_pred[:,1])

round(score, 4)

## **Part 4**: Improving the Prediction with a Better Model 

**Q 1:** If model A performs better than all other models for our problem, does it mean that it is in general a better model? Why?

Hint: Google *no-free-lunch* theorem.

**Q 2:** How can you make your performance results statistically more reliable?

Hint: Consider selection bias in both hyperparameter tuning and testing steps.

In [None]:
# Define toy data 
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([4, 4, 4, 2, 2, 4])

# CODE HERE
# Divide data into different folds without preserving target distribution (k = 2) (se the KFold function)
kf = None

# Assert OK to proceed 
assert kf is not None, 'HINT: you need to complete the code to proceed.'

for train_index, test_index in kf.split(X):
    
    train = [y[i] for i in train_index] 
    test = [y[i] for i in test_index] 
    
    print('Without stratification:', 'TRAIN:', train , 'TEST:', test)


In [None]:
# Divide data into diferent folds with preserving target distribution (k = 2)
skf = StratifiedKFold(n_splits = 2)

for train_index, test_index in skf.split(X = X, y = y):
    
    train = [y[i] for i in train_index] 
    test = [y[i] for i in test_index] 
    
    print('With stratification:', 'TRAIN:', train, 'TEST:', test)
    

## **Bonus**: Further Reading

- Extensive notebook using the same dataset: https://www.kaggle.com/lucabasa/credit-card-default-a-very-pedagogical-notebook
- Linking data science back to business objectives with a "Profit Curve": http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/ClassificationProcessCreditCardDefault.html