# Capstone Project Proposal 1 - Credit Card Client Default

### The Problem To Be Solved

The goal of this capstone project is to build a model to predict the probability of default of credit card clients.  Provided with a robust and reliable model to predict the probability of client default, credit card companies would be better equipped manage the risk of their portfolios.

### Who Is My Hypothetical Client & Why Would They Care About This Problem? 

My hypothetical client could be any credit card company.

Credit card companies could use this model for the following:
1.  To create more reliable stochastic cash flow forecasts for the company; and
2.  To make better-informed decisions about allowing clients to increase their credit limits.

### Data Source

The data to be used for this project is a data for a sample of Taiwanese credit card clients. This anonymized dataset was used by Dr. I-Cheng Yeh in his 2009 paper "The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients." [1]  The dataset has been made available to the public and posted on the UC Irvine Machine Learning Repository website (See: http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). 

The dataset contains 30,000 observations and 24 attributes.  

The response variable is a binary variable for default payment in the following month.

There are 23 explanatory variables included in the dataset.  These include:

    ● History of past payment;
    ● Amount of bill statement;
    ● Amount of previous payment;
    ● Marital status;
    ● Education;
    ● Age; and
    ● Gender.
    
The dataset posted on the UC Irvine Machine Learning Repository website is in .xls format.

I have converted this data to .csv format. Below, I show the first 10 observations in dataset.

In [1]:
import pandas as pd

df = pd.read_csv('default of credit card clients.csv', 
                 header=1, 
                 index_col=0)
df.head(10)

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
6,50000,1,1,2,37,0,0,0,0,0,...,19394,19619,20024,2500,1815,657,1000,1000,800,0
7,500000,1,1,2,29,0,0,0,0,0,...,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
8,100000,2,2,2,23,0,-1,-1,0,0,...,221,-159,567,380,601,0,581,1687,1542,0
9,140000,2,3,1,28,0,0,2,0,0,...,12211,11793,3719,3329,0,432,1000,1000,1000,0
10,20000,1,3,2,35,-2,-2,-2,-2,-1,...,0,13007,13912,0,0,0,13007,1122,0,0


The data is fairly clean, but I will need to deal with missing data (possibly by imputing missing values).  

I also plan to experiment with feature engineering (*i.e.*,transforming variables or creating combinations of multiple variables) with the goal of exposing meaningful relationships in the data.

### Approach To Solving The Problem

To predict the probability of default, I plan on using one or more of the following types of models: 
1.  Logistic Regression;
2.  Naïve Bayes;
3.  K-Nearest Neighbors;
4.  Random Forest; 
5.  Gradient Boost.



This may be subject to change.

### Project Deliverables

The deliverables for this project will include:
1.  A Jupyter notebook that contains explanations of each step of the analysis and the code used to perform that analysis;
2.  A paper summarizing my analysis (in .PDF format);
3.  A slide deck summarizing my analysis; and
4.  A GitHub repository containing the above-mentioned Jupyter notebook, paper, and slide deck.

In addition, I plan to compare the results of my analysis to the results of Yeh & Lien (2009).


#### Citations:

[1]  Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.