# Credit Card Default Prediction

The goal is to predict the likelihood of a client defaulting on their credit loans by crediting a credit score prediction model. 

By the end of this notebook exercise, we hope to have answered the following questions:
1. How does the probability of default payment vary by categories of different demographic variables?
2. Which variables are the strongest predictors of default payments?

# Dataset

Dataset contains 25 variables:

- **ID**: ID of each client
- **LIMIT_BAL**: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- **SEX**: Gender (1=male, 2=female)
- **EDUCATION**: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE**: Marital status (1=married, 2=single, 3=others)
- **AGE**: Age in years
- **PAY_0**: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- **PAY_2**: Repayment status in August, 2005 (scale same as above)
- **PAY_3**: Repayment status in July, 2005 (scale same as above)
- **PAY_4**: Repayment status in June, 2005 (scale same as above)
- **PAY_5**: Repayment status in May, 2005 (scale same as above)
- **PAY_6**: Repayment status in April, 2005 (scale same as above)
- **BILL_AMT1**: Amount of bill statement in September, 2005 (NT dollar)
- **BILL_AMT2**: Amount of bill statement in August, 2005 (NT dollar)
- **BILL_AMT3**: Amount of bill statement in July, 2005 (NT dollar)
- **BILL_AMT4**: Amount of bill statement in June, 2005 (NT dollar)
- **BILL_AMT5**: Amount of bill statement in May, 2005 (NT dollar)
- **BILL_AMT6**: Amount of bill statement in April, 2005 (NT dollar)
- **PAY_AMT1**: Amount of previous payment in September, 2005 (NT dollar)
- **PAY_AMT2**: Amount of previous payment in August, 2005 (NT dollar)
- **PAY_AMT3**: Amount of previous payment in July, 2005 (NT dollar)
- **PAY_AMT4**: Amount of previous payment in June, 2005 (NT dollar)
- **PAY_AMT5**: Amount of previous payment in May, 2005 (NT dollar)
- **PAY_AMT6**: Amount of previous payment in April, 2005 (NT dollar)
- **default.payment.next.month**: Default payment (1=yes, 0=no)
Inspiration

# Exploratory Data Analysis
The goal of EDA is to uncover patterns, relationships, anomalies, and trends with the dataset. These discoveries provide insights that guides further analysis and decision-making. 

## Understand the data
- get a good understanding of the data: such as number of observations,features, and data types
- identify the target variable (variable which we want to predict) and understand its significance

### Import Libraries

In [2]:
import pandas as pd 
import numpy as np

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go



In [3]:
import importlib
import sys
sys.path.append("notebooks")

In [4]:
from helpers import functions

In [5]:
# read in the dataset
df = pd.read_csv(r"../data/raw/UCI_Credit_Card.csv")

### Data Size
Check how many observations does the dataset contain?


In [6]:
# check the shape of the dataset
df.shape # return the number of rows and columns as a tuple
print(f"The dataset has - {df.shape[0]} rows and {df.shape[1]} columns.")

The dataset has - 30000 rows and 25 columns.


### Data Preview
What does the dataset look like? 

In [7]:
# show the first five observations
df.head() # displays the first 5 rows of the dataset by default. 

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


### Data Types
What type of information is stored in each column? 

In [8]:
# return the data type information of each column
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

The dataset is composed mainly of int64s and float64s. The dataframe using 5.7mb of memory. We can optimise this by adjusting the capacity (data type) used to store each column. This is done by checking the data range of each column and changing the capacity used to store them. 


In [9]:
# rename the target column 
df = df.rename(columns={
    "default.payment.next.month":"def_pay",
    "PAY_0":"PAY_1"
})

In [10]:
# look at the memory used by each column
df.memory_usage(deep=True)

Index           132
ID           240000
LIMIT_BAL    240000
SEX          240000
EDUCATION    240000
MARRIAGE     240000
AGE          240000
PAY_1        240000
PAY_2        240000
PAY_3        240000
PAY_4        240000
PAY_5        240000
PAY_6        240000
BILL_AMT1    240000
BILL_AMT2    240000
BILL_AMT3    240000
BILL_AMT4    240000
BILL_AMT5    240000
BILL_AMT6    240000
PAY_AMT1     240000
PAY_AMT2     240000
PAY_AMT3     240000
PAY_AMT4     240000
PAY_AMT5     240000
PAY_AMT6     240000
def_pay      240000
dtype: int64

In [11]:
# get the statistical distribution of the dataset.
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,30000.0,15000.5,8660.398374,1.0,7500.75,15000.5,22500.25,30000.0
LIMIT_BAL,30000.0,167484.322667,129747.661567,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,30000.0,1.603733,0.489129,1.0,1.0,2.0,2.0,2.0
EDUCATION,30000.0,1.853133,0.790349,0.0,1.0,2.0,2.0,6.0
MARRIAGE,30000.0,1.551867,0.52197,0.0,1.0,2.0,2.0,3.0
AGE,30000.0,35.4855,9.217904,21.0,28.0,34.0,41.0,79.0
PAY_1,30000.0,-0.0167,1.123802,-2.0,-1.0,0.0,0.0,8.0
PAY_2,30000.0,-0.133767,1.197186,-2.0,-1.0,0.0,0.0,8.0
PAY_3,30000.0,-0.1662,1.196868,-2.0,-1.0,0.0,0.0,8.0
PAY_4,30000.0,-0.220667,1.169139,-2.0,-1.0,0.0,0.0,8.0


From the output above, the majority of numeric columns do not need to be stored with int64 datatype. Why? Their values are well under the max capacity required for an 1nt64. We can store in a datatype that is more close to their capacity. Also, some values are unsigned. Therefore no need for storage of negative ranges.  

In [12]:
# change to appropriate data type - uint8
df["AGE"] = df["AGE"].astype("uint8")
df["SEX"] = df["SEX"].astype("uint8")
df["EDUCATION"] = df["EDUCATION"].astype("uint8")
df["MARRIAGE"] = df["MARRIAGE"].astype("uint8")
df["def_pay"] = df["def_pay"].astype("uint8")

In [13]:
# change to appropriate data type - uint8
temp_list = [1, 2, 3, 4, 5, 6]
for i in temp_list:
    df[f"PAY_{i}"] = df[f"PAY_{i}"].astype("int8")

In [14]:
# change to appropriate data type - float32
temp_list = [1, 2, 3, 4, 5, 6]
for i in temp_list:
    df[f"PAY_AMT{i}"] = df[f"PAY_AMT{i}"].astype("float32")

In [15]:
# check memory usage after datatype changes
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         30000 non-null  int64  
 1   LIMIT_BAL  30000 non-null  float64
 2   SEX        30000 non-null  uint8  
 3   EDUCATION  30000 non-null  uint8  
 4   MARRIAGE   30000 non-null  uint8  
 5   AGE        30000 non-null  uint8  
 6   PAY_1      30000 non-null  int8   
 7   PAY_2      30000 non-null  int8   
 8   PAY_3      30000 non-null  int8   
 9   PAY_4      30000 non-null  int8   
 10  PAY_5      30000 non-null  int8   
 11  PAY_6      30000 non-null  int8   
 12  BILL_AMT1  30000 non-null  float64
 13  BILL_AMT2  30000 non-null  float64
 14  BILL_AMT3  30000 non-null  float64
 15  BILL_AMT4  30000 non-null  float64
 16  BILL_AMT5  30000 non-null  float64
 17  BILL_AMT6  30000 non-null  float64
 18  PAY_AMT1   30000 non-null  float32
 19  PAY_AMT2   30000 non-null  float32
 20  PAY_AM

- managed to reduce the memory used by the dataframe from 5.7mb to 2.8mb, a 68% difference

### Missing Values
Check for any missing or duplicate values in the dataset.

In [16]:
# check for missing values in the dataset
df.isna().sum()

ID           0
LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_1        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
def_pay      0
dtype: int64

### Create a copy of dataset

In [17]:
df_mod = df.copy()

In [18]:
# check for data duplicates
print(f"Number of duplicate values {df_mod.duplicated().sum()}.")

Number of duplicate values 0.


### Correlation Analysis
Important check - how does each dataset variable relate to other dataset variable. 

**NB** 

The default metric used to calculate the correlation matrix is the **pearson correlation**. One can look at using a more appropriate metric for the following variable combinations:
- continuous data vs categorical (numerical)

The Pearson's correlation coefficient describes the linear relationship between two quantitative variables by making the following assumption sof the data:
- Both variables are either on an interval or ratio scale
- Data is normally distributed and has no outliers
- expect a linear relationship between the variables.

In [19]:
# calculate the correlation matrix
corr = df_mod.corr()
# extract the target variable - def_pay
next_month_default_corr = corr["def_pay"].sort_values()
next_month_default_corr

LIMIT_BAL   -0.153520
PAY_AMT1    -0.072929
PAY_AMT2    -0.058579
PAY_AMT4    -0.056827
PAY_AMT3    -0.056250
PAY_AMT5    -0.055124
PAY_AMT6    -0.053183
SEX         -0.039961
MARRIAGE    -0.024339
BILL_AMT1   -0.019644
BILL_AMT2   -0.014193
BILL_AMT3   -0.014076
ID          -0.013952
BILL_AMT4   -0.010156
BILL_AMT5   -0.006760
BILL_AMT6   -0.005372
AGE          0.013890
EDUCATION    0.028006
PAY_6        0.186866
PAY_5        0.204149
PAY_4        0.216614
PAY_3        0.235253
PAY_2        0.263551
PAY_1        0.324794
def_pay      1.000000
Name: def_pay, dtype: float64

In [20]:
# plot the correlation heatmap
fig = px.imshow(corr, text_auto=True)
fig.update_layout(
    height=900,
    width=950
)
fig.show()

Extract the relationship between each independent variable and the target variable. 

In [21]:
fig = go.Figure()
fig.add_trace(
    go.Bar(x=next_month_default_corr[:-1].index, y= next_month_default_corr[:-1].values, text=next_month_default_corr[:-1].values)
)
fig.update_layout(
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Default Correlation Distribution <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      xaxis_title="Independent Variables",
      showlegend=False
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

The following are derived from the correlation plot above:
- The repayment status are highly correlated to the target variable - default status. Their correlation starts to decrease as we move further back in time. Whether the person will default or not seems to be correlated to their most recent repayement status.

- The limit balance has the highest negative correlation to the client's default status. 


## Univariate Graphical Analysis

- examination and exploration of the individual variables in dataset.
- generate summary statistics, visualisations to understand the distribution and characteristics of specific variables.

### Count Plot and Histograms
- count the occurrence of each category in a categorical variable.

**Default Status**

In [22]:
# create default_payment series
next_month_default = df_mod["def_pay"].value_counts()
functions.create_count_plot(next_month_default, title_text="Default Next Month Distribution <br>", xaxis_title="Target Variable <br> 0=No Default, 1=Default")

In [23]:
# calculate the percentage of defaulters in the dataset
default_percent = (next_month_default[1]/(next_month_default.sum()))*100
print(f"The percentage of defaulters - {default_percent}%")

The percentage of defaulters - 22.12%


From the result above, the dataset is imbalanced -`78%` of the instances are **non-defaulters**.

**Gender**

In [24]:
# get the SEX variable and count the number of each category in the column - their are only 2 - Male and Female 
# in the dataset
sex_count = df_mod["SEX"].value_counts()
functions.create_count_plot(sex_count, title_text="Sex Distribution <br>", xaxis_title="Sex <br> 1=Male 2=Female")

The figure below shows that there are **more females than males** in this dataset - `60%`.

**Education**

The education variable has three categories (from the data description above):
- 1:graduate school
- 2: university
- 3:high school
- 4:others
- 5:unknown
- 6: unknown

In [25]:
# count the instances of each category in this variable

edu_count = df_mod["EDUCATION"].value_counts()
functions.create_count_plot(edu_count, title_text="Education Distribution <br>", xaxis_title="Education <br>1=graduate school,2=university,3=high school,<br> 4=others 5=unknown, 6 = unknown")

The education variable has two classes that represent the same thing (5 and 6 - unknown) - we will combine the two values - all 6 will be changed to 5. There is a 0 class and since we don't know what it represents, it will also be assigned as unknown -5.

In [26]:
# change all values that are 6 to 5
df_mod["EDUCATION"] = df_mod["EDUCATION"].where(df_mod["EDUCATION"]!=6, other=5)
# change all values that are 0 to 5 
df_mod["EDUCATION"] = df_mod["EDUCATION"].where(df_mod["EDUCATION"]!=0, other=5)


In [27]:
edu_count = df_mod["EDUCATION"].value_counts()
functions.create_count_plot(edu_count, title_text="Education Distribution <br>", xaxis_title="Education <br>1=graduate school,2=university,3=high school,<br> 4=others 5=unknown, 6 = unknown")

The majority of the clients in this dataset have a university background, `82%` - a significant amount has graduate school background (`47%`). There are clients with only a high school background - `16%` . The classes - **others and unknown** - make up less than `3%` of the dataset. 

**Marriage**

The marriage variable has three categories (from the data description above):
- 1: married
- 2: single
- 3: other
However, we see `0` in the dataset - therefore we will reassign all occurrences of `0` to `3`.

In [28]:
# change all values of 0 to 3
df_mod["MARRIAGE"] = df_mod["MARRIAGE"].where(df_mod["MARRIAGE"]!=0, other=3 )

In [29]:
# count categories in the Marriage column
marriage_count = df_mod["MARRIAGE"].value_counts()
functions.create_count_plot(marriage_count, title_text="Marriage Distribution <br>", xaxis_title="Marriage <br> 1=Single, 2=Married, 3=Other")

The marriage status of the clients is composed as follows:
- Married - `53%`
- Single -  `46%`
- Others - `1%`

**Repayment Status**

These columns indicate the clients repayment characteristics in the months between April and September. 

To visualise this, we will create a subplots by iterating over the columns. 

In [30]:
# create a temporary list
temp_list = [0, 2, 3, 4, 5, 6] # represents each month between April and September
# create list to store each repayment status column
repayment_status_list = [df_mod[f"PAY_{num}"].value_counts() for num in np.array([1,2,3,4,5,6])] # list comprehension


In [31]:
repayment_status_list[1].value_counts().index

Index([15730, 6050, 3927, 3782, 326, 99, 28, 25, 20, 12, 1], dtype='int64', name='count')

In [32]:
# create subplots
fig = make_subplots(2,3,
subplot_titles=(
    "Repayment Status <br> September",
    "Repayment Status August",
    "Repayment Status July",
    "Repayment Status June",
    "Repayment Status May",
    "Repayment Status April"
))
# create months array
months = np.array(["September","August","July","June","May","April"])
c=-1 # variable used to iterate through the repayment status list
for row_ in [1,2]:
    for col_ in [1,2,3]:
        c+=1
        fig.add_trace(
            go.Bar(
                name=f"Repayment Status in - {months[c]}",
                x=repayment_status_list[c].index,
                y=repayment_status_list[c].values,
                marker_color = 'steelblue',
            ),
            row=row_,
            col=col_,
        )
fig.update_layout(
      barmode='relative',
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Repayment Status from Between April and September 2005 with a scale in delay of payment. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      xaxis_title="Repayment Status",
      showlegend=True
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

From the client repayment status visualisation above, the following observations are made:
- majority of the clients have a repayment status level of 0 from April to September - on average `53%`. 
- as expected, majority of the clients are not behind on their payments. **remeber, the dataset is imbalanced towards non-defaulters**.
- there is a `10%` reduction in clients with a 0 level between April and September.
- `September` was the month with most number of clients who were behind in their payment.
-





However,`0` and `-2` are not included in the datacard.

We can do some processing and make the following changes:
- change observations with -2 to -1
- change observations with 0 to -1

 


**Age**

In [33]:
# get the age of each client and store as pandas series
age_count = df_mod["AGE"]#.value_counts()
# plot the age distribution with a histogram
functions.create_histogram_plot(age_count, title_text="Age Distribution <br>", xaxis_title="Age")

The distribution of the clients' age shows that the majority of the clientelle is not older than 45. The youngest age is `21` and the oldest is `79`.

**Limit Balance**

In [34]:
# get the limit balances of the clients 
lim_balance = df_mod["LIMIT_BAL"]
# plot the distribution of the limit balance
functions.create_histogram_plot(lim_balance, title_text="Limit Balance Distribution")

The limit balance distribution shows that the majority of the clients have between 0 and 0.4M New Taiwan Dollars as the maximum amount of credit that they can borrow. There is an individual who has 1M New Taiwan Dollars available for credit. 

**Bill Statement Amount**

In [35]:

# create list to store each bill statement amount column
bill_amt_list = [df_mod[f"BILL_AMT{num}"] for num in np.array([1,2,3,4,5,6])] # list comprehension


In [36]:
# see the format of the series
bill_amt_list[1]

0          3102.0
1          1725.0
2         14027.0
3         48233.0
4          5670.0
           ...   
29995    192815.0
29996      1828.0
29997      3356.0
29998     78379.0
29999     48905.0
Name: BILL_AMT2, Length: 30000, dtype: float64

*I want to create a 2 * 3 subplot (2rows and 3cols) and to do this I have to iterate through the list created above. The positioning of the plot needs to be in the following format:*

- list item 0, row 1, col1
- list item 1, row 1, col2
- list item 2, row 1, col3
- ...
- list item 5, row2, col3

*I am using a nested for loop and a incremental value,c which starts at -1 to accomplish this.*

In [37]:
# create subplots
fig = make_subplots(2,3,
subplot_titles=(
    "Bill Amount September",
    "Bill Amount August",
    "Bill Amount July",
    "Bill Amount June",
    "Bill Amount May",
    "Bill Amount April"
))
# create months array
months = np.array(["September","August","July","June","May","April"])
c=-1 # variable used to iterate through the repayment status list
for row_ in [1,2]:
    for col_ in [1,2,3]:
        c+=1
        fig.add_trace(
            go.Histogram(
                name=f"Bill Amount in - {months[c]}",
                x=bill_amt_list[c].values,
                marker_color = "steelblue",
            ),
            row=row_,
            col=col_,
        )
fig.update_layout(
      barmode="relative",
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Bill Statement Amount from Between April and September 2005. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      # xaxis_title="Ride Provider",
      showlegend=False
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

The distribution of the bill amount for each clients statement is similar for each month between April and September. 
- in June, May, and April - the credit issuer owes some clients money but by september, it seems that the clients have used up the negative balance. 

**Pay Amount**

In [38]:

# create list to store previous amount payment between September and April
pay_amt_list = [df_mod[f"PAY_AMT{num}"] for num in np.array([1,2,3,4,5,6])] # list comprehension


In [39]:
# create subplots
fig = make_subplots(2,3,
subplot_titles=(
    "Previous Payment Amount - September",
    "Previous Payment Amount - August",
    "Previous Payment Amount - July",
    "Previous Payment Amount - June",
    "Previous Payment Amount - May",
    "Previous Payment Amount - April"
))
# create months array
months = np.array(["September","August","July","June","May","April"])
c=-1 # variable used to iterate through the repayment status list
for row_ in [1,2]:
    for col_ in [1,2,3]:
        c+=1
        fig.add_trace(
            go.Histogram(
                name=f"Payment in - {months[c]}",
                x=pay_amt_list[c].values,
                marker_color = "steelblue",
            ),
            row=row_,
            col=col_,
        )
fig.update_layout(
      barmode='relative',
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Previous Payment Amount Between April and September 2005. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      # xaxis_title="Ride Provider",
      showlegend=False
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

## Multivariate Graphical Analysis

The following plots will visualise the effect of the target variable on all the other variables.


This section will be using chaining syntax. Visit the link below to learn more about Pandas Chaining.

[Pandas Chaining](https://practicaldatascience.co.uk/data-science/how-to-use-method-chaining-in-pandas)

### Count Plots


**Gender**

In [40]:
# group dataset by gender and default status
gender_target = (df_mod
                 .groupby(["SEX","def_pay"])["ID"]
                 .count()
                 .unstack()
                 )
# display aggregation
gender_target

def_pay,0,1
SEX,Unnamed: 1_level_1,Unnamed: 2_level_1
1,9015,2873
2,14349,3763


In [41]:
# create multivariate count plot to show relationship between gender and default status
functions.create_multivariate_count_plot(gender_target, title_text="Relationship between Gender and Default Status", xaxis_title="Sex <br> 1=Male, 2=Female")

From the figure above, the following are derived:

- `20%` of female clients are have defaulted on their credit loans.
- `24%` of male clients have defaulted on their credit loans. 

Majority of defaulters are female, however note that this dataset has **more female clients than men**. 

**Mariage**

In [42]:
# group dataset by marriage and default status
marriage_target = (df_mod
                 .groupby(["MARRIAGE","def_pay"])["ID"]
                 .count()
                 .unstack()
                 )
# display aggregation
marriage_target

def_pay,0,1
MARRIAGE,Unnamed: 1_level_1,Unnamed: 2_level_1
1,10453,3206
2,12623,3341
3,288,89


In [43]:
# create multivariate count plot to show relationship between gender and default status
functions.create_multivariate_count_plot(marriage_target, title_text="Relationship between Marriage and Default Status", xaxis_title="Marriage Status <br> 1=Married, 2=Single, 3=Others")

From the figure above, the following are derived:
- `24%` of the clients who are neither single or married are defaulters. 
- `21%` of the single clients are defaulters - this is the largest class in the `marriage status` variable
- `23%` of the married clients are defaulters.

Although, clients who are neither single or married have the largest defaulting percentage, it is more likely to come across a single client who is a defualter because they are the largest sub-group in this categorical variable. 


**Education**

In [44]:
# group dataset by education and default status
education_target = (df_mod
                 .groupby(["EDUCATION","def_pay"])["ID"]
                 .count()
                 .unstack()
                 )
# display aggregation
education_target

def_pay,0,1
EDUCATION,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8549,2036
2,10700,3330
3,3680,1237
4,116,7
5,319,26


In [45]:
# create multivariate count plot to show relationship between gender and default status
functions.create_multivariate_count_plot(education_target, title_text="Relationship between Education and Default Status", xaxis_title="Education <br>1=graduate school,2=university,3=high school,<br> 4=others 5=unknown")

From the figure above, the following are derived:
- `25%` of clients with a high school education were defaulters.
- `24%` of clients with a university education (guestimation - undergraduate) were defaulters.
- `19%` of clients with a graduate school education were defaulters. 

The analysis of the other two classes that provide no client education level information are as follows:
- `others` - this makes up `0.4%` of the clients in the dataset and `5%` were defaulters. 
- `unknown` - makes up `1%` of the clients and `7%` were defaulters. 

**Age**

To have a more granular look at the relationship between age and default status, age bins will be created and then we will visualise how age relates to default status. 


In [46]:
# create bins for the age
age_bins = [20, 30, 40, 50, 60, 70, 80]
# represent each bin
age_bins_str_list = ["21-30","31-40","41-50","51-60","61-70","71-80"]

# create a column in the dataframe that bins the age values into the discrete values from age_bins
df_mod["AGE_BINS"] = pd.cut(x=df_mod["AGE"], bins=age_bins, labels=age_bins_str_list, right=True)

In [47]:
# group dataset by client age and default status
age_target = (df_mod
              .groupby(["AGE_BINS","def_pay"])["ID"]
              .count()
              .unstack()
              )
# display result of aggregation
age_target





def_pay,0,1
AGE_BINS,Unnamed: 1_level_1,Unnamed: 2_level_1
21-30,8542,2471
31-40,8524,2189
41-50,4606,1399
51-60,1493,504
61-70,189,68
71-80,10,5


In [48]:
# create multivariate count plot to show relationship between gender and default status
functions.create_multivariate_count_plot(age_target, title_text="Relationship between Age and Default Status", xaxis_title="Age")

From the figure above, the majority of the banks clients are in two age groups - `21-30` and `31-40` and this where the most of the defaulters are in these age groups. However, interestingly, the age group with the least amount of clients - `71-80` has the highest percentage of defaulters - `33%`. The trend seems to be as we go from the least represented to the most represented age groups (population size) the percentage of defaulters starts to decrease. 
The proportion of defaulters for each age group is as follows:
- `21-30`: `22%`
- `31-40`: `20%`
- `41-50`: `23%`
- `51-60`: `25%`
- `61-70`: `26%`
- `71-80`: `33%`


**Limit Balance**

Same process as age.

In [49]:
# create bins for the lim balance
lim_bal_bins = [5000,10000, 50000, 100000, 150000, 200000, 500000, 1000000]
# bin the age values into discrete intervals
df_mod["LIMIT_BINS"] = pd.cut(x= df_mod["LIMIT_BAL"], bins = lim_bal_bins, right=True)

# df_mod["LIMIT_BINS"] # uncomment to view data format before conversion

# convert to str type
df_mod["LIMIT_BINS"] = df_mod["LIMIT_BINS"].astype("str")

In [50]:
# group limit balace by default status
lim_bal_target = (df_mod
.groupby(["LIMIT_BINS","def_pay"])["ID"]
.count()
.unstack()
)
# reorder the index
lim_bins_order = ["(5000, 10000]", "(10000, 50000]","(50000, 100000]","(100000, 150000]","(150000, 200000]","(200000, 500000]","(500000, 1000000]"]
# change the order of the index
lim_bal_target = lim_bal_target.reindex(index=lim_bins_order)



In [51]:
# create multivariate count plot to show relationship between gender and default status
functions.create_multivariate_count_plot(lim_bal_target, title_text="Relationship between Limit Balance and Default Status", xaxis_title="Limit Balance Bins")

The figure above shows the default status distribution of the banks client per credit limit balance group. In the graph, the smallest credit balance group, `5000 to 10000`, has the highest amount of defaulters - `40%`. Most of the banks clients fall in the credit limit balance between `200,000 - 500,000` -  `31%` and `15%` of the group are defaulters. The highest defaulters belong to the group with a credit balance `10,000 - 50,000`.

**Repayment Status**

In [52]:
# create a list to store each repayment status column aggregation
repay_status_list = []
# aggregate each repayment status column by default_status
for month in np.array([1,2,3,4,5,6]):
    # perform aggregation
    temp_df = (df_mod.
    groupby([f"PAY_{month}","def_pay"])["ID"]
    .count()
    .unstack()
    )
    repay_status_list.append(temp_df)

In [53]:
repay_status_list[1]

def_pay,0,1
PAY_2,Unnamed: 1_level_1,Unnamed: 2_level_1
-2,3091.0,691.0
-1,5084.0,966.0
0,13227.0,2503.0
1,23.0,5.0
2,1743.0,2184.0
3,125.0,201.0
4,49.0,50.0
5,10.0,15.0
6,3.0,9.0
7,8.0,12.0


In [54]:
fig = make_subplots(2,3,
                   subplot_titles =(
                    "Repayment Status - September",
                    "Repayment Status - August",
                    "Repayment Status - July",
                    "Repayment Status - June",
                    "Repayment Status - May",
                    "Repayment Status - April",
                   )
)
# create colours dictionary
colors = {'A':'steelblue',
        'B':'firebrick'}
months = np.array(["September","August","July","June","May","April"])
c=-1
for row_ in [1,2]:
    for col_ in [1,2,3]:
        c+=1
        fig.add_trace(
            go.Bar(name=f"Repayment Status- {months[c]} - no default", x=repay_status_list[c].index, y=repay_status_list[c][0].values, marker_color=colors["A"]), row=row_, col=col_
        )
        fig.add_trace(
            go.Bar(name=f"Repayment Status - {months[c]} - default", x=repay_status_list[c].index, y=repay_status_list[c][1].values, marker_color=colors["B"]), row=row_, col=col_
        )
fig.update_layout(
      barmode='relative',
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Repayment Status from Between April and September 2005 with a scale in delay of payment. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      # xaxis_title="Ride Provider",
      showlegend=False
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

The figure above shows the relationship between repayment status and default status across the distribution of levels used to describe clients repayment behaviour. The following are derived from the figure:
- For each month, the status level `0` had the most number of defaulters.
- Distribution is similar across all the months. 



**Bill Statement Amount**

In [55]:
# create a list to store each repayment status column aggregation
bill_amount_list = []
# aggregate each repayment status column by default_status
for month in np.array([1,2,3,4,5,6]):
    # perform aggregation
    temp_df = (df_mod.
    groupby([f"BILL_AMT{month}","def_pay"])["ID"]
    .count()
    .unstack()
    )
    bill_amount_list.append(temp_df)

In [56]:
bill_amount_list[1]

def_pay,0,1
BILL_AMT2,Unnamed: 1_level_1,Unnamed: 2_level_1
-69777.0,1.0,
-67526.0,1.0,
-33350.0,1.0,
-30000.0,1.0,
-26214.0,1.0,
...,...,...
624475.0,1.0,
646770.0,1.0,
671563.0,1.0,
743970.0,1.0,


In [57]:
fig = make_subplots(2,3,
                   subplot_titles =(
                    "Bill Statement Amount - September",
                    "Bill Statement Amount - August",
                    "Bill Statement Amount - July",
                    "Bill Statement Amount - June",
                    "Bill Statement Amount - May",
                    "Bill Statement Amount - April",
                   )
)
# create colours dictionary
colors = {'A':'steelblue',
        'B':'firebrick'}
months = np.array(["September","August","July","June","May","April"])
c=-1
for row_ in [1,2]:
    for col_ in [1,2,3]:
        c+=1
        fig.add_trace(
            go.Histogram(name=f"Repayment Status- {months[c]} - no default", x=repay_status_list[c][0].values, marker_color=colors["A"]), row=row_, col=col_
        )
        fig.add_trace(
            go.Histogram(name=f"Repayment Status - {months[c]} - default", x=repay_status_list[c][1].values, marker_color=colors["B"]), row=row_, col=col_
        )
fig.update_layout(
      barmode='relative',
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Bill Statement Amount Between April and September 2005. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      # xaxis_title="Ride Provider",
      showlegend=False
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

**Previous Payment**

In [58]:
# create a list to store each repayment status column aggregation
prev_pay_list = []
# aggregate each repayment status column by default_status
for month in np.array([1,2,3,4,5,6]):
    # perform aggregation
    temp_df = (df_mod.
    groupby([f"PAY_AMT{month}","def_pay"])["ID"]
    .count()
    .unstack()
    )
    prev_pay_list.append(temp_df)

In [59]:
fig = make_subplots(2,3,
                   subplot_titles =(
                    "Previous Payement Amount - September",
                    "Previous Payement Amount - August",
                    "Previous Payement Amount - July",
                    "Previous Payement Amount - June",
                    "Previous Payement Amount - May",
                    "Previous Payement Amount - April",
                   )
)
# create colours dictionary
colors = {'A':'steelblue',
        'B':'firebrick'}
months = np.array(["September","August","July","June","May","April"])
c=-1
for row_ in [1,2]:
    for col_ in [1,2,3]:
        c+=1
        fig.add_trace(
            go.Histogram(name=f"Previous Payment- {months[c]} - no default", x=repay_status_list[c][0].values, marker_color=colors["A"]), row=row_, col=col_
        )
        fig.add_trace(
            go.Histogram(name=f"Previous Payment - {months[c]} - default", x=repay_status_list[c][1].values, marker_color=colors["B"]), row=row_, col=col_
        )
fig.update_layout(
      barmode='relative',
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Bill Statement Amount Between April and September 2005. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      # xaxis_title="Ride Provider",
      showlegend=False
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

## Modelling

This problem is a `binary classification problem` - the aim is to create a solution (model) that will categorise each client into two distinct groups - `defaulter or non-defaulter`. Since the model will be trained with a dataset that has the ground truth for each client - this is therefore a `supervised machine learning problem`. 

The two distinct groups can be reworded as follows:
- `positive` outcome: defaulter
- `negative` outcome: non-defaulter

### Evaluation of binary classification models
The following metrics are when the model:

- **True Positive (TP)**: successful classification of a defaulting client as a defaulter.
- **True Negative (TN)**: successful classification of a non-defaulting client as a non-defaulter. 
- **False Positive (FP)**: incorrect classification of a non-defaulting client as a defaulter.
- **False Negative (FN)**: incorrect classification of a defaulting client as a non-defaulter. 

There are several methods that perform binary classification and the most common are:
- Support vector machines
- Naive Bayes
- Nearest Neighbour
- Decision Trees
- Logistic Regression
- Neural Networks

This notebook compares the performance of three of the methods listed above. Also remember, we have an **unbalanced** dataset, therefore, for most of the methods we have to employ some techniques so that are model can accurately generalize well when it encounters the under represented category - `defaulters`. 

Before jumping into the modelling head first, we must understand this a classical imbalanced dataset. The following is a rough outline on how to approach imbalanced datasets ([Handling Imbalanced Data](https://www.svds.com/learning-imbalanced-classes/)). 
- Do nothing and use the dataset's natural distribution. 
- Balance the training set using using either oversampling or undersampling techniques
- Use an anomaly detection framework
- Adjust class weight, decision threshold
- construct a new algorithm to perform well on imbalanced data

In [60]:
# create a copy of the modified_dataset
model_df = df_mod.copy()

In [61]:
model_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 27 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   ID          30000 non-null  int64   
 1   LIMIT_BAL   30000 non-null  float64 
 2   SEX         30000 non-null  uint8   
 3   EDUCATION   30000 non-null  uint8   
 4   MARRIAGE    30000 non-null  uint8   
 5   AGE         30000 non-null  uint8   
 6   PAY_1       30000 non-null  int8    
 7   PAY_2       30000 non-null  int8    
 8   PAY_3       30000 non-null  int8    
 9   PAY_4       30000 non-null  int8    
 10  PAY_5       30000 non-null  int8    
 11  PAY_6       30000 non-null  int8    
 12  BILL_AMT1   30000 non-null  float64 
 13  BILL_AMT2   30000 non-null  float64 
 14  BILL_AMT3   30000 non-null  float64 
 15  BILL_AMT4   30000 non-null  float64 
 16  BILL_AMT5   30000 non-null  float64 
 17  BILL_AMT6   30000 non-null  float64 
 18  PAY_AMT1    30000 non-null  float32 
 19  PAY_

### Data Split
The chosen split is as follows:
- Training split - 70%
- Validation split - 20%
- Test split - 10%

In [62]:
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn import svm, tree
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

In [63]:
# select the features
features = model_df.drop(
    labels = ["ID", "AGE_BINS","LIMIT_BINS","def_pay"],
    axis=1
)
# show features first 5 observations
features.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,20000.0,2,2,1,24,2,2,-1,-1,-2,...,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0
1,120000.0,2,2,2,26,-1,2,0,0,0,...,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0
2,90000.0,2,2,2,34,0,0,0,0,0,...,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0
3,50000.0,2,2,1,37,0,0,0,0,0,...,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0
4,50000.0,1,2,1,57,-1,0,-1,0,0,...,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0


In [64]:
# set target variable
target = model_df["def_pay"]
# show first five observations
target.head()

0    1
1    1
2    0
3    0
4    0
Name: def_pay, dtype: uint8

Before splitting the data let us look at the value range of each feature in the features dataset. We will see that there is quite a diverse range of values, and the larger numbers might influence the results of the machine learning method.  

In [65]:
features.describe().transpose()[["min","max"]]

Unnamed: 0,min,max
LIMIT_BAL,10000.0,1000000.0
SEX,1.0,2.0
EDUCATION,1.0,5.0
MARRIAGE,1.0,3.0
AGE,21.0,79.0
PAY_1,-2.0,8.0
PAY_2,-2.0,8.0
PAY_3,-2.0,8.0
PAY_4,-2.0,8.0
PAY_5,-2.0,8.0


In [66]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, stratify=target, test_size=0.1, random_state=100)


The piece of code above splits the features and targets into a training and test sets. `stratify` - ensures that there is the same representive split of the target variable in both training and test sets. This ensures that we have the same distribution for when we train and evaluate the model. `test_size` means we have set asside `10%` of the dataset for testing. 
Next is to split the above training set into - actual training and validation sets. 

In [67]:
# create another split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2, random_state=100)

Plot the target distribution for each data set.

In [68]:
fig = make_subplots(3,1)
model_targets = [y_train, y_val, y_test]
model_name = ["Train", "Validation", "Test"]
plot_x = model_targets[0].value_counts().index
for i in range(len(model_targets)):
  plot_y = model_targets[i].value_counts()
  fig.add_trace(
      go.Bar(x=plot_x,y=plot_y, name=f"{model_name[i]} Target Distribution"), row=i+1, col=1
  )
fig.update_layout(
      barmode='relative',
      plot_bgcolor="white",
      height=900,
      width=950,
      margin={
          "l":25,
          "r":25,
          "b":25
      },
      title_text="Target Distribution. <br>",
      title={
          "x":0.5,
          "xanchor":"center",
          "font":{
              "size":14
          }
      },
      xaxis_title="Default Status (0=No, 1=Yes)",
      showlegend=True
  )
fig.update_xaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.update_yaxes(
      showline = True,
      linewidth=1,
      linecolor="black"
  )
fig.show()

From the plot above, the stratification of the target variable worked since we have the same split across each split. 

### Logistic Regression

In [None]:
# instantiate a logistic regression model
log_reg = LogisticRegression(random_state=11)
# fit the model to the training data
log_reg.fit(X_train, y_train)

In [70]:
# make predictions on the validation set
log_reg_pred_val = log_reg.predict(X_val)
print(f"Accuracy of logistic regression classifier on validation set: {log_reg.score(X_val,y_val):.2f}")

Accuracy of logistic regression classifier on validation set: 0.78


In [71]:
# create the confusion matrix for this model
conf_matrix_log_reg_val = metrics.confusion_matrix(y_val, log_reg_pred_val)
print(conf_matrix_log_reg_val)

[[4204    2]
 [1193    1]]


In [72]:
functions.plot_confusion_matrix(conf_matrix_log_reg_val, title="Validation Confusion Matrix for Logistic Regression")

From the confusion matrix plot above, out of the 5400 samples, the predictions are:
- 4204 predictions were **true negatives**
- 1 prediction was a **true positive**
- 1193 predictions were **false negatives**
- 2 predictions were **false positives**

#### Evaluation Metrics

We have mentioned the four base units used to calculate the various metrics used to evaluate machine learning predicitons. Let us now discuss the confusion matrix, and the metrics that will be shown in the classification report.



In [73]:
print(metrics.classification_report(y_val,log_reg_pred_val))
print(f"Precision Score (positive label(Default)): {metrics.precision_score(y_val,log_reg_pred_val):.2f}")
print(f"Recall Score (positive label(Default)): {metrics.recall_score(y_val,log_reg_pred_val):.2f}")

              precision    recall  f1-score   support

           0       0.78      1.00      0.88      4206
           1       0.33      0.00      0.00      1194

    accuracy                           0.78      5400
   macro avg       0.56      0.50      0.44      5400
weighted avg       0.68      0.78      0.68      5400

Precision Score (positive label(Default)): 0.33
Recall Score (positive label(Default)): 0.00


The classifier achieved an accuracy of `78%` but as  expected, the classifier performs well in classifying non-defaulting customers but not with defaulting customers. This is because of the dataset imbalance. 

In [None]:
p_grid =  {
    "solver":["lbfgs","liblinear","newton-cg"],
    "class_weight":[{0:x, 1:1-x} for x in np.linspace(0,0.99,15)]
}
# fit the grid_search to the train data using 5 stratifeid folds
grid_search_log_reg = GridSearchCV(
    estimator=log_reg,
    param_grid=p_grid,
    cv = StratifiedKFold(),
    n_jobs = -1,
    scoring="f1",
    verbose=2
).fit(X_train, y_train)

In [109]:
log_reg_grid = grid_search_log_reg.best_estimator_

# check best score
grid_search_log_reg.best_score_

0.511588310125048

In [110]:
# predict on validation
log_reg_grid_val = log_reg_grid.predict(X_val)


In [114]:
# create the confustion matrix 
conf_matrix_log_reg_grid = metrics.confusion_matrix(y_val, log_reg_grid_val)
print(conf_matrix_log_reg_grid)
# plot confusion matrix 
functions.plot_confusion_matrix(conf_matrix_log_reg_grid, title="Validation Confusion Matrix for Logistic Regression using Grid Search")


[[3706  500]
 [ 630  564]]


In [113]:
print(metrics.classification_report(y_val, log_reg_grid_val))


              precision    recall  f1-score   support

           0       0.85      0.88      0.87      4206
           1       0.53      0.47      0.50      1194

    accuracy                           0.79      5400
   macro avg       0.69      0.68      0.68      5400
weighted avg       0.78      0.79      0.79      5400



### Decision Tree

In [74]:
# create a decision tree model
dec_tree = tree.DecisionTreeClassifier(random_state=11)
# fit the data to the decision tree
dec_tree = dec_tree.fit(X_train, y_train)
# predict on the validation set
y_pred_tree_val = dec_tree.predict(X_val)

In [75]:
# create the confusion matrix for this model
conf_matrix_tree_val = metrics.confusion_matrix(y_val, y_pred_tree_val)
print(conf_matrix_tree_val)

[[3407  799]
 [ 703  491]]


In [76]:
functions.plot_confusion_matrix(conf_matrix_tree_val, title="Validation Confusion Matrix for Decision Tree")

From the decision tree's confusion matrix, we can see that this model has a significant improvement in classifying the minority class - defaulting customers. The results are as follows:
- 3407 predictions were **true negatives**
- 491 predictions were **true positives**
- 703 predictions were **false negatives**
- 799 predictions were **false positives**

In [77]:
# print classification report using decision tree
print(metrics.classification_report(y_val, y_pred_tree_val))


              precision    recall  f1-score   support

           0       0.83      0.81      0.82      4206
           1       0.38      0.41      0.40      1194

    accuracy                           0.72      5400
   macro avg       0.60      0.61      0.61      5400
weighted avg       0.73      0.72      0.73      5400



In [118]:
# decision tree using GridSearch
p_grid_tree = {
    "criterion": ["gini","entropy", "log_loss"],
    "class_weight": [{0:x,1:1-x} for x in np.linspace(0,0.99,15)],
    "splitter": ["random","best"],
    "max_depth":[1,5,10,15,20,10]
}
# fit the grid_search to the train data using 5 stratifeid folds
grid_search_tree = GridSearchCV(
    estimator=tree.DecisionTreeClassifier(random_state=11),
    param_grid=p_grid_tree,
    cv = StratifiedKFold(),
    n_jobs = -1,
    scoring="f1",
    verbose=2
).fit(X_train, y_train)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


In [119]:
# print best score and best estimator
grid_search_tree.best_estimator_
grid_search_tree.best_score_

0.5217794322624143

In [120]:
# get best estimator
tree_grid = grid_search_tree.best_estimator_
# predict with val
tree_grid_val = tree_grid.predict(X_val)

In [121]:
# create confusion matrix for this model
conf_matrix_tree_grid = metrics.confusion_matrix(y_val, tree_grid_val)
# print the conf matrix
print(conf_matrix_tree_grid)

[[3633  573]
 [ 577  617]]


In [122]:
functions.plot_confusion_matrix(conf_matrix_tree_grid, title="Validation Confusion Matrix for Decision Tree using Grid Search")


### Random Forest

In [85]:
# create random forest classifier - default number of estimators
rdm_forest = RandomForestClassifier(n_estimators=100,random_state=11,criterion="entropy")
# train or fit the model with training set
rdm_forest  = rdm_forest.fit(X_train, y_train)
# predict on validation
y_rdm_pred_val = rdm_forest.predict(X_val)

In [79]:
# create the confusion matrix for this model
conf_matrix_rdm_val = metrics.confusion_matrix(y_val, y_rdm_pred_val)
print(conf_matrix_rdm_val)

[[3970  236]
 [ 751  443]]


In [80]:
functions.plot_confusion_matrix(conf_matrix_rdm_val, title="Validation Confusion Matrix for Random Forest Classifier")

In [81]:
print(metrics.classification_report(y_val, y_rdm_pred_val))


              precision    recall  f1-score   support

           0       0.84      0.94      0.89      4206
           1       0.65      0.37      0.47      1194

    accuracy                           0.82      5400
   macro avg       0.75      0.66      0.68      5400
weighted avg       0.80      0.82      0.80      5400



In [82]:
np.empty((10,1), float)

array([[77146.78767985],
       [97432.85240796],
       [92852.21827562],
       [62832.69178678],
       [50133.38940125],
       [37217.73881614],
       [ 7006.56123251],
       [ 8306.31303037],
       [35581.86616053],
       [38716.81793065]])

### Ensemble Techniques - Bagging

This is an ensemble meta-estimator that trains base models on random subsets of the original dataset and then aggregates their individual prediction either by voting or averaging to form a final prediction. 

In [84]:
# create bagging ensemble model
bagging_model = BaggingClassifier(
    estimator=RandomForestClassifier(
        n_estimators=100,
        criterion="entropy", 
        random_state=11),
        n_estimators=10,
        random_state=11
        )
# fit 
                                  