## Default of Credit Card Clients Dataset
Default Payments of Credit Card Clients in Taiwan from 2005

ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

SEX: Gender (1=male, 2=female)

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

MARRIAGE: Marital status (1=married, 2=single, 3=others)

AGE: Age in years

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay 
for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)

In [1]:
# sort features

target = "default.payment.next.month"
categorical_features = ["SEX", "EDUCATION", "MARRIAGE"]
ordinal_features = ["PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]
numerical_features = ["LIMIT_BAL", "AGE", "BILL_AMT1", "BILL_AMT2", "BILL_AMT3", 
                      "BILL_AMT4", "BILL_AMT5", "BILL_AMT6"]

In [None]:
# setup

# 3rd party
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# stdlib
import os

pd.options.display.max_columns = None

filename = "UCI_Credit_Card.csv"

In [None]:
# utility functions 

def correlation_heatmap(corr_df):
    color = "white"
    plt.rcParams['text.color'] = color
    plt.rcParams['axes.labelcolor'] = color
    plt.rcParams['xtick.color'] = color
    plt.rcParams['ytick.color'] = color
    plt.figure(figsize=(30,35))
    sns.heatmap(corr_df, annot=True, cmap=plt.cm.Reds)
    # plt.show()
    #####
    # fix for mpl bug that cuts off top/bottom of seaborn viz
    b, t = plt.ylim() # discover the values for bottom and top
    b += 0.5 # Add 0.5 to the bottom
    t -= 0.5 # Subtract 0.5 from the top
    plt.ylim(b, t) # update the ylim(bottom, top) values
    plt.show() 
    

def find_correlations(corr_df, threshold):
    relevant = {}
    for column in corr_df.columns:
        corr_df[column].loc[column] = 0 # replace equal pairs with 0 since correlation has to be 1
        series_above_th = corr_df.loc[column].loc[abs(corr_df.loc[column])>threshold]
        i = 0
        for index in series_above_th.index:
            if (index+"---"+column) not in relevant: # avoid redundant entries
                relevant[column+"---"+index] = series_above_th[i]
            i += 1
    return relevant


def show_correlations(df, correlations):
    for element in correlations:
        (first,second)=element.split("---")
        print(f"{first} and {second} have a pearson correlation of {correlations[element]}")
        plt.figure(figsize=(15,20))
        plt.scatter(df[first], df[second])
        plt.show()
        

def analyze_correlations(df, threshold=0.9):
    corr_df = df.corr(method="pearson")
    correlation_heatmap(corr_df)
    print("\n\n*********************************************")
    print(f"Showing correlations above {threshold} (absolute value)")
    print("*********************************************\n\n")
    show_correlations(df, find_correlations(corr_df, threshold))

In [None]:
# ingest

os.chdir("..")
filepath = os.getcwd() + "\data\\" + filename

df = pd.read_csv(filepath)
df.head(3)

In [None]:
# check for NaNs and data types

df.info()

In [None]:
# check for std = 0 and similar

df.describe()

## Visualization - Setup

In [None]:
color = "white"
plt.rcParams["text.color"] = color
plt.rcParams["axes.labelcolor"] = color
plt.rcParams["xtick.color"] = color
plt.rcParams["ytick.color"] = color

## Visualization - Categorical

In [None]:
df["SEX"].value_counts().plot(kind ="bar")

# SEX: Gender (1=male, 2=female)

In [None]:
df["EDUCATION"].value_counts().plot(kind ="bar")

# EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

In [None]:
df["MARRIAGE"].value_counts().plot(kind ="bar")

# MARRIAGE: Marital status (1=married, 2=single, 3=others)

In [None]:
df[categorical_features].describe()

## Visualization Results - Categorical

Todos: 

Marriage: Collapse 0 and 3

Education: Collapse 5,6,0


## Visualization - Ordinal

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay 
for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)

In [None]:
ordinal_df = df[ordinal_features]
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
ordinal_df.hist(bins=50, ax=ax)

In [None]:
ordinal_df.describe()

## Visualization Results - Ordinal

Todos: 

Clean up unexplained values of 0 and -2


## Visualization - Numerical


In [None]:
numerical_df = df[numerical_features]
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
numerical_df.hist(bins=30, ax=ax)

In [None]:
numerical_df.describe()

## Correlation Analysis

In [None]:

analyze_correlations(df)

## Correlation Analysis -  Results

No surprises - strong correlation among lagged time features. All correlations to target weak. 