## Client Default Prediction Using Machine Learning
In this notebook, the target is to predict if a customer has a chance to default after getting credit card loan from a bank. The target is to make a classification machine learning model for predicting credit card default chance for a given customer.

## Importing Packages & Libraries for Computation
The first step is to load all libraries that we need for computation. 

In [None]:
# importing machine learning libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os, time, re, tqdm, math # utility libraries for computation
from sklearn.model_selection import train_test_split
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import plot_confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.svm import SVC

In [None]:
# installing d tale library for performing EDA 
!pip install dtale

In [None]:
# listing the datasets we have
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

We are given **4** files. One is train file, one is the testing file, one file contains labels for features and the last one is the sample submission file.

## Loading the Dataset
First of all, we need to load dataset in memory to perform computation. The dataset is pretty much higher, i.e., 18+ GBs in size. For working in a limited computational environment, we'll load a chunk of data for now, and perform computations on that dataset. For now, we'll load 20,000 examples from data. 

In [None]:
start = time.time()

# define path variables
train_path = '../input/amex-default-prediction/train_data.csv'
train_labels_path = '../input/amex-default-prediction/train_labels.csv'

# define chunk size
chunk_size = 20000
# load dataset with given chunksize
train_data = pd.read_csv(train_path, low_memory=False, chunksize=chunk_size)
train_labels = pd.read_csv(train_labels_path, chunksize=chunk_size)

end = time.time()
print('Time Taken: %.3f seconds' % (end-start))

## EDA (Exploratory Data Analysis)
The first step after loading the dataset is the EDA or exploratory data analysis step. In this step, we critically analyse the dataset, draw useful insights, make decisions for modelling and check the dataset for any missing, null, NaN values, and so on. 

In [None]:
# convert the IO object into a pandas dataframe
train_data = train_data.__next__()
train_labels = train_labels.__next__()

In [None]:
train_labels.head(4)

In [None]:
train_labels.drop('customer_ID', axis=1, inplace=True)

In [None]:
train_labels.shape

In [None]:
train_labels[:4]

In [None]:
# merging labels with dataset features
train_data = pd.concat([train_data, train_labels], axis=1)

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
# check the shape of data
train_data.shape

After loading **20,000** exmples, the shape of dataset is now having **20,000** training examples & **191** features.

In [None]:
train_data.head(5)

In [None]:
# export the dataframe
train_data.to_csv('train_data_20k.csv')

In [None]:
# check the dtype of S_2 feature
train_data['S_2'].dtype

In [None]:
categorical = [i for i in train_data.columns if train_data[i].dtype == object]
numeric = [i for i in train_data.columns if train_data[i].dtype != object]

In [None]:
train_data.columns

In [None]:
len(categoricl), len(numeric)

We mostly have numerical attributes in our dataset, since the data is a real-world dataset. 

### Preprocessing & Featuring Datatime Attribute 

In [None]:
train_data.rename(columns={'S_2': 'Date'}, 
                  inplace=True)

In [None]:
# check the data again
train_data.head(2)

In [None]:
# convert the 
train_data['Date'] = pd.to_datetime(train_data['Date'], 
                                         infer_datetime_format=True, format='%Y/%m/%d %H:%M:%S')

In [None]:
train_data.head(4)

In [None]:
train_data['Date'].dtype

### Analyzing a Particular Customer's ID
For now, we'll analyze the transaction history of a particular customer. We'll fetch the data from dataset of a partocular customer & then apply computations on that data frame

In [None]:
train_data['customer_ID'].value_counts()

In [None]:
random_customer_id = '0089bf123391cdddcdc34a8ea239de5188c8fc7e4a5974e0cd3c2461c8d3dc0b'

In [None]:
# fetch the data from dataset
random_customer_data = train_data[train_data['customer_ID'] == random_customer_id]

In [None]:
random_customer_data.head(4)

In [None]:
random_customer_data.shape

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})

sns.lineplot(x='Date', y='P_2', 
             data=train_data)

In [None]:
random_customer_data['Date'].min(), random_customer_data['Date'].max()

In [None]:
sns.lineplot(x='Date', y='P_2', 
             data=random_customer_data)

In [None]:
# viewing the plot of risk factor of a random customer
sns.lineplot(x='Date', y='R_2', 
             data=random_customer_data)

In [None]:
plt.scatter(x='Date', y='P_2', 
             data=random_customer_data)

In [None]:
sns.countplot(x=train_data['target']).set_title('Class distribution of Taregt Feature')

In [None]:
categorical

In [None]:
# checking the distribution of target column
plt.figure(figsize=(10, 8))
circle = plt.Circle((0, 0), 0.7, color='white')
plt.pie(train_data['target'].value_counts(), labels=['No Default', 'Default'], colors=['green', 'red' 
                                                                                   ])
p = plt.gcf()
p.gca().add_artist(circle)

In [None]:
index = 0
for column in random_customer_data.columns:
    if column in ["S_2", "customer_ID", "target"] + categorical:
        continue
    
    if index % 4 == 0:
        plt.figure(figsize=(16, 4))
    plt.subplot(1, 4, index % 4 + 1)
    
    sns.histplot(data=random_customer_data, x=column, hue="target", bins=20)
    plt.ylabel("")
    
    if index % 4 == 3:
        plt.show()
    
    index += 1

In [None]:
X_col = [
    "B_2", "B_7", "B_18", "B_23", "B_32", "D_48",
    "D_55", "D_61", "D_121", "P_2", "S_11",
    
]

In [None]:
%%time
# define chunk size again for macking machine learning model
chunk_size = 2000000

# load dataset with given chunksize
data = pd.read_csv(train_path, low_memory=False, chunksize=chunk_size, usecols=['customer_ID'] + X_col)
labels = pd.read_csv(train_labels_path, chunksize=200000)

In [None]:
%%time
# convert to dataframe
data = data.__next__()

In [None]:
%%time
labels = labels.__next__()

In [None]:
data_mean = data.groupby("customer_ID")[X_col].mean().reset_index()
data_last = data.groupby("customer_ID")[X_col].last().reset_index()

In [None]:
# labels.drop('customer_ID', axis=1, inplace=True)

In [None]:
# merging the dataset by customer ID
new_data = pd.merge(
    left=data_mean, 
    right=data_last, 
    how="inner",
    on="customer_ID",
    suffixes=("_mean", "_last"),
)

In [None]:
labels.shape

In [None]:
new_data.shape

In [None]:
new_data = pd.merge(new_data, 
                    labels, 
                    on="customer_ID", 
                    how="left")

In [None]:
new_data.shape

In [None]:
new_data.head(4)

In [None]:
new_data.columns

In [None]:
# converting datetime column
'''
new_data['S_2'] = pd.to_datetime(new_data['S_2'], 
                                         infer_datetime_format=True, format='%Y/%m/%d %H:%M:%S')
                                         '''

In [None]:
# new_data['S_2'].dtype

In [None]:
# creating additional features
'''
new_data['Year of Transaction'] = new_data['S_2'].dt.year
new_data['Month of Transaction'] = new_data['S_2'].dt.month
new_data['Day of Transaction'] = new_data['S_2'].dt.day
'''

In [None]:
# new_data.drop('S_2', axis=1, inplace=True)

In [None]:
# new_data.head(4)

In [None]:
cat_cols = [i for i in new_data.columns if new_data[i].dtype==object]
num_cols = [i for i in new_data.columns if new_data[i].dtype!=object]

In [None]:
cat_cols
print(len(num_cols))

In [None]:
cat_cols

In [None]:
new_data['D_63'].value_counts()

In [None]:
new_data.isnull().sum().to_numpy()

### Defining Imputation Functions For Dealing with Missing, NaN & Null Values

In [None]:
new_data.sample()

In [None]:
# making an imputation function
def random_imputation(x):
    random_sample = new_data[x].dropna().sample(new_data[x].isna().sum(), replace=True)
    random_sample.index = new_data[new_data[x].isnull()].index
    new_data.loc[new_data[x].isnull(), x] = random_sample

# define imputation mode    
def imputation_mode(x):
    mode = new_data[x].mode()[0]
    new_data[x] = new_data[x].fillna(mode)

In [None]:
%%time
# apply function to columns
for c in num_cols:
    random_imputation(c)

In [None]:
new_data.isnull().sum().to_numpy()

In [None]:
new_data.shape

In [None]:
# check for null or missing values
new_data[cat_cols].isna().sum().sort_values(ascending=False)

In [None]:
random_imputation('D_64')

for c in cat_cols:
    imputation_mode(c)

In [None]:
new_data[cat_cols].isna().sum().sort_values(ascending=False)

In [None]:
new_data.isnull().sum().to_numpy()

In [None]:
new_data.shape

## Converting Categorical Columns to Integers

In [None]:
D_63 = new_data[['D_63']]
D_63 = pd.get_dummies(D_63)

In [None]:
D_64 = new_data[['D_64']]
D_64 = pd.get_dummies(D_64)

In [None]:
new_data.drop(['D_63', 'D_64', 'customer_ID'], axis=1, inplace=True)

In [None]:
final_data = pd.concat([new_data, D_63, D_64], axis=1)

In [None]:
final_data.head(4)

In [None]:
final_data.shape

In [None]:
final_data.target

In [None]:
# X = final_data.drop('target', axis=1)

X = new_data.drop(['customer_ID', 'target'], axis=1)
Y = new_data['target']

In [None]:
X.shape, Y.shape

In [None]:
new_data.columns[1:-1]

## Data Splicing (Splitting into Train & Test Sets)

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X, 
                                                  Y, 
                                                  test_size=0.2, 
                                                  random_state=123)

print(X_train.shape)
print(X_val.shape)
print(Y_train.shape)
print(Y_val.shape)

## Applying Machine Learning Models

In [None]:
%%time
rf = RandomForestClassifier()
rf.fit(X_train, Y_train)

In [None]:
rf.get_params()

In [None]:
print('Accuracy of Random Forest on training data: %.3f' % rf.score(X_train, Y_train))
print('Accuracy of Random Forest on validation data: %.3f' % rf.score(X_val, Y_val))

In [None]:
plt.figure(figsize=(10, 5))
plt.title('Feature Importance of Random Forest')
imp = pd.Series(rf.feature_importances_, index=X.columns)
imp.nlargest(20).plot(kind='barh')
plt.show()

We have obtained an accuracy of 99% on training data but got 74% on testing data. We need to boost the accuracy. The work is in progress!

In [None]:
def get_scores(clf):
    model = clf.fit(X_train, Y_train)
    y_pred = clf.predict(X_test)
    
    print('============================================================================')
    print('Classification Results of Classification Model After Training')
    print('============================================================================')
    print('')
    
    print('Accuracy of Classifier on training dataset: %.3f' % rf.score(X_train, Y_train))
    print('Accuracy of Classifier on test dataset: %.2f' % rf.score(X_test, Y_test))
    print('Precision of classifier: %.3f' % precision_score(Y_test, y_pred, average='weighted'))
    print('Recall of classifer: %.3f' % recall_score(Y_test, y_pred, average='weighted'))
    print('F1 score of classifer: %.3f' % f1_score(Y_test, y_pred, average='weighted'))

In [None]:
# getting the scores of random forest
# get_scores(rf)

In [None]:
sns.countplot(x=new_data['target'])

In [None]:
# checking the distribution of target column
plt.figure(figsize=(10, 8))
circle = plt.Circle((0, 0), 0.7, color='white')
plt.pie(train_data['target'].value_counts(), labels=['No Default', 'Default'], colors=['green', 'red' 
                                                                                   ])
p = plt.gcf()
p.gca().add_artist(circle)

In [None]:
from sklearn.model_selection import GridSearchCV