# INTRODUCTION

### Task Details
An organization wants to predict who possible defaulters are for the consumer loans product. They have data about historic customer behavior based on what they have observed. Hence when they acquire new customers they want to predict who is riskier and who is not.

### What do you have to do?
You are required to use the training dataset to identify patterns that predict “potential” defaulters.

### Expected Submission
Submissions should be made in the same format as the Sample Notebook provided. Train/Test split should be 80% for training & 20% for testing.

### Evaluation
Submissions will be evaluated on the basis of roc_auc_score on 20% of train_dataset.

<img src="https://www.onlygfx.com/wp-content/uploads/2020/05/alert-stamp-3.png" width="600" height="200" />

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Import libraries and set options

In [None]:
import seaborn as sns
pd.set_option('display.width', 100)
pd.set_option('display.max_columns', 20)
sns.set_theme(color_codes=True, style='darkgrid', 
              palette='deep', font='sans-serif')

# IMPORT DATA AND DATA CLEANSING

In [None]:
df_train = pd.read_csv ( '/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv' )
df_train.drop ('Id', axis = 1, inplace = True )
df_train.head().style.set_properties(**{'background-color':'black',
                                     'color': 'white'})

In [None]:
df_train.isnull().sum()

###### There are no missing-values ​​in the training dataset.

In [None]:
df_train.info()

In [None]:
# set target variable as category
df_train['Risk_Flag']=df_train['Risk_Flag'].astype('category')

# EDA

In [None]:
#Import ploting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots 
colors = ['#ffa07a','#00b2ff']
sns.set(palette=colors, font='Serif', style='white', rc={'axes.facecolor':'#f1f1f1', 'figure.facecolor':'#f1f1f1'})
sns.palplot(colors)

In [None]:
df_train.describe().T.style.background_gradient(subset=['mean','std','50%','count'], cmap='tab10')

In [None]:
#Lets check the Target features first
fig = plt.figure(figsize=(10,6))
ax=sns.countplot(data=df_train, x='Risk_Flag')
for i in ax.patches:
    ax.text(x=i.get_x()+i.get_width()/2, y=i.get_height()/7, s=f"{np.round(i.get_height()/len(df_train)*100,0)}%", ha='center', size=40, weight='bold', rotation=360, color='white')
plt.title("Risk_Flag Feature", size=20, weight='bold')
plt.annotate(text="No potential default on loans", xytext=(0.5,150000),xy=(0.2,120000), arrowprops =dict(arrowstyle="->", color='black', connectionstyle="angle3,angleA=0,angleB=90"), color='black')
plt.annotate(text="Potential default on loans", xytext=(0.8,130000),xy=(1,30000), arrowprops =dict(arrowstyle="->", color='black',  connectionstyle="angle3,angleA=0,angleB=90"), color='black')
plt.show()

###### The classes are heavily skewed we need to solve this issue later, with algorithm SMOTE (Synthetic Minority Oversampling TEchnique).

###### Class 0 represents 88.00% of the dataset, while class 1 only 12.00%.

In [None]:
g = sns.PairGrid(df_train)
g.map(sns.scatterplot)
plt.show()

# PREPROCESSING

#### Management of binary categorical data

In [None]:
from sklearn.preprocessing import LabelEncoder

binary_class = ['Married/Single', 'Car_Ownership']
for column in binary_class:
    print ( '\nBefore:', df_train [column].unique () )
    lab_enc = LabelEncoder()
    df_train [column] = lab_enc.fit_transform ( df_train [column].values )
    print ('')
    print ( 'After:\n', df_train [column] )
    print ( '*' * 50 )

# rename column Single
df_train.rename(columns = { 'Married/Single' : 'Single' }, inplace = True)
df_train['Single']=df_train['Single'].astype('category')
df_train['Car_Ownership']=df_train['Car_Ownership'].astype('category')

#### Management of categorical data

**One-Hot coding** for the other categorical columns, otherwise one of the most common mistakes would be made, i.e. the classification algorithm will assume that there is an order of magnitude between the various professions, states or cities.

In [None]:
one_hot_class = ['House_Ownership', 'CITY', 'STATE', 'Profession']
for column in one_hot_class:
    one_hot = pd.get_dummies ( df_train [column] ,
                drop_first = True)
    df_train = pd.concat([df_train, one_hot], axis=1)
    df_train.drop (column, axis = 1, inplace = True )
    
df_train.head().style.set_properties(**{'background-color':'black',
                                     'color': 'white'})

In [None]:
print ( df_train.info() )

#### Train and test split

In [None]:
from sklearn.model_selection import train_test_split
X, y = df_train.drop ('Risk_Flag', axis=1).values , df_train.Risk_Flag.values
X_train, X_test, y_train, y_test = train_test_split ( X, y,
                                                     test_size = 0.3,
                                                     random_state = 1,
                                                     stratify = y)

#### Minority class oversampling in the training dataset (SMOTE)

In [None]:
from imblearn.over_sampling import SMOTE

print ('Number of observations in the target variable before oversampling of the minority class:', np.bincount (y_train) )

smt = SMOTE ()
X_train, y_train = smt.fit_resample (X_train, y_train)

print ('\nNumber of observations in the target variable after oversampling of the minority class:', np.bincount (y_train) )

#### Standardization of variables

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
X_train_std = std_scaler.fit_transform ( X_train )
X_test_std = std_scaler.transform ( X_test )

# MODEL SELECTION AND EVALUATION OF PERFORMANCE

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

tree = DecisionTreeClassifier ( random_state = 1 )
tree.fit ( X_train_std, y_train )
y_pred = tree.predict ( X_test_std )
print ( 'Accuracy score: %.2f' %accuracy_score ( y_test, y_pred ) )
print ( 'Roc_Auc score: %.2f' %roc_auc_score ( y_test, y_pred ) )

#### HYPERPARAMETERS OPTIMIZATION

In [None]:
from sklearn.model_selection import GridSearchCV
# range of parameter values
split_range = [ 8, 10 ]
# parameters grid
grid_param = [
    { 'criterion' : [ 'entropy' ],
     'splitter' : [ 'best', 'random' ],
     'min_samples_split' : split_range }
]
gs = GridSearchCV ( estimator = tree,
                   param_grid = grid_param,
                   scoring = 'roc_auc',
                   cv = 3,
                   refit = True,
                   n_jobs = 4
                   )

gs = gs.fit ( X_train, y_train )

print ( 'Best hyperparameter:', gs.best_params_ )

print ( 'Best score: %.3f' %gs.best_score_ )

gs = gs.best_estimator_

In [None]:
gs.fit ( X_train_std, y_train )
y_pred_gs = gs.predict ( X_test_std )
print ( 'Accuracy score: %.2f' %accuracy_score ( y_test, y_pred_gs ) )
print ( 'Roc_Auc score: %.2f' %roc_auc_score ( y_test, y_pred_gs ) )

Following model optimization:

- accuracy has improved (0,87 --> 0,88)

- the rac_auc score is fixed at 0.85.

Now let's see in detail what errors the model makes on the test data through the confusion matrix.

#### CONFUSION MATRIX

In [None]:
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix (  y_test, y_pred_gs )

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
#plot 1
sns.heatmap(conf_matrix,ax=axes[0],annot=True, cmap='Blues', cbar=False, fmt='d')
axes[0].set_xlabel('\nPredicted label', size = 14)
axes[0].set_ylabel('True label\n', size = 14)

# plot 2
sns.heatmap(conf_matrix/np.sum(conf_matrix),ax=axes[1], annot=True, 
            fmt='.2%', cmap='Blues', cbar=False)
axes[1].set_xlabel('\nPredicted label', size = 14)
axes[1].set_ylabel('True label\n', size = 14)
axes[1].yaxis.tick_left()
plt.show()


From the confusion matrices it can be deduced that:

- the model fails 2.30% of the time to classify it as non-potential default

- in general, it is noted that it is more wrong to classify as potential defaulting those who in reality are not (9.88%)

# CONCLUSIONS AND FINAL CONSIDERATIONS

**Based on the requirements of the task in question, we can conclude that the trained tree model achieved a good roc_auc_score of 0.85.**

**For a more in-depth analysis it is advisable to test other classification algortms, perhaps more performing, or to test some ensemble algorithm.**

**I await comments and / or suggestions.**