## INTRODUCTION
***
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

## GOAL
***
To Predict the probability of a Customer not paying back on a loan in the next two years

## METHODS
***
### Feature Engineering
- Weight of Evidence (WOE)
- P-value for Feature Selection
### Algorithms
- Logistic Regression (Baseline Model)
- Random Forest
### Evaluation
- Precision, Recall, F1-score, roc_auc

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df_train = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-training.csv', index_col = 0)
df_test = pd.read_csv('/kaggle/input/GiveMeSomeCredit/cs-test.csv', index_col = 0)
df_entry = pd.read_csv('/kaggle/input/GiveMeSomeCredit/sampleEntry.csv', index_col = 0)

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train.info()

In [None]:
df_test.info()

# Observations
- Training dataset have 150,000 records<br>
- Test dataset have 101,503 records<br>
- There are 10 Numeric Independent Variables<br>
- SeriousDlqin2yrs is the Dependent Variable<br>

#  MISSING VALUES

In [None]:
pd.DataFrame({'count':df_train.isnull().sum().values, 'ratio': df_train.isnull().mean() * 100})

In [None]:
pd.DataFrame({'count':df_test.isnull().sum().values, 'ratio': df_test.isnull().mean() * 100})

## Note
- MonthlyIncome and Number of Dependents  have ~20% and ~2.6% missing values respectively on both Training and Test Dataset

In [None]:
df_train[df_train['MonthlyIncome'].isnull()][['NumberOfDependents', 'DebtRatio']].describe()

In [None]:
df_train[df_train['NumberOfDependents'].isnull()][['MonthlyIncome', 'DebtRatio']].describe()

In [None]:
df_train[df_train['DebtRatio']>100]['MonthlyIncome'].isnull().sum()/len(df_train)*100, df_test[df_test['DebtRatio']>100]['MonthlyIncome'].isnull().sum()/len(df_test)*100

In [None]:
df_train[df_train['MonthlyIncome'].isnull()]['NumberOfDependents'].isnull().sum()/len(df_train)*100, df_test[df_test['MonthlyIncome'].isnull()]['NumberOfDependents'].isnull().sum()/len(df_test)*100

In [None]:
df_train[(df_train['DebtRatio']>100) & (df_train['MonthlyIncome'].notnull())]['MonthlyIncome'].describe()

In [None]:
df_train[(df_train['DebtRatio']<100) & (df_train['MonthlyIncome'].notnull())]['MonthlyIncome'].describe()

# Handling Missing Monthly Income
* Records with missing Monthly Income have high Debt Ratio (Median 1159)
* Summary Stat of Borrowers with high Debt Ratio shows that the Monthly Income of these Borrowers are 0
* This could mean Borrowers with missing Monthly Income delibrately left the column blank because they are trivial woorkers not earning Monthly Income
* The best method to handle this missing values is to replace it with 0

# Handling Missing Number of Dependents
* Records with missing Number of Dependents occured simultaneously with missing missing MonthlyIncome (i.e they share the same index)
* This shows that same set of borrowers that left their Monthly Income blank also left Number of Dependents field Blank.
* Summary stat of Borrowers with missing monthly Income reveals they have no dependents
* It's quite logical that this category of borrowers with little to no Income have no dependents.
* Thus, the best way to handle this missing values is to replace with 0 which is also consisent with the Range of this Variable

In [None]:
df_train['MonthlyIncome'].replace(np.nan, 0, inplace=True)
df_test['MonthlyIncome'].replace(np.nan, 0, inplace=True)
df_train['NumberOfDependents'].replace(np.nan, 0, inplace=True)
df_test['NumberOfDependents'].replace(np.nan, 0, inplace=True)

## Imbalanced Dataset

In [None]:
df_train['SeriousDlqin2yrs'].value_counts()/len(df_train)

In [None]:
sns.countplot('SeriousDlqin2yrs' ,data=df_train)

## Note
- The target class (SeriousDlqin2yrs) is highly imbalanced (14 : 1)
- Due to the Bias Nature of the Dataset towards a particular class (0), Precision, Recall, F1-score and AUC are the metrics to evaluate our Predictive Models
- Resampling Techniques such as SMOTE and Tomek Links would be employed to improve our model

# EXPLORATORY DATA ANALYSIS
***

### REVOLVING CREDIT UTILIZATION RATIO

In [None]:
df_train['RevolvingUtilizationOfUnsecuredLines'].describe().to_frame().T

In [None]:
df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > df_train['RevolvingUtilizationOfUnsecuredLines'].quantile(0.99)]['RevolvingUtilizationOfUnsecuredLines'].describe()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.distplot(x = np.array(df_train['RevolvingUtilizationOfUnsecuredLines']),
             ax = axes[0])
axes[0].set_title('Histogram Plot of RevolvingUtilizationOfUnsecuredLines')
sns.boxplot(x = df_train['RevolvingUtilizationOfUnsecuredLines'], ax = axes[1])
axes[1].set_title('Box Plot of RevolvingUtilizationOfUnsecuredLines')

Not much sense can be made of the plots due to high level of skewness. The summary stats shows that the mean is 40 times bigger than the median and there is huge change in values beyond the 99th percentile value. There are notable extreme outliers.

In [None]:
below_1 = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] < 1]['RevolvingUtilizationOfUnsecuredLines'].count()*100/len(df_train)
bet_1_10 = df_train[(df_train['RevolvingUtilizationOfUnsecuredLines'] > 1) &
        (df_train['RevolvingUtilizationOfUnsecuredLines'] < 10)]['RevolvingUtilizationOfUnsecuredLines'].count() * 100/len(df_train)
beyond_10 = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].count()*100/len(df_train)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.boxplot(x = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] < 1]['RevolvingUtilizationOfUnsecuredLines'],
            ax = axes[0])
axes[0].set_title('{}% of Train_Dataset'.format(round(below_1, 0)))
sns.boxplot(x = df_train[(df_train['RevolvingUtilizationOfUnsecuredLines'] > 1) &
                        (df_train['RevolvingUtilizationOfUnsecuredLines'] < 10)]['RevolvingUtilizationOfUnsecuredLines'],
            ax = axes[1])
axes[1].set_title('{}% of Train_Dataset'.format(round(bet_1_10, 0)))

In [None]:
df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].count()/len(df_train)*100, df_test[df_test['RevolvingUtilizationOfUnsecuredLines'] > 10]['RevolvingUtilizationOfUnsecuredLines'].count()/len(df_test)*100

### Note
Approximately 98% of values of this Variable are between 0 and 1 with a well defined right-skewed distribution. Generally, Credit Utilization is expected to be within this regio (0 - 1). Altough, Borrowers can sometimes spend beyond credit limit. Values between 1 and 10 make up 2% of the dataset. Values beyond 10 are extremely big and they make up less than 0.5% of our data, these values would be dropped to prevent them from impacting our model.

In [None]:
to_drop_train = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > 10].index.values
#to_drop_test = df_test[df_test['RevolvingUtilizationOfUnsecuredLines'] > 10].index.values

In [None]:
#df_train.drop(to_drop_train, axis = 0, inplace = True)
#df_test.drop(to_drop_test, axis = 0, inplace = True)

### DEBT RATIO
***

In [None]:
df_train['DebtRatio'].describe().to_frame().T

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.distplot(x = np.array(df_train['DebtRatio']),
             ax = axes[0])
axes[0].set_title('Histogram Plot of Debt Ratio')
sns.boxplot(x = df_train['DebtRatio'], ax = axes[1])
axes[1].set_title('Box Plot of Debt Ratio')

In [None]:
pd.DataFrame({'below 1': df_train[df_train['DebtRatio'] <= 1]['DebtRatio'].count()*100/len(df_train),
             'between 1 - 10': df_train[(df_train['DebtRatio'] > 1) &
                                        (df_train['DebtRatio'] <=10)]['DebtRatio'].count()*100/len(df_train),
             'beyond 10': df_train[df_train['DebtRatio'] > 10]['DebtRatio'].count()*100/len(df_train)}, index = [1])

In [None]:
df_train[(df_train['DebtRatio'] > 1) & (df_train['DebtRatio'] <=10)]['DebtRatio'].describe().to_frame().T

In [None]:
df_train[df_train['DebtRatio'] > 10]['DebtRatio'].describe().describe().to_frame().T

### Notes
* 76% of values in this variable are between 0 - 1
* 4% are between 1 - 10
* The remaining 20% have high values (Median of 2166). Outliers responsible for skewing the Variable
* These outliers won't be discarded as we've earlier established that they are special case of Borrowers

## AGE
***

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.boxplot(x= df_train['age'], ax = axes[0])
axes[0].set_title('Train_Dataset')
sns.boxplot(x= df_test['age'], ax = axes[1])
axes[1].set_title('Test_Dataset')

### Note
Age tends to have a somewhat reasonable distribution. There are a suspicious number of centenarians but plausible. The only certainly incorrect data is that there is one person in the dataset with age 0, and because infants are not legally permitted to take out loans, we will impute that to the next youngest person in the dataset.

In [None]:
df_train['age'].replace(0, 18, inplace=True)

## NUMBER OF OPEN CREDIT LINES
***

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train['NumberOfOpenCreditLinesAndLoans'], binwidth=1, ax = axes[0])
sns.histplot(x = df_test['NumberOfOpenCreditLinesAndLoans'], binwidth=1, ax = axes[1])

### Note
This variable is right-skewed with no extreme values. Further preprocessing of this data would be aggregating similar Category (Fine Class) to a Coarse class during WOE Feature Engineering and Data Preprocessing.

## NUMBER OF REAL ESTATE LOANS AND LINES
***

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train['NumberRealEstateLoansOrLines'], binwidth=1, ax = axes[0])
sns.histplot(x = df_test['NumberRealEstateLoansOrLines'], binwidth=1, ax = axes[1])

In [None]:
df_train['NumberRealEstateLoansOrLines'].value_counts()

### Note
This variable is highly skewed to the right, Majority of the Borrowers have between 0 to 2 Mortgage loans. Further preprocessing of this data would be aggregating similar Category (Fine Class) to a Coarse class during WOE Feature Engineering and Data Preprocessing.

## NUMBER OF DEPENDENTS
***

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train['NumberOfDependents'], binwidth=1, ax = axes[0])
sns.histplot(x = df_test['NumberOfDependents'], binwidth=1, ax = axes[1])

### Note
This variable is right skewed. Majority of the Borrowers have between 0 - 3 Dependents. Further preprocessing of this data would be aggregating similar Category (Fine Class) to a Coarse class during WOE Feature Engineering and Data Preprocessing.

### NUMBER OF DAYS PAST DUE
***

In [None]:
due_30_59 = pd.DataFrame(df_train['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()).rename(columns = {'NumberOfTime30-59DaysPastDueNotWorse':'30-59days'})
due_60_89 =  pd.DataFrame(df_train['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()).rename(columns = {'NumberOfTime60-89DaysPastDueNotWorse':'60-89days'})
due_90 = pd.DataFrame(df_train['NumberOfTimes90DaysLate'].value_counts()).rename(columns = {'NumberOfTimes90DaysLate':'90days'})
pd.concat([due_30_59, due_60_89, due_90], axis = 1)

In [None]:
df_train[df_train['NumberOfTime30-59DaysPastDueNotWorse'] > 17][['NumberOfTime30-59DaysPastDueNotWorse',
                                                                'NumberOfTime60-89DaysPastDueNotWorse',
                                                                'NumberOfTimes90DaysLate']]

In [None]:
df_train[df_train['NumberOfTime30-59DaysPastDueNotWorse'] > 17]['SeriousDlqin2yrs'].mean()*100

### Note
These Features have similar distribution. There are two unique values (98 and 96). It is impossible for a borrower to exhibit delinquency 98 or 96 times in space of 2 years. It can also be observerd that these values share the same corresponding index, which might indicates Data Entry error. However, they can't be dropped due to high information they possess in identifying defaulting members. 55% of Borrowers in this category defaulted compared to 6% global default rate. Its best we keep them and assign a separate class for these values

## BASELINE MODELS 
***

In [None]:
#import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, auc, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
#ROC curve function
# We plot the false positive rate along the x-axis and the true positive rate along the y-axis,
def plot_roc(y_valid, y_pred_proba):
    fpr, tpr, thresholds = roc_curve(y_valid, y_pred_proba)
    plt.plot(fpr, tpr)
    plt.plot(fpr, fpr, linestyle = '--', color = 'k')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC curve')

In [None]:
df_train.reset_index(drop = True, inplace = True)
#df_test.reset_index(drop = True, inplace = True)

In [None]:
df_train_inputs = df_train.loc[:, df_train.columns.values[1:]]
df_test_inputs = df_test.loc[:, df_train.columns.values[1:]]
df_train_target = df_train.loc[:, df_train.columns.values[0]].to_frame()

In [None]:
#stratified split
X_train, X_valid, y_train, y_valid = train_test_split(np.array(df_train_inputs), np.array(df_train_target),
                                                      test_size = 0.2, random_state = 42, stratify = np.array(df_train_target))

In [None]:
#logistic regression object
lr = LogisticRegression(max_iter=300, solver = 'liblinear')

In [None]:
#fit logistic regression
lr.fit(X_train, y_train)

In [None]:
#predictions 1 or 0
y_pred = lr.predict(X_valid)

In [None]:
#predictions in probalities
y_pred_proba = lr.predict_proba(X_valid)
y_pred_proba = y_pred_proba[: ][: , 1]

In [None]:
#confusion matrix
cm = metrics.confusion_matrix(y_valid, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt=".2f", linewidths=.5, square =True, cmap = 'Blues_r');
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Confusion Matrix'
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
#classification report: recall, precision, f1-score, accuracy
print(classification_report(y_valid, y_pred))

In [None]:
#ROC_curve
plot_roc(y_valid, y_pred_proba)

In [None]:
#AUC score
roc_auc_score(y_valid, y_pred_proba)

In [None]:
y_proba_base = lr.predict_proba(np.array(df_test_inputs))
lr_baseline_model = pd.DataFrame({'Id': df_test.index.values,
                                 'Probability': y_proba_base[:, 1]})
lr_baseline_model.set_index(keys = 'Id', inplace = True)
lr_baseline_model

### Note
Our Baseline model achieves an AUC score of 0.8014. This isn't skillful enough, we need to jack it up.

## WEIGHT OF EVIDENCE
***
### Background
This method is commonly used alongside Logistic Regression for modelling Probability of Default. WOE access the amount of information each attribute (category) of an independent variable has in predicting the class of a target variable. Mathematically, it is the natural log of the ratio of percentage distribution of non-defaulting customers to percentage of defauting customers.

### Steps
* Fine Classing: All Continuous Variables would be binned into several categories base on its distribution. Any variable with more than 50 unique values is considered to be a continuous Variable. Other Numerical variable with less than 50 unique values would have each element as a separate category
* Coarse Classing: Categories with similar WOE value would be binned together. Percentage of observation would also influence coarse classing.
* Dummy variable would be created for each coarse class
* Each variable would have a reference attribute to avoid dummy variable trap

### Information Value and P-Value
Information Value shows the strength of a variable in predicting the target class. It is summation of the product of WOE and the difference between proportion of good customers and bad customers for each Variable category. P-value access the statiscal significance of each variable as a part of total variables in predicting the target class. We are going to use P-value to select statistically significant variables.


## FEATURE ENGINEERING AND DATA PREPROCESSING
***


In [None]:
def woe_discrete(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_bad']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_bad'] = df['prop_bad'] * df['n_obs']
    df['n_good'] = (1 - df['prop_bad']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    df = df.sort_values(['WoE'])
    df = df.reset_index(drop = True)
    df['diff_prop_good'] = (1 - df['prop_bad']).diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    #df['IV'] = df['IV'].replace([np.inf, -np.inf], np.nan).sum()
    return df

In [None]:
def woe_continuous(df, discrete_variabe_name, good_bad_variable_df):
    df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_bad']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_bad'] = df['prop_bad'] * df['n_obs']
    df['n_good'] = (1 - df['prop_bad']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    #df = df.sort_values(['WoE'])
    #df = df.reset_index(drop = True)
    df['diff_prop_good'] = (1 - df['prop_bad']).diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    #df['IV'] = df['IV'].replace([np.inf, -np.inf], np.nan).sum()
    return df

In [None]:
# Below we define a function that takes 2 arguments: a dataframe and a number.
# The number parameter has a default value of 0.
# This means that if we call the function and omit the number parameter, it will be executed with it having a value of 0.
# The function displays a graph.
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
    x = np.array(df_WoE.iloc[:, 0].apply(str))
    # Turns the values of the column with index 0 to strings, makes an array from these strings, and passes it to variable x.
    y = df_WoE['WoE']
    # Selects a column with label 'WoE' and passes it to variable y.
    plt.figure(figsize=(18, 6))
    # Sets the graph size to width 18 x height 6.
    plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
    # Plots the datapoints with coordiantes variable x on the x-axis and variable y on the y-axis.
    # Sets the marker for each datapoint to a circle, the style line between the points to dashed, and the color to black.
    plt.xlabel(df_WoE.columns[0])
    # Names the x-axis with the name of the column with index 0.
    plt.ylabel('Weight of Evidence')
    # Names the y-axis 'Weight of Evidence'.
    plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
    # Names the grapth 'Weight of Evidence by ' the name of the column with index 0.
    plt.xticks(rotation = rotation_of_x_axis_labels)
    # Rotates the labels of the x-axis a predefined number of degrees.

### FINE CLASSING AND COARSE CLASSING NUMBER OF DAYS PAST DUE 30 - 59

In [None]:
df_temp = woe_discrete(df_train_inputs, 'NumberOfTime30-59DaysPastDueNotWorse', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp)

In [None]:
df_train_inputs['PastDue30-59:11-13-96-10'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([11,13,96,10]), 1, 0)
df_train_inputs['PastDue30-59:98-6-7-12'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([98,6,7,12]), 1, 0)
df_train_inputs['PastDue30-59:5-4'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([5,4]), 1, 0)
df_train_inputs['PastDue30-59:3'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([3]), 1, 0)
df_train_inputs['PastDue30-59:9-8'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([9,8]), 1, 0)
df_train_inputs['PastDue30-59:2'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([2]), 1, 0)
df_train_inputs['PastDue30-59:1'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([1]), 1, 0)
#df_train_inputs['PastDue30-59:0'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([5,4]), 1, 0)

In [None]:
df_test_inputs['PastDue30-59:11-13-96-10'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([11,13,96,10]), 1, 0)
df_test_inputs['PastDue30-59:98-6-7-12'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([98,6,7,12]), 1, 0)
df_test_inputs['PastDue30-59:5-4'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([5,4]), 1, 0)
df_test_inputs['PastDue30-59:3'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([3]), 1, 0)
df_test_inputs['PastDue30-59:9-8'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([9,8]), 1, 0)
df_test_inputs['PastDue30-59:2'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([2]), 1, 0)
df_test_inputs['PastDue30-59:1'] = np.where(df_test_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([1]), 1, 0)
#df_train_inputs['PastDue30-59:0'] = np.where(df_train_inputs['NumberOfTime30-59DaysPastDueNotWorse'].isin([5,4]), 1, 0)

### FINE CLASSING AND COARSE CLASSING NUMBER OF DAYS PAST DUE 60 - 89

In [None]:
df_temp = woe_discrete(df_train_inputs, 'NumberOfTime60-89DaysPastDueNotWorse', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp)

In [None]:
df_train_inputs['PastDue60-89:11-96-6-9'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([11,96,6,9]), 1, 0)
df_train_inputs['PastDue60-89:4-5'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([4,5]), 1, 0)
df_train_inputs['PastDue60-89:3-98'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([3, 98]), 1, 0)
df_train_inputs['PastDue60-89:7-8'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([7, 8]), 1, 0)
df_train_inputs['PastDue60-89:2'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([2]), 1, 0)
df_train_inputs['PastDue60-89:1'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([1]), 1, 0)
#df_train_inputs['PastDue60-89:0'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([0]), 1, 0)

In [None]:
df_test_inputs['PastDue60-89:11-96-6-9'] = np.where(df_test_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([11,96,6,9]), 1, 0)
df_test_inputs['PastDue60-89:4-5'] = np.where(df_test_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([4,5]), 1, 0)
df_test_inputs['PastDue60-89:3-98'] = np.where(df_test_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([3,98]), 1, 0)
df_test_inputs['PastDue60-89:7-8'] = np.where(df_test_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([7,8]), 1, 0)
df_test_inputs['PastDue60-89:2'] = np.where(df_test_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([2]), 1, 0)
df_test_inputs['PastDue60-89:1'] = np.where(df_test_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([1]), 1, 0)
#df_train_inputs['PastDue60-89:0'] = np.where(df_train_inputs['NumberOfTime60-89DaysPastDueNotWorse'].isin([0]), 1, 0)

### FINE CLASSING AND COARSE CLASSING NUMBER OF DAYS PAST DUE 90

In [None]:
df_temp = woe_discrete(df_train_inputs, 'NumberOfTimes90DaysLate', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp)

In [None]:
df_train_inputs['PastDue90:9-96-7-17-15-8'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([9,96,7,17,15,8]), 1, 0)
df_train_inputs['PastDue90:4-5'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([4,5]), 1, 0)
df_train_inputs['PastDue90:6-10-11'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([6,10,11]), 1, 0)
df_train_inputs['PastDue90:3-98'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([3,98]), 1, 0)
df_train_inputs['PastDue90:12-13-14'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([12,13,14]), 1, 0)
df_train_inputs['PastDue90:2'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([2]), 1, 0)
df_train_inputs['PastDue90:1'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([1]), 1, 0)
#df_train_inputs['PastDue90:0'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([0]), 1, 0)

In [None]:
df_test_inputs['PastDue90:9-96-7-17-15-8'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([9,96,7,17,15,8]), 1, 0)
df_test_inputs['PastDue90:4-5'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([4,5]), 1, 0)
df_test_inputs['PastDue90:6-10-11'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([6,10,11]), 1, 0)
df_test_inputs['PastDue90:3-98'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([3,98]), 1, 0)
df_test_inputs['PastDue90:12-13-14'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([12,13,14]), 1, 0)
df_test_inputs['PastDue90:2'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([2]), 1, 0)
df_test_inputs['PastDue90:1'] = np.where(df_test_inputs['NumberOfTimes90DaysLate'].isin([1]), 1, 0)
#df_train_inputs['PastDue90:0'] = np.where(df_train_inputs['NumberOfTimes90DaysLate'].isin([0]), 1, 0)

## NUMBER OF DEPENDENTS

In [None]:
df_temp = woe_discrete(df_train_inputs, 'NumberOfDependents', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp)

In [None]:
df_train_inputs['NumberOfDependents:>9'] = np.where(df_train_inputs['NumberOfDependents'].isin([9,10,13,20]), 1, 0)
df_train_inputs['NumberOfDependents:6'] = np.where(df_train_inputs['NumberOfDependents'].isin([6]), 1, 0)
df_train_inputs['NumberOfDependents:4'] = np.where(df_train_inputs['NumberOfDependents'].isin([4]), 1, 0)
df_train_inputs['NumberOfDependents:1'] = np.where(df_train_inputs['NumberOfDependents'].isin([1]), 1, 0)
df_train_inputs['NumberOfDependents:2'] = np.where(df_train_inputs['NumberOfDependents'].isin([2]), 1, 0)
df_train_inputs['NumberOfDependents:7'] = np.where(df_train_inputs['NumberOfDependents'].isin([7]), 1, 0)
df_train_inputs['NumberOfDependents:5'] = np.where(df_train_inputs['NumberOfDependents'].isin([5]), 1, 0)
df_train_inputs['NumberOfDependents:3'] = np.where(df_train_inputs['NumberOfDependents'].isin([3]), 1, 0)
df_train_inputs['NumberOfDependents:8'] = np.where(df_train_inputs['NumberOfDependents'].isin([8]), 1, 0)
#df_train_inputs['NumberOfDependents:0'] = np.where(df_train_inputs['NumberOfDependents'].isin([0]), 1, 0)

In [None]:
df_test_inputs['NumberOfDependents:>9'] = np.where(df_test_inputs['NumberOfDependents'].isin([9,10,13,20]), 1, 0)
df_test_inputs['NumberOfDependents:6'] = np.where(df_test_inputs['NumberOfDependents'].isin([6]), 1, 0)
df_test_inputs['NumberOfDependents:4'] = np.where(df_test_inputs['NumberOfDependents'].isin([4]), 1, 0)
df_test_inputs['NumberOfDependents:1'] = np.where(df_test_inputs['NumberOfDependents'].isin([1]), 1, 0)
df_test_inputs['NumberOfDependents:2'] = np.where(df_test_inputs['NumberOfDependents'].isin([2]), 1, 0)
df_test_inputs['NumberOfDependents:7'] = np.where(df_test_inputs['NumberOfDependents'].isin([7]), 1, 0)
df_test_inputs['NumberOfDependents:5'] = np.where(df_test_inputs['NumberOfDependents'].isin([5]), 1, 0)
df_test_inputs['NumberOfDependents:3'] = np.where(df_test_inputs['NumberOfDependents'].isin([3]), 1, 0)
df_test_inputs['NumberOfDependents:8'] = np.where(df_test_inputs['NumberOfDependents'].isin([8]), 1, 0)
#df_train_inputs['NumberOfDependents:0'] = np.where(df_train_inputs['NumberOfDependents'].isin([0]), 1, 0)

## MONTHLY INCOME

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18,6))
sns.histplot(x = df_train[df_train['MonthlyIncome'] < 1000]['MonthlyIncome'], ax = axes[0,0])
sns.histplot(x = df_train[(df_train['MonthlyIncome'] > 1000) & 
                         (df_train['MonthlyIncome'] <= 10000)]['MonthlyIncome'], ax = axes[0,1])
sns.histplot(x = df_train[(df_train['MonthlyIncome'] > 10000) & 
                         (df_train['MonthlyIncome'] <= 20000)]['MonthlyIncome'], ax = axes[1,0])
sns.histplot(x = df_train[(df_train['MonthlyIncome'] > 20000) & 
                         (df_train['MonthlyIncome'] <= 50000)]['MonthlyIncome'], ax = axes[1,1])

In [None]:
bins = pd.IntervalIndex.from_tuples([(0, 1000)])
bins3 = pd.IntervalIndex.from_tuples([(10000, 12000), (12000, 14000), (14000, 16000), (16000, 20000)])
bins4 = pd.IntervalIndex.from_tuples([(20000, 30000), (30000, 50000)])
box1 = pd.cut(df_train[df_train['MonthlyIncome'] <= 1000]['MonthlyIncome'], bins)
box2 = pd.qcut(df_train[(df_train['MonthlyIncome'] > 1000) & 
                         (df_train['MonthlyIncome'] <= 10000)]['MonthlyIncome'], 4)
box3 = pd.cut(df_train[(df_train['MonthlyIncome'] > 10000) & 
                         (df_train['MonthlyIncome'] <= 20000)]['MonthlyIncome'], bins3)
box4 = pd.cut(df_train[(df_train['MonthlyIncome'] > 20000) & 
                         (df_train['MonthlyIncome'] <= 50000)]['MonthlyIncome'], bins4)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18,6))
sns.histplot(x = df_train[(df_train['MonthlyIncome'] > 50000) & 
                         (df_train['MonthlyIncome'] <= 100000)]['MonthlyIncome'], ax = axes[0,0])
sns.histplot(x = df_train[(df_train['MonthlyIncome'] > 100000) & 
                         (df_train['MonthlyIncome'] <= 200000)]['MonthlyIncome'], ax = axes[0,1])
sns.histplot(x = df_train[(df_train['MonthlyIncome'] > 200000) & 
                         (df_train['MonthlyIncome'] <= 500000)]['MonthlyIncome'], ax = axes[1,0])
sns.histplot(x = df_train[df_train['MonthlyIncome'] > 500000]['MonthlyIncome'], ax = axes[1,1])

In [None]:
bins5 = pd.IntervalIndex.from_tuples([(50000, 70000), (70000,100000), (100000, 140000), (140000, 200000), (200000, 500000),
                                     (500000, 3500000)])
box5 = pd.cut(df_train[df_train['MonthlyIncome'] > 50000]['MonthlyIncome'], bins5)

In [None]:
df_train_inputs['MonthlyIncome_x'] = df_train_inputs['MonthlyIncome'].values

In [None]:
df_train_inputs.loc[box1.index.values, 'MonthlyIncome_x'] = box1.values
df_train_inputs.loc[box2.index.values, 'MonthlyIncome_x'] = box2.values
df_train_inputs.loc[box3.index.values, 'MonthlyIncome_x'] = box3.values
df_train_inputs.loc[box4.index.values, 'MonthlyIncome_x'] = box4.values
df_train_inputs.loc[box5.index.values, 'MonthlyIncome_x'] = box5.values

In [None]:
df_temp = woe_continuous(df_train_inputs, 'MonthlyIncome_x', df_train_target)
df_temp

In [None]:
df_train_inputs['MonthlyIncome:0-200'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(0, 200)), 1, 0)
df_train_inputs['MonthlyIncome:200-1000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(200, 1000)), 1, 0)
df_train_inputs['MonthlyIncome:1000-3500'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(1000, 3500)), 1, 0)
df_train_inputs['MonthlyIncome:3500-5000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(3500, 5000)), 1, 0)
#df_train_inputs['MonthlyIncome:5000-6850'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(5000, 6850)), 1, 0)
df_train_inputs['MonthlyIncome:6850-10000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(6850, 10000)), 1, 0)
df_train_inputs['MonthlyIncome:10000-12000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(10000, 12000)), 1, 0)
df_train_inputs['MonthlyIncome:12000-16000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(12000, 16000)), 1, 0)
df_train_inputs['MonthlyIncome:16000-30000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(16000, 30000)), 1, 0)
df_train_inputs['MonthlyIncome:30000-70000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(30000, 70000)), 1, 0)
df_train_inputs['MonthlyIncome:70000-100000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(70000, 100000)), 1, 0)
df_train_inputs['MonthlyIncome:100000-140000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(100000, 140000)), 1, 0)
df_train_inputs['MonthlyIncome:140000-500000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(140000, 500000)), 1, 0)
df_train_inputs['MonthlyIncome:>500000'] = np.where(df_train_inputs['MonthlyIncome'].isin(range(500000, int(df_train_inputs['MonthlyIncome'].max()))), 1, 0)

In [None]:
df_test_inputs['MonthlyIncome:0-200'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(0, 200)), 1, 0)
df_test_inputs['MonthlyIncome:200-1000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(200, 1000)), 1, 0)
df_test_inputs['MonthlyIncome:1000-3500'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(1000, 3500)), 1, 0)
df_test_inputs['MonthlyIncome:3500-5000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(3500, 5000)), 1, 0)
#df_test_inputs['MonthlyIncome:5000-6850'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(5000, 6850)), 1, 0)
df_test_inputs['MonthlyIncome:6850-10000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(6850, 10000)), 1, 0)
df_test_inputs['MonthlyIncome:10000-12000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(10000, 12000)), 1, 0)
df_test_inputs['MonthlyIncome:12000-16000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(12000, 16000)), 1, 0)
df_test_inputs['MonthlyIncome:16000-30000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(16000, 30000)), 1, 0)
df_test_inputs['MonthlyIncome:30000-70000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(30000, 70000)), 1, 0)
df_test_inputs['MonthlyIncome:70000-100000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(70000, 100000)), 1, 0)
df_test_inputs['MonthlyIncome:100000-140000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(100000, 140000)), 1, 0)
df_test_inputs['MonthlyIncome:140000-500000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(140000, 500000)), 1, 0)
df_test_inputs['MonthlyIncome:>500000'] = np.where(df_test_inputs['MonthlyIncome'].isin(range(500000, int(df_test_inputs['MonthlyIncome'].max()))), 1, 0)

## DEBT RATIO

In [None]:
#Train_Dataset Boxplot
fig, axes = plt.subplots(2, 2, figsize=(18,6))
sns.histplot(x = df_train[df_train['DebtRatio'] < 1]['DebtRatio'],
            ax = axes[0,0])
sns.histplot(x = df_train[(df_train['DebtRatio'] > 1) & 
                        (df_train['DebtRatio'] <= 10)]['DebtRatio'],
            ax = axes[0,1])
sns.histplot(x = df_train[(df_train['DebtRatio'] > 10) & 
                        (df_train['DebtRatio'] <= 100)]['DebtRatio'],
            ax = axes[1,0])
sns.histplot(x = df_train[(df_train['DebtRatio'] > 100) & 
                        (df_train['DebtRatio'] <= 1000)]['DebtRatio'],
            ax = axes[1,1])

In [None]:
sns.histplot(x = df_train[(df_train['DebtRatio'] > 1000) & 
                        (df_train['DebtRatio'] <= 10000)]['DebtRatio'])

In [None]:
bins = pd.IntervalIndex.from_tuples([(1, 10), (10, 100), (100, 1000), (1000, int(df_train_inputs['DebtRatio'].max()))])
box1 = pd.qcut(df_train[df_train['DebtRatio'] <= 1]['DebtRatio'], 10)
box2 = pd.cut(df_train[df_train['DebtRatio'] > 1]['DebtRatio'], bins)

In [None]:
df_train_inputs['DebtRatio_x'] = df_train_inputs['DebtRatio'].values

In [None]:
df_train_inputs.loc[box1.index.values, 'DebtRatio_x'] = box1.values
df_train_inputs.loc[box2.index.values, 'DebtRatio_x'] = box2.values

In [None]:
df_temp = woe_continuous(df_train_inputs, 'DebtRatio_x', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp, 90)

In [None]:
df_train_inputs['DebtRatio:<0.0129'] = np.where(round(df_train_inputs['DebtRatio'], 3).isin(np.arange(0.0, 0.0129, 0.001)), 1, 0)
#df_train_inputs['DebtRatio:<0.159'] = np.where(round(df_train_inputs['DebtRatio'], 4).isin(np.arange(0.0129, 0.159, 0.001)), 1, 0)
df_train_inputs['DebtRatio:<0.218'] = np.where(round(df_train_inputs['DebtRatio'], 3).isin(np.arange(0.159, 0.218, 0.001)), 1, 0)
df_train_inputs['DebtRatio:<0.333'] = np.where(round(df_train_inputs['DebtRatio'], 3).isin(np.arange(0.218, 0.333, 0.001)), 1, 0)
df_train_inputs['DebtRatio:<0.483'] = np.where(round(df_train_inputs['DebtRatio'], 3).isin(np.arange(0.333, 0.483, 0.001)), 1, 0)
df_train_inputs['DebtRatio:<0.621'] = np.where(round(df_train_inputs['DebtRatio'], 3).isin(np.arange(0.483, 0.621, 0.001)), 1, 0)
df_train_inputs['DebtRatio:<1'] = np.where(round(df_train_inputs['DebtRatio'], 3).isin(np.arange(0.621, 1, 0.001)), 1, 0)
df_train_inputs['DebtRatio:<10'] = np.where(round(df_train_inputs['DebtRatio'], 0).isin(range(1, 10)), 1, 0)
df_train_inputs['DebtRatio:<100'] = np.where(round(df_train_inputs['DebtRatio'], 0).isin(range(10, 100)), 1, 0)
df_train_inputs['DebtRatio:<1000'] = np.where(round(df_train_inputs['DebtRatio'], 0).isin(range(100, 1000)), 1, 0)
df_train_inputs['DebtRatio:>1000'] = np.where((df_train_inputs['DebtRatio'] > 1000), 1, 0)

In [None]:
df_test_inputs['DebtRatio:<0.0129'] = np.where(round(df_test_inputs['DebtRatio'], 3).isin(np.arange(0.0, 0.0129, 0.001)), 1, 0)
#df_test_inputs['DebtRatio:<0.159'] = np.where(round(df_test_inputs['DebtRatio'], 4).isin(np.arange(0.0129, 0.159, 0.001)), 1, 0)
df_test_inputs['DebtRatio:<0.218'] = np.where(round(df_test_inputs['DebtRatio'], 3).isin(np.arange(0.159, 0.218, 0.001)), 1, 0)
df_test_inputs['DebtRatio:<0.333'] = np.where(round(df_test_inputs['DebtRatio'], 3).isin(np.arange(0.218, 0.333, 0.001)), 1, 0)
df_test_inputs['DebtRatio:<0.483'] = np.where(round(df_test_inputs['DebtRatio'], 3).isin(np.arange(0.333, 0.483, 0.001)), 1, 0)
df_test_inputs['DebtRatio:<0.621'] = np.where(round(df_test_inputs['DebtRatio'], 3).isin(np.arange(0.483, 0.621, 0.001)), 1, 0)
df_test_inputs['DebtRatio:<1'] = np.where(round(df_test_inputs['DebtRatio'], 3).isin(np.arange(0.621, 1, 0.001)), 1, 0)
df_test_inputs['DebtRatio:<10'] = np.where(round(df_test_inputs['DebtRatio'], 0).isin(range(1, 10)), 1, 0)
df_test_inputs['DebtRatio:<100'] = np.where(round(df_test_inputs['DebtRatio'], 0).isin(range(10, 100)), 1, 0)
df_test_inputs['DebtRatio:<1000'] = np.where(round(df_test_inputs['DebtRatio'], 0).isin(range(100, 1000)), 1, 0)
df_test_inputs['DebtRatio:>1000'] = np.where((df_test_inputs['DebtRatio'] > 1000), 1, 0)

## CREDIT UTILIZATION RATIO

In [None]:
#Train_Dataset Boxplot
fig, axes = plt.subplots(1, 2, figsize=(18,6))
sns.histplot(x = df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] < 1]['RevolvingUtilizationOfUnsecuredLines'],
            ax = axes[0])
sns.histplot(x = df_train[(df_train['RevolvingUtilizationOfUnsecuredLines'] > 1) & 
                        (df_train['RevolvingUtilizationOfUnsecuredLines'] <= 10)]['RevolvingUtilizationOfUnsecuredLines'],
            ax = axes[1])

In [None]:
#bins = pd.IntervalIndex.from_tuples([(1, 10), (10, 100), (100, 1000), (1000, int(df_train_inputs['RevolvingUtilizationOfUnsecuredLines'].max()))])
box1 = pd.cut(df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] <= 1]['RevolvingUtilizationOfUnsecuredLines'], 50)
box2 = pd.cut(df_train[df_train['RevolvingUtilizationOfUnsecuredLines'] > 1]['RevolvingUtilizationOfUnsecuredLines'], 10)

In [None]:
df_train_inputs['RevolvingUtilizationOfUnsecuredLines_x'] = df_train_inputs['RevolvingUtilizationOfUnsecuredLines'].values

In [None]:
df_train_inputs.loc[box1.index.values, 'RevolvingUtilizationOfUnsecuredLines_x'] = box1.values
df_train_inputs.loc[box2.index.values, 'RevolvingUtilizationOfUnsecuredLines_x'] = box2.values

In [None]:
df_temp = woe_continuous(df_train_inputs, 'RevolvingUtilizationOfUnsecuredLines_x', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp, 90)

In [None]:
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:<0.0004'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] < 0.0004), 1, 0)
#df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.0004-0.05_REF'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] >= 0.0004) & (train['RevolvingUtilizationOfUnsecuredLines'] < 0.05) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.05-0.1'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.05) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.1) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.1-0.2'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.1) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.2) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.2-0.3'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.2) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.3) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.3-0.4'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.3) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.4) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.4-0.6'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.4) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.6) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.6-0.8'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.6) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.8) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.8-1.0'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.8) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 1.0) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:1-10'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 1) & (df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 10) , 1, 0)
df_train_inputs['RevolvingUtilizationOfUnsecuredLines:>10'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] > 10) , 1, 0)

In [None]:
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:<0.0004'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] < 0.0004), 1, 0)
#df_train_inputs['RevolvingUtilizationOfUnsecuredLines:0.0004-0.05_REF'] = np.where((df_train_inputs['RevolvingUtilizationOfUnsecuredLines'] >= 0.0004) & (train['RevolvingUtilizationOfUnsecuredLines'] < 0.05) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.05-0.1'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.05) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] < 0.1) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.1-0.2'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.1) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.2) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.2-0.3'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.2) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.3) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.3-0.4'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.3) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.4) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.4-0.6'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.4) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.6) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.6-0.8'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.6) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 0.8) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:0.8-1.0'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 0.8) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 1.0) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:1-10'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 1) & (df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] <= 10) , 1, 0)
df_test_inputs['RevolvingUtilizationOfUnsecuredLines:>10'] = np.where((df_test_inputs['RevolvingUtilizationOfUnsecuredLines'] > 10) , 1, 0)

## NUMBER OF OPEN CREDIT LINES AND LOANS

In [None]:
df_temp = woe_continuous(df_train_inputs, 'NumberOfOpenCreditLinesAndLoans', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp)

In [None]:
df_train_inputs['NumberOfOpenCreditLinesAndLoans:0'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin([0]), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:1'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin([1]), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:2'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin([2]), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:3'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin([3]), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:4-6'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(4, 6)), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:6-8'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(6, 8)), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:9-13'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(9, 13)), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:13'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin([13]), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:14-18'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(14, 18)), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:19'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin([19]), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:20-24'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(20, 24)), 1, 0)
df_train_inputs['NumberOfOpenCreditLinesAndLoans:24-26'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(24, 27)), 1, 0)
#df_train_inputs['NumberOfOpenCreditLinesAndLoans:>26_REF'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(27, int(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].max()))), 1, 0)

In [None]:
df_test_inputs['NumberOfOpenCreditLinesAndLoans:0'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin([0]), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:1'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin([1]), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:2'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin([2]), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:3'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin([3]), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:4-6'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(4, 6)), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:6-8'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(6, 8)), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:9-13'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(9, 13)), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:13'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin([13]), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:14-18'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(14, 18)), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:19'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin([19]), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:20-24'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(20, 24)), 1, 0)
df_test_inputs['NumberOfOpenCreditLinesAndLoans:24-26'] = np.where(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(24, 27)), 1, 0)
#df_train_inputs['NumberOfOpenCreditLinesAndLoans:>26_REF'] = np.where(df_train_inputs['NumberOfOpenCreditLinesAndLoans'].isin(range(27, int(df_test_inputs['NumberOfOpenCreditLinesAndLoans'].max()))), 1, 0)

## NUMBER OF REAL ESTATE LOANS AND LINES

In [None]:
df_temp = woe_continuous(df_train_inputs, 'NumberRealEstateLoansOrLines', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp)

In [None]:
df_train_inputs['NumberRealEstateLoansOrLines:0'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([0]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:1'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([1]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:2'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([2]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:3'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([3]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:4'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([4]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:5'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([5]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:6'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([6]), 1, 0)
df_train_inputs['NumberRealEstateLoansOrLines:7'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin([7]), 1, 0)
#df_train_inputs['NumberRealEstateLoansOrLines:>7_REF'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin(range(8, int(df_train_inputs['NumberRealEstateLoansOrLines'].max()))), 1, 0)

In [None]:
df_test_inputs['NumberRealEstateLoansOrLines:0'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([0]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:1'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([1]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:2'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([2]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:3'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([3]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:4'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([4]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:5'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([5]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:6'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([6]), 1, 0)
df_test_inputs['NumberRealEstateLoansOrLines:7'] = np.where(df_test_inputs['NumberRealEstateLoansOrLines'].isin([7]), 1, 0)
#df_train_inputs['NumberRealEstateLoansOrLines:>7_REF'] = np.where(df_train_inputs['NumberRealEstateLoansOrLines'].isin(range(8, int(df_test_inputs['NumberRealEstateLoansOrLines'].max()))), 1, 0)

## AGE

In [None]:
#fine classing age feature into 30 categories
bins=np.linspace(df_train_inputs['age'].min(), df_train_inputs['age'].max()+1, 30)
df_train_inputs['age_x'] = pd.cut(df_train_inputs['age'], bins=bins, include_lowest=True, precision=0)

In [None]:
df_temp = woe_continuous(df_train_inputs, 'age_x', df_train_target)
df_temp

In [None]:
plot_by_woe(df_temp, 90)

In [None]:
df_train_inputs['age:<24'] = np.where(df_train_inputs['age'].isin(range(24)), 1, 0)
df_train_inputs['age:24-33'] = np.where(df_train_inputs['age'].isin(range(24, 33)), 1, 0)
df_train_inputs['age:33-36'] = np.where(df_train_inputs['age'].isin(range(33, 36)), 1, 0)
df_train_inputs['age:36-42'] = np.where(df_train_inputs['age'].isin(range(36, 42)), 1, 0)
df_train_inputs['age:42-55'] = np.where(df_train_inputs['age'].isin(range(42, 55)), 1, 0)
df_train_inputs['age:55-58'] = np.where(df_train_inputs['age'].isin(range(55, 58)), 1, 0)
df_train_inputs['age:58-64'] = np.where(df_train_inputs['age'].isin(range(58, 64)), 1, 0)
df_train_inputs['age:64-67'] = np.where(df_train_inputs['age'].isin(range(64, 67)), 1, 0)
df_train_inputs['age:67-70'] = np.where(df_train_inputs['age'].isin(range(67, 70)), 1, 0)
df_train_inputs['age:70-73'] = np.where(df_train_inputs['age'].isin(range(70, 73)), 1, 0)
df_train_inputs['age:73-89'] = np.where(df_train_inputs['age'].isin(range(73, 89)), 1, 0)
#df_train_inputs['age:>89_REF'] = np.where(df_test_inputs['age'].isin(range(89, int(df_train_inputs['age'].max()))), 1, 0)

In [None]:
df_test_inputs['age:<24'] = np.where(df_test_inputs['age'].isin(range(24)), 1, 0)
df_test_inputs['age:24-33'] = np.where(df_test_inputs['age'].isin(range(24, 33)), 1, 0)
df_test_inputs['age:33-36'] = np.where(df_test_inputs['age'].isin(range(33, 36)), 1, 0)
df_test_inputs['age:36-42'] = np.where(df_test_inputs['age'].isin(range(36, 42)), 1, 0)
df_test_inputs['age:42-55'] = np.where(df_test_inputs['age'].isin(range(42, 55)), 1, 0)
df_test_inputs['age:55-58'] = np.where(df_test_inputs['age'].isin(range(55, 58)), 1, 0)
df_test_inputs['age:58-64'] = np.where(df_test_inputs['age'].isin(range(58, 64)), 1, 0)
df_test_inputs['age:64-67'] = np.where(df_test_inputs['age'].isin(range(64, 67)), 1, 0)
df_test_inputs['age:67-70'] = np.where(df_test_inputs['age'].isin(range(67, 70)), 1, 0)
df_test_inputs['age:70-73'] = np.where(df_test_inputs['age'].isin(range(70, 73)), 1, 0)
df_test_inputs['age:73-89'] = np.where(df_test_inputs['age'].isin(range(73, 89)), 1, 0)
#df_test_inputs['age:>89_REF'] = np.where(df_test_inputs['age'].isin(range(89, int(df_test_inputs['age'].max()))), 1, 0)

In [None]:
#original feature categories in a list
original_features = ['RevolvingUtilizationOfUnsecuredLines', 'age', 
                     'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 
                     'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 
                     'NumberOfTimes90DaysLate','NumberRealEstateLoansOrLines',
                     'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']

In [None]:
woe_train_inputs = df_train_inputs.copy()
woe_test_inputs = df_test_inputs.copy()

In [None]:
for col in df_train_inputs:
    if col not in original_features:
        df_train_inputs.drop(col, axis = 1, inplace = True)

In [None]:
for col in df_test_inputs:
    if col not in original_features:
        df_test_inputs.drop(col, axis = 1, inplace = True)

In [None]:
fine_class = ['MonthlyIncome_x', 'DebtRatio_x', 'RevolvingUtilizationOfUnsecuredLines_x', 'age_x']
other_ref_columns = ['MonthlyIncome:5000-6850', 'DebtRatio:<0.159', 'PastDue30-59:0', 'PastDue60-89:0',
                    'PastDue90:0', 'NumberOfDependents:0', 'RevolvingUtilizationOfUnsecuredLines:0.0004-0.05_REF',
                    'NumberOfOpenCreditLinesAndLoans:>26_REF', 'NumberRealEstateLoansOrLines:>7_REF', 'age:>89_REF']
to_drop = original_features + fine_class + other_ref_columns

In [None]:
for col in to_drop:
    if col in woe_train_inputs.columns.values:
        woe_train_inputs.drop(col, axis = 1, inplace = True)

In [None]:
for col in to_drop:
    if col in woe_test_inputs.columns.values:
        woe_test_inputs.drop(col, axis = 1, inplace = True)

In [None]:
#stratified split
X_train, X_valid, y_train, y_valid = train_test_split(woe_train_inputs.values, df_train_target.values,
                                                      test_size = 0.2, random_state = 42, stratify = df_train_target.values)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

In [None]:
lr_woe = LogisticRegression(max_iter=300, solver = 'liblinear')

In [None]:
#fit logistic regression
lr_woe.fit(X_train, y_train.ravel())

In [None]:
#predictions 1 or 0
y_pred = lr_woe.predict(X_valid)

In [None]:
#predictions in probalities
y_pred_proba = lr_woe.predict_proba(X_valid)
y_pred_proba = y_pred_proba[: , 1]

In [None]:
#confusion matrix
cm = metrics.confusion_matrix(y_valid, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt=".2f", linewidths=.5, square =True, cmap = 'Blues_r');
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Confusion Matrix'
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
#classification report: recall, precision, f1-score, accuracy
print(classification_report(y_valid, y_pred))

In [None]:
#ROC_curve
plot_roc(y_valid, y_pred_proba)

In [None]:
#AUC score
roc_auc_score(y_valid, y_pred_proba)

### Observation
* WOE has improved both Precison(0.58) and recall(0.20) achieving a good AUC score of 0.862

### Let's further investige important features

In [None]:
#logistic regrssion with p-values function
from sklearn import linear_model
import scipy.stats as stat

class LogisticRegression_with_p_values:
    
    def __init__(self,*args,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs, max_iter=300)

    def fit(self,X,y):
        self.model.fit(X,y)
        denom = (2.0 * (1.0 + np.cosh(self.model.decision_function(X))))
        denom = np.tile(denom,(X.shape[1],1)).T
        F_ij = np.dot((X / denom).T,X)
        Cramer_Rao = np.linalg.inv(F_ij)
        sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
        z_scores = self.model.coef_[0] / sigma_estimates
        p_values = [stat.norm.sf(abs(x)) * 2 for x in z_scores]
        self.coef_ = self.model.coef_
        self.intercept_ = self.model.intercept_
        self.p_values = p_values

In [None]:
#logistic regression object
lr_p = LogisticRegression_with_p_values()

In [None]:
lr_p.fit(X_train, y_train.ravel())

In [None]:
#creating a dataframe with feature name, p_values and logistic regression coefficient
summary_table = pd.DataFrame(columns = ['Feature name'], data = woe_train_inputs.columns.values)
summary_table['Coefficients'] = np.transpose(lr_p.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', lr_p.intercept_[0]]
summary_table = summary_table.sort_index()
p_values = lr_p.p_values
p_values = np.append(np.nan, np.array(p_values))
summary_table['p_values'] = p_values

In [None]:
#pd.options.display.max_rows = None
summary_table.head(5)

In [None]:
summary_table[summary_table['p_values'] > 0.05]

### RELEVANT FEATURES
* Using 5% significance value.
* Any Original Feature having all its dummy variable greater the 5% would be dropped.
* All features have at least one statistically significant variable.
* Although, Features like Age, Number of Dependents,Monthly Income, Number of open credit and Number of Real Estate have many statistically insignificant variable. This implies they have low predictive power.



In [None]:
dummy_drop = list(woe_train_inputs.filter(regex='Depend').columns) + list(woe_train_inputs.filter(regex='age').columns) 

In [None]:
woe_train_inputs_copy = woe_train_inputs.copy()
woe_test_inputs_copy = woe_test_inputs.copy()

In [None]:
woe_train_inputs_copy.drop(dummy_drop, axis = 1, inplace = True)

In [None]:
woe_test_inputs_copy.drop(dummy_drop, axis = 1, inplace = True)

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(woe_train_inputs_copy.values, df_train_target.values,
                                                      test_size = 0.2, random_state = 42, stratify = df_train_target.values)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

In [None]:
lr_woe = LogisticRegression(max_iter=300, solver = 'liblinear')

In [None]:
#fit logistic regression
lr_woe.fit(X_train, y_train.ravel())

In [None]:
#predictions 1 or 0
y_pred = lr_woe.predict(X_valid)

In [None]:
#predictions in probalities
y_pred_proba = lr_woe.predict_proba(X_valid)
y_pred_proba = y_pred_proba[: , 1]

In [None]:
#confusion matrix
cm = metrics.confusion_matrix(y_valid, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt=".2f", linewidths=.5, square =True, cmap = 'Blues_r');
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Confusion Matrix'
plt.title(all_sample_title, size = 15)
plt.show()

In [None]:
#classification report: recall, precision, f1-score, accuracy
print(classification_report(y_valid, y_pred))

In [None]:
#ROC_curve
plot_roc(y_valid, y_pred_proba)

In [None]:
#AUC score
roc_auc_score(y_valid, y_pred_proba)

## RANDOM FOREST CLASSIFIER
***

In [None]:
def rf_func(target, *data):
    precision = []
    recall = []
    f1_score_ = []
    auc_ = []
    for df in data:
        X_train, X_valid, y_train, y_valid = train_test_split(df.values, target.values,
                                                      test_size = 0.2, random_state = 42, stratify = target.values)
        
        rf = RandomForestClassifier(n_estimators=500)
        rf.fit(X_train, y_train.ravel())
        rf_pred = rf.predict(X_valid)
        rf_pred_proba = rf.predict_proba(X_valid)
        rf_pred_proba = rf_pred_proba[: , 1]
        precision.append(round(precision_score(y_valid, rf_pred, average=None)[1], 2))
        recall.append(round(recall_score(y_valid, rf_pred, average=None)[1], 2))
        f1_score_.append(round(f1_score(y_valid, rf_pred, average=None)[1], 2))
        auc_.append(round(roc_auc_score(y_valid, rf_pred_proba), 3))
    return pd.DataFrame([precision, recall, f1_score_, auc_], index = ['Precision', 'recall',
                                                                      'f1_score', 'auc'],
                       columns = ['Original Features', 'WOE_Features', 'WOE_Features_Trimmed'])

In [None]:
df_ = rf_func(df_train_target, df_train_inputs, woe_train_inputs, woe_train_inputs_copy)
df_.T

## Observations
- Comparing Random Forest classifier across the 3 different datasets shows that it performs best on the Original datasets.
- This result is equivalent to the one achieved using Logistic Regression on WOE Engineered Features.
- This confirms the basic principle of Weight of Evidence, it breaks down a variable to classes with similar informative power on the target variable
- Which is similar to the basic algorithimic principle of Random Forest: collection of decision trees that looks for the most informative data point (test construction) from the best feature to achieve a pure leaf using few hierachical questions as possible.
- Henceforth, Original Dataset would be used for training RF Classifier while WOE dummy features would be used for training Logistic Regression Model
- we shall perform hyperparameter tuning to find the best combination of hyperparameters to optimize the performance of both Classifiers.
- We shall also compare the most important Features from both models.

In [None]:
#import libraries
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV , StratifiedShuffleSplit

In [None]:

#Construct some pipelines

pipe_lr = Pipeline([('clf', LogisticRegression(random_state=42, max_iter=300))])

pipe_rf = Pipeline([('clf', RandomForestClassifier(max_features = 'auto', random_state=0, n_estimators=500, n_jobs=-1))])

#Set grid search params

grid_params_lr = [{'clf__penalty': ['l1','l2'],
            'clf__C': [0.1, 0.2, 1, 2],
            'clf__solver': ['liblinear','lbfgs']}] 

grid_params_rf = [{'clf__min_samples_leaf': [2,5],
                'clf__max_depth': [5,10],
                'clf__min_samples_split': [3,5]}]

#Construct grid searches

gs_lr = GridSearchCV(estimator=pipe_lr,
            param_grid=grid_params_lr,
            scoring='roc_auc',
            cv = StratifiedShuffleSplit(n_splits=3,test_size=0.2,random_state = 0), 
            n_jobs=-1)

gs_rf = GridSearchCV(estimator=pipe_rf,
            param_grid=grid_params_rf,
            scoring='roc_auc',
            cv = StratifiedShuffleSplit(n_splits=3,test_size=0.2,random_state = 0),
            n_jobs=-1)


#List of pipelines for ease of iteration
grids = [gs_lr, gs_rf]

#Dictionary of pipelines and classifier types for ease of reference
grid_dict = {0: 'Logistic Regression', 1: 'Random Forest'}


#Fit the grid search objects
print('Performing model optimizations...')

for idx, gs in enumerate(grids):
    print('\nEstimator: %s' % grid_dict[idx])
    # Fit grid search
    if idx == 0:
        gs.fit(woe_train_inputs.values, df_train_target)
        # Best params
        print('Best params: %s' % gs.best_params_)
        # Best Score
        print('Best AUC score: %.4f' % gs.best_score_)
    else:
        gs.fit(df_train_inputs.values, df_train_target)
        # Best params
        print('Best params: %s' % gs.best_params_)
        # Best Score
        print('Best AUC score: %.4f' % gs.best_score_)

In [None]:
#import libraries
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [None]:
#loading best parameters
lr = LogisticRegression(C=0.1, max_iter=300, solver='liblinear', penalty='l2')
rf = RandomForestClassifier(n_estimators=500, random_state=10 , min_samples_leaf=5, max_depth=10,
                            min_samples_split=3, n_jobs=-1)

In [None]:
#stratified kfold
scoring = 'roc_auc'
models = []
models.append(('LR', lr))
models.append(('RFG', rf))
names = []
results = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=3, shuffle=True , random_state = 47)
    if name == 'LR':
        cv_results = cross_val_score(model, woe_train_inputs.values, df_train_target.values, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
    else:
        cv_results = cross_val_score(model, df_train_inputs.values, df_train_target.values, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)

In [None]:
# Compare Algorithms
fig = plt.figure(figsize=(10,8))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

## PRECISION, RECALL TRADEOFF
***

In [None]:
def trade_off(clf, train_input, target):
    X_train, X_valid, y_train, y_valid = train_test_split(train_input.values, target.values.ravel(),
                                                     test_size = 0.2, random_state = 42,
                                                      stratify = target.values.ravel())
    clf.fit(X_train, y_train)
    clf_pred = clf.predict(X_valid)
    clf_pred_proba = clf.predict_proba(X_valid)
    clf_pred_proba = clf_pred_proba[:, 1]
    predictions = pd.concat([pd.DataFrame(clf_pred_proba), pd.DataFrame(clf_pred)],  axis = 1)
    predictions.columns = ['probability', 'class']
    precision_recall_df = pd.DataFrame({'precision':[0,0,0,0], 'recall':[0,0,0,0], 'f1_score':[0,0,0,0],
                                   'auc_score':[0,0,0,0]}, index = ['threshold:0.3', 'threshold:0.4',
                                                                   'threshold:0.5', 'threshold:0.6'])
    for threshold, idx in zip([0.3, 0.4, 0.5, 0.6], precision_recall_df.index.values):
        predictions['class_temp'] = np.where(predictions['probability'] > threshold, 1, 0)
        precision_recall_df.loc[idx, 'precision'] = round(precision_score(y_valid, predictions['class_temp'],
                                                                      average=None)[1], 2)
        precision_recall_df.loc[idx, 'recall'] = round(recall_score(y_valid, predictions['class_temp'],
                                                                average=None)[1], 2)
        precision_recall_df.loc[idx, 'f1_score'] = round(f1_score(y_valid, predictions['class_temp'],
                                                              average=None)[1], 2)
        precision_recall_df.loc[idx, 'auc_score'] = round(roc_auc_score(y_valid, predictions['probability']), 3)
    return precision_recall_df
    

In [None]:
trade_off(rf, df_train_inputs, df_train_target)

In [None]:
trade_off(lr, woe_train_inputs, df_train_target)

> There is always a trade-off between model precision and recall, base on the nature of the  business problem. A precision focused model is a cautious model that puts more emphasis on lowering false positives. This type of model is very strict and highly discriminatory to a particular class. It uses a higher threshold (probability) value to assign a data point to a positive class (event). The higher the threshold the higher the precision of a model.<br>
On the other hand, a recall based model is more oriented in lowering false negatives. This type of model don't want a positive data point to go unnoticed. A lower threshold value means the model is less strict in discriminating between the two classes and would classify any data point with a inkling of positive attribute to a positive class.<br>
When giving out loans, it is often better to deny a potentially good customer than to approve a high risk borrower. Hence, our model would be recall specific. Although, optimizing our model to be more skillful is generally preferred.<br>
Comparing the scores from both classifiers. Random forest is more skillful with AUC score of 0.868 to Logistic Regression's 0.864. Relative to other thresholds, using a threshold of 0.3 would give the best balance between precision (0.46) and recall (0.40).

In [None]:
lr.fit(woe_train_inputs.values, df_train_target.values.ravel())
lr_pred = lr.predict(woe_test_inputs.values)
lr_pred_proba = lr.predict_proba(woe_test_inputs.values)
lr_pred_proba = lr_pred_proba[:, 1]

In [None]:
lr_woe_model = pd.DataFrame({'Id': df_test.index.values,
                                 'Probability': lr_pred_proba})
lr_woe_model.set_index(keys = 'Id', inplace = True)
lr_woe_model

In [None]:
rf.fit(df_train_inputs.values, df_train_target.values.ravel())

In [None]:
rf_pred = rf.predict(df_test_inputs.values)
rf_pred_proba = rf.predict_proba(df_test_inputs.values)
rf_pred_proba = rf_pred_proba[:, 1]

In [None]:
rf_model = pd.DataFrame({'Id': df_test.index.values,
                                 'Probability': rf_pred_proba})
rf_model.set_index(keys = 'Id', inplace = True)
rf_model.head(10)

In [None]:
importance = lr.coef_[0]
feat_importances = pd.DataFrame(importance, index=woe_train_inputs.columns.values, columns=['Score'])
feat_importances = feat_importances.sort_values(by='Score',ascending=True)
feat_importances.plot(kind='bar', title='Features Importance',legend=False, figsize=(14,8))
#plt.xlabel('Importance Score')
plt.ylabel('Coefficient')
plt.show()

### COEFFICIENTS OF LOGISTIC REGRESSION
* Coefficient of a Feature in a Logistic Regression indicates its predictive strength and the class it favours
* The further away from zero the higher the predictive power.
* Coefficients greater than zero is indicative of the event class (target:1) while negative coefficients tends towards the no-event class (target:0). Here, event means default and no-event means non-default
* Coefficients closer to zero are indifferent to either class
* The graph of coefficients above is bidirectional
* Features on the far right are indicative of a defaulting Borrower
* Revolving Credit Utilization and Number of Days past due are important features in predicting the likelihood of a Borrower defaulting
* Borrowers with Credit Utilization beyond 0.5 are high risk
* Borrowers that defaulted on their loan at least twice for 90 days are high risk 
* Borrowers that defaulted on their loan more than 3 times for 60-89 days are high risk
* Borrowers that defaulted on their loan more than 4 times for 30-59 days should also be considered high risk
* While Age, Number of Dependents, Monthly Income, Debt Ratio and others apart from the ones above are generally class indifferent, some categories are worth noting.
* Older customers (>65) are less likely to default.
* Debt Ratio above 10 are high risk.
* Borrowers with at least 6 dependents are high risk

In [None]:
def plot_feature_importance(importance,names,model_type):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    #Define size of bar plot
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
plot_feature_importance(rf.feature_importances_,df_train_inputs.columns,'RANDOM FOREST')

### RANDOM FOREST FEATURE IMPORTANCE
The value assigned to Random Forest Features measure their predictive power but they are not class indicative. It tells how informative the features are in splitting the target variable into distinctive classes. The above graph further tells us that Number of Days past due (30, 60 and 90) and Revolving Credit Utilization are the most important features for our PD model.

In [None]:
rf_model.to_csv('submission.csv')