<center>
<h1> Credit Risk Analytics <h1>
</center>

###### Definition of Target and Outcome Window:
One of the leading banks would like to predict bad customer while customer applying for loan. This model also called as PD Models (Probability of Default)


###### Data Pre-Processing - 
    - Missing Values Treatment - Numerical (Mean/Median imputation) and Categorical (Separate Missing Category or Merging)
    - Univariate Analysis - Outlier and Frequency Analysis
###### Data Exploratory Analysis
    - Bivariate Analysis - Numeric(TTest) and Categorical(Chisquare)
    - Bivariate Analysis - Visualization
    - Variable Transformation - P-Value based selection
    - Variable Transformation - Bucketing / Binning for numerical variables and Dummy for Categorical Variables
    - Variable Reduction - IV / Somers'D
    - Variable Reduction - Multicollinearity
###### Model Build and Model Diagnostics
    - Train and Test split
    - Significance of each Variable
    - Gini and ROC / Concordance analysis - Rank Ordering
    - Classification Table Analysis - Accuracy

###### Model Validation
    - OOS validation - p-value and sign testing for the model coefficients
    - Diagnostics check to remain similar to Training Model build
    - BootStrapping, if necessary
###### Model Interpretation for its properties
    - Inferencing for finding the most important contributors
    - Prediction of risk and proactive prevention by targeting segments of the population

### Import packages:

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt


from patsy import dmatrices

from statsmodels.stats.outliers_influence import variance_inflation_factor  # for multi-colinearity

import scipy.stats as stats # for hypothesis testing

import statsmodels.formula.api as smf  # for model defining

from sklearn.model_selection import train_test_split # for train test split
 
from sklearn.linear_model import LogisticRegression  # for logistic regression

from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, accuracy_score

### Import dataset:

In [3]:
bankloans = pd.read_csv("D:\ALabs\Python for Data Science\STUDY MATERIAL\Bankloans.csv")
bankloans

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1.0
1,27,1,10,6,31,17.3,1.362202,4.000798,0.0
2,40,1,15,14,55,5.5,0.856075,2.168925,0.0
3,41,1,15,14,120,2.9,2.658720,0.821280,0.0
4,24,2,2,0,28,17.3,1.787436,3.056564,1.0
...,...,...,...,...,...,...,...,...,...
845,34,1,12,15,32,2.7,0.239328,0.624672,
846,32,2,12,11,116,5.7,4.026708,2.585292,
847,48,1,13,11,38,10.8,0.722304,3.381696,
848,35,2,1,11,24,7.8,0.417456,1.454544,


### UDF for continuous variables:

In [4]:
def continuous_var_summary( x ):
    
    # freq and missings
    n_total = x.shape[0]
    n_miss = x.isna().sum()
    perc_miss = n_miss * 100 / n_total
    
    # outliers - iqr
    q1 = x.quantile(0.25)
    q3 = x.quantile(0.75)
    iqr = q3 - q1
    lc_iqr = q1 - 1.5 * iqr
    uc_iqr = q3 + 1.5 * iqr
    
    
    return pd.Series( [ x.dtype, x.nunique(), n_total, x.count(), n_miss, perc_miss,
                       x.sum(), x.mean(), x.std(), x.var(), 
                       lc_iqr, uc_iqr, 
                       x.min(), x.quantile(0.01), x.quantile(0.05), x.quantile(0.10), 
                       x.quantile(0.25), x.quantile(0.5), x.quantile(0.75), 
                       x.quantile(0.90), x.quantile(0.95), x.quantile(0.99), x.max() ], 
                     
                    index = ['dtype', 'cardinality', 'n_tot', 'n', 'nmiss', 'perc_miss',
                             'sum', 'mean', 'std', 'var',
                        'lc_iqr', 'uc_iqr',
                        'min', 'p1', 'p5', 'p10', 'p25', 'p50', 'p75', 'p90', 'p95', 'p99', 'max']) 

### UDF for Categorical variables:

In [5]:
def categorical_var_summary(x):
    Mode = x.value_counts().sort_values(ascending = False)[0:1].reset_index()
    return pd.Series([x.count(), x.isnull().sum(), Mode.iloc[0, 0], Mode.iloc[0, 1], 
                          round(Mode.iloc[0, 1] * 100/x.count(), 2)], 
                  index = ['N', 'NMISS', 'MODE', 'FREQ', 'PERCENT'])

In [6]:
# Missing value imputation for categorical and continuous variables

def missing_imputation(x, stats = 'mean'):
    if (x.dtypes == 'float64') | (x.dtypes == 'int64'):
        x = x.fillna(x.mean()) if stats == 'mean' else x.fillna(x.median())
    return x

### EDA | DATA PROFILLING | DATA INSPECTION:

In [7]:
bankloans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       850 non-null    int64  
 1   ed        850 non-null    int64  
 2   employ    850 non-null    int64  
 3   address   850 non-null    int64  
 4   income    850 non-null    int64  
 5   debtinc   850 non-null    float64
 6   creddebt  850 non-null    float64
 7   othdebt   850 non-null    float64
 8   default   700 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 59.9 KB


* **y variable/ predicted variable = default.**

In [8]:
bankloans.default.nunique()

2

#### %age of data of each "default" class:

In [9]:
bankloans.default.value_counts()

0.0    517
1.0    183
Name: default, dtype: int64

In [10]:
bankloans.default.value_counts() / bankloans.default.count() # count() doesn't include missing values

0.0    0.738571
1.0    0.261429
Name: default, dtype: float64

* **Why the above step is done?**
    
    - To see whether the classes are BALANCED or not.

##### Example:
            
             0 : 50
             1 : 50
                
                data is balanced
                
            0 : 85
            1 : 15
                 
                data is imbalanced. 
         
- In order to balance the data, we can do:
            
            i. Overbalancing
            ii. Underbalancing
            iii. Over-Under balancing
           
           - In python, it is done using np.random.randint()
           - In ML, we have multiple options.
           
            
            
         0: 85
         1: 15
         
    - Over balancing:
            
            0: 85
            1: 85
                    
                    - random records from class 1 are selected to make it as 85% from 15%.
                    
    
   - Under balancing: 
           
           0: 15
           1: 15
                   
                   - random records from class 0 are selected to make it as 15% from 85%.
    
   - Over Under balancing:
   
           0: 45
           1: 55

### Separating the data on the basis of 'default' variable:

In [12]:
bankloans_existing = bankloans[bankloans.default.notna()]
bankloans_existing

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1.0
1,27,1,10,6,31,17.3,1.362202,4.000798,0.0
2,40,1,15,14,55,5.5,0.856075,2.168925,0.0
3,41,1,15,14,120,2.9,2.658720,0.821280,0.0
4,24,2,2,0,28,17.3,1.787436,3.056564,1.0
...,...,...,...,...,...,...,...,...,...
695,36,2,6,15,27,4.6,0.262062,0.979938,1.0
696,29,2,6,4,21,11.5,0.369495,2.045505,0.0
697,33,1,15,3,32,7.6,0.491264,1.940736,0.0
698,45,1,19,22,77,8.4,2.302608,4.165392,0.0


In [13]:
bankloans_new = bankloans[ bankloans.default.isna() ]
bankloans_new

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
700,36,1,16,13,32,10.9,0.544128,2.943872,
701,50,1,6,27,21,12.9,1.316574,1.392426,
702,40,1,9,9,33,17.0,4.880700,0.729300,
703,31,1,5,7,23,2.0,0.046000,0.414000,
704,29,1,4,0,24,7.8,0.866736,1.005264,
...,...,...,...,...,...,...,...,...,...
845,34,1,12,15,32,2.7,0.239328,0.624672,
846,32,2,12,11,116,5.7,4.026708,2.585292,
847,48,1,13,11,38,10.8,0.722304,3.381696,
848,35,2,1,11,24,7.8,0.417456,1.454544,


### Summary of the existing customers data:

In [14]:
bankloans_existing.apply(continuous_var_summary)

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
dtype,int64,int64,int64,int64,int64,float64,float64,float64,float64
cardinality,37,5,32,31,114,231,695,699,2
n_tot,700,700,700,700,700,700,700,700,700
n,700,700,700,700,700,700,700,700,700
nmiss,0,0,0,0,0,0,0,0,0
perc_miss,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sum,24402,1206,5872,5795,31921,7182.4,1087.486972,2140.746028,183.0
mean,34.86,1.722857,8.388571,8.278571,45.601429,10.260571,1.553553,3.058209,0.261429
std,7.997342,0.928206,6.658039,6.824877,36.814226,6.827234,2.117197,3.287555,0.439727
var,63.957482,0.861566,44.329483,46.578939,1355.287265,46.611118,4.482523,10.808015,0.19336


### Data Cleaning:
        
         Only treat outliers.