Now that we have variables that show some promise in their predictability of Risk Rating, let us see if these Variable can indeed be used to predict the Risk Rating

Unfortunately the version of Statsmodels does not support ordinal regression so we will need to change our approach to predict the Self Exclusion flag which is a bibary variable

The purpose of this notebook is to use classifiers to classify the variables into risky and non risky

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [3]:
import statsmodels.api as sm

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [4]:
df = pd.read_csv('df_client.csv')

In [5]:
df.head()

Unnamed: 0,Country_Code,BR Code,Period,Client,risk_rating,Self_exclude_flag,Variable_1_Y0,Variable_1_Y1,Variable_1_Y2,Variable_1_Y3,...,Variable_28_Y1,Variable_28_Y2,Variable_28_Y3,Variable_29_Y0,Variable_29_Y1,Variable_29_Y2,Variable_30_Y0,Variable_30_Y1,Variable_30_Y2,Variable_30_Y3
0,0,0,2017Q2,0,7,1,581103.4591,612122.5165,589483.6484,608043.5063,...,572312.4225,601762.9316,574251.413,577170.3096,594024.8975,616177.8226,588163.8327,623659.1015,608794.9055,574860.551
1,0,0,2016Q1,0,7,1,608189.3682,581513.6158,609292.15,,...,608263.6088,605605.1646,,581951.0166,608354.2362,623470.1198,591055.8212,592011.4052,572734.0028,
2,0,0,2015Q4,0,7,1,626775.445,620338.8464,,,...,621396.294,,,590490.362,620329.2616,,626221.0887,572241.0321,,
3,0,0,2015Q2,0,7,1,613152.4469,595630.8819,,,...,589714.2432,,,580633.8747,576235.2813,,619098.6619,578761.7137,,
4,0,1,2019Q1,1,9,0,615840.2415,603501.2067,587601.9393,610071.5454,...,607400.3547,570273.9177,573434.8221,572413.5987,618435.4264,587802.7283,,,,


We go back to the Correlation Matrix and take the absolute correlation of Risk Rating vs. all variables. 

We then select candidates for use in our classification exercise

Preference was given to Y0 variables since they had less null values and correlation was comparable to variables from prior years

In [58]:
#cols = ['Variable_16_Y0','Variable_17_Y0', 'Variable_22_Y0','Variable_3_Y0', 
cols = ['Variable_16_Y0','Variable_3_Y0', 
        'risk_rating', 'Self_exclude_flag']

16, 17, 22 and 3 were selected for the initial modeling

Subsequently 17 and 22 were dropped

In [59]:
df1 = df.loc[:,cols].dropna()

In [60]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23235 entries, 0 to 28223
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Variable_16_Y0     23235 non-null  float64
 1   Variable_3_Y0      23235 non-null  float64
 2   risk_rating        23235 non-null  int64  
 3   Self_exclude_flag  23235 non-null  int64  
dtypes: float64(2), int64(2)
memory usage: 907.6 KB


In [61]:
pd.value_counts(df1['Self_exclude_flag'])

1    19930
0     3305
Name: Self_exclude_flag, dtype: int64

Self Exclusion flag shows that 19K rows had value 1 and 3K rows had value 0

In [62]:
pd.value_counts(df1['risk_rating'])

7     4816
6     4604
8     4216
5     3312
4     2143
9     1660
10     718
3      652
11     437
12     199
2      178
13     139
17     109
14      27
15      16
1        9
Name: risk_rating, dtype: int64

Frequency of Risk Rating shows that some risk ratings like 1 had very few values

For modeling, we will break the dataset into Train and Test datasets

Train dataset will be used for training the model and Test dataset will be used for testing the model's accuracy

In [63]:
# split into inputs and outputs
X, y = df1.loc[:,cols[:-2]], df1.loc[:,cols[-1]]
print(X.shape, y.shape)

(23235, 2) (23235,)


In [64]:
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(15567, 2) (7668, 2) (15567,) (7668,)


In [65]:
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=1)

In [66]:
# make predictions
yhat = model.predict(X_test)

In [67]:
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

Accuracy: 0.845


Model has high accuracy (0.85) with all 4 variables, however subsequently 2 variables were removed which reduced the accuracy somewhat

In [68]:
cm = confusion_matrix(y_test, yhat) 
print ("Confusion Matrix : \n", cm) 

Confusion Matrix : 
 [[  41 1031]
 [ 158 6438]]


Confusion matrix shows the distribution of Actual vs. Predicted results. 

6438 / 6596 1s were correctly predicted

Only 41 out of 1072 0s were correctly predicted

So the model tends to have greater success in predicting the 1s (non-self excluded) than the 0s (self excluded)

In [70]:
unique, counts = np.unique(y_test, return_counts=True)

print("y_test : \n", np.asarray((unique, counts)).T)

y_test : 
 [[   0 1072]
 [   1 6596]]


In [69]:
unique, counts = np.unique(yhat, return_counts=True)

print("yhat : \n", np.asarray((unique, counts)).T)

yhat : 
 [[   0  199]
 [   1 7469]]


In [27]:
df1.corr()

Unnamed: 0,Variable_16_Y0,Variable_17_Y0,Variable_22_Y0,Variable_3_Y0,risk_rating,Self_exclude_flag
Variable_16_Y0,1.0,0.997606,0.001025,-0.006374,-0.068312,0.028444
Variable_17_Y0,0.997606,1.0,0.001026,-0.006371,-0.068122,0.028178
Variable_22_Y0,0.001025,0.001026,1.0,-0.398988,-0.00946,0.000914
Variable_3_Y0,-0.006374,-0.006371,-0.398988,1.0,-0.006262,0.005107
risk_rating,-0.068312,-0.068122,-0.00946,-0.006262,1.0,-0.685828
Self_exclude_flag,0.028444,0.028178,0.000914,0.005107,-0.685828,1.0


Correlation between the 4 variables shows that 16 and 17 are highly correlated and 22 and 3 have high correlation

In [53]:
def get_significance(cols):
    logit_mod = sm.Logit(df1.loc[:,'Self_exclude_flag'], 
                         df1.loc[:,cols])

    logit_res = logit_mod.fit()

    print(logit_res.summary())

The above code will take in a set of columns and use it to build a model and provide a summary of regression results which can then be used to select the variables that we will then se in our model

In [54]:
get_significance(['Variable_16_Y0','Variable_22_Y0'])

Optimization terminated successfully.
         Current function value: 0.449731
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:      Self_exclude_flag   No. Observations:                22620
Model:                          Logit   Df Residuals:                    22618
Method:                           MLE   Df Model:                            1
Date:                Wed, 24 Aug 2022   Pseudo R-squ.:                -0.08976
Time:                        09:41:07   Log-Likelihood:                -10173.
converged:                       True   LL-Null:                       -9335.0
Covariance Type:            nonrobust   LLR p-value:                     1.000
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Variable_16_Y0  4.492e-10   5.87e-11      7.652      0.000    3.34e-10    5.64e-10
Variable_22_Y0  2

In [55]:
get_significance(['Variable_16_Y0','Variable_3_Y0'])

Optimization terminated successfully.
         Current function value: 0.420900
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:      Self_exclude_flag   No. Observations:                22620
Model:                          Logit   Df Residuals:                    22618
Method:                           MLE   Df Model:                            1
Date:                Wed, 24 Aug 2022   Pseudo R-squ.:                -0.01990
Time:                        09:41:36   Log-Likelihood:                -9520.8
converged:                       True   LL-Null:                       -9335.0
Covariance Type:            nonrobust   LLR p-value:                     1.000
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Variable_16_Y0  3.018e-10    5.9e-11      5.114      0.000    1.86e-10    4.17e-10
Variable_3_Y0   2

In [56]:
get_significance(['Variable_17_Y0','Variable_22_Y0'])

Optimization terminated successfully.
         Current function value: 0.449746
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:      Self_exclude_flag   No. Observations:                22620
Model:                          Logit   Df Residuals:                    22618
Method:                           MLE   Df Model:                            1
Date:                Wed, 24 Aug 2022   Pseudo R-squ.:                -0.08980
Time:                        09:42:02   Log-Likelihood:                -10173.
converged:                       True   LL-Null:                       -9335.0
Covariance Type:            nonrobust   LLR p-value:                     1.000
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Variable_17_Y0  4.465e-10   5.86e-11      7.615      0.000    3.32e-10    5.61e-10
Variable_22_Y0  2

In [57]:
get_significance(['Variable_17_Y0','Variable_3_Y0'])

Optimization terminated successfully.
         Current function value: 0.420910
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:      Self_exclude_flag   No. Observations:                22620
Model:                          Logit   Df Residuals:                    22618
Method:                           MLE   Df Model:                            1
Date:                Wed, 24 Aug 2022   Pseudo R-squ.:                -0.01992
Time:                        09:42:16   Log-Likelihood:                -9521.0
converged:                       True   LL-Null:                       -9335.0
Covariance Type:            nonrobust   LLR p-value:                     1.000
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Variable_17_Y0   2.99e-10   5.89e-11      5.073      0.000    1.84e-10    4.15e-10
Variable_3_Y0   2

Log Likelihood value is a measure of goodness of fit for any model. Higher the value, better is the model. The highest value of Log Likelihood occurs when variable are 16 and 3 so henceforth these variables will be used for modeling