Previously we built a model to predict 8 risk rating vs. not 8. 

The purpose of this notebook is to use classifiers to classify the risk rating of 9 vs. not 9

In a similar manner we can build models for each risk rating 1-17.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle

In [2]:
from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [3]:
df = pd.read_csv('df_client.csv')

In [4]:
df.head()

Unnamed: 0,Country_Code,BR Code,Period,Client,risk_rating,Self_exclude_flag,Variable_1_Y0,Variable_1_Y1,Variable_1_Y2,Variable_1_Y3,...,Variable_28_Y1,Variable_28_Y2,Variable_28_Y3,Variable_29_Y0,Variable_29_Y1,Variable_29_Y2,Variable_30_Y0,Variable_30_Y1,Variable_30_Y2,Variable_30_Y3
0,0,0,2017Q2,0,7,1,581103.4591,612122.5165,589483.6484,608043.5063,...,572312.4225,601762.9316,574251.413,577170.3096,594024.8975,616177.8226,588163.8327,623659.1015,608794.9055,574860.551
1,0,0,2016Q1,0,7,1,608189.3682,581513.6158,609292.15,,...,608263.6088,605605.1646,,581951.0166,608354.2362,623470.1198,591055.8212,592011.4052,572734.0028,
2,0,0,2015Q4,0,7,1,626775.445,620338.8464,,,...,621396.294,,,590490.362,620329.2616,,626221.0887,572241.0321,,
3,0,0,2015Q2,0,7,1,613152.4469,595630.8819,,,...,589714.2432,,,580633.8747,576235.2813,,619098.6619,578761.7137,,
4,0,1,2019Q1,1,9,0,615840.2415,603501.2067,587601.9393,610071.5454,...,607400.3547,570273.9177,573434.8221,572413.5987,618435.4264,587802.7283,,,,


In [5]:
f1 = df['risk_rating'] == 9

df.loc[:,'rr_9'] = 0
df.loc[f1,'rr_9'] = 1

In [6]:
#cols = ['Variable_16_Y0','Variable_17_Y0', 'Variable_22_Y0','Variable_3_Y0', 
cols = ['Variable_16_Y0','Variable_3_Y0', 'rr_9']

In [7]:
df1 = df.loc[:,cols].dropna().reset_index(drop=True)

In [8]:
df1.head()

Unnamed: 0,Variable_16_Y0,Variable_3_Y0,rr_9
0,626840.4657,592374.2988,0
1,606359.1289,595587.6726,0
2,627671.8775,576546.0494,0
3,580507.2233,575206.3258,0
4,622216.4303,598898.0024,0


In [10]:
pd.value_counts(df1['rr_9'])

0    21575
1     1660
Name: rr_9, dtype: int64

In [11]:
# split into inputs and outputs
X, y = df1.loc[:,cols[:-1]], df1.loc[:,cols[-1]]
print(X.shape, y.shape)

(23235, 2) (23235,)


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(15567, 2) (7668, 2) (15567,) (7668,)


In [13]:
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=1)

In [14]:
# make predictions
yhat = model.predict(X_test)

In [15]:
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

Accuracy: 0.927


In [16]:
cm = confusion_matrix(y_test, yhat) 
print ("Confusion Matrix : \n", cm) 

Confusion Matrix : 
 [[7104   20]
 [ 542    2]]


Model was able to predict 0s 7104/7124 times.

Model was able to predict only 2/544 1s. 

In [17]:
unique, counts = np.unique(y_test, return_counts=True)

print("y_test : \n", np.asarray((unique, counts)).T)

y_test : 
 [[   0 7124]
 [   1  544]]


In [18]:
unique, counts = np.unique(yhat, return_counts=True)

print("yhat : \n", np.asarray((unique, counts)).T)

yhat : 
 [[   0 7646]
 [   1   22]]


In [19]:
filename = 'rr_9_model.sav'
pickle.dump(model, open(filename, 'wb'))