<h1> Financial Risk - Machine Learning</h1>


O objetivo inicial do estudo deste dataset é explorar o conjunto de dados  e desenvolver modelos preditivos que avaliem o risco de fornecer um seguro veicular. Para tanto, assume que a seguradora deseja aumentar seus negócios em termos de escala, lucratividade e deseja alguns insights explorando os dados históricos. Primeiro é feito a previsão de ML da variavel TARGET_FLAG e logo para TARGET_AMT.


In [1]:
#Importando bibliotecas

#Manipulacao de dados
import numpy as np
import pandas as pd

#Visualizacao de dados
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import missingno
import warnings
warnings.filterwarnings('ignore')

#Sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [2]:
insurance_df = pd.read_csv('insurance_data.csv')

Machine Learning Pré Processamento

In [3]:
insurance_df.isnull().sum()

INDEX            0
TARGET_FLAG      0
TARGET_AMT       0
KIDSDRIV         0
AGE              6
HOMEKIDS         0
YOJ            454
INCOME         445
PARENT1          0
HOME_VAL       464
MSTATUS          0
SEX              0
EDUCATION        0
JOB            526
TRAVTIME         0
CAR_USE          0
BLUEBOOK         0
TIF              0
CAR_TYPE         0
RED_CAR          0
OLDCLAIM         0
CLM_FREQ         0
REVOKED          0
MVR_PTS          0
CAR_AGE        510
URBANICITY       0
dtype: int64

Removendo coluna index

In [4]:
insurance_df = insurance_df.drop(['INDEX'], axis=1)

Tratando valores nulos

In [5]:
insurance_df[insurance_df.AGE.isnull()].AGE 

239    NaN
1042   NaN
1314   NaN
2970   NaN
3459   NaN
4155   NaN
Name: AGE, dtype: float64

In [6]:
insurance_df['AGE'].fillna(insurance_df.AGE.mode()[0], inplace=True)

In [7]:
insurance_df[insurance_df.AGE.isnull()].AGE

Series([], Name: AGE, dtype: float64)


Idade do veículo com valor negativo


In [8]:
insurance_df.loc[insurance_df['CAR_AGE'] == -3, ['CAR_AGE']] = insurance_df['CAR_AGE'].mode()

In [9]:
columns = ['INCOME','HOME_VAL','BLUEBOOK','OLDCLAIM']
insurance_df[insurance_df['INCOME'].isnull()]['INCOME']

3       NaN
28      NaN
53      NaN
60      NaN
63      NaN
       ... 
8084    NaN
8101    NaN
8104    NaN
8107    NaN
8136    NaN
Name: INCOME, Length: 445, dtype: object

In [10]:
for i in columns:
    insurance_df.drop(insurance_df[insurance_df[i].isnull()].index, axis=0, inplace=True)

In [11]:
insurance_df[insurance_df['INCOME'].isnull()]['INCOME']

Series([], Name: INCOME, dtype: object)

Removendo valores contábeis em campos numérico sendo necessario remover o sinal $ e as virgulas.

In [12]:
for i in columns:
  insurance_df[i] = insurance_df[i].str.replace("$", "")
  insurance_df[i] = insurance_df[i].str.replace(",", "").astype(int)

Removendo valores nulos

In [13]:
columns2 = ['YOJ','JOB','CAR_AGE']
for i in columns2:
    insurance_df.drop(insurance_df[insurance_df[i].isnull()].index, axis=0, inplace=True)

Tratando dados que estão divergentes com distúrbios como z_F e z_No

In [14]:
insurance_df.loc[insurance_df['MSTATUS'] == 'z_No', ['MSTATUS']] = 'No'
insurance_df.loc[insurance_df['SEX'] == 'z_F', ['SEX']] = 'F'
insurance_df.loc[insurance_df['EDUCATION'] == 'z_High School', ['EDUCATION']] = 'High School'
insurance_df.loc[insurance_df['EDUCATION'] == '<High School', ['EDUCATION']] = 'High School'
insurance_df.loc[insurance_df['JOB'] == 'z_Blue Collar', ['JOB']] = 'Blue Collar'
insurance_df.loc[insurance_df['CAR_TYPE'] == 'z_SUV', ['CAR_TYPE']]  = 'SUV'
insurance_df.loc[insurance_df['URBANICITY'] == 'z_Highly Rural/ Rural', ['URBANICITY']]  = 'Highly Rural/ Rural'

Mudando as variáveis categóricas para numéricas.

In [15]:
insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6047 entries, 0 to 8160
Data columns (total 25 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   TARGET_FLAG  6047 non-null   int64  
 1   TARGET_AMT   6047 non-null   float64
 2   KIDSDRIV     6047 non-null   int64  
 3   AGE          6047 non-null   float64
 4   HOMEKIDS     6047 non-null   int64  
 5   YOJ          6047 non-null   float64
 6   INCOME       6047 non-null   int64  
 7   PARENT1      6047 non-null   object 
 8   HOME_VAL     6047 non-null   int64  
 9   MSTATUS      6047 non-null   object 
 10  SEX          6047 non-null   object 
 11  EDUCATION    6047 non-null   object 
 12  JOB          6047 non-null   object 
 13  TRAVTIME     6047 non-null   int64  
 14  CAR_USE      6047 non-null   object 
 15  BLUEBOOK     6047 non-null   int64  
 16  TIF          6047 non-null   int64  
 17  CAR_TYPE     6047 non-null   object 
 18  RED_CAR      6047 non-null   object 
 19  OLDCLA

In [16]:
labelencoder1 = LabelEncoder()
insurance_df['PARENT1'] = labelencoder1.fit_transform(insurance_df['PARENT1'])

labelencoder2 = LabelEncoder()
insurance_df['MSTATUS'] = labelencoder2.fit_transform(insurance_df['MSTATUS'])

labelencoder3 = LabelEncoder()
insurance_df['SEX'] = labelencoder3.fit_transform(insurance_df['SEX'])

labelencoder4 = LabelEncoder()
insurance_df['EDUCATION'] = labelencoder4.fit_transform(insurance_df['EDUCATION'])

labelencoder5 = LabelEncoder()
insurance_df['JOB'] = labelencoder5.fit_transform(insurance_df['JOB'])

labelencoder6 = LabelEncoder()
insurance_df['CAR_USE'] = labelencoder6.fit_transform(insurance_df['CAR_USE'])

labelencoder7 = LabelEncoder()
insurance_df['CAR_TYPE'] = labelencoder7.fit_transform(insurance_df['CAR_TYPE'])

labelencoder8 = LabelEncoder()
insurance_df['RED_CAR'] = labelencoder8.fit_transform(insurance_df['RED_CAR'] )

labelencoder9 = LabelEncoder()
insurance_df['REVOKED'] = labelencoder9.fit_transform(insurance_df['REVOKED'])

labelencoder10 = LabelEncoder()
insurance_df['URBANICITY'] = labelencoder10.fit_transform(insurance_df['URBANICITY'])

Modelo com dados de maior influência encontrado na análise exploratória para TARGET_FLAG 1.

In [17]:
columns = ['HOMEKIDS', 'OLDCLAIM', 'CLM_FREQ', 'MVR_PTS', 'PARENT1', 'MSTATUS', 'EDUCATION', 'JOB', 'CAR_USE', 'CAR_TYPE', 'REVOKED', 'URBANICITY']
insurance_ml_df = insurance_df[columns]

In [18]:
x = insurance_ml_df
y = insurance_df['TARGET_FLAG']

In [19]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

Modelo com todos os dados para TARGET_FLAG 2.

In [20]:
x2 = insurance_df.drop(['TARGET_FLAG','TARGET_AMT'], axis=1)
y2 = insurance_df['TARGET_FLAG']

In [21]:
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size = 0.3)

<h1>Machine Learning - Target Flag 1 e 2<h1>

RandomForestClassifier

In [22]:
model = RandomForestClassifier() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = RandomForestClassifier() 
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.97631
Acurácia 2: 1.0
Resultado 1: 94.11765%
Resultado 2: 100.0%


LinearSVC

In [84]:
model = LinearSVC() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = LinearSVC() 
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.73719
Acurácia 2: 0.73333
Resultado 1: 8.33333%
Resultado 2: 0.0%


LogisticRegression

In [83]:
model = LogisticRegression() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = LogisticRegression()
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.7449
Acurácia 2: 0.7416
Resultado 1: 58.33333%
Resultado 2: 0.0%


ExtraTreesClassifier

In [82]:
model = ExtraTreesClassifier() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = ExtraTreesClassifier()
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.97466
Acurácia 2: 1.0
Resultado 1: 83.33333%
Resultado 2: 100.0%


KNeighborsClassifier

In [77]:
model = KNeighborsClassifier() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = KNeighborsClassifier()
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.80606
Acurácia 2: 0.79008
Resultado 158.33333%
Resultado 285.71429%


GaussianNB

In [78]:
model = GaussianNB() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = GaussianNB() 
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.73994
Acurácia 2: 0.7405
Resultado 1: 83.33333%
Resultado 2: 71.42857%


Perceptron

In [79]:
model = Perceptron() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = Perceptron() 
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.73333
Acurácia 2: 0.71625
Resultado 1: 16.66667%
Resultado 2: 85.71429%


SGDClassifier

In [80]:
model = SGDClassifier() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = SGDClassifier() 
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.73609
Acurácia 2: 0.73994
Resultado 1: 25.0%
Resultado 2: 0.0%


DecisionTreeClassifier

In [81]:
model = DecisionTreeClassifier() 
model.fit(x_test, y_test)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

model = DecisionTreeClassifier() 
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

Acurácia 1: 0.97466
Acurácia 2: 1.0
Resultado 1: 83.33333%
Resultado 2: 100.0%


<h1>Machine Learning - Target Amt 1 e 2<h1>

Modelo com dados de maior influencia econtrado na analise exploratoria para TARGET_AMT.

In [32]:
columns = ['HOMEKIDS', 'OLDCLAIM', 'CLM_FREQ', 'MVR_PTS', 'PARENT1', 'MSTATUS', 'EDUCATION', 'JOB', 'CAR_USE', 'CAR_TYPE', 'REVOKED', 'URBANICITY']
insurance_ml_df = insurance_df[columns]

In [33]:
x = insurance_ml_df
y = insurance_df['TARGET_AMT']

In [34]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

Modelo com todos os dados para TARGET_AMT.

In [26]:
x2 = insurance_df.drop(['TARGET_FLAG','TARGET_AMT'], axis=1)
y2 = insurance_df['TARGET_AMT']

In [27]:
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size = 0.3)

Ridge Regression 

In [29]:
from sklearn.linear_model import Ridge

In [44]:
model = Ridge()
model.fit(x_train, y_train)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
predicted_price = model.predict(x_test[205:255])
print(predicted_price)

print('')
model = Ridge()
model.fit(x2_test, y2_test)
result = model.score(x2_test, y2_test)
print(f"Acurácia 2: {round(result,5)}")

t11 = np.array(y2_test[205:255])
t22 = model.predict(x2_test[205:255])

print(f'Resultado 2: {round((t22.sum())/(t11.sum())*100,5)}%')

predicted_price2 = model.predict(x2_test[205:255])
print(predicted_price2)

Acurácia 1: 0.04529
Resultado 1: 72.76542%
[ -636.83935694   657.9691175   1132.85782347  1508.34390522
   772.38800375  3249.68480422   699.1813017   2163.25686385
  2117.23912477  -793.27191113  2539.78418832  2591.66380539
   886.20640378   552.51480419   371.25328942  1765.86915118
  1186.26051796  2067.31406643   584.76241545  2213.52580413
  2753.20063236  1533.1906426   1326.3468519    874.34256831
   854.16001902  2303.03314039  1070.64433879  1487.64017909
   205.02792936  3375.39409532   364.78793289  1021.76795608
   979.02401774  2295.36532478  -636.83935694 -1173.53364259
   271.52256034  1163.46195275  2107.75820601   101.96642944
  2439.9580912   -676.11239642  1338.63591726  3713.41557207
   674.41503282  2616.19499512  1571.51857554  1073.27501185
  1585.26933969  -119.20357359]

Acurácia 2: 0.08732
Resultado 2: 42.36925%
[  815.34842753 -1268.31305499  1147.27599032   332.88343275
  2304.62589295  3523.7532236    319.44170156  2530.60660048
  3016.56214695  2231.18476

LSTM

MLP

In [73]:
from imblearn.over_sampling import SMOTE 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_auc_score

In [70]:
model = SMOTE()
x_smote1, y_smote1 = model.fit_resample(x_train, y_train)

neural_network = MLPClassifier(max_iter=1500, verbose=False, 
                                   solver = 'adam', activation = 'relu',n_iter_no_change=40,learning_rate_init=0.0001,
                                   hidden_layer_sizes = (100,100))

neural_network.fit(x_smote1, y_smote1)

previsoes_v1 = neural_network.predict(x_test)

In [81]:
previsoes_v1 = previsoes_v1[:-1]

In [79]:
neural_network.score(x_train, y_train)

0.46526465028355385

In [82]:
accuracy_score(previsoes_v1, y_test)

0.45865490628445427

In [83]:
precision_score(previsoes_v1, y_test)

0.6597077244258872

In [84]:
recall_score(previsoes_v1, y_test)

0.27841409691629954

In [85]:
roc_auc_score(y_test, previsoes_v1)

0.5231122891792357

In [None]:
model = Ridge()
model.fit(x_train, y_train)
result = model.score(x_test, y_test)
print(f"Acurácia 1: {round(result,5)}")

t1 = np.array(y_test[205:255])
t2 = model.predict(x_test[205:255])

print(f'Resultado 1: {round((t2.sum())/(t1.sum())*100,5)}%')
predicted_price = model.predict(x_test[205:255])
print(predicted_price)
