# Objetivo:
Prever o pagamento próxima fatura dos seus clientes e criação de estratégias de comunicação para clientes que tivessem alta
probabilidade de não pagar as suas proximas faturas.


# Q 2.1 - Que tipo de problema estamos enfrentando e qual técnica você utilizaria para resolver esse problema?
Problema de inadimplencia, mais especificamente, um problema classificação binária. Usaria alguma técnica de machine learning para classificação. Nesse caso foi usado o algoritmo de Support Vector Machine.

# 2.2 - Se você só pudesse enviar comunicação para 10% dos clientes devivo ao alto custo, para quais clientes abaixo você enviaria?
Para os 10% de clientes com maior probabilidade de não pagar as próximas faturas. Esse notebook mostra como chegar e quem são esses 10%.

# Colunas
* default_payment - _**Variável dependente**_ : Pagamento da Próxima Fatura (1: Sim, 0: Não) 


* LIMIT_BAL - _**Variável numérica**_ : Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 


* SEX - _**Variável Categórica Nominal**_ : Gender (1 = male; 2 = female).


* EDUCATION - _**Variável Categórica Nominal**_ : Educação (1 = graduate school; 2 = university; 3 = high school; 4 = others).


* MARRIAGE - _**Variável Categórica Nominal**_ : Marital status (1 = married; 2 = single; 3 = others).


* AGE - _**Variável Numérica**_ : Age (year).


* PAY_0,PAY_2 ~ PAY_6 - _**Variável Numérica**_ : History of past payment. We tracked the past monthly payment records (from April to September, 2019) as follows: PAY_0 = the repayment status in September, 2019; PAY_2 = the repayment status in August, 2019;The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ... ; 8 = payment delay for eight months; 9 = payment delay for nine months and above


* BILL_AMT1 ~ BILL_AMT6 - _**Variável Numérica**_ : Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statement in September, 2019; BILL_AMT2 = amount of bill statement in August, 2019; ... ; BILL_AMT6 = amount of bill statement in April, 2019.


* PAY_AMT1 ~ PAY_AMT6 - _**Variável Numérica**_ : Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2019; PAY_AMT2 = amount paid in August, 2019; ... ;PAY_AMT6 = amount paid in April, 2019.

In [301]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler,MinMaxScaler,MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV, cross_val_score
from sklearn.metrics import classification_report,confusion_matrix,f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC,SVC
from imblearn.over_sampling import SMOTE


# 1 - Análises iniciais

In [287]:
train = pd.read_csv('data/questao2_creditcard.csv',sep=';',skiprows=1)
test = pd.read_csv('data/questao22_creditcard_clientes.csv',sep=';',skiprows=1)

In [75]:
train.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_payment
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [76]:
test.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,70000,2,2,2,26,2,0,0,2,2,...,45020,44006,46905,46012,2007,3582,0,3601,0,1820
1,230000,2,1,2,27,-1,-1,-1,-1,-1,...,13266,15339,14307,36923,17270,13281,15339,14307,37292,0
2,50000,1,2,2,33,2,0,0,0,0,...,22102,22734,23217,23680,1718,1500,1000,1000,1000,716
3,50000,1,1,2,29,2,2,2,2,2,...,26591,25865,27667,28264,0,2700,0,2225,1200,0
4,10000,1,2,1,56,2,2,2,0,0,...,3978,4062,4196,4326,2300,0,150,200,200,160


In [273]:
all(train.columns[:-1] == test.columns)

True

In [78]:
print(train.isnull().sum(),'\n')
print(test.isnull().sum())

LIMIT_BAL          0
SEX                0
EDUCATION          0
MARRIAGE           0
AGE                0
PAY_0              0
PAY_2              0
PAY_3              0
PAY_4              0
PAY_5              0
PAY_6              0
BILL_AMT1          0
BILL_AMT2          0
BILL_AMT3          0
BILL_AMT4          0
BILL_AMT5          0
BILL_AMT6          0
PAY_AMT1           0
PAY_AMT2           0
PAY_AMT3           0
PAY_AMT4           0
PAY_AMT5           0
PAY_AMT6           0
default_payment    0
dtype: int64 

LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_0        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
dtype: int64


In [258]:
#Dados desbalanceados
train['default_payment'].value_counts()

0    21219
1     6058
Name: default_payment, dtype: int64

# 2 - Tratamento de variáveis

In [288]:
# Categóricas
categorical = ['SEX','EDUCATION','MARRIAGE']

train_cat = pd.concat([pd.get_dummies(train[categorical[0]],prefix=categorical[0]), 
                     pd.get_dummies(train[categorical[1]],prefix=categorical[1]),
                     pd.get_dummies(train[categorical[2]],prefix=categorical[2])
                        ],axis=1)

test_cat = pd.concat([pd.get_dummies(test[categorical[0]],prefix=categorical[0]), 
                     pd.get_dummies(test[categorical[1]],prefix=categorical[1]),
                     pd.get_dummies(test[categorical[2]],prefix=categorical[2])
                        ],axis=1)

In [275]:
all(train_cat.columns == test_cat.columns) 

True

In [289]:
#Numéricas
numerical = set(test.columns) - set(categorical)

train_num,test_num = train[numerical],test[numerical]

In [290]:
scaler = StandardScaler()

train_num = scaler.fit_transform(train_num)
test_num = scaler.transform(test_num)

train_num = pd.DataFrame(train_num,columns = numerical,index = train.index)
test_num = pd.DataFrame(test_num,columns = numerical,index = test.index)

# 3 - Divisão treino/teste e oversampling 

In [291]:
X = pd.concat([train_cat,train_num],axis=1)
y = train['default_payment']

In [292]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [293]:
y_train.value_counts()

0    14221
1     4054
Name: default_payment, dtype: int64

In [294]:
over_sampler = SMOTE(random_state=42)
X_train_res, y_train_res = over_sampler.fit_resample(X_train, y_train)

In [296]:
y_train_res.value_counts()

1    14221
0    14221
Name: default_payment, dtype: int64

# 4 - Performances de modelos de classificação

In [305]:
dict_classifiers = {"Floresta Aleatória": RandomForestClassifier(random_state=0),
    "Vizinhos mais próximos": KNeighborsClassifier(),
    "Regressão Logística": LogisticRegression(solver = "liblinear",random_state=0),
    "Gradient Boosting": GradientBoostingClassifier(random_state=0),
    "Ada Boost": AdaBoostClassifier(),
    "SVM Linear": LinearSVC(dual=False,random_state=0),
    "SVM": SVC(probability=True,random_state=0)
}

classifiers_names = list(dict_classifiers.keys())
classifiers_values = list(dict_classifiers.values())


In [303]:
def train_model(model):
    t0 = time.time()
    model.fit(X_train_res,y_train_res)
    print(f'Treinou em {time.time()-t0:.3f} segundos')
    
    pred = model.predict(X_test)
    f1_0,f1_1 = f1_score(y_test,pred,average=None)
    print(f"f1_0 = {f1_0}\nf1_1 = {f1_1}\n")

In [299]:
for key,value in zip(classifiers_names,classifiers_values):
    print('--'*9+f'{key}'+'--'*9)
    train_model(value)

------------------Floresta Aleatória------------------
Treinou em 6.131 segundos
f1_0 = 0.8680964395850855
f1_1 = 0.496252676659529

------------------Vizinhos mais próximos------------------
Treinou em 0.112 segundos
f1_0 = 0.7591752249024761
f1_1 = 0.4442403086533162

------------------Regressão Logística------------------
Treinou em 0.434 segundos
f1_0 = 0.8239610632721828
f1_1 = 0.49429984942998495

------------------Gradient Boosting------------------
Treinou em 11.364 segundos
f1_0 = 0.8689616092931008
f1_1 = 0.5239320638188368

------------------Ada Boost------------------
Treinou em 2.512 segundos
f1_0 = 0.8399177559112939
f1_1 = 0.5029639762881898

------------------SVM Linear------------------
Treinou em 0.489 segundos
f1_0 = 0.8315773853687491
f1_1 = 0.49845338046840476

------------------SVM------------------
Treinou em 300.623 segundos
f1_0 = 0.8661147587781902
f1_1 = 0.5300601202404809



# 5 - Support Vector Machine Classifier

In [307]:
#(clf, X, y_true)
def f1(model,X_test,y_test):
    pred = model.predict(X_test)
    f1_0,f1_1 = f1_score(y_test,pred,average=None)    
    return f1_1

In [317]:
model = SVC(probability=True,random_state=0)
model.fit(X_train_res,y_train_res)
pred = model.predict(X_test)

In [320]:
scores = cross_val_score(model,X,y, cv = 5, scoring=f1,n_jobs=4)
scores

array([0.41444444, 0.43124312, 0.48912467, 0.48938547, 0.45330296])

In [321]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87      6998
           1       0.53      0.53      0.53      2004

    accuracy                           0.79      9002
   macro avg       0.70      0.70      0.70      9002
weighted avg       0.79      0.79      0.79      9002



In [322]:
conf_matrix=confusion_matrix(y_test, pred)
conf_matrix

array([[6068,  930],
       [ 946, 1058]])

In [323]:
print(f'Taxa de True Negatives = {round((conf_matrix[0,0] / y_test.shape[0]),3) * 100}%')
print(f'Taxa de True Positives = {round((conf_matrix[1,1] / y_test.shape[0]),3) * 100}%')
print(f'Taxa de False Positives - Erro Tipo I = = {round((conf_matrix[0,1] / y_test.shape[0]),3) * 100}%')
print(f'Taxa de False Negatives - Erro Tipo II = = {round((conf_matrix[1,0] / y_test.shape[0]),3) * 100}%')

Taxa de True Negatives = 67.4%
Taxa de True Positives = 11.799999999999999%
Taxa de False Positives - Erro Tipo I = = 10.299999999999999%
Taxa de False Negatives - Erro Tipo II = = 10.5%


#### Apesar do uso do SMOTE para driblar o desbalanceamento, os erros continuam relativamente altos.

# 6 - Clientes mais inadimplentes

In [324]:
#test_cat e test_num vêm do "questao22_creditcard_clientes.csv"
Xtest = pd.concat([test_cat,test_num],axis=1)

In [325]:
pred = model.predict(Xtest)
prob = model.predict_proba(Xtest)

In [326]:
model.classes_
# ou seja, classe 0 será a coluna 0 e classe 1 a coluna 1 de prob

array([0, 1])

In [327]:
#Não pagar = 0

prob0 = pd.DataFrame(prob[:,0],columns=['Probabilidade de não pagar as próximas faturas'],index=Xtest.index)
prob0 = prob0.sort_values(by='Probabilidade de não pagar as próximas faturas',ascending=False)

In [328]:
#10% de clientes com maior probabilidade de não pagar as próximas faturas.
percent = int(Xtest.shape[0] * 0.10)
prob0.head(percent)

Unnamed: 0,Probabilidade de não pagar as próximas faturas
1052,0.911386
594,0.900472
263,0.883391
2620,0.882952
1453,0.879276
...,...
1449,0.775648
1957,0.775541
777,0.775498
515,0.775477


# 7 - Resposta 2.2

In [329]:
print('Clientes que será enviada comunicação:\n',list(prob0.index[:percent]))

Clientes que será enviada comunicação:
 [1052, 594, 263, 2620, 1453, 798, 2294, 1142, 318, 1749, 47, 538, 1069, 2225, 2556, 1818, 2139, 2174, 842, 2577, 2214, 1110, 1888, 587, 618, 354, 2259, 333, 2258, 1202, 1504, 2601, 2614, 814, 1103, 958, 2423, 2602, 2495, 957, 2196, 2321, 1617, 2199, 562, 1981, 2351, 1729, 1946, 112, 852, 2405, 2061, 1300, 2166, 45, 266, 2383, 649, 2541, 2040, 1213, 2356, 39, 403, 937, 17, 298, 865, 2096, 1542, 1745, 2354, 2481, 1372, 997, 1382, 872, 1466, 108, 2484, 2478, 1402, 2567, 911, 843, 305, 1702, 2508, 1815, 1987, 1908, 1270, 462, 1675, 449, 821, 671, 1212, 190, 816, 1458, 636, 244, 1521, 785, 2411, 583, 1891, 545, 577, 1875, 1600, 338, 2342, 106, 2018, 282, 1837, 2455, 2084, 588, 733, 2448, 973, 907, 2016, 2291, 1269, 2628, 2041, 2352, 2538, 2333, 625, 65, 418, 1678, 2666, 2471, 815, 531, 1438, 2388, 766, 2113, 2346, 711, 2606, 1778, 948, 1400, 1810, 1182, 1043, 376, 1944, 2201, 528, 1807, 1621, 1423, 1123, 771, 371, 1368, 717, 2518, 2615, 1640, 281, 258

In [330]:
prob0.to_csv('prob_clientes.csv',index_label='Cliente')