# Objetivo:
Prever o pagamento próxima fatura dos seus clientes e criação de estratégias de comunicação para clientes que tivessem alta
probabilidade de não pagar as suas proximas faturas.


# Q 2.1 - Que tipo de problema estamos enfrentando e qual técnica você utilizaria para resolver esse problema?
Problema de inadimplencia, mais especificamente, um problema classificação binária. Usaria alguma técnica de machine learning para classificação. Nesse caso foi usado o algoritmo de Floresta Aleatória.

# 2.2 - Se você só pudesse enviar comunicação para 10% dos clientes devivo ao alto custo, para quais clientes abaixo você enviaria?
Para os 10% de clientes com maior probabilidade de não pagar as próximas faturas. Esse notebook mostra como chegar e quem são esses 10%.

# Colunas
* default_payment - _**Variável dependente**_ : Pagamento da Próxima Fatura (1: Sim, 0: Não) 


* LIMIT_BAL - _**Variável numérica**_ : Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 


* SEX - _**Variável Categórica Nominal**_ : Gender (1 = male; 2 = female).


* EDUCATION - _**Variável Categórica Nominal**_ : Educação (1 = graduate school; 2 = university; 3 = high school; 4 = others).


* MARRIAGE - _**Variável Categórica Nominal**_ : Marital status (1 = married; 2 = single; 3 = others).


* AGE - _**Variável Numérica**_ : Age (year).


* PAY_0,PAY_2 ~ PAY_6 - _**Variável Numérica**_ : History of past payment. We tracked the past monthly payment records (from April to September, 2019) as follows: PAY_0 = the repayment status in September, 2019; PAY_2 = the repayment status in August, 2019;The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ... ; 8 = payment delay for eight months; 9 = payment delay for nine months and above


* BILL_AMT1 ~ BILL_AMT6 - _**Variável Numérica**_ : Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statement in September, 2019; BILL_AMT2 = amount of bill statement in August, 2019; ... ; BILL_AMT6 = amount of bill statement in April, 2019.


* PAY_AMT1 ~ PAY_AMT6 - _**Variável Numérica**_ : Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2019; PAY_AMT2 = amount paid in August, 2019; ... ;PAY_AMT6 = amount paid in April, 2019.

In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler,MinMaxScaler,MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix,f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC,SVC
from imblearn.over_sampling import RandomOverSampler

# 1 - Análises iniciais

In [74]:
train = pd.read_csv('data/questao2_creditcard.csv',sep=';',skiprows=1)
test = pd.read_csv('data/questao22_creditcard_clientes.csv',sep=';',skiprows=1)

In [75]:
train.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default_payment
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [76]:
test.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,70000,2,2,2,26,2,0,0,2,2,...,45020,44006,46905,46012,2007,3582,0,3601,0,1820
1,230000,2,1,2,27,-1,-1,-1,-1,-1,...,13266,15339,14307,36923,17270,13281,15339,14307,37292,0
2,50000,1,2,2,33,2,0,0,0,0,...,22102,22734,23217,23680,1718,1500,1000,1000,1000,716
3,50000,1,1,2,29,2,2,2,2,2,...,26591,25865,27667,28264,0,2700,0,2225,1200,0
4,10000,1,2,1,56,2,2,2,0,0,...,3978,4062,4196,4326,2300,0,150,200,200,160


In [77]:
all(train.columns[:-1] == test.columns)

True

In [78]:
print(train.isnull().sum(),'\n')
print(test.isnull().sum())

LIMIT_BAL          0
SEX                0
EDUCATION          0
MARRIAGE           0
AGE                0
PAY_0              0
PAY_2              0
PAY_3              0
PAY_4              0
PAY_5              0
PAY_6              0
BILL_AMT1          0
BILL_AMT2          0
BILL_AMT3          0
BILL_AMT4          0
BILL_AMT5          0
BILL_AMT6          0
PAY_AMT1           0
PAY_AMT2           0
PAY_AMT3           0
PAY_AMT4           0
PAY_AMT5           0
PAY_AMT6           0
default_payment    0
dtype: int64 

LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_0        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
dtype: int64


In [80]:
#Dados desbalanceados
train['default_payment'].value_counts()

0    21219
1     6058
Name: default_payment, dtype: int64

# 2 - Tratamento de variáveis

In [81]:
# Categóricas
categorical = ['SEX','EDUCATION','MARRIAGE']

train_cat = pd.concat([pd.get_dummies(train[categorical[0]],prefix=categorical[0]), 
                     pd.get_dummies(train[categorical[1]],prefix=categorical[1]),
                     pd.get_dummies(train[categorical[2]],prefix=categorical[2])
                        ],axis=1)

test_cat = pd.concat([pd.get_dummies(test[categorical[0]],prefix=categorical[0]), 
                     pd.get_dummies(test[categorical[1]],prefix=categorical[1]),
                     pd.get_dummies(test[categorical[2]],prefix=categorical[2])
                        ],axis=1)

In [82]:
all(train_cat.columns == test_cat.columns) 

True

In [83]:
#Numéricas
numerical = set(test.columns) - set(categorical)

train_num,test_num = train[numerical],test[numerical]

In [84]:
scaler = StandardScaler()
train_num = scaler.fit_transform(train_num)
test_num = scaler.transform(test_num)

train_num = pd.DataFrame(train_num,columns = numerical,index = train.index)
test_num = pd.DataFrame(test_num,columns = numerical,index = test.index)

# 3 - Oversampling e divisão treino/teste 

In [85]:
X = pd.concat([train_cat,train_num],axis=1)
y = train['default_payment']

In [None]:
over_sampler = RandomOverSampler(random_state=42)
X_res, y_res = over_sampler.fit_resample(X, y)

In [86]:
y_res.value_counts()

1    21219
0    21219
Name: default_payment, dtype: int64

In [87]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.33, random_state=42)

# 4 - Performances de modelos de classificação

In [88]:
dict_classifiers = {"Floresta Aleatória": RandomForestClassifier(random_state=0),
    "Vizinhos mais próximos": KNeighborsClassifier(),
    "Regressão Logística": LogisticRegression(solver = "liblinear",random_state=0),
    "Gradient Boosting": GradientBoostingClassifier(random_state=0),
    "Ada Boost": AdaBoostClassifier(),
    "SVM Linear": LinearSVC(dual=False,random_state=0),
    "SVM": SVC(probability=True,random_state=0)
}

classifiers_names = list(dict_classifiers.keys())
classifiers_values=list(dict_classifiers.values())


In [89]:
def train_model(model):
    t0 = time.time()
    model.fit(X_train,y_train)
    print(f'Treinou em {time.time()-t0:.3f} segundos ')
    
    pred = model.predict(X_test)
    f1_0,f1_1 = f1_score(y_test,pred,average=None)
    print(f"f1_0 = {f1_0}\nf1_1 = {f1_1}\n")

In [90]:
for key,value in zip(classifiers_names,classifiers_values):
    print('--'*9+f'{key}'+'--'*9)
    train_model(value)

------------------Floresta Aleatória------------------
Treinou em 5.188 segundos 
f1_0 = 0.9185672514619884
f1_1 = 0.9222609909281229

------------------Vizinhos mais próximos------------------
Treinou em 0.125 segundos 
f1_0 = 0.7237133899212358
f1_1 = 0.7580526351034621

------------------Regressão Logística------------------
Treinou em 0.337 segundos 
f1_0 = 0.6884537662880635
f1_1 = 0.6726700344095468

------------------Gradient Boosting------------------
Treinou em 9.445 segundos 
f1_0 = 0.7494371606409747
f1_1 = 0.706848466067555

------------------Ada Boost------------------
Treinou em 2.142 segundos 
f1_0 = 0.7416365428853104
f1_1 = 0.6927706135209066

------------------SVM Linear------------------
Treinou em 0.418 segundos 
f1_0 = 0.692520011040574
f1_1 = 0.6703654386743602

------------------SVM------------------
Treinou em 333.820 segundos 
f1_0 = 0.7519763480943505
f1_1 = 0.6900650550156614



# 5 - Floresta Aleatória

In [91]:
model = RandomForestClassifier(random_state=0)
model.fit(X_train,y_train)
pred = model.predict(X_test)

In [92]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.94      0.89      0.92      7030
           1       0.90      0.95      0.92      6975

    accuracy                           0.92     14005
   macro avg       0.92      0.92      0.92     14005
weighted avg       0.92      0.92      0.92     14005



In [96]:
model.classes_
# ou seja, classe 0 será a coluna 0 e classe 1 a coluna 1 de prob

array([0, 1])

# 6 - Clientes mais inadimplentes

In [93]:
#test_cat e test_num vêm do "questao22_creditcard_clientes.csv"
Xtest = pd.concat([test_cat,test_num],axis=1)

In [94]:
pred = model.predict(Xtest)
prob = model.predict_proba(Xtest)

In [97]:
#Não pagar = 0

prob0 = pd.DataFrame(prob[:,0],columns=['Probabilidade de não pagar as próximas faturas'],index=Xtest.index)
prob0 = prob0.sort_values(by='Probabilidade de não pagar as próximas faturas',ascending=False)

In [104]:
#10% de clientes com maior probabilidade de não pagar as próximas faturas.
percent = int(Xtest.shape[0] * 0.10)
prob0.head(percent)

Unnamed: 0,Probabilidade de não pagar as próximas faturas
285,1.00
594,1.00
1891,1.00
2512,0.99
1754,0.99
...,...
753,0.91
684,0.91
1237,0.91
1608,0.91


# 7 - Resposta 2.2

In [99]:
print('Clientes que será enviada comunicação:\n',list(prob0.index[:percent]))

Clientes que será enviada comunicação:
 [285, 594, 1891, 2512, 1754, 1303, 1069, 2615, 1399, 68, 865, 1777, 1814, 1077, 169, 2533, 1947, 210, 1442, 1436, 2068, 1389, 2248, 282, 2310, 35, 1982, 1270, 2112, 2117, 2061, 2459, 2079, 79, 2660, 2096, 1790, 485, 2111, 2628, 1063, 527, 257, 2335, 2309, 1683, 1744, 383, 863, 902, 1243, 108, 2455, 1686, 135, 913, 1752, 742, 1765, 1576, 569, 2523, 702, 1972, 843, 274, 449, 2364, 2428, 2095, 497, 2225, 2188, 2708, 2059, 2640, 215, 2540, 281, 129, 391, 804, 347, 314, 766, 1751, 140, 113, 778, 535, 470, 734, 1784, 234, 2080, 1803, 699, 2488, 1808, 645, 1908, 1913, 184, 1948, 610, 482, 305, 1388, 1480, 1432, 97, 1103, 1654, 874, 2605, 1005, 911, 957, 1101, 1400, 318, 2060, 2304, 266, 2355, 1791, 2345, 163, 1489, 287, 332, 1810, 2002, 2342, 1836, 1837, 1860, 657, 1875, 2686, 17, 181, 2443, 315, 1189, 2179, 1070, 831, 23, 36, 1263, 1294, 2108, 1181, 2556, 779, 249, 455, 2199, 2529, 370, 1309, 765, 1106, 1598, 2368, 255, 2223, 1128, 2446, 2078, 1330, 58

In [100]:
prob0.to_csv('prob_clientes.csv',index_label='Cliente')