## Prevendo o Nível de Satisfação dos Clientes do Santander

A satisfação do cliente é uma medida fundamental de sucesso. Clientes insatisfeitos cancelam seus serviços e raramente expressam sua insatisfação antes de sair. Clientes satisfeitos, por outro lado, se tornam defensores da marca!

![image](img/santander_custsat_red.png)

Fonte de dados disponível em: https://www.kaggle.com/c/santander-customer-satisfaction/overview

#### Importando pacotes necessários para este projeto

In [211]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score, auc, accuracy_score
import imblearn
import pickle

#### Carregando o conjunto de dados

In [154]:
# Lendo arquivo csv
df_train = pd.read_csv("data/train.csv")

In [155]:
df_train.shape

(76020, 371)

O conjunto de dados de treino possuí 76020 linhas e 371 colunas.

In [156]:
# Primeiras 5 linhas do conjunto de dados
df_train.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


#### Análise exploratória

Verificando a distribuição para a variável target.

In [157]:
df_train['TARGET'].value_counts()

0    73012
1     3008
Name: TARGET, dtype: int64

Os dados estão desbalanceados em relação a variável dependente, pois a classe que aparece com maior frequência é a classe 0(clientes satisfeitos), este número representa aproximadamente 96% da base, enquanto a classe 1 representa aproximadamente 4% do conjunto de dados. Como consequência, se não for aplicada nenhuma técnica para balanceamento dos dados, o algoritmo aprenderá mais sobre os clientes que estão insatisfeitos, o que não é bom. Temos um problema de classe rara, deverá ser utilizada uma técnica para o balanceamento dos dados.

In [158]:
# Verificando os tipos das colunas
df_train.dtypes.head()

ID                           int64
var3                         int64
var15                        int64
imp_ent_var16_ult1         float64
imp_op_var39_comer_ult1    float64
dtype: object

Temos mais de 300 colunas neste dataset, para verificar se há outro tipo de dado além de variáveis do tipo int ou float, a função abaixo verificará e somara para cada int e float sua quantidade, respectivamente. Se a soma da saída for igual a 371(que é o número de colunas), o conjunto de dados só tem colunas do tipo int64 e float64.

In [159]:
def verifica_tipos():
    var_int = 0
    var_float = 0
    for i in df_train.columns:
        if df_train[i].dtype == "int64":
            var_int += 1
        elif df_train[i].dtype == "float64":
            var_float += 1
    return var_int, var_float

In [160]:
verifica_tipos()

(260, 111)

A soma desses dois elementos acima é igual a 371, ou seja, todos as colunas desse dataset são do tipo int64 ou float64.

Dado ao grande número de variáveis, é necessário verificar se há dados missing neste dataset, e não seria viável olhar cada variável individualmente, na próxima célula, terá uma função que verificará todas as variáveis, e retornará o nome da variável que contem missing e a quantidade de ocorrência de valores missing.

In [161]:
def verifica_nan():
    is_nan = []
    count = 0
    for i in range(len(df_train.columns)):
        count += 1
        value = df_train[df_train.columns[i]].isnull().sum()
        if value > 0:
            is_nan.append([i, value])
    return is_nan, count

In [162]:
verifica_nan()

([], 371)

Esse dataset não contém valores missing.

#### Pré-Processamento: Balanceando os dados com a Ténica de Over-Sampling SMOTE

In [163]:
# Carregando o SMOTE
smote = imblearn.over_sampling.SMOTE()

In [164]:
# Balanceando os dados
df_train_aux1 = smote.fit_resample(df_train.drop(['ID','TARGET'], axis = 1), df_train['TARGET'])
target_df = pd.DataFrame(df_train_aux1[1]).reset_index()
df_train_aux2 = df_train_aux1[0].reset_index().join(target_df.set_index('index'), on = 'index').drop('index', axis = 1)

In [165]:
df_train_aux2.head()

Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [166]:
# Volume de ocorrências iguais para a variável alvo (Variável TARGET)
df_train_aux2['TARGET'].value_counts()

1    73012
0    73012
Name: TARGET, dtype: int64

In [167]:
df_train_aux2.shape

(146024, 370)

Agora os dados já estão balanceados, mas ainda faltam algumas técnicas a serem aplicadas nesse dataset antes de treinar o algoritmo.

#### Pré-Processamento: Aplicando a ténica Principal Component Analysis (PCA) para redução de dimensionalidade.
PCA é um algoritmo não supervisionado para encontrar uma base mais significativa ou sistema de coordenados para nossos dados e funciona com base na matriz de covariância para encontrar as características mais fortes da amostra.

In [168]:
# Criando o modelo PCA com 150 componentes.
pca = PCA(n_components = 150,
   whiten = True,
   svd_solver = 'randomized')

In [169]:
target = df_train_aux2['TARGET']
df_train_aux3 = df_train_aux2.drop('TARGET', axis=1)

In [170]:
# Treinando o modelo PCA nos dados de treino
pca.fit(df_train_aux3)

PCA(n_components=150, svd_solver='randomized', whiten=True)

In [171]:
# Aplicando o PCA nos dados de treino
df_train_pca = pca.transform(df_train_aux3)

In [172]:
# Dimensões
print(df_train_pca.shape)

(146024, 150)


In [173]:
type(df_train_pca)

numpy.ndarray

In [174]:
df_train_pca = pd.DataFrame(df_train_pca)

In [175]:
df_train_pca.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,140,141,142,143,144,145,146,147,148,149
0,-0.058616,-0.028168,-0.019534,-0.014647,0.002633,-0.001602,-0.004694,-0.000412,-0.000356,-4e-06,...,-0.027458,-0.454927,0.08781,-0.03806,0.038301,0.029376,0.009164,0.050143,-0.031718,-0.187917
1,-0.058616,-0.028168,-0.019534,-0.014647,0.002633,-0.001602,-0.004694,-0.000412,-0.000356,-4e-06,...,0.00537,0.643953,-0.201164,-0.004949,-2.324486,2.757665,0.884664,0.254928,-0.774843,-0.577177
2,-0.058616,-0.028168,-0.019534,-0.014647,0.002633,-0.001602,-0.004694,-0.000412,-0.000356,-4e-06,...,0.108744,1.492631,-0.390037,0.156402,-0.631833,-0.411571,0.205334,0.349015,0.323742,-0.720549
3,-0.058616,-0.028168,-0.019534,-0.014647,0.002633,-0.001602,-0.004694,-0.000412,-0.000356,-4e-06,...,0.369208,-1.379974,0.221488,-1.618071,-3.102211,-1.96452,0.729817,1.573627,-0.560539,0.61125
4,-0.058616,-0.028168,-0.019534,-0.014647,0.002633,-0.001602,-0.004694,-0.000412,-0.000356,-4e-06,...,-0.085253,3.223924,-0.518141,-0.629103,0.604696,0.039952,-0.297549,-0.246195,0.799958,-0.582738


#### Dividindo os dados em treino e teste

In [177]:
from sklearn.model_selection import train_test_split

In [178]:
X_treino, X_teste, y_treino, y_teste = train_test_split(df_train_pca, target, train_size = 0.7)

### Modelos
Será testada uma série de algoritmos de classificação, será avaliado sua pontuação e o tempo computacional para treinar o modelo. 
##### Modelo 1 - SVM

In [179]:
from sklearn import svm

In [180]:
modelo = svm.SVC()

In [182]:
%%time 
modelo.fit(X_treino, y_treino)

Wall time: 35min 21s


SVC()

In [183]:
# Realizando novas predições com dados conhecidos
predicts = modelo.predict(X_teste)

In [184]:
# Matriz de confusão
confusion_matrix(y_teste, predicts)

array([[19770,  2127],
       [ 2771, 19140]], dtype=int64)

In [185]:
accuracy_score(y_teste, predicts)

0.8881939371804237

In [186]:
precision_score(y_teste, predicts)

0.8999858936380307

In [187]:
f1_score(y_teste, predicts)

0.8865626013247487

In [188]:
recall_score(y_teste, predicts)

0.8735338414495003

In [212]:
filename = 'svc_model.sav'
pickle.dump(modelo, open(filename, 'wb'))

##### Modelo 2 - Decision Tree

In [189]:
from sklearn.tree import DecisionTreeClassifier

In [190]:
tree = DecisionTreeClassifier()

In [191]:
%%time
tree.fit(X_treino, y_treino)

Wall time: 34.8 s


DecisionTreeClassifier()

In [192]:
# Realizando novas predições com dados conhecidos
predicts = tree.predict(X_teste)

In [193]:
# Matriz de confusão
confusion_matrix(y_teste, predicts)

array([[19376,  2521],
       [ 1738, 20173]], dtype=int64)

In [194]:
accuracy_score(y_teste, predicts)

0.9027803140978816

In [195]:
precision_score(y_teste, predicts)

0.8889133691724685

In [196]:
recall_score(y_teste, predicts)

0.9206791109488385

In [197]:
f1_score(y_teste, predicts)

0.9045174307813025

In [198]:
fpr, tpr, thresholds = metrics.roc_curve(y_teste, predicts, pos_label=1)

In [199]:
metrics.auc(fpr, tpr)

0.902774592237446

In [213]:
filename = 'tree_model.sav'
pickle.dump(modelo, open(filename, 'wb'))

###### Modelo 3 - Nayve Bayes

In [200]:
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

In [201]:
gaussian = GaussianNB()

In [202]:
%%time
gaussian.fit(X_treino, y_treino)

Wall time: 381 ms


GaussianNB()

In [203]:
# Realizando previsões com dados conhecidos
predicts = gaussian.predict(X_teste)

In [204]:
# Matriz de confusão
confusion_matrix(y_teste, predicts)

array([[ 2348, 19549],
       [  780, 21131]], dtype=int64)

In [205]:
accuracy_score(y_teste, predicts)

0.5359523374726077

In [206]:
precision_score(y_teste, predicts)

0.5194444444444445

In [207]:
recall_score(y_teste, predicts)

0.9644014421979827

In [208]:
f1_score(y_teste, predicts)

0.675208895847646

In [209]:
fpr, tpr, thresholds = metrics.roc_curve(y_teste, predicts, pos_label=1)

In [210]:
metrics.auc(fpr, tpr)

0.5358153715077232

In [214]:
filename = 'gaussian_model.sav'
pickle.dump(modelo, open(filename, 'wb'))

#### Conclusão: 
Dos modelos treinados, o que teve melhor desempenho e menor tempo de execução computacional foi o modelo de Decision Tree Classifier.
Todas as métricas utilizadas para mensurar o desempenho do modelo foram bem satisfatórias.

##### Fim