## 1 - Qual o Problema do Negócio

http://www.kaggle.com/mlg-ulb/creditcardfraud



*   Os conjuntos de dados contêm **transações** feitas por cartões de crédito em setembro de 2013 por titulares de cartões europeis.

*   Este conjunto de dados apresenta as transações que ocorreram em dois dias, onde **temos 492 fraudes em 284.807 transações**. O conjunto de dados é altamente desequilibrado, a classe prosivida (fraudes) é responsável por 0,172% de todas as transações.

*   Ele contém apenas variáveis de entrada numéricas que são o resultado de uma **transformação PCA**. Infelizmente, devido a questãos de **confidencialidade**, não podemos fornecer os recursos originais e mais informações básicas sobre os dados. Os recursos V1, V2, ... V28 são os componentes principais obtidos com o **PCA**, os únicos recursos que não foram transformados com o PCA são 'Tempo" e 'Quantidade".

*   O recurso 'Tempo' contém os segundos decorridos entre cada transação e a primeira transação no conjunto de dados. O recurso 'Amoun' é o Total $ da transação, esse recurso pode ser usado, por exemplo, para estudos diversos.

*   O recurso **'Classe' é a variável TARGET**e assume o valor 1 em caso de fraude e 0 caso contrário


## 2 - Análise Exploratória dos Dados

In [2]:
# Bibliotecas
import pandas as pd
from pycaret.classification import *

In [3]:
# Carga de dados
origem = './creditcard.csv'
df = pd.read_csv(origem, low_memory = False)

In [4]:
# Visualizando o DataFrame
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
# Verificação de missing
df.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [6]:
# Verificação das informações
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

## 3 - Pré-Processamento dos Dados

*   O conjunto de dados não possui dados faltantes

*   Todas as variáveis estão limpas e normalizadas

*   Não há necessidade de aplicar a etapa referente a Engenharia de Features

*   A separação do conjunto X e y será feita na etapa de construção do Modelo

# 4 - Construção do Modelo

*   Essa etapa visa a escolha do Modelo ideal para representar a solução para o negócio.

*   O conjunto de dados será submetido na função de auto machine learning, onde vários modelos com várias combinações serão submetidos e apresentados, em ordem de resultados.

*   Serão escolhidos os melhores modelos para criar uma combinação **"BAGGING"** que representará a média pelos votos

*   Serão removidos multicolinearidade e variáveis com baixa covariancia.

*   A seleção de features será realizada de forma automática

In [5]:
# Criando o setup de classificação
classificacao = setup(
    data = df,
    target = 'Class',
    ignore_low_variance = True,
    remove_multicollinearity = True,
    multicollinearity_threshold = 0.95,
    n_jobs = 2,
    fix_imbalance = True
)

Unnamed: 0,Description,Value
0,session_id,2456
1,Target,Class
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(284807, 31)"
5,Missing Values,False
6,Numeric Features,30
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [6]:
# Comparação de modelos para escolha
avaliacao = compare_models(sort = 'F1')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9996,0.9792,0.8353,0.9136,0.8717,0.8715,0.8729,11.74
rf,Random Forest Classifier,0.9995,0.9736,0.8265,0.8874,0.8546,0.8544,0.8555,75.354
lightgbm,Light Gradient Boosting Machine,0.9992,0.9708,0.8382,0.7389,0.7841,0.7837,0.7859,3.099
dt,Decision Tree Classifier,0.998,0.8948,0.7912,0.4573,0.5785,0.5776,0.6001,13.597
gbc,Gradient Boosting Classifier,0.995,0.9803,0.8676,0.24,0.375,0.3733,0.454,225.195
nb,Naive Bayes,0.9925,0.9718,0.7735,0.1566,0.2603,0.2582,0.3458,0.208
ada,Ada Boost Classifier,0.9902,0.9783,0.8824,0.137,0.2369,0.2346,0.345,42.698
ridge,Ridge Classifier,0.9872,0.0,0.8324,0.1023,0.1821,0.1796,0.289,0.28
lda,Linear Discriminant Analysis,0.9872,0.9757,0.8324,0.1023,0.1821,0.1796,0.289,1.163
lr,Logistic Regression,0.9807,0.9756,0.9059,0.0786,0.1441,0.1414,0.2621,2.66


In [10]:
# Tunando a et classifier
extra_tree_tuned = tune_model(avaliacao, n_iter=50)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9993,0.9929,0.9118,0.7561,0.8267,0.8263,0.83
1,0.999,0.9725,0.8235,0.6829,0.7467,0.7462,0.7495
2,0.999,0.9829,0.7941,0.6923,0.7397,0.7393,0.741
3,0.9994,0.9866,0.8235,0.8235,0.8235,0.8232,0.8232
4,0.9993,0.9731,0.8529,0.7838,0.8169,0.8166,0.8173
5,0.9993,0.9731,0.7647,0.8125,0.7879,0.7875,0.7879
6,0.9992,0.9778,0.8529,0.7436,0.7945,0.7941,0.796
7,0.9994,0.99,0.9412,0.7619,0.8421,0.8418,0.8465
8,0.9993,0.968,0.7647,0.8387,0.8,0.7997,0.8005
9,0.9991,0.9966,0.8529,0.6905,0.7632,0.7627,0.767


In [24]:
# Teste com ET
# modelo = create_model('et')
predict_model(avaliacao, data = df);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.9998,0.9892,0.939,0.9726,0.9555,0.9555,0.9556


In [25]:
# Teste com ET tunado
predict_model(extra_tree_tuned, data = df);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.9992,0.9781,0.8252,0.7505,0.7861,0.7857,0.7866


## 5 - Treinando e salvando modelo

*   Foi escolhido Extra Tree Classifier por melhor desempenho, sem opção de bagging.

In [30]:
modelo_final = finalize_model(avaliacao)
save_model(modelo_final, './et_fraudscore_v1')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Class',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                  ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0,
                                       class_weight=None, criterion='gini',
                                       max_depth=None, max_features='auto',
                                       max_leaf_nod