## [Mestrado] 4.3 Análises estatísticas usando instâncias reais (sklearn, autoencoder e autoencoder+lstm) sem extração de características

## Bibliotecas e configurações

In [1]:
# Artifício para alcular tempo total do notebook Jupyter
from datetime import datetime 
start_time = datetime.now()

In [2]:
import pandas as pd
import numpy as np
import scikit_posthocs as sp
from scipy.stats import friedmanchisquare
import warnings
warnings.filterwarnings("ignore")

A banca de qualificação solicitou que os testes sejam pareados entre si.

In [3]:
# Médias classificadores Sklearn
sklearn_scores = pd.read_csv(r'./results/4-0_anomaly_detection_scores_reais.csv', index_col=0)

# Médias classificadores Autoencoder
autoencoder_scores = pd.read_csv(r'./results/4-1_anomaly_detection_scores_reais.csv', index_col=0)

# Médias classificadores Autoencoder LSTM
autoencoder_lstm_scores = pd.read_csv(r'./results/4-2_anomaly_detection_scores_reais.csv', index_col=0)

# Agrupa todos os resultados
scores = pd.concat([sklearn_scores, autoencoder_scores, autoencoder_lstm_scores])

mean_score_table = scores.groupby('CLASSIFICADOR').mean().sort_values(by=['F1'], ascending=False)
mean_score_table

Unnamed: 0_level_0,PRECISAO,REVOGACAO,F1,TREINAMENTO [s],TESTE [s]
CLASSIFICADOR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Local Outlier Factor,0.858796,0.858796,0.858796,0.006446,0.006224
Envelope Eliptico MCD,0.650463,0.650463,0.650463,11.911614,0.01686
AutoEncoder LSTM,0.627315,0.627315,0.627315,9.693188,1.448803
Floresta de Isolamento,0.615741,0.615741,0.615741,0.087819,0.022803
AutoEncoder,0.578704,0.578704,0.578704,3.239772,0.042441
One Class SVM,0.550926,0.550926,0.550926,0.006918,0.006889
Dummy,0.5,0.5,0.5,0.000139,0.000222


In [4]:
clfs_names = list(mean_score_table.index)
f1s = [scores.loc[scores['CLASSIFICADOR']==cn, 'F1'].values for cn in clfs_names]
clfs_names

['Local Outlier Factor',
 'Envelope Eliptico MCD',
 'AutoEncoder LSTM',
 'Floresta de Isolamento',
 'AutoEncoder',
 'One Class SVM',
 'Dummy']

In [5]:
# Teste Estatístico (Friedman)

stat, p_value = friedmanchisquare(*(f1s))
print(f'p_value: {p_value}')

p_value: 3.812133787383094e-19


In [6]:
p_values = sp.posthoc_wilcoxon(f1s, 
                                 val_col=None, 
                                 group_col=None, 
                                 zero_method='wilcox', 
                                 correction=False, 
                                 p_adjust='bonferroni', 
                                 sort=False)

p_values.columns = clfs_names
p_values.index = clfs_names

In [7]:
df = pd.DataFrame(p_values)

d = dict(selector="th",
    props=[('text-align', 'center')])

df.style.set_properties(**{'width':'10em', 'text-align':'center'})\
        .set_table_styles([d])\
        .applymap(lambda x: 'color: green' if  x != '-' and x <= 0.05 and x > -1 else 'color: gray')\
        .set_caption("Valores p dos testes estatísticos Wilcoxon com ajuste de Bonferroni").set_table_styles([{
            'selector': 'caption',
            'props': [
                ('color', 'black'),
                ('text-align', 'center'),
                ('font-size', '16px')
            ]
        }])

Unnamed: 0,Local Outlier Factor,Envelope Eliptico MCD,AutoEncoder LSTM,Floresta de Isolamento,AutoEncoder,One Class SVM,Dummy
Local Outlier Factor,1.0,1.6e-05,7e-06,1.3e-05,7e-06,1e-05,3e-06
Envelope Eliptico MCD,1.6e-05,1.0,1.0,1.0,0.629214,0.296349,0.00048
AutoEncoder LSTM,7e-06,1.0,1.0,1.0,0.734375,1.0,0.018093
Floresta de Isolamento,1.3e-05,1.0,1.0,1.0,1.0,1.0,0.0075
AutoEncoder,7e-06,0.629214,0.734375,1.0,1.0,1.0,0.100479
One Class SVM,1e-05,0.296349,1.0,1.0,1.0,1.0,1.0
Dummy,3e-06,0.00048,0.018093,0.0075,0.100479,1.0,1.0


## Conclusões

Tendo em vista os resultados apresentados acima, podemos afirmar que o classificador "LOF" apresenta o melhor desempenho em termos de médias da métrica medida-F1.

 - O classificador "LOF" produz um valor médio de medida-F1 estatisticamente diferente em comparação a todos os demais classificadores.

In [8]:
# Calcular tempo total do notebook Jupyter
print(f'Tempo total de execução (hh:mm:ss.ms): {datetime.now() - start_time}')

Tempo total de execução (hh:mm:ss.ms): 0:00:01.784604
