> **Comparing Classifiers**
<br>` python  3.7.13    scikit-learn  1.0.2 `
<br>`numpy   1.19.5          pandas  1.3.5`

#### Using the Nemenyi test 
A quick demonstration of why the Nemenyi test is only appropriate for All.vs.All

In [1]:
import pandas as pd
import numpy as np

In [2]:
# add parent folder path where lib folder is
import sys
if ".." not in sys.path:import sys; sys.path.insert(0, '..') 

In [4]:
# nonparametric tests for one performance measure (e.g., AUC)
#from P_HAKN import run_friedman, ph_pvals, filtr_ap2h0, filtr_psig 
#from P_HAKN import compare_avgranks, compare_avgranks_lf, sort_dict_byval

from P_HAKN import run_friedman, filtr_ap2h0
from P_HAKN.do_not import nym_permit, nym_psig

In [5]:
alpha = 0.05   # Set this to change the default signifance level

#### Read in our data

In [6]:
# In the csv, the columns are classifiers and the rows are datasets 
full_df = pd.read_csv("../datasets/4d_auc.csv", index_col=0)
#full_df = pd.read_csv("../datasets/4d_gmean.csv", index_col=0)
#full_df = pd.read_csv("../datasets/4d_eloss.csv", index_col=0)

In [7]:
# All_vs_One: list of "control" classifiers or none for All.vs.All
#avo = []
avo = ['RF']
#avo = ['RF','XGB']

In [8]:
if len(avo) == 0:
    oname = "All_Models"
    df = full_df
else:
    baseclf = tuple(avo)
    oname=""
    for x in range(len(avo)):
        oname += avo[x] + "_"
    oname += "Models"
    df = full_df.loc[:, full_df.columns.str.startswith(baseclf)]

#### Friedman test 
Checks if there is a significant difference in performance for any classifier<br>
If we reject H0 (no difference), we use the post-hoc test to find out which differences are significant.

In [9]:
reject, rptstr, rankings, avg_ranks = run_friedman(df)

print(oname,":",rptstr)
# continue only if H0 was rejected
if not reject:
    raise Exception("Accepted H0 for Freidman Test")

RF_Models : Freidman Test
H0: there is no difference in the means at the 95.0% confidence level 
Reject: Ready to continue with post-hoc tests


#### Nemenyi Test - Used properly
For All.vs.All, the Nemenyi adjusted p_value is consistently smaller than its single step counterparts, which means H0 will be rejected for more tests - another way to say this is Nemenyi is more powerful than Sidak, and Sidak is more powerful than Bonferroni-Dunn

In [10]:
print(oname,": Nemenyi Test - All.vs.All")
nym_ap_df=nym_permit(rankings)

RF_Models : Nemenyi Test - All.vs.All
Classifiers: 9    Tests: 36


In [11]:
aptop = nym_ap_df.loc[nym_ap_df['ap_BDun'] < (alpha*2)]
aptop

Unnamed: 0,lookup,p_noadj,ap_Nymi,ap_BDun,ap_Sdak
RF+1FF // RF+3FF+RUS,3.356586,0.000789,0.022443,0.028408,0.028019
RF // RF+3FF+RUS,3.292036,0.000995,0.027714,0.035807,0.035191
RF+1FF // RF+RUS,3.098387,0.001946,0.050507,0.070048,0.067714
RF+3FF // RF+3FF+RUS,3.098387,0.001946,0.050507,0.070048,0.067714
RF // RF+RUS,3.033837,0.002415,0.061166,0.086927,0.083353


In [12]:
nym_rfa_ho = filtr_ap2h0(nym_ap_df)
nym_rfa_ho.head(len(aptop))

Unnamed: 0,p_noadj,H0: Nymi,H0: BDun,H0: Sdak
RF+1FF // RF+3FF+RUS,False,False,False,False
RF // RF+3FF+RUS,False,False,False,False
RF+1FF // RF+RUS,False,True,True,True
RF+3FF // RF+3FF+RUS,False,True,True,True
RF // RF+RUS,False,True,True,True


#### Nemenyi Test - Used on a subset (Don't do this)
Here the Nemenyi adjusted p_value is consistently larger than its single step counterparts, because the Nemenyi adjustment is based on _the full list of classifiers being tested_, where the others adjust based on _the number of pairs of classifiers_

In [13]:
print(oname,"subset: unadjusted p_value is significant")
nym_sig_df = nym_psig(nym_ap_df)

RF_Models subset: unadjusted p_value is significant
Significant: 12    Not: 24
Classifiers: 8    Tests: 12


In [14]:
sotop = nym_sig_df.loc[nym_sig_df['ap_BDun'] < (alpha*2)]
sotop

Unnamed: 0,lookup,p_noadj,ap_Nymi,ap_BDun,ap_Sdak
RF+1FF // RF+3FF+RUS,3.356586,0.000789,0.017981,0.009469,0.009428
RF // RF+3FF+RUS,3.292036,0.000995,0.022253,0.011936,0.011871
RF+1FF // RF+RUS,3.098387,0.001946,0.040951,0.023349,0.023101
RF+3FF // RF+3FF+RUS,3.098387,0.001946,0.040951,0.023349,0.023101
RF // RF+RUS,3.033837,0.002415,0.049571,0.028976,0.028594
RF+1FF // RF+3FF+SMOTE,2.904738,0.003676,0.071882,0.044107,0.043227
RF // RF+3FF+SMOTE,2.840188,0.004509,0.085639,0.054104,0.052783
RF+3FF // RF+RUS,2.840188,0.004509,0.085639,0.054104,0.052783
RF+3FF // RF+3FF+SMOTE,2.646539,0.008132,0.139159,0.097584,0.093336


In [15]:
nym_sig_ho = filtr_ap2h0(nym_sig_df)
nym_sig_ho.head(len(sotop))

Unnamed: 0,p_noadj,H0: Nymi,H0: BDun,H0: Sdak
RF+1FF // RF+3FF+RUS,False,False,False,False
RF // RF+3FF+RUS,False,False,False,False
RF+1FF // RF+RUS,False,False,False,False
RF+3FF // RF+3FF+RUS,False,False,False,False
RF // RF+RUS,False,False,False,False
RF+1FF // RF+3FF+SMOTE,False,True,False,False
RF // RF+3FF+SMOTE,False,True,True,True
RF+3FF // RF+RUS,False,True,True,True
RF+3FF // RF+3FF+SMOTE,False,True,True,True


#### Nemenyi Test - One.vs.All (Don't do this)
Demsar (2006, p.12) 
> "When all classiﬁers are compared with a control classiﬁer, we can instead of the Nemenyi test use one of the general procedures for controlling the family-wise error in multiple hypothesis testing, such as the Bonferroni correction or similar procedures. Although these methods are generally conservative and can have little power, they are in this speciﬁc case more powerful than the Nemenyi test, since the latter adjusts the critical value for making `k(k − 1)/2` comparisons while when
comparing with a control we only make `k − 1` comparisons."

In other words, the Nemenyi adjustment is based on _the full list of classifiers being tested_, where the others adjust based on _the number of pairs of classifiers_

In [16]:
print(oname,": Nemenyi Test - One.vs.All")
nym_rf_df=nym_permit(rankings,control='RF')

RF_Models : Nemenyi Test - One.vs.All
Classifiers: 9    Tests: 8


In [17]:
rftop = nym_rf_df.loc[nym_rf_df['ap_BDun'] < (alpha*2)]
rftop

Unnamed: 0,lookup,p_noadj,ap_Nymi,ap_BDun,ap_Sdak
RF // RF+3FF+RUS,3.292036,0.000995,0.027714,0.007957,0.00793
RF // RF+RUS,3.033837,0.002415,0.061166,0.019317,0.019155
RF // RF+3FF+SMOTE,2.840188,0.004509,0.103887,0.03607,0.035505


In [18]:
nym_rfo_ho = filtr_ap2h0(nym_rf_df)
nym_rfo_ho.head(len(rftop))

Unnamed: 0,p_noadj,H0: Nymi,H0: BDun,H0: Sdak
RF // RF+3FF+RUS,False,False,False,False
RF // RF+RUS,False,True,False,False
RF // RF+3FF+SMOTE,False,True,False,False
