> **Comparing Classifiers**
<br>` python  3.7.13    scikit-learn  1.0.2 `
<br>`numpy   1.19.5          pandas  1.3.5`

In [1]:
import pandas as pd
import numpy as np

In [2]:
# add parent folder path where lib folder is
import sys
if ".." not in sys.path:import sys; sys.path.insert(0, '..') 

In [3]:
# nonparametric tests for one performance measure (e.g., AUC)
from P_HAKN import run_friedman, ph_pvals, filtr_ap2h0, filtr_psig 
from P_HAKN import compare_avgranks, compare_avgranks_lf, sort_dict_byval

In [4]:
alpha = 0.05   # Set this to change the default signifance level

#### Read in our data

In [21]:
# In the csv, the columns are classifiers and the rows are datasets 
full_df = pd.read_csv("../datasets/4d_auc.csv", index_col=0)
#full_df = pd.read_csv("../datasets/4d_gmean.csv", index_col=0)
#full_df = pd.read_csv("../datasets/4d_eloss.csv", index_col=0)

In [6]:
# list of classifiers in the dataset or none for all
## generally it is a bad idea to base things on a naming convention ...

#avo = []
#avo = ['RF']
avo = ['RF','XGB']

In [7]:
if len(avo) == 0:
    oname = "All_Models"
    df = full_df
else:
    baseclf = tuple(avo)
    oname=""
    for x in range(len(avo)):
        oname += avo[x] + "_"
    oname += "Models"
    df = full_df.loc[:, full_df.columns.str.startswith(baseclf)]

#### Friedman test 
Checks if there is a significant difference in performance for any classifier<br>
If H0 (no difference) is rejected, we use post-hoc tests to find out which differences are significant.

In [8]:
reject, rptstr, rankings, avg_ranks = run_friedman(df)

print(oname,":",rptstr)
# continue only if H0 was rejected
if not reject:
    raise Exception("Accepted H0 for Freidman Test")

RF_XGB_Models : Freidman Test
H0: there is no difference in the means at the 95.0% confidence level 
Reject: Ready to continue with post-hoc tests


Essentially, the Freidman test tells us to reject H0 if the difference between the lowest and highest average rank is large enough - and nothing else

In [9]:
# lower numbers performed better
print("Freidman test: Average Ranks")
sort_dict_byval(avg_ranks)

Freidman test: Average Ranks


{'RF+1FF': 2.375,
 'RF': 3.25,
 'RF+3FF': 3.5,
 'RF+ROS': 6.0,
 'RF+SMOTE': 7.0,
 'XGB+3FF': 8.0,
 'XGB+3FF+SMOTE': 8.0,
 'XGB': 8.25,
 'XGB+1FF': 8.625,
 'XGB+SMOTE': 9.0,
 'RF+3FF+ROS': 9.25,
 'RF+3FF+SMOTE': 12.25,
 'XGB+3FF+ROS': 12.75,
 'XGB+ROS': 12.75,
 'RF+RUS': 14.375,
 'XGB+3FF+RUS': 14.5,
 'RF+3FF+RUS': 14.875,
 'XGB+RUS': 16.25}

#### All vs. All Tests
Compare every classifier to every other one, using the rankings returned from the Freidman test.<br>
The general case shows p_values adjusted for multiple tests using a range of methods, Nemenyi and Shaffer show p_values adjusted using similar methods.<br>
The dataframe of adjusted p_values can be quickly converted to show if the Null Hypothesis (H0: No significant difference) should be accepted (True) or rejected (False).<br>Note that for technical reasons, the Schaffer method should not be used for more than 18 classifiers.

In [10]:
# possibly useful
# pd.set_option('display.max_rows', 110)
# pd.set_option("display.max_rows", max_rows, "display.max_columns", max_cols)

In [11]:
print(oname,": Range of Tests")
gen_pvals_df = ph_pvals(rankings)
gen_pvals_df

RF_XGB_Models : Range of Tests
Classifiers: 18    Tests: 153


Unnamed: 0,p_noadj,ap_BDun,ap_Sdak,ap_Holm,ap_Finr,ap_Hoch,ap_Li
RF+1FF // XGB+RUS,0.000237,0.036309,0.035662,0.036309,0.035662,0.036309,1.0
RF // XGB+RUS,0.000574,0.087766,0.084048,0.087192,0.042946,0.087192,1.0
RF+3FF // XGB+RUS,0.000731,0.111893,0.105897,0.110430,0.042946,0.110430,1.0
RF+1FF // RF+3FF+RUS,0.000929,0.142066,0.132493,0.139280,0.042946,0.139280,1.0
RF+1FF // XGB+3FF+RUS,0.001318,0.201682,0.182754,0.196410,0.042946,0.196410,1.0
...,...,...,...,...,...,...,...
XGB // XGB+3FF,0.947197,1.000000,1.000000,1.000000,0.953171,1.000000,1.0
XGB // XGB+3FF+SMOTE,0.947197,1.000000,1.000000,1.000000,0.953171,1.000000,1.0
RF+RUS // XGB+3FF+RUS,0.973584,1.000000,1.000000,1.000000,0.974826,1.000000,1.0
XGB+3FF // XGB+3FF+SMOTE,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.0


In [12]:
gen_ho_df = filtr_ap2h0(gen_pvals_df)
gen_ho_df

Unnamed: 0,p_noadj,H0: BDun,H0: Sdak,H0: Holm,H0: Finr,H0: Hoch,H0: Li
RF+1FF // XGB+RUS,False,False,False,False,False,False,True
RF // XGB+RUS,False,True,True,True,False,True,True
RF+3FF // XGB+RUS,False,True,True,True,False,True,True
RF+1FF // RF+3FF+RUS,False,True,True,True,False,True,True
RF+1FF // XGB+3FF+RUS,False,True,True,True,False,True,True
...,...,...,...,...,...,...,...
XGB // XGB+3FF,True,True,True,True,True,True,True
XGB // XGB+3FF+SMOTE,True,True,True,True,True,True,True
RF+RUS // XGB+3FF+RUS,True,True,True,True,True,True,True
XGB+3FF // XGB+3FF+SMOTE,True,True,True,True,True,True,True


In [13]:
print(oname,": Nemenyi Test")
nym_ap_df=ph_pvals(rankings,nmyi=True)
nym_ap_df

RF_XGB_Models : Nemenyi Test
Classifiers: 18    Tests: 153


Unnamed: 0,p_noadj,ap_Nymi,ap_BDun,ap_Sdak
RF+1FF // XGB+RUS,0.000237,0.026680,0.036309,0.035662
RF // XGB+RUS,0.000574,0.057931,0.087766,0.084048
RF+3FF // XGB+RUS,0.000731,0.071463,0.111893,0.105897
RF+1FF // RF+3FF+RUS,0.000929,0.087135,0.142066,0.132493
RF+1FF // XGB+3FF+RUS,0.001318,0.115677,0.201682,0.182754
...,...,...,...,...
XGB // XGB+3FF,0.947197,0.900000,1.000000,1.000000
XGB // XGB+3FF+SMOTE,0.947197,0.900000,1.000000,1.000000
RF+RUS // XGB+3FF+RUS,0.973584,0.900000,1.000000,1.000000
XGB+3FF // XGB+3FF+SMOTE,1.000000,0.900000,1.000000,1.000000


For analysis it may be useful to consider just the pairs with significant differences<br>For statistical reasons the Nemenyi test should not be used for this.

In [15]:
print(oname,"subset: unadjusted p_value is significant")
gen_sig_df = filtr_psig(gen_pvals_df)
gen_sig_df

RF_XGB_Models subset: unadjusted p_value is significant
Significant: 32    Not: 121
Classifiers: 16    Tests: 32


Unnamed: 0,p_noadj,ap_BDun,ap_Sdak,ap_Holm,ap_Finr,ap_Hoch,ap_Li
RF+1FF // XGB+RUS,0.000237,0.007594,0.007566,0.007594,0.007566,0.007594,0.000249
RF // XGB+RUS,0.000574,0.018356,0.018194,0.017783,0.009139,0.017783,0.000602
RF+3FF // XGB+RUS,0.000731,0.023402,0.023139,0.02194,0.009139,0.02194,0.000767
RF+1FF // RF+3FF+RUS,0.000929,0.029713,0.029289,0.026927,0.009139,0.026927,0.000973
RF+1FF // XGB+3FF+RUS,0.001318,0.042182,0.041331,0.036909,0.009139,0.036909,0.001381
RF+1FF // RF+RUS,0.001478,0.047311,0.046243,0.039919,0.009139,0.039919,0.001549
RF // RF+3FF+RUS,0.002073,0.066343,0.064255,0.053904,0.009443,0.046945,0.002171
RF+3FF // RF+3FF+RUS,0.002584,0.082692,0.079464,0.064603,0.010297,0.046945,0.002704
RF // XGB+3FF+RUS,0.002881,0.09218,0.08818,0.069135,0.010297,0.046945,0.003013
RF // RF+RUS,0.003208,0.102651,0.097707,0.073781,0.010297,0.046945,0.003355


Essentially, the post-hoc tests tell us which pairs showed a significant difference - and nothing else. To find out which one performed better, we need to go back to the average_ranks from the Freidman test

In [16]:
ca = compare_avgranks(gen_sig_df, avg_ranks, alpha=0.05)
ca

Note: differences Down the Columns are NOT significant
      only differences Across the Rows ARE significant


['RF 3.25 // 12.25 RF+3FF+SMOTE',
 'RF 3.25 // 12.75 XGB+3FF+ROS',
 'RF 3.25 // 12.75 XGB+ROS',
 'RF 3.25 // 14.375 RF+RUS',
 'RF 3.25 // 14.5 XGB+3FF+RUS',
 'RF 3.25 // 14.875 RF+3FF+RUS',
 'RF 3.25 // 16.25 XGB+RUS',
 'RF+1FF 2.375 // 12.25 RF+3FF+SMOTE',
 'RF+1FF 2.375 // 12.75 XGB+3FF+ROS',
 'RF+1FF 2.375 // 12.75 XGB+ROS',
 'RF+1FF 2.375 // 14.375 RF+RUS',
 'RF+1FF 2.375 // 14.5 XGB+3FF+RUS',
 'RF+1FF 2.375 // 14.875 RF+3FF+RUS',
 'RF+1FF 2.375 // 16.25 XGB+RUS',
 'RF+3FF 3.5 // 12.25 RF+3FF+SMOTE',
 'RF+3FF 3.5 // 12.75 XGB+3FF+ROS',
 'RF+3FF 3.5 // 12.75 XGB+ROS',
 'RF+3FF 3.5 // 14.375 RF+RUS',
 'RF+3FF 3.5 // 14.5 XGB+3FF+RUS',
 'RF+3FF 3.5 // 14.875 RF+3FF+RUS',
 'RF+3FF 3.5 // 16.25 XGB+RUS',
 'RF+ROS 6.0 // 14.375 RF+RUS',
 'RF+ROS 6.0 // 14.5 XGB+3FF+RUS',
 'RF+ROS 6.0 // 14.875 RF+3FF+RUS',
 'RF+ROS 6.0 // 16.25 XGB+RUS',
 'RF+SMOTE 7.0 // 14.5 XGB+3FF+RUS',
 'RF+SMOTE 7.0 // 14.875 RF+3FF+RUS',
 'RF+SMOTE 7.0 // 16.25 XGB+RUS',
 'XGB 8.25 // 16.25 XGB+RUS',
 'XGB+1FF 8.6

In [17]:
compare_avgranks_lf(ca)

sorted by last field


['RF 3.25 // 14.875 RF+3FF+RUS',
 'RF+1FF 2.375 // 14.875 RF+3FF+RUS',
 'RF+3FF 3.5 // 14.875 RF+3FF+RUS',
 'RF+ROS 6.0 // 14.875 RF+3FF+RUS',
 'RF+SMOTE 7.0 // 14.875 RF+3FF+RUS',
 'RF 3.25 // 12.25 RF+3FF+SMOTE',
 'RF+1FF 2.375 // 12.25 RF+3FF+SMOTE',
 'RF+3FF 3.5 // 12.25 RF+3FF+SMOTE',
 'RF 3.25 // 14.375 RF+RUS',
 'RF+1FF 2.375 // 14.375 RF+RUS',
 'RF+3FF 3.5 // 14.375 RF+RUS',
 'RF+ROS 6.0 // 14.375 RF+RUS',
 'RF 3.25 // 12.75 XGB+3FF+ROS',
 'RF+1FF 2.375 // 12.75 XGB+3FF+ROS',
 'RF+3FF 3.5 // 12.75 XGB+3FF+ROS',
 'RF 3.25 // 14.5 XGB+3FF+RUS',
 'RF+1FF 2.375 // 14.5 XGB+3FF+RUS',
 'RF+3FF 3.5 // 14.5 XGB+3FF+RUS',
 'RF+ROS 6.0 // 14.5 XGB+3FF+RUS',
 'RF+SMOTE 7.0 // 14.5 XGB+3FF+RUS',
 'RF 3.25 // 12.75 XGB+ROS',
 'RF+1FF 2.375 // 12.75 XGB+ROS',
 'RF+3FF 3.5 // 12.75 XGB+ROS',
 'RF 3.25 // 16.25 XGB+RUS',
 'RF+1FF 2.375 // 16.25 XGB+RUS',
 'RF+3FF 3.5 // 16.25 XGB+RUS',
 'RF+ROS 6.0 // 16.25 XGB+RUS',
 'RF+SMOTE 7.0 // 16.25 XGB+RUS',
 'XGB 8.25 // 16.25 XGB+RUS',
 'XGB+1FF 8.6

#### One vs All (Control vs Treatment) 
In some cases, we do not care about all pairwise comparisons as we only propose a single method, or just need to compare to a baseline method. In this case we designate a control method, and compare all others to it.<br>
For statistical reasons the Nemenyi test should not be used for this.

In [18]:
xgb_ap_df = ph_pvals(rankings,control='RF')
xgb_ap_df

Classifiers: 18    Tests: 17


Unnamed: 0,p_noadj,ap_BDun,ap_Sdak,ap_Holm,ap_Finr,ap_Hoch,ap_Li
RF // XGB+RUS,0.000574,0.009752,0.009707,0.009752,0.009707,0.009752,0.010747
RF // RF+3FF+RUS,0.002073,0.035245,0.034666,0.033172,0.017486,0.033172,0.03778
RF // XGB+3FF+RUS,0.002881,0.048971,0.047858,0.043209,0.017486,0.043209,0.051732
RF // RF+RUS,0.003208,0.054534,0.053156,0.04491,0.017486,0.04491,0.057272
RF // XGB+ROS,0.011849,0.201432,0.183424,0.154036,0.039717,0.142187,0.183274
RF // XGB+3FF+ROS,0.011849,0.201432,0.183424,0.154036,0.039717,0.142187,0.183274
RF // RF+3FF+SMOTE,0.017118,0.29101,0.254373,0.188301,0.041066,0.188301,0.244823
RF // RF+3FF+ROS,0.111961,1.0,0.867156,1.0,0.223006,0.947197,0.679526
RF // XGB+SMOTE,0.127706,1.0,0.90199,1.0,0.227463,0.947197,0.707478
RF // XGB+1FF,0.154483,1.0,0.942313,1.0,0.248191,0.947197,0.745267


In [20]:
## IMPORTANT:
## Shaffer test uses a recursive function, and
## python internal multithreading goes wildly oversubscribed
## when there are more than 18 classfiers to compare

shaf_ap_xgb_df = ph_pvals(rankings,shaf=True,control='RF')
shaf_ap_xgb_df

Classifiers: 18    Tests: 17


Unnamed: 0,p_noadj,ap_Shaf,ap_Holm,ap_Finr
RF // XGB+RUS,0.000574,0.008605,0.009752,0.009707
RF // RF+3FF+RUS,0.002073,0.031098,0.033172,0.017486
RF // XGB+3FF+RUS,0.002881,0.043209,0.043209,0.017486
RF // RF+RUS,0.003208,0.043209,0.04491,0.017486
RF // XGB+ROS,0.011849,0.118489,0.154036,0.039717
RF // XGB+3FF+ROS,0.011849,0.118489,0.154036,0.039717
RF // RF+3FF+SMOTE,0.017118,0.171182,0.188301,0.041066
RF // RF+3FF+ROS,0.111961,1.0,1.0,0.223006
RF // XGB+SMOTE,0.127706,1.0,1.0,0.227463
RF // XGB+1FF,0.154483,1.0,1.0,0.248191
