### Task#1

We would like to compare properties of three classifiers – C4.5 and its three modifications: one that tunes hyperparameter m, one that tunes hyperparameter cf, and one that tunes both. We recorded the performance of these 4 versions of the classifier across 14 datasets. The performance is measured with AUC; the following file contains AUC for each classifier on each dataset.

Using signed rank test, compare all pairs of classifiers to each other. Choose two classifiers with the most statistically significant difference in AUCs.

In [1]:
import numpy as np
import pandas as pd
import scipy as sc
import itertools

from statsmodels.stats.descriptivestats import sign_test
from statsmodels.stats.weightstats import zconfint
from statsmodels.sandbox.stats.multicomp import multipletests 

%pylab inline


df = pd.read_csv('AUCs.txt', sep='\t')
df

Populating the interactive namespace from numpy and matplotlib


Unnamed: 0.1,Unnamed: 0,C4.5,C4.5+m,C4.5+cf,C4.5+m+cf
0,adult (sample),0.763,0.768,0.771,0.798
1,breast cancer,0.599,0.591,0.59,0.569
2,breast cancer wisconsin,0.954,0.971,0.968,0.967
3,cmc,0.628,0.661,0.654,0.657
4,ionosphere,0.882,0.888,0.886,0.898
5,iris,0.936,0.931,0.916,0.931
6,liver disorders,0.661,0.668,0.609,0.685
7,lung cancer,0.583,0.583,0.563,0.625
8,lymphography,0.775,0.838,0.866,0.875
9,mushroom,1.0,1.0,1.0,1.0


In [2]:
l = []
col = ['C4.5', 'C4.5+m', 'C4.5+cf', 'C4.5+m+cf']
for c in range(len(col)):
    if c != len(col):
        for k in range(c+1, len(col)):
            res = sc.stats.wilcoxon(df[col[c]], df[col[k]]).pvalue
            l.append(res)
            print(f"Signed rank test {col[c]} and {col[k]}:", res)

Signed rank test C4.5 and C4.5+m: 0.01075713311978963
Signed rank test C4.5 and C4.5+cf: 0.861262330095348
Signed rank test C4.5 and C4.5+m+cf: 0.015874359307532084
Signed rank test C4.5+m and C4.5+cf: 0.05432871367198416
Signed rank test C4.5+m and C4.5+m+cf: 0.3278256758446406
Signed rank test C4.5+cf and C4.5+m+cf: 0.025313519968766574




**Answer: C4.5 and C4.5+m**

### Task#2

How many statistically significant at 0.05 differences have you found?

**Answer: 3**

### Task#3

Using results from the previous question, tuning which hyperparameter gives more statistically significant increase in classifier performance?

**Answer: m**

### Task#4

With pairwise comparisons of 4 classifiers to each other, we tested 6 hypotheses. Let's do multiple testing correction. Start with Holm's method: how many hypotheses could you reject after controlling FWER at 0.05 with this method?

In [3]:
_, p_adjusted_holm, _, _ = multipletests(pd.Series(l), alpha = 0.05, method = 'holm') 
print('Number of significant differences with FWER <= 0.05:', sum(p_adjusted_holm <= 0.05))

Number of significant differences with FWER <= 0.05: 0


### Task#5

How many hypotheses you could reject after controlling FDR at 0.05 with Benjamini-Hochberg's method?

In [4]:
_, p_adjusted_holm, _, _ = multipletests(pd.Series(l), alpha = 0.05, method = 'fdr_bh') 
print('Number of significant differences with FWER <= 0.05:', sum(p_adjusted_holm <= 0.05))

Number of significant differences with FWER <= 0.05: 2


### Task#6

How correct do you think is to use Benjamini-Hochberg's method in this problem?

**Answer: Incorrect - statistics for different hypotheses here are dependent, so the assumptions of Benjamini-Hochberg's method do not hold**

### Task#7

We suspect that classifiers we tried are actually more different from each other, we just didn't manage to prove it. What can we do to increase the probability of detecting the differences between classifiers, if they exist?

**Answer: Apply classifiers to more datasets**