# Assignement V: Evaluation Methodologies, Model Selection & Comparison of Models

## Comparison of Models

**Download the data set https://archive.ics.uci.edu/ml/datasets/HCV+data and consider the class Blood-Donor and Non-Blood-Donor to be predicted.**

In [1]:
#Disable warning
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# libraries
#numeric
import numpy as np
import pandas as pd
# graphics
import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 


Download the data set https://archive.ics.uci.edu/ml/datasets/HCV+data

In [3]:
df = pd.read_csv("hcvdat0.csv",header = 0)
df.shape

(615, 14)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [5]:
df.drop('Unnamed: 0',axis=1,inplace=True)
df.describe()

Unnamed: 0,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
count,615.0,614.0,597.0,614.0,615.0,615.0,615.0,605.0,615.0,615.0,614.0
mean,47.40813,41.620195,68.28392,28.450814,34.786341,11.396748,8.196634,5.368099,81.287805,39.533171,72.044137
std,10.055105,5.780629,26.028315,25.469689,33.09069,19.67315,2.205657,1.132728,49.756166,54.661071,5.402636
min,19.0,14.9,11.3,0.9,10.6,0.8,1.42,1.43,8.0,4.5,44.8
25%,39.0,38.8,52.5,16.4,21.6,5.3,6.935,4.61,67.0,15.7,69.3
50%,47.0,41.95,66.2,23.0,25.9,7.3,8.26,5.3,77.0,23.3,72.2
75%,54.0,45.2,80.1,33.075,32.9,11.2,9.59,6.06,88.0,40.2,75.4
max,77.0,82.2,416.6,325.3,324.0,254.0,16.41,9.67,1079.1,650.9,90.0


### 1 Eliminating samples or features with missing values
###### Deleting samples of the data frame

In [6]:
df.dropna();

###### Deleting all features with NaN

In [7]:
df.dropna(axis=1);

### 2 perform imputation

In [8]:
from sklearn.impute import SimpleImputer
#print(df.shape)
## numpy array
values = df.values[:,3:13]
#print(df.values[:,3:13].shape)

# type of imputation
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')# = 'median', ...
# Calculation (transform the dataset)
ImpDataValues = imputer.fit_transform(values)
# count the number of NaN values in each column
print('Missing: %d' % np.isnan(ImpDataValues).sum())
df.iloc[:,3:13] = ImpDataValues

Missing: 0


### 3.  Consider that the data is organized into two groups (Blood-Donor and Non-Blood-Donor)
#### encode labels and group all the non blood-donor into the same group (or skip the encode...)
**group all the non blood-donor into the same group**

In [9]:
print(df.Category.unique())
print(df.Category.value_counts())
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder() 
y = class_le.fit_transform(df['Category'].values)

df['Category'] = y 
print(df.Category.unique())
df['Category'].replace(to_replace=[1,2,3,4], value=1, inplace=True)
print(df.Category.unique())

['0=Blood Donor' '0s=suspect Blood Donor' '1=Hepatitis' '2=Fibrosis'
 '3=Cirrhosis']
0=Blood Donor             533
3=Cirrhosis                30
1=Hepatitis                24
2=Fibrosis                 21
0s=suspect Blood Donor      7
Name: Category, dtype: int64
[0 1 2 3 4]
[0 1]


# 2.Programming Exercises

## 2.4 Read section 4.3 of https://arxiv.org/pdf/1811.12808.pdf
## Compare
##  • SVM (linear) versus SVM(RBF kernel)
##  • perceptron versus multilayer feedforward neural network.

### 4 Split the data

In [10]:
from sklearn.model_selection import train_test_split

X = df.values[:,3:13]
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=0)

### 5 Scaling

In [11]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 2.4.1 SVM (linear) versus SVM(RBF kernel)

In [12]:
from sklearn.svm import SVC
SVM_lin = SVC(kernel='linear', random_state=42)
SVM_lin.fit(X_train_scaled, y_train)
pred_SVM_lin = SVM_lin.predict(X_test_scaled)

SVM_rbf = SVC(kernel='rbf', random_state=42)
SVM_rbf.fit(X_train_scaled, y_train)
pred_SVM_rbf = SVM_rbf.predict(X_test_scaled)

In [13]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_SVM_lin, 
                   y_model2 = pred_SVM_rbf)

print(tb)

chi2, p = mcnemar(ary=tb, corrected=True)
# It is highly recommended to use exact=True for sample sizes < 25 since chi-squared is not well-approximated by the chi-squared distribution!

print('chi-squared:', chi2)
print('p-value:', p)

if p > 0.05:
    print('As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the two predictive models')
else:
    print('At a sig. level of 5%%, we can reject the H0 that both models perform equally well on this dataset, since the p-value (%.3f) is smaller than α' %(p))




[[167   0]
 [  4  14]]
chi-squared: 2.25
p-value: 0.13361440253771584
As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the two predictive models


## 2.4.2  perceptron versus multilayer feedforward neural network

In [14]:
from sklearn.linear_model import Perceptron
perc = Perceptron(random_state=42)
perc.fit(X_train_scaled, y_train)
pred_perc = perc.predict(X_test_scaled)

from sklearn.neural_network import MLPClassifier
nn =  MLPClassifier(activation='tanh', hidden_layer_sizes=(4,2), max_iter=5000, random_state=42)
nn.fit(X_train_scaled, y_train)
pred_nn = nn.predict(X_test_scaled)

In [15]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_perc, 
                   y_model2 = pred_nn)

print(tb)

chi2, p = mcnemar(ary=tb, corrected=True)
# It is highly recommended to use exact=True for sample sizes < 25 since chi-squared is not well-approximated by the chi-squared distribution!

print('chi-squared:', chi2)
print('p-value:', p)

if p > 0.05:
    print('As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the two predictive models')
else:
    print('At a sig. level of 5%%, we can reject the H0 that both models perform equally well on this dataset, since the p-value (%.3f) is smaller than α' %(p))



[[164   3]
 [  3  15]]
chi-squared: 0.16666666666666666
p-value: 0.6830913983096086
As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the two predictive models


# 2.Programming Exercises

## 2.5 Read section 4.7 of the paper https://arxiv.org/pdf/1811.12808.pdf
## Implement/adapt the test to compare more than two models, including at least 3 classifiersof the list:
##  • SVM (linear)
##  • SVM (non-linear kernel)
##  • perceptron
##  • logistic regression
##  • LDA
##  • KNN

In [16]:
SVM_poly = SVC(C=1.0,kernel='poly', degree=2, random_state=42)
SVM_poly.fit(X_train_scaled, y_train)
pred_SVM_poly = SVM_poly.predict(X_test_scaled)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
#Learning
lda.fit(X_train_scaled, y_train)
pred_lda = lda.predict(X_test_scaled)

from sklearn.linear_model import SGDClassifier
# With loss="log" a logistic regression is be appllied
lr = SGDClassifier(loss="log", random_state=42)
lr.fit(X_train_scaled, y_train)
pred_lr = lr.predict(X_test_scaled)

from sklearn.neighbors import KNeighborsClassifier
kNN =  KNeighborsClassifier(n_neighbors=3)#, metric = 'manhattan')
kNN.fit(X_train_scaled, y_train)
pred_kNN = kNN.predict(X_test_scaled)


In [17]:
from mlxtend.evaluate import ftest
f, p_value = ftest(y_test, 
               pred_SVM_lin, 
               pred_SVM_rbf,
               pred_perc, 
               pred_nn, 
               pred_SVM_poly,
               pred_lda,
               pred_lr,
               pred_kNN)

print('F: %.3f' %f)
print('p-value:', p_value)

if p_value > 0.05:
    print('As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the accucary of the predictive models')
else:
    print('Since the p-value (%.3f) is smaller than α, we can reject the H0 and conclude that there is a difference between the classification accuracies' %p_value)
    print('--> perform multiple post hoc pair-wise tests') 
          

F: 0.740
p-value: 0.6380806969601052
As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the accucary of the predictive models


# --------------------------------------------------------------------------------
# --------------------------------------------------------------------------------
# with generated data
### to show  differences between the performance of multiple classifiers
# --------------------------------------------------------------------------------
# --------------------------------------------------------------------------------

In [18]:
from sklearn.datasets import make_moons

X_moon,y_moon= make_moons(n_samples=1000, shuffle=True, noise=None, random_state=None)
X=X_moon
y=y_moon
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=0)

In [19]:
#scaling not crutial as the features have similar scales
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [20]:
from sklearn.svm import SVC
SVM_lin = SVC(kernel='linear', random_state=42)
SVM_lin.fit(X_train_scaled, y_train)
pred_SVM_lin = SVM_lin.predict(X_test_scaled)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
#Learning
lda.fit(X_train_scaled, y_train)
pred_lda = lda.predict(X_test_scaled)

from sklearn.linear_model import SGDClassifier
# With loss="log" a logistic regression is be appllied
lr = SGDClassifier(loss="log", random_state=42)
lr.fit(X_train_scaled, y_train)
pred_lr = lr.predict(X_test_scaled)

## instance-based
from sklearn.neighbors import KNeighborsClassifier
kNN =  KNeighborsClassifier(n_neighbors=3)#, metric = 'manhattan')
kNN.fit(X_train_scaled, y_train)
pred_kNN = kNN.predict(X_test_scaled)

## non-linear

SVM_rbf = SVC(kernel='rbf', random_state=42)
SVM_rbf.fit(X_train_scaled, y_train)
pred_SVM_rbf = SVM_rbf.predict(X_test_scaled)

from sklearn.linear_model import Perceptron
perc = Perceptron(random_state=42)
perc.fit(X_train_scaled, y_train)
pred_perc = perc.predict(X_test_scaled)

from sklearn.neural_network import MLPClassifier
nn =  MLPClassifier(activation='tanh', hidden_layer_sizes=(4,2), max_iter=5000, random_state=42)
nn.fit(X_train_scaled, y_train)
pred_nn = nn.predict(X_test_scaled)

SVM_poly = SVC(C=1.0,kernel='poly', degree=2, random_state=42)
SVM_poly.fit(X_train_scaled, y_train)
pred_SVM_poly = SVM_poly.predict(X_test_scaled)

In [21]:
from mlxtend.evaluate import ftest
f, p_value = ftest(y_test, 
               pred_SVM_lin, 
               pred_SVM_rbf,
               pred_perc, 
               pred_nn)

print('F: %.3f' %f)
print('p-value:', p_value)

if p_value > 0.05:
    print('As p-value is larger than sig. threshold (α=0.05) -> we cannot reject our null hypothesis and assume that there is no significant difference between the accucary of the predictive models')
else:
    print('Since the p-value (%.3f) is smaller than α, we can reject the H0 and conclude that there is a difference between the classification accuracies' %p_value)
    print('--> perform multiple post hoc pair-wise tests') 
          

F: 26.280
p-value: 2.67239263207248e-16
Since the p-value (0.000) is smaller than α, we can reject the H0 and conclude that there is a difference between the classification accuracies
--> perform multiple post hoc pair-wise tests


## Perform multiple post hoc pair-wise tests to determine which pairs have different population proportions.
### McNemar tests with a Bonferroni correction

In [22]:
numbModels = 4
from scipy.special import comb
numComb = comb(numbModels, 2);
print('Number of Multiple Tests = ', numComb)



Number of Multiple Tests =  6.0


### SVM lin vs RBF rbf

In [23]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_SVM_lin, 
                   y_model2 = pred_SVM_rbf)
chi2, p = mcnemar(ary=tb, corrected=True)
print(p)
if p > 0.05/numComb:
    print('p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)')
else:
    print('p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally')

4.1854369440036287e-10
p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally


### SVM lin vs perceptron

In [24]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_SVM_lin, 
                   y_model2 = pred_perc)
chi2, p = mcnemar(ary=tb, corrected=True)
print(p)
if p > 0.05/numComb:
    print('p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)')
else:
    print('p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally')

0.3408032468860819
p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)


### SVM lin vs NN

In [25]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_SVM_lin, 
                   y_model2 = pred_nn)
chi2, p = mcnemar(ary=tb, corrected=True)
print(p)
if p > 0.05/numComb:
    print('p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)')
else:
    print('p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally')

0.24821307898992026
p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)


### RBF rbf vs Perceptron

In [26]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_SVM_rbf, 
                   y_model2 = pred_perc)
chi2, p = mcnemar(ary=tb, corrected=True)
print(p)
if p > 0.05/numComb:
    print('p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)')
else:
    print('p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally')

7.025137193458291e-12
p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally


### RBF rbf vs NN

In [27]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_SVM_rbf, 
                   y_model2 = pred_nn)
chi2, p = mcnemar(ary=tb, corrected=True)
print(p)
if p > 0.05/numComb:
    print('p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)')
else:
    print('p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally')

1.9467060601806844e-09
p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally


### Perceptron vs NN

In [28]:
from mlxtend.evaluate import mcnemar_table, mcnemar
tb = mcnemar_table(y_target = y_test, 
                   y_model1 = pred_perc, 
                   y_model2 = pred_nn)
chi2, p = mcnemar(ary=tb, corrected=True)
print(p)
if p > 0.05/numComb:
    print('p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)')
else:
    print('p-value < 0.05/NumbMultiplTests, we can reject the H0 that both models perform equally')

0.16142946236707922
p-value > 0.05/NumbMultiplTests, we cannot reject H0 (there is no significant difference between the 2 models)
