# Algoritm Comparisons

In this notebook we will compare the performance of the two algorithms - without relabeling and with relabeling - through their test sample f1 score and the norm of the weights from the hidden layer and the output layers. The null hypothesis for each case is that the new methodolody does not **improve** the performance, and the results are the same. The alternative hypothesis is that the second algorithm **generates an improvement** in the respective metric under test. 

## Read csv files and data frame

In [22]:
data <- read.csv('result_moons.csv')

In [23]:
head(data)

X,f1_score_before,f1_score_after,norm_1_before,norm_1_after,norm_2_before,norm_2_after
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0.8971234,0.938943,44.54125,22.11043,27.98088,25.18419
1,0.875,0.913775,37.09127,17.21556,36.5373,19.96331
2,0.9280156,0.9462151,41.68607,15.10151,35.73296,15.94568
3,0.9126984,0.9412356,26.14099,21.61062,25.75867,24.3053
4,0.8964435,0.9469027,50.95062,23.04784,43.35037,24.08169
5,0.865019,0.9259826,46.08482,15.48541,44.53597,17.53767


In [80]:
f1_score_before <- unlist(data['f1_score_before'], use.names=FALSE)
f1_score_after <- unlist(data['f1_score_after'], use.names=FALSE)
norm_1_before <- unlist(data['norm_1_before'], use.names=FALSE)
norm_1_after <- unlist(data['norm_1_after'], use.names=FALSE)
norm_2_before <- unlist(data['norm_2_before'], use.names=FALSE)
norm_2_after <- unlist(data['norm_2_after'], use.names=FALSE)

## Test Normality

### Test f1 score

In [82]:
shapiro.test(f1_score_before)


	Shapiro-Wilk normality test

data:  f1_score_before
W = 0.98859, p-value = 0.982


In [83]:
shapiro.test(f1_score_before)


	Shapiro-Wilk normality test

data:  f1_score_before
W = 0.98859, p-value = 0.982


### Test weights norms

In [84]:
shapiro.test(norm_1_before)


	Shapiro-Wilk normality test

data:  norm_1_before
W = 0.97713, p-value = 0.7451


In [85]:
shapiro.test(norm_1_after)


	Shapiro-Wilk normality test

data:  norm_1_after
W = 0.97887, p-value = 0.7949


In [86]:
shapiro.test(norm_2_before)


	Shapiro-Wilk normality test

data:  norm_2_before
W = 0.9593, p-value = 0.2973


In [87]:
shapiro.test(norm_2_after)


	Shapiro-Wilk normality test

data:  norm_2_after
W = 0.97843, p-value = 0.7824


As can be seen, for all metrics considered, the p-value for the Shapiro-Wilk test was higher than 0.05, considering an hypothesis thest with 95% of accuracy. This means that the null hypothesis that the data came from a normal distribution cannot be negated.

## Test F1 Score

To test if the proposed algorithm improved the f1 score of the sample data, a t test can be performed, considering the normality of the samples. We are intereste in find deviations from value greater 2% to consider an improvement. So the effect size of the test, $\delta^*$ is calculated as:

\begin{equation}
\delta^* = 0.02 * E[\text{f1_score_before}]
\end{equation}

and the value is:

In [88]:
0.02*mean(f1_score_before)

The power of the test, then, is:

In [89]:
power.t.test(sig.level = 0.05, 
             n=30, 
             sd = sd(f1_score_before), 
             delta=0.02*mean(f1_score_before), 
             type='two.sample', 
             alternative='one.sided'
            )


     Two-sample t test power calculation 

              n = 30
          delta = 0.0180998
             sd = 0.02355011
      sig.level = 0.05
          power = 0.9026465
    alternative = one.sided

NOTE: n is number in *each* group


Applying t test for the f1 score samples, from before adjustment and after adjustment, the result is:

In [90]:
t.test(f1_score_before, f1_score_after, alternative='less', conf=0.95)


	Welch Two Sample t-test

data:  f1_score_before and f1_score_after
t = -6.3554, df = 51.308, p-value = 2.77e-08
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
        -Inf -0.02439342
sample estimates:
mean of x mean of y 
0.9049898 0.9381137 


The p-value for the test allow us to negate the null hypothesis of same performance, so the second algorithm is better than the first one considering the f1 score.

## Testing norms

### Testing the norm of the weight from the hidden layer

For the norm, we have 30 samples, from the runs performed, and want a power of 0.8 at least. Considering these parameters, the effective size obtained, which will be the lowest difference value considered to do the comparisons is: 

In [113]:
delta <- power.t.test(sig.level = 0.05, 
             n=30, 
             sd = sd(norm_1_before), 
             power=0.8, 
             type='two.sample', 
             alternative='one.sided'
            )$delta

cat("Delta value = ", delta / mean(norm_1_before) * 100, "%")

Delta value =  12.62151 %

In [98]:
t.test(norm_1_before, norm_1_after, alternative='greater', conf=0.95, var.equal = F)


	Welch Two Sample t-test

data:  norm_1_before and norm_1_after
t = 10.148, df = 50.366, p-value = 4.494e-14
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 13.99586      Inf
sample estimates:
mean of x mean of y 
 38.81299  22.04913 


The p-value allow us to negate the null  hypothesis of same norm for each algorithm, and indicates the possibility that the second one is better then the first one with 95% confidence.

### Testing the norm of the weighs from the output layer

In [114]:
delta <- power.t.test(sig.level = 0.05, 
             n=30, 
             sd = sd(norm_2_before), 
             power=0.8, 
             type='two.sample', 
             alternative='one.sided'
            )$delta

cat("Delta value = ", delta / mean(norm_2_before) * 100, "%")

Delta value =  13.9734 %

In [115]:
t.test(norm_2_before, norm_2_after, alternative='greater', conf=0.95, var.equal = F)


	Welch Two Sample t-test

data:  norm_2_before and norm_2_after
t = 5.8182, df = 52.757, p-value = 1.784e-07
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 6.706916      Inf
sample estimates:
mean of x mean of y 
 33.42141  24.00473 


And, as obtained for the weights of the hidden layer, the p-value allow us to infer that the second algorithm is better than the first one with 95% confidence.

## Conclusion

The proposed algorithm **improves** the general performance of the MLP, according to an improvement of the f1 score. It helps to reduce the model complexity through a reduce on weights norms from all layers.