## Wilcoxon Signed-Rank Test 

To test whether the multimodal model performs significantly better than the baselines (visual or textual features only), a wilcoxon signed-rank test will be conducted.

**Null hypothesis 1:** The test accuracy samples obtained by training a classifier on joint visual-textual features and the accuracy samples obtained by training a classifier on visual features only come from the same distribution (do not significant differences).

**Null hypothesis 2:** The test accuracy samples obtained by training a classifier on joint visual-textual features and the accuracy samples obtained by training a classifier on textual features only come from the same distribution (do not significant differences).

In [2]:
import scipy.stats as stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [12]:
# load data
#group1 = [456, 564, 54, 554, 54, 51, 1, 12, 45, 5]  #represents the test accuracies of the multimodal model
#group2 = [65, 87, 456, 564, 456, 564, 564, 6, 4, 564] # baseline model

# test and validation accuracies (for all epochs) for visual features
test_accuracies_visual = pd.read_csv("/kaggle/input/results-visual-features-seeds/visual_results.csv")
test_acc_visual = test_accuracies_visual['test_accuracy'].tolist()
print(test_acc_visual)

val_accuracies_all_epochs_visual = pd.read_csv("/kaggle/input/results-visual-features-seeds/visual_all_val_accuracies.csv")
val_accuracies_all_epochs_visual = val_accuracies_all_epochs_visual.drop(val_accuracies_all_epochs_visual.columns[0],  axis=1)
val_accuracies_all_epochs_visual.head()

# test and validation accuracies (for all epochs) for joint features
test_accuracies_joint = pd.read_csv("/kaggle/input/results-joint-features-seeds/joint_results.csv")
test_acc_joint = test_accuracies_joint['test_accuracy'].tolist()
print(test_acc_joint)

val_accuracies_all_epochs_joint = pd.read_csv("/kaggle/input/results-joint-features-seeds/joint_all_val_accuracies.csv")
val_accuracies_all_epochs_joint = val_accuracies_all_epochs_joint.drop(val_accuracies_all_epochs_joint.columns[0],  axis=1)
val_accuracies_all_epochs_joint.head()

# test and validation accuracies (for all epochs) for textual features

[0.4382022619247436, 0.4719101190567016, 0.449438214302063, 0.4157303273677826, 0.3595505654811859, 0.3932584226131439, 0.4269662797451019, 0.4044943749904632, 0.4606741666793823, 0.449438214302063]
[0.6629213690757751, 0.6067415475845337, 0.6404494643211365, 0.6404494643211365, 0.6516854166984558, 0.6516854166984558, 0.6741573214530945, 0.6853932738304138, 0.5730336904525757, 0.7078651785850525]


Unnamed: 0,epoch 1,epoch 2,epoch 3,epoch 4,epoch 5,epoch 6,epoch 7,epoch 8,epoch 9,epoch 10,epoch 11,epoch 12,epoch 13,epoch 14,epoch 15
0,0.101124,0.235955,0.280899,0.382022,0.460674,0.550562,0.629214,0.629214,0.662921,0.629214,0.662921,0.707865,0.685393,0.617977,0.629214
1,0.157303,0.269663,0.370787,0.348315,0.404494,0.494382,0.651685,0.58427,0.595506,0.651685,0.685393,0.685393,0.707865,0.696629,0.640449
2,0.191011,0.303371,0.303371,0.494382,0.539326,0.58427,0.52809,0.595506,0.674157,0.651685,0.640449,0.674157,0.662921,,
3,0.224719,0.280899,0.235955,0.337079,0.47191,0.573034,0.651685,0.719101,0.707865,0.696629,0.719101,0.719101,0.662921,0.696629,0.696629
4,0.067416,0.146067,0.247191,0.314607,0.460674,0.539326,0.629214,0.662921,0.707865,0.674157,0.696629,0.662921,0.674157,0.707865,0.685393


In [14]:
# conduct significance testing for visual baseline

#stat, p_value = stats.wilcoxon(group1, group2, alternative='greater') 

stat, p_value = stats.wilcoxon(test_acc_joint, test_acc_visual, alternative='greater') 

print(f'Wilcoxon Signed-Rank Test: stat={stat}, p-value={p_value}')

if p_value < 0.05:
    print("We reject the null hypothesis 1. The multimodal model performs significantly better than the visual baseline.")
else:
    print("We cannot reject the null hypothesis 1. The multimodal model does not perform significantly better than the visual baseline.")


Wilcoxon Signed-Rank Test: stat=55.0, p-value=0.0009765625
We reject the null hypothesis 1. The multimodal model performs significantly better than the visual baseline.


In [None]:
# conduct significance testing for textual baseline

stat, p_value = stats.wilcoxon(group1, group2, alternative='greater') 

print(f'Wilcoxon Signed-Rank Test: stat={stat}, p-value={p_value}')

if p_value < 0.05:
    print("We reject the null hypothesis 2. The multimodal model performs significantly better than the textual baseline.")
else:
    print("We cannot reject the null hypothesis 2. The multimodal model does not perform significantly better than the textual baseline.")

In [None]:
# plot overall results for test accuracies


In [None]:
# plot overall results for validation accuracies