### Check Normality Assumption

H0: The data are normally distributed

H1: The data are not normally distributed

In [1]:
import numpy as np
import scipy.stats as stats

def check_normality(data):
    test_stat_normality, p_value_normality=stats.shapiro(data)   #test name
    print("p value:%.4f" % p_value_normality)
    if p_value_normality <0.05:
        print("Reject null hypothesis >> The data is not normally distributed")
    else:
        print("Fail to reject null hypothesis >> The data is normally distributed")

This is Shapiro-Wilk’s W test, but we can also use Kolmogorov-Smirnov and D’Agostino and Pearson’s test

### Check Variance Assumption

H0: The variance of the samples are same

H1: The variance of the samples are different

In [2]:
def check_variance_homogeneity(group1, group2):
    test_stat_var, p_value_var= stats.levene(group1,group2)
    print("p value:%.4f" % p_value_var)
    if p_value_var <0.05:
        print("Reject null hypothesis >> The variances of the samples are different.")
    else:
        print("Fail to reject null hypothesis >> The variances of the samples are same.")

It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). Suppose the resulting p-value of Levene’s test is less than the significance level (typically 0.05). In that case, the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances.

For checking variance homogeneity, preferred Levene’s test but you can also check Bartlett’s test.

1. Defining Hypothesis

2. Assumption Check

3. Selecting the Proper Test

4. Decision and Conclusion

### Q1: t-test independent

S-> Live attend

A-> Recordings

H₀: μₛ ≤ μₐ

H₁: μₛ > μₐ

In [3]:
sync = np.array([94. , 84.9, 82.6, 69.5, 80.1, 79.6, 81.4, 77.8, 81.7, 78.8, 73.2, 87.9, 87.9, 93.5, 82.3, 79.3, 78.3, 71.6, 88.6, 74.6, 74.1, 80.6])
asyncr = np.array([77.1, 71.7, 91. , 72.2, 74.8, 85.1, 67.6, 69.9, 75.3, 71.7, 65.7, 72.6, 71.5, 78.2])
check_normality(sync)
check_normality(asyncr)

p value:0.6556
Fail to reject null hypothesis >> The data is normally distributed
p value:0.0803
Fail to reject null hypothesis >> The data is normally distributed


In [4]:
check_variance_homogeneity(sync, asyncr)

p value:0.8149
Fail to reject null hypothesis >> The variances of the samples are same.


### Test Hypothesis

Since assumptions are satisfied, we can perform the parametric version of the test for 2 groups and unpaired data.(independent t-test)

In [5]:
ttest,p_value = stats.ttest_ind(sync,asyncr)
print("p value:%.8f" % p_value)
print("since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:%.4f" %(p_value/2))
if p_value/2 <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.00753598
since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:0.0038
Reject null hypothesis


Since the p value (0.0038) < 0.05 we reject H0 at 0.05 significance level. Therefore, we have enough evidence
at 5% level of significance to conclude that average grade of the students who follow the course synchronously is higher than the students who follow the course asynchronously.

### ======================================================================================

### Q2: ANOVA

H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.

H₁: At least one of them is different.

In [6]:
only_breast = np.array([794.1, 716.9, 993. , 724.7, 760.9, 908.2, 659.3 , 690.8, 768.7, 717.3 , 630.7, 729.5, 714.1, 810.3, 583.5, 679.9, 865.1])
only_formula = np.array([898.8, 881.2, 940.2, 966.2, 957.5, 1061.7, 1046.2, 980.4, 895.6, 919.7, 1074.1, 952.5, 796.3, 859.6, 871.1 , 1047.5, 919.1 , 1160.5, 996.9])
both = np.array([976.4, 656.4, 861.2, 706.8, 718.5, 717.1, 759.8, 894.6, 867.6, 805.6, 765.4, 800.3, 789.9, 875.3, 740. , 799.4, 790.3, 795.2 , 823.6, 818.7, 926.8, 791.7, 948.3])

In [7]:
check_normality(only_breast)
check_normality(only_breast)
check_normality(both)

p value:0.4694
Fail to reject null hypothesis >> The data is normally distributed
p value:0.4694
Fail to reject null hypothesis >> The data is normally distributed
p value:0.7973
Fail to reject null hypothesis >> The data is normally distributed


In [8]:
stat, pvalue_levene= stats.levene(only_breast,only_formula,both)
print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.7673
Fail to reject null hypothesis >> The variances of the samples are same.


### Test

Since assumptions are satisfied, we can perform the parametric version of the test for more than 2 groups and unpaired data.(ANOVA)

In [9]:
F, p_value = stats.f_oneway(only_breast,only_formula,both)
print("p value:%.6f" % p_value)
if p_value <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.000000
Reject null hypothesis


Since the p value (0.0000) < 0.05 we reject H0 at 0.05 significance level. Therefore, it can be concluded that at least one of the groups has a different average monthly weight gain.

To find which group or groups cause the difference, we need to perform a posthoc test/pairwise comparison.

In [10]:
pip install scikit-posthocs

Note: you may need to restart the kernel to use updated packages.


In [11]:
# pip install scikit-posthocs
# Pairwise T test for multiple comparisons of independent groups. May be used after a parametric ANOVA to do pairwise comparisons.

import scikit_posthocs as sp
posthoc_df= sp.posthoc_ttest([only_breast,only_formula,both], equal_var=True, p_adjust="bonferroni")  #****

group_names= ["only breast", "only formula","both"]  #****
posthoc_df.columns= group_names
posthoc_df.index= group_names
posthoc_df.style.applymap(lambda x: "background-color:violet" if x<0.05 else "background-color: white")


Unnamed: 0,only breast,only formula,both
only breast,1.0,0.0,0.129454
only formula,0.0,1.0,4e-06
both,0.129454,4e-06,1.0


It can be concluded that:

“only breast” is different than “only formula”

“only formula” is different than both “only breast” and “both”

“both” is different than “only formula”

### ======================================================================================

### Q3. Mann Whitney U

H₀: μ₁≤μ₂

H₁: μ₁>μ₂

In [12]:
test_team=np.array([6.2, 7.1, 1.5, 2,3 , 2, 1.5, 6.1, 2.4, 2.3, 12.4, 1.8, 5.3, 3.1, 9.4, 2.3, 4.1])
developer_team=np.array([2.3, 2.1, 1.4, 2.0, 8.7, 2.2, 3.1, 4.2, 3.6, 2.5, 3.1, 6.2, 12.1, 3.9, 2.2, 1.2 ,3.4])

check_normality(test_team)
check_normality(developer_team)
check_variance_homogeneity(test_team, developer_team)

p value:0.0046
Reject null hypothesis >> The data is not normally distributed
p value:0.0005
Reject null hypothesis >> The data is not normally distributed
p value:0.5410
Fail to reject null hypothesis >> The variances of the samples are same.


There are two groups, and data is collected from different individuals, so it is not paired. However, the normality assumption is not satisfied; therefore, we need to use the nonparametric version of 2 group comparison for unpaired data: the Mann-Whitney U Test.

In [13]:
ttest,pvalue = stats.mannwhitneyu(test_team,developer_team, alternative="two-sided")
print("p-value:%.4f" % pvalue)
if pvalue <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p-value:0.8226
Fail to reject null hypothesis


Since the p value (0.8226) > 0.05 we do not reject H0 at 0.05 significance level. Therefore, it can be concluded that there is no significant difference between the average overwork time of the two teams.

### ======================================================================================

### Q4. Kruskal-Wallis

H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.

H₁: At least one of them is different.

In [14]:
youtube=np.array([1913, 1879, 1939, 2146, 2040, 2127, 2122, 2156, 2036, 1974, 1956, 2146, 2151, 1943, 2125])
instagram =np.array([2305., 2355., 2203., 2231., 2185., 2420., 2386., 2410., 2340., 2349., 2241., 2396., 2244., 
2267., 2281.])
facebook =np.array([2133., 2522., 2124., 2551., 2293., 2367., 2460., 2311., 2178., 2113., 2048., 2443., 2265., 
2095., 2528.])

In [15]:
check_normality(youtube)
check_normality(instagram)
check_normality(facebook)

p value:0.0285
Reject null hypothesis >> The data is not normally distributed
p value:0.4156
Fail to reject null hypothesis >> The data is normally distributed
p value:0.1716
Fail to reject null hypothesis >> The data is normally distributed


In [16]:
stat, pvalue_levene= stats.levene(youtube, instagram, facebook)

print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.0012
Reject null hypothesis >> The variances of the samples are different.


The normality and variance homogeneity assumptions are not satisfied, therefore we need to use the nonparametric version of ANOVA for unpaired data (the data is collected from different sources).

In [17]:
F, p_value = stats.kruskal(youtube, instagram, facebook)
print("p value:%.6f" % p_value)
if p_value <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.000015
Reject null hypothesis


Since the p value (0.000015) < 0.05 we reject H0 at 0.05 significance level. Therefore, it can be concluded that at least one of the average customer acquisition number is different.

Since the data is not normal, the nonparametric version of posthoc test is used.

In [18]:
import scikit_posthocs as sp
posthoc_df= sp.posthoc_mannwhitney([youtube, instagram, facebook], p_adjust="bonferroni")  #non para->sp.posthoc_mannwhitney

group_names= ["youtube", "instagram","facebook"]
posthoc_df.columns= group_names
posthoc_df.index= group_names
posthoc_df.style.applymap(lambda x: "background-color:violet" if x<0.05 else "background-color: white")


Unnamed: 0,youtube,instagram,facebook
youtube,1.0,1e-05,0.002337
instagram,1e-05,1.0,1.0
facebook,0.002337,1.0,1.0


The average number of customers coming from YouTube is different than the other. 

### ======================================================================================

### Q5. t-test dependent

• The dependent variable must be continuous (interval/ratio)

• The observations are independent of one another.

• The dependent variable should be approximately normally distributed.

H₀: μd>=0 or The true mean difference is equal to or bigger than zero.             μ(after)-μ(before)

H₁: μd<0 or The true mean difference is smaller than zero.

In [19]:
test_results_before_diet=np.array([224, 235, 223, 253, 253, 224, 244, 225, 259, 220, 242, 240, 239, 229, 276, 254, 237, 227])
test_results_after_diet=np.array([198, 195, 213, 190, 246, 206, 225, 199, 214, 210, 188, 205, 200, 220, 190, 199, 191, 218])

In [20]:
check_normality(test_results_before_diet)
check_normality(test_results_after_diet)

p value:0.1635
Fail to reject null hypothesis >> The data is normally distributed
p value:0.1003
Fail to reject null hypothesis >> The data is normally distributed


The data is paired since data is collected from the same individuals and assumptions are satisfied, then we can use the dependent t-test.

In [21]:
test_stat, p_value_paired = stats.ttest_rel(test_results_before_diet,test_results_after_diet)
print("p value:%.6f" % p_value_paired , "one tailed p value:%.6f" %(p_value_paired/2))
if p_value_paired <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.000008 one tailed p value:0.000004
Reject null hypothesis


Since the p value (0.000008) < 0.05 we reject H0 at 0.05 significance level. Therefore, it can be concluded that there is enough evidence to conclude mean cholesterol level of patients has decreased after the diet.

### ======================================================================================

### Q6. Wilcoxon signed-rank test

Since the performance scores are obtained from the same files, the data is paired.

H₀: μd>=0 or The true mean difference is equal to or bigger than zero.  (end-pie)

H₁: μd<0 or The true mean difference is smaller than zero.

In [22]:
piedpiper=np.array([4.57, 4.55, 5.47, 4.67, 5.41, 5.55, 5.53, 5.63, 3.86, 3.97, 5.44, 3.93, 5.31, 5.17, 4.39, 4.28, 5.25])
endframe=np.array([4.27, 3.93, 4.01, 4.07, 3.87, 4. , 4. , 3.72, 4.16, 4.1 , 3.9 , 3.97, 4.08, 3.96, 3.96, 3.77, 4.09])

In [23]:
check_normality(piedpiper)
check_normality(endframe)

p value:0.0304
Reject null hypothesis >> The data is not normally distributed
p value:0.9587
Fail to reject null hypothesis >> The data is normally distributed


The normality assumption is not satisfied; therefore, we need to use the nonparametric version of the paired test, namely the Wilcoxon Signed Rank test.

In [24]:
test,pvalue = stats.wilcoxon(endframe,piedpiper) ##alternative default two sided
print("p-value:%.6f" %pvalue, ">> one_tailed_pval:%.6f" %(pvalue/2))

test,one_sided_pvalue = stats.wilcoxon(endframe,piedpiper, alternative="less")
print("one sided pvalue:%.6f" %(one_sided_pvalue))
if pvalue <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to recejt null hypothesis")

p-value:0.000214 >> one_tailed_pval:0.000107
one sided pvalue:0.000107
Reject null hypothesis


Since the p value (0.000107) < 0.05 we reject H0 at 0.05 significance level. Therefore, it can be concluded there is enough evidence to conclude that the performance of the PiedPaper is better than the EndFrame.

In [25]:
#my method
import numpy as np

piedpiper_median = np.median(piedpiper)
endframe_median = np.median(endframe)

print('Median performance score of PiedPiper:', piedpiper_median)
print('Median performance score of EndFrame:', endframe_median)

Median performance score of PiedPiper: 5.17
Median performance score of EndFrame: 4.0


The median performance score of PiedPiper is higher than the median performance score of EndFrame, which suggests that PiedPiper may be the better method for data compression without loss of quality.

### ======================================================================================

### Q7. Friedman Chi-Square

H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.

H₁: At least one of them is different.

In [26]:
method_A=np.array([89.8,89.9,88.6,88.7,89.6,89.7,89.2,89.3])
method_B=np.array([90.0,90.1,88.8,88.9,89.9,90.0,89.0,89.2])
method_C=np.array([91.5,90.7,90.3,90.4,90.2,90.3,90.2,90.3])

In [27]:
check_normality(method_A)
check_normality(method_B)
check_normality(method_C)

p value:0.3076
Fail to reject null hypothesis >> The data is normally distributed
p value:0.0515
Fail to reject null hypothesis >> The data is normally distributed
p value:0.0016
Reject null hypothesis >> The data is not normally distributed


In [34]:
stat, pvalue_levene= stats.levene(method_A, method_B, method_C)

print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.1953
Fail to reject null hypothesis >> The variances of the samples are same.


There are three groups, but the normality assumption is violated. So, we need to use the nonparametric version of ANOVA for paired data since the accuracy scores are obtained from the same test sets.

In [29]:
test_stat,p_value = stats.friedmanchisquare(method_A,method_B, method_C)
print("p value:%.4f" % p_value)
if p_value <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")
    
print(np.round(np.mean(method_A),2), np.round(np.mean(method_B),2), np.round(np.mean(method_C),2))


p value:0.0015
Reject null hypothesis
89.35 89.49 90.49


At this significance level, at least one of the methods has a different performance.

Since the data is not normal, the nonparametric version of the posthoc test is used.

In [30]:
data = np.array([method_A, method_B, method_C]) 
posthoc_df=sp.posthoc_wilcoxon(data, p_adjust="holm")
# posthoc_df = sp.posthoc_nemenyi_friedman(data.T) ## another option for the posthoc test

group_names= ["Method A", "Method B","Method C"]
posthoc_df.columns= group_names
posthoc_df.index= group_names
posthoc_df.style.applymap(lambda x: "background-color:violet" if x<0.05 else "background-color: white")

Unnamed: 0,Method A,Method B,Method C
Method A,1.0,0.078125,0.023438
Method B,0.078125,1.0,0.023438
Method C,0.023438,0.023438,1.0


Method C outperformed others and achieved better accuracy scores than the others.

### ======================================================================================

### Q8. The goodness of Fit

H₀: Gender and risk appetite are independent.

H₁: Gender and risk appetite are dependent.

chi2 test should be used for this question. This test is known as the goodness-of-fit test. It implies that if the observed data are very close to the expected data. The assumption of this test every Ei ≥ 5 (in at least 80% of the cells) is satisfied.

In [31]:
from scipy.stats import chi2_contingency

obs =np.array([[53, 23, 30, 36, 88],[71, 48, 51, 57, 203]])
chi2, p, dof, ex = chi2_contingency(obs, correction=False)

print("expected frequencies:\n ", np.round(ex,2))
print("degrees of freedom:", dof)
print("test stat :%.4f" % chi2)
print("p value:%.4f" % p)

expected frequencies:
  [[ 43.21  24.74  28.23  32.41 101.41]
 [ 80.79  46.26  52.77  60.59 189.59]]
degrees of freedom: 4
test stat :7.0942
p value:0.1310


In [32]:
from scipy.stats import chi2
## calculate critical stat

alpha = 0.01
df = (5-1)*(2-1)
critical_stat = chi2.ppf((1-alpha), df)
print("critical stat:%.4f" % critical_stat)

critical stat:13.2767


Since p value is larger than α=0.01 ( or calculated statistic=7.0942 is smaller than the critical statistic=13.28) >> Fail to Reject H0. At this significance level, it can be concluded that gender and risk appetite are independent.