# question 1

# Assumptions of ANOVA

Normality: The data within each group should follow a normal distribution. This assumption can be checked using a histogram or a normal probability plot.

Homogeneity of Variance: The variances of each group should be equal. This assumption can be checked using a plot of the group means versus the group standard deviations or by performing a Levene's test.

Independence: The observations within each group should be independent of each other. This assumption is usually met if the groups are randomly sampled.

Absence of Outliers: If there are outliers in the dataset, some or the other technique must be used in order to remove them.

If these assumptions are violated, it may result in inaccurate results.

# question 2

The three types of ANOVA are :-

1. One Way ANOVA- Here one factor is there with atleast two levels which are independent of each other.

2. Repeated Measure ANOVA :- Here one factor is there with atleast two levels which are dependent on each other.

3. Factorial ANOVA:- Here two or more factors are there with atleast two levels . It does not matter if they are independent or dependent on each other.


# question 3

Partitioning of variance in ANOVA refers to the process of dividing the total variance in a data set into different components, each of which can be attributed to a particular source of variation. ANOVA achieves this by comparing the variation within groups to the variation between groups.

The total variation in a data set can be broken down into two main components: variation within groups and variation between groups. Variation within groups is the variation that exists among the observations within each group or treatment. Variation between groups is the variation that exists among the means of the groups or treatments.

By partitioning the variance, ANOVA can determine whether the observed differences between group means are statistically significant or not. This helps researchers to determine whether any differences observed between groups are likely due to chance or whether they are the result of a systematic effect of the independent variable(s) being studied.

Understanding the concept of partitioning of variance is important in ANOVA because it allows researchers to assess the significance of the effects of different factors on the outcome variable. It also provides a way to identify potential sources of variation that may be contributing to differences between groups. This information can be useful in designing experiments and interpreting results.

# question 4


In [2]:

import numpy as np
from scipy import stats

# Create some sample data
group1 = np.array([4, 7, 6, 8, 5])
group2 = np.array([9, 11, 13, 10, 12])
group3 = np.array([16, 14, 17, 15, 18])

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate the mean of the combined data
grand_mean = np.mean(data)

# Calculate the total sum of squares (SST)
SST = np.sum((data - grand_mean)**2)

# Calculate the sum of squares explained by the groups (SSE)
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])
SSE = np.sum((group_means - grand_mean)**2 * 5)

# Calculate the residual sum of squares (SSR)
SSR = SST - SSE

print("Total sum of squares (SST):", SST)
print("Explained sum of squares (SSE):", SSE)
print("Residual sum of squares (SSR):", SSR)

Total sum of squares (SST): 280.0
Explained sum of squares (SSE): 250.0
Residual sum of squares (SSR): 30.0


# question 5

In [14]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate random data
np.random.seed(123)
n = 50
factor_a = np.random.choice(['A1', 'A2'], size=n)
factor_b = np.random.choice(['B1', 'B2', 'B3'], size=n)
response = np.random.normal(loc=10, scale=2, size=n)

# Create a pandas DataFrame
data = pd.DataFrame({'factor_a': factor_a, 'factor_b': factor_b, 'response': response})

# Fit the ANOVA model
model = ols('response ~ C(factor_a) + C(factor_b) + C(factor_a):C(factor_b)', data=data).fit()

# Print the ANOVA table
print(sm.stats.anova_lm(model, typ=2))


                             sum_sq    df         F    PR(>F)
C(factor_a)                0.016209   1.0  0.003944  0.950210
C(factor_b)                2.470619   2.0  0.300575  0.741901
C(factor_a):C(factor_b)    6.946764   2.0  0.845142  0.436349
Residual                 180.832172  44.0       NaN       NaN


# question 6

A one-way ANOVA tests the null hypothesis that there are no differences in means between the groups. The F-statistic is a measure of how much the variance between the group means differs from what we would expect based on the variance within the groups alone. The p-value tells us the probability of observing such a large F-statistic by chance, assuming that the null hypothesis is true.

In this case, the obtained F-statistic of 5.23 indicates that there is some difference between the group means. The p-value of 0.02 means that if the null hypothesis of no differences between the groups were true, there is a 2% chance of observing an F-statistic as large as 5.23 or larger. This is a relatively low probability, so we can reject the null hypothesis at the 5% significance level and conclude that there are significant differences between the groups.

In other words, we can conclude that at least one of the groups has a different mean from the others.

# question 7

One approach to handling missing data is to exclude cases with missing data from the analysis, commonly referred to as listwise deletion. However, this approach can reduce the sample size and statistical power of the analysis, potentially leading to biased or inaccurate results. In addition, listwise deletion assumes that the missing data are missing completely at random (MCAR), which may not be a realistic assumption in practice.

Another approach is to impute the missing data, which involves estimating the missing values based on the available data. There are several methods of imputation, such as mean imputation, regression imputation, and multiple imputation. Mean imputation replaces missing values with the mean of the observed values for that variable, while regression imputation uses a regression model to estimate the missing values based on the other variables in the data set. Multiple imputation involves creating several imputed data sets and then analyzing each of them separately before combining the results.

Imputation can improve the precision and validity of the analysis, but it also has limitations. Imputation assumes that the missing data are missing at random (MAR), which means that the probability of missing data depends only on the observed data and not on unobserved data. If the missing data are not MAR, imputation can lead to biased or inaccurate results.



# question 8

Most common post- hoc methods are:-

Tukey's HSD (Honestly Significant Difference): Tukey's HSD is a widely used post-hoc test that compares all possible pairs of group means and controls the family-wise error rate. This test is appropriate when there are no specific a priori hypotheses about which groups may be different from each other.

Bonferroni correction: The Bonferroni correction is a conservative approach that adjusts the alpha level (i.e., the significance level) for multiple comparisons. It is appropriate when there are specific a priori hypotheses about which groups may be different from each other.

Scheffe's method: Scheffe's method is a conservative post-hoc test that controls the family-wise error rate. It is appropriate when there are multiple comparisons and the sample sizes are unequal.

Games-Howell test: The Games-Howell test is a non-parametric post-hoc test that does not assume equal variances among the groups. It is appropriate when the assumption of equal variances is violated.

An example of a situation where a post-hoc test might be necessary is in a study comparing the effects of three different treatments (Treatment A, B, and C) on a dependent variable. After conducting a one-way ANOVA, the researcher finds a significant main effect of treatment. To determine which treatments are significantly different from each other, a post-hoc test such as Tukey's HSD or Bonferroni correction can be used. For instance, Tukey's HSD can be used to test all possible pairs of group means, while Bonferroni correction can be used to test specific a priori hypotheses about which groups may be different from each other.

# question 9

In [4]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# create data
np.random.seed(123)
diet_a = np.random.normal(loc=5, scale=2, size=50)
diet_b = np.random.normal(loc=7, scale=2, size=50)
diet_c = np.random.normal(loc=4, scale=2, size=50)

# create DataFrame
df = pd.DataFrame({
    'diet': ['A']*50 + ['B']*50 + ['C']*50,
    'weight_loss': np.concatenate((diet_a, diet_b, diet_c))
})

# conduct one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# print results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)

# since the f statistic is large and p value is small 
#we can reject the null hypothesis that there is no difference in the mean weight loss between the three diets 
#and conclude that there are significant differences between them.

F-statistic:  22.245732108190538
p-value:  3.6294576915534843e-09


# question 10

In [7]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# generate data
np.random.seed(123)
time_data = np.random.normal(loc=10, scale=2, size=90)
software = ['A', 'B', 'C'] * 30
experience = ['novice'] * 45 + ['experienced'] * 45

# create DataFrame
df = pd.DataFrame({
    'time': time_data,
    'software': software,
    'experience': experience
})

# conduct two-way ANOVA
model = ols('time ~ C(software) + C(experience) + C(software):C(experience)', data=df).fit()
table = sm.stats.anova_lm(model, typ=2)

# print results
print(table)

#If the p-value for an effect is less than 0.05, 
#we can reject the null hypothesis and conclude that the effect is statistically significant

                               sum_sq    df         F    PR(>F)
C(software)                 17.386324   2.0  1.702921  0.188378
C(experience)                8.604884   1.0  1.685628  0.197732
C(software):C(experience)   18.813443   2.0  1.842701  0.164734
Residual                   428.807757  84.0       NaN       NaN


# question 11

In [9]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# generate data
np.random.seed(123)
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=80, scale=10, size=50)

# conduct two-sample t-test
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)
print('t-statistic:', t_stat)
print('p-value:', p_value)

# conduct post-hoc test
df = pd.DataFrame({
    'score': np.concatenate([control_scores, experimental_scores]),
    'group': ['control'] * 50 + ['experimental'] * 50
})
posthoc = pairwise_tukeyhsd(df['score'], df['group'])
print(posthoc)


#The p-value is a measure of the strength of evidence against the null hypothesis. 
#In this case, the p-value is very small (1.8080222016909692e-05), 
#which means that the probability of observing such a large difference in means by chance alone is very low
#we reject the null hypothesis and conclude that there is a significant difference in test scores between the control and experimental groups.



t-statistic: -4.508893097370603
p-value: 1.8080222016909692e-05
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower   upper  reject
---------------------------------------------------------
control experimental  10.2768   0.0 5.7537 14.7998   True
---------------------------------------------------------


# question 12



In [13]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import MultiComparison
from scipy import stats

# generate sales data (using the code provided in the previous answer)
np.random.seed(123)
sales_A = np.random.normal(50, 10, 30)
sales_B = np.random.normal(60, 15, 30)
sales_C = np.random.normal(70, 12, 30)
data = {'Store': ['A']*30 + ['B']*30 + ['C']*30,
        'Day': list(range(1,31))*3,
        'Sales': np.concatenate((sales_A, sales_B, sales_C))}
sales_data  = pd.DataFrame(data)

# create a model for repeated measures ANOVA using the formula notation
model = ols('Sales ~ Store + Day + Store:Day', data=df).fit()

# get ANOVA table and print results
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# conduct post-hoc test (Tukey HSD) if the ANOVA results are significant
if anova_table['PR(>F)'][0] < 0.05:
    print("There is a significant difference between at least one pair of stores")

    # Perform the Tukey's HSD post-hoc test
    mc = MultiComparison(sales_data['Sales'], sales_data['Store'])
    result = mc.tukeyhsd()
    print(result)
else:
    print("There is no significant difference between any pair of stores")


                 sum_sq    df          F    PR(>F)
Store       5305.706697   2.0  13.866440  0.000006
Day         1025.476816   1.0   5.360158  0.023041
Store:Day   1619.464303   2.0   4.232463  0.017729
Residual   16070.431896  84.0        NaN       NaN
There is a significant difference between at least one pair of stores
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B  11.6751 0.0077  2.6451 20.7051   True
     A      C  18.6068    0.0  9.5768 27.6368   True
     B      C   6.9317 0.1658 -2.0983 15.9617  False
----------------------------------------------------
