In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
Analysis of Variance (ANOVA) relies on the following main assumptions for its results to be valid:
    
1. Normality of sampling distribution of means: The data within each group should be normally 
distributed, resembling a bell-shaped curve with few outliers.

2. Homogeneity of variance : The variance of the data within each group should be equal. 

3. Absence of outliers: Outlying score need to be removed from dataset.

4. Independence : The observations within each group should be independent and random. This means the same person
shouldn't be measured multiple times. 

Violations:
    
1. Non-normality: Violating the normality assumption can increase the chance of a false positive result.
However, ANOVA is generally robust to moderate deviations from normality.
 
2. Unequal variances: Violating the homogeneity of variance assumption can be more impactful, especially
when sample sizes are unequal. This can lead to false positives or negatives.

3. Non-independence: Lack of independence can occur within or between samples. For example, if the 
same person is measured multiple times. 

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
One-way ANOVA: Used to compare one factor with atleast 2 levels and the levels are independent. 
For example, comparing the medication to decrease headache under 3 conditions(10mg, 20mg, 30mg).

Repeated measures ANOVA: Used to compare one factor with atleast 2 levels and the levels are dependent. 
For example, comparing the running finish times of a runner on different weekdays.

Factorial ANOVA: Used to examine 2 or more factors(each of which with atleast 2 levels) and levels 
canbe either dependent and independent.
For example, comparing the race finish times of runners on different weekdays and belonging to different gender.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
An ANOVA uses an F-test to evaluate whether the variance among the groups is greater than the 
variance within a group.

The total variation present in a set of data may be partitioned into a number of non-overlapping
components as per the nature of the classification. The systematic procedure to achieve this is 
called Analysis of Variance (ANOVA). With the help of such a partitioning, some testing of 
hypothesis may be performed.

Partitioning variance in Analysis of Variance (ANOVA) is important because it helps researchers understand
the relationships between different factors and their contributions to the variability of a dependent
variable. This information can help researchers make informed decisions based on their results. 

This partitioning can be done based on the nature of the classification and can be used to perform
hypothesis testing. For example, ANOVA can be used to determine if there are statistically significant
differences in mean test scores across different teaching methods.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
1. Sum of Squares Total (SST): The sum of squared differences between individual data points (yi) and
the mean of the response variable (y).
SST = Σ(yi – y)2

2. Sum of Squares Error (SSE): The sum of squared differences between predicted data points (ŷi) and 
observed data points (yi).
SSE = Σ(ŷi – yi)2

3. Sum of Squares Regression (SSR): The sum of squared differences between predicted data points (ŷi) and
the mean of the response variable(y).
SSR = Σ(ŷi – ybar)2


In [5]:
# Example:-

import pandas as pd
import numpy as np
import statsmodels.api as sm

df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83, 88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})

#define response variable
y = df['score']

#define predictor variable
x = df[['hours']]

#add constant to predictor variables, add a column of ones to an array.
x = sm.add_constant(x)

#fit linear regression model using Ordinary Least Squares(OLS)
model = sm.OLS(y, x).fit()

#calculate sse
sse = np.sum((model.fittedvalues - df.score)**2)
print('SSE:',sse)

#calculate ssr
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print('SSR:',ssr)

#calculate sst
sst = ssr + sse
print('SST:',sst)

SSE: 331.07488479262696
SSR: 917.4751152073725
SST: 1248.5499999999995


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
We can calculate the main effects & interaction effects using python.

For example, 

The factors are as follows: 
1. Water: how frequently each plant was watered- daily or weekly
2. Sun: how much sunlight exposure each plant received- low, medium, or high
3. Height: the height of each plant (in inches) after two months

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5, 6, 6, 7, 8, 7, 3, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})    
    
print(df)
model = ols('height ~ C(sun) + C(water) + C(sun):C(water)',data=df).fit()
result = sm.stats.anova_lm(model, type=2)
  
print(result)

     water   sun  height
0    daily   low       6
1    daily   low       6
2    daily   low       6
3    daily   low       5
4    daily   low       6
5    daily   med       5
6    daily   med       5
7    daily   med       6
8    daily   med       4
9    daily   med       5
10   daily  high       6
11   daily  high       6
12   daily  high       7
13   daily  high       8
14   daily  high       7
15  weekly   low       3
16  weekly   low       4
17  weekly   low       4
18  weekly   low       4
19  weekly   low       5
20  weekly   med       4
21  weekly   med       4
22  weekly   med       4
23  weekly   med       4
24  weekly   med       4
25  weekly  high       5
26  weekly  high       6
27  weekly  high       6
28  weekly  high       7
29  weekly  high       8
                   df     sum_sq    mean_sq        F    PR(>F)
C(sun)            2.0  24.866667  12.433333  23.3125  0.000002
C(water)          1.0   8.533333   8.533333  16.0000  0.000527
C(sun):C(water)   2.0   2.466667   1

In [None]:
significance value = 0.05

We can see the following p-values for each of the factors:
water: p-value = 0.000527
sun: p-value = 0.0000002
water*sun: p-value = 0.120667

Main effects: Since the p-values for water and sun are both less than 0.05, this means that both factors have a 
statistically significant effect on plant height.

Interaction effects: Since the p-value for the interaction effect (0.120667) is not less than 0.05, this tells
us that there is no significant interaction effect between sunlight exposure and watering frequency.

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
F-statistic = 5.23 > 1
This means variation between the samples is much greater than the variation within the samples.
A big F(5.23), with a small p-value(0.02), means that the null hypothesis is discredited(refuse to accept), and
we would assert that the means are significantly different.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
When dealing with missing data, we can use two primary methods to solve the error:

The imputation method: substitutes reasonable guesses for missing data. It’s most useful when 
the percentage of missing data is low. If the portion of missing data is too high, the results
lack natural variation that could result in an effective model.
There are examples of single imputation methods for replacing missing data.

1. Mean, Median and Mode: In cases where there are a small number of missing observations, we can calculate
the mean or median of the existing observations and insert them in place of the missing observations.

2. Time-Series Specific Methods: of imputation assume the adjacent observations will be like the missing data. 

3. Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB): Every missing value
is replaced with either the last observed value or the next one.

4. Linear Interpolation: is often used to approximate a value of some function by using two known values
of that function at other points. This formula can also be understood as a weighted average. 

The other option is to remove data. When dealing with data that is missing at random(the data is 
not missing across all observations but only within sub-samples of the data.), the entire
data point that is missing information can be deleted to help reduce bias. Removing data may not
be the best option if there are not enough observations to result in a reliable analysis. In some 
situations, observation of specific events or factors may be required, even if incomplete.

Consequences: If we use proc glm to perform the analysis, it will omit observations listwise, meaning 
that if any of the observations for a subject are missing, the entire subject will be omitted from the analysis.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
When a one-way ANOVA test leads to a significant result, it is common to then follow up with post-hoc
tests to see which particular groups are significantly different from each other. Post-hoc tests 
essentially involve carrying out multiple t -tests to test for differences between each pair of categories.

Some common post hoc tests include:
Tukey's HSD: A commonly used pairwise multiple comparison test
Scheffe's F: A commonly used pairwise multiple comparison test
Bonferroni: A simple post hoc test that tightens the criterion for accepting an effect as significant
Dunnett's: A multiple comparison test
Hsu's MCB: Useful when we don't know which group to compare to all the other groups
Tamhane's T2: A multiple comparison test that doesn't assume equal variances
Dunnett's T3: A multiple comparison test that doesn't assume equal variances
Games-Howell: A multiple comparison test that doesn't assume equal variances 

Post hoc tests are used when the null hypothesis of an ANOVA model is rejected. For example, if an ANOVA 
shows that three influencer types have different effectiveness, a post hoc test could reveal that Instagram 
influencers are more effective than TikTok and Facebook influencers.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [5]:
import numpy as np
import scipy.stats as st

diet_A = np.random.normal(loc=47, scale=5,size = 50)
diet_B = np.random.normal(loc=50, scale=5,size = 50)
diet_C = np.random.normal(loc=55, scale=5,size = 50)

null_hypothesis =  'μA = μB = μC (It implies that the means of all the population are equal)'
alternate_hypothesis = 'as there will be at least one population mean that differs from the rest'
significance_value = 0.05

# Conduct the one-way ANOVA
F_statistic,p_value = st.f_oneway(diet_A,diet_B,diet_C)
print('F-statistic:',F_statistic,'and P-value:',p_value)

if p_value > significance_value:
    print('Accept the null hypothesis',null_hypothesis)
else:
    print('Reject the null hypothesis',alternate_hypothesis)

F-statistic: 26.402206764783106 and P-value: 1.5971692021748323e-10
Reject the null hypothesis as there will be at least one population mean that differs from the rest


In [None]:
Analysis of result: The F statistic and p-value turn out to be equal to 21.575416756474326 and 6.08280571508745e-09
respectively. Since the p-value is less than 0.05 hence we would reject the null hypothesis. This implies that we have
sufficient proof to say that the means of all the population are equal.

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [16]:
import numpy as np
import pandas as pd
import statsmodels.api as sm 
from statsmodels.formula.api import ols 

df = pd.DataFrame({'experience_level': np.random.choice(['Novice', 'Experienced'], size=90),
                   'programs': np.random.choice(['A', 'B', 'C'], size=90),
                   'time': np.random.normal(loc=20, scale=5, size=90)})
print(df)

# Performing two-way ANOVA 
model = ols('time ~ C(experience_level) + C(programs) + C(experience_level):C(programs)', data=df).fit() 
result = sm.stats.anova_lm(model, type=2) 
  
# Print the result 
print(result)

   experience_level programs       time
0            Novice        C  13.292774
1       Experienced        C  29.434141
2            Novice        C  20.919417
3            Novice        A  20.500477
4       Experienced        A  19.548862
..              ...      ...        ...
85      Experienced        A   9.150612
86           Novice        C  11.232653
87           Novice        B  17.275389
88      Experienced        A  24.859113
89           Novice        C  19.845209

[90 rows x 3 columns]
                                   df       sum_sq    mean_sq         F  \
C(experience_level)               1.0    98.772362  98.772362  3.687306   
C(programs)                       2.0     3.540949   1.770474  0.066094   
C(experience_level):C(programs)   2.0    99.247766  49.623883  1.852526   
Residual                         84.0  2250.119610  26.787138       NaN   

                                   PR(>F)  
C(experience_level)              0.058222  
C(programs)                      

In [None]:
Interpretation:

significance value = 0.05

We can see the following p-values for each of the factors:
experience_level: p-value = 0.880069
programs: p-value = 0.566799
experience_level*programs: p-value = 0.000078

Main effects: Since the p-values for experience_level and programs are both greater than 0.05, this means that both 
factors have not statistically significant effect on time taken to complete the task.

Interaction effects: Since the p-value for the interaction effect (0.000078) is less than 0.05, this tells
us that there is significant interaction effect between experience_level and programs.

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [1]:
# H0: μ(traditional teaching method) = μ(new teaching method)
# H1: μ(traditional teaching method) ≠ μ(new teaching method)

import numpy as np
import scipy.stats as st

traditional_method = np.random.normal(loc=80, scale=5,size = 100)
new_method = np.random.normal(loc=70, scale=2, size = 100)

significance_value = 0.05
t_test,p_value = st.ttest_ind(traditional_method, new_method)
print('t_test:',t_test,'p_value:',p_value)

if p_value > significance_value:
    print('accept the null hypothesis as there is no significant difference between the teaching methods')
else:
    print('reject the null hypothesis as there is significant difference between the teaching methods')

t_test: 18.35881164378464 p_value: 1.2920496069794455e-44
reject the null hypothesis as there is significant difference between the teaching methods


In [16]:
#Post-hoc test

import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

df = pd.DataFrame({'score': np.concatenate([traditional_method,new_method]), 'group': ['Traditional']*100 + ['New']*100})
# Perform the Tukey's HSD test
tukey_results = pairwise_tukeyhsd(df['score'], df['group'])
print(tukey_results)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1    group2   meandiff p-adj lower  upper  reject
------------------------------------------------------
   New Traditional  10.1714   0.0 9.0788 11.264   True
------------------------------------------------------


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences
in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine
which store(s) differ significantly from each other.

In [17]:
# H0: μ1 = μ2 = μ3
# H1: at least one of the means is different from the others

import numpy as np
import scipy.stats as stats


store_a=np.random.normal(loc=900, scale=200, size=30)
store_b=np.random.normal(loc=1000, scale=100, size=30)
store_c=np.random.normal(loc=800, scale=100, size=30)

# Conduct the repeated measures ANOVA 
F_statistic, p_value = stats.f_oneway(store_a, store_b, store_c)
print('F-statistic:',F_statistic,'and P-value:',p_value,'\n')

significance_value = 0.05
if p_value < significance_value:
    print('Reject the null hypothesis as there is significant difference in the average daily sales of 3 retail stores.')
else:
    print('Accept the null hypothesis as there is no significant difference in the average daily sales of 3 retail stores.')

F-statistic: 15.169539048126635 and P-value: 2.2301355860658083e-06 

Reject the null hypothesis as there is significant difference in the average daily sales of 3 retail stores.


In [18]:
#Post-hoc test

import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

df = pd.DataFrame({'sales': np.concatenate([store_a,store_b,store_c]), 'group': ['A']*30 + ['B']*30 + ['C']*30})
# Perform the Tukey's HSD test
tukey = pairwise_tukeyhsd(df['sales'], df['group'])
print(tukey)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj    lower     upper   reject
---------------------------------------------------------
     A      B   96.0246 0.0313    7.0159  185.0333   True
     A      C -109.4376 0.0119 -198.4463  -20.4288   True
     B      C -205.4622    0.0 -294.4709 -116.4535   True
---------------------------------------------------------
