In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
# Ans 1:
Assumptions of ANOVA:

1. Independence of observations: The observations within each group should be independent of each other. 
                                 Violations of this assumption can lead to overestimating the significance of differences between groups. 

    Examples of violations include observations from related individuals or repeated measures from the same individual.

2. Normality: The data within each group should be normally distributed. 
              Violations of this assumption can lead to incorrect p-values and confidence intervals. 
    
    Examples of violations include skewed distributions or extreme outliers.

3. Homogeneity of variances: The variances of the data should be equal across all groups. 
                             Violations of this assumption can lead to an incorrect assessment of the significance of differences between groups. 
    
    Examples of violations include unequal variances due to differences in sample sizes or outliers in one or more groups.

4. Absence of outliers: Outliers can affect the mean and variance estimates and, therefore, are considered an assumption of ANOVA. 
                        It is important to identify and address outliers before conducting ANOVA analysis, using visual inspection or statistical tests such as the Grubbs' test or Dixon's Q test.


Violations of these assumptions can impact the validity of ANOVA results. 

It is important to check the assumptions of ANOVA before analyzing the data to ensure that the results are valid and meaningful. 
If the assumptions are not met, there are alternative tests or modifications that can be used to analyze the data.

In [None]:
# Ans 2:
There are three types of ANOVA:

1. One-way ANOVA: This is used when there is only one factor or independent variable being studied. 
                  It is used to test for differences in means among two or more groups on a single dependent variable. 
    For example, it could be used to test whether there is a difference in average exam scores between students who received different types of instruction.

2. Repeated measures ANOVA: This is used when each subject is measured multiple times under different conditions. 
                            It is used to test whether there are significant differences between the conditions or across time on a dependent variable. 
    For example, it could be used to test whether there is a significant difference in blood pressure before and after exercise.

3. Factorial ANOVA: This is used when there are two or more independent variables or factors being studied. 
                    It is used to test for the main effects of each factor and the interaction effect between them on a dependent variable. 
    For example, it could be used to test whether there is a difference in exam scores between students who received different types of instruction and who had different levels of motivation.

In [None]:
# Ans 3:
Partitioning of variance is the process of dividing the total variation in a dataset into different sources of variation.
In ANOVA, the variation in the dependent variable is partitioned into different components based on the sources of variation in the independent variables.

'''
Here are some key points to understand about the partitioning of variance in ANOVA:

1. Total variation: The total variation in the dependent variable is the sum of the variation between groups (due to the independent variable) and the variation within groups (due to random error or individual differences).

2. Between-group variation: This is the variation in the dependent variable that is due to differences between the groups or levels of the independent variable. 
                            It is calculated by comparing the means of the different groups and taking into account the sample size of each group.

3. Within-group variation: This is the variation in the dependent variable that is due to random error or individual differences within each group. 
                           It is calculated by looking at the variation of the scores within each group and taking into account the sample size of each group.

4. Degrees of freedom: The degrees of freedom associated with each source of variation are used to calculate the F statistic and determine the statistical significance of the results.
'''

In [3]:
# Ans 4:
# In Python, we can calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using the scipy.stats module.

import scipy.stats as stats

# Define the data for the one-way ANOVA
group1 = [10, 12, 14, 16, 18]
group2 = [8, 10, 12, 14, 16]
group3 = [6, 8, 10, 12, 14]

# Concatenate the data from all groups into a single array
data = group1 + group2 + group3

# Calculate the total sum of squares (SST)
mean = sum(data) / len(data)
squared_errors = [(x - mean)**2 for x in data]
SST = sum(squared_errors)

# Calculate the explained sum of squares (SSE)
n_groups = 3
group_means = [sum(group)/len(group) for group in [group1, group2, group3]]
global_mean = sum(data) / len(data)
SSE = n_groups * sum([(mean - group_mean)**2 for group_mean in group_means])

# Calculate the residual sum of squares (SSR)
SSR = SST - SSE

# Print the results
print("SST =", SST)
print("SSE =", SSE)
print("SSR =", SSR)


SST = 160.0
SSE = 24.0
SSR = 136.0


In [12]:
# Ans 5:
>> In a two-way ANOVA, we can calculate the main effects and interaction effects using Python by using the 'statsmodels' module.

import statsmodels.api as sm
from statsmodels.formula.api import ols

In [5]:
# Ans 6:
With an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a significant difference between at least two of the groups. 
The p-value of 0.02 suggests that there is only a 2% chance of observing such a large F-statistic by random chance. 

Therefore, we reject the null hypothesis of equal group means and accept the alternative hypothesis that at least one group mean is significantly different from the others.

In [6]:
# Ans 7:
'''
Some ways to handle missing data in a repeated measures ANOVA:

1. Complete case analysis: Only use participants who have data for all time points. 
                           This approach is easy to implement but may lead to biased results if the missing data is not random.

2. Imputation: Estimate missing data using various imputation methods such as mean imputation or regression imputation. 
               This approach may improve the statistical power of the analysis but can lead to biased results if the imputation model is incorrect.

3. Maximum likelihood estimation: Use statistical software that allows for maximum likelihood estimation, which can include participants with incomplete data. 
                                  This approach can provide unbiased results if the data are missing at random.


>> The potential consequences of using different methods to handle missing data include:

1. Bias: Using complete case analysis or incorrect imputation methods can lead to biased results.
2. Reduced power: Complete case analysis can lead to a reduction in statistical power, as fewer participants are included in the analysis.
3. Decreased precision: Using imputation methods that do not accurately capture the missing data can lead to a decrease in the precision of the estimates.
4. Increased complexity: Using more advanced methods such as maximum likelihood estimation can be computationally intensive and require more expertise in statistical analysis.
'''

In [7]:
# Ans 8:
'''
Post-hoc tests are used after an ANOVA to compare specific pairs of groups and identify which groups differ significantly from each other. 

Here are some common post-hoc tests and when to use them:

1. Tukey's HSD: This test is used when all groups have equal sample sizes and is the most conservative test. 
                It controls for family-wise error rate (FWER), which is the probability of making at least one Type I error across all comparisons. 
                This test is appropriate when there are many groups to compare.

2. Bonferroni: This test is also used to control for FWER, but it is more conservative than Tukey's HSD. 
               This test is appropriate when there are a small number of groups to compare.

3. Scheffe: This test is more liberal than Tukey's HSD and Bonferroni, and it is appropriate when there are a small number of groups to compare and the sample sizes are unequal.

4. Dunnett's: This test is used to compare each group to a control group. 
              This test is appropriate when there is a control group that serves as a reference point.


An example of a situation where a post-hoc test might be necessary is:
> When conducting a study on the effectiveness of four different weight-loss programs. 
  After conducting an ANOVA, we find that there is a significant difference between the means of the four programs. 
  We would then conduct a post-hoc test to identify which programs differ significantly from each other to determine which program is most effective.
'''

In [13]:
# Ans 9:
import pandas as pd
import scipy.stats as stats

# create a dataframe with the weight loss data
data = {'diet': ['A']*16 + ['B']*17 + ['C']*17,
        'weight_loss': [2.3, 1.7, 1.4, 2.8, 3.2, 3.4, 1.1, 1.9, 1.5, 2.7, 2.9, 2.2, 2.1, 1.8, 1.6, 1.9,
                        2.8, 2.2, 2.4, 2.1, 1.7, 1.4, 2.3, 2.5, 3.1, 1.8, 1.3, 2.0, 1.9, 1.6, 1.2, 1.5, 2.2,
                        2.5, 2.9, 1.8, 1.6, 1.7, 2.0, 1.4, 1.1, 2.3, 2.4, 2.7, 1.5, 1.9, 1.4, 2.2, 2.0, 2.3]}
df = pd.DataFrame(data)

# conduct one-way ANOVA
f_stat, p_value = stats.f_oneway(df[df['diet'] == 'A']['weight_loss'],
                                  df[df['diet'] == 'B']['weight_loss'],
                                  df[df['diet'] == 'C']['weight_loss'])

# print results
print('F-statistic:', f_stat)
print('p-value:', p_value)


F-statistic: 0.45930881565134246
p-value: 0.634525470518545


In [17]:
# Ans 10:
import pandas as pd
import numpy as np

# Create dataframe with random data
np.random.seed(1)
data = {'Program': np.random.choice(['A', 'B', 'C'], size=30),
        'Experience': np.random.choice(['Novice', 'Experienced'], size=30),
        'Time': np.random.normal(loc=10, scale=2, size=30)}
df = pd.DataFrame(data)

# Save dataframe to CSV file
df.to_csv('data.csv', index=False)



In [18]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data from CSV file
df = pd.read_csv('data.csv')

# Fit two-way ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print ANOVA table with F-statistics and p-values
print(anova_table)


                              sum_sq    df         F    PR(>F)
C(Program)                 10.410107   2.0  0.977472  0.390745
C(Experience)               2.747881   1.0  0.516033  0.479475
C(Program):C(Experience)    6.377912   2.0  0.598863  0.557441
Residual                  127.800324  24.0       NaN       NaN


In [19]:
# Ans 11:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create sample data
control = pd.Series([80, 75, 85, 90, 65, 72, 68, 88, 92, 78, 
                     70, 84, 76, 83, 77, 71, 81, 86, 79, 73])
experimental = pd.Series([95, 88, 92, 85, 84, 89, 78, 87, 94, 91,
                          90, 86, 83, 82, 79, 81, 80, 84, 88, 89])

# conduct two-sample t-test
t_stat, p_val = ttest_ind(control, experimental)
print("Two-sample t-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_val)

# conduct post-hoc test (Tukey's HSD)
tukey_results = pairwise_tukeyhsd(np.concatenate((control, experimental)), 
                                  np.concatenate(([0]*len(control), [1]*len(experimental))))
print("\nPost-hoc test (Tukey's HSD) results:")
print(tukey_results)


Two-sample t-test results:
t-statistic: -3.7847176872693105
p-value: 0.0005318706661400818

Post-hoc test (Tukey's HSD) results:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
     0      1      7.6 0.0005 3.5349 11.6651   True
---------------------------------------------------


In [21]:
# Ans 12:
import pandas as pd

# create the dataframe
data = {'Store': ['A']*30 + ['B']*30 + ['C']*30,
        'Day': list(range(1, 31))*3,
        'Sales': [20, 25, 18, 22, 27, 21, 24, 28, 25, 19, 23, 26, 18, 20, 24, 29, 22, 27, 21, 25, 23, 19, 28, 26, 20, 22, 25, 21, 23, 27]*3}
df = pd.DataFrame(data)

# save the dataframe to a CSV file
df.to_csv('sales_data.csv', index=False)
 

In [26]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# load the data from the CSV file
df = pd.read_csv('sales_data.csv')

# perform repeated measures ANOVA
rm = ols('Sales ~ Store + Day + Store:Day', data=df).fit()
table = sm.stats.anova_lm(rm, typ=2)

# print the ANOVA table
print(table)

# perform post-hoc tests using Tukey's HSD method
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
print(posthoc)


                 sum_sq    df             F    PR(>F)
Store      7.250191e-28   2.0  3.536011e-29  1.000000
Day        1.443737e+01   1.0  1.408258e+00  0.238691
Store:Day  2.531024e-27   2.0  1.234413e-28  1.000000
Residual   8.611626e+02  84.0           NaN       NaN
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
     A      B      0.0   1.0 -1.9532 1.9532  False
     A      C      0.0   1.0 -1.9532 1.9532  False
     B      C      0.0   1.0 -1.9532 1.9532  False
--------------------------------------------------
