In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.
ans:
ANOVA (Analysis of Variance) is a statistical method used to test for significant differences between the means of three or more groups. There are several a
ssumptions that must be met to use ANOVA properly. These assumptions include:

1.Independence: The observations within each group are independent of each other.

2.Normality: The distribution of the dependent variable is normal (i.e., approximately bell-shaped) within each group.

3.Homogeneity of variance: The variance of the dependent variable is equal across all groups.

4.Random sampling: The groups are formed from randomly selected samples from the population.

Violations of these assumptions can impact the validity of the ANOVA results. Here are some examples of how each assumption violation can impact the validity of the results:

Independence: Violations of independence occur when the observations within one group are dependent on the observations within another group. This can occur when 
participants are matched or when repeated measures are taken on the same participants. The violation of independence can lead to an increased risk of a Type I error\
(i.e., rejecting the null hypothesis when it is actually true).

Normality: If the distribution of the dependent variable is not approximately normal within each group, it can impact the validity of the ANOVA results. 
Non-normality can lead to a biased estimate of the group means, which can affect the accuracy of the ANOVA results.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?
ans:
One-Way ANOVA: This type of ANOVA is used when there is one independent variable (or factor) and one dependent variable. It is used to test for differences 
between the means of two or more groups. One-way ANOVA is appropriate when we want to compare the means of multiple groups based on a single factor or independent 
variable. For example, we might use one-way ANOVA to compare the mean test scores of students who studied under different teaching methods (such as online, in-person, or hybrid).

Two-Way ANOVA: This type of ANOVA is used when there are two independent variables (or factors) and one dependent variable. It is used to test for differences between 
the means of two or more groups, and to test for interactions between the two independent variables. Two-way ANOVA is appropriate when we want to compare the means of 
multiple groups based on two different factors or independent variables. For example, we might use two-way ANOVA to compare the mean test scores of students who studied
under different teaching methods (such as online, in-person, or hybrid) and who were from different age groups (such as teenagers and young adults).

MANOVA (Multivariate ANOVA): This type of ANOVA is used when there are two or more dependent variables and one or more independent variables. It is used to test for
differences between the means of two or more groups on multiple dependent variables. MANOVA is appropriate when we want to compare the means of multiple groups on 
multiple dependent variables, and to test for interactions between the independent variables and the dependent variables. For example, we might use MANOVA to compare 
the mean scores of students who studied under different teaching methods (such as online, in-person, or hybrid) on multiple tests or exams.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
ans:
The partitioning of variance in ANOVA (Analysis of Variance) refers to the process of decomposing the total variance of a set of data into different components or 
sources of variation. ANOVA is a statistical technique used to test the hypothesis that the means of two or more groups are equal, by examining the variation within 
and between the groups.

The total variance of the data is partitioned into two components:

Between-group variance: This component measures the differences in means between the different groups being compared. It indicates how much the means of the groups 
differ from each other.

Within-group variance: This component measures the variability within each group. It indicates how much the individual data points within each group vary from their group mean.

The importance of understanding the partitioning of variance in ANOVA lies in the fact that it allows us to quantify the degree of variation in the data and to identify 
the sources of this variation. By partitioning the variance, we can determine whether the observed differences between the groups are statistically significant or simply due
to chance. This information is critical for making informed decisions and drawing valid conclusions from the data.

In [15]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?
# ans:
# To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way 
# ANOVA using Python, we can use the statsmodels library. Here an example:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = {'group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'y': [3, 5, 7, 4, 6, 8, 5, 7, 9]}
data = pd.DataFrame(data)

# Fit the one-way ANOVA model
model = ols('y ~ group', data=data).fit()

# Calculate the total sum of squares (SST)
sst = ((data['y'] - data['y'].mean()) ** 2).sum()

# Calculate the explained sum of squares (SSE)
sse = ((model.predict(data['group']) - data['y'].mean()) ** 2).sum()

# Calculate the residual sum of squares (SSR)
ssr = ((data['y'] - model.predict(data['group'])) ** 2).sum()

# Print the results
print("Total sum of squares (SST): ", sst)
print("Explained sum of squares (SSE): ", sse)
print("Residual sum of squares (SSR): ", ssr)

Total sum of squares (SST):  30.0
Explained sum of squares (SSE):  5.999999999999995
Residual sum of squares (SSR):  24.0


In [13]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
# ans:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate example data
data = {'x1': [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'x2': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'y': [3, 5, 7, 4, 6, 8, 5, 7, 9]}
data = pd.DataFrame(data)

# Fit the two-way ANOVA model with interaction terms
model = ols('y ~ C(x1) + C(x2) + C(x1):C(x2)', data=data).fit()

# Calculate the main effects
main_effects = model.params[:2]

# Calculate the interaction effect
interaction_effect = model.params[2]

# Print the results
print("Main effects: ", main_effects)
print("Interaction effect: ", interaction_effect)


Main effects:  Intercept     3.0
C(x1)[T.2]    1.0
dtype: float64
Interaction effect:  1.9999999999999993


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?
ans:
If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a statistically significant difference 
between at least two of the groups being compared.
The p-value of 0.02 indicates the probability of obtaining an F-statistic as extreme as the observed one (or more extreme) under the null hypothesis, which assumes 
that there are no differences between the groups. In this case, the p-value is less than the conventional significance level of 0.05, which means that we can reject the 
null hypothesis and conclude that there are significant differences between the groups.

To interpret these results, we could perform post-hoc tests, such as Tukey's HSD test, to determine which specific groups differ from each other. Alternatively, if we had 
a priori hypotheses about which groups would differ, we could perform planned comparisons to test those hypotheses. 


In [18]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?
ans:
In a repeated measures ANOVA, missing data can arise due to a variety of reasons, such as participant dropout, technical errors, or incomplete data collection. 
Handling missing data is important because it can affect the statistical power of the analysis and the accuracy of the estimates.

There are several methods for handling missing data in a repeated measures ANOVA, such as:

1.Complete case analysis: This method involves only including participants who have complete data for all variables. However, this can result in a loss of statistical 
power and potential bias if the missing data is not missing completely at random.

2.Pairwise deletion: This method involves including all participants who have data for at least one variable in the analysis. This can increase the sample size and 
statistical power, but can also result in biased estimates if the missing data is not missing completely at random.

Mean imputation: This method involves replacing the missing values with the mean value of the non-missing data for that variable. However, this can lead to biased estimates 
and inflated standard errors.

Multiple imputation: This method involves estimating the missing data using statistical models and creating multiple datasets with plausible values for the missing data.
The analysis is then performed on each dataset, and the results are combined to produce an overall estimate. This can improve the accuracy of the estimates and reduce bias 
and uncertainty, but can also be computationally intensive and require assumptions about the missing data mechanism.

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA can vary depending on the nature and extent of the missing data. 
Generally, the more missing data there is, the greater the potential impact on the results.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.
ans:
Some common post-hoc tests used after ANOVA include Tukey's Honestly Significant Difference (HSD) test, Bonferroni correction, Scheffé's method, and 
Fisher's Least Significant Difference (LSD) test.

1.Tukey's HSD test is used to determine which pairs of group means are significantly different from each other while controlling the family-wise error rate. 
It is appropriate when the number of groups is equal or nearly equal, and the variances of the groups are approximately equal.

2.The Bonferroni correction adjusts the p-values for multiple comparisons to control the family-wise error rate. It is a conservative method that can be used when 
the number of comparisons is small, and the groups are independent.

3.Scheffé's method is a more conservative post-hoc test that can be used when the number of comparisons is large and the groups are independent. It controls the 
family-wise error rate at a more stringent level than Tukey's HSD test.

4.Fisher's LSD test is similar to Tukey's HSD test, but it is less conservative and more powerful. It is appropriate when the variances of the groups are unequal.

A situation where a post-hoc test might be necessary is when we conduct an ANOVA and obtain a significant F-statistic, indicating that there are significant differences
between the groups.

In [19]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.
# ans:
import numpy as np
from scipy.stats import f_oneway

# generate some sample data
np.random.seed(123)
diet_A = np.random.normal(5, 1, 50)  # mean weight loss of 5 kg, SD of 1 kg
diet_B = np.random.normal(6, 1, 50)  # mean weight loss of 6 kg, SD of 1 kg
diet_C = np.random.normal(4, 1, 50)  # mean weight loss of 4 kg, SD of 1 kg

# conduct one-way ANOVA
f_stat, p_value = f_oneway(diet_A, diet_B, diet_C)

# print the results
print("F-statistic =", f_stat)
print("p-value =", p_value)

F-statistic = 38.1814612681822
p-value = 4.4208876104953276e-14


In [25]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# generate some sample data
np.random.seed(123)
software = ["A", "B", "C"]
experience = ["Novice", "Experienced"]
n = 30
data = pd.DataFrame(columns=["Time", "Software", "Experience"])
for i in range(n):
    for j in range(len(software)):
        for k in range(len(experience)):
            if k == 0:
                mu = 10 + np.random.normal(0, 2)
            else:
                mu = 8 + np.random.normal(0, 2)
            time = mu + np.random.normal(0, 1)
            data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)

# conduct two-way ANOVA
model = ols("Time ~ C(Software) + C(Experience) + C(Software):C(Experience)", data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print the results
print(anova_table)

  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = data.append({"Time": time, "Software": software[j], "Experience": experience[k]}, ignore_index=True)
  data = d

                               sum_sq     df          F        PR(>F)
C(Software)                  5.390342    2.0   0.525086  5.924387e-01
C(Experience)              177.347273    1.0  34.551635  2.069970e-08
C(Software):C(Experience)    8.470111    2.0   0.825094  4.399010e-01
Residual                   893.110418  174.0        NaN           NaN


In [30]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# generate some example data
np.random.seed(1)
control_scores = np.random.normal(70, 10, size=100)
experimental_scores = np.random.normal(75, 10, size=100)

# conduct two-sample t-test
t_stat, p_value = ttest_ind(control_scores, experimental_scores)
print("t-statistic:", t_stat)
print("p-value:", p_value)

# The t-statistic tells us how many standard errors the difference between the two groups means is away from zero. In this case, the t-statistic is negative, which means
# that the control group has a lower mean score than the experimental group. The p-value indicates the probability of observing such a large t-statistic if there were no
# difference between the groups. Since the p-value is less than 0.05 (assuming a significance level of 0.05), we can conclude that there is a significant difference in test
# scores between the control and experimental groups.
# To follow up with a post-hoc test, we can use the ttest_ind function again to compare the control and experimental groups pairwise. For example, to compare the
# control group to the experimental group, we can use:

t_stat, p_value = ttest_ind(control_scores, experimental_scores)
print('Control vs Experimental')
print("t-statistic:", t_stat)
print("p-value:", p_value)


t-statistic: -4.584315463985094
p-value: 8.059088190829134e-06
Control vs Experimental
t-statistic: -4.584315463985094
p-value: 8.059088190829134e-06


In [31]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
# significant differences in sales between the three stores. If the results are significant, follow up with a post-
# hoc test to determine which store(s) differ significantly from each other.

import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd


np.random.seed(123)
store_a_sales = np.random.normal(loc=1000, scale=100, size=30)
store_b_sales = np.random.normal(loc=900, scale=120, size=30)
store_c_sales = np.random.normal(loc=1100, scale=80, size=30)

sales_data = np.concatenate([store_a_sales, store_b_sales, store_c_sales])
store_labels = np.repeat(['A', 'B', 'C'], 30)
day_labels = np.tile(np.arange(30), 3)

data = pd.DataFrame({'sales': sales_data, 'store': store_labels, 'day': day_labels})

model = ols('sales ~ C(store) + C(day) + C(store):C(day)', data=data).fit()
anova = AnovaRM(data, 'sales', 'store', within=['day']).fit()
print(anova.summary())

# If the results of the ANOVA are significant, we can follow up with a post-hoc test, such as Tukey's HSD, to determine which store(s) differ significantly from each other:

tukey_results = pairwise_tukeyhsd(data['sales'], data['store'])
print(tukey_results)


              Anova
    F Value  Num DF  Den DF Pr > F
----------------------------------
day  0.8141 29.0000 58.0000 0.7231

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
     A      B -87.4932 0.0161 -161.3721 -13.6142   True
     A      C  89.2216 0.0137   15.3427 163.1006   True
     B      C 176.7148    0.0  102.8359 250.5938   True
-------------------------------------------------------
