# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact  the validity of the results

ANOVA (Analysis of Variance) is a statistical method used to compare means across multiple groups or conditions.

Independence: The observations within each group or condition should be independent of each other.

Normality: The distribution of the dependent variable (the variable being measured or compared) should be approximately normal within each group or condition.

Homogeneity of variances: The variability of the dependent variable should be roughly equal across all groups or conditions.

Interval or ratio-level data: ANOVA assumes that the dependent variable is measured on an interval or ratio scale.

Violations of these assumptions can impact the validity of ANOVA results. Some examples of violations and their consequences are:

Violation of independence: If observations within groups are not independent, such as when there is a correlation between observations or when there is a hierarchical structure in the data, the assumption of independence is violated. This can lead to biased estimates of the group differences and incorrect p-values.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

There are three main types of ANOVA:

One-Way ANOVA: One-Way ANOVA is used when you have one independent variable (also known as a factor) with three or more levels or groups

Two-Way ANOVA: Two-Way ANOVA is used when you have two independent variables (factors) and you want to examine the interaction effect between these variables on a dependent variable.

Repeated Measures ANOVA: Repeated Measures ANOVA is used when you have a within-subjects design, where the same participants are measured under different conditions or at multiple time points.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA (Analysis of Variance) refers to the division of the total variation in a data set into different components or sources of variation.


Understanding the partitioning of variance is important for several reasons:

(1). Identifying significant differences

(2). Assessing the effect size

(3). Decision-making

(4). Experimental design

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual  sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
import scipy.stats as stats

# Sample data for each group
group1 = [1, 2, 3, 4, 5]
group2 = [2, 4, 6, 8, 10]
group3 = [3, 6, 9, 12, 15]

# Concatenate the data from all groups
data = group1 + group2 + group3

# Calculate the overall mean
overall_mean = sum(data) / len(data)

# Calculate the total sum of squares (SST)
sst = sum((x - overall_mean) ** 2 for x in data)

# Calculate the explained sum of squares (SSE)
group_means = [sum(group) / len(group) for group in [group1, group2, group3]]
sse = sum(len(group) * (mean - overall_mean) ** 2 for group, mean in zip([group1, group2, group3], group_means))

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

# Print the results
print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)


SST: 230.0
SSE: 90.0
SSR: 140.0


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas DataFrame with your data
data = {
    'A': [10, 12, 8, 15, 9, 11],
    'B': [20, 22, 18, 25, 19, 21],
    'Y': [35, 40, 30, 45, 33, 38]
}

df = pd.DataFrame(data)

# Fit the ANOVA model
model = ols('Y ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effects
main_effects = anova_table[['sum_sq', 'df', 'mean_sq', 'F']].iloc[:-1]
interaction_effect = anova_table[['sum_sq', 'df', 'mean_sq', 'F']].iloc[-1]

# Print the results
print("Main Effects:")
print(main_effects)
print("\nInteraction Effect:")
print(interaction_effect)


Main Effects:
         sum_sq   df     mean_sq            F
A    140.563063  1.0  140.563063  1650.384643
B      0.005940  1.0    0.005940     0.069743
A:B    2.262411  1.0    2.262411    26.563514

Interaction Effect:
sum_sq     0.25551
df         3.00000
mean_sq    0.08517
F              NaN
Name: Residual, dtype: float64


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.  What can you conclude about the differences between the groups, and how would you interpret these  results?


In the given scenario, conducting a one-way ANOVA resulted in an F-statistic of 5.23 and a p-value of 0.02. Based on these results, we can draw the following conclusions:

Differences between the groups: The obtained F-statistic indicates that there are significant differences between the groups being compared. In other words, the means of at least two groups are not equal.

Interpretation of the results: The p-value of 0.02 indicates that the probability of observing an F-statistic as extreme as 5.23 (or even more extreme) under the null hypothesis of no differences between the groups is 0.02. Typically, if the p-value is below a predetermined significance level (such as 0.05), it is considered statistically significant.

Therefore, with a p-value of 0.02, we can conclude that there is strong evidence to reject the null hypothesis, which states that there are no differences between the groups.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential  consequences of using different methods to handle missing data?

In a repeated measures ANOVA, missing data can present challenges in terms of analysis and interpretation. There are different methods to handle missing data, each with its own implications.

Listwise deletion: This method involves removing any participant with missing data from the analysis.

Pairwise deletion: In this approach, only the cases with complete data for each pair of variables are used for analysis.

Mean substitution: This method involves replacing missing values with the mean value of the variable.

Multiple imputation: Multiple imputation involves creating multiple plausible values for missing data based on observed values and known relationships.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide  an example of a situation where a post-hoc test might be necessary.


After conducting an Analysis of Variance (ANOVA) and obtaining a significant result indicating that at least one group means differ significantly, post-hoc tests are often performed to determine which specific group means differ from each other. Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD) Test


Bonferroni Correction


Scheffé's Test

Dunnett's Test

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from  50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python  to determine if there are any significant differences between the mean weight loss of the three diets.  Report the F-statistic and p-value, and interpret the results.


In [None]:
import scipy.stats as stats

# Weight loss data for the three diets
diet_A = [1.2, 2.1, 0.8, 1.5, 1.9]  # Replace with actual data
diet_B = [1.8, 1.3, 1.0, 2.2, 1.6]  # Replace with actual data
diet_C = [0.9, 1.5, 1.7, 1.2, 0.5]  # Replace with actual data

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


F-Statistic: 1.0433566433566435
p-value: 0.3821460936215956


### Q10. A company wants to know if there are any significant differences in the average time it takes to  complete a task using three different software programs: Program A, Program B, and Program C. They  randomly assign 30 employees to one of the programs and record the time it takes each employee to  complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or  interaction effects between the software programs and employee experience level (novice vs.  experienced). Report the F-statistics and p-values, and interpret the results.


In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({
    'Program': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice', 'Experienced'] * 15,
    'Time': [10, 12, 9, 11, 14, 13, 15, 16, 13, 12, 11, 10, 9, 10, 12, 11, 13, 11, 14, 13,
             15, 16, 13, 12, 11, 10, 9, 10, 12, 11]
})

# Perform the two-way ANOVA
model = ols('Time ~ Program * Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                        sum_sq    df         F    PR(>F)
Program               4.066667   2.0  0.438849  0.649849
Experience            0.133333   1.0  0.028777  0.866717
Program:Experience    0.466667   2.0  0.050360  0.950988
Residual            111.200000  24.0       NaN       NaN


# Q11. An educational researcher is interested in whether a new teaching method improves student test  scores. They randomly assign 100 students to either the control group (traditional teaching method) or the  experimental group (new teaching method) and administer a test at the end of the semester. Conduct a  two-sample t-test using Python to determine if there are any significant differences in test scores  between the two groups. If the results are significant, follow up with a post-hoc test to determine which  group(s) differ significantly from each other.


In [5]:
import numpy as np
from scipy import stats

# Test scores for the control group
control_scores = [75, 80, 85, 78, 92, 87, 83, 79, 88, 85]

# Test scores for the experimental group
experimental_scores = [80, 82, 88, 78, 95, 90, 84, 76, 89, 86]

# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)

print("Two-sample t-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)



Two-sample t-test results:
t-statistic: -0.6418856341919262
p-value: 0.5290379211880494


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three  retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store  on those days. Conduct a repeated measures ANOVA using Python to determine if there are any  significant differences in sales between the three stores. If the results are significant, follow up with a post hoc test to determine which store(s) differ significantly from each other.

In [12]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a dataframe with sales data
data = {
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Sales': np.random.randint(100, 1000, size=90)  # Replace with your actual sales data
}
df = pd.DataFrame(data)

# Convert Store column to categorical variable
df['Store'] = df['Store'].astype('category')

# Perform repeated measures ANOVA
model = ols('Sales ~ Store', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Perform post hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
print(posthoc)


                sum_sq    df         F    PR(>F)
Store     7.863976e+04   2.0  0.686886  0.505848
Residual  4.980197e+06  87.0       NaN       NaN
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
     A      B    -52.5 0.6732  -199.803   94.803  False
     A      C  16.9333 0.9594 -130.3697 164.2363  False
     B      C  69.4333  0.502  -77.8697 216.7363  False
-------------------------------------------------------
