# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

### Assumptions for using ANOVA (Analysis of Variance) include:

Independence: The data points should be independent of each other. Violation example: Repeated measures on the same subject without accounting for the correlation.

Normality: The dependent variable should follow a normal distribution within each group. Violation example: The data is heavily skewed or has heavy tails.

Homogeneity of Variance: The variances of the dependent variable should be equal across groups. Violation example: One group has significantly larger variance than the others.

Random Sampling: The data should be collected through random sampling from the population of interest.

Interval or Ratio Scale: The dependent variable should be measured on an interval or ratio scale.

Violations of these assumptions can impact the validity of ANOVA results, leading to inaccurate conclusions or decreased statistical power. For example, if the data is not normally distributed or the variance is not homogeneous, the ANOVA results may be unreliable. In such cases, it might be necessary to use non-parametric alternatives or transform the data to meet the assumptions.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

- One-Way ANOVA: Used when there is one categorical independent variable (factor) with two or more levels (groups). It tests for differences in means across the groups.

- Two-Way ANOVA: Used when there are two categorical independent variables (factors) and their interactions. It tests for main effects of each factor and the interaction effect between the factors.

- N-Way ANOVA: Generalization of ANOVA to more than two independent variables.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

- Partitioning of variance in ANOVA refers to the decomposition of the total variance in the dependent variable into different sources, such as between-group variance and within-group variance.

### In one-way ANOVA, the total variance (SST) is partitioned into:

- Explained variance (SSE): Variation attributed to the differences between the group means.
- Residual variance (SSR): Variation not accounted for by the group means, also known as the error or within-group variance.
- Understanding this concept is crucial because it helps identify how much of the total variance is due to the effect of the independent variable (group means) and how much is due to random variation (error). It allows us to determine the proportion of variance explained by the model and evaluate the statistical significance of the effects.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import scipy.stats as stats

# Sample data for groups
group1 = np.array([10, 12, 14, 16])
group2 = np.array([20, 22, 24, 26])
group3 = np.array([30, 32, 34, 36])

# Combine data from all groups
all_data = np.concatenate((group1, group2, group3))

# Calculate the overall mean
overall_mean = np.mean(all_data)

# Calculate SST
SST = np.sum((all_data - overall_mean) ** 2)

# Calculate SSE
SSE = np.sum((group1 - np.mean(group1)) ** 2) + np.sum((group2 - np.mean(group2)) ** 2) + np.sum((group3 - np.mean(group3)) ** 2)

# Calculate SSR
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 860.0
Explained Sum of Squares (SSE): 60.0
Residual Sum of Squares (SSR): 800.0


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = {'Factor1': [1, 1, 2, 2, 3, 3],
        'Factor2': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Response': [5, 6, 7, 8, 9, 10]}

df = pd.DataFrame(data)

model = ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=df).fit()

main_effects = model.params['Factor1'], model.params['Factor2']
interaction_effect = model.params['Factor1:Factor2']

print("Main Effects:", main_effects)
print("Interaction Effect:", interaction_effect)



KeyError: 'Factor2'

# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In a one-way ANOVA, an F-statistic is used to test whether the means of the groups are significantly different. The p-value associated with the F-statistic indicates the probability of obtaining the observed results under the assumption that there are no significant differences between the group means.

In this scenario, the obtained F-statistic is 5.23, and the p-value is 0.02.

Interpretation: With a significance level of 0.05, since the p-value (0.02) is less than 0.05, we reject the null hypothesis. This means there is strong evidence to suggest that there are significant differences between the groups. However, the F-statistic value alone does not provide information about the direction or magnitude of the differences. To understand the specific group differences, further post-hoc tests or comparisons between group means would be necessary.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

- In a repeated measures ANOVA, missing data can be a common issue as participants may drop out of the study, fail to complete some measurements, or for other reasons, data points may be missing. Handling missing data appropriately is crucial to avoid biased results. Several methods can be used to handle missing data, each with its potential consequences:

- Complete Case Analysis (Listwise Deletion): This method involves excluding any participant with missing data on any of the measured variables. The consequence of this approach is reduced sample size, potentially leading to loss of statistical power and biased results if missing data are not missing completely at random.

- Mean Imputation: Missing values are replaced with the mean value of the available data for that variable. This method can lead to an underestimation of variance and distorted standard errors, resulting in inflated Type I error rates.

- Last Observation Carried Forward (LOCF): Missing values are replaced with the last observed value for that participant. This method may introduce bias if the assumption that the last observed value remains constant is not valid.

- Multiple Imputation: A more sophisticated approach that involves creating multiple plausible imputations for the missing values based on the observed data's distribution. The analyses are performed on each imputed dataset, and results are combined to account for uncertainty. This method tends to produce more accurate estimates and valid statistical inferences.

- Maximum Likelihood Estimation (MLE): MLE is an approach used in software capable of handling missing data. It estimates model parameters while accounting for the uncertainty associated with the missing data. MLE provides valid parameter estimates under the assumption that data are missing at random.

- The choice of method depends on the underlying assumptions about the missing data mechanism and the extent of missingness in the dataset. It is essential to carefully consider the nature of the missing data and perform sensitivity analyses to assess the impact of different missing data methods on the results.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

- After performing an ANOVA and finding a significant difference among the groups, post-hoc tests are used to determine which specific group means differ significantly from each other. Some common post-hoc tests include:

- Tukey's Honestly Significant Difference (HSD): This test is conservative and controls the family-wise error rate. It is used when you have equal group sizes and want to compare all possible pairs of means. It is suitable when you have no specific hypotheses about which groups will differ.

- Bonferroni Correction: This test is more conservative than Tukey's HSD and is used to control the overall family-wise error rate. It divides the desired alpha level (usually 0.05) by the number of pairwise comparisons.

- Scheffe's Test: This test is less conservative than Tukey's HSD and is used when you have unequal group sizes or want to compare complex contrasts among means.

- Dunnett's Test: This test is used when you have one control group and want to compare it to all other groups.

- Example Situation: Suppose a researcher conducts an experiment comparing the effectiveness of four different teaching methods (A, B, C, and D) on test scores. After running an ANOVA, they find a statistically significant difference among the teaching methods. Now, they want to determine which specific pairs of teaching methods are significantly different from each other. In this case, a post-hoc test, such as Tukey's HSD or Bonferroni correction, can be used to make pairwise comparisons and identify the significant differences.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [5]:
#pip install scipy


In [6]:
import numpy as np
from scipy.stats import f_oneway

# Weight loss data for each diet group
diet_A = np.array([5.1, 4.8, 4.9, 4.7, 5.3, 5.5, 5.0, 4.6, 5.2, 5.4,
                   5.1, 4.8, 4.9, 5.2, 5.3, 5.0, 5.2, 4.7, 5.0, 5.3,
                   5.1, 5.5, 4.9, 4.8, 5.0, 5.2, 5.1, 5.3, 5.2, 5.1])

diet_B = np.array([4.3, 4.1, 4.2, 4.0, 4.5, 4.4, 4.2, 4.6, 4.1, 4.3,
                   4.2, 4.5, 4.4, 4.2, 4.3, 4.5, 4.6, 4.0, 4.4, 4.3,
                   4.2, 4.6, 4.1, 4.3, 4.2, 4.4, 4.3, 4.5, 4.4, 4.2])

diet_C = np.array([6.2, 6.0, 6.1, 5.9, 6.4, 6.3, 6.1, 6.5, 6.0, 6.2,
                   6.1, 6.4, 6.3, 6.1, 6.2, 6.4, 6.5, 5.9, 6.3, 6.2,
                   6.1, 6.5, 6.0, 6.2, 6.1, 6.3, 6.2, 6.4, 6.3, 6.1])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Report the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-Statistic: 724.5919214415851
p-value: 5.738634864256539e-55
There is a significant difference between the mean weight loss of the three diets.


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [8]:
#pip install statsmodels


In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data in a DataFrame
data = pd.DataFrame({
    'Software': ['A'] * 10 + ['B'] * 10 + ['C'] * 10,
    'Experience': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time': [20, 25, 22, 19, 21, 24, 18, 23, 26, 20,
             30, 32, 28, 29, 31, 27, 33, 29, 31, 30,
             15, 18, 17, 16, 19, 20, 18, 21, 22, 19]
})

# Perform two-way ANOVA
model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Report the results
print(anova_table)


                          sum_sq    df           F        PR(>F)
Software             1670.742908   2.0  159.586024  1.330837e-12
Experience                   NaN   1.0         NaN           NaN
Software:Experience    95.203333   2.0    9.093632  5.667015e-03
Residual              136.100000  26.0         NaN           NaN


  F /= J


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [10]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import MultiComparison

# Test scores data for the control group (traditional teaching method)
control_group = np.array([75, 82, 88, 79, 90, 72, 85, 78, 80, 83,
                          77, 86, 81, 73, 79, 84, 77, 82, 80, 76,
                          75, 85, 81, 83, 79, 78, 74, 80, 88, 82])

# Test scores data for the experimental group (new teaching method)
experimental_group = np.array([85, 92, 78, 89, 94, 92, 87, 90, 96, 88,
                               82, 91, 84, 89, 90, 83, 88, 86, 93, 85,
                               88, 90, 89, 84, 92, 86, 85, 88, 90, 92])

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Report the results
print("T-Statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
    # Follow up with post-hoc test (Tukey's HSD)
    data = np.concatenate([control_group, experimental_group])
    group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)
    mc = MultiComparison(data, group_labels)
    result = mc.tukeyhsd()
    print(result)
else:
    print("There is no significant difference in test scores between the two groups.")


T-Statistic: -7.113582731185638
p-value: 1.8890074887517523e-09
There is a significant difference in test scores between the two groups.
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental      7.8   0.0 5.6051 9.9949   True
--------------------------------------------------------


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

In [15]:
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import MultiComparison

# Sample data for daily sales of three stores (30 days each)
store_A_sales = np.array([100, 105, 98, 110, 115, 102, 108, 112, 105, 100,
                          90, 95, 92, 88, 85, 93, 99, 105, 102, 97,
                          120, 115, 112, 125, 130, 128, 132, 125, 120, 125])

store_B_sales = np.array([80, 85, 82, 88, 85, 78, 82, 90, 85, 80,
                          200, 198, 202, 210, 215, 212, 208, 205, 209, 198,
                          90, 95, 92, 88, 85, 82, 88, 85, 78, 82])

store_C_sales = np.array([120, 125, 130, 128, 132, 125, 120, 125, 130, 122,
                          90, 95, 92, 88, 85, 82, 88, 85, 78, 82,
                          100, 102, 98, 105, 110, 108, 112, 105, 100, 105])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(store_A_sales, store_B_sales, store_C_sales)

# Report the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in daily sales between the three stores.")
    # Follow up with post-hoc test (Tukey's HSD)
    data = np.concatenate([store_A_sales, store_B_sales, store_C_sales])
    group_labels = ['Store A'] * len(store_A_sales) + ['Store B'] * len(store_B_sales) + ['Store C'] * len(store_C_sales)
    mc = MultiComparison(data, group_labels)
    result = mc.tukeyhsd()
    print(result)
else:
    print("There is no significant difference in daily sales between the three stores.")


F-Statistic: 2.722321779588673
p-value: 0.07132327308922878
There is no significant difference in daily sales between the three stores.
