# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

### Assumptions in ANOVA and examples of potential violations:
1. Independence : Violations of independence occur when there is a dependency or correlation between observations.

   Violation  - Repeated measures: When measurements are taken from the same subjects multiple times, the observations within each subject are correlated.

2. Normality: The dependent variable within each group should follow a normal distribution. Departure from normality can affect the accuracy of p-values and confidence intervals. 

   Violations may include: Skewness or heavy-tailed distributions & Outliers

3. Homogeneity of Variance: The variance of the dependent variable should be equal across all groups.

   Unequal variances: If the variance differs substantially between groups, it violates the assumption of homogeneity of variance.
   Heteroscedasticity: When the variability of the dependent variable increases or decreases systematically across the levels of the independent variable.


## Q2. What are the three types of ANOVA, and in what situations would each be used?

1. One-Way ANOVA:

Situation: One-Way ANOVA is used when you have one factor with two or more levels and you want to compare the means of a continuous dependent variable across those groups.
Example: Comparing the average test scores of students across different grade levels (e.g., 9th grade, 10th grade, 11th grade, and 12th grade).

2. Two-Way ANOVA:

Situation: Two-Way ANOVA is used when you have two factors and one continuous dependent variable. It allows you to examine the main effects of each independent variable and their interaction effect on the dependent variable.
Example: Investigating the effect of both gender (male vs. female) and treatment (drug A vs. drug B) on blood pressure levels.

3. Three-Way ANOVA:

Situation: Three-Way ANOVA is used when you have three categorical independent variables (factors) and one continuous dependent variable. It allows you to examine the main effects of each independent variable, as well as their interactions.
Example: Analyzing the impact of factors like temperature (low, medium, high), humidity (low, medium, high), and time of day (morning, afternoon, evening) on crop yield.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

1. The partitioning of variance in ANOVA refers to the decomposition of the total variation in the data into different sources of variation.
2. It involves dividing the total sum of squares (SST) into several components, such as the sum of squares between groups (SSB) and the sum of squares within groups (SSW). 
3. Understanding this concept is important because it provides valuable insights into the sources of variability and helps in interpreting the results of ANOVA.
Imagine you have a group of people, and you want to understand why their heights are different. 
      1. total variation  "total sum of squares." = It represents all the differences in heights among the people.
      2. "sum of squares between groups," tells you how much of the variation in height comes from differences between            specific groups of people (like males and females).
      3. "sum of squares within groups," tells you how much of the variation in height comes from differences within each          group (like individual differences within the male or female group).
      4. By understanding this partitioning, you can see how much of the total height variation is due to differences              between groups and how much is due to differences within groups.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

1. Total Sum of Squares (SST):
    1. Formula: SST = Σ(y - ȳ)²
    2. Calculation: Sum the squared differences between each individual value (y) and the overall mean (ȳ) across all            observations. 

2. Explained Sum of Squares (SSE):
   1. Formula: SSE = Σ(nᵢ(ȳᵢ - ȳ)²)
   2. Calculation: For each group, calculate the squared difference between the group mean (ȳᵢ) and the overall mean (ȳ),       and multiply it by the number of observations (nᵢ) in that group. Sum these values across all groups.

3. Residual Sum of Squares (SSR):
    1. Formula: SSR = SST - SSE
    2. Calculation: Subtract the explained sum of squares (SSE) from the total sum of squares (SST) to obtain the                residual sum of squares (SSR).

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B', 'C', 'C'],
                     'factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
                     'values': [5, 8, 6, 9, 7, 10]})

# Fit the two-way ANOVA model
model = ols('values ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=data).fit()

# Print the main effects
print("Main Effect of Factor 1:")
print(model.params['C(factor1)[T.B]'])
print("Main Effect of Factor 2:")
print(model.params['C(factor2)[T.Y]'])

# Print the interaction effect
print("Interaction Effect:")
print(model.params['C(factor1)[T.B]:C(factor2)[T.Y]'])


Main Effect of Factor 1:
1.0000000000000004
Main Effect of Factor 2:
2.9999999999999973
Interaction Effect:
5.075018748098689e-16


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.vWhat can you conclude about the differences between the groups, and how would you interpret thesevresults?

1. The F-statistic measures the ratio of the between-group variability to the within-group variability. 
2. A larger F-statistic suggests that the between-group variability is relatively larger compared to the within-group variability.
3. In this case, the obtained F-statistic of 5.23 indicates that there are substantial differences between the groups.
4.  A p-value of 0.02 indicates that there is strong evidence against the null hypothesis of no differences between the groups.
5. In general, with an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there are statistically significant differences between the groups. This means that at least one of the groups is different from the others in terms of the variable being analyzed.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### Common approaches for dealing with missing data in a repeated measures ANOVA and their potential consequences:
1. Listwise deletion: 
      1. cases with missing data on any variable involved in the analysis are completely removed from the dataset.
      2. This can affect potential loss of statistical power and potential bias if the missingness is not random.
2. Pairwise deletion: 
      1. It involves using all available data for each pairwise comparison within the repeated measures design.
      2. It can lead to an unbalanced design with different sample sizes for each comparison. 
      3. This can affect the precision of estimates and statistical power.
3. Mean imputation:
      1. Mean imputation replaces missing values with the mean of the observed values for that variable
      2. Assume missing values are missing completely at random (MCAR)
      3. It can also artificially reduce the variability of the data, leading to underestimated standard errors and potentially invalid inferences.
4. Multiple imputation: 
      1. It generates several plausible values for each missing data point based on the observed data and their relationships.
      2.  it can be computationally intensive and requires assumptions about the missing data mechanism.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### 1. Tukey's Honestly Significant Difference (HSD): 
    Tukey's HSD is often used when the sample sizes across groups are equal, and it provides a conservative approach to control the familywise error rate. It compares all possible pairs of group means and identifies significant differences while considering the overall significance level.
    Example: Let's say you conducted an ANOVA to compare the effectiveness of three different teaching methods on student performance. The ANOVA revealed a significant overall effect. To determine which specific pairs of teaching methods differ significantly, you can use Tukey's HSD
#### 2. Bonferroni correction: 
    The Bonferroni correction is a conservative method that adjusts the significance level for multiple comparisons. It divides the desired significance level (e.g., α = 0.05) by the number of pairwise comparisons being made. It is suitable when there are many pairwise comparisons, but it may be overly conservative and reduce power.
     Example: Suppose you conducted an ANOVA to examine the effects of a medication on various symptoms. You have four treatment groups, and you want to compare each pair of groups to identify significant differences. In this case, you can apply the Bonferroni correction to control for multiple comparisons.
#### 3. Scheffé's test:
    Scheffé's test is a conservative post-hoc test that can be used when there are unequal sample sizes or when the assumption of homogeneity of variances is violated. It controls the familywise error rate and provides a wider confidence interval, making it more suitable for situations with limited sample sizes.
     Example: Consider a study comparing the mean scores of four different diets on weight loss. The ANOVA indicates a significant effect, and you want to determine which specific diet pairs have significantly different mean weight loss. In this scenario, you can utilize Scheffé's test to make the comparisons.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [1]:
import numpy as np
import scipy.stats as stats
# as data is not given so we will consider random numbers
# Set the random seed for reproducibility
np.random.seed(42)

# Generate random weight loss data for each diet
diet_A = np.random.normal(loc=1.5, scale=0.5, size=50)  # Mean: 1.5, Standard Deviation: 0.5
diet_B = np.random.normal(loc=2.0, scale=0.7, size=50)  # Mean: 2.0, Standard Deviation: 0.7
diet_C = np.random.normal(loc=1.8, scale=0.6, size=50)  # Mean: 1.8, Standard Deviation: 0.6

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


F-Statistic: 15.513421604195475
p-value: 7.711529407310787e-07


1. The F-statistic measures the ratio of the between-group variance to the within-group variance. A higher F-statistic suggests a larger difference between the mean weight loss of the three diets. 
2. If the p-value < significance level, it suggests that there are significant differences between the mean weight loss of the three diets.
3.  if the p-value > significance level, you fail to reject the null hypothesis, indicating no significant differences between the mean weight loss of the diets.

## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random time data
time_data = np.random.normal(loc=10, scale=2, size=30)  # Mean: 10, Standard Deviation: 2

# Generate random software program assignments
program_data = np.random.choice(['A', 'B', 'C'], size=30)

# Generate random experience levels
experience_data = np.random.choice(['Novice', 'Experienced'], size=30)

# Create a DataFrame with the random data
data = pd.DataFrame({
    'Time': time_data,
    'Program': program_data,
    'Experience': experience_data
})

# Fit the two-way ANOVA model
model = ols('Time ~ Program + Experience + Program:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)


                      df     sum_sq    mean_sq         F    PR(>F)
Program              2.0   1.233374   0.616687  0.236857  0.790927
Experience           1.0  23.761035  23.761035  9.126114  0.005904
Program:Experience   2.0   6.479781   3.239890  1.244374  0.306054
Residual            24.0  62.487152   2.603631       NaN       NaN


To interpret the results:
1. p-value for a factor - Program  > 0.05 that is not significant which indicates that the program has not  significant main effect on the average time to complete the task.
2. p-value for a factor -Experience < 0.05 that is significant which indicates that the Experienc has significant main effect on the average time to complete the task.
3. If the p-value for the interaction term (Program:Experience) > 0.05 that is not significant, it suggests that there is no significant interaction effect between the software program and employee experience level. This means that the combined effect of the two factors is not significantly different from what would be expected based on the individual main effects alone.

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [3]:
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random test scores for the control group (mean: 75, standard deviation: 10)
control_group = np.random.normal(loc=75, scale=10, size=100)

# Generate random test scores for the experimental group (mean: 80, standard deviation: 12)
experimental_group = np.random.normal(loc=80, scale=12, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the t-statistic and p-value
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform HSD test
data = np.concatenate((control_group, experimental_group))
group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)
hsd = pairwise_tukeyhsd(data, group_labels)

# Print the HSD test results
print(hsd)


t-statistic: -4.316398519082441
p-value: 2.5039591073846333e-05
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.3061   0.0 3.4251 9.1872   True
--------------------------------------------------------


1. p-value < 0.05, it indicates a statistically significant difference in test scores between the two groups.
2.  focus on the "reject" column, which indicates the significant pairwise comparisons. If the value is True, it means there is a significant difference between the corresponding groups.

## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random daily sales data for Store A, B, and C
store_a_sales = np.random.normal(loc=100, scale=20, size=30)
store_b_sales = np.random.normal(loc=110, scale=25, size=30)
store_c_sales = np.random.normal(loc=90, scale=15, size=30)

# Combine the sales data into a single DataFrame
data = pd.DataFrame({
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales]),
    'Store': np.repeat(['A', 'B', 'C'], 30)
})

# Convert the Store column to categorical
data['Store'] = pd.Categorical(data['Store'])

# Fit the repeated measures ANOVA model
model = ols('Sales ~ Store', data=data).fit()

# Perform the repeated measures ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Perform post-hoc test (e.g., Tukey's HSD)
posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'])

# Print the post-hoc test results
print(posthoc)


                sum_sq    df         F    PR(>F)
Store      4332.335994   2.0  5.976977  0.003696
Residual  31530.424628  87.0       NaN       NaN
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  10.7339 0.0797  -0.9868 22.4546  False
     A      C  -6.0438 0.4391 -17.7645  5.6769  False
     B      C -16.7777 0.0028 -28.4984  -5.057   True
-----------------------------------------------------


1. Repeated Measures ANOVA:
 p-value < 0.05, is significant, you can conclude that there are significant differences in the average daily sales between at least two of the stores.
2. Post-hoc Test (e.g., Tukey's HSD):
p-value < 0.05 between Store B and Store C, it suggests that there is a significant difference in the average daily sales between Store B and Store C.