### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical technique used to compare means between three or more groups. To ensure the validity of ANOVA results, certain assumptions must be met. These assumptions include:

1. **Independence**: Observations within each group must be independent of each other. This means that the data points in one group should not be influenced by or related to the data points in another group.

2. **Normality**: The data within each group should be approximately normally distributed. This assumption is about the distribution of residuals (the differences between observed and predicted values). While ANOVA is robust to violations of normality when sample sizes are large, it becomes more critical for smaller sample sizes.

3. **Homogeneity of Variances (Homoscedasticity)**: The variances of the populations being compared should be approximately equal. In other words, the spread of data points within each group should be similar across all groups. Violations of this assumption can lead to unreliable results, especially when sample sizes are unequal.

Examples of violations of these assumptions that could impact the validity of ANOVA results include:

1. **Non-independence**: In a study examining the effects of different teaching methods on student performance, if students within the same classroom are assigned to different teaching methods, their scores may not be independent because they are influenced by the same classroom environment.

2. **Non-normality**: Suppose you're conducting an ANOVA to compare the effectiveness of three pain relief medications. If the distribution of pain scores for one of the medications is heavily skewed, it may violate the normality assumption. For example, if the distribution is highly skewed to the right, with many participants reporting very low pain scores and few reporting high pain scores, the assumption of normality may not hold.

3. **Heterogeneity of Variances**: Consider a study comparing the yield of three different crop varieties. If one crop variety consistently produces highly variable yields across different fields, while the other varieties have more consistent yields, this violates the assumption of homogeneity of variances. The ANOVA results may be unreliable in this case, as the group with higher variance may dominate the overall variability observed.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

1. One-Way ANOVA:
   - One-way ANOVA is used when you have one independent variable (factor) with three or more levels (groups) and one continuous dependent variable.
   - It is used to determine whether there are any statistically significant differences between the means of the groups.
   - For example, it could be used to compare the effectiveness of three different types of fertilizer on plant growth.

2. Two-Way ANOVA:
   - Two-way ANOVA is used when you have two independent variables (factors) and one continuous dependent variable.
   - It is used to analyze the interaction effects between the two independent variables as well as their individual effects on the dependent variable.
   - For example, it could be used to study the effects of both genotype and treatment on the growth of plants.

3. MANOVA (Multivariate Analysis of Variance):
   - MANOVA is used when you have two or more dependent variables and one or more independent variables.
   - It extends the one-way or two-way ANOVA to multiple dependent variables.
   - It is used to determine whether there are any overall differences between the groups across all dependent variables.
   - For example, it could be used to examine the effects of different teaching methods on students' performance across multiple subjects.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In Analysis of Variance (ANOVA), the partitioning of variance refers to the process of breaking down the total variance observed in the data into different components, each of which represents the variance attributable to different sources or factors. Understanding this concept is crucial because it provides insights into how much of the total variation in the data can be explained by the factors being studied.

The partitioning of variance typically involves three main components:

1. Between-Group Variance:
   - This component represents the variation in the data that is attributable to differences between the group means.
   - It reflects the variation in the dependent variable that is explained by the independent variable(s) or factors being studied.
   - In ANOVA, this variance is compared to the within-group variance to determine if the group means are significantly different from each other.

2. Within-Group Variance (or Error Variance):
   - This component represents the variation in the data that is not explained by the independent variable(s) or factors being studied.
   - It reflects the random variability or "noise" in the data that cannot be attributed to the effects of the independent variable(s).
   - It is also known as error variance because it includes any random error or measurement error present in the data.

3. Total Variance:
   - This component represents the overall variation in the data, regardless of the groupings or factors being studied.
   - It is the sum of the between-group variance and the within-group variance.
   - It provides a baseline against which the explained variance (between-group variance) can be compared to assess the significance of the factors being studied.

Understanding the partitioning of variance is important for several reasons:

- It helps researchers assess the impact of the independent variable(s) on the dependent variable by quantifying how much of the total variance can be attributed to these factors.
- It provides a basis for hypothesis testing in ANOVA, where the goal is to determine whether the differences between group means are statistically significant.
- It allows researchers to identify sources of variation in the data and potentially control for them in future studies.
- It aids in the interpretation of ANOVA results by clarifying the relative importance of different factors or variables in explaining the variation observed in the data.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy import stats

# Sample data (replace this with your own data)
group1 = np.array([10, 12, 15, 18, 20])
group2 = np.array([8, 11, 14, 16, 19])
group3 = np.array([9, 13, 16, 17, 21])

# Combine data from all groups
all_data = np.concatenate([group1, group2, group3])

# Compute the overall mean
overall_mean = np.mean(all_data)

# Compute the total sum of squares (SST)
SST = np.sum((all_data - overall_mean)**2)

# Compute the group means
group_means = [np.mean(group) for group in [group1, group2, group3]]

# Compute the explained sum of squares (SSE)
SSE = np.sum([len(group) * (mean - overall_mean)**2 for group, mean in zip([group1, group2, group3], group_means)])

# Compute the residual sum of squares (SSR)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 229.6
Explained Sum of Squares (SSE): 7.6
Residual Sum of Squares (SSR): 222.0


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

#view first ten rows of data 
df[:10]

Unnamed: 0,water,sun,height
0,daily,low,6
1,daily,low,6
2,daily,low,6
3,daily,low,5
4,daily,low,6
5,daily,med,5
6,daily,med,5
7,daily,med,6
8,daily,med,4
9,daily,med,5


In [3]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

If we did a one-way ANOVA with F=5.23 and p=0.02, it means there's a significant difference between groups. This implies that at least one pair of group means is different. To find which pairs, you need to do additional tests like Tukey's or Dunnett's.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA requires careful consideration to maintain the validity of the analysis. There are several methods to handle missing data in this context, each with its own potential consequences:

1. **Complete Case Analysis (CCA)**:
   - In CCA, cases with missing data on any variable are excluded from the analysis.
   - Pros: Simple to implement.
   - Cons: Reduces sample size and potentially introduces bias if missing data are not completely random. Also, it may reduce statistical power.

2. **Mean Imputation**:
   - Missing values are replaced with the mean of observed values for that variable.
   - Pros: Easy to implement and maintains the sample size.
   - Cons: Underestimates standard errors, distorts the distribution of the variable, and reduces variability, potentially leading to biased results.

3. **Last Observation Carried Forward (LOCF)**:
   - Missing values are replaced with the value of the last observed measurement.
   - Pros: Easy to implement and maintains the sample size.
   - Cons: Assumes that the last observed value is an accurate representation of the missing value, which may not always be true. Can distort longitudinal trends and underestimate variability.

4. **Multiple Imputation**:
   - Missing values are imputed multiple times based on a specified model, creating several complete datasets. Analyses are performed on each dataset, and results are combined.
   - Pros: Retains variability, handles missing data uncertainty, and provides more accurate standard errors and parameter estimates.
   - Cons: More computationally intensive and requires assumptions about the distribution of missing data. Results may vary depending on the imputation model chosen.

5. **Maximum Likelihood Estimation (MLE)**:
   - Missing data are treated as parameters to be estimated along with other model parameters, using likelihood-based methods.
   - Pros: Provides unbiased parameter estimates under the assumption that data are missing at random (MAR).
   - Cons: Requires sophisticated statistical software and may be sensitive to violations of the MAR assumption.

6. **Pattern-Mixture Models**:
   - Models are specified for different patterns of missingness, allowing for separate estimation of parameters for each pattern.
   - Pros: Allows for a more nuanced understanding of missing data mechanisms.
   - Cons: Requires strong assumptions about the missing data mechanism and can be complex to implement.

The choice of method should be based on the characteristics of the missing data, the assumptions that can be reasonably made about the missingness mechanism, and the goals of the analysis. It's essential to conduct sensitivity analyses to assess the robustness of results to different missing data handling approaches. Additionally, reporting the method used and potential limitations introduced by missing data handling is crucial for transparency and reproducibility.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

 After conducting an analysis of variance (ANOVA) and finding a significant difference among the means of three or more groups, post-hoc tests are often conducted to determine which specific group differences are significant. Here are some common post-hoc tests and when to use each one:

1. **Tukey's Honestly Significant Difference (HSD)**:
   - Use: Tukey's HSD is appropriate when you have equal sample sizes and want to perform pairwise comparisons between all possible pairs of group means.
   - Example: After conducting a one-way ANOVA to compare the mean scores of students from three different schools on a standardized test, Tukey's HSD can be used to determine which pairs of schools have significantly different mean scores.

2. **Bonferroni Correction**:
   - Use: Bonferroni correction is useful when performing multiple pairwise comparisons, especially when sample sizes are unequal.
   - Example: If you conducted multiple t-tests to compare the means of several treatment groups with a control group, you might use Bonferroni correction to adjust the significance level for each comparison to maintain an overall significance level.

3. **Sidak Correction**:
   - Use: Similar to Bonferroni correction, Sidak correction adjusts the significance level for multiple comparisons but can be less conservative.
   - Example: Suppose you're comparing the mean effectiveness of three different advertising strategies on sales. After finding a significant difference with ANOVA, you could use Sidak correction for pairwise comparisons to determine which strategies differ significantly from each other.

4. **Dunnett's Test**:
   - Use: Dunnett's test is appropriate when comparing multiple treatment groups to a single control group.
   - Example: After conducting an ANOVA to compare the mean weights of rats subjected to different diets, with one group receiving a standard diet as the control, Dunnett's test can be used to compare the mean weights of rats on experimental diets with the control group.

5. **Scheffé Test**:
   - Use: Scheffé test is a conservative post-hoc test that can be used when sample sizes are unequal and group variances differ.
   - Example: If you're comparing the mean scores of students from different grade levels on a standardized test and you suspect that the variances of scores might vary between grade levels, you might use Scheffé test for pairwise comparisons.

The choice of post-hoc test depends on the specific research question, the nature of the data, and the assumptions underlying the analysis. Post-hoc tests help to identify significant group differences after detecting an overall significant effect in ANOVA, thereby providing more detailed insights into the relationships between variables.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [5]:
import scipy.stats as stats

# Sample weight loss data (replace with your actual data)
diet_A = [2.5, 3.1, 2.8, 1.9, 2.2]
diet_B = [3.0, 2.7, 3.5, 2.4, 1.8]
diet_C = [1.7, 2.1, 2.3, 1.5, 2.0]

# Combine data into groups for ANOVA
data = [diet_A, diet_B, diet_C]

# Perform ANOVA 
f_statistic, p_value = stats.f_oneway(*data)

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a statistically significant difference (p < 0.05) between the mean weight loss of the three diets.")
else:
    print("There is not sufficient evidence (p >= 0.05) to conclude a significant difference between the mean weight loss of the three diets.")


F-statistic: 3.2234332425068124
p-value: 0.07577951993591635
There is not sufficient evidence (p >= 0.05) to conclude a significant difference between the mean weight loss of the three diets.


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)

# Software programs
software = np.random.choice(['A', 'B', 'C'], size=90)

# Employee experience level
experience = np.random.choice(['Novice', 'Experienced'], size=90)

# Task completion time
time = np.random.normal(loc=10, scale=2, size=90)  # Assuming normal distribution with mean 10 and standard deviation 2

# Create DataFrame
data = pd.DataFrame({'Software': software, 'Experience': experience, 'Time': time})

# Fit two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Report results
print(anova_table)


                               sum_sq    df         F    PR(>F)
C(Software)                  4.600606   2.0  0.532542  0.589080
C(Experience)                1.359515   1.0  0.314741  0.576279
C(Software):C(Experience)   15.102201   2.0  1.748150  0.180369
Residual                   362.836289  84.0       NaN       NaN


### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [10]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(0)
control_group = np.random.normal(loc=75, scale=10, size=100)  # Control group (traditional teaching method)
experimental_group = np.random.normal(loc=80, scale=10, size=100)  # Experimental group (new teaching method)

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Report results of t-test
print("Two-sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform post-hoc test (Tukey's HSD) if the results are significant
if p_value < 0.05:
    print("\nPost-hoc test (Tukey's HSD):")
    data = np.concatenate([control_group, experimental_group])
    groups = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)
    tukey_results = pairwise_tukeyhsd(data, groups)
    print(tukey_results)
else:
    print("\nNo significant difference found between the two groups.")


Two-sample t-test:
t-statistic: -3.597192759749614
p-value: 0.0004062796020362504

Post-hoc test (Tukey's HSD):
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental    5.222 0.0004 2.3593 8.0848   True
---------------------------------------------------------


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [21]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(0)
sales_store_A = np.random.randint(1000, 2000, size=30)  # Daily sales for Store A
sales_store_B = np.random.randint(900, 1900, size=30)   # Daily sales for Store B
sales_store_C = np.random.randint(1100, 2100, size=30)  # Daily sales for Store C

# Combine data into a DataFrame
data = pd.DataFrame({
    'Sales': np.concatenate([sales_store_A, sales_store_B, sales_store_C]),
    'Store': np.repeat(['A', 'B', 'C'], 30)
})

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(sales_store_A, sales_store_B, sales_store_C)

# Report results of one-way ANOVA
print("One-way ANOVA:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Perform post-hoc test (Tukey's HSD) if the results are significant
if p_value < 0.05:
    print("\nPost-hoc test (Tukey's HSD):")
    tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'])
    print(tukey_results)
else:
    print("\nNo significant difference found between the three stores.")


One-way ANOVA:
F-statistic: 0.751995907371091
p-value: 0.474463832909494

No significant difference found between the three stores.
