## Q(1)

 **Here are the assumptions required for ANOVA, along with examples of violations and their potential impacts:**

**1. Normality of Distributions:**

- **Assumption:** The dependent variable within each group should be approximately normally distributed.
- **Violation:** Skewed distributions, outliers, or heavy tails.
- **Impact:** Increased risk of Type I errors (false positives), especially with smaller sample sizes. Results may not accurately reflect the true relationships between groups.

**2. Homogeneity of Variances (Homoscedasticity):**

- **Assumption:** The variances of the dependent variable should be equal across all groups.
- **Violation:** Unequal variances (heteroscedasticity).
- **Impact:** F-test results become less reliable, affecting the accuracy of p-values and conclusions about group differences.

**3. Independence of Observations:**

- **Assumption:** Observations within and between groups should be independent of each other.
- **Violation:** Repeated measures on the same subjects, clustered data, or hierarchical structures.
- **Impact:** Inflated Type I error rates, as the model assumes unrelated observations.

**4. Interval or Ratio Dependent Variable:**

- **Assumption:** The dependent variable must be continuous, measured on an interval or ratio scale.
- **Violation:** Ordinal or nominal dependent variables.
- **Impact:** ANOVA is not appropriate for non-continuous variables, and alternative methods like non-parametric tests are needed.

**Additional Considerations:**

- **Additive Factor Effects:** ANOVA assumes that the effects of independent variables on the dependent variable are additive, meaning they combine in a linear fashion.

**Addressing Violations:**

- **Transformations:** Normalize skewed distributions or stabilize variances.
- **Non-Parametric Tests:** Use tests like Kruskal-Wallis or Friedman's ANOVA for non-normal data or unequal variances.
- **Robust ANOVA Methods:** Employ techniques less sensitive to violations, such as Welch's ANOVA for unequal variances.
- **Mixed Models:** Account for non-independence in repeated measures or hierarchical designs.

**It's crucial to check these assumptions before conducting ANOVA and carefully consider appropriate remedies if violations occur to ensure the validity and trustworthiness of the results.**


## Q(2)

There are three main types of ANOVA (Analysis of Variance) based on the number of independent variables used in your analysis:

1. One-Way ANOVA:

Situation: Compare the means of three or more groups on a single dependent variable.
Examples: Studying the effect of different fertilizer types on plant growth, comparing average student scores in different teaching methods, or analyzing protein levels in three diet groups.

2. Two-Way ANOVA:

Situation: Investigate the effects of two independent variables on a single dependent variable, potentially including their interaction effect.
Examples: Examining the combined effect of exercise frequency and intensity on weight loss, analyzing the impact of both fertilizer type and soil acidity on crop yield, or studying the influence of both gender and age on job satisfaction.

3. N-Way ANOVA (Multiple-way ANOVA):

Situation: Analyze the effects of three or more independent variables and their interactions on a single dependent variable.
Examples: Exploring the combined influence of fertilizer type, water stress, and temperature on fruit quality, researching the effects of medication, dosage, and treatment duration on symptom reduction, or investigating the impact of teacher experience, class size, and student socioeconomic status on academic performance.
Choosing the right type of ANOVA depends on your research question and the number of independent variables you want to investigate.

Here are some additional points to consider:

One-way ANOVA is the simplest form and suitable for basic comparisons between groups.
Two-way ANOVA allows for a more nuanced analysis by revealing potential interactions between independent variables.
N-way ANOVA is useful for complex research designs with multiple factors influencing the outcome, but it also requires larger data sets and more elaborate interpretation.

## Q(3)



**Concept:**

- **Partitioning of variance** is a fundamental concept in ANOVA that involves dividing the total variance in a dataset into different components to determine whether the observed differences between groups are statistically significant.

**Process:**

1. **Total Sum of Squares (SST):** Represents the total variability of all scores around the grand mean (mean of all scores).
2. **Between-Groups Sum of Squares (SSB):** Measures the variability attributed to differences between the group means.
3. **Within-Groups Sum of Squares (SSW):** Captures the variability within each group, often considered as "error" or "unexplained" variance.

**Relationship:**

- SST = SSB + SSW

**Importance:**

1. **F-Test:** ANOVA uses the F-test to compare the ratio of SSB to SSW. A large F-value indicates that the differences between group means are substantial relative to the variability within groups, suggesting a significant effect of the independent variable.
2. **Understanding Sources of Variation:** Partitioning variance helps identify whether differences in scores are due to the independent variable (group membership) or random error.
3. **Statistical Power:** By isolating the variability caused by the independent variable, ANOVA increases the ability to detect significant effects, even with smaller sample sizes.
4. **Effect Size Calculations:** Partitioned variance components are used to calculate effect sizes like eta-squared, providing a standardized measure of the strength of the relationship between variables.
5. **Model Assumptions:** Analyzing the partitioned variance can reveal violations of assumptions like homogeneity of variances, crucial for interpreting ANOVA results correctly.

**In essence, understanding partitioning of variance is essential for:**

- **Assessing the significance of group differences in ANOVA.**
- **Interpreting ANOVA results meaningfully.**
- **Evaluating statistical power and effect sizes.**
- **Diagnosing potential violations of model assumptions.**


## Q(4)

In [4]:
import numpy as np

# Example data
group_a = [75, 80, 85, 78, 82]
group_b = [68, 72, 65, 74, 70]
group_c = [88, 85, 90, 92, 87]

# Combine all groups into a single list
all_data = np.concatenate([group_a, group_b, group_c])

# Calculate overall mean (X̄total)
overall_mean = np.mean(all_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean)**2)

# Calculate group means
mean_a = np.mean(group_a)
mean_b = np.mean(group_b)
mean_c = np.mean(group_c)

# Calculate Explained Sum of Squares (SSE)
sse = len(group_a) * (mean_a - overall_mean)**2 + len(group_b) * (mean_b - overall_mean)**2 + len(group_c) * (mean_c - overall_mean)**2

# Calculate Residual Sum of Squares (SSR)
ssr = np.sum((group_a - mean_a)**2) + np.sum((group_b - mean_b)**2) + np.sum((group_c - mean_c)**2)

# Verify that SST equals the sum of SSE and SSR
assert np.isclose(sst, sse + ssr)

print(f"Total Sum of Squares (SST): {sst:.4f}")
print(f"Explained Sum of Squares (SSE): {sse:.4f}")
print(f"Residual Sum of Squares (SSR): {ssr:.4f}")


Total Sum of Squares (SST): 1003.6000
Explained Sum of Squares (SSE): 867.6000
Residual Sum of Squares (SSR): 136.0000


## Q(5)

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = {
    'Y': [10, 12, 14, 15, 8, 11, 13, 16, 9, 12, 18, 20, 16, 14, 22, 24],
    'FactorA': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D'],
    'FactorB': ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y']
}

df = pd.DataFrame(data)

formula = 'Y ~ FactorA + FactorB + FactorA:FactorB'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

main_effect_A = anova_table['sum_sq']['FactorA'] / anova_table['df']['FactorA']
main_effect_B = anova_table['sum_sq']['FactorB'] / anova_table['df']['FactorB']
interaction_effect = anova_table['sum_sq']['FactorA:FactorB'] / anova_table['df']['FactorA:FactorB']

print(f"Main Effect of Factor A: {main_effect_A:.4f}")
print(f"Main Effect of Factor B: {main_effect_B:.4f}")
print(f"Interaction Effect: {interaction_effect:.4f}")


Main Effect of Factor A: 39.4167
Main Effect of Factor B: 156.2500
Interaction Effect: 5.7500


## Q(6)

Significant Difference:

The obtained p-value of 0.02 is less than the conventional significance level of 0.05. This means we have enough evidence to reject the null hypothesis that all group means are equal.
We can conclude that there is a statistically significant difference between at least two of the group means.

2. Strength of Evidence:

The F-statistic of 5.23 indicates a moderate to strong effect size. It suggests that the variability between the group means is substantial relative to the variability within the groups.

3. Additional Information:

ANOVA does not specify which particular groups differ significantly from each other.
To identify the specific group differences, post-hoc tests such as Tukey's HSD or pairwise t-tests would be necessary.

4. Assumptions:

Before fully accepting these conclusions, it's essential to ensure that the assumptions of ANOVA (normality, homogeneity of variances, and independence of observations) have been met. Violations of these assumptions could impact the validity of the results.

5. Practical Significance:

While statistical significance is important, consider the practical significance of the differences as well. Assess the effect size using measures like eta-squared or Cohen's d to determine the practical relevance of the findings.

Interpretation:

The results suggest that the independent variable (grouping factor) has a significant effect on the dependent variable.
There are meaningful differences in the dependent variable's values across different groups.
Further investigation using post-hoc tests can pinpoint the exact pairs of groups that exhibit significant differences.

## Q(7)

**Here's how to handle missing data in repeated measures ANOVA, along with the potential consequences of different methods:**

**Methods for Handling Missing Data:**

1. **Complete Case Analysis (Listwise Deletion):**
   - Removes any participant with missing data for any time point.
   - Consequences:
     - Can lead to substantial loss of data and reduced power, especially with large missingness.
     - May introduce bias if missingness is not random.

2. **Pairwise Deletion:**
   - Uses all available data for each comparison, excluding only participants with missing data for the specific time points being compared.
   - Consequences:
     - Can lead to different sample sizes for different comparisons, complicating interpretation.
     - May still introduce bias if missingness is not random.

3. **Mean Imputation:**
   - Replaces missing values with the mean of the observed scores for that time point.
   - Consequences:
     - Can underestimate variability and distort relationships between variables.
     - May not be appropriate if missingness patterns are complex.

4. **Last Observation Carried Forward (LOCF):**
   - Imputes missing values with the last observed non-missing value for that participant.
   - Consequences:
     - Assumes no change over time, which may not be realistic.
     - Can lead to biased results if missingness is related to change over time.

5. **Mixed Models (Mixed Effects Models):**
   - A sophisticated statistical approach that can handle missing data more flexibly by incorporating all available data and modeling the covariance structure of the repeated measures.
   - Consequences:
     - Requires more complex statistical expertise and software.
     - Assumptions about the missing data mechanism need to be carefully considered.

**Recommendations:**

- **Prevention:** Design studies to minimize missing data, ensuring complete and accurate data collection.
- **Mechanism:** Understand the reasons for missingness to choose appropriate methods.
- **Multiple Imputation:** Consider multiple imputation techniques, which create multiple plausible datasets with imputed values and combine results for more robust inferences.
- **Mixed Models:** Explore mixed models for their flexibility in handling missing data and accounting for correlations between repeated measures.

**Key Considerations:**

- **Assumptions:** Each method has assumptions about the missing data mechanism (random or non-random) that should be assessed.
- **Bias:** Inappropriate methods can lead to biased results and incorrect conclusions.
- **Power:** Missing data can reduce statistical power, making it harder to detect significant effects.
- **Interpretation:** Carefully consider the impact of missing data handling on the interpretation of results.

**Consult with a statistician for guidance on the most appropriate method based on the specific study design and missing data patterns.**


## Q(8)

**Here are some common post-hoc tests used after ANOVA, along with their uses and an example:**

**1. Tukey's Honest Significant Difference (HSD):**

- **When to use:** For comparing all possible pairs of means in a balanced design (equal sample sizes in each group).
- **Features:** Controls the family-wise error rate (FWER), ensuring the overall probability of making at least one Type I error among all comparisons remains at the desired level (usually 0.05).
- **Example:** Comparing the effects of four different fertilizers on plant growth. If ANOVA indicates a significant difference, Tukey's HSD would determine which specific fertilizers lead to significantly different growth rates.

**2. Bonferroni Correction:**

- **When to use:** For pairwise comparisons with unequal sample sizes or when a more conservative approach is desired.
- **Features:** Adjusts the significance level for each individual comparison to control the FWER.
- **Example:** Comparing the effectiveness of two teaching methods on student test scores, but with different class sizes in each method. Bonferroni correction would account for the unequal sample sizes in the comparisons.

**3. Scheffé's Test:**

- **When to use:** For more complex comparisons beyond pairwise, such as comparing combinations of means or linear combinations of means.
- **Features:** Very conservative, controls the FWER for any possible comparison.
- **Example:** Comparing the effects of two drugs and a placebo on blood pressure, where you might want to compare the average effect of the drugs to the placebo, or compare the difference between the two drugs.

**4. Dunnett's Test:**

- **When to use:** For comparing multiple treatment groups to a single control group.
- **Features:** More powerful than Bonferroni or Scheffé when only comparing to a control.
- **Example:** Comparing the effectiveness of three new medications to a standard treatment for a disease. Dunnett's test would focus on identifying treatments that are significantly better than the control.

**5. Games-Howell Test:**

- **When to use:** For pairwise comparisons when variances are unequal (heteroscedasticity).
- **Features:** Robust to violations of homogeneity of variance assumption.
- **Example:** Comparing the salaries of employees in different job categories with varying levels of experience, where salary distributions might have different variances.

**Remember:**

- Post-hoc tests are only conducted if the overall ANOVA test is significant, indicating at least one difference among the means.
- The choice of post-hoc test depends on the specific research question, design, and assumptions of the data.
- It's essential to consider both statistical significance and practical significance when interpreting the results of post-hoc tests.


## Q(9)

In [8]:
import numpy as np
from scipy.stats import f_oneway

# Example data
np.random.seed(42)  # for reproducibility
weight_loss_A = np.random.normal(5, 1, 50)  # mean=5, std=1
weight_loss_B = np.random.normal(6, 1, 50)  # mean=6, std=1
weight_loss_C = np.random.normal(7, 1, 50)  # mean=7, std=1

# Combine data
weight_loss_data = np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])

# Create group labels
groups = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

# Report results
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
if p_value < 0.05:
    print("There is significant evidence to reject the null hypothesis.")
else:
    print("There is not enough evidence to reject the null hypothesis.")


F-statistic: 67.6185
P-value: 0.0000
There is significant evidence to reject the null hypothesis.


## Q(10)

In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
np.random.seed(42)  # for reproducibility

# Simulating data with two factors (Software Programs and Experience Level)
data = {
    'Time': np.random.normal(loc=10, scale=2, size=90),
    'Program': np.repeat(['A', 'B', 'C'], 30),
    'Experience': np.tile(['Novice', 'Experienced'], 45),
}

df = pd.DataFrame(data)

# Fit two-way ANOVA model
formula = 'Time ~ Program + Experience + Program:Experience'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Report results
print(anova_table)


                        sum_sq    df         F    PR(>F)
Program               2.514772   2.0  0.344485  0.709581
Experience            0.479063   1.0  0.131248  0.718051
Program:Experience    1.592393   2.0  0.218133  0.804472
Residual            306.603758  84.0       NaN       NaN


## Q(11)

In [14]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulated data
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(70, 10, 100)  # mean=70, std=10
experimental_group = np.random.normal(75, 10, 100)  # mean=75, std=10

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Report results of the t-test
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Check for significance
if p_value < 0.05:
    print("There is a significant difference in test scores between the two groups.")
else:
    print("There is no significant difference in test scores between the two groups.")

# Follow up with a post-hoc test (Tukey's HSD) if the results are significant
if p_value < 0.05:
    # Combine data and labels
    all_scores = np.concatenate([control_group, experimental_group])
    group_labels = ['Control'] * 100 + ['Experimental'] * 100

    # Perform post-hoc test
    results = pairwise_tukeyhsd(all_scores, group_labels)
    print(results)


T-statistic: -4.7547
P-value: 0.0000
There is a significant difference in test scores between the two groups.
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2615   0.0 3.6645 8.8585   True
--------------------------------------------------------


## Q(12)

In [15]:
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulated data
np.random.seed(42)  # for reproducibility

# Generate random sales data for three stores across 30 days
data = {
    'Day': np.repeat(range(1, 31), 3),
    'Store': np.tile(['Store A', 'Store B', 'Store C'], 30),
    'Sales': np.random.normal(loc=100, scale=20, size=90),  # mean=100, std=20
}

df = pd.DataFrame(data)

# Convert the 'Day' column to categorical
df['Day'] = pd.Categorical(df['Day'])

# Fit repeated measures ANOVA model
rm_anova = AnovaRM(df, 'Sales', 'Day', within=['Store'])
results = rm_anova.fit()

# Report results of repeated measures ANOVA
print(results)

# Follow up with post-hoc test (Tukey's HSD) if the results are significant
if results.anova_table['Pr > F']['Store'] < 0.05:
    posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
    print(posthoc)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  0.1808 2.0000 58.0000 0.8350

