Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Analysis of Variance (ANNOVA) is a statistical method used to compare means among multile groups.
To use the ANOVA test we made the following assumptions:
-> Each group sample is drawn from a normally distrubted populations.
-> All population have a common variance.
-> All samples are drawn independently of each other. 
-> Within each sample, the observations are sampled randomly and independently of each other. 
-> Factors effectes of additive.


Each group sample is drawn from a normally distributed population:

Violation Example: If the data within a group is not normally distributed, ANOVA results may be affected. For instance, if a group has a skewed or heavily tailed distribution, it may violate the normality assumption.
All populations have a common variance:

Violation Example: Heterogeneous variances among groups can be a violation. For instance, if one group has a much larger variance than the others, it can impact the overall F-test, leading to potential issues in interpreting group differences.
All samples are drawn independently of each other:

Violation Example: If observations within a group are not independent (e.g., repeated measures or clustered data), it can violate the assumption of independence. This can occur when measurements on the same subject are taken over time or when there are dependencies within groups.
Within each sample, the observations are sampled randomly and independently of each other:

Violation Example: If the sampling within groups is not random or if there is a systematic bias in how samples are collected, it could violate the assumption. For instance, if certain subjects are more likely to be included in a specific group, it may compromise the randomness assumption.
Factors' effects are additive:

Violation Example: If there are interaction effects between factors, meaning the combined effect of two factors is not simply the sum of their individual effects, it can violate the additivity assumption. This occurs when the effect of one factor depends on the level of another, and it can complicate the interpretation of main effects.

Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANNOVA are :
i) One Way ANOVA
ii) Repeated Measures ANOVA
iii) Factor ANOVA

Situations when each could be used are: 

One-Way ANOVA:

Use Case: One-Way ANOVA is used when there is one independent variable with more than two levels or groups, and the researcher wants to determine whether there are any statistically significant differences in the means of these groups. It is often applied to compare means across different categories or levels of a single factor.

Example: Comparing the mean test scores of students from three different teaching methods (Method A, Method B, Method C) to determine if there is a significant difference in performance.

Repeated Measures ANOVA:

Use Case: Repeated Measures ANOVA is employed when the same subjects are used for each treatment or condition, and measurements are taken at multiple time points or under different conditions. It is particularly useful for studying changes within subjects over time or across different experimental conditions.

Example: Assessing the effect of a drug treatment on blood pressure by measuring blood pressure before treatment, during treatment, and after treatment for each participant.

Factorial ANOVA:

Use Case: Factorial ANOVA is used when there are two or more independent variables (factors), and the researcher wants to examine their main effects and interactions on the dependent variable. This type of ANOVA is beneficial for exploring how different factors, both independently and in combination, influence the outcome.

Example: Investigating the impact of both gender (Male/Female) and treatment type (A/B/C) on exam scores to examine not only the main effects of gender and treatment but also the interaction effect between gender and treatment.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

 The partitioning of variance in ANOVA involves breaking down the total variability observed in the data into different components, specifically between-group variance and within-group variance. Here's an elaboration on the concept and its significance:

Between-Group Variance:

This component measures the extent to which the group means differ from each other. If the between-group variance is large relative to within-group variance, it suggests that there are significant differences among the group means. This is essential for understanding whether the independent variable (or factors) has a significant impact on the dependent variable.
Within-Group Variance:

Within-group variance represents the variability of individual data points within each group. It accounts for random variability and individual differences within groups. Understanding this component is crucial for capturing the natural variability that occurs within a group, which may be due to measurement error, individual differences, or other random factors.

Understanding the partitioning of variance is important for several reasons:

Hypothesis Testing:

ANOVA uses the partitioning of variance to test whether the observed differences among group means are statistically significant. If the between-group variance is significantly larger than the within-group variance, it indicates that the groups are likely different from each other.
Effectiveness of the Model:

Assessing the partitioning of variance helps researchers evaluate how well the model explains the observed data. A large between-group variance suggests that the independent variable(s) has a substantial impact on the dependent variable.
Identifying Sources of Variability:

Understanding the breakdown of variance helps identify the sources of variability in the data. This is crucial for drawing meaningful conclusions about the factors influencing the dependent variable.
Interpretation of F-ratio:

The ratio of between-group variance to within-group variance (F-ratio) is used to determine the statistical significance of group differences. A high F-ratio suggests that the group means are not equal, and the observed differences are likely not due to chance.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import scipy.stats as stats

# Example data (replace this with your actual data)
group1 = np.array([12, 14, 16, 18, 20])
group2 = np.array([8, 10, 12, 14, 16])
group3 = np.array([5, 7, 9, 11, 13])

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Number of groups
k = 3

# Number of observations per group
n = len(group1)

# Calculate the mean of the entire dataset
overall_mean = np.mean(data)

# Calculate total sum of squares (SST)
sst = np.sum((data - overall_mean)**2)

# Calculate group means
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])

# Calculate explained sum of squares (SSE)
sse = np.sum(n * (group_means - overall_mean)**2)

# Calculate residual sum of squares (SSR)
ssr = sst - sse

# Display the results
print(f"Total Sum of Squares (SST): {sst}")
print(f"Explained Sum of Squares (SSE): {sse}")
print(f"Residual Sum of Squares (SSR): {ssr}")


Total Sum of Squares (SST): 243.33333333333334
Explained Sum of Squares (SSE): 123.33333333333333
Residual Sum of Squares (SSR): 120.00000000000001


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [3]:
from scipy.stats import f_oneway
import pandas as pd 
# Create a sample dataset
data = {'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'B': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'Value': [5, 8, 7, 10, 15, 12, 6, 9, 11]}

df = pd.DataFrame(data)

# Extract unique levels for factors A and B
levels_A = df['A'].unique()
levels_B = df['B'].unique()

# Perform two-way ANOVA
result_A = f_oneway(*[df['Value'][df['A'] == level] for level in levels_A])
result_B = f_oneway(*[df['Value'][df['B'] == level] for level in levels_B])
result_interaction = f_oneway(*[df['Value'][(df['A'] == level_A) & (df['B'] == level_B)] 
                                for level_A in levels_A for level_B in levels_B])

# Extract p-values for main effects and interaction effect
p_value_A = result_A.pvalue
p_value_B = result_B.pvalue
p_value_interaction = result_interaction.pvalue

# Output results
print(f"Main Effect A p-value: {p_value_A}")
print(f"Main Effect B p-value: {p_value_B}")
print(f"Interaction Effect p-value: {p_value_interaction}")


Main Effect A p-value: 0.0536231380568786
Main Effect B p-value: 0.3613861142296907
Interaction Effect p-value: nan




Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In a one-way ANOVA, the F-statistic is used to test whether there are significant differences among the means of three or more independent (unrelated) groups. The associated p-value indicates the probability of observing the given F-statistic under the null hypothesis, which assumes that all group means are equal.

In your case:

- F-statistic: 5.23
- p-value: 0.02

### Interpretation:

1. **F-Statistic:**
   - The F-statistic measures the ratio of the variance between groups to the variance within groups. A higher F-statistic suggests greater variability between group means relative to within-group variability.

2. **P-Value:**
   - The p-value of 0.02 is below the commonly used significance level of 0.05. This indicates that there is evidence to reject the null hypothesis.

### Conclusion:

Given the obtained p-value of 0.02, you can conclude that there are statistically significant differences among the group means. In other words, at least one group mean is different from the others.

### Interpretation of the Results:

- **Rejection of the Null Hypothesis:**
  - With a p-value of 0.02, you can reject the null hypothesis that all group means are equal.

- **Practical Significance:**
  - While statistical significance suggests differences between groups, it's also important to consider the practical significance of these differences. A small p-value doesn't necessarily imply a large or practically significant difference.

- **Post-hoc Analysis:**
  - If you have more than two groups, further post-hoc tests (e.g., Tukey's HSD, Bonferroni) may be conducted to identify which specific groups differ from each other.

- **Effect Size:**
  - Consider examining effect size measures (e.g., eta-squared or Cohen's d) to quantify the magnitude of the observed differences.

In summary, you have evidence to suggest that there are significant differences among the group means based on the obtained F-statistic and p-value. Further analyses and exploration may be needed to understand the nature and implications of these differences.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is crucial for obtaining accurate and reliable results. The appropriate method for handling missing data depends on the nature of the missingness and the assumptions underlying the analysis. Here are some common approaches and their potential consequences:

1. **Complete Case Analysis (CCA):**
   - **Method:** Exclude cases with missing data from the analysis.
   - **Consequences:**
     - Reduces sample size.
     - May introduce bias if missing data is not completely random (i.e., if there's a systematic reason for missingness).

2. **Pairwise Deletion:**
   - **Method:** Analyze each pair of variables with available data.
   - **Consequences:**
     - Utilizes all available data but may lead to biased results if missing data is related to the outcome.

3. **Mean Imputation:**
   - **Method:** Replace missing values with the mean of observed values for the variable.
   - **Consequences:**
     - Preserves sample size but underestimates variability and may distort relationships if missingness is not completely random.

4. **Last Observation Carried Forward (LOCF):**
   - **Method:** Use the last available measurement for missing values.
   - **Consequences:**
     - May not accurately represent changes over time, especially if missingness is related to changes in the variable.

5. **Linear Interpolation:**
   - **Method:** Estimate missing values based on the observed values before and after the missing point.
   - **Consequences:**
     - Assumes a linear trend, which may not be appropriate in all cases.

6. **Multiple Imputation:**
   - **Method:** Generate multiple datasets with imputed values and combine results.
   - **Consequences:**
     - Preserves variability and provides more accurate estimates if assumptions of missing data mechanism are met. However, it requires more complex statistical procedures.

7. **Model-Based Imputation:**
   - **Method:** Impute missing values using a model (e.g., regression).
   - **Consequences:**
     - Can provide accurate estimates if the imputation model is correctly specified. However, model misspecification can introduce bias.

### Considerations:

- **Missing Data Mechanism:**
  - Understanding the mechanism of missingness is essential. If missing data is not missing completely at random (MCAR), other methods might be more appropriate.

- **Sensitivity Analysis:**
  - Perform sensitivity analyses using different imputation methods to assess the robustness of results.

- **Consult Guidelines:**
  - Follow guidelines and recommendations in the literature or statistical software documentation for handling missing data in the specific context of repeated measures ANOVA.

- **Imputation Software:**
  - Various statistical software packages (e.g., R, Python with libraries like `pandas` and `scikit-learn`) provide functions for implementing imputation techniques.

Choosing the appropriate method requires careful consideration of the assumptions, potential biases, and the nature of the missing data. Multiple imputation is generally recommended when possible, as it accounts for uncertainty associated with imputed values and can provide more accurate results. However, it's important to ensure that the assumptions underlying imputation methods are met.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are conducted after an analysis of variance (ANOVA) to further explore differences between specific groups when the overall ANOVA indicates a significant effect. Some common post-hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffé test, and Dunnett's test. The choice of post-hoc test depends on factors such as the assumptions, sample size, and research question. Here's an overview of these tests and when you might use each one:

1. **Tukey's Honestly Significant Difference (HSD):**
   - **When to use:**
     - Useful when you have more than two groups and you want to compare all possible pairs.
   - **Example:**
     - In a study comparing the mean scores of three different teaching methods, if the overall ANOVA indicates a significant difference, Tukey's HSD can be used to identify which specific pairs of teaching methods differ significantly.

2. **Bonferroni Correction:**
   - **When to use:**
     - Suitable when you have more than two groups and want to control the familywise error rate.
   - **Example:**
     - Suppose you are comparing mean scores of four different treatments. If the overall ANOVA is significant, and you want to conduct pairwise comparisons while controlling the overall Type I error rate, you might use Bonferroni correction.

3. **Scheffé Test:**
   - **When to use:**
     - Appropriate when you have more than two groups and you want to control the familywise error rate, but it is generally more powerful than Bonferroni.
   - **Example:**
     - If you are conducting a study with multiple groups and the overall ANOVA is significant, Scheffé test can be used for pairwise comparisons to control the familywise error rate.

4. **Dunnett's Test:**
   - **When to use:**
     - Specifically designed for comparing multiple treatments to a control group.
   - **Example:**
     - In a clinical trial comparing the effectiveness of several drugs to a placebo, if the overall ANOVA is significant, Dunnett's test can be used to compare each drug group to the control (placebo) group.

### Example Scenario:

Let's consider an experiment where researchers are testing the impact of three different diets (A, B, and C) on weight loss. After conducting a one-way ANOVA, if the overall test indicates a significant difference among the diets, you might decide to perform post-hoc tests to identify specific pairs of diets that differ significantly in terms of weight loss.

- **Post-hoc Test Choice:**
  - If you want to compare all pairs of diets, you could use Tukey's HSD.
  - If you are particularly interested in comparing each diet to a control (e.g., a standard diet), Dunnett's test might be more appropriate.

Remember, the choice of post-hoc test should be based on your specific research question, the nature of your data, and the assumptions of the chosen test. Always consider the context and the goals of your analysis when selecting a post-hoc test.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [6]:
import pandas as pd
from scipy.stats import f_oneway

# Generate sample data
data = {'Diet': ['A']*50 + ['B']*50 + ['C']*50,
        'WeightLoss': [np.random.normal(loc=5, scale=2) for _ in range(50)] +
                      [np.random.normal(loc=7, scale=2) for _ in range(50)] +
                      [np.random.normal(loc=6, scale=2) for _ in range(50)]}

# Create DataFrame
df = pd.DataFrame(data)

# Perform one-way ANOVA
result = f_oneway(df['WeightLoss'][df['Diet'] == 'A'],
                  df['WeightLoss'][df['Diet'] == 'B'],
                  df['WeightLoss'][df['Diet'] == 'C'])

# Extract F-statistic and p-value
f_statistic = result.statistic
p_value = result.pvalue

# Output results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in mean weight loss between at least two diets.")
else:
    print("There is no significant difference in mean weight loss between the diets.")


F-statistic: 3.2675063109479887
P-value: 0.040886324427914844
There is a significant difference in mean weight loss between at least two diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [8]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
data = {'Software': ['A']*30 + ['B']*30 + ['C']*30,
        'Experience': ['Novice']*45 + ['Experienced']*45,
        'CompletionTime': [np.random.normal(loc=20, scale=5) for _ in range(30)] +
                           [np.random.normal(loc=25, scale=5) for _ in range(30)] +
                           [np.random.normal(loc=22, scale=5) for _ in range(30)]}

# Create DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
formula = 'CompletionTime ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract F-statistics and p-values
f_statistic_software = anova_table['F']['C(Software)']
p_value_software = anova_table['PR(>F)']['C(Software)']

f_statistic_experience = anova_table['F']['C(Experience)']
p_value_experience = anova_table['PR(>F)']['C(Experience)']

f_statistic_interaction = anova_table['F']['C(Software):C(Experience)']
p_value_interaction = anova_table['PR(>F)']['C(Software):C(Experience)']

# Output results
print(f"F-statistic Software: {f_statistic_software}, p-value: {p_value_software}")
print(f"F-statistic Experience: {f_statistic_experience}, p-value: {p_value_experience}")
print(f"F-statistic Interaction: {f_statistic_interaction}, p-value: {p_value_interaction}")

# Interpretation
if p_value_software < 0.05:
    print("There is a significant difference in completion time between at least two software programs.")
else:
    print("There is no significant difference in completion time between the software programs.")

if p_value_experience < 0.05:
    print("There is a significant difference in completion time between novices and experienced users.")
else:
    print("There is no significant difference in completion time between novices and experienced users.")

if p_value_interaction < 0.05:
    print("There is a significant interaction effect between software programs and experience level.")
else:
    print("There is no significant interaction effect between software programs and experience level.")


F-statistic Software: 27.37306260085701, p-value: 1.1673147268134069e-06
F-statistic Experience: nan, p-value: nan
F-statistic Interaction: 3.7680603721319503, p-value: 0.05551514343466879
There is a significant difference in completion time between at least two software programs.
There is no significant difference in completion time between novices and experienced users.
There is no significant interaction effect between software programs and experience level.


  F /= J


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [9]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(42)  # Setting seed for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)
experimental_group = np.random.normal(loc=75, scale=10, size=50)

# Create DataFrame
data = pd.DataFrame({'Group': ['Control']*50 + ['Experimental']*50,
                     'TestScores': np.concatenate([control_group, experimental_group])})

# Conduct two-sample t-test
t_statistic, p_value = ttest_ind(data['TestScores'][data['Group'] == 'Control'],
                                 data['TestScores'][data['Group'] == 'Experimental'])

# Output t-test results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Follow up with post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(data['TestScores'], data['Group'], alpha=0.05)

# Display post-hoc results
print(posthoc)


T-statistic: -4.108723928204809
P-value: 8.261945608702611e-05
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [10]:
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate sample data
np.random.seed(42)  # Setting seed for reproducibility
store_a_sales = np.random.normal(loc=100, scale=20, size=30)
store_b_sales = np.random.normal(loc=110, scale=15, size=30)
store_c_sales = np.random.normal(loc=95, scale=25, size=30)

# Create DataFrame
data = pd.DataFrame({'Store': ['A']*30 + ['B']*30 + ['C']*30,
                     'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])})

# Perform one-way ANOVA
result = f_oneway(data['Sales'][data['Store'] == 'A'],
                  data['Sales'][data['Store'] == 'B'],
                  data['Sales'][data['Store'] == 'C'])

# Extract F-statistic and p-value
f_statistic = result.statistic
p_value = result.pvalue

# Output one-way ANOVA results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Follow up with post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'], alpha=0.05)

# Display post-hoc results
print(posthoc)


F-statistic: 4.085968889697053
P-value: 0.02013491576502835
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  11.9455 0.0506   -0.025  23.916  False
     A      C  -0.9149 0.9819 -12.8854 11.0555  False
     B      C -12.8604 0.0322 -24.8309   -0.89   True
-----------------------------------------------------
