## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.


### Assumptions in ANOVA,

### Independence of Observations:
The observations within and between groups should be independent of each other. This means that the value of one observation should not be related to the value of another observation.

Example of Violation: If you have a repeated measures design where the same subjects are used in multiple groups, and there is a carryover effect from one group to another, this violates the independence assumption.

### Homogeneity of Variance:
The variances of the groups being compared should be roughly equal. In other words, the spread of data within each group should be approximately the same.

Example of Violation: If you are comparing the test scores of students from three different schools, and the variance of scores in one school is much larger than in the others, this violates the homogeneity of variance assumption.

### Normality:
The residuals (the differences between observed values and predicted values) for each group should follow a normal distribution. This assumption is particularly important for smaller sample sizes.

Example of Violation: If you have a small sample size and the residuals in one group are not normally distributed, this could impact the validity of ANOVA results.

### Independence of Groups:
The groups you are comparing should be independent of each other. This means that individuals in one group should not be related to individuals in another group.

Example of Violation: If you are comparing the salaries of employees in different departments of a company, and some employees are in multiple departments simultaneously, this violates the independence of groups assumption.

### Random Sampling:
The data should be obtained through random sampling or a well-controlled experimental design. This ensures that the results are generalizable to the broader population.

Example of Violation: If you only collect data from a convenience sample of individuals who are readily available, this may not represent the broader population.


## Q2. What are the three types of ANOVA, and in what situations would each be used?


### One-Way ANOVA:
One-Way ANOVA is used when you have one independent variable with more than two levels or groups, and you want to determine if there are any statistically significant differences among the means of these groups.

Example: Suppose you have three different teaching methods (Group A, Group B, and Group C) and you want to determine if there's a significant difference in student test scores among these groups.

### Two-Way ANOVA:
Two-Way ANOVA is used when you have two independent variables and you want to examine their combined effects on a dependent variable. It helps you determine if there are interactions between the two factors and if each factor individually has a significant effect.

Example: You're studying the effect of both a new drug treatment and gender on patient recovery time. Two-Way ANOVA allows you to analyze whether the drug, gender, or their interaction significantly affects recovery time.

### Repeated Measures ANOVA:
Repeated Measures ANOVA is used when you have a single group of participants, and each participant is measured under multiple conditions or at multiple time points. It helps assess whether there are significant differences across the repeated measures.

Example: You're studying the impact of a training program on individuals' performance, and you measure their performance before training, immediately after training, and one month after training. Repeated Measures ANOVA helps determine if there's a significant change over time.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps us understand how the total variation in a dataset is divided into different components, each with its own source. ANOVA decomposes the total variation in the data into three main components:

Total Variance (SST): This represents the total variability in the dependent variable. It is calculated as the sum of the squared differences between each data point and the overall mean.

Between-Group Variance (SSB): This represents the variation between different groups or levels of the independent variable. It measures how much the group means differ from the overall mean.

Within-Group Variance (SSW): This represents the variation within each group. It measures how much individual data points within each group vary from their group mean.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?


In [15]:
import numpy as np
import pandas as pd
from scipy import stats

data = {
    'Group1': [45, 50, 55, 60],
    'Group2': [55, 60, 65, 70],
    'Group3': [65, 70, 75, 80],
}

# Convert the data into a DataFrame
df = pd.DataFrame(data)

# Calculate the overall mean
grand_mean = df.values.mean()

# Calculate the Total Sum of Squares (SST)
sst = ((df - grand_mean) ** 2).sum().sum()

# Calculate the Explained Sum of Squares (SSE)
group_means = df.mean()
sse = ((group_means - grand_mean) ** 2 * len(df)).sum()

# Calculate the Residual Sum of Squares (SSR)
ssr = sst - sse

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In Python,
The main effect and Interaction effect can be calculated by these formulaes,

Consider A and B be the datasets,

Main Effect A = SS(A) / SS(Residual)

Main Effect B = SS(B) / SS(Residual)

Interaction Effect = SS(A:B) / SS(Residual)

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

F-Statistic = 5.23

P-Value: 0.02

Null Hypothesis (H0): The null hypothesis in ANOVA is that there are no significant differences among the group means. In other words, all group means are equal.

Alternative Hypothesis (Ha): The alternative hypothesis is that at least one group mean is significantly different from the others.

If the p-value is less than the chosen significance level, you would typically reject the null hypothesis.
Rejecting the null hypothesis implies that there are statistically significant differences among the group means.
Therefore, you can conclude that there are significant differences between at least some of the groups in your study.


## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


Below Methods can be used for handling missing data.

Listwise Deletion (Complete Case Analysis): In this approach, cases with any missing data are removed from the analysis. This is the simplest method but can lead to loss of statistical power and potentially biased results, especially if the missing data is not missing completely at random. It may also reduce the sample size substantially.
Consequence: Reduced sample size, potential bias if data is not MCAR, loss of information.

Mean Imputation: Missing values are replaced with the mean of the available data for that variable. While this method retains the sample size, it may introduce bias because it assumes that missing values are missing at random.
Consequence: Potential bias, underestimation of variance.

Last Observation Carried Forward : In repeated measures data, missing values are imputed with the value from the previous time point. This assumes that values remain constant between time points and can be problematic if this assumption is not met.
Consequence: Potentially inaccurate representation of change over time, assumes constancy.

Interpolation or Extrapolation: You can estimate missing values using statistical techniques like linear interpolation or regression imputation. This approach tries to capture the underlying trends in the data but relies on assumptions about the data's functional form.
Consequence: Results may be sensitive to the chosen imputation model, and assumptions about data trends may not always hold.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.


Tukey's Honestly Significant Difference (HSD) Test: Tukey's HSD is used when you have equal sample sizes in each group. It controls the overall Type I error rate and provides simultaneous confidence intervals for all possible pairwise group comparisons.

When to Use: Tukey's HSD is widely used and appropriate when you have no specific a priori hypotheses about which groups will differ. It's conservative and reliable but may not be suitable when sample sizes are unequal.

Bonferroni Correction: The Bonferroni correction is a conservative method that adjusts the significance level for each pairwise comparison to control the familywise error rate. It's suitable when you want to control the overall Type I error rate, but it can be overly conservative when you have many comparisons.

When to Use: Bonferroni correction is useful when you want to perform multiple pairwise comparisons, and you want to maintain a low overall Type I error rate.

Sidak Correction: The Sidak correction is similar to Bonferroni but often less conservative. It adjusts the significance level for each comparison, considering the number of comparisons being made.

When to Use: Sidak correction is suitable when you have multiple comparisons but want to be less conservative than Bonferroni.

Dunnett's Test: Dunnett's test is used when you have a control group and you want to compare all other groups to the control group. It controls the Type I error rate for these specific comparisons.

When to Use: Dunnett's test is appropriate when you have a control group and you want to identify which treatment groups differ significantly from the control group.

Scheffé's Test: Scheffé's test is a less conservative but powerful method that can be used when sample sizes are unequal and when you have a large number of comparisons. It provides a broad range of confidence intervals for pairwise comparisons.

When to Use: Scheffé's test is suitable when you have unequal sample sizes and you want to control the overall Type I error rate while performing multiple comparisons.

Example Scenario:
Suppose you're conducting a medical study to compare the effectiveness of four different drugs (A, B, C, D) in reducing blood pressure. You use ANOVA to test if there are any significant differences among the drug treatments. After finding a significant difference, you want to know which specific drug(s) are different from each other in terms of their effects on blood pressure. In this case, you would use post-hoc tests, such as Tukey's HSD or Scheffé's test, to perform pairwise comparisons between the drugs and identify where the significant differences exist.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [2]:
import numpy as np
import scipy.stats as stats

# Sample data (replace with your actual data)
diet_A = [2.1, 1.8, 1.5, 2.0, 2.2, 1.9, 2.3, 1.7, 2.4, 1.6, 2.0, 1.8, 2.1, 1.9, 2.2, 2.0, 1.8, 2.3, 1.7, 2.4, 1.6, 2.0, 1.8, 2.1, 1.9]
diet_B = [1.5, 1.7, 1.8, 1.6, 1.9, 1.8, 1.6, 1.7, 1.5, 1.7, 1.9, 1.6, 1.8, 1.7, 1.6, 1.5, 1.7, 1.8, 1.6, 1.7, 1.5, 1.9, 1.8, 1.6, 1.7]
diet_C = [1.3, 1.2, 1.5, 1.4, 1.6, 1.3, 1.7, 1.2, 1.5, 1.3, 1.4, 1.6, 1.3, 1.2, 1.5, 1.4, 1.6, 1.3, 1.7, 1.2, 1.5, 1.3, 1.4, 1.6, 1.3]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
alpha = 0.05  # Significance level

print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
    print("There is a significant difference in mean weight loss between at least two of the diets.")
else:
    print("Fail to reject the null hypothesis.")
    print("There is no significant difference in mean weight loss between the diets.")


F-statistic: 54.61950286806887
p-value: 3.689009004580535e-15
Reject the null hypothesis.
There is a significant difference in mean weight loss between at least two of the diets.


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
significance_level = 0.05

np.random.seed(0)  
n = 30 
programs = ['A', 'B', 'C']
experience_levels = ['Novice', 'Experienced']

# Generate random task completion times
data = {'Program': np.random.choice(programs, n),
        'ExperienceLevel': np.random.choice(experience_levels, n),
        'Time': np.random.normal(loc=15, scale=3, size=n)}  

# Step 2: Fit the ANOVA model
formula = 'Time ~ Program + ExperienceLevel + Program:ExperienceLevel'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Step 3: Print the Anova Table
print(anova_table)

                           df      sum_sq    mean_sq         F    PR(>F)
Program                   2.0   25.439518  12.719759  2.145100  0.138964
ExperienceLevel           1.0    4.729822   4.729822  0.797652  0.380665
Program:ExperienceLevel   2.0   13.529836   6.764918  1.140857  0.336272
Residual                 24.0  142.312322   5.929680       NaN       NaN


### Interpretation
Since, The p-value for (Program, Experience Level and their intercation) is greater than significance level of 0.05 . Hence we conclude their is no significant effect

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.