## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

here are the assumptions required to use ANOVA:

Normality: The data for each group must be normally distributed. This means that the distribution of the data should be bell-shaped, with most of the data points clustered around the mean and fewer data points towards the tails of the distribution.
Homogeneity of variance: The variances of the data for each group must be equal. This means that the spread of the data points around the mean should be the same for each group.
Independence: The data points must be independent of each other. This means that the value of one data point cannot be predicted from the value of another data point.
Violations of these assumptions can impact the validity of the results of an ANOVA test. For example, if the data is not normally distributed, the F-statistic that is used to test for significance may not be accurate. Similarly, if the variances of the data are not equal, the F-statistic may be biased. Finally, if the data points are not independent, the F-statistic may be inflated.

Here are some examples of violations that could impact the validity of the results of an ANOVA test:

Non-normality: If the data is not normally distributed, you can try to transform the data to make it more normal. However, if the data is severely non-normal, you may need to use a non-parametric test instead of ANOVA.
Unequal variances: If the variances of the data are not equal, you can try to use a robust ANOVA test that is less sensitive to violations of this assumption. However, if the variances are very unequal, you may need to use a non-parametric test instead of ANOVA.
Dependence: If the data points are not independent, you may need to use a repeated measures ANOVA or a mixed ANOVA.
It is important to check the assumptions of ANOVA before you run the test. If you find that one or more of the assumptions are violated, you may need to take steps to address the violation before you can interpret the results of the test.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

One-Way ANOVA:
One-Way ANOVA is used when you have one categorical independent variable (also known as a factor) and one continuous dependent variable. The categorical variable should have three or more levels (groups). It is used to determine whether there are any significant differences in the means of the dependent variable across the different groups. For example, you could use a One-Way ANOVA to analyze if there are differences in test scores among students from three different schools.

Two-Way ANOVA:
Two-Way ANOVA is an extension of One-Way ANOVA, but it deals with two independent categorical variables (factors) and one continuous dependent variable. It allows you to examine the main effects of each independent variable as well as their interaction effect on the dependent variable. This type of ANOVA is suitable when you want to investigate how two independent variables influence the same dependent variable. For instance, you could use a Two-Way ANOVA to explore the effects of both gender and different teaching methods on exam scores.

Repeated Measures ANOVA:
Repeated Measures ANOVA, also known as Within-Subjects ANOVA, is used when you have a single group of subjects that have been measured multiple times under different conditions or at different time points. In this type of ANOVA, the same participants are measured under all the conditions, which makes it appropriate for studying the effect of a treatment or intervention within the same group over time. For example, if you are testing the effectiveness of three different drugs on patients' pain levels, and each patient receives all three drugs at different times, you would use Repeated Measures ANOVA.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the division of the total variance observed in the data into different components that can be attributed to specific sources or factors. Understanding this concept is fundamental to ANOVA because it allows researchers to assess the contributions of different factors and sources of variation to the overall variability in the dependent variable. This partitioning is crucial for hypothesis testing and drawing meaningful conclusions about group differences.

In ANOVA, the total variance in the data is broken down into three main components:

Between-Groups Variance: This component of variance represents the variability in the dependent variable that is due to differences between the groups being compared. It reflects the differences in means among the different groups or levels of the independent variable. The larger the between-groups variance relative to the total variance, the stronger the evidence for significant differences among the groups.

Within-Groups Variance: Also known as the error variance or residual variance, this component represents the variability in the dependent variable that cannot be attributed to the independent variable. It includes random variation, measurement errors, and any other factors not accounted for in the model. The within-groups variance is an estimate of the variability of individual scores within each group.

Total Variance: The total variance is the overall variability observed in the dependent variable across all data points. It is the sum of the between-groups variance and the within-groups variance. Mathematically, Total Variance = Between-Groups Variance + Within-Groups Variance.

By partitioning the variance into these components, ANOVA enables researchers to test whether the observed between-groups differences are statistically significant or if they could be due to random fluctuations (within-groups variance). This is accomplished by comparing the variability between groups to the variability within groups and calculating the F-statistic, which is used to test the null hypothesis that there are no significant differences among the group means.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy import stats

# Sample data (replace this with your actual data)
y = np.array([15, 18, 20, 22, 25, 28, 12, 16, 19, 23, 26, 14, 17, 21, 24, 27])
group = np.array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3])

# Calculate the overall mean (grand mean)
grand_mean = np.mean(y)

# Calculate the total sum of squares (SST)
sst = np.sum((y - grand_mean) ** 2)

# Calculate the group means
group_means = np.array([np.mean(y[group == i]) for i in np.unique(group)])

# Calculate the explained sum of squares (SSE)
sse = np.sum((group_means - grand_mean) ** 2) * len(np.unique(group))

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 355.9375
Explained Sum of Squares (SSE): 7.0809895833333325
Residual Sum of Squares (SSR): 348.85651041666665


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import numpy as np

def two_way_anova(data):
  

  # Calculate the total sum of squares.
  sst = np.sum((data[:, 0] - np.mean(data[:, 0]))**2)

  # Calculate the main effect of the first independent variable.
  bgss_1 = np.sum((np.mean(data[:, 0]) - data[:, 1])**2)

  # Calculate the main effect of the second independent variable.
  bgss_2 = np.sum((np.mean(data[:, 0]) - data[:, 2])**2)

  # Calculate the interaction effect.
  bgss_int = np.sum((data[:, 1] - np.mean(data[:, 1])) * (data[:, 2] - np.mean(data[:, 2])))

  # Calculate the within-group sum of squares.
  wgss = sst - bgss_1 - bgss_2 - bgss_int

  # Calculate the main effects.
  main_effects = [bgss_1, bgss_2]

  # Calculate the interaction effect.
  interaction_effect = bgss_int

  return main_effects, interaction_effect

if __name__ == "__main__":
  # Create some sample data.
  data = np.array([[10, 1, 1], [12, 2, 2], [14, 3, 3], [16, 4, 4], [18, 5, 5]])

  # Calculate the main effects and interaction effects.
  main_effects, interaction_effect = two_way_anova(data)

  # Print the results.
  print("Main effects:", main_effects)
  print("Interaction effect:", interaction_effect)


Main effects: [615.0, 615.0]
Interaction effect: 10.0


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

here is what you can conclude about the differences between the groups and how you would interpret these results if you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02:

Conclusion: The F-statistic is a measure of the ratio of the between-group variance to the within-group variance. A large F-statistic indicates that there is a significant difference between the groups. In this case, the F-statistic of 5.23 is significant at the 0.02 level, which means that there is a 2% chance of getting a result at least as extreme as this by chance.
Interpretation: The p-value is a measure of the probability of obtaining a result as extreme as the one that was actually observed if the null hypothesis is true. In this case, the p-value of 0.02 is very small, which means that the null hypothesis is very unlikely to be true. Therefore, we can conclude that there is a significant difference between the groups.
In other words, the results of the ANOVA suggest that there is a real difference in the means of the groups. This difference is unlikely to be due to chance, and it is likely to be due to the independent variable.

It is important to note that the ANOVA only tells us that there is a difference between the groups. It does not tell us which groups are different or how much different they are. To answer these questions, we would need to conduct post-hoc tests.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


There are several ways to handle missing data in a repeated measures ANOVA. The most common methods are:

Listwise deletion: This method simply deletes any cases that have any missing data. This is the simplest method, but it can also be the most conservative. If a lot of data is missing, this method can reduce the power of the test.
Pairwise deletion: This method deletes any cases that have missing data for a particular variable. This method is less conservative than listwise deletion, but it can still reduce the power of the test if a lot of data is missing.
Mean imputation: This method replaces missing data with the mean of the variable. This is a relatively simple method, but it can introduce bias into the results.
Model-based imputation: This method uses a statistical model to impute missing data. This is a more sophisticated method, but it can also be more complex.
The potential consequences of using different methods to handle missing data in a repeated measures ANOVA depend on the amount of missing data and the method that is used. If a lot of data is missing, any method can reduce the power of the test. However, some methods can introduce bias into the results, while others do not.

It is important to choose a method for handling missing data that is appropriate for the specific data set. If the data set is small, listwise deletion may be the best option. However, if the data set is large, pairwise deletion or model-based imputation may be better options.

Here are some additional considerations when handling missing data in a repeated measures ANOVA:

The number of missing values: The more missing values there are, the more likely it is that the results will be affected.
The pattern of missing values: If the missing values are clustered, then this can also affect the results.
The type of variable: Missing values are more likely to affect the results for continuous variables than for categorical variables.
It is important to carefully consider all of these factors when choosing a method for handling missing data in a repeated measures ANOVA.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.


There are many different post-hoc tests that can be used after ANOVA. Some of the most common ones include:

Tukey's HSD test: This test is a pairwise comparison test that controls for the family-wise error rate. It is a popular choice because it is relatively easy to interpret and has good power.
Bonferroni test: This test is also a pairwise comparison test, but it is more conservative than Tukey's HSD test. This means that it is less likely to find a significant difference between groups, even if there is one.
Sidak test: This test is similar to the Bonferroni test, but it is less conservative. This means that it is more likely to find a significant difference between groups, even if there is not one.
Holm-Bonferroni test: This test is a modification of the Bonferroni test that is designed to control the family-wise error rate more effectively.
The choice of which post-hoc test to use depends on a number of factors, including the number of groups being compared, the level of significance that is desired, and the power of the test.

Here is an example of a situation where a post-hoc test might be necessary:

A researcher is interested in the effects of different teaching methods on student test scores. The researcher conducts a one-way ANOVA and finds that there is a significant difference between the groups. However, the ANOVA does not tell the researcher which groups are different or how much different they are. To answer these questions, the researcher would need to conduct a post-hoc test.
In this example, the researcher could use any of the post-hoc tests that were mentioned above. The choice of which test to use would depend on the number of groups being compared and the level of significance that is desired.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Sample data (replace this with your actual data)
diet_A = np.array([3, 5, 4, 6, 7, 8, 4, 5, 6, 3, 2, 4, 6, 5, 3, 4, 5, 6, 7, 4, 5, 6, 4, 3, 5])
diet_B = np.array([2, 3, 1, 4, 5, 3, 2, 4, 3, 2, 1, 3, 5, 4, 3, 2, 4, 5, 3, 2, 3, 4, 2, 1, 3])
diet_C = np.array([6, 7, 5, 8, 6, 9, 5, 6, 8, 7, 6, 9, 8, 7, 6, 5, 9, 7, 6, 8, 7, 6, 5, 8, 7])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 53.67701076630778
p-value: 5.374890081389282e-15


In [4]:
import numpy as np
import scipy.stats as stats

# Create the data.
diet_A = np.random.normal(10, 5, 50)
diet_B = np.random.normal(15, 5, 50)
diet_C = np.random.normal(20, 5, 50)

# Conduct the ANOVA.
F, p = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results.
print("F-statistic:", F)
print("p-value:", p)


F-statistic: 34.59575587540308
p-value: 4.866135313221344e-13


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Sample data (replace this with your actual data)
data = {
    'task_time': [15, 20, 18, 22, 17, 21, 23, 16, 19, 25, 14, 18, 20, 22, 15, 19, 24, 16, 20, 23, 21, 18, 20, 22, 17, 20, 24, 16, 21, 25],
    'software_program': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C'],
    'experience_level': ['Novice', 'Experienced'] * 15
}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
formula = 'task_time ~ C(software_program) + C(experience_level) + C(software_program):C(experience_level)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract F-statistics and p-values
F_program = anova_table.loc['C(software_program)', 'F']
p_program = anova_table.loc['C(software_program)', 'PR(>F)']

F_experience = anova_table.loc['C(experience_level)', 'F']
p_experience = anova_table.loc['C(experience_level)', 'PR(>F)']

F_interaction = anova_table.loc['C(software_program):C(experience_level)', 'F']
p_interaction = anova_table.loc['C(software_program):C(experience_level)', 'PR(>F)']

print("F-statistic for Software Programs:", F_program)
print("p-value for Software Programs:", p_program)

print("F-statistic for Experience Level:", F_experience)
print("p-value for Experience Level:", p_experience)

print("F-statistic for Interaction:", F_interaction)
print("p-value for Interaction:", p_interaction)


F-statistic for Software Programs: 0.0010526315789477012
p-value for Software Programs: 0.9989479683601158
F-statistic for Experience Level: 0.7052631578947439
p-value for Experience Level: 0.4093089898972744
F-statistic for Interaction: 0.826315789473685
p-value for Interaction: 0.44972995330416093


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [10]:
import numpy as np
from scipy.stats import ttest_ind

# Sample data (replace this with your actual data)
control_group_scores = np.array([75, 80, 85, 70, 78, 82, 76, 72, 80, 77, 79, 81, 75, 78, 73, 79, 82, 74, 81, 80])
experimental_group_scores = np.array([85, 90, 88, 92, 86, 91, 87, 89, 93, 90, 88, 91, 86, 89, 92, 87, 90, 85, 89, 91])

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group_scores, experimental_group_scores)

print("t-statistic:", t_statistic)
print("p-value:", p_value)


t-statistic: -11.020532658534306
p-value: 2.1422942532549697e-13


In [11]:
import pandas as pd
import statsmodels.stats.multicomp as mc

# Combine the data into one DataFrame
data = pd.DataFrame({'score': np.concatenate([control_group_scores, experimental_group_scores]),
                     'group': ['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_scores)})

# Perform Tukey's HSD test
posthoc = mc.MultiComparison(data['score'], data['group'])
result = posthoc.tukeyhsd()

print(result)


  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1    group2    meandiff p-adj lower upper  reject
-------------------------------------------------------
Control Experimental     11.1 0.001 9.061 13.139   True
-------------------------------------------------------


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [12]:
import numpy as np
from scipy.stats import f_oneway

# Sample data (replace this with your actual data)
store_A_sales = np.array([100, 110, 105, 95, 98, 105, 102, 108, 103, 100, 98, 105, 102, 108, 103, 100, 98, 105, 102, 108, 103, 100, 98, 105, 102, 108, 103, 100, 98, 105])
store_B_sales = np.array([90, 95, 100, 92, 88, 92, 94, 98, 95, 90, 92, 92, 94, 98, 95, 90, 92, 92, 94, 98, 95, 90, 92, 92, 94, 98, 95, 90, 92, 92])
store_C_sales = np.array([120, 115, 110, 112, 118, 120, 125, 115, 110, 112, 118, 120, 125, 115, 110, 112, 118, 120, 125, 115, 110, 112, 118, 120, 125, 115, 110, 112, 118, 120])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(store_A_sales, store_B_sales, store_C_sales)

print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 261.91029466157113
p-value: 1.5199977499031654e-37
