# 1] Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

## Assumptions
## 1) normality of sampling distribution of mean
### => The distributions of sample mean is noarmally distributed.
## 2) Absence of outliers
### => outlying score need to be removed from the dataset
## 3) Homoginity of variance
### => population variance in different levels of each independent variable are equal
## 4) samples are independent and random
### 
## violations
## 1) Non-Normality:
### => If the data in each group or population is not normally distributed, the ANOVA results may be inaccurate. For example, if the data is skewed or has outliers, the assumption of normality may not be met.

## 2) Non-Homogeneity of Variance:
### => If the variances of the groups are not equal, the ANOVA results may be unreliable. For example, if one group has much larger variances than the others, this may be a sign of heterogeneity of variance.

## 3) Dependence:
### => If the observations within each group or population are not independent of each other, the ANOVA results may be biased. For example, if the same subjects are measured under different conditions, the assumption of independence may not be met.

# 2] What are the three types of ANOVA, and in what situations would each be used?

## 1) One-way ANOVA: 
### => This type of ANOVA is used when there is only one independent variable, such as a treatment or a condition, that has three or more levels. One-way ANOVA compares the means of the groups to determine whether there are any significant differences between them.

## 2) Two-way ANOVA: 
### => This type of ANOVA is used when there are two independent variables that are being studied simultaneously, and both variables have two or more levels. Two-way ANOVA tests for the main effects of each independent variable as well as their interaction effect.
## 3) N-way ANOVA:
### => This type of ANOVA is used when there are two or more independent variables, and at least one of them is within-subjects and the other is between-subjects. This type of ANOVA tests for the main effects of each independent variable as well as their interaction effect.

# 3] What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

### => The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of breaking down the total variation in the data into different sources of variation. In ANOVA, variance is partitioned into two types: within-group variance and between-group variance.

### 1) Within-group variance is the amount of variation in the data that is due to differences within each group or condition being compared. It reflects the degree of variability among the individual data points within each group.

### 2) Between-group variance is the amount of variation in the data that is due to differences between the groups or conditions being compared. It reflects the degree of difference among the means of each group.
### 
### =>Understanding the partitioning of variance is important because it allows us to determine the relative contributions of different factors to the overall variation in the data. This information can help us to identify which factors are most important in explaining the differences between groups or conditions, and to determine whether any observed differences are statistically significant. It also helps us to interpret the results of ANOVA and draw valid conclusions about the effects of the factors being studied.






# 4] How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C', 'C'], 'value': [1, 2, 3, 4, 5, 6]})
model = ols('value ~ group', data=df).fit()

# calculate SST
sst = ((df['value'] - df['value'].mean()) ** 2).sum()
sse = ((model.fittedvalues - df['value'].mean()) ** 2).sum()
ssr = ((df['value'] - model.fittedvalues) ** 2).sum()

# 5] In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 'B': [1, 2, 1, 2, 1, 2], 'value': [4, 5, 6, 7, 8, 9]})
model = ols('value ~ A + B + A:B', data=df).fit()

In [3]:
main_effect_A = model.params['A'] 
main_effect_B = model.params['B']
interaction_effect = model.params['A:B']

# 6] Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

### => The p-value is used to determine the significance of the F-statistic. If the p-value is less than the chosen significance level (often set at 0.05), then we reject the null hypothesis and conclude that there is a significant difference between the group means. If the p-value is greater than the significance level, then we fail to reject the null hypothesis and conclude that there is insufficient evidence to support a significant difference between the group means.

# 7] In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

## 1) Pairwise deletion:
### => This method involves analyzing only the cases with complete data for each variable. In other words, cases with missing data for any variable are deleted. This method is easy to use and preserves the sample size, but it can lead to biased estimates and reduced power.

## 2) Listwise deletion: 
### => This method involves deleting all cases with any missing data for any variable. This method is more conservative than pairwise deletion but can result in a substantial loss of power, especially if there are many missing cases.

## 3) Imputation: 
### => This method involves estimating missing values based on other observed data. There are various methods of imputation, including mean imputation, regression imputation, and multiple imputation. Imputation can improve the accuracy of the estimates and preserve the sample size, but the choice of method can affect the results.

## 4) Maximum likelihood estimation: 
### => This method involves using all available data to estimate the model parameters. It is a more sophisticated method that can handle missing data without deleting cases or imputing values.

# 8] What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

## 1) Tukey's Honestly Significant Difference (HSD) test: 
### => This test is used to compare all possible pairs of group means and control the overall Type I error rate at a desired level (usually 0.05). This test is appropriate when there are a moderate number of groups (e.g., 3-5) and when the assumption of homogeneity of variances is met.

## 2) Bonferroni correction: 
### => This test is a more conservative approach that adjusts the significance level for each pairwise comparison to control the overall Type I error rate. Specifically, the significance level for each comparison is divided by the number of comparisons being made. This test is appropriate when there are a large number of groups being compared and when the assumption of homogeneity of variances is not met.

## 3) Dunnett's test: 
### => This test is used to compare each group mean to a control or reference group. This test is appropriate when there is a single control group and the research question is focused on comparing the other groups to the control group.

# 9] A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [4]:
import numpy as np
from scipy.stats import f_oneway

diet_a = np.array([5.2, 6.8, 7.5, 6.1, 5.9, 4.8, 7.2, 6.5, 7.1, 5.6,
                   6.4, 4.9, 5.3, 6.9, 7.3, 6.2, 5.4, 6.0, 6.6, 5.7,
                   6.3, 5.5, 6.7, 5.8, 7.0])
diet_b = np.array([4.2, 2.8, 3.5, 5.1, 4.9, 5.8, 3.2, 4.5, 3.1, 5.6,
                   4.4, 6.9, 4.3, 3.1, 2.7, 4.2, 3.4, 4.0, 3.4, 4.3,
                   3.7, 4.5, 2.7, 4.8, 3.0])
diet_c = np.array([1.2, 2.8, 1.5, 3.1, 1.9, 2.8, 2.2, 2.5, 3.1, 2.6,
                   1.4, 2.9, 3.3, 2.1, 1.7, 3.2, 2.4, 1.0, 2.4, 3.3,
                   1.7, 2.5, 1.7, 3.8, 2.0])

f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 122.22789609312501
p-value: 7.123609487999246e-24


### => The F-statistic is 17.22 and the p-value is less than 0.05, indicating that there are significant differences between the mean weight loss of the three diets. We can reject the null hypothesis that the mean weight loss of the three diets are equal.

# 10] A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

### => Assuming that you have a CSV file named 'task_completion_times.csv' that contains the following columns:

## 'software': 
### => the software program used (A, B, or C)
## 'experience': 
### => the employee experience level (novice or experienced)
## 'time': 
### => the time it took each employee to complete the task, in seconds.
### 
## Code
### 
## data = pd.read_csv('task_completion_times.csv')

## model = ols('time ~ C(software) + C(experience) + C(software):C(experience)', data=data).fit()

## table = sm.stats.anova_lm(model, typ=2)

## print(table)

# 11] An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [5]:
import scipy.stats as stats

control_scores = np.random.normal(loc=70, scale=10, size=50)

experimental_scores = np.random.normal(loc=75, scale=10, size=50)

In [6]:
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)
print(t_statistic,",", p_value)

-2.5801762481002166 , 0.011358580330239457


In [7]:
import statsmodels.stats.multicomp as mc

tukey_results = mc.pairwise_tukeyhsd(np.concatenate([control_scores, experimental_scores]), 
                                      np.concatenate([np.repeat('control', 50), np.repeat('experimental', 50)]))

print(tukey_results)


  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower upper  reject
--------------------------------------------------------
control experimental   4.4657 0.0114 1.031 7.9004   True
--------------------------------------------------------


# 12] A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [8]:
store_a_sales = np.random.normal(loc=100, scale=10, size=30)
store_b_sales = np.random.normal(loc=110, scale=10, size=30)
store_c_sales = np.random.normal(loc=120, scale=10, size=30)

# combine the sales data from all three stores
sales_data = np.concatenate([store_a_sales, store_b_sales, store_c_sales])

# create a grouping variable for the stores
store_groups = np.concatenate([np.repeat('A', 30), np.repeat('B', 30), np.repeat('C', 30)])

# perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_a_sales, store_b_sales, store_c_sales)


In [9]:
tukey_results = mc.pairwise_tukeyhsd(sales_data, store_groups)

# print results
print(tukey_results)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B    9.364 0.0014  3.1994 15.5286   True
     A      C  18.4225    0.0 12.2579 24.5871   True
     B      C   9.0584 0.0021  2.8938  15.223   True
----------------------------------------------------
