### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

***ANOVA assumptions :***

* The observations are independent and identically distributed.
* The populations have equal variances.
* The populations are normally distributed.
    
**Examples of violations that could impact the validity of ANOVA results are:**
    
* **Violation of independence:** if the observations are not independent, it can lead to inflated Type I error rates.
* **Violation of equal variances:** unequal variances can lead to biased results and reduced power.
* **Violation of normality:** if the populations are not normally distributed, it can lead to inaccurate p-values and confidence intervals.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

 The three types of ANOVA are:
1. One-way ANOVA: used to compare means across two or more groups when there is only one independent variable.
2. Two-way ANOVA: used to analyze the effects of two independent variables on a dependent variable.
3. Repeated measures ANOVA: used when the same subjects are measured multiple times under different conditions.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

* The partitioning of variance in ANOVA involves dividing the total variance in the data into components that can be attributed to different sources, such as group differences, error, and interaction effects. 

* Understanding this concept is important because it helps to identify the sources of variation in the data and determine the extent to which group differences are responsible for the observed effects.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import scipy.stats as stats
import numpy as np

group1 = [1, 2, 3, 4]
group2 = [2, 4, 6, 8]
group3 = [3, 6, 9, 12]

f_stat, p_val = stats.f_oneway(group1, group2, group3)

n = len(group1) + len(group2) + len(group3)
df_total = n - 1
df_groups = 3 - 1
df_error = df_total - df_groups

sst = sum(np.array(group1 + group2 + group3)**2) - (sum(group1)**2 + sum(group2)**2 + sum(group3)**2)/n
sse = sst - (sum(np.array(group1))**2 + sum(np.array(group2))**2 + sum(np.array(group3))**2)/n
ssr = sse / df_error

print("SST =", sst)
print("SSE =", sse)
print("SSR =", ssr)


SST = 303.3333333333333
SSE = 186.66666666666663
SSR = 20.740740740740737


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [3]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
# Example data
data = {'y': [1, 2, 3, 4, 5, 6], 'A': ['a1', 'a1', 'a2', 'a2', 'a3', 'a3'], 'B': ['b1', 'b2', 'b1', 'b2', 'b1', 'b2']}
df = pd.DataFrame(data)

# Fit the model
model = ols('y ~ C(A) + C(B) + C(A):C(B)', data=df).fit()

# Perform ANOVA and print results
anova_table = sm.stats.anova_lm(model)
print(anova_table)

            df        sum_sq       mean_sq    F  PR(>F)
C(A)       2.0  1.600000e+01  8.000000e+00  0.0     NaN
C(B)       1.0  1.500000e+00  1.500000e+00  0.0     NaN
C(A):C(B)  2.0  1.232595e-30  6.162976e-31  0.0     NaN
Residual   0.0  1.972152e-30           inf  NaN     NaN


  (model.ssr / model.df_resid))


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

With an F-statistic of 5.23 and a p-value of 0.02 in a one-way ANOVA, we can conclude that there are significant differences between at least two groups. This means that the null hypothesis, which states that there are no differences between the groups, is rejected. The p-value of 0.02 indicates that there is a 2% chance of observing such differences by chance alone. Therefore, we can reject the null hypothesis and conclude that there are significant differences between the groups. However, we cannot determine which specific groups are different from each other without further analysis.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

* Handling missing data in a repeated measures ANOVA depends on the reason for the missing data. 
* If the data are missing completely at random, one option is to delete the missing data points.
* Another option is to impute the missing values using methods such as mean imputation or regression imputation. 
* If the data are missing not at random, such as when missing data are related to the outcome variable or the independent variable, then the analysis may be biased if missing data are ignored.
* One potential consequence of using different methods to handle missing data is that the results may differ, leading to different conclusions about the effects of the independent variable on the outcome variable.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


* Common post-hoc tests used after ANOVA include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, and Scheffe's method. 
* Tukey's HSD is used when the sample sizes are equal, and it tests all possible pairwise differences between groups.
* Bonferroni correction is a conservative method that adjusts the significance level for multiple comparisons.
* Scheffe's method is more conservative than Tukey's HSD and is used when the sample sizes are unequal.
* Post-hoc tests are necessary when the ANOVA indicates that there are significant differences between groups but does not identify which specific groups are different.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [4]:
from scipy.stats import f_oneway
import numpy as np

# Define the data for each diet
diet_A = np.array([1.2, 1.5, 1.8, 2.1, 1.3, 1.6, 1.9, 2.2, 1.4, 1.7, 2.0, 2.3, 1.5, 1.8, 2.1, 2.4, 1.6, 1.9, 2.2, 2.5, 1.7, 2.0, 2.3, 2.6, 1.8])
diet_B = np.array([1.4, 1.7, 2.0, 2.3, 1.5, 1.8, 2.1, 2.4, 1.6, 1.9, 2.2, 2.5, 1.7, 2.0, 2.3, 2.6, 1.8, 2.1, 2.4, 2.7, 1.9, 2.2, 2.5, 2.8, 2.0])
diet_C = np.array([1.6, 1.9, 2.2, 2.5, 1.7, 2.0, 2.3, 2.6, 1.8, 2.1, 2.4, 2.7, 1.9, 2.2, 2.5, 2.8, 2.0, 2.3, 2.6, 2.9, 2.1, 2.4, 2.7, 3.0, 2.2])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)

F-statistic:  7.038948850305019
p-value:  0.0016138769339523271


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [5]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

# create a sample data frame with 3 software programs and 2 experience levels
data = pd.DataFrame({
    'software_program': ['A', 'B', 'C'] * 20,
    'experience_level': ['novice', 'experienced'] * 30,
    'time': [15.6, 12.7, 14.2, 13.3, 12.1,13.7, 14.5, 12.9, 13.8, 13.2,
            15.6, 12.7, 14.2, 13.3, 12.1, 13.7, 14.5, 12.9, 13.8, 13.2,
             15.1, 14.5, 13.9, 13.2, 15.8, 16.1, 15.2, 14.8, 16.4, 16.6,
             11.9, 10.7, 11.5, 11.8, 11.2, 11.6, 12.3, 11.1, 12.4, 12.6,
             13.7, 14.2, 13.3, 14.1, 14.5, 13.8, 14.6, 13.4, 15.1, 14.9,
             15.1, 14.5, 13.9,13.7, 14.2, 13.3,15.2, 14.8, 16.4, 16.6
            ]
})

# fit the ANOVA model with software_program and experience_level as factors
model = ols('time ~ C(software_program) + C(experience_level) + C(software_program):C(experience_level)', data=data).fit()
anova_results = sm.stats.anova_lm(model)

print(anova_results)

                                           df      sum_sq   mean_sq         F  \
C(software_program)                       2.0    0.320333  0.160167  0.075818   
C(experience_level)                       1.0    3.360667  3.360667  1.590834   
C(software_program):C(experience_level)   2.0    2.336333  1.168167  0.552973   
Residual                                 54.0  114.076000  2.112519       NaN   

                                           PR(>F)  
C(software_program)                      0.927084  
C(experience_level)                      0.212629  
C(software_program):C(experience_level)  0.578459  
Residual                                      NaN  


### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
import numpy as np
from scipy.stats import ttest_ind

# Generate random data for the control and experimental groups
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=75, scale=10, size=50)

# Compute the t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

# Print the results
print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

t-statistic:  -3.67625761123462
p-value:  0.00038638058667233476


The pairwise_tukeyhsd() function takes in the combined scores and the group variable, as well as the significance level (0.05 in this case). It returns a table that shows which groups differ significantly from each other.

In [7]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine the scores for the two groups
all_scores = np.concatenate([control_scores, experimental_scores])

# Generate a group variable that indicates which group each score belongs to
groups = np.array(["control"] * 50 + ["experimental"] * 50)

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(all_scores, groups, 0.05)

# Print the results
print(tukey_results)


   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
control experimental   6.9999 0.0004 3.2213 10.7785   True
----------------------------------------------------------


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [8]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

# Create a DataFrame with the sales data
data = {'Store': ['A']*30 + ['B']*30 + ['C']*30,
        'Sales': np.concatenate([np.random.normal(50, 10, 30), 
                                  np.random.normal(60, 10, 30), 
                                  np.random.normal(70, 10, 30)])}

df = pd.DataFrame(data)

# Fit the repeated measures ANOVA model
model = ols('Sales ~ Store', data=df).fit()

# Print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

               sum_sq    df          F        PR(>F)
Store     5169.946981   2.0  25.665314  1.734360e-09
Residual  8762.514852  87.0        NaN           NaN


In the above example, we created a DataFrame with the sales data for the three stores. You should replace this with your own data.

If the p-value for the "Store" factor is less than 0.05 (assuming a significance level of 0.05), we can reject the null hypothesis and conclude that there is a significant difference in sales between the three stores.

To follow up with a post-hoc test to determine which store(s) differ significantly from each other, you can use Tukey's HSD test. Here's how you can do it:

In [9]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(df['Sales'], df['Store'], 0.05)

# Print the results
print(tukey_results)

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B  11.0459 0.0001  4.8671 17.2247   True
     A      C  18.4453    0.0 12.2666 24.6241   True
     B      C   7.3994 0.0147  1.2207 13.5782   True
----------------------------------------------------
