## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans = ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups or treatments to determine if there are any significant differences among them.

To use ANOVA and ensure the validity of its results, certain assumptions need to be met. These assumptions are :

**Independence** : The observations within each group or treatment are independent of each other. This means that the values in one group should not be influenced or related to the values in another group.

**Normality** : The data in each group should follow a normal distribution. The normality assumption is particularly important when the sample sizes are small, as ANOVA tends to be robust to violations when the sample sizes are large.

**Homogeneity of Variance (Homoscedasticity)** : The variability of the data (variance) within each group should be roughly equal. In other words, the spread of the data points around the group means should be similar across all groups.

**Absence of outliers** : There must no outliers in the data points


Examples of Violations:

**Violation of Independence**: In some experimental designs, the independence assumption may be violated if there is a hierarchical or nested structure in the data. For example, if you measure the performance of students within different classrooms, the students within the same classroom may not be independent of each other due to shared characteristics or teaching styles.

**Violation of Normality**: If the data in any of the groups deviates significantly from a normal distribution, it can impact the validity of ANOVA results. For instance, if the data is strongly skewed or has heavy tails, the normality assumption may not hold.

**Violation of Homoscedasticity**: Unequal variances among groups can lead to biased ANOVA results. For example, if the variability of test scores in one group is much larger than that in another group, the assumption of homogeneity of variance may not be met.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans= The three types of ANOVA (Analysis of Variance) are:

**One-Way ANOVA**:

<br>One-Way ANOVA is used when there is one categorical independent variable (also known as a factor) with three or more levels or groups, and we want to compare the means of a continuous dependent variable across these groups.
<br>It is suitable for situations where we have one factor and want to determine if there are any significant differences in the means of the dependent variable across the different levels of that factor.
<br>Example: Suppose we want to compare the average test scores of students in three different teaching methods (A, B, and C) to see if there is a significant difference in performance.

**Two-Way ANOVA**:

<br>Two-Way ANOVA is used when there are two categorical independent variables (factors), and we want to examine the interaction between these two factors and their effects on a continuous dependent variable.
<br>It is suitable for situations where we have two factors, and we want to investigate how the means of the dependent variable vary across the combinations of levels of both factors.
<br>Example: Suppose we want to analyze the effect of two factors, gender (male and female) and teaching method (A, B, and C), on the performance of students in a test.

**Three-Way ANOVA**:

Three-Way ANOVA is an extension of the two-way ANOVA and is used when there are three categorical independent variables (factors).
It is suitable for situations where we have three factors, and we want to examine their individual effects and their interactions on a continuous dependent variable.
Example: Suppose we want to study the effects of three factors: age group (young, middle-aged, and old), treatment type (A, B, and C), and location (urban and rural) on the response time of participants in a cognitive test.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans = Partitioning of variance in ANOVA refers to the division of the total variability in the data into different components that are associated with various sources of variation.

In ANOVA, the total variance in the data is broken down into three main components:

**Between-Group Variance (Between-Treatments Variance)**:
This component represents the variability in the dependent variable that can be attributed to the differences between the groups or treatments (levels of the independent variable).
It measures the effect of the factors (independent variables) on the dependent variable. A large between-group variance indicates that the groups have different means, suggesting that the factors have a significant effect on the outcome.

**Within-Group Variance (Within-Treatments Variance or Residual Variance)**:
This component represents the variability in the dependent variable that cannot be explained by the differences between the groups. It accounts for the random or unexplained variation within each group.
It measures the variability of data points within each group around the group mean. A large within-group variance suggests that there is considerable variability within the groups, making it harder to detect significant differences between the group means.

**Total Variance**:
This is the overall variability in the data, and it is the sum of the between-group variance and the within-group variance.
It represents the total variation in the dependent variable across all groups and treatments.



The importance of understanding the partitioning of variance in ANOVA lies in its ability to provide valuable insights into the significance and effects of the factors being studied. By breaking down the total variance into these components, ANOVA helps researchers assess the proportion of variability in the dependent variable that can be attributed to the factors of interest. This allows us to determine whether there are statistically significant differences between the group means and whether the factors have a significant impact on the outcome.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
#Creating the Data

import pandas as pd

#create pandas DataFrame
df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})

#view first five rows of DataFrame
df.head()

Unnamed: 0,hours,score
0,1,68
1,1,76
2,1,74
3,2,80
4,2,76


In [2]:
#Fit a regression model
import statsmodels.api as sm

#define response variable
y = df['score']

#define predictor variable
x = df[['hours']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

In [3]:
#Calculate SSE,SSR,SST
import numpy as np

#calculate sse
sse = np.sum((model.fittedvalues - df.score)**2)
print("SSE : ", sse)


#calculate ssr
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print("SSR : ", ssr)


#calculate sst
sst = ssr + sse
print("SST : ", sst)

SSE :  331.07488479262696
SSR :  917.4751152073725
SST :  1248.5499999999995


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [4]:
# Importing libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13,
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})


# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) + C(Fertilizer):C(Watering)',data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)

# Print the result
print(result)

model.summary()

                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.000369  0.000369  0.000133  0.990865
C(Fertilizer):C(Watering)   1.0   0.040866  0.040866  0.014796  0.904053
Residual                   28.0  77.333333  2.761905       NaN       NaN


0,1,2,3
Dep. Variable:,height,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.035
Method:,Least Squares,F-statistic:,0.01207
Date:,"Sun, 01 Dec 2024",Prob (F-statistic):,0.913
Time:,06:15:50,Log-Likelihood:,-56.772
No. Observations:,30,AIC:,117.5
Df Residuals:,28,BIC:,120.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,14.8000,0.429,34.491,0.000,13.921,15.679
C(Fertilizer)[T.weekly],-0.0222,0.202,-0.110,0.913,-0.437,0.392
C(Watering)[T.weekly],-0.0222,0.202,-0.110,0.913,-0.437,0.392
C(Fertilizer)[T.weekly]:C(Watering)[T.weekly],-0.0222,0.202,-0.110,0.913,-0.437,0.392

0,1,2,3
Omnibus:,0.177,Durbin-Watson:,0.916
Prob(Omnibus):,0.915,Jarque-Bera (JB):,0.011
Skew:,0.029,Prob(JB):,0.995
Kurtosis:,2.929,Cond. No.,2.6100000000000002e+32


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans = In a one-way ANOVA, the F-statistic is used to test whether there are significant differences between the means of three or more groups. The associated p-value indicates the probability of observing the data, or more extreme data, under the assumption that the group means are all equal (null hypothesis).

The F-statistic:

The F-statistic measures the ratio of variability between groups to variability within groups. A larger F-statistic suggests that the variability between group means is significantly greater than the variability within groups.

The p-value:

The p-value indicates the probability of obtaining the observed data or more extreme data under the assumption that there are no true differences between the group means (null hypothesis). A smaller p-value suggests that the observed differences are unlikely to have occurred by chance alone.
Interpretation:

Since the p-value (0.02) is less than the chosen significance level (α) of 0.05 (commonly used so assumption is made), we reject the null hypothesis.
This means that there is sufficient evidence to conclude that there are statistically significant differences between the means of the groups being compared.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans = In a repeated measures ANOVA, missing data can occur when participants have incomplete responses or when data is lost during data collection or processing. Handling missing data appropriately is crucial because it can impact the validity and reliability of the results.

There are several methods to handle missing data in a repeated measures ANOVA:

**Complete Case Analysis (Listwise Deletion)**: This method involves excluding any case with missing data in any of the variables being analyzed. While it is the simplest approach, it can lead to biased results if the data is not missing completely at random. It can also reduce the sample size and statistical power, potentially leading to less reliable results.

**Mean Imputation**: In this method, missing data in a variable are replaced by the mean of that variable from the observed cases. While this is a straightforward approach, it may distort the distribution of the variable and underestimate the standard error, leading to overly optimistic statistical significance.

**Last Observation Carried Forward (LOCF)**: LOCF imputes missing data with the value of the last observed data point. This method assumes that the data follows a linear pattern, which may not be appropriate for all situations.

**Multiple Imputation**: Multiple imputation involves creating multiple plausible imputations for the missing data, incorporating uncertainty in the imputation process. This approach can provide more reliable estimates and standard errors. However, it can be computationally intensive and may require making assumptions about the missing data mechanism.

**Maximum Likelihood Estimation (MLE)**: MLE is a statistical approach that estimates parameters by maximizing the likelihood function. In the context of missing data, it allows for the use of all available data and provides unbiased estimates under the assumption that data is missing at random.

**Pattern-Mixture Models**: These models involve considering different patterns of missingness and fitting separate models for each pattern. This approach can be complex but may provide more accurate estimates when the missing data mechanism is related to the outcome.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans = Post-hoc tests are used in ANOVA to compare specific pairs of groups after a significant main effect or interaction effect has been found.

Some common post-hoc tests include are :

Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is widely used when comparing all possible pairs of group means. It controls the familywise error rate, ensuring that the overall experimentwise error rate remains at a desired level (typically 0.05). This test is appropriate when you have equal group sizes and homogeneity of variances.

Bonferroni Correction: The Bonferroni correction is a simple method to adjust the significance level for multiple comparisons. It divides the desired alpha level (usually 0.05) by the number of comparisons being made. This method is more conservative but can be applied to any set of comparisons.

Sidak Correction: Similar to Bonferroni, the Sidak correction is another way to adjust the significance level for multiple comparisons. It is considered slightly more powerful than Bonferroni.

Dunnett's Test: Dunnett's test is used when you have one control group and you want to compare it to multiple treatment groups. It controls the Type I error rate by considering the control group as a reference.

Scheffé Test: The Scheffé test is more conservative than Tukey's HSD and is suitable when group sizes are unequal and variances are not homogeneous. It can be used for all possible pairwise comparisons.

Fisher's Least Significant Difference (LSD): Fisher's LSD test is less conservative than Tukey's HSD, but it is not appropriate when there are unequal group sizes or non-homogeneous variances.

The choice of post-hoc test depends on the research question, the number of groups, and the prior knowledge about which groups are likely to differ. A post-hoc test might be necessary when an ANOVA indicates a significant difference between groups but does not identify which specific groups differ.

For example, a researcher might conduct an ANOVA to examine the effect of different instructional methods on student achievement. If the ANOVA shows a significant main effect of instructional method, the researcher might use a post-hoc test to compare the mean scores of each instructional method to identify which methods are significantly different from each other.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.



In [5]:
import numpy as np
import scipy.stats as stats

# Generate simulated data assuming normal distribution with same variance
np.random.seed(20)
diet_A = np.random.normal(5, 1, 50)
diet_B = np.random.normal(4, 1, 50)
diet_C = np.random.normal(3, 1, 50)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Set significance level
alpha = 0.05

# Null hypothesis: The mean weight loss is the same for all three diets.
# Alternative hypothesis: The mean weight loss is different for at least one diet.
null_hypothesis = "The mean weight loss is the same for all three diets."
alternate_hypothesis = "The mean weight loss is different for at least one diet."

print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < alpha:
    print("We reject the null hypothesis.")
    print(f"Final Conclusion : {alternate_hypothesis}")
else:
    print("We fail to reject the null hypothesis.")
    print(f"Final Conclusion : {null_hypothesis}")

F-statistic: 41.80444706032352
p-value: 4.2309140010930765e-15
We reject the null hypothesis.
Final Conclusion : The mean weight loss is different for at least one diet.


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

# Software programs: A, B, C
software_programs = np.random.choice(['A', 'B', 'C'], size=90)

# Employee experience level: Novice, Experienced
experience_level = np.random.choice(['Novice', 'Experienced'], size=90)

# Random time data for each combination of program and experience level
time_to_complete_task = np.random.normal(loc=20, scale=5, size=90)

# Create a DataFrame
data = pd.DataFrame({'Software': software_programs,
                     'ExperienceLevel': experience_level,
                     'Time': time_to_complete_task})

# Perform the two-way ANOVA
model = ols('Time ~ C(Software) + C(ExperienceLevel) + C(Software):C(ExperienceLevel)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Report the results
print(anova_table)


                                  df       sum_sq    mean_sq         F  \
C(Software)                      2.0     9.309580   4.654790  0.216246   
C(ExperienceLevel)               1.0    31.851905  31.851905  1.479736   
C(Software):C(ExperienceLevel)   2.0    52.479686  26.239843  1.219018   
Residual                        84.0  1808.132913  21.525392       NaN   

                                  PR(>F)  
C(Software)                     0.805984  
C(ExperienceLevel)              0.227223  
C(Software):C(ExperienceLevel)  0.300694  
Residual                             NaN  


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [7]:
import numpy as np
import scipy.stats as stats

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

control_group = np.random.normal(loc=75, scale=5, size=100)
experimental_group = np.random.normal(loc=80, scale=6, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Report the results
print("Two-sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the control and experimental groups.")

Two-sample t-test:
t-statistic: -7.738786904885968
p-value: 5.026085102727666e-13
There is a significant difference in test scores between the control and experimental groups.


In [8]:
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine the test scores and group information into a DataFrame
data = pd.DataFrame({'Test_Score': np.concatenate([control_group, experimental_group]),
                     'Group': ['Control'] * 100 + ['Experimental'] * 100})

# Perform Tukey's HSD post-hoc test
tukey_results = pairwise_tukeyhsd(data['Test_Score'], data['Group'])

# Report the results
print("\nTukey's HSD post-hoc test:")
print(tukey_results)


Tukey's HSD post-hoc test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   5.6531   0.0 4.2125 7.0936   True
--------------------------------------------------------


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [9]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

store_A_sales = np.random.normal(loc=1000, scale=100, size=30)
store_B_sales = np.random.normal(loc=950, scale=90, size=30)
store_C_sales = np.random.normal(loc=1100, scale=110, size=30)

# Combine the sales data and group information into a DataFrame
data = pd.DataFrame({'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales]),
                     'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30})

# Perform one-way repeated measures ANOVA
F_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)

# Report the results
print("One-way repeated measures ANOVA:")
print("F-statistic:", F_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There is a significant difference in average daily sales between the three stores.")
else:
    print("There is no significant difference in average daily sales between the three stores.")

One-way repeated measures ANOVA:
F-statistic: 23.62763182315457
p-value: 6.369054894762179e-09
There is a significant difference in average daily sales between the three stores.


In [10]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD post-hoc test
tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'])

# Report the results
print("\nTukey's HSD post-hoc test:")
print(tukey_results)


Tukey's HSD post-hoc test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
     A      B -42.0899 0.2045 -100.5291  16.3492  False
     A      C  120.232    0.0   61.7929 178.6712   True
     B      C 162.3219    0.0  103.8828 220.7611   True
-------------------------------------------------------
