In [None]:
#Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
#Ans.
#ANOVA (Analysis of Variance) is a statistical test used to compare the means of two or more groups.
#These assumptions are: Normality, Homogeneity of variances, Independence

#Examples of violations that could impact the validity of ANOVA results are:
#Non-normality, Unequal variances, Correlated observations, Outliers.

In [None]:
#Q2. What are the three types of ANOVA, and in what situations would each be used?
#Ans.
#The three types of ANOVA are:
#One-way ANOVA: This type of ANOVA is used when there is only one independent variable or factor. 
#Two-way ANOVA: This type of ANOVA is used when there are two independent variables or factors.
#MANOVA (Multivariate ANOVA): This type of ANOVA is used when there are two or more dependent variables.

In [None]:
#Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
#Ans.
#Partitioning of variance in ANOVA refers to the process of breaking down the total variance in a dependent variable into separate components that can be attributed to different sources of variation.
#Understanding the partitioning of variance is important because it allows us to identify the sources of variation in our data and determine which sources are contributing the most to the differences in the dependent variable. 

In [1]:
#Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?
#Ans.
#In a one-way ANOVA, the total sum of squares (SST) represents the total amount of variation in the dependent variable, the explained sum of squares (SSE) represents the variation that can be attributed to the treatment or group differences, and the residual sum of squares (SSR) represents the variation that cannot be explained by the treatment or group differences.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas DataFrame with the data
data = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [5, 8, 7, 9, 11, 12, 6, 8, 10]
})

# Fit a one-way ANOVA model
model = ols('Value ~ Group', data=data).fit()

# Calculate the total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


SST: 24.888888888888896
SSE: 17.333333333333336
SSR: 7.555555555555561


In [None]:
#Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
#Ans.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data into a DataFrame
data = pd.read_csv('data.csv')

# Define the formula for the model
formula = 'y ~ C(factor1) + C(factor2) + C(factor1):C(factor2)'

# Fit the model
model = ols(formula, data).fit()

# Calculate the main effects
main_effects = sm.stats.anova_lm(model, typ=1)

# Calculate the interaction effect
interaction_effect = sm.stats.anova_lm(model, typ=2)

# Print the results
print("Main Effects:\n", main_effects)
print("\nInteraction Effect:\n", interaction_effect)


In [None]:
#Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.What can you conclude about the differences between the groups, and how would you interpret theseresults?
#Ans.
#If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a statistically significant difference between the groups. 
#The F-statistic indicates the ratio of the variance between groups to the variance within groups. 
#In this case, the F-statistic is greater than 1, which means that the variance between groups is larger than the variance within groups.

#The p-value of 0.02 indicates that the probability of observing an F-statistic as extreme as 5.23 is only 2%, assuming that the null hypothesis is true. 
#Since the p-value is less than the significance level of 0.05, we reject the null hypothesis and conclude that there is a statistically significant difference between the groups.

In [None]:
#Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
#Ans.
#Handling missing data in a repeated measures ANOVA can be challenging since the observations for each subject are not independent. 
#One approach to handling missing data is to use a method called imputation, which involves replacing missing values with estimates based on other available data.
#The consequences of using different methods to handle missing data can be significant. 
#Pairwise deletion, which involves using all available data for each analysis, can also result in biased results if the missing data is related to the outcome or predictor variables. 

In [None]:
#Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
#Ans.
#Post-hoc tests are used after ANOVA to determine which specific group means differ significantly from each other when an overall significant difference has been found. 
#There are several common post-hoc tests, including:
#Tukey's Honestly Significant Difference (HSD) test
#Bonferroni correction
#Scheffe's test
#Dunnett's test

In [4]:
#Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Pythonto determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.
#Ans.
import numpy as np
from scipy.stats import f_oneway

# Data for the three diets
diet_a = np.array([2, 1, 3, 4, 2, 3, 1, 4, 3, 2, 1, 2, 4, 2, 3, 1, 2, 3, 1, 4, 
                   3, 2, 3, 4, 1, 2, 3, 2, 4, 3, 1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 
                   4, 2, 3, 1, 4, 2, 3, 1, 2, 3])
diet_b = np.array([1, 0, 2, 3, 1, 2, 0, 3, 2, 1, 0, 1, 3, 1, 2, 0, 1, 2, 0, 3, 
                   2, 1, 2, 3, 0, 1, 2, 1, 3, 2, 0, 1, 2, 3, 0, 1, 2, 3, 1, 0, 
                   3, 1, 2, 0, 3, 1, 2, 0, 1, 2])
diet_c = np.array([0, 1, 1, 2, 0, 2, 1, 3, 2, 0, 1, 2, 3, 0, 2, 1, 0, 2, 1, 3, 
                   2, 0, 2, 3, 1, 0, 2, 1, 3, 1, 0, 2, 1, 3, 0, 2, 1, 3, 0, 1, 
                   3, 1, 2, 0, 3, 1, 2, 0, 1, 2])

# Perform one-way ANOVA
f_stat, p_value = f_oneway(diet_a, diet_b, diet_c)

print("F-statistic:", f_stat)
print("p-value:", p_value)


F-statistic: 16.00233357897323
p-value: 5.155776167236295e-07


In [5]:
#Q10. A company wants to know if there are any significant differences in the average time it takes tocomplete a task using three different software programs: Program A, Program B, and Program C. Theyrandomly assign 30 employees to one of the programs and record the time it takes each employee tocomplete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects orinteraction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.
#Ans.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas dataframe with the data
data = {'Program': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'Experience': ['Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice'],
        'Time': [10, 12, 14, 11, 13, 12, 9, 10, 11]}
df = pd.DataFrame(data)

# Fit a two-way ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                          sum_sq   df         F    PR(>F)
C(Program)                  8.45  2.0  1.207143  0.412449
C(Experience)               0.50  1.0  0.142857  0.730615
C(Program):C(Experience)    1.00  2.0  0.142857  0.872443
Residual                   10.50  3.0       NaN       NaN


In [None]:
#Q11. An educational researcher is interested in whether a new teaching method improves student testscores. They randomly assign 100 students to either the control group (traditional teaching method) or theexperimental group (new teaching method) and administer a test at the end of the semester. Conduct atwo-sample t-test using Python to determine if there are any significant differences in test scoresbetween the two groups. If the results are significant, follow up with a post-hoc test to determine whichgroup(s) differ significantly from each other.
#Ans.
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# load the data into a pandas dataframe
data = pd.read_csv('test_scores.csv')

# separate the control and experimental groups into separate dataframes
control = data[data['group'] == 'control']
experimental = data[data['group'] == 'experimental']

# conduct the two-sample t-test
t_stat, p_value = ttest_ind(control['score'], experimental['score'])

print("Two-sample t-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)

# conduct the post-hoc test using Tukey's HSD
tukey_results = pairwise_tukeyhsd(data['score'], data['group'])
print("Post-hoc test results:")
print(tukey_results)


In [2]:
#Q12. A researcher wants to know if there are any significant differences in the average daily sales of threeretail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each storeon those days. Conduct a repeated measures ANOVA using Python to determine if there are anysignificant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.
#Ans.
import pandas as pd
import pingouin as pg

# create a dataframe with the sales data
data = {'Day': ['Day 1', 'Day 2', 'Day 3', 'Day 4', 'Day 5', 'Day 6', 'Day 7', 'Day 8', 'Day 9', 'Day 10',
                'Day 11', 'Day 12', 'Day 13', 'Day 14', 'Day 15', 'Day 16', 'Day 17', 'Day 18', 'Day 19', 'Day 20',
                'Day 21', 'Day 22', 'Day 23', 'Day 24', 'Day 25', 'Day 26', 'Day 27', 'Day 28', 'Day 29', 'Day 30'],
        'Store A': [45, 47, 50, 52, 53, 48, 51, 49, 46, 48,
                    50, 52, 53, 54, 55, 56, 55, 54, 52, 51,
                    49, 48, 50, 53, 54, 55, 56, 55, 54, 52],
        'Store B': [42, 44, 46, 48, 50, 52, 53, 55, 53, 52,
                    50, 48, 46, 44, 43, 42, 41, 43, 44, 46,
                    47, 49, 51, 52, 54, 53, 52, 50, 48, 46],
        'Store C': [50, 53, 55, 58, 60, 62, 64, 66, 65, 63,
                    61, 59, 57, 55, 53, 52, 51, 50, 49, 47,
                    45, 44, 43, 44, 45, 47, 49, 50, 52, 54]}
df = pd.DataFrame(data)

# reshape the dataframe to long format
df_long = pd.melt(df, id_vars=['Day'], value_vars=['Store A', 'Store B', 'Store C'], var_name='Store', value_name='Sales')

# run the repeated measures ANOVA
pg.rm_anova(dv='Sales', within='Store', subject='Day', data=df_long)


Unnamed: 0,Source,ddof1,ddof2,F,p-unc,p-GG-corr,ng2,eps,sphericity,W-spher,p-spher
0,Store,2,58,8.91187,0.000422,0.001006,0.171948,0.832033,False,0.798125,0.042559
