In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
=>Independence: The observations within each group must be independent of each other. 
=>Normality: The data within each group should follow a normal distribution.
=>Homogeneity of Variance (Homoscedasticity): The variance of the data should be approximately equal across all groups. 
=>Random Sampling: The data should be obtained through random sampling from the population of interest.
Example:-
Non-Independence: If the data points within groups are not independent, it can lead to biased results. 
For example, in a study where the performance of students in multiple schools is compared, but some students are present in
 more than one school, violating the independence assumption.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
=>One-Way ANOVA:
One-Way ANOVA is used when you have one categorical independent variable with three or more levels,and a continuous dependent 
variable.
=>Two-Way ANOVA:
Two-Way ANOVA involves two categorical independent variables (factors) and one continuous dependent variable. 
=>Repeated Measures ANOVA:
Repeated Measures ANOVA is used when the same participants are measured under multiple conditions or at multiple time points.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
The partitioning of variance in ANOVA refers to the process of breaking down the total variance observed in the data into
different components, each associated with a specific source of variation. 

Assess the significance of the independent variable(s) in explaining the variation in the dependent variable.
Interpret the relative importance of different factors and interactions in influencing the outcome.
Identify potential sources of error or noise in the data.
Make informed decisions about the experimental design and data collection process.
Provide insights into the relationships between variables and inform further investigations.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
group1 = [10, 12, 14, 15, 18]
group2 = [8, 9, 11, 13, 16]
group3 = [7, 9, 10, 12, 15]

all_data = np.concatenate([group1, group2, group3])

grand_mean = np.mean(all_data)

n_group1 = len(group1)
n_group2 = len(group2)
n_group3 = len(group3)

mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)

sst = np.sum((all_data - grand_mean) ** 2)

sse = n_group1 * (mean_group1 - grand_mean) ** 2 + \
      n_group2 * (mean_group2 - grand_mean) ** 2 + \
      n_group3 * (mean_group3 - grand_mean) ** 2


ssr = np.sum((group1 - mean_group1) ** 2) + \
      np.sum((group2 - mean_group2) ** 2) + \
      np.sum((group3 - mean_group3) ** 2)

# Print the results
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 142.9333333333333
Explained Sum of Squares (SSE): 27.73333333333335
Residual Sum of Squares (SSR): 115.2


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data for the two-way ANOVA (replace these with your own data)
data = {
    'Group1': [10, 12, 14, 15, 18],
    'Group2': [8, 9, 11, 13, 16],
    'Group3': [7, 9, 10, 12, 15],
    'Factor1': ['A', 'A', 'B', 'B', 'C'],  # Factor 1 levels
    'Factor2': ['X', 'Y', 'X', 'Y', 'X'],  # Factor 2 levels
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
formula = 'Value ~ C(Factor1) + C(Factor2) + C(Factor1):C(Factor2)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effects
main_effect_factor1 = anova_table.loc['C(Factor1)', 'sum_sq']
main_effect_factor2 = anova_table.loc['C(Factor2)', 'sum_sq']
interaction_effect = anova_table.loc['C(Factor1):C(Factor2)', 'sum_sq']

# Print the results
print("Main Effect for Factor 1:", main_effect_factor1)
print("Main Effect for Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
F-Statistic (5.23)
The p-value associated with the F-statistic is 0.02. 
Based on the results, we can conclude that there are significant differences between the means of the groups. 
Since the p-value (0.02) is less than the commonly chosen significance level of 0.05, we reject the null hypothesis 
that all group means are equal. Instead, we accept the alternative hypothesis, 
which states that at least one group mean is significantly different from the others.

these results do not tell us which specific groups are different from each other; post-hoc tests or pairwise comparisons are
necessary to determine which groups are significantly different if the overall ANOVA test is significant.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
There are several methods to handle missing data are
=>Complete Case Analysis (Listwise Deletion):
This method involves removing any participant with missing data from the analysis. 
=>Mean Imputation:
Mean imputation involves replacing missing values with the mean value of the observed data for that variable.
=>Last Observation Carried Forward (LOCF):
LOCF involves replacing missing values with the last observed value for that participant. 

Using different methods to handle missing data can lead to varying results and conclusions. 
Complete case analysis can reduce statistical power and potentially introduce bias, while 
imputation methods may distort the distribution of the data and underestimate the standard errors.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD) Test:
It is a widely used post-hoc test when the sample sizes are equal across groups.
Bonferroni Correction:
The Bonferroni correction adjusts the significance level for each pairwise comparison to control the familywise error rate.
Scheffe's Test:
 It can be used when sample sizes are unequal across groups and is robust to various types of designs and assumptions.
Example situation:
Suppose you conducted a study to compare the effectiveness of three different teaching methods (A, B, and C) on students'
test scores. After performing a one-way ANOVA, you found that there is a significant difference in the means of the three 
teaching methods (p < 0.05).

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [None]:
import numpy as np
import scipy.stats as stats

diet_A = [2.5, 3.1, 1.8, 2.9, 3.5, ...]  
diet_B = [1.7, 2.2, 1.5, 2.1, 2.8, ...] 
diet_C = [1.0, 1.5, 1.2, 0.8, 2.0, ...]  


all_weight_loss = np.concatenate([diet_A, diet_B, diet_C])


group_labels = ['Diet A'] * len(diet_A) + ['Diet B'] * len(diet_B) + ['Diet C'] * len(diet_C)


f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print("F-Statistic:", f_statistic)
print("P-value:", p_value)


In [None]:
If the p-value is less than the chosen significance level (commonly 0.05), you can reject the null hypothesis, indicating 
 that there are significant differences between the mean weight loss of the three diets.

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = {
    'Software': ['A', 'B', 'C'] * 20,  
    'Experience': ['Novice'] * 30 + ['Experienced'] * 30,  
    'Time': [12, 15, 14, 18, 16, ...],  
}


df = pd.DataFrame(data)


formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)


print(anova_table)


In [None]:
In the ANOVA table, you will see three main effects: one for the software programs, one for the employee experience levels, 
and one for the interaction effect between the two factors.
the two-way ANOVA results will help you determine whether there are significant main effects for the software programs and 
employee experience levels and whether there is an interaction effect between the two factors.

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data for test scores for control and experimental groups (replace these with your own data)
control_group = [78, 82, 85, 76, 80, ...]  # Test scores for control group
experimental_group = [86, 90, 88, 92, 85, ...]  # Test scores for experimental group

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the results of the t-test
print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Follow up with post-hoc test (Tukey's HSD) if the results are significant (p-value < 0.05)
if p_value < 0.05:
    # Combine all the test scores into a single array
    all_test_scores = np.concatenate([control_group, experimental_group])
    
    # Create corresponding group labels for each group
    group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)
    
    # Create a DataFrame from the data
    df = pd.DataFrame({'Test Scores': all_test_scores, 'Group': group_labels})
    
    # Perform Tukey's HSD post-hoc test
    tukey_result = pairwise_tukeyhsd(df['Test Scores'], df['Group'])
    print(tukey_result)


In [None]:
Replace the control_group and experimental_group lists with your actual test score data for each group.
The code will then perform the two-sample t-test and provide the t-statistic and p-value for the comparison.

In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data for daily sales for each store (replace these with your own data)
store_A_sales = [1000, 950, 1050, 1100, 900, ...]  # Daily sales for Store A
store_B_sales = [900, 850, 950, 1000, 800, ...]    # Daily sales for Store B
store_C_sales = [1200, 1150, 1100, 1250, 1300, ...] # Daily sales for Store C

# Combine all the sales data into a single array
all_sales = np.concatenate([store_A_sales, store_B_sales, store_C_sales])

# Create corresponding group labels for each store
group_labels = ['Store A'] * len(store_A_sales) + ['Store B'] * len(store_B_sales) + ['Store C'] * len(store_C_sales)

# Create a DataFrame from the data
df = pd.DataFrame({'Sales': all_sales, 'Store': group_labels})

# Fit the one-way repeated measures ANOVA model
formula = 'Sales ~ C(Store)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the results of the repeated measures ANOVA
print(anova_table)

# Follow up with post-hoc test (Tukey's HSD) if the results are significant (p-value < 0.05)
if anova_table['PR(>F)'][0] < 0.05:
    # Perform Tukey's HSD post-hoc test
    tukey_result = pairwise_tukeyhsd(df['Sales'], df['Store'])
    print(tukey_result)


In [None]:
Replace the store_A_sales, store_B_sales, and store_C_sales lists with your actual daily sales data for each store. 
The code will then perform the repeated measures ANOVA and provide the F-statistic and p-value for the analysis.