In [None]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact 
# the validity of the results.

In [None]:
# ANOVA, or analysis of variance, is a statistical test used to determine whether there are significant differences between the means of three 
# or more groups. In order to use ANOVA, several assumptions must be met:

# Independence: The observations in each group must be independent of each other. 
# This means that the values in one group should not be influenced by the values in another group.

# Normality: The data within each group should be normally distributed. This means that the data should follow a bell-shaped curve when plotted.

# Homogeneity of variances: The variances of the data within each group should be roughly equal. 
# This means that the spread of the data should be similar across all groups.

# Violations of these assumptions can impact the validity of ANOVA results. For example:

# Violation of independence: If the observations in one group are dependent on the observations in another group, 
# this can lead to biased results. For example, if a study measures the same individuals over time 
# and includes those measurements in different groups, the observations within each group are no longer independent.

# Violation of normality: If the data within each group is not normally distributed, this can lead to inaccurate results. 
# For example, if the data is highly skewed or has outliers, the assumption of normality may be violated.

# Violation of homogeneity of variances: If the variances of the data within each group are not equal, this can lead to inaccurate results.
# For example, if one group has much larger variances than the other groups, the assumption of homogeneity of variances may be violated.

# In general, violations of these assumptions can lead to an increased risk of type I or type II errors,
# which can impact the validity of ANOVA results. Therefore, it is important to assess whether 
# these assumptions are met before interpreting ANOVA results. If the assumptions are not met, alternative tests may be more appropriate.

In [None]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
# There are three types of ANOVA, and they are used in different situations depending on the research question and the data being analyzed.
# The three types of ANOVA are:

# One-way ANOVA: This type of ANOVA is used when there is one categorical independent variable with three or more levels, 
# and one continuous dependent variable. One-way ANOVA is used to test whether there are significant differences between the means of the groups.
# For example, a one-way ANOVA could be used to test whether there is a difference in average test scores among students from different schools.

# Two-way ANOVA: This type of ANOVA is used when there are two categorical independent variables, 
# with two or more levels each, and one continuous dependent variable. 
# Two-way ANOVA is used to test whether there are significant main effects of each independent variable, 
# as well as whether there is an interaction effect between the two independent variables on the dependent variable.
# For example, a two-way ANOVA could be used to test whether there is a difference in average test scores among students from different schools, 
# as well as whether the effect of school varies by gender.

# Repeated measures ANOVA: This type of ANOVA is used when there is one categorical independent variable with two or more levels, 
# and one continuous dependent variable, but the dependent variable is measured multiple times for each individual. 
# Repeated measures ANOVA is used to test whether there are significant differences between the means of the groups over time. 
# For example, a repeated measures ANOVA could be used to test whether there is a difference in average heart rate before and after a stress test.

# Overall, ANOVA is a powerful statistical tool for comparing means across groups,
# and choosing the appropriate type of ANOVA depends on the research question and the nature of the data being analyzed.

In [None]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
# Partitioning of variance in ANOVA refers to the process of dividing the total variance in the data into different sources of variation. 
# In ANOVA, the total variance in the data is divided into two types of variance: variance between groups and variance within groups.

# The variance between groups represents the amount of variation in the dependent variable that can be explained by the differences between 
# the groups being compared. This variance is what ANOVA tests for, to determine whether there are significant differences in means between 
# the groups.

# The variance within groups represents the amount of variation in the dependent variable that is not explained by the differences between 
# the groups being compared. This variance can arise from random variability, measurement error, or other sources of variation that are 
# not related to the independent variable.

# Understanding the concept of partitioning of variance is important for several reasons. First, it allows researchers to assess the relative 
# importance of different sources of variation in the data. For example, if the variance between groups is much larger than the variance 
# within groups, this suggests that the independent variable is an important predictor of the dependent variable.

# Second, partitioning of variance allows researchers to estimate effect sizes, which are important for interpreting the practical 
# significance of the results. Effect sizes measure the magnitude of the difference between the means of the groups, 
# and are calculated by dividing the variance between groups by the total variance.

# Finally, understanding the concept of partitioning of variance is important for interpreting the output of ANOVA analyses. 
# ANOVA output typically includes information about the variance between groups, the variance within groups, and the F-statistic, 
# which is calculated by dividing the variance between groups by the variance within groups. By understanding these components, 
# researchers can more accurately interpret the results of their analyses and draw appropriate conclusions.

In [None]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual 
# sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
# In a one-way ANOVA, the total sum of squares (SST) represents the total amount of variability in the dependent variable. 
# The explained sum of squares (SSE) represents the amount of variability in the dependent variable that is explained by 
# the independent variable (i.e., the group means). The residual sum of squares (SSR) represents the amount of variability 
# in the dependent variable that is not explained by the independent variable and is due to random variability or measurement error.

# In Python, you can calculate SST, SSE, and SSR using the following steps:

# Calculate the grand mean of the dependent variable (i.e., the mean of all observations).

# Calculate the deviation of each observation from the grand mean.

# Calculate the sum of squares for SST by squaring each deviation and summing them.

# Calculate the sum of squares for SSE by summing the squared deviations of each observation from its group mean.

# Calculate the sum of squares for SSR by subtracting SSE from SST.

import scipy.stats as stats
import pandas as pd

# Load data into a pandas dataframe
df = pd.read_csv('data.csv')

# Calculate the grand mean
grand_mean = df['dependent_variable'].mean()

# Calculate the sum of squares for SST
sst = ((df['dependent_variable'] - grand_mean) ** 2).sum()

# Calculate the sum of squares for SSE
group_means = df.groupby('group_variable')['dependent_variable'].mean()
sse = ((df['dependent_variable'] - df['group_variable'].map(group_means)) ** 2).sum()

# Calculate the sum of squares for SSR
ssr = sst - sse


In [None]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
# In a two-way ANOVA, there are two main effects (one for each independent variable) and one interaction effect 
# (representing the combined effect of the two independent variables). The main effects describe the individual effects of each
# independent variable on the dependent variable, while the interaction effect describes how the effects of the independent variables combine 
# to affect the dependent variable.

# To calculate the main effects and interaction effect in a two-way ANOVA using Python, you can use the statsmodels library.
# Here's an example code:

import statsmodels.api as sm
import pandas as pd

# Load data into a pandas dataframe
df = pd.read_csv('data.csv')

# Fit the two-way ANOVA model
model = sm.formula.api.ols('dependent_variable ~ independent_variable_1 * independent_variable_2', data=df).fit()

# Calculate the main effects
main_effect_1 = model.params['independent_variable_1']
main_effect_2 = model.params['independent_variable_2']

# Calculate the interaction effect
interaction_effect = model.params['independent_variable_1:independent_variable_2']


In [None]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. 
# What can you conclude about the differences between the groups, and how would you interpret these 
# results?

In [None]:
# If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is 
# a statistically significant difference between the means of the groups. The F-statistic indicates the ratio of the variability 
# between groups to the variability within groups, and a larger F-statistic indicates a larger difference between the means of 
# the groups relative to the variability within the groups. The p-value indicates the probability of obtaining an F-statistic 
# as extreme as the one observed if the null hypothesis (i.e., the means of the groups are equal) were true.

# In this case, the p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme as 5.23 
# if the null hypothesis were true is only 2%, which is below the commonly used significance level of 5%. Therefore, 
# we can reject the null hypothesis and conclude that there is a statistically significant difference between the means of the groups.

# To interpret these results, you would need to examine the means of the groups and conduct post-hoc tests 
# (e.g., Tukey's HSD test) to determine which groups differ significantly from each other. The effect size of 
#  the difference between the means can also be calculated using measures such as Cohen's d or eta-squared 
#  to determine the practical significance of the difference.

In [None]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential 
# consequences of using different methods to handle missing data?

In [None]:
# In a repeated measures ANOVA, missing data can occur when participants do not provide responses for some of the repeated measures. 
# There are different methods to handle missing data, and the method chosen can affect the results of the analysis.

# One common method for handling missing data in repeated measures ANOVA is to use listwise deletion, also known as complete case analysis. 
# With this method, any participant with missing data on any of the repeated measures is excluded from the analysis. 
# The advantage of this method is that it is simple to implement and ensures that the analysis only includes complete cases. 
# However, listwise deletion can lead to reduced sample size and loss of statistical power, as well as potential bias 
# if the missingness is related to the outcome variable.

# Another method for handling missing data is to use imputation, which involves replacing missing values with estimated values based on 
# the available data. There are different types of imputation methods, including mean imputation, regression imputation, 
# and multiple imputation. The advantage of imputation is that it can preserve sample size and reduce potential bias due to missingness. 
# However, the choice of imputation method can affect the results of the analysis, and inappropriate imputation methods can lead 
# to inaccurate estimates of the standard errors and biased hypothesis tests.

In [None]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide 
# an example of a situation where a post-hoc test might be necessary.

In [None]:
# After conducting an ANOVA, post-hoc tests can be used to compare specific pairs of groups to identify which groups differ significantly 
# from each other. Some common post-hoc tests include:

# Tukey's Honestly Significant Difference (HSD) test: This test compares all possible pairs of means and controls the overall type 
# I error rate at a specified level (usually 0.05). It is often used when there are equal sample sizes and equal variances across groups.

# Bonferroni correction: This test adjusts the significance level for each pairwise comparison to control the overall type I error rate. 
# It is often used when there are multiple comparisons and a more stringent control of the error rate is needed.

# Scheffe's test: This test is more conservative than Tukey's HSD test and is used when there are unequal sample sizes or variances across groups.
# It controls the overall type I error rate at a specified level.

# Dunnett's test: This test is used to compare each group with a control group. It controls the overall type I error rate at a specified level.

# An example of a situation where a post-hoc test might be necessary is in a study comparing the effectiveness of four different treatments 
# for depression. The ANOVA might show that there is a significant difference in mean depression scores among the four treatment groups. 
# However, the ANOVA does not tell us which specific pairs of groups differ significantly. A post-hoc test such as Tukey's
# HSD test could be used to compare all possible pairs of treatment groups and identify which pairs differ significantly from each other. 
# This information could be useful for clinicians in deciding which treatments to recommend to their patients.

In [None]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python 
# to determine if there are any significant differences between the mean weight loss of the three diets. 
# Report the F-statistic and p-value, and interpret the results.

In [None]:
import numpy as np
from scipy.stats import f_oneway

# create example weight loss data for each diet
diet_a = np.random.normal(10, 2, 50)                     ## Each diet has a sample size of 50 and a mean weight loss of 10, 12, and 8,
diet_b = np.random.normal(12, 2, 50)                     ## respectively. The standard deviation for each diet is set to 2.
diet_c = np.random.normal(8, 2, 50)

# conduct one-way ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# print results
print("F-statistic: ", f_stat)
print("p-value: ", p_val)


In [None]:
# Since the p-value (1.446913328110397e-18 = 0.000000000000000004832278478456194)is very small (less than 0.05), we can reject 
# the null hypothesis that there is no difference in mean weight loss between the three diets. 
# We can conclude that there is strong evidence that at least one of the diets produces 
# different mean weight loss than the others. However, we cannot determine which specific diets are different from 
# each other using only the ANOVA. A post-hoc test, such as Tukey's HSD test, would be necessary to make these pairwise comparisons.

In [None]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to 
# complete a task using three different software programs: Program A, Program B, and Program C. They 
# randomly assign 30 employees to one of the programs and record the time it takes each employee to 
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or 
# interaction effects between the software programs and employee experience level (novice vs. 
# experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Generate example data
programs = ["Program A", "Program B", "Program C"]
experience_levels = ["Novice", "Experienced"]
n = 30

data = pd.DataFrame(columns=["Program", "Experience", "Time"])

for program in programs:
    for experience_level in experience_levels:
        times = np.random.normal(loc=20, scale=5, size=n)
        data = data.append(pd.DataFrame({
            "Program": program,
            "Experience": experience_level,
            "Time": times
        }), ignore_index=True)

# Perform two-way ANOVA
model = ols("Time ~ Program + Experience + Program:Experience", data).fit()
anova_table = anova_lm(model, typ=2)

print(anova_table)


In [None]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test 
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the 
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a 
# two-sample t-test using Python to determine if there are any significant differences in test scores 
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which 
# group(s) differ significantly from each other.

In [None]:
import numpy as np
from scipy.stats import ttest_ind

# generate sample data
control_scores = np.random.normal(loc=75, scale=10, size=100)
experimental_scores = np.random.normal(loc=80, scale=10, size=100)

# conduct two-sample t-test
t_stat, p_val = ttest_ind(control_scores, experimental_scores)

# print results
print("t-statistic: ", t_stat)
print("p-value: ", p_val)


In [None]:
# If the p-value is less than our chosen significance level (e.g., 0.05), we can reject the null hypothesis and 
# conclude that there is a significant difference in test scores between the control and experimental groups.

In [None]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three 
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store 
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any 
# significant differences in sales between the three stores. If the results are significant,
# follow up with a posthoc test to determine which store(s) differ significantly from each other.