In [1]:
#Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
"""
The key assumptions of ANOVA are:
1.Normality: The data should be normally distributed within each group. Violations of this assumption can result in misleading results.
2.Homogeneity of variance: The variance of the data within each group should be approximately equal. Violations of this assumption can result in an
  increased risk of Type I errors (false positives) or Type II errors (false negatives).
3.Independence: The observations within each group should be independent of each other. Violations of this assumption can result in an increased 
  risk of Type I errors.

Examples of violations that could impact the validity of ANOVA results include:
1.Non-normality: If the data is not normally distributed within each group, ANOVA may not accurately test for significant differences. 
  For example, if the data is skewed or contains outliers, ANOVA may incorrectly identify significant differences.
2.Heterogeneity of variance: If the variance of the data within each group is not equal, ANOVA may not accurately test for significant differences.
  For example, if one group has a much larger variance than the others, ANOVA may incorrectly identify significant differences.
3.Dependence: If the observations within each group are not independent of each other, ANOVA may not accurately test for significant differences.
  For example, if there is clustering or grouping within the data, ANOVA may incorrectly identify significant differences.
"""

In [None]:
#Q2. What are the three types of ANOVA, and in what situations would each be used?
"""
There are three main types of ANOVA:
1. One-way ANOVA: This type of ANOVA is used when there is only one independent variable (factor) being tested with two or more levels (categories). 
   For example, if a researcher wants to test whether there is a significant difference in the mean scores of three different groups of students who 
   were given different types of instruction.
2. Repeated Measure ANOVA: This type of ANOVA is used when there is only one independent variable (factor) being tested with two or more dependent 
   levels.For example if we want to measure distance covered in running, and the days that a runner is running.
3. Two-way ANOVA: This type of ANOVA is used when there are two independent variables (factors) being tested with two or more levels each.
   For example, if a researcher wants to test whether there is a significant difference in the mean scores of two different groups of students who 
   were given different types of instruction, and whether this difference is affected by the gender of the students.
"""

In [None]:
#Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
"""
The partitioning of variance in ANOVA refers to the decomposition of the total variation in the data into different sources of variation.

The total variation in the data can be divided into two main components:
1. Between-group variation, which is the variation in the means of different groups or treatments. This component measures the extent to which the 
   means of the different groups differ from each other.
2. Within-group variation, which is the variation within each group or treatment. This component measures the amount of variation among individual 
   observations within each group.

The partitioning of variance in ANOVA is important because it helps us to understand the sources of variation in the data and determine whether any 
differences between the groups are statistically significant or due to chance.
"""

In [1]:
#Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA 
#    using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

#Next, create a pandas DataFrame with the data for the one-way ANOVA:
data = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                     'value': [4, 6, 5, 7, 9, 8, 3, 2, 4]})

#Then, use the ols function to fit the ANOVA model and calculate the sums of squares:
model = ols('value ~ group', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

SST = anova_table['sum_sq']['group'] + anova_table['sum_sq']['Residual']
SSE = anova_table['sum_sq']['group']
SSR = anova_table['sum_sq']['Residual']


In [None]:
#Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.read_csv('data.csv')

model = ols('dependent_variable ~ independent_variable_1 + independent_variable_2', data=data).fit()

main_effect_1 = sm.stats.anova_lm(model, typ=2)['sum_sq']['independent_variable_1'] / sm.stats.anova_lm(model, typ=2)['df']['independent_variable_1']
main_effect_2 = sm.stats.anova_lm(model, typ=2)['sum_sq']['independent_variable_2'] / sm.stats.anova_lm(model, typ=2)['df']['independent_variable_2']
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq']['independent_variable_1:independent_variable_2'] / sm.stats.anova_lm(model, typ=2)
                     ['df']['independent_variable_1:independent_variable_2']


In [None]:
#Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences 
#    between the groups, and how would you interpret these results?
"""
When conducting a one-way ANOVA, an F-statistic is calculated to test whether there are significant differences between the means of three or more 
groups. 
Null hypothesis states that there are no significant differences between the means
Alternative hypothesis suggests that at least one mean is different from the others

In this case, the F-statistic is 5.23 and the associated p-value is 0.02. Since the p-value is less than the significance level of 0.05, 
we can reject the null hypothesis and conclude that there are significant differences between the means of the groups.

However, the ANOVA test does not tell us which specific group or groups differ from the others.
"""

In [None]:
#Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle 
#    missing data?
"""
Handling missing data in a repeated measures ANOVA involves making decisions about how to handle observations with missing data in order to estimate 
model parameters.
1. Remove the observations with missing data: this resulting in a reduced sample size. This approach is called "complete case analysis" or "listwise 
   deletion." 
   Consequences: While this method is straightforward, it can result in biased estimates if the data are not missing completely at random (MCAR),
                 and reduced power.
  
2. Estimating the missing values based on the observed data: There are various methods for imputing missing data, such as mean imputation, regression
   imputation, and multiple imputation.
   Consequences: THis can affect the precision of the estimates and the coverage of confidence intervals.
"""

In [None]:
#Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test 
#    might be necessary.
"""
Some of the commonly used post-hoc tests are:
1. Tukey's Honestly Significant Difference (HSD)
2. Scheffé test
3. Bonferroni correction
4. Dunnett's test.

An example of a situation where a post-hoc test might be necessary is in a study comparing the effectiveness of three different medications for 
treating a particular condition. After conducting an ANOVA test, it is found that there is a significant difference in the effectiveness of the 
medications. However, the ANOVA test does not tell us which medication is more effective than the others. To determine this, a post-hoc test, 
such as Tukey's HSD would be conducted to compare each medication with the others and determine which pairs are significantly different from each 
other. This would provide more detailed information about the differences between the medications and help to guide treatment decisions.
"""

In [3]:
#Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned
#   to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of 
#   the three diets. Report the F-statistic and p-value, and interpret the results.

import pandas as pd
import scipy.stats as stats

data = pd.DataFrame({
    'diet': ['A', 'A', 'A', ..., 'C', 'C', 'C'],
    'weight_loss': [1.2, 2.0, 3.5, ..., 4.2, 2.9, 3.8]
})

f_statistic, p_value = stats.f_oneway(data[data['diet'] == 'A']['weight_loss'],
                                      data[data['diet'] == 'B']['weight_loss'],
                                      data[data['diet'] == 'C']['weight_loss'])

print('F-statistic: ', f_statistic)
print('p-value: ', p_value)

In [5]:
#Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software 
#     programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each 
#     employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between 
#     the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# read in the data.Let's assume that the data is stored in a CSV file called "task_times.csv", with columns for software program (A, B, or C)
data = pd.read_csv("task_times.csv")

# define the ANOVA model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=data).fit()

# print the ANOVA table
table = sm.stats.anova_lm(model, typ=2)
print(table)

In [6]:
#Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to 
#     either the control group (traditional teaching method) or theexperimental group (new teaching method) and administer a test at the end of the 
#     semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. 
#     If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

import numpy as np
from scipy.stats import ttest_ind

# generate data
np.random.seed(1)
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=10, size=100)

# conduct two-sample t-test
t_stat, p_val = ttest_ind(control_scores, experimental_scores)

print('t-statistic: ', t_stat)
print('p-value: ', p_val)

# the p-value is less than the significance level (e.g., 0.05), we can reject the null hypothesis and conclude that there is a significant difference 
# in test scores between the two groups. In this case, we can follow up with a post-hoc test (e.g., Tukey's HSD test) to determine which group(s) 
# differ significantly from each other.

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# combine the data and group labels
all_scores = np.concatenate([control_scores, experimental_scores])
group_labels = ['control'] * 100 + ['experimental'] * 100

# conduct Tukey's HSD test
tukey_results = pairwise_tukeyhsd(all_scores, group_labels)

print(tukey_results)


t-statistic:  -4.584315463985094
p-value:  8.059088190829134e-06
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
control experimental   5.9221   0.0 3.3746 8.4696   True
--------------------------------------------------------


In [None]:
#Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and 
#     Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to 
#     determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
#     hoc test to determine which store(s) differ significantly from each other.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

#Assuming that the sales data is stored in a CSV file called "sales_data.csv" with the following structure:

sales_data = pd.read_csv("sales_data.csv")
sales_data_long = pd.melt(sales_data, id_vars="Store", var_name="Day", value_name="Sales
                          
model = AnovaRM(sales_data_long, "Sales", "Store", within=["Day"])
results = model.fit()
print(results.summary())
                          
tukey_results = pairwise_tukeyhsd(sales_data_long["Sales"], sales_data_long["Store"])
print(tukey_results.summary())