In [None]:
""
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.
""

In [None]:
""
ANOVA (Analysis of Variance) is a statistical technique used to test whether there is a significant difference between the means of three or more groups. ANOVA relies on certain assumptions to be met for its results to be valid. The assumptions required for ANOVA are:

Independence: The observations in each group are independent of each other.
Normality: The distribution of scores within each group is approximately normal.
Homogeneity of variance: The variance of scores within each group is approximately equal.
If these assumptions are violated, the validity of the ANOVA results may be compromised. Examples of violations that could impact the validity of the results include:

Non-independence: If observations within a group are correlated with each other, such as repeated measurements on the same individual or clustered data, ANOVA may produce inaccurate results.
Non-normality: If the distribution of scores within a group is not approximately normal, ANOVA may produce inaccurate results. For example, if the distribution is heavily skewed or has extreme outliers.
Heterogeneity of variance: If the variance of scores within groups is not approximately equal, ANOVA may produce inaccurate results. This can occur if one or more groups have much larger variances than the others.
In such situations, alternative methods may need to be used. For example, nonparametric tests, such as the Kruskal-Wallis test, may be used instead of ANOVA when the normality assumption is violated.

""

In [None]:
#Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
The three types of ANOVA are:

One-Way ANOVA: This type of ANOVA is used when there is one independent variable with three or more levels (or groups). It is used to test whether there is a significant difference between the means of the groups. For example, a One-Way ANOVA can be used to compare the mean scores of students in different classes or different schools.

Two-Way ANOVA: This type of ANOVA is used when there are two independent variables, each with two or more levels. It is used to test whether there are significant main effects of each independent variable and whether there is a significant interaction effect between the two independent variables. For example, a Two-Way ANOVA can be used to examine whether there is a significant difference in the mean scores of students in different classes (one independent variable) and whether this effect differs by gender (second independent variable).

Repeated Measures ANOVA: This type of ANOVA is used when the same individuals are measured on the same variable under different conditions or at different time points. It is used to test whether there is a significant difference between the means of the conditions or time points. For example, a Repeated Measures ANOVA can be used to examine whether there is a significant difference in the mean scores of students before and after an intervention or whether there is a significant difference in the mean scores of patients at different time points during a treatment.

Each type of ANOVA is used in different situations depending on the research question and design of the study. One-Way ANOVA is used when there is one independent variable, Two-Way ANOVA is used when there are two independent variables, and Repeated Measures ANOVA is used when the same individuals are measured on the same variable under different conditions or time points.

In [None]:
#Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
""
The partitioning of variance in ANOVA refers to the division of the total variance in the data into different sources of variation that can be attributed to different factors or variables. This is done by decomposing the total variance in the data into two or more components: the variance between groups (also known as the "treatment" or "factor" variance) and the variance within groups (also known as the "error" or "residual" variance).

The between-groups variance represents the variation in the data that can be attributed to the differences between the groups being compared, while the within-groups variance represents the variation in the data that cannot be attributed to the group differences but rather reflects the natural variability of the data within each group. By partitioning the variance in this way, ANOVA allows researchers to determine whether any observed differences between groups are likely due to chance or whether they are statistically significant.

Understanding the partitioning of variance is important because it helps researchers to interpret the results of ANOVA and determine whether any observed differences between groups are statistically significant or simply due to chance. It also provides insights into the relative importance of different sources of variation in the data, which can help inform future research and experimental designs. Finally, understanding the partitioning of variance is important for selecting appropriate post-hoc tests and estimating effect sizes.





""

In [None]:
""
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

""

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data into a pandas dataframe
data = pd.read_csv('data.csv')

# Fit the one-way ANOVA model
model = ols('score ~ group', data=data).fit()

# Calculate the total sum of squares (SST)
ss_total = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
ss_explained = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
ss_residual = ss_total - ss_explained

print('Total sum of squares (SST):', ss_total)
print('Explained sum of squares (SSE):', ss_explained)
print('Residual sum of squares (SSR):', ss_residual)


In [None]:
#Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data into a pandas dataframe
data = pd.read_csv('data.csv')

# Fit the two-way ANOVA model
model = ols('score ~ group1 + group2 + group1:group2', data=data).fit()

# Calculate the main effects
main_effects = model.params[:-1]

# Calculate the interaction effect
interaction_effect = model.params[-1]

print('Main effects:', main_effects)
print('Interaction effect:', interaction_effect)


In [None]:
""
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?
""

In [None]:

""
If a one-way ANOVA produces an F-statistic of 5.23 and a p-value of 0.02, it indicates that there is a significant difference between the groups. In particular, it suggests that the null hypothesis of equal means across all groups can be rejected at a significance level of 0.05 or lower.

To interpret these results, we can examine the F-statistic and the p-value. The F-statistic measures the ratio of between-group variability to within-group variability, and a larger F-statistic indicates that the differences between the groups are more significant. In this case, the F-statistic of 5.23 suggests that there is a moderate-to-strong difference between the groups.

The p-value, on the other hand, measures the probability of observing such an extreme F-statistic by chance alone, assuming that the null hypothesis is true. A p-value of 0.02 indicates that there is only a 2% chance of observing such a large F-statistic if the null hypothesis were true. Therefore, we can reject the null hypothesis and conclude that there is a significant difference between the groups.

In summary, if a one-way ANOVA produces an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a significant difference between the groups, and that the differences are moderate-to-strong.

""


In [None]:
""
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

""

In [None]:
""
In a repeated measures ANOVA, missing data can be handled using various methods, depending on the nature of the missingness and the assumptions made about the data. Here are some common methods for handling missing data in repeated measures ANOVA:

Listwise deletion: This method involves deleting any cases that have missing values on any of the variables used in the analysis. This method is easy to implement but can lead to loss of statistical power and biased results if the missingness is not completely at random.

Pairwise deletion: This method involves using all available data for each comparison, even if some cases have missing values on some of the variables. This method can be more efficient than listwise deletion, but can also lead to biased results if the missingness is not completely at random.

Imputation: This method involves estimating the missing values based on the observed values of the other variables. There are various methods of imputation, such as mean imputation, regression imputation, and multiple imputation. Imputation can help retain statistical power and reduce bias, but it also relies on assumptions about the data and the imputation method used.

The potential consequences of using different methods to handle missing data in repeated measures ANOVA are as follows:

Listwise deletion can lead to a loss of statistical power and biased results, especially if the missing data is not completely at random.

Pairwise deletion can lead to biased results if the missing data is not completely at random.

Imputation can retain statistical power and reduce bias, but the accuracy of the imputed values depends on the quality of the imputation method and the assumptions made about the data.

Therefore, it is important to carefully evaluate the nature of the missing data and choose an appropriate method of handling missing data that is consistent with the assumptions made about the data. It is also important to report the method used to handle missing data in the analysis to ensure transparency and reproducibility of the results.

""

In [None]:
""
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.
""

In [None]:
""
Post-hoc tests are used after conducting an ANOVA to determine which specific groups have significant differences in their means. Here are some common post-hoc tests used in ANOVA, along with examples of situations where they might be used:

Tukey's HSD (honestly significant difference): This test compares the mean difference between all pairs of groups and adjusts for multiple comparisons. It is often used when there are more than two groups and the research question involves identifying which specific groups differ significantly from each other.
Example: A researcher is conducting an experiment to compare the effectiveness of four different medications for treating a particular condition. After conducting an ANOVA, the researcher wants to determine which specific medications have significantly different mean effectiveness scores.

Bonferroni correction: This test adjusts the significance level for multiple comparisons to control for the family-wise error rate. It is often used when multiple pairwise comparisons are being made, and there is a risk of falsely rejecting the null hypothesis.
Example: A researcher is conducting an experiment to compare the effectiveness of three different teaching methods for improving student performance on a particular exam. After conducting an ANOVA, the researcher wants to determine whether each of the three teaching methods produces significantly different mean exam scores.

Dunnett's test: This test compares the mean difference between each treatment group and a control group. It is often used when the research question involves comparing several treatment groups to a single control group.
Example: A researcher is conducting an experiment to compare the effects of three different diets on weight loss, with a control group that does not follow any specific diet. After conducting an ANOVA, the researcher wants to determine whether each of the three diets produces significantly different mean weight loss compared to the control group.

Scheffe's test: This test compares the mean difference between each treatment group and the overall mean. It is often used when the research question involves comparing several treatment groups to a general population mean.
Example: A researcher is conducting an experiment to compare the effects of four different exercise programs on improving cardiovascular health. After conducting an ANOVA, the researcher wants to determine whether each of the four exercise programs produces significantly different mean cardiovascular health scores compared to the general population mean.

In summary, post-hoc tests are used after conducting ANOVA to determine which specific groups have significant differences in their means. The choice of post-hoc test depends on the research question and the number of groups being compared. A post-hoc test might be necessary when the ANOVA shows that there is a significant difference among the groups, but does not specify which specific groups are different from each other.





""

In [None]:
""
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

""

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Generate random weight loss data for each diet
np.random.seed(123)
diet_a = np.random.normal(5, 1, 50)
diet_b = np.random.normal(4.5, 1, 50)
diet_c = np.random.normal(6, 1, 50)

# Conduct one-way ANOVA
f_stat, p_value = f_oneway(diet_a, diet_b, diet_c)

# Print results
print("F-statistic:", f_stat)
print("p-value:", p_value)


F-statistic: 27.089667116788075
p-value: 9.648208034964242e-11


In [None]:
""

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

""

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe with the task completion time data
data = {'time': [12, 11, 10, 14, 13, 12, 18, 17, 16, 20, 19, 18, 24, 23, 22,
                 15, 14, 13, 16, 15, 14, 20, 19, 18, 22, 21, 20, 26, 25, 24],
        'program': ['A']*10 + ['B']*10 + ['C']*10,
        'experience': ['Novice']*15 + ['Experienced']*15}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('time ~ program + experience + program:experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                        sum_sq    df         F    PR(>F)
program               1.376957   2.0  0.073063  0.789062
experience                 NaN   1.0       NaN       NaN
program:experience  120.333333   2.0  6.385034  0.017931
Residual            245.000000  26.0       NaN       NaN


  F /= J


In [None]:
""
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.
""

In [None]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# load data
data = pd.read_csv("test_scores.csv")

# conduct two-sample t-test
control_scores = data[data["group"] == "control"]["score"]
experimental_scores = data[data["group"] == "experimental"]["score"]
t_stat, p_val = stats.ttest_ind(control_scores, experimental_scores)

print("Two-sample t-test results:")
print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_val:.4f}")

# conduct post-hoc test
tukey_results = pairwise_tukeyhsd(data["score"], data["group"])

print("\nPost-hoc test results:")
print(tukey_results.summary())


In [None]:
""
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

""

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a dataframe with sales data
sales_data = pd.DataFrame({
    'Store': ['A']*30 + ['B']*30 + ['C']*30,
    'Day': list(range(1, 31))*3,
    'Sales': [10, 11, 9, 12, 10, 11, 13, 12, 14, 11, 10, 9, 10, 12, 11, 13, 11, 10, 12, 13, 12, 14, 13, 11, 10, 12, 11, 10, 9, 8] + 
             [8, 9, 7, 10, 8, 9, 11, 10, 12, 9, 8, 7, 8, 10, 9, 11, 9, 8, 10, 11, 10, 12, 11, 9, 8, 10, 9, 8, 7, 6] +
             [6, 7, 5, 8, 6, 7, 9, 8, 10, 7, 6, 5, 6, 8, 7, 9, 7, 6, 8, 9, 8, 10, 9, 7, 6, 8, 7, 6, 5, 4]
})

# conduct repeated measures ANOVA
rm = ols('Sales ~ C(Store, Sum)*C(Day, Sum)', data=sales_data).fit()
sm.stats.anova_lm(rm, typ=3)

# conduct post-hoc tests (Tukey HSD)
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(sales_data['Sales'], sales_data['Store'])
tukey_result = mc.tukeyhsd()
print(tukey_result)
