In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Analysis of Variance (ANOVA) is a statistical method used to compare means between two or more groups. It is an extension of the t-test to more than two groups. However, ANOVA has certain assumptions that must be met for its results to be valid. Here are the key assumptions required for ANOVA and examples of violations that could impact the validity of the results:

### Assumptions of ANOVA:

1. **Normality**: The dependent variable should be approximately normally distributed within each group. This assumption is particularly important when sample sizes are small (e.g., <30 per group).
   
   - **Violation Example**: If the data within groups are heavily skewed or do not follow a normal distribution, the validity of ANOVA results may be affected. For instance, if the data are heavily skewed or exhibit extreme outliers, ANOVA results may not be reliable.

2. **Homogeneity of Variance (Homoscedasticity)**: The variance of the dependent variable should be equal across all groups. In other words, the spread of scores in each group should be similar.
   
   - **Violation Example**: If the variance is unequal across groups, it can lead to inflated Type I error rates (i.e., falsely detecting differences) or Type II error rates (i.e., failing to detect differences when they exist). For example, if one group has much larger variance than the others, it may have a disproportionate influence on the overall ANOVA results.

3. **Independence of Observations**: Observations within each group should be independent of each other. This assumption ensures that the observations are not influenced by each other.
   
   - **Violation Example**: If there is dependence among observations (e.g., repeated measures within subjects, or clustering within groups), it can violate the independence assumption. Violations of independence can lead to biased estimates of group differences and inflated Type I error rates.

### Examples of Violations:

- **Outliers**: Extreme values in the data can skew the distribution and violate the normality assumption.
- **Heteroscedasticity**: Unequal variances across groups can violate the homogeneity of variance assumption.
- **Non-Independence**: In longitudinal or repeated measures designs, where measurements are taken from the same individuals over time, observations may not be independent, violating the assumption of independence.
- **Non-Normality**: If the data are not normally distributed within each group, it can violate the normality assumption.

It's important to assess these assumptions before conducting ANOVA and consider alternative analyses if the assumptions are violated. Additionally, techniques like transformations or non-parametric tests may be used to address violations of these assumptions.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) can be categorized into three main types based on the design of the study and the number of factors involved:

1. **One-Way ANOVA**: This type of ANOVA is used when there is only one categorical independent variable (factor) with two or more levels (groups). It tests for differences in means among the groups.

   - **Example**: A researcher wants to compare the mean exam scores of students who were taught using three different teaching methods (e.g., lecture, group discussion, and online tutorials).

2. **Two-Way ANOVA (Factorial ANOVA)**: This type of ANOVA is used when there are two categorical independent variables (factors) and their interaction effect on the dependent variable needs to be assessed. It allows examining the main effects of each factor as well as their interaction effect.

   - **Example**: A researcher wants to investigate the effects of both gender (male vs. female) and treatment type (drug A vs. drug B) on blood pressure. Two-way ANOVA can assess whether there are differences in blood pressure due to gender, treatment type, and whether these effects interact with each other.

3. **Repeated Measures ANOVA**: This type of ANOVA is used when measurements are taken on the same subjects at multiple time points or under different conditions. It is also known as within-subjects ANOVA or ANOVA for correlated samples.

   - **Example**: A researcher wants to examine the effect of time (pre-test, post-test) on the anxiety levels of participants after receiving a therapy intervention. Repeated Measures ANOVA allows assessing whether there are significant changes in anxiety levels over time and whether these changes differ across different therapy interventions.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA refers to the division of total variance in the data into different components that can be attributed to various sources or factors. Understanding this concept is essential because it provides insights into the sources of variability in the data and helps in assessing the significance of these sources in explaining the variation in the dependent variable.

In ANOVA, the total variance in the data is decomposed into three main components:

1. **Between-Groups Variance (SSB)**: This component of variance represents the variability in the dependent variable that can be attributed to differences between the group means. It assesses whether there are significant differences in the means of the groups being compared.

2. **Within-Groups Variance (SSW)**: Also known as error variance, this component represents the variability in the dependent variable that cannot be explained by differences between group means. It reflects the random variation or noise within each group.

3. **Total Variance (SST)**: This is the overall variability in the dependent variable across all observations. It is the sum of the between-groups variance and the within-groups variance.

The partitioning of variance is important for several reasons:

- **Understanding Group Differences**: By decomposing the total variance into between-groups and within-groups components, ANOVA helps identify whether there are significant differences in the means of the groups being compared.

- **Assessing Model Fit**: Partitioning of variance allows researchers to assess how well the model (ANOVA) fits the data. A larger proportion of variance explained by between-groups variance indicates a better fit of the model.

- **Interpreting Results**: Understanding the sources of variance helps in interpreting the results of ANOVA. It provides insights into the relative contributions of different factors or treatments to the variability in the dependent variable.

- **Identifying Sources of Error**: By quantifying within-groups variance, ANOVA helps identify sources of error or variability that are not accounted for by the factors under investigation.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In a one-way ANOVA, the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) can be calculated using Python by following these steps:

1. Calculate the total sum of squares (SST):
   \[ SST = \sum (y_i - \bar{y})^2 \]
   where \( y_i \) are the individual observations, and \( \bar{y} \) is the overall mean of the data.

2. Calculate the explained sum of squares (SSE):
   \[ SSE = \sum n_i (\bar{y}_i - \bar{y})^2 \]
   where \( n_i \) is the number of observations in the \( i^{th} \) group, \( \bar{y}_i \) is the mean of the \( i^{th} \) group, and \( \bar{y} \) is the overall mean of the data.

3. Calculate the residual sum of squares (SSR):
   \[ SSR = SST - SSE \]

Let's implement these calculations in Python:

import numpy as np

# Sample data (example)
group_1 = np.array([10, 12, 14, 15, 16])
group_2 = np.array([20, 22, 24, 25, 26])
group_3 = np.array([30, 32, 34, 35, 36])

# Calculate overall mean
overall_mean = np.mean(np.concatenate([group_1, group_2, group_3]))

# Calculate total sum of squares (SST)
squared_deviations_total = np.sum((np.concatenate([group_1, group_2, group_3]) - overall_mean) ** 2)
SST = squared_deviations_total

# Calculate group means
group_means = np.array([np.mean(group_1), np.mean(group_2), np.mean(group_3)])

# Calculate explained sum of squares (SSE)
squared_deviations_explained = np.sum([len(group_1) * (group_means[0] - overall_mean) ** 2,
                                       len(group_2) * (group_means[1] - overall_mean) ** 2,
                                       len(group_3) * (group_means[2] - overall_mean) ** 2])
SSE = squared_deviations_explained

# Calculate residual sum of squares (SSR)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)

This code calculates the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) for a one-way ANOVA using Python. It uses sample data for three groups (`group_1`, `group_2`, `group_3`), calculates the overall mean, group means, and then performs the necessary calculations. Finally, it prints the results.

In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In a two-way ANOVA, we can calculate the main effects and interaction effects by decomposing the total sum of squares (SST) into components that represent the variation explained by each factor and their interaction. Let's break down the calculation process:

1. **Total Sum of Squares (SST)**: This is the total variation in the dependent variable.

2. **Main Effects**:
   - **Main Effect of Factor A**: This represents the variation explained by differences between the levels of Factor A, regardless of the levels of Factor B.
   - **Main Effect of Factor B**: This represents the variation explained by differences between the levels of Factor B, regardless of the levels of Factor A.
   
3. **Interaction Effect**: This represents the additional variation explained by the interaction between Factor A and Factor B, beyond what would be expected from the main effects alone.

The main effects and interaction effect can be calculated as follows:

- **Main Effect of Factor A (SSA)**: This can be calculated by comparing the variation between the means of the levels of Factor A to the overall mean.

- **Main Effect of Factor B (SSB)**: This can be calculated by comparing the variation between the means of the levels of Factor B to the overall mean.

- **Interaction Effect (SSAB)**: This can be calculated as the remaining variation after accounting for the main effects of Factor A and Factor B.

Once these components are calculated, they can be used to compute the respective sums of squares (SS) and degrees of freedom (df) for each effect.

Let's demonstrate how to calculate these effects using Python with sample data:

import numpy as np
import pandas as pd
from scipy import stats

# Sample data (example)
data = {
    'Factor_A': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A3', 'A3', 'A3'],
    'Factor_B': ['B1', 'B2', 'B3', 'B1', 'B2', 'B3', 'B1', 'B2', 'B3'],
    'Dependent_Variable': [10, 12, 14, 20, 22, 24, 30, 32, 34]
}

df = pd.DataFrame(data)

# Two-way ANOVA
model = stats.f_oneway(df[df['Factor_A'] == 'A1']['Dependent_Variable'],
                        df[df['Factor_A'] == 'A2']['Dependent_Variable'],
                        df[df['Factor_A'] == 'A3']['Dependent_Variable'],
                        df[df['Factor_B'] == 'B1']['Dependent_Variable'],
                        df[df['Factor_B'] == 'B2']['Dependent_Variable'],
                        df[df['Factor_B'] == 'B3']['Dependent_Variable'])[1]

# Degrees of freedom
n = len(df['Dependent_Variable'])
df_A = len(df['Factor_A'].unique()) - 1
df_B = len(df['Factor_B'].unique()) - 1
df_AB = df_A * df_B
df_error = n - (df_A + df_B + df_AB)

# Main effects
SSA = df.groupby('Factor_A')['Dependent_Variable'].mean().var() * len(df['Factor_A'].unique())
SSB = df.groupby('Factor_B')['Dependent_Variable'].mean().var() * len(df['Factor_B'].unique())

# Interaction effect
SSAB = model - SSA - SSB

print("Main Effect of Factor A (SSA):", SSA)
print("Main Effect of Factor B (SSB):", SSB)
print("Interaction Effect (SSAB):", SSAB)
```

In this Python code:
- We first conduct a two-way ANOVA using the `stats.f_oneway` function from the SciPy library.
- We calculate the degrees of freedom for each effect.
- We compute the main effects of Factor A and Factor B by comparing the variation between the means of their levels to the overall mean.
- We compute the interaction effect as the difference between the total sum of squares and the sum of squares of the main effects.

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In a one-way ANOVA, the F-statistic is used to test whether there are significant differences between the means of the groups. The p-value associated with the F-statistic indicates the probability of observing the data if the null hypothesis (i.e., no differences between group means) were true. 

Given an F-statistic of 5.23 and a p-value of 0.02:

1. **F-Statistic**: The F-statistic compares the variance between the group means to the variance within the groups. A larger F-statistic indicates a greater difference between the group means relative to the variability within each group.

2. **p-value**: The p-value represents the probability of obtaining the observed F-statistic (or more extreme) if the null hypothesis were true. A low p-value (typically below the chosen significance level, e.g., 0.05) suggests that the observed differences between the group means are unlikely to be due to random chance alone.

Based on these results:

- Since the p-value (0.02) is less than the chosen significance level (e.g., 0.05), we reject the null hypothesis.
- Therefore, we can conclude that there are statistically significant differences between the group means.

Interpretation:
- The F-statistic of 5.23 indicates that there is a significant difference in at least one pair of group means.
- The p-value of 0.02 indicates that the observed differences between group means are unlikely to have occurred by chance alone.
- Thus, we have evidence to suggest that there are real differences between the groups.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA requires careful consideration, as missing data can introduce bias and reduce the power of the analysis. Here are some common methods for handling missing data in repeated measures ANOVA, along with their potential consequences:

1. **Complete Case Analysis (Listwise Deletion)**:
   - This approach involves excluding any participant with missing data on any variable included in the analysis.
   - **Consequences**: 
     - Reduces sample size, potentially leading to loss of statistical power.
     - May introduce bias if the missing data are not missing completely at random (MCAR) or missing at random (MAR). This can lead to biased estimates of treatment effects.

2. **Mean Imputation**:
   - Missing values are replaced with the mean of the observed values for that variable.
   - **Consequences**:
     - Can artificially reduce variability and bias the estimated treatment effects towards the mean.
     - Underestimates standard errors, leading to inflated Type I error rates.

3. **Last Observation Carried Forward (LOCF)**:
   - Missing values are replaced with the value from the last observed time point for each participant.
   - **Consequences**:
     - Assumes that the missing values remain constant over time, which may not be valid.
     - Can underestimate variability and distort treatment effects, especially if the missing values are not missing completely at random.

4. **Linear Interpolation**:
   - Missing values are replaced with values interpolated based on neighboring time points.
   - **Consequences**:
     - Assumes a linear relationship between time points, which may not be appropriate for all data.
     - Can introduce bias if the underlying pattern of missingness is not linear.

5. **Multiple Imputation**:
   - Missing values are imputed multiple times based on observed data and uncertainty in the imputation process.
   - **Consequences**:
     - Requires assumptions about the missing data mechanism, such as missing at random (MAR).
     - Provides unbiased estimates if the imputation model is correctly specified, but can be computationally intensive.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are conducted after finding a significant result in an analysis of variance (ANOVA) to determine which specific group differences are significant. There are several common post-hoc tests used in conjunction with ANOVA, each with its own strengths and applicability:

1. **Tukey's Honestly Significant Difference (HSD)**:
   - Tukey's HSD test is widely used and is appropriate when you have equal sample sizes in each group.
   - It controls for the familywise error rate, maintaining the overall Type I error rate at the desired level.
   - Example: Suppose you conducted a one-way ANOVA comparing the exam scores of students in three different teaching methods. After finding a significant overall difference, you use Tukey's HSD to identify which specific pairs of teaching methods have significantly different mean scores.

2. **Bonferroni Correction**:
   - The Bonferroni correction is a conservative method that adjusts the significance level for each pairwise comparison to maintain the overall familywise error rate.
   - It is suitable when you have unequal sample sizes or heterogeneity of variances between groups.
   - Example: In a clinical trial comparing the effectiveness of three different treatments, you find a significant overall difference in treatment outcomes. To determine which treatments differ significantly from each other, you apply Bonferroni correction to adjust the significance level for pairwise comparisons.

3. **Sidak Correction**:
   - Similar to the Bonferroni correction, the Sidak correction adjusts the significance level for multiple comparisons to control for the familywise error rate.
   - It is less conservative than Bonferroni correction and may be preferable when conducting a large number of comparisons.
   - Example: In a marketing study comparing the effectiveness of multiple advertising strategies, you conduct several pairwise comparisons to identify significant differences in brand awareness. You apply Sidak correction to adjust the significance level for these comparisons.

4. **Dunnett's Test**:
   - Dunnett's test is used when comparing multiple treatment groups to a control group.
   - It is suitable for situations where one group serves as a reference or control, and the objective is to determine whether other groups differ significantly from the control group.
   - Example: In a pharmaceutical study comparing the efficacy of several drug treatments to a placebo, you use Dunnett's test to identify which drug treatments have significantly different effects compared to the placebo.

These post-hoc tests help avoid Type I errors that can occur when conducting multiple pairwise comparisons following a significant ANOVA result. The choice of post-hoc test depends on factors such as the research design, sample sizes, and the number of comparisons being made. It's essential to select a test that is appropriate for the specific analysis and ensures valid and reliable interpretation of the results.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

To conduct a one-way ANOVA in Python to compare the mean weight loss of three diets (A, B, and C), you can use the `scipy.stats` module. Below is an example code to perform the analysis and interpret the results:

import numpy as np
from scipy import stats

# Sample data (weight loss for each diet)
diet_A = np.array([2.1, 1.9, 2.0, 2.2, 2.5, 2.3, 1.8, 2.4, 2.1, 2.2,
                   2.3, 2.0, 2.1, 2.4, 2.2, 2.3, 2.5, 2.0, 2.1, 2.2,
                   2.3, 2.1, 2.4, 2.0, 2.3, 2.2, 2.1, 2.5, 2.0, 2.3,
                   2.4, 2.1, 2.3, 2.2, 2.0, 2.1, 2.4, 2.5, 2.2, 2.3,
                   2.0, 2.1, 2.4, 2.3, 2.2, 2.1, 2.0, 2.5, 2.2, 2.3])

diet_B = np.array([1.8, 2.0, 1.9, 1.7, 2.1, 2.0, 1.8, 1.9, 1.7, 2.0,
                   2.1, 1.8, 2.0, 1.7, 1.9, 2.1, 2.0, 1.8, 1.9, 1.7,
                   2.0, 2.1, 1.8, 1.9, 2.0, 1.7, 1.9, 2.1, 2.0, 1.8,
                   1.7, 2.0, 2.1, 1.8, 1.9, 1.7, 2.0, 2.1, 1.8, 1.9,
                   2.0, 1.7, 1.9, 2.1, 2.0, 1.8, 1.7, 2.0, 2.1, 1.9])

diet_C = np.array([1.5, 1.7, 1.6, 1.8, 1.9, 1.7, 1.6, 1.8, 1.7, 1.9,
                   1.6, 1.8, 1.7, 1.9, 1.6, 1.8, 1.7, 1.9, 1.6, 1.8,
                   1.7, 1.9, 1.6, 1.8, 1.7, 1.9, 1.6, 1.8, 1.7, 1.9,
                   1.6, 1.8, 1.7, 1.9, 1.6, 1.8, 1.7, 1.9, 1.6, 1.8,
                   1.7, 1.9, 1.6, 1.8, 1.7, 1.9, 1.6, 1.8, 1.7, 1.9])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Interpret results
alpha = 0.05
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.")

In this code:
- We first define the weight loss data for each diet (A, B, and C).
- Then, we perform a one-way ANOVA using `stats.f_oneway()` function from SciPy.
- Finally, we interpret the results by comparing the obtained p-value with the chosen significance level (alpha). If the p-value is less than alpha, we reject the null hypothesis and conclude that there is a significant difference between the mean weight loss of the three diets. Otherwise, we fail to reject the null hypothesis, indicating no significant difference.

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

To conduct a two-way ANOVA in Python to analyze the effects of software programs and employee experience level on task completion time, you can use the `statsmodels` library, which provides a convenient way to perform ANOVA. Below is an example code to perform the analysis and interpret the results:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (example)
data = {
    'Software_Program': ['A', 'B', 'C'] * 20,
    'Employee_Experience': ['Novice'] * 30 + ['Experienced'] * 30 + ['Novice'] * 30,
    'Task_Completion_Time': [10, 12, 11, 13, 15, 14, 9, 11, 10, 12,
                              13, 14, 15, 16, 11, 12, 10, 13, 14, 15,
                              16, 17, 12, 13, 11, 14, 15, 16, 17, 18,
                              19, 20, 14, 15, 13, 16, 17, 18, 19, 20,
                              21, 15, 16, 14, 17, 18, 19, 20, 21, 22]
}

df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Task_Completion_Time ~ C(Software_Program) + C(Employee_Experience) + C(Software_Program):C(Employee_Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Interpret results
print(anova_table)

In this code:
- We first create a DataFrame `df` containing the sample data with columns for software program, employee experience level, and task completion time.
- Then, we use the `ols()` function from `statsmodels.formula.api` to specify the model formula for the two-way ANOVA. The formula includes both main effects and the interaction effect between software programs and employee experience levels.
- We fit the model using the `fit()` method and perform the ANOVA using the `anova_lm()` function from `statsmodels.api`.
- Finally, we print the ANOVA table, which contains the F-statistics and p-values for the main effects and interaction effect.

Interpreting the results of the ANOVA table will allow us to determine if there are any significant main effects (software programs and employee experience) or interaction effects between them on task completion time. The p-values associated with each factor will indicate whether they have a significant effect on task completion time, and the interaction term will indicate whether the effects of software programs depend on employee experience level, or vice versa.

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

To conduct a two-sample t-test in Python to determine if there are significant differences in test scores between the control group (traditional teaching method) and the experimental group (new teaching method), and then follow up with a post-hoc test if the results are significant, you can use the `scipy.stats` module. Below is an example code to perform the analysis:

import numpy as np
from scipy import stats

# Test scores for the control group (traditional teaching method)
control_group = np.array([85, 78, 92, 88, 80, 79, 90, 85, 82, 87,
                          83, 81, 86, 88, 84, 87, 82, 89, 85, 81,
                          87, 83, 90, 86, 88, 84, 89, 82, 80, 85])

# Test scores for the experimental group (new teaching method)
experimental_group = np.array([88, 82, 94, 90, 85, 84, 92, 87, 84, 89,
                               86, 83, 88, 90, 86, 89, 83, 91, 88, 84,
                               90, 85, 93, 88, 91, 86, 90, 83, 82, 87])

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Interpret results
alpha = 0.05
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in test scores between the control and experimental groups.")
    # Perform post-hoc test (if significant)
    print("\nPerforming post-hoc test (e.g., Tukey's HSD) to determine significant differences between groups.")
    posthoc_results = stats.tukey_hsd(data=np.concatenate([control_group, experimental_group]),
                                      groups=np.concatenate([np.repeat('Control', len(control_group)),
                                                             np.repeat('Experimental', len(experimental_group))]),
                                      alpha=alpha)
    print(posthoc_results)
else:
    print("Fail to reject the null hypothesis. There is no significant difference in test scores between the control and experimental groups.")

In this code:
- We first define the test scores for the control group and the experimental group.
- Then, we perform a two-sample t-test using `stats.ttest_ind()` function from SciPy.
- We interpret the results by comparing the obtained p-value with the chosen significance level (alpha). If the p-value is less than alpha, we reject the null hypothesis and conclude that there is a significant difference in test scores between the two groups. In this case, we follow up with a post-hoc test (e.g., Tukey's HSD) to determine which group(s) differ significantly from each other. Otherwise, we fail to reject the null hypothesis, indicating no significant difference.

In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

To conduct a repeated measures ANOVA followed by post-hoc tests in Python, you can use the `pingouin` library, which provides easy-to-use functions for conducting various statistical analyses, including ANOVA and post-hoc tests. First, you need to install the `pingouin` library if you haven't already:

pip install pingouin


Then, you can use the following code to perform the repeated measures ANOVA and post-hoc tests:

import pandas as pd
import pingouin as pg

# Sample data
data = {
    'Day': [1, 2, 3, 4, 5] * 3,  # Example days (repeated measures)
    'Store': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,  # Store labels
    'Sales': [100, 110, 105, 115, 120, 90, 95, 100, 105, 110, 80, 85, 90, 95, 100]  # Sales data
}

df = pd.DataFrame(data)

# Repeated measures ANOVA
aov = pg.rm_anova(dv='Sales', within='Day', subject='Store', data=df)

# Post-hoc tests (pairwise comparisons)
posthoc = pg.pairwise_ttests(dv='Sales', within='Day', subject='Store', data=df, padjust='bonf')

# Print ANOVA results
print("Repeated Measures ANOVA Results:")
print(aov)

# Print post-hoc test results
print("\nPost-hoc Tests (Pairwise Comparisons):")
print(posthoc)

In this code:
- We first create a DataFrame `df` containing the sales data for each store on each day.
- We then perform a repeated measures ANOVA using `pg.rm_anova()` from the `pingouin` library.
- Next, we conduct post-hoc pairwise comparisons using `pg.pairwise_ttests()` to determine which stores differ significantly from each other.
- Finally, we print the results of the ANOVA and post-hoc tests to interpret the findings.

Make sure to replace the example sales data with your actual data before running the code. This code will provide you with the necessary statistical results to determine if there are any significant differences in sales between the three stores and which stores differ significantly from each other.