**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**

**Assumptions of ANOVA:**

1. **Independence:** The observations in each group are independent of each other.
  * **Violation:** Observations within a group are correlated.
2. **Normality:** The data in each group is normally distributed.
  * **Violation:** The data is skewed or has outliers.
3. **Homogeneity of variances:** The variances of the groups are equal.
  * **Violation:** The variances of the groups are significantly different.

**Examples of violations that could impact the validity of the results:**

* **Independence:** If the observations within a group are not independent, then the results of the ANOVA may be biased. For example, if students in the same class are assigned to different groups, then the results of the ANOVA may be biased because the students in the same class may be more similar to each other than students in different classes.
* **Normality:** If the data in each group is not normally distributed, then the results of the ANOVA may be biased. For example, if the data is skewed, then the results of the ANOVA may be biased towards the group with the higher mean.
* **Homogeneity of variances:** If the variances of the groups are not equal, then the results of the ANOVA may be biased. For example, if one group has a much larger variance than the other groups, then the results of the ANOVA may be biased towards the group with the larger variance.

**Impact of violations on the validity of the results:**

Violations of the assumptions of ANOVA can impact the validity of the results in a number of ways. For example, violations of independence can lead to biased results, violations of normality can lead to biased results or incorrect conclusions about the significance of the results, and violations of homogeneity of variances can lead to biased results or incorrect conclusions about the significance of the results.

It is important to note that violations of the assumptions of ANOVA do not necessarily mean that the results of the ANOVA are invalid. However, violations of the assumptions can increase the likelihood of obtaining biased results or incorrect conclusions. Therefore, it is important to be aware of the assumptions of ANOVA and to take steps to minimize the impact of violations of these assumptions.


**Q2. What are the three types of ANOVA, and in what situations would each be used?**

**Three types of ANOVA:**

1. **One-way ANOVA:** Compares the means of two or more independent groups.
    * **Situation:** Used when there is one independent variable with two or more levels.
2. **Two-way ANOVA:** Compares the means of two or more independent groups, each with two or more levels.
    * **Situation:** Used when there are two independent variables, each with two or more levels.
3. **Repeated measures ANOVA:** Compares the means of two or more related groups.
    * **Situation:** Used when there is one independent variable with two or more levels, and the same subjects are measured on each level.

**Examples of situations where each type of ANOVA would be used:**

* **One-way ANOVA:**
    * Comparing the mean test scores of three different schools.
    * Comparing the mean heights of men and women.
* **Two-way ANOVA:**
    * Comparing the mean test scores of students in different schools and different genders.
    * Comparing the mean heights of men and women of different ages.
* **Repeated measures ANOVA:**
    * Comparing the mean reaction times of subjects on two different tasks.
    * Comparing the mean scores of students on a pre-test and a post-test.


**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

The partitioning of variance in ANOVA is the process of breaking down the total variance in the data into its component parts. This is important to understand because it allows us to determine the amount of variance that is due to the independent variable(s) and the amount of variance that is due to error.

The total variance in the data can be partitioned into three components:

* **Between-groups variance:** The variance between the means of the different groups.
* **Within-groups variance:** The variance within each group.
* **Error variance:** The variance due to random error.

The between-groups variance is the amount of variance that is explained by the independent variable(s). The within-groups variance is the amount of variance that is not explained by the independent variable(s). The error variance is the amount of variance that is due to random error.

Understanding the partitioning of variance in ANOVA is important because it allows us to:

* Determine the significance of the independent variable(s).
* Estimate the effect size of the independent variable(s).
* Make predictions about the dependent variable.

The partitioning of variance in ANOVA is a fundamental concept that is essential for understanding how ANOVA works.


**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**

In [None]:
import pandas as pd
import numpy as np

# Assuming df is your DataFrame and 'group' is your categorical column and 'values' is your data column
df = pd.DataFrame({
    'group': np.repeat(['A', 'B', 'C'], 20),
    'values': np.random.rand(60)
})

# Calculate the overall mean
overall_mean = df['values'].mean()

# Calculate SST
sst = np.sum((df['values'] - overall_mean)**2)

# Calculate group means
group_means = df.groupby('group').mean()

# Calculate SSE
sse = np.sum(df['group'].map(group_means['values'].to_dict()) - overall_mean)**2

# Calculate SSR
ssr = sst - sse

print(f'SST: {sst}')
print(f'SSE: {sse}')
print(f'SSR: {ssr}')


SST: 4.0370880802769005
SSE: 1.1093356479670479e-29
SSR: 4.0370880802769005


**Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

In [None]:
import pandas as pd
import numpy as np

# Create some sample data
data = {'factor1': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
        'factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
        'value': [25, 34, 12, 20, 32, 41, 28, 18]}

# Create a pandas dataframe
df = pd.DataFrame(data)

def calculate_effects(data):
  """
  This function calculates the main effects and interaction effect for a two-way ANOVA.

  Args:
      data: A pandas DataFrame containing the data for the ANOVA.

  Returns:
      A dictionary containing the main effects for each factor, the interaction effect,
      and the overall mean.
  """
  factor1_means = data.groupby('factor1')['value'].mean()
  factor2_means = data.groupby('factor2')['value'].mean()
  interaction_means = data.groupby(['factor1', 'factor2'])['value'].mean()
  overall_mean = data['value'].mean()

  # Calculate main effects by subtracting the overall mean from factor means
  main_effect_factor1 = factor1_means - overall_mean
  main_effect_factor2 = factor2_means - overall_mean

  # Calculate interaction effect by subtracting the sum of main effects and overall mean
  # from the interaction means, then averaging across factor2 levels for each factor1 level
  interaction_effect = interaction_means.unstack() - main_effect_factor1 - main_effect_factor2 - overall_mean

  return {
      'main_effect_factor1': main_effect_factor1,
      'main_effect_factor2': main_effect_factor2,
      'interaction_effect': interaction_effect.mean(axis=1),
      'overall_mean': overall_mean
  }

# Print the results
results = calculate_effects(df.copy())
print("Main effect of factor 1:", results['main_effect_factor1'])
print("Main effect of factor 2:", results['main_effect_factor2'])
print("Interaction effect:", results['interaction_effect'])
print("Overall mean:", results['overall_mean'])


Main effect of factor 1: factor1
A    6.75
B   -6.75
Name: value, dtype: float64
Main effect of factor 2: factor2
X   -2.0
Y    2.0
Name: value, dtype: float64
Interaction effect: factor1
A   NaN
B   NaN
dtype: float64
Overall mean: 26.25


**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?**

Based on the results of your one-way ANOVA, you can conclude that there is a statistically significant difference between the means of at least two groups. Here's how to interpret these findings:

F-statistic (5.23): This value represents the ratio of variance between the groups compared to the variance within the groups. A higher F-statistic indicates a greater difference between group means relative to the variation within each group.

p-value (0.02): This value signifies the probability of observing an F-statistic as extreme or more extreme than 5.23, assuming the null hypothesis (all group means are equal) is true. A p-value of 0.02 is less than a common significance level (e.g., 0.05), suggesting we can reject the null hypothesis.

Interpretation: The statistically significant F-statistic (5.23) and low p-value (0.02) provide evidence that at least one group mean differs from the others in your one-way ANOVA. However, this test doesn't tell you which specific groups are different.

**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**

There are several methods for handling missing data in a repeated measures ANOVA:

1. **Listwise deletion:** This method removes any cases that have missing data on any of the repeated measures. This can lead to a loss of power, especially if the missing data is not missing at random.
2. **Pairwise deletion:** This method calculates separate ANOVAs for each pair of repeated measures that do not have missing data. This can lead to biased results if the missing data is not missing at random.
3. **Imputation:** This method replaces missing data with estimated values. There are a variety of imputation methods, such as mean imputation, median imputation, and regression imputation. Imputation can introduce bias into the results, but it can also help to reduce the loss of power.

The choice of which method to use depends on the amount of missing data, the pattern of missing data, and the assumptions of the ANOVA model.

**Potential consequences of using different methods to handle missing data:**

* **Listwise deletion:** This method can lead to a loss of power, especially if the missing data is not missing at random.
* **Pairwise deletion:** This method can lead to biased results if the missing data is not missing at random.
* **Imputation:** This method can introduce bias into the results, but it can also help to reduce the loss of power.

It is important to choose a method for handling missing data that is appropriate for the specific data set and research question.

**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**

**Common post-hoc tests used after ANOVA:**

* **Tukey's HSD (Honestly Significant Difference) test:** This test is used to compare all possible pairs of group means. It is the most conservative of the post-hoc tests, meaning that it has the lowest probability of making a Type I error (rejecting the null hypothesis when it is true).
* **Scheffé's test:** This test is also used to compare all possible pairs of group means. It is less conservative than Tukey's HSD test, meaning that it has a higher probability of making a Type I error. However, it is also more powerful than Tukey's HSD test, meaning that it is more likely to detect a significant difference between group means when one exists.
* **Bonferroni correction:** This method is used to adjust the p-values of multiple comparisons to control the overall probability of making a Type I error. It is a more conservative approach than Tukey's HSD test or Scheffé's test.

**When to use each post-hoc test:**

* **Tukey's HSD test:** This test is a good choice when there are a small number of groups (e.g., 3 or 4) and the researcher is interested in making pairwise comparisons between all of the groups.
* **Scheffé's test:** This test is a good choice when there are a larger number of groups (e.g., 5 or more) and the researcher is interested in making pairwise comparisons between all of the groups.
* **Bonferroni correction:** This method is a good choice when the researcher is making a large number of comparisons and wants to control the overall probability of making a Type I error.

**Example of a situation where a post-hoc test might be necessary:**

A researcher is conducting a one-way ANOVA to compare the mean test scores of three different groups of students. The ANOVA results show that there is a statistically significant difference between the groups. However, the ANOVA results do not tell the researcher which specific groups are different. In order to determine which groups are different, the researcher would need to conduct a post-hoc test.


**Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.**

In [None]:
import pandas as pd
from scipy import stats

# Create a DataFrame
df = pd.DataFrame({
    'diet': ['A'] * 17 + ['B'] * 17 + ['C'] * 16,
    'weight_loss': [10, 8, 12, 9, 11, 10, 8, 12, 9, 11 , 14, 13, 10, 7, 9, 12, 11, 13, 15, 14, 10, 8, 12, 9, 11, 14, 13, 10, 7, 9, 12, 11, 13, 15, 14, 10, 8, 12, 9, 11, 14, 13, 10, 7, 9, 12, 11, 13, 15, 14]
})

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(df['weight_loss'][df['diet'] == 'A'],
                                     df['weight_loss'][df['diet'] == 'B'],
                                     df['weight_loss'][df['diet'] == 'C'])

# Print the results
print('F-statistic:', f_statistic)
print('p-value:', p_value)

# Interpret the results
if p_value < 0.05:
    print('There is a statistically significant difference between the mean weight loss of at least two diets.')
else:
    print('There is no statistically significant difference between the mean weight loss of the three diets.')


F-statistic: 1.3645836166924268
p-value: 0.26542191414273025
There is no statistically significant difference between the mean weight loss of the three diets.


**Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.**

In [None]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data (replace with your actual data)
data = {
    'program': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],  # Program used (A, B, or C)
    'experience': ['Novice', 'Experienced', 'Novice','Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced'],  # Employee experience level
    'time': [20, 15, 25,20, 15, 25, 18, 25, 18]  # Time taken to complete the task (minutes)
}

df = pd.DataFrame(data)

# Define the model formula
model = ols('time ~ C(program) + C(experience) ', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model)

# Print the ANOVA table
print(anova_results)


                df      sum_sq     mean_sq          F    PR(>F)
C(program)     2.0    0.222222    0.111111   0.022124  0.978214
C(experience)  1.0  107.555556  107.555556  21.415929  0.005695
Residual       5.0   25.111111    5.022222        NaN       NaN


**Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**

In [None]:
import numpy as np
from scipy import stats
import statsmodels.stats.multicomp as mc

# Generate sample data (replace this with your actual data)
np.random.seed(0)  # for reproducibility
control_group_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_group_scores = np.random.normal(loc=75, scale=10, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Output the results
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check if the results are significant
alpha = 0.05
if p_value < alpha:
    print("The difference in test scores between the two groups is statistically significant.")
    # Perform Tukey's HSD test for post-hoc analysis
    data = np.concatenate([control_group_scores, experimental_group_scores])
    group_labels = ['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_scores)
    tukey_results = mc.MultiComparison(data, group_labels).tukeyhsd()

    # Output the post-hoc results
    print("\nPost-hoc analysis (Tukey's HSD):")
    print(tukey_results)

else:
    print("There is no significant difference in test scores between the two groups.")


Two-sample t-test results:
t-statistic: -3.597192759749614
p-value: 0.0004062796020362504
The difference in test scores between the two groups is statistically significant.

Post-hoc analysis (Tukey's HSD):
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental    5.222 0.0004 2.3593 8.0848   True
---------------------------------------------------------


**Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.**

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Generate random data for the example (replace this with your actual data)
np.random.seed(42)

store_A_sales = np.random.normal(loc=1000, scale=100, size=30)
store_B_sales = np.random.normal(loc=950, scale=90, size=30)
store_C_sales = np.random.normal(loc=1100, scale=110, size=30)

# Combine the sales data and group information into a DataFrame
data = pd.DataFrame({'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales]),
                     'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30})

# Perform one-way repeated measures ANOVA
F_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)

# Report the results
print("One-way repeated measures ANOVA:")
print("F-statistic:", F_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There is a significant difference in average daily sales between the three stores.")
else:
    print("There is no significant difference in average daily sales between the three stores.")

One-way repeated measures ANOVA:
F-statistic: 23.62763182315457
p-value: 6.369054894762179e-09
There is a significant difference in average daily sales between the three stores.


In [2]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD post-hoc test
tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'])

# Report the results
print("\nTukey's HSD post-hoc test:")
print(tukey_results)


Tukey's HSD post-hoc test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
     A      B -42.0899 0.2045 -100.5291  16.3492  False
     A      C  120.232    0.0   61.7929 178.6712   True
     B      C 162.3219    0.0  103.8828 220.7611   True
-------------------------------------------------------
