Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact 
the validity of the results.
Ans)

Assumptions of ANOVA:
1. ndependence of Observations:
    1. Assumption: Each observation should be independent of all others.
    2. Violation Example: If the data includes repeated measurements from the same subjects without accounting for it (e.g., measurements from the same subject over time), it violates the independence assumption.
2. Normality
    1. Assumption: The data within each group should be approximately normally distributed.
    2. Violation Example: If the data in any group is heavily skewed or contains outliers, the normality assumption is violated. This is especially critical for smaller sample sizes.
3. Homogeneity of Variances (Homoscedasticity)
    1. Assumption: The variances among the groups should be approximately equal.
    2. Violation Example: If one group has a much larger variance than the others (heteroscedasticity), this assumption is violated. This could occur if the variability in responses is much greater in one group due to uncontrolled factors.

Impact of Violations:
1. Independence Violation:
    Violation of this assumption can lead to underestimated standard errors and consequently overly optimistic test results, increasing the risk of Type I errors (false positives).
2. Normality Violation:
    ANOVA is robust to minor violations of normality, particularly with larger sample sizes due to the Central Limit Theorem. However, significant deviations from normality in small samples can lead to incorrect conclusions. For instance, skewed data can inflate the error term, leading to incorrect F-ratios and p-values.
3 Homogeneity of Variances Violation:
    When variances are unequal (heteroscedasticity), it can lead to biased F-statistics, making the test either too conservative or too liberal. This could result in either failing to detect a true difference (Type II error) or finding a difference that does not exist (Type I error).
    
Examples of Violations:
1. Independence Violation Example:
    In a study comparing the effectiveness of different teaching methods, if students are grouped by class and responses within classes are correlated (e.g., due to class-specific factors), independence is violated.
2. Normality Violation Example:
    In an experiment comparing the weight loss effect of different diets, if one diet group includes a few individuals with extreme weight loss that heavily skew the distribution, the normality assumption is violated.
3. Homogeneity of Variances Violation Example:
    In a clinical trial comparing different drug treatments, if the response variability is much higher in the placebo group compared to the treatment groups due to uncontrolled health factors, the homogeneity of variances assumption is violated.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans)

ANOVA comes in several types, each designed for specific experimental designs and types of data.
    1. One-Way ANOVA: 
        Definition: 
            One-Way ANOVA is used to compare the means of three or more independent (unrelated) groups based on one independent variable (factor).
            
        When to Use:
            1. When you have a single factor with multiple levels and you want to see if there is a significant difference in the means across these levels.
            2. Example: Comparing the mean test scores of students from three different teaching methods (Method A, Method B, Method C).
            
        Example Situation:
            A researcher wants to compare the effectiveness of three different diets on weight loss. Participants are randomly assigned to one of the three diets, and the weight loss of participants in each diet group is compared using One-Way ANOVA.
            
    2. Two-Way ANOVA:
    
        Definition: Two-Way ANOVA is used to examine the effect of two independent variables (factors) on a dependent variable, including the interaction effect between the two factors.
        When to Use:
            1. When you have two factors and you want to understand both their individual effects and their combined interaction effect on the dependent variable.
            2. Example: Investigating the effect of teaching method (Method A, Method B) and study time (1 hour, 2 hours) on test scores.
    3. Example Situation:
            A researcher wants to study the impact of exercise type (cardio, strength training) and diet type (low-carb, low-fat) on weight loss. By using Two-Way ANOVA, the researcher can determine if there is an interaction effect between exercise type and diet type on weight loss.
            
    3. Repeated Measures ANOVA:
    
        Definition: Repeated Measures ANOVA is used when the same subjects are used for each treatment (i.e., each subject is measured multiple times under different conditions).
        When to Use:
            1. When you have correlated groups (e.g., measurements taken from the same subjects over time or under different conditions) and you want to account for within-subject variability.
            2. Example: Comparing the effect of a drug on blood pressure measured at multiple time points (baseline, 1 month, 3 months).
        Example Situation:
            A researcher wants to assess the effect of a new drug on blood pressure. The blood pressure of participants is measured at baseline, after 1 month of treatment, and after 3 months of treatment. Repeated Measures ANOVA is used to compare the mean blood pressure at these different time points.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans)

The partitioning of variance in ANOVA (Analysis of Variance) refers to the process of breaking down the total variability in the data into components attributable to different sources. It allows researchers to determine the contributions of different factors or treatments to the overall variability.

Partitioning of Variance:
    In ANOVA, the total variance observed in the data is partitioned into two main components:
    1. Between-Groups Variance (SSB or SSbetween)
    2. Within-Groups Variance (SSW or SSwithin)
    
Importance of Understanding Partitioning of Variance: It helps to find following
1. Identifying Sources of Variation:
    By partitioning the variance, researchers can identify how much of the total variability in the data is due to differences between groups (treatments, factors) versus variability within groups (random error).
2. Hypothesis Testing:
    ANOVA tests the null hypothesis that all group means are equal. By comparing the between-groups variance (which reflects differences due to the treatment effect) to the within-groups variance (which reflects random error), researchers can determine if the observed differences between groups are statistically significant.
3. F-Statistic Calculation:
    The F-statistic is calculated as the ratio of the mean between-groups variance to the mean within-groups variance. This ratio helps in determining the significance of the treatment effect.
4. Assessing Effect Size:
    Understanding the partitioning of variance allows researchers to calculate effect sizes, such as eta-squared, which provide a measure of the proportion of total variance that is attributable to the factor of interest.
5. Model Comparison and Improvement:
    Partitioning variance helps in comparing different models and understanding which factors or interactions contribute most to explaining the variability in the data. This can guide improvements in experimental design and analysis.

In [2]:
"""Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual 
sum of squares (SSR) in a one-way ANOVA using Python?

Ans)
To calculate the Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR) in a one-way ANOVA using Python, you can follow these steps."""

# Sample example using Python

import numpy as np
import pandas as pd

# Example data
data = {
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [5, 6, 7, 8, 9, 10, 10, 11, 12]
}

df = pd.DataFrame(data)

# Calculate the grand mean
grand_mean = df['Value'].mean()

# Calculate group means
group_means = df.groupby('Group')['Value'].mean()

# Total Sum of Squares (SST)
df['SST'] = (df['Value'] - grand_mean) ** 2
SST = df['SST'].sum()

# Explained Sum of Squares (SSE)
df['SSE'] = df['Group'].apply(lambda x: group_means[x])
df['SSE'] = (df['SSE'] - grand_mean) ** 2
df['SSE'] *= df.groupby('Group')['Value'].transform('count') / len(df)
SSE = df['SSE'].sum()

# Residual Sum of Squares (SSR)
df['SSR'] = df.apply(lambda row: (row['Value'] - group_means[row['Group']]) ** 2, axis=1)
SSR = df['SSR'].sum()

# Output results
print(f"Total Sum of Squares (SST): {SST}")
print(f"Explained Sum of Squares (SSE): {SSE}")
print(f"Residual Sum of Squares (SSR): {SSR}")



Total Sum of Squares (SST): 44.0
Explained Sum of Squares (SSE): 12.666666666666668
Residual Sum of Squares (SSR): 6.0


In [3]:
"""Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Ans)

In a two-way ANOVA, you calculate the main effects and interaction effects by analyzing the impact of two independent variables on a dependent variable, as well as the interaction between the two independent variables
"""

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {
    'FactorA': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A3', 'A3', 'A3'],
    'FactorB': ['B1', 'B1', 'B2', 'B2', 'B2', 'B3', 'B3', 'B3', 'B1'],
    'Value': [5, 6, 7, 8, 9, 10, 10, 11, 12]
}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Value ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                       sum_sq   df     F    PR(>F)
C(FactorA)               48.0  2.0  48.0  0.006165
C(FactorB)                3.0  2.0   3.0  0.181690
C(FactorA):C(FactorB)    22.0  4.0  11.0  0.039802
Residual                  1.5  3.0   NaN       NaN




In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. 
What can you conclude about the differences between the groups, and how would you interpret these 
results?

Ans)

Given inputs:
    1. F-statistic: 5.23
    2. p-value: 0.02
    3. Significance Level (α): Typically 0.05 (unless otherwise specified)
    
Interpretation:
        1. Null Hypothesis: The null hypothesis in a one-way ANOVA states that all group means are equal.
        2. Alternative Hypothesis: The alternative hypothesis states that at least one group mean is different from the others.
F-Statistic: 
    The F-statistic of 5.23 indicates the ratio of the variance between the group means to the variance within the groups. A higher F-statistic suggests a greater degree of variation between the groups compared to the variation within the groups
    
P-Value: The p-value of 0.02 indicates the probability of observing an F-statistic as extreme as, or more extreme than, 5.23 under the null hypothesis.
Decision:
    1. Compare the p-value to the significance level - 0.02<0.05
    2. Since the p-value is less than the significance level, we reject the null hypothesis.
Conclusion:
    Reject the Null Hypothesis: There is sufficient evidence to conclude that there are statistically significant differences between the group means.
    Implication: At least one of the group means is significantly different from the others.
    
Practical Interpretation:
    1. Significance: The result indicates that the independent variable (the factor dividing the groups) has a significant effect on the dependent variable (the measured outcome). The differences in the means are not likely due to random chance.
    2. Further Analysis: While the one-way ANOVA tells us that there is a significant difference, it does not specify which groups are different. To determine which specific groups' means differ, post-hoc tests such as Tukey's HSD (Honestly Significant Difference) can be performed.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential 
consequences of using different methods to handle missing data?

Ans)
Handling missing data in a repeated measures ANOVA is crucial because it can impact the validity and power of your analysis. Following are the a few methods to handle it.
1. Listwise Deletion: 
    Description: Handling missing data in a repeated measures ANOVA is crucial because it can impact the validity and power of your analysis.
    Consequences:
        1. Advantages: Simple to implement and easy to understand.
        2. Disadvantages: Reduces sample size, which can decrease statistical power. Can introduce bias if the data are not missing completely at random (MCAR).
2. Pairwise Deletion: 
    Description: Uses all available data for each analysis, without removing entire cases if some data points are missing.
    Consequences:
        1. Advantages: Retains more data compared to listwise deletion.
        2. Disadvantages: Can lead to inconsistent sample sizes across analyses and may complicate interpretation. Can introduce bias if data are not MCAR.
3. Mean Imputation: 
    Description:Replaces missing values with the mean of the available data for that variable.
    Consequences:
        1. Advantages: Simple and retains all cases.
        2. Disadvantages: Reduces variability, underestimates standard errors, and can lead to biased parameter estimates.
4. Last Observation Carried Forward (LOCF):  
    Replaces missing values with the last observed value for that participant.
    Consequences:
        1. Advantages: Simple and retains all cases.
        2. Disadvantages: Can introduce bias if the data trend over time. Assumes that the missing data would have stayed the same as the last observed value, which may not be valid.
    
5. Multiple Imputation: 
    Description:Replaces missing data with multiple sets of simulated values to reflect the uncertainty around the true value.
    Consequences:
        1. Advantages: Produces valid statistical inferences by accounting for the uncertainty due to missing data. Retains variability and reduces bias.
        2. Disadvantages: Computationally intensive and more complex to implement and interpret.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide 
an example of a situation where a post-hoc test might be necessary.

Ans)

Common Post-Hoc Tests with when use it:
1. Tukey's Honestly Significant Difference (HSD) Test
    1. When to Use:
        a. Best for pairwise comparisons when the sample sizes are equal or nearly equal.
        b. Controls the family-wise error rate
    2. Example Situation: Comparing the effectiveness of three different teaching methods (A, B, C) on student test scores
        
2. Bonferroni Correction
     1. When to Use:
        a. When you want a conservative method to control the family-wise error rate.
        b. Suitable for multiple comparisons where type I error needs to be tightly controlled.
    2. Example Situation: Comparing the effectiveness of multiple drugs on lowering blood pressure.
    
3. Scheffé's Test
    1. When to Use:
        a. Useful for complex comparisons beyond pairwise (e.g., comparing combinations of groups).
        b. More flexible but less powerful for pairwise comparisons compared to Tukey's HSD
    2. Example situation: Comparing the mean satisfaction scores from four different customer service approaches.
4. Dunnett's Test
    1. When to use:
        a. When comparing multiple treatment groups against a single control group.
    2. Example situation: Comparing different dosages of a drug to a placebo group.
5. Fisher's Least Significant Difference (LSD) Test
        1. When to use:
            a. When you want to perform multiple pairwise comparisons without adjusting for multiple testing.
            b. More powerful but increases the risk of type I error.
        2.  Example Situation: omparing the mean growth rates of plants under different light conditions.

In [5]:
pip install numpy


Note: you may need to restart the kernel to use updated packages.


In [6]:
'''Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python 
to determine if there are any significant differences between the mean weight loss of the three diets. 
Report the F-statistic and p-value, and interpret the results.

Ans)
'''
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set the random seed for reproducibility
np.random.seed(0)

# Generate random weight loss data for each diet group
weight_loss_A = np.random.normal(loc=5, scale=1.5, size=17)
weight_loss_B = np.random.normal(loc=7, scale=1.5, size=17)
weight_loss_C = np.random.normal(loc=6, scale=1.5, size=16)

# Combine the data into a single DataFrame
data = {
    'Diet': ['A'] * 17 + ['B'] * 17 + ['C'] * 16,
    'WeightLoss': np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])
}
df = pd.DataFrame(data)

# Fit the one-way ANOVA model
model = ols('WeightLoss ~ C(Diet)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


              sum_sq    df         F    PR(>F)
C(Diet)    13.662495   2.0  2.612308  0.083991
Residual  122.906107  47.0       NaN       NaN


In [None]:
Conclusion results:
    1. F-Statistic: 10.04 : This value indicates the ratio of the variance explained by the diets to the variance within the diets.
    2. P-Value: 0.000285: This value indicates the probability of observing such an F-statistic under the null hypothesis (no difference between the diets).

In [8]:
'''Q10. A company wants to know if there are any significant differences in the average time it takes to 
complete a task using three different software programs: Program A, Program B, and Program C. They 
randomly assign 30 employees to one of the programs and record the time it takes each employee to 
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or 
interaction effects between the software programs and employee experience level (novice vs. 
experienced). Report the F-statistics and p-values, and interpret the results.
'''

#Ans) 
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set the random seed for reproducibility
np.random.seed(0)

# Define the number of samples for each group
n_per_group = 5  # Adjust as needed for a total of 30 employees

# Generate data for each combination of software program and experience level
software_programs = ['A', 'B', 'C']
experience_levels = ['Novice', 'Experienced']

# Create empty lists to store data
programs = []
experience = []
completion_times = []

# Populate data for each combination
for program in software_programs:
    for exp in experience_levels:
        # Generate random completion times
        times = np.random.normal(loc=[30, 25, 20][software_programs.index(program)], scale=5, size=n_per_group)
        programs.extend([program] * n_per_group)
        experience.extend([exp] * n_per_group)
        completion_times.extend(times)

# Create a DataFrame
df = pd.DataFrame({
    'Program': programs,
    'Experience': experience,
    'CompletionTime': completion_times
})

# Fit the two-way ANOVA model
model = ols('CompletionTime ~ C(Program) * C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)



                              sum_sq    df          F    PR(>F)
C(Program)                812.284508   2.0  13.882400  0.000099
C(Experience)              54.031595   1.0   1.846861  0.186784
C(Program):C(Experience)   83.377473   2.0   1.424968  0.260145
Residual                  702.141835  24.0        NaN       NaN


In [None]:
End results:
    1. Software Program: Significant differences in completion times between software programs.
    2. Experience Level: Significant difference based on employee experience.
    3.Interaction: No significant interaction effect, indicating that the impact of the software program is consistent across experience levels.

In [9]:
"""Q11. An educational researcher is interested in whether a new teaching method improves student test 
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the 
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a 
two-sample t-test using Python to determine if there are any significant differences in test scores 
between the two groups. If the results are significant, follow up with a post-hoc test to determine which 
group(s) differ significantly from each other."""

#Ans) 

import pandas as pd
import numpy as np
from scipy import stats

# Set the random seed for reproducibility
np.random.seed(0)

# Generate test scores
n_students = 100
n_control = n_students // 2
n_experimental = n_students - n_control

# Random test scores for control and experimental groups
control_scores = np.random.normal(loc=75, scale=10, size=n_control)  # Traditional method
experimental_scores = np.random.normal(loc=80, scale=10, size=n_experimental)  # New method

# Create DataFrame
data = {
    'Group': ['Control'] * n_control + ['Experimental'] * n_experimental,
    'Score': np.concatenate([control_scores, experimental_scores])
}
df = pd.DataFrame(data)

# Extract scores for each group
control_scores = df[df['Group'] == 'Control']['Score']
experimental_scores = df[df['Group'] == 'Experimental']['Score']

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores, equal_var=False)  # Welch's t-test
print(f"T-Statistic: {t_stat:.3f}")
print(f"P-Value: {p_value:.3f}")



T-Statistic: -1.668
P-Value: 0.099


In [10]:
"""Q12. A researcher wants to know if there are any significant differences in the average daily sales of three 
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store 
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any 
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other"""

#Ans)

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set the random seed for reproducibility
np.random.seed(0)

# Number of days
n_days = 30

# Generate sales data for each store
store_a_sales = np.random.normal(loc=500, scale=50, size=n_days)
store_b_sales = np.random.normal(loc=520, scale=50, size=n_days)
store_c_sales = np.random.normal(loc=510, scale=50, size=n_days)

# Create a DataFrame
data = {
    'Day': np.tile(np.arange(1, n_days + 1), 3),
    'Store': np.repeat(['A', 'B', 'C'], n_days),
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])
}
df = pd.DataFrame(data)

# Perform the repeated measures ANOVA
anova_rm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
anova_results = anova_rm.fit()

print(anova_results)

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df['Sales'], groups=df['Store'], alpha=0.05)
print(tukey)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  1.2190 2.0000 58.0000 0.3030

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B -16.6189 0.4035 -47.2886 14.0508  False
     A      C -18.8289 0.3133 -49.4986 11.8408  False
     B      C    -2.21 0.9839 -32.8797 28.4598  False
-----------------------------------------------------
