Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Ans:
    
    Analysis of Variance (ANOVA) is a statistical technique used to compare means among multiple groups. To ensure the
    validity of ANOVA results, certain assumptions must be met. These assumptions include:

    Normality: The data within each group should be normally distributed. However, ANOVA is robust to violations of 
        
        normality, especially when sample sizes are large.

    Violation Example: If the data within one or more groups significantly deviates from a normal distribution, it can 
        affect the accuracy of ANOVA results. This can be checked using normality tests or visual inspection of Q-Q plots.
        
    Homogeneity of Variances (Homoscedasticity): The variances of the groups should be roughly equal. This assumption is
        important for the validity of ANOVA results.

    Violation Example: If the variances are significantly different between groups, it can lead to inflated Type I error 
        rates. A common way to check homogeneity of variances is to use Levene's test or Barlett's test.
    

Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans:
    
    Analysis of Variance (ANOVA) is a statistical technique used to compare means among multiple groups. There are three
    main types of ANOVA, each designed for different situations:

    One-Way ANOVA:

    Use Case: When comparing means across two or more independent groups.
    Situation: You have one independent variable with two or more levels (groups) and a continuous dependent variable.
    Example: Investigating whether there is a significant difference in test scores among students taught by different 
        teachers.
        
    Two-Way ANOVA:

    Use Case: When there are two independent variables and their combined effect on the dependent variable needs to be
        assessed.
    Situation: You have two independent variables, and both are categorical. The interaction between these variables is
        also examined.
    Example: Studying the effects of both teaching method and gender on test scores. This allows you to investigate not
        only the main effects of teaching method and gender but also whether there is an interaction effect.
        
    Repeated Measures ANOVA:

    Use Case: When measurements are taken on the same group or individual at multiple points in time or under different 
        conditions.
    Situation: You have a single group of subjects, and each subject is measured under multiple conditions or at different
        time points.
    Example: Assessing the effectiveness of a drug by measuring patients' blood pressure before treatment, during treatment,
        and after treatment.


Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans:
    
    
    The partitioning of variance in Analysis of Variance (ANOVA) refers to the decomposition of the total variance observed in the data into different components, each associated with specific sources of variation. Understanding this concept is crucial for interpreting ANOVA results and gaining insights into the relative contributions of different factors to the overall variability in the data.

    Understanding the partitioning of variance is important for several reasons:

    Hypothesis Testing: ANOVA tests the null hypothesis that all group means are equal. By partitioning the variance, ANOVA helps assess whether the observed differences among group means are statistically significant or if they could occur due to random chance.

    Effect Size: The ratio of between-group variability to within-group variability (F-statistic) provides a measure of effect size. A larger F-statistic suggests a stronger effect of the independent variable on the dependent variable.

    Model Understanding: Partitioning variance helps researchers understand the relative importance of different factors or treatments in explaining the overall variability in the data.


Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace this with your own dataset)
data = {
    'Group1': [5, 8, 7, 4, 6],
    'Group2': [10, 12, 8, 11, 9],
    'Group3': [15, 14, 13, 17, 16]
}

df = pd.DataFrame(data)

# Reshape the data for ANOVA
stacked_data = df.stack().reset_index()
stacked_data.columns = ['Group', 'Value']

# Fit the one-way ANOVA model
model = ols('Value ~ C(Group)', data=stacked_data).fit()

# Extract ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract sum of squares values from the ANOVA table
SST = anova_table['sum_sq'].sum()
SSE = anova_table['sum_sq'][0]  # Explained sum of squares
SSR = anova_table['sum_sq'][1]  # Residual sum of squares

# Display the results
print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'A': ['A1']*5 + ['A2']*5 + ['A3']*5,
        'B': ['B1', 'B2', 'B1', 'B2', 'B1']*3,
        'Value': [10, 12, 14, 16, 18, 25, 30, 28, 35, 32, 45, 40, 38, 42, 48]}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
formula = 'Value ~ A + B + A:B'
model = ols(formula, df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table.loc['A:B', 'sum_sq']
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table.loc['A:B', 'sum_sq']
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table.loc['A:B', 'sum_sq']

# Print the results
print("Main Effect of A:", main_effect_A)
print("Main Effect of B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


Main Effect of A: 72.17330210772843
Main Effect of B: 0.031615925058546476
Interaction Effect: 1.0


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

ANs:
    
    In a one-way ANOVA, the F-statistic is used to test whether there are significant differences among the means of three
    or more groups. The associated p-value helps determine the statistical significance of these differences. Here's how
    to interpret the results:

    F-Statistic: The F-statistic is a ratio of the variance between group means to the variance within groups. In your case,
        the F-statistic is 5.23.

    P-value: The p-value associated with the F-statistic indicates the probability of obtaining such a result 
        (or more extreme) if the null hypothesis is true. In your case, the p-value is 0.02.

    Now, let's interpret the results:

    Null Hypothesis (H0): The null hypothesis typically states that there are no significant differences among the
        group means.

    Alternative Hypothesis (H1): The alternative hypothesis suggests that there are significant differences among the
        group means.

    Interpretation:

    If the p-value is less than the chosen significance level (commonly 0.05), you would reject the null hypothesis.
    If the p-value is greater than the significance level, you would fail to reject the null hypothesis.
    In your case:

    The p-value is 0.02, which is less than the common significance level of 0.05.
    Therefore, you would reject the null hypothesis.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Ans:
   
    Pairwise Deletion:

    Method: Analyzing only the available data points for each subject, resulting in varying sample sizes across subjects.
    Consequences:
    Easy to implement.
    May lead to biased results if the missing data is not missing completely at random (MCAR).
    Reduction in statistical power, especially when the proportion of missing data is substantial.
    
    Mean Imputation:

    Method: Replace missing values with the mean of the observed values for that variable.
    Consequences:
    Preserves the sample size.
    Can introduce bias, especially if the missing data is related to specific conditions or time points.
    Reduces variability, potentially leading to an underestimation of standard errors and inflated Type I error rates.
    Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):

    Method: Impute missing values with the last (or next) observed value for that subject.
    Consequences:
    Assumes a constant trajectory between observed time points.
    Can be inappropriate if the assumption of a constant trajectory is violated.
    May introduce bias, especially if the last (or next) observation is not a good representation of the missing value.
    
    Linear Interpolation:

    Method: Estimate missing values based on the linear interpolation between adjacent observed time points.
    Consequences:
    Assumes a linear trend between observed time points.
    Appropriate for continuous data with a linear trend but may not be suitable for all types of data.
   

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Ans:
    
    Post-hoc tests are used in analysis of variance (ANOVA) when the omnibus ANOVA test indicates a significant
    difference among groups. These tests help identify specific group differences when there are more than two groups.
    Here are some common post-hoc tests and situations where each might be appropriate:

    Tukey's Honestly Significant Difference (HSD):

    Use Case: When there are multiple groups, and you want to compare all possible pairs of means.
    Example Situation: In a study comparing the effectiveness of three different teaching methods, Tukey's HSD can be used
        to identify which pairs of teaching methods have significantly different mean scores.
    Bonferroni Correction:

    Use Case: When conducting multiple pairwise comparisons and aiming to control the familywise error rate.
    Example Situation: In a medical trial with multiple treatment groups, Bonferroni correction can be applied to adjust 
        the significance level for multiple comparisons, reducing the risk of Type I errors.
    Scheffé's Test:

    Use Case: When there are unequal group sizes and you want to control the familywise error rate.
    Example Situation: In a study comparing the performance of different software programs with unequal sample sizes,
        Scheffé's test can be used to assess pairwise differences while adjusting for unequal group sizes.
    Dunnett's Test:

    Use Case: When there is a control group and you want to compare the treatment groups to the control group.
    Example Situation: In a clinical trial with a control group and multiple experimental treatments, Dunnett's test can
        be used to compare each treatment group to the control group while controlling for the overall Type I error rate.
   







Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [3]:
import scipy.stats as stats
import numpy as np

# Example data
np.random.seed(42)  # for reproducibility
weight_loss_A = np.random.normal(5, 2, 50)  # mean=5, std=2
weight_loss_B = np.random.normal(6, 2, 50)  # mean=6, std=2
weight_loss_C = np.random.normal(4, 2, 50)  # mean=4, std=2

# Combine data into a single array
all_data = np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])

# Create group labels
group_labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

# Print the results
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 16.5742
P-value: 0.0000
Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
np.random.seed(42)  # for reproducibility
data = {'Time': np.random.normal(loc=10, scale=2, size=90),
        'Program': np.repeat(['A', 'B', 'C'], 30),
        'Experience': np.tile(['Novice', 'Experienced'], 45)}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
formula = 'Time ~ Program + Experience + Program:Experience'
model = ols(formula, df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the results
print(anova_table)


                        sum_sq    df         F    PR(>F)
Program               2.514772   2.0  0.344485  0.709581
Experience            0.479063   1.0  0.131248  0.718051
Program:Experience    1.592393   2.0  0.218133  0.804472
Residual            306.603758  84.0       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)
experimental_group = np.random.normal(loc=75, scale=10, size=50)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the results of the t-test
print(f"Two-Sample t-Test:")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Check if the results are significant (using a common alpha level of 0.05)
if p_value < 0.05:
    print("The difference in test scores between the two groups is statistically significant.")
    
    # Perform post-hoc test (Tukey's HSD)
    all_data = np.concatenate([control_group, experimental_group])
    group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)

    df_posthoc = pd.DataFrame({'Data': all_data, 'Group': group_labels})
    posthoc_result = pairwise_tukeyhsd(df_posthoc['Data'], df_posthoc['Group'])

    # Print post-hoc results
    print("\nPost-Hoc (Tukey's HSD) Test:")
    print(posthoc_result)
else:
    print("There is no significant difference in test scores between the two groups.")


Two-Sample t-Test:
T-statistic: -4.1087
P-value: 0.0001
The difference in test scores between the two groups is statistically significant.

Post-Hoc (Tukey's HSD) Test:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [6]:
import numpy as np
import scipy.stats as stats

# Example data
np.random.seed(42)  # for reproducibility
sales_A = np.random.normal(loc=500, scale=50, size=30)
sales_B = np.random.normal(loc=550, scale=50, size=30)
sales_C = np.random.normal(loc=480, scale=50, size=30)

# Combine data into a single array
all_sales = np.concatenate([sales_A, sales_B, sales_C])

# Create group labels
group_labels = ['A'] * 30 + ['B'] * 30 + ['C'] * 30

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(sales_A, sales_B, sales_C)

# Print the results of the ANOVA
print(f"One-Way ANOVA:")
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Check if the results are significant (using a common alpha level of 0.05)
if p_value < 0.05:
    print("The difference in daily sales between the three stores is statistically significant.")
    
    # Perform post-hoc test (Tukey's HSD)
    posthoc_result = pairwise_tukeyhsd(all_sales, group_labels)

    # Print post-hoc results
    print("\nPost-Hoc (Tukey's HSD) Test:")
    print(posthoc_result)
else:
    print("There is no significant difference in daily sales between the three stores.")


One-Way ANOVA:
F-statistic: 15.6747
P-value: 0.0000
The difference in daily sales between the three stores is statistically significant.

Post-Hoc (Tukey's HSD) Test:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
     A      B  53.3492 0.0001  24.3572  82.3413   True
     A      C  -9.9484 0.6928 -38.9405  19.0437  False
     B      C -63.2976    0.0 -92.2897 -34.3056   True
------------------------------------------------------
