In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if they are significantly 
different from each other. To use ANOVA reliably, several assumptions must be met. Here are the key assumptions of ANOVA and examples of violations 
that could impact the validity of the results:

Independence of Observations:
Assumption: The observations within each group are independent of each other.
Violation Example: If data points within a group are correlated or dependent, such as repeated measures on the same subjects over time, it violates 
the assumption of independence.
Normality:
Assumption: The residuals (the differences between the observed values and the group means) are normally distributed.
Violation Example: If the residuals are not normally distributed, it can lead to incorrect p-values and confidence intervals. This violation can occur
when the sample size is small or when the data is heavily skewed or contains outliers.
Homogeneity of Variance (Homoscedasticity):
Assumption: The variances of the populations from which the samples are drawn are equal.
Violation Example: If the variances are not equal across groups, it can lead to inflated Type I error rates and decreased power. For example, if one
group has much larger variances than the others, it can affect the overall F-test.
Homogeneity of Regression Slopes (for Two-Way ANOVA with Interaction):
Assumption: The relationship between the independent variable and the dependent variable is the same for all levels of the other independent variable.
Violation Example: In a two-way ANOVA with interaction, if the slopes of the regression lines for one independent variable differ across levels of the
                                  other independent variable, it violates this assumption.
Random Sampling:
Assumption: The samples are randomly selected from the population of interest.
Violation Example: If the samples are not randomly selected, such as convenience sampling or sampling bias, it can lead to biased estimates and affect 
                                  the generalizability of the results.
Equal Group Sizes (for One-Way ANOVA):
Assumption: The groups have equal sample sizes.
Violation Example: Unequal group sizes can affect the power of the ANOVA test, especially if combined with violations of other assumptions.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
One-Way ANOVA is used when there is one categorical independent variable.
Two-Way ANOVA is used when there are two categorical independent variables.
Three-Way ANOVA is used when there are three categorical independent variables.
Each type of ANOVA allows researchers to examine the effects of multiple factors on a dependent variable and to determine whether there are 
significant differences among the group means. The choice of ANOVA type depends on the specific research design and the number of factors 
being studied.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
The partitioning of variance in ANOVA refers to the process of breaking down the total variance observed in the dependent variable into different 
components attributed to various sources or factors. Understanding this concept is crucial because it helps researchers identify and quantify the
sources of variability in the data, which in turn allows them to assess the significance of these sources and draw meaningful conclusions about the
relationships between variables.

In ANOVA, the total variance observed in the dependent variable is decomposed into three main components:

Between-Group Variance:
This component of variance represents the variability in the dependent variable that can be attributed to differences between the group means.
It reflects the extent to which the means of the different groups are spread out from each other.
Significant between-group variance suggests that the independent variable(s) have an effect on the dependent variable.
Within-Group Variance (or Error Variance):
This component of variance represents the variability in the dependent variable that is not accounted for by differences between group means.
It reflects the variability within each group, including random variability and measurement error.
It serves as a baseline level of variability against which the between-group differences are compared.
Total Variance:
This is the overall variability observed in the dependent variable across all observations.
It is the sum of the between-group variance and the within-group variance.
Understanding the partitioning of variance allows researchers to assess the relative importance of different factors in explaining the variability
    in the dependent variable. By comparing the magnitudes of the between-group and within-group variances, researchers can determine whether the 
    differences observed between groups are statistically significant or are likely due to chance alone. This helps in evaluating the strength of 
    the relationships between the independent and dependent variables and drawing valid conclusions about the effects of the independent variables 
    on the dependent variable. Additionally, partitioning of variance aids in identifying potential sources of error or bias in the data analysis
    process,
    which can inform improvements in experimental design and data collection procedures. Overall, understanding the partitioning of variance is 
    essential
    for conducting rigorous and interpretable analyses in ANOVA and other statistical methods.


In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np

# Example data (replace with your actual data)
group1 = [10, 12, 14]
group2 = [15, 18, 20]
group3 = [8, 11, 13]

# Combine data from all groups
data = np.concatenate([group1, group2, group3])

# Calculate overall mean
overall_mean = np.mean(data)

# Calculate SST
SST = np.sum((data - overall_mean)**2)

# Calculate group means
group_means = [np.mean(group) for group in [group1, group2, group3]]

# Calculate SSE
SSE = np.sum([len(group) * (mean - overall_mean)**2 for group, mean in zip([group1, group2, group3], group_means)])

# Calculate SSR
SSR = np.sum([(x - mean)**2 for group, mean in zip([group1, group2, group3], group_means) for x in group])

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 116.22222222222221
Explained Sum of Squares (SSE): 82.88888888888893
Residual Sum of Squares (SSR): 33.333333333333336


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data (replace with your actual data)
data = pd.DataFrame({
    'A': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A3', 'A3', 'A3'],
    'B': ['B1', 'B2', 'B3', 'B1', 'B2', 'B3', 'B1', 'B2', 'B3'],
    'Y': [10, 12, 14, 15, 18, 20, 8, 11, 13]
})

# Perform two-way ANOVA
model = ols('Y ~ C(A) + C(B) + C(A):C(B)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effect
main_effect_A = anova_table['sum_sq']['C(A)'] / anova_table['sum_sq']['Residual']
main_effect_B = anova_table['sum_sq']['C(B)'] / anova_table['sum_sq']['Residual']
interaction_effect = anova_table['sum_sq']['C(A):C(B)'] / anova_table['sum_sq']['Residual']

print("Main Effect of A:", main_effect_A)
print("Main Effect of B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


  return np.dot(wresid, wresid) / self.df_resid


ValueError: array must not contain infs or NaNs

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:

In a one-way ANOVA, the F-statistic is used to test whether the means of three or more groups are significantly different from each other. 
The p-value associated with the F-statistic indicates the probability of observing such an extreme F-value (or more extreme) under the null hypothesis 
of no difference between group means.

In this case, you obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how to interpret these results:

Significance of the F-statistic:
The F-statistic of 5.23 indicates that there is some evidence of differences between the group means.
Significance of the p-value:
The p-value of 0.02 is less than the significance level (often chosen as 0.05).
Since the p-value is less than the significance level, we reject the null hypothesis.
Therefore, we conclude that there are statistically significant differences between the group means.
Interpretation:
With a p-value of 0.02, we have evidence to suggest that the differences observed between the group means are unlikely to be due to random chance 
alone.
This suggests that at least one of the groups has a mean that is significantly different from the others.
However, the ANOVA itself does not tell us which specific group(s) differ from each other; additional post-hoc tests (e.g., Tukey's HSD, Bonferroni,
etc.) may be conducted to determine pairwise differences between groups.
In summary, based on the obtained F-statistic and p-value, we conclude that there are statistically significant differences between the group means,
and further investigation may be warranted to determine the nature of these differences.


In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
Handling missing data in repeated measures ANOVA is important to ensure the validity and reliability of the analysis. There are several methods 
to handle missing data in repeated measures ANOVA, each with its own potential consequences:

Complete Case Analysis (Listwise Deletion):
This approach involves excluding any cases with missing data on any variable involved in the analysis.
Pros: Simple to implement.
Cons: Reduces sample size and statistical power, potentially leading to biased results if missingness is not completely random 
(i.e., if missing data are related to the outcome or predictors).
Pairwise Deletion:
This approach uses all available data for each pairwise comparison, excluding cases with missing data only for the specific comparison being made.
Pros: Retains more data compared to complete case analysis.
Cons: May lead to biased estimates if missingness is related to the outcome or predictors. Produces inconsistent estimates of covariance parameters,
        potentially leading to inaccurate hypothesis testing.
Mean Imputation:
This approach replaces missing values with the mean of the observed values for the variable.
Pros: Preserves sample size and statistical power. Simple to implement.
Cons: Can lead to biased estimates and underestimated standard errors, especially if the missing data mechanism is not completely at random. 
        Can distort correlations and produce misleading results.
Multiple Imputation:
This approach involves generating multiple plausible imputed datasets, each with missing values replaced by estimates based on the observed data and
random variation.
Pros: Provides unbiased estimates and valid standard errors if the imputation model is correctly specified. Accounts for uncertainty due to missing 
    data.
Cons: More complex to implement than other methods. Requires assumptions about the missing data mechanism and may be sensitive to model
        misspecification.
Model-Based Methods:
This approach involves modeling the missing data mechanism and estimating parameters using maximum likelihood estimation or Bayesian methods.
Pros: Provides valid estimates and standard errors under the specified model assumptions. Allows for flexibility in modeling the missing data
    mechanism.
Cons: Requires specifying a model for the missing data mechanism, which may be challenging and may introduce additional uncertainty if misspecified.
The choice of method for handling missing data in repeated measures ANOVA should be based on considerations such as the amount and pattern of 
          missingness, the assumptions underlying the analysis, and the goals of the research study. It is important to conduct sensitivity analyses 
    to assess the robustness of the results to different missing data handling methods. Additionally, researchers should clearly document the methods
        used for handling missing data and justify their choice based on the specific characteristics of the dataset and research question.



In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:

Post-hoc tests are used in analysis of variance (ANOVA) to determine which specific group means differ from each other after finding a significant 
omnibus F-test result. Some common post-hoc tests include:

Tukey's Honestly Significant Difference (HSD):
Tukey's HSD test compares all possible pairs of group means and provides confidence intervals for the difference between each pair.
It is used when there are three or more groups and the sample sizes are equal or approximately equal.
Bonferroni Correction:
The Bonferroni correction adjusts the significance level for multiple comparisons by dividing the desired significance level (e.g., 0.05) by the
number of comparisons.
It is often used when conducting multiple pairwise comparisons and is more conservative than other post-hoc tests.
Duncan's New Multiple Range Test:
Duncan's test compares all possible pairs of group means and arranges them into homogeneous subsets based on significance levels.
It is used when comparing all groups against each other and can be less conservative than Tukey's HSD when sample sizes are unequal.
Scheffé's Test:
Scheffé's test provides simultaneous confidence intervals for all possible pairwise differences between group means.
It is more conservative than other post-hoc tests and is suitable for situations where the number of groups is small or unequal sample sizes.
Fisher's Least Significant Difference (LSD):
Fisher's LSD test compares all possible pairs of group means and uses the standard error of the differences to determine significance.
It is less conservative than other post-hoc tests and is suitable for situations with equal sample sizes and homogeneous variances.
Example Situation:
Suppose a researcher conducts a study to compare the effectiveness of four different treatments (T1, T2, T3, T4) on pain relief. 
After performing a one-way ANOVA, the researcher obtains a significant F-statistic indicating that there are differences between
the treatment groups.

In this situation, the researcher would use a post-hoc test to determine which specific treatment groups differ from each other. For example, 
they might use Tukey's HSD test to compare all possible pairs of treatment means and identify significant differences between treatments. 
This would help the researcher to understand which treatments are more effective than others and make appropriate recommendations for 
clinical practice.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Example data (replace with your actual data)
diet_A = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
diet_B = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52]
diet_C = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]

# Perform one-way ANOVA
F_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print results
print("F-statistic:", F_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("The p-value is less than 0.05, so we reject the null hypothesis.")
    print("There is significant evidence to suggest that there are differences between the mean weight loss of the three diets.")
else:
    print("The p-value is greater than or equal to 0.05, so we fail to reject the null hypothesis.")
    print("There is no significant evidence to suggest that there are differences between the mean weight loss of the three diets.")


F-statistic: 0.3731488837145597
p-value: 0.6892174730227125
The p-value is greater than or equal to 0.05, so we fail to reject the null hypothesis.
There is no significant evidence to suggest that there are differences between the mean weight loss of the three diets.


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data (replace with your actual data)
np.random.seed(0)

# Generate data for employee experience level (novice vs. experienced)
experience = np.random.choice(['novice', 'experienced'], size=90)

# Generate data for software programs (Program A, Program B, Program C)
programs = np.random.choice(['A', 'B', 'C'], size=90)

# Generate random task completion time data
task_completion_time = np.random.normal(loc=10, scale=2, size=90)

# Create DataFrame
data = pd.DataFrame({'Experience': experience, 'Program': programs, 'Time': task_completion_time})

# Perform two-way ANOVA
model = ols('Time ~ C(Experience) + C(Program) + C(Experience):C(Program)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print results
print(anova_table)


                              sum_sq    df         F    PR(>F)
C(Experience)               4.397995   1.0  0.976664  0.325862
C(Program)                 29.446179   2.0  3.269560  0.042915
C(Experience):C(Program)    7.962674   2.0  0.884137  0.416879
Residual                  378.258657  84.0       NaN       NaN


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [5]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data (replace with your actual data)
np.random.seed(0)

# Generate test scores for control group (traditional teaching method)
control_group_scores = np.random.normal(loc=70, scale=10, size=100)

# Generate test scores for experimental group (new teaching method)
experimental_group_scores = np.random.normal(loc=75, scale=10, size=100)

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group_scores, experimental_group_scores)

# Print results
print("Two-Sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check if results are significant
if p_value < 0.05:
    print("The difference in test scores between the two groups is statistically significant.")
    print("Performing post-hoc test...")
    # Perform post-hoc test (e.g., Tukey's HSD)
    all_scores = np.concatenate([control_group_scores, experimental_group_scores])
    groups = ['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_scores)
    tukey_results = pairwise_tukeyhsd(all_scores, groups)
    print(tukey_results)
else:
    print("There is no significant difference in test scores between the two groups.")


Two-Sample t-test:
t-statistic: -3.597192759749614
p-value: 0.0004062796020362504
The difference in test scores between the two groups is statistically significant.
Performing post-hoc test...
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental    5.222 0.0004 2.3593 8.0848   True
---------------------------------------------------------


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [6]:
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data (replace with your actual data)
np.random.seed(0)

# Generate sales data for Store A, Store B, and Store C
sales_store_A = np.random.normal(loc=1000, scale=100, size=30)
sales_store_B = np.random.normal(loc=1100, scale=100, size=30)
sales_store_C = np.random.normal(loc=1200, scale=100, size=30)

# Perform one-way ANOVA
F_statistic, p_value = f_oneway(sales_store_A, sales_store_B, sales_store_C)

# Print results
print("One-way ANOVA:")
print("F-statistic:", F_statistic)
print("p-value:", p_value)

# Check if results are significant
if p_value < 0.05:
    print("The differences in average daily sales between the three stores are statistically significant.")
    print("Performing post-hoc test...")
    # Perform post-hoc test (e.g., Tukey's HSD)
    all_sales = np.concatenate([sales_store_A, sales_store_B, sales_store_C])
    groups = ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30
    tukey_results = pairwise_tukeyhsd(all_sales, groups)
    print(tukey_results)
else:
    print("There is no significant difference in average daily sales between the three stores.")


One-way ANOVA:
F-statistic: 17.295761534833975
p-value: 4.740170938397587e-07
The differences in average daily sales between the three stores are statistically significant.
Performing post-hoc test...
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1  group2 meandiff p-adj   lower    upper   reject
--------------------------------------------------------
Store A Store B  26.7622 0.5536 -34.5773  88.1016  False
Store A Store C 142.3423    0.0  81.0028 203.6817   True
Store B Store C 115.5801 0.0001  54.2406 176.9195   True
--------------------------------------------------------
