#Q1
Analysis of Variance (ANOVA) is a statistical technique used to compare means across two or more groups. It assesses whether there are statistically significant differences between the means of these groups while accounting for variability within and between the groups. ANOVA is based on several assumptions, and violating these assumptions can impact the validity of the results. The key assumptions for ANOVA are:

Independence: The observations within each group should be independent of each other. This means that the values of one observation should not be influenced by the values of other observations within the same group.

Normality: The distribution of the residuals (the differences between observed and predicted values) should be approximately normal within each group. This assumption is about the normality of the population distributions, not the sample distributions.

Homogeneity of Variance (Homoscedasticity): The variance of the residuals should be approximately equal across all groups. This means that the spread of data points around the mean should be consistent across groups.

Homogeneity of Regression Slopes (Only for Two-Way ANOVA): This assumption is specific to two-way ANOVA and it requires that the relationship between the dependent variable and the independent variable(s) is consistent across different levels of the other independent variable.

Examples of violations that could impact the validity of ANOVA results:

Non-Independence: Violation of the independence assumption can occur in scenarios where there is a natural structure or order among observations, such as time series data or repeated measures on the same subjects. Ignoring this can lead to incorrect conclusions.

Non-Normality: If the residuals do not follow a normal distribution within each group, the p-values and confidence intervals generated by ANOVA may be inaccurate. This can happen when the data is skewed or contains extreme outliers.

Heteroscedasticity: Unequal variances among groups can affect the significance tests and confidence intervals. If the assumption is violated, it can lead to incorrect conclusions about the differences between group means.

Interaction Effects (Two-Way ANOVA): If the relationship between the dependent variable and one independent variable changes depending on the level of another independent variable, the assumption of homogeneity of regression slopes is violated. This means that the impact of one independent variable is not consistent across the different levels of the other independent variable.

When these assumptions are violated, the ANOVA results may be misleading or incorrect. In such cases, it might be necessary to explore alternative statistical techniques or consider transformations on the data to mitigate the violations. Additionally, non-parametric tests (which make fewer assumptions) can be used when the assumptions of ANOVA are seriously violated.

#Q2
There are three main types of Analysis of Variance (ANOVA): one-way ANOVA, two-way ANOVA, and repeated measures ANOVA. Each type is used in specific situations to analyze the variance between groups and determine if there are significant differences among group means. Here's an overview of each type and when they would be used:

One-Way ANOVA:

Number of Factors: One categorical independent variable (factor) with three or more levels or groups.
Use Case: When you want to compare means across multiple independent groups. For example, you might use one-way ANOVA to analyze the effects of different teaching methods (e.g., lecture, discussion, hands-on) on student test scores.
Two-Way ANOVA:

Number of Factors: Two categorical independent variables (factors), often referred to as "factor A" and "factor B," and their interactions.
Use Case: When you want to investigate the combined effects of two independent variables on the dependent variable. For example, a two-way ANOVA could be used to examine how both gender and treatment type impact patient recovery time.
Repeated Measures ANOVA:

Number of Factors: One categorical independent variable (factor) with two or more related measures or time points.
Use Case: When you have collected measurements from the same subjects or items at multiple time points or under multiple conditions. Repeated measures ANOVA is commonly used in longitudinal studies or when analyzing data with a within-subjects design. An example is studying the effects of a training program on individuals' performance measured before and after the training.

#Q3
In analysis of variance (ANOVA), the partitioning of variance refers to the process of decomposing the total variance observed in a dataset into different components associated with different sources of variation. ANOVA is a statistical technique used to compare means among different groups and determine if there are statistically significant differences between those groups. The partitioning of variance is crucial in ANOVA as it helps us understand the relative contributions of various factors to the overall variability observed in the data.

In a typical one-way ANOVA scenario, the total variance observed in the data is divided into two main components: the variance between groups and the variance within groups. The formula for total variance can be expressed as:

Total Variance = Variance Between Groups + Variance Within Groups

Variance Between Groups: This component represents the variability of the group means with respect to the overall mean. It indicates how much the group means differ from each other. If this component is large relative to the within-group variance, it suggests that there are significant differences between the group means.

Variance Within Groups: This component represents the variability of the individual data points within each group around their respective group means. It captures the random variability within each group that is not attributed to the differences between the group means.

By comparing the sizes of these two components, ANOVA helps us determine whether the observed differences between group means are statistically significant or if they could have occurred due to random chance. If the variance between groups is significantly larger than the variance within groups, it suggests that there are real differences between the groups' means.

Understanding the partitioning of variance is important for several reasons:

Interpretation of Group Differences: ANOVA allows us to assess whether the differences between groups are likely due to the effect of the independent variable or if they could be attributed to random variability.

Hypothesis Testing: ANOVA provides a statistical framework for hypothesis testing regarding the equality of means across groups. It helps researchers determine if these differences are statistically significant.

Effect Size Estimation: By quantifying the proportion of variance explained by the between-group differences, ANOVA provides a way to estimate the effect size of the independent variable.

Experimental Design: Understanding how variance is partitioned can inform the design of future experiments. For instance, if most of the variability is within groups, it might suggest that the experimental conditions need to be refined to achieve clearer between-group differences.

Generalization: A better understanding of variance components can lead to more robust and accurate generalizations about the population from which the sample was drawn.

In [1]:
#Q4
import pandas as pd
from statsmodels.formula.api import ols
import seaborn as sns
from statsmodels.stats.anova import anova_lm

df_iris = sns.load_dataset('iris')
print('Top 5 rows of IRIS dataset : ')
print(df_iris.head())

model = ols('sepal_length ~ species', data=df_iris).fit()

print('Values for Sepal Length vs Species:')
SSE = model.ess
SSR = model.ssr
SST = SSE + SSR

print('SSE:', round(SSE,4))
print('SSR:', round(SSR,4))
print('SST:', round(SST,4))

print(anova_lm(model))

Top 5 rows of IRIS dataset : 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
Values for Sepal Length vs Species:
SSE: 63.2121
SSR: 38.9562
SST: 102.1683
             df     sum_sq    mean_sq           F        PR(>F)
species     2.0  63.212133  31.606067  119.264502  1.669669e-31
Residual  147.0  38.956200   0.265008         NaN           NaN


In [2]:
#Q5
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = sm.datasets.get_rdataset("ToothGrowth", "datasets").data

print('Top 5 rows of Tooth Growth Dataset')
print(data.head())

model_formula = "len ~ C(supp) + C(dose) + C(supp):C(dose)"

model = ols(model_formula, data).fit()

main_effects = sm.stats.anova_lm(model, typ=2)['sum_sq'][:2]
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq'][2:3]

print("Main effects:")
print(main_effects)
print("Interaction effect:")
print(interaction_effect)
print("ANOVA Table:")
print(anova_lm(model,typ=2))

Top 5 rows of Tooth Growth Dataset
    len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5
Main effects:
C(supp)     205.350000
C(dose)    2426.434333
Name: sum_sq, dtype: float64
Interaction effect:
C(supp):C(dose)    108.319
Name: sum_sq, dtype: float64
ANOVA Table:
                      sum_sq    df          F        PR(>F)
C(supp)           205.350000   1.0  15.571979  2.311828e-04
C(dose)          2426.434333   2.0  91.999965  4.046291e-18
C(supp):C(dose)   108.319000   2.0   4.106991  2.186027e-02
Residual          712.106000  54.0        NaN           NaN


#Q6

In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of the groups are equal against the alternative hypothesis that at least one group mean is different from the others. The p-value associated with the F-statistic indicates the probability of observing the data, or more extreme data, under the assumption that the null hypothesis is true. A small p-value suggests that the observed differences between group means are statistically significant, and you would reject the null hypothesis.

Given your provided F-statistic of 5.23 and a p-value of 0.02, here's how you can interpret the results:

F-Statistic (5.23): The F-statistic is a measure of the ratio of the variability between groups to the variability within groups. A larger F-statistic suggests that the variability between groups is larger compared to the variability within groups. In your case, a value of 5.23 indicates that there is some evidence that the group means are not all equal.

P-Value (0.02): The p-value is the probability of obtaining results as extreme as the ones observed, assuming that the null hypothesis (equal group means) is true. A p-value of 0.02 means that if the null hypothesis were true, you would expect to see data as extreme as what you observed only about 2% of the time. This is a relatively low probability, suggesting that the observed differences between the groups' means are unlikely to have occurred by random chance alone.

Conclusion: With a small p-value (0.02), you have evidence to reject the null hypothesis. This indicates that there are statistically significant differences between the groups. In other words, at least one group mean is likely to be different from the others.

#Q7
Handling missing data in a repeated measures ANOVA is crucial to ensure the validity and reliability of your results. Missing data can occur for various reasons, such as participant dropout, equipment malfunction, or incomplete responses. There are several methods to handle missing data, each with its own advantages and potential consequences. Here are some common methods and their potential consequences:

Listwise Deletion (Complete Case Analysis): This method involves removing cases with missing data from the analysis. While it is straightforward, it can lead to reduced sample size, loss of statistical power, and potential bias if the missing data are not random.

Consequences:

Reduced sample size, which may decrease the power to detect effects.
Potential bias if the missing data are related to the outcome or predictors, leading to non-representative results.
Ignoring potentially valuable information if missingness is related to meaningful patterns.
Pairwise Deletion (Available Case Analysis): This method uses available data for each specific analysis, so cases with missing data for specific variables are excluded only from the analyses involving those variables.

Consequences:

Inconsistency in sample size across analyses, which can complicate interpretation.
Increased risk of Type I errors if the missing data are not missing completely at random.
Mean Imputation: Missing values are replaced with the mean value of the non-missing data for that variable. This method can lead to an underestimation of standard errors and the loss of variability in the imputed variable.

Consequences:

Underestimation of standard errors, which can affect hypothesis tests and confidence intervals.
Reduction of variance in the imputed variable, potentially affecting the ability to detect real effects.
Last Observation Carried Forward (LOCF): Missing data are replaced with the last observed value for that participant. This method assumes that the participant's status remains unchanged from the last observed time point until the next observation.

Consequences:

Can lead to biased estimates if participants' statuses change over time.
May not accurately represent the true trajectory of the variable.
Multiple Imputation: This more sophisticated approach involves creating multiple imputed datasets based on the observed data and their relationships. The analyses are then performed on each imputed dataset, and results are combined to obtain estimates and standard errors.

Consequences:

Can provide more accurate estimates and valid statistical inference if assumptions about the missing data mechanism are met.
Requires careful consideration of the imputation model and potential bias introduced if assumptions are violated.


#Q8
Post-hoc tests are used after conducting an ANOVA to determine which specific group differences are statistically significant when a significant main effect or interaction is detected. ANOVA can tell you that there are differences between groups, but it doesn't specify which groups are different from each other. Post-hoc tests help you identify these differences. Here are some common post-hoc tests and when to use each one:

Tukey's Honestly Significant Difference (HSD):

Use when you have conducted a one-way ANOVA and you have three or more groups.
It controls the familywise error rate, providing a good balance between Type I error control and statistical power.
Appropriate for situations where you want to test all possible pairwise group comparisons.
Example: You conducted an experiment to compare the effectiveness of three different teaching methods on student exam scores. The ANOVA showed a significant difference among the teaching methods. To identify which pairs of teaching methods are significantly different from each other, you can use Tukey's HSD.

Bonferroni Correction:

Use when conducting multiple pairwise comparisons after an ANOVA.
It's a conservative approach that controls the familywise error rate by adjusting the significance level for each comparison.
Appropriate when you want to maintain a strict control over the overall Type I error rate.
Example: In the same teaching methods experiment, you are comparing all possible pairs of teaching methods. Since you're conducting several comparisons, you might choose to apply the Bonferroni correction to adjust the p-values to a more stringent level.

Sidak Correction:

Similar to the Bonferroni correction, it's used to control the familywise error rate for multiple comparisons.
The Sidak correction can be less conservative than Bonferroni for larger numbers of comparisons.
Example: If you have a large number of pairwise comparisons to make, such as in a genetic study with multiple variables, you might consider using the Sidak correction to control for the increased Type I error risk.

Dunn's Test:

Use when you have conducted a non-parametric ANOVA (e.g., Kruskal-Wallis test) and need to perform post-hoc pairwise comparisons.
It's a non-parametric alternative to Tukey's HSD or Bonferroni corrections.
Example: You conducted a Kruskal-Wallis test to compare the medians of multiple groups. Since this is a non-parametric test, you can use Dunn's test for pairwise comparisons to identify which groups have significantly different medians.

Holm's Method:

A stepwise procedure that adjusts p-values for multiple comparisons in a less conservative way compared to Bonferroni.
It starts with the most significant p-value and adjusts it. If it remains significant, it proceeds to the next smallest p-value, adjusting it accordingly.
Example: You conducted a two-way ANOVA and want to perform pairwise comparisons for both main effects and interaction effects. Holm's method can be useful to control the familywise error rate while exploring multiple comparisons.

Post-hoc tests are important to avoid making false conclusions about group differences. However, it's important to choose a post-hoc test that is appropriate for your specific data and research questions, and to interpret the results with consideration of the chosen method's assumptions and adjustments.

In [3]:
#Q9
import numpy as np
from scipy.stats import f_oneway

np.random.seed(1)
diet_A = np.random.normal(5, 1, 50)
diet_B = np.random.normal(4, 1, 50)
diet_C = np.random.normal(3, 1, 50)

f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

alpha = 0.05

null_hypothesis = "The mean weight loss is the same for all three diets."
alternate_hypothesis = "The mean weight loss is different for at least one diet."

print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < alpha:
    print("We reject the null hypothesis.")
    print(f"Conclusion : {alternate_hypothesis}")
else:
    print("We fail to reject the null hypothesis.")
    print(f"Conclusion : {null_hypothesis}")

F-statistic: 57.06379442059458
p-value: 4.5619061215783055e-19
We reject the null hypothesis.
Conclusion : The mean weight loss is different for at least one diet.


In [4]:
#Q10
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(123)

time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)

data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})

print('Simulated Data example :')
print(data.head())


model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)

alpha = 0.05

print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")

if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")

if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")

Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799
                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


Here are the interpretations of the three conclusions:
"There is a significant main effect of software": This means that the software programs used by the employees have a significant impact on the outcome variable (e.g., completion time), independent of the experience level of the employees. This suggests that the choice of software program is an important factor that should be considered carefully when completing this task.

"There is a significant main effect of experience": This means that the experience level of the employees has a significant impact on the outcome variable, independent of the software program used. Specifically, this suggests that experienced employees may complete the task faster than novices, or vice versa. This finding can be helpful for the company to identify the best employees for a given task and to provide appropriate training for new employees.

"There is NO significant interaction effect between software and experience": This means that the effect of software on the outcome variable does not depend on the experience level of the employees, and vice versa. This suggests that the software programs perform similarly for both novices and experienced employees. This finding can be helpful for the company to decide which software program to use, as they do not need to consider the experience level of the employees when making the choice.

In [5]:
#Q11
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

np.random.seed(45)

test_score_control = np.random.normal(loc=70, scale=3, size=50)
test_score_experimental = np.random.normal(loc=85, scale=3, size=50)

df = pd.DataFrame({'test_score':list(test_score_control)+list(test_score_experimental),
                   'group':['control']*50 + ['experimental']*50})

print('Simulated data for test_scores:')
print(df.head())

null_hypothesis = "There is NO difference in test scores between the control and experimental groups."
alt_hypothesis = "There is SIGNIFICANT difference in test scores between the control and experimental groups."

control_scores = df[df['group'] == 'control']['test_score']
experimental_scores = df[df['group'] == 'experimental']['test_score']
t_stat, p_val = ttest_ind(control_scores, experimental_scores, equal_var=True)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_val}")
print('\n')

alpha = 0.05
if p_val<alpha:
    print('Reject the Null Hypothesis')
    print(f'Conclusion : {alt_hypothesis}')
else:
    print('Failed to reject the Null Hypothesis')
    print(f'Conclusion : {null_hypothesis}')

Simulated data for test_scores:
   test_score    group
0   70.079124  control
1   70.780965  control
2   68.814563  control
3   69.387097  control
4   66.185102  control
t-statistic: -28.5074, p-value: 3.096206271894725e-49


Reject the Null Hypothesis
Conclusion : There is SIGNIFICANT difference in test scores between the control and experimental groups.


In [None]:
#Q12
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# set random seed for reproducibility
np.random.seed(456)

# generate sales data for Store A, B, and C
sales_a = np.random.normal(loc=1000, scale=100, size=(30,))
sales_b = np.random.normal(loc=1050, scale=150, size=(30,))
sales_c = np.random.normal(loc=800, scale=80, size=(30,))

# create a DataFrame to store the sales data
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})

# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melted.columns = ['Day', 'Store', 'Sales']

print('Generated data top 5 rows : ')
print(sales_melted.head())


rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)

if rm_results.anova_table['Pr > F'][0] < 0.05:
    print('Reject the Null Hypothesis : \nConcusion: Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')