### Q1

# Analysis of Variance (ANOVA) is a statistical test used to compare means between two or more groups to determine if there are statistically significant differences among the group means. ANOVA is based on several assumptions, and violations of these assumptions can impact the validity of the results. The main assumptions for ANOVA are:

# Independence: The observations within and between groups should be independent of each other. In other words, the value of one observation should not depend on the value of any other observation. Violations of independence could occur in longitudinal or repeated-measures designs, where measurements on the same subjects are correlated.

# Normality: The data within each group should follow a normal distribution. Deviations from normality can lead to inaccurate p-values and confidence intervals. Violations could include skewed or heavy-tailed distributions. You can check normality using diagnostic plots or normality tests like the Shapiro-Wilk test.

# Homogeneity of Variance (Homoscedasticity): The variances of the different groups should be approximately equal. If the variances are not equal, it can affect the test's power and lead to incorrect conclusions. Violations could result from unequal variances in different groups, which can be detected using statistical tests or visual inspection of variance-covariance matrices.

# Independence of Errors: The residuals (differences between the observed values and the predicted values) should be independent of each other and exhibit no systematic patterns. Serial correlation in the residuals can be a violation of this assumption, potentially affecting the validity of the test.

# Examples of Violations and Their Impact:

# Non-Normality: If the data within groups do not follow a normal distribution, ANOVA results may not be valid. The impact can include incorrect p-values and confidence intervals, leading to the risk of false positives or false negatives. For example, if the data is heavily skewed, ANOVA may incorrectly indicate significant differences.

# Heteroscedasticity: When the variances in different groups are unequal, ANOVA may be less powerful in detecting true group differences or may erroneously detect differences that do not exist. This can lead to Type I or Type II errors. For instance, if one group has much larger variances than others, ANOVA may incorrectly conclude that there are significant group differences.

# Violations of Independence: In longitudinal or repeated-measures designs, where observations within the same subject are correlated, the independence assumption is violated. This can result in inflated Type I errors or an increased likelihood of detecting significant differences even when they don't exist.

# To address these violations, there are alternative tests and methods available. For example, non-parametric tests like the Kruskal-Wallis test can be used when the assumption of normality is violated, and Welch's ANOVA can be employed when homogeneity of variances is not met. Additionally, transformations or robust ANOVA methods can sometimes mitigate the impact of these violations. It's important to assess the assumptions and choose the appropriate test based on the characteristics of your data to ensure the validity of your results.

### Q2

# Analysis of Variance (ANOVA) is a statistical technique used to compare means across two or more groups to determine if there are significant differences. There are three main types of ANOVA, each suited for different situations:

# One-Way ANOVA:

# Use Case: One-Way ANOVA is used when you have one categorical independent variable (with more than two levels or groups) and a continuous dependent variable.
# Example: You want to compare the mean test scores of students from three different schools (School A, School B, and School C) to determine if there are significant differences in the academic performance between the schools. Here, the independent variable is the school, and the dependent variable is the test score.
# Two-Way ANOVA:

# Use Case: Two-Way ANOVA is used when you have two categorical independent variables (factors) and one continuous dependent variable. It allows you to assess the interaction effects between the two independent variables.
# Example: You want to determine if both the type of diet (Factor A: Diet A, Diet B) and the gender of participants (Factor B: Male, Female) have significant effects on weight loss (the dependent variable). Two-Way ANOVA will help you analyze the main effects of diet and gender as well as their interaction effect.
# Repeated Measures ANOVA:

# Use Case: Repeated Measures ANOVA is used when you have one group of participants and you measure the same dependent variable under multiple conditions or at different time points. It is essentially an extension of One-Way ANOVA for within-subject designs.
# Example: You want to evaluate the impact of a new drug on patients' blood pressure. You measure their blood pressure before taking the drug, immediately after taking it, and then at 30-minute intervals for the next two hours. Repeated Measures ANOVA is appropriate because the same participants are measured under different conditions over time.

### Q3

# The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps us understand the sources of variation in a dataset. It decomposes the total variation in the data into different components, allowing us to assess the relative contributions of these components and make inferences about group means. Understanding this concept is essential in ANOVA for several reasons:

# Identification of Sources of Variation:

# Partitioning of variance allows us to identify and separate the different sources of variation in the data. This includes variation within groups and variation between groups (or factors). By understanding where the variation comes from, we can assess the impact of each source on the outcome variable.
# Hypothesis Testing:

# ANOVA involves testing hypotheses about group means. The partitioning of variance is critical for hypothesis testing, as it provides a way to compare the variation between groups (treatment effect) with the variation within groups (random variability). This comparison is used to calculate the F-statistic and assess the statistical significance of group differences.
# Estimation of Group Means:

# By partitioning the variance and assessing the group differences, ANOVA allows us to estimate the means of different groups. This estimation is useful for understanding the central tendencies of groups and making comparisons between them.
# Understanding Group Differences:

# ANOVA helps us answer questions about whether the means of the groups are significantly different from each other. The partitioning of variance helps us understand whether observed differences are likely due to the treatment or factors being studied or if they could have occurred by chance.
# Assessing the Model Fit:

# Partitioning of variance can be used to assess how well the model (ANOVA model) fits the data. It helps in evaluating the goodness of fit by comparing the explained variance (variation between groups) with the unexplained variance (variation within groups).
# Model Diagnostics:

# Understanding the partitioning of variance is crucial for model diagnostics. It allows researchers to identify any issues with the assumptions of the ANOVA, such as violations of homogeneity of variances or normality.
# Interpretation and Reporting:

# When presenting ANOVA results, understanding the partitioning of variance is essential for proper interpretation. Researchers can explain the proportion of variance explained by the factors or treatments, which is important for communicating the practical significance of the results.

### Q4

# Group data (replace with your data and groupings)
group_1 = [45, 52, 48]
group_2 = [55, 50, 58]
group_3 = [51, 54, 49, 47]

# Calculate the group means
mean_group_1 = np.mean(group_1)
mean_group_2 = np.mean(group_2)
mean_group_3 = np.mean(group_3)

# Calculate the explained sum of squares (SSE)
SSE = len(group_1) * (mean_group_1 - overall_mean)**2 + len(group_2) * (mean_group_2 - overall_mean)**2 + len(group_3) * (mean_group_3 - overall_mean)**2

# Calculate the residual sum of squares (SSR)
SSR = sum((x - np.mean(group_1))**2 for x in group_1) + sum((x - np.mean(group_2))**2 for x in group_2) + sum((x - np.mean(group_3))**2 for x in group_3)

### Q5

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Example dataset
data = {
    'Factor1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y', 'X'],
    'DV': [23, 45, 67, 34, 56, 78, 89, 12, 34]
}

df = pd.DataFrame(data)

# Fit the model using OLS (Ordinary Least Squares)
model = ols('DV ~ Factor1 + Factor2 + Factor1:Factor2', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model)

print(anova_results)


                  df  sum_sq  mean_sq         F    PR(>F)
Factor1          2.0   242.0    121.0  0.105263  0.903270
Factor2          1.0   544.5    544.5  0.473684  0.540741
Factor1:Factor2  2.0  1089.0    544.5  0.473684  0.662553
Residual         3.0  3448.5   1149.5       NaN       NaN


### Q6

# In a one-way Analysis of Variance (ANOVA), the F-statistic and its associated p-value are used to determine whether there are significant differences between the group means. In your scenario, you obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how to interpret these results:

# The F-Statistic (5.23):

# The F-statistic is a measure of the ratio of the variance between groups (explained variance) to the variance within groups (unexplained variance). A higher F-statistic suggests a larger difference between the group means relative to the variability within each group.
# The P-Value (0.02):

# The p-value represents the probability of obtaining an F-statistic as extreme as the one observed (or more extreme) if there were no real differences between the groups. In other words, it assesses whether the observed differences are statistically significant.
# Interpretation:

# Based on the F-statistic and p-value:

# Statistical Significance: The p-value of 0.02 is less than the chosen significance level (alpha), which is typically set at 0.05. This indicates that the differences between the groups are statistically significant.

# Conclusions: You can conclude that there are significant differences between the groups in the population from which the samples were drawn.

# Post-hoc Tests: If your one-way ANOVA indicates significant group differences, you may want to perform post-hoc tests (e.g., Tukey's HSD, Bonferroni, or Dunnett's test) to identify which specific group means are different from each other. These post-hoc tests can provide more detailed information about pairwise group comparisons.

# Effect Size: While the F-statistic and p-value indicate statistical significance, it's also valuable to assess the practical significance or effect size. An effect size measure (e.g., eta-squared or Cohen's d) can help quantify the magnitude of the differences between the groups, providing a more meaningful interpretation of the results.

### Q7

# Handling missing data in a repeated measures ANOVA is an important aspect of data analysis. Repeated measures ANOVA, which is used when the same subjects are measured under multiple conditions or time points, can be sensitive to missing data. Here's how you can handle missing data and the potential consequences of using different methods:

# Listwise Deletion (Complete Case Analysis):

# Handling: In listwise deletion, any subject with missing data on any of the measured conditions or time points is excluded from the analysis.
# Consequences:
# Pros: Simple and straightforward.
# Cons: Reduces sample size, which can reduce statistical power and may introduce selection bias if missing data is not completely random.
# Imputation:

# Handling: Imputation involves filling in missing values with estimated values. Common imputation methods include mean imputation (replacing missing values with the group mean), linear interpolation, last observation carried forward, or using statistical imputation techniques such as multiple imputation.
# Consequences:
# Pros: Retains all subjects in the analysis, maintains sample size, and may provide less biased estimates.
# Cons: Can introduce measurement error or bias if the imputation method is not appropriate or if the assumption of data missing at random (MAR) is violated. Imputation methods may also obscure the true variability in the data.
# Maximum Likelihood Estimation (MLE):

# Handling: MLE is a statistical approach that estimates model parameters by maximizing the likelihood function. In the context of repeated measures ANOVA, MLE is used to estimate parameters while accounting for missing data.
# Consequences:
# Pros: Utilizes all available information, provides unbiased estimates, and is the preferred method when data is missing at random (MAR).
# Cons: May not perform well when data is not missing at random (NMAR) or when the assumptions of the model are violated. MLE can be computationally intensive and requires specialized software.
# Potential Consequences of Different Methods:

# Using listwise deletion can result in reduced statistical power and biased estimates if missing data is not completely random. It may lead to a loss of valuable information, especially in small sample sizes.

# Imputation can help maintain sample size and reduce bias if done correctly. However, it can introduce error if the imputation method is inappropriate, and it may not be valid if data is not missing at random.

# Maximum Likelihood Estimation is generally the preferred method as it provides unbiased estimates and utilizes all available information. However, it relies on assumptions of normality and may be sensitive to model misspecification. It is also computationally demanding.

# The choice of how to handle missing data should be made carefully, taking into consideration the nature of the data, the assumptions of the analysis, and the potential consequences of the chosen method. It's important to document and justify the approach taken to handle missing data in your analysis.

### Q8

# Post-hoc tests are used in conjunction with Analysis of Variance (ANOVA) when you have three or more groups, and the ANOVA results indicate that there are significant differences between at least some of the group means. Post-hoc tests are performed to make pairwise comparisons between specific groups to identify which groups differ from each other. Some common post-hoc tests include:

# Tukey's Honestly Significant Difference (Tukey's HSD):

# Use Case: Tukey's HSD is widely used when you have equal group sizes and homogeneity of variances. It controls the familywise error rate, making it suitable for exploratory analyses when you want to identify all significantly different pairs.
# Example: In a study comparing the effectiveness of four different teaching methods (A, B, C, D) on student test scores, you use Tukey's HSD to identify which teaching methods yield significantly different scores.
# Bonferroni Correction:

# Use Case: Bonferroni is conservative and suitable when you want to control the familywise error rate, especially when making multiple comparisons. It is commonly used when you have unequal group sizes and/or heterogeneity of variances.
# Example: In a clinical trial, you want to compare the efficacy of four different drug treatments to a control group. To maintain an overall alpha level of 0.05, you use the Bonferroni correction to adjust individual comparison p-values.
# Sidak Correction:

# Use Case: Similar to Bonferroni, Sidak correction is used to control the familywise error rate when making multiple comparisons. It may be less conservative than Bonferroni and is suitable for situations with unequal group sizes and heterogeneity of variances.
# Example: In a market research study, you want to compare the average purchase amounts among several customer segments. You use Sidak correction to make pairwise comparisons between the segments while controlling the overall alpha level.
# Dunnett's Test:

# Use Case: Dunnett's test is used when you have a control group and you want to compare other groups to the control. It controls the familywise error rate and is commonly used in experimental and clinical trials.
# Example: In a drug trial, you have a control group and three experimental groups receiving different drug dosages. You use Dunnett's test to compare each experimental group to the control group while controlling for multiple comparisons.
# Fisher's Least Significant Difference (LSD):

# Use Case: Fisher's LSD is used when you have equal group sizes and homogeneity of variances. It's a relatively liberal test, and it's appropriate when you have a priori hypotheses about specific pairwise comparisons.
# Example: In an agricultural study, you have data on crop yields for five different fertilizers. You use Fisher's LSD to test specific pairwise comparisons you have a theoretical basis to investigate.
# Games-Howell:

# Use Case: Games-Howell is a non-parametric post-hoc test that can be used when the assumptions of equal variances and normality are violated. It's robust to unequal group sizes.
# Example: In a psychological study, you want to compare the reaction times of participants across various experimental conditions. If the data distribution is not normal and variances are unequal, you can use Games-Howell for post-hoc tests.
# The choice of a post-hoc test depends on the specific characteristics of your data, such as group sizes, homogeneity of variances, and your research objectives. It's essential to select a post-hoc test that is appropriate for your data and control the familywise error rate when making multiple comparisons to maintain the overall Type I error rate.

### Q9

In [3]:
import scipy.stats as stats
import numpy as np

# Sample data for the three diets (replace with your actual data)
diet_A = [2.0, 3.0, 1.5, 2.5, 2.3, 3.1, 2.2, 1.8, 2.9, 1.7]
diet_B = [3.2, 2.7, 3.5, 3.0, 2.8, 3.3, 2.6, 3.4, 2.4, 3.1]
diet_C = [1.1, 0.9, 1.2, 1.0, 1.5, 1.3, 1.4, 1.7, 0.8, 1.6]

# Combine the data into a single array
data = np.concatenate([diet_A, diet_B, diet_C])

# Create a corresponding grouping variable
group_labels = ['A'] * len(diet_A) + ['B'] * len(diet_B) + ['C'] * len(diet_C)

# Perform the one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Report the results
print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("The one-way ANOVA is significant, indicating that there are significant differences in mean weight loss between at least two of the three diets.")
else:
    print("The one-way ANOVA is not significant, suggesting that there are no significant differences in mean weight loss between the three diets.")

F-statistic: 42.71
P-value: 0.0000
The one-way ANOVA is significant, indicating that there are significant differences in mean weight loss between at least two of the three diets.


### 10

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Set random seed for reproducibility
np.random.seed(42)

# Create a simulated dataset
n = 30  # total number of employees
programs = ['A', 'B', 'C']
experience_levels = ['Novice', 'Experienced']

# Simulate data: Assume completion times are normally distributed with different means for each group
data = {
    'Program': np.tile(programs, n // 3),  # Repeat programs A, B, C for 10 employees each
    'Experience': np.tile(experience_levels, n // 2),  # Repeat novice, experienced for 15 employees each
    'Time': np.random.normal(50, 10, n)  # Simulate task completion times
}

# Introduce differences in means for different conditions
df = pd.DataFrame(data)

# Adjust Time for different Program and Experience levels
df.loc[df['Program'] == 'A', 'Time'] += 5  # Program A takes 5 minutes more
df.loc[df['Program'] == 'B', 'Time'] += 10  # Program B takes 10 minutes more
df.loc[df['Experience'] == 'Novice', 'Time'] += 15  # Novices take 15 minutes longer

# Display first few rows
df.head()


Unnamed: 0,Program,Experience,Time
0,A,Novice,74.967142
1,B,Experienced,58.617357
2,C,Novice,71.476885
3,A,Experienced,70.230299
4,B,Novice,72.658466


In [6]:
# Fit the model using OLS (Ordinary Least Squares)
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()

# Perform the ANOVA
anova_results = anova_lm(model)

# Display ANOVA results
print(anova_results)


                      df       sum_sq      mean_sq          F    PR(>F)
Program              2.0   652.107322   326.053661   3.850094  0.035464
Experience           1.0  1751.737373  1751.737373  20.684795  0.000131
Program:Experience   2.0    14.344125     7.172062   0.084689  0.919071
Residual            24.0  2032.492766    84.687199        NaN       NaN


### Q11

In [7]:
import numpy as np
import pandas as pd
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

# Simulate data for two groups (control and experimental)
n = 100  # total number of students
control_group_scores = np.random.normal(75, 10, n)  # Control group with mean 75 and std 10
experimental_group_scores = np.random.normal(80, 10, n)  # Experimental group with mean 80 and std 10

# Combine into a DataFrame for convenience
data = pd.DataFrame({
    'Group': ['Control']*n + ['Experimental']*n,
    'TestScore': np.concatenate([control_group_scores, experimental_group_scores])
})

# Display the first few rows
data.head()


Unnamed: 0,Group,TestScore
0,Control,79.967142
1,Control,73.617357
2,Control,81.476885
3,Control,90.230299
4,Control,72.658466


In [8]:
# Perform a two-sample t-test
control_scores = data[data['Group'] == 'Control']['TestScore']
experimental_scores = data[data['Group'] == 'Experimental']['TestScore']

# Two-sample t-test (assuming equal variances)
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores, equal_var=True)

# Display the results
print(f'T-statistic: {t_stat}')
print(f'P-value: {p_value}')


T-statistic: -4.754695943505282
P-value: 3.819135262679469e-06


### Q12

In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set random seed for reproducibility
np.random.seed(42)

# Simulate sales data for three stores (A, B, C) over 30 days
n_days = 30
store_A_sales = np.random.normal(2000, 300, n_days)  # Store A sales with mean 2000 and std 300
store_B_sales = np.random.normal(2100, 250, n_days)  # Store B sales with mean 2100 and std 250
store_C_sales = np.random.normal(2200, 350, n_days)  # Store C sales with mean 2200 and std 350

# Create a DataFrame with the sales data
data = pd.DataFrame({
    'Day': np.tile(np.arange(1, n_days + 1), 3),
    'Store': np.repeat(['A', 'B', 'C'], n_days),
    'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales])
})

# Display the first few rows of the data
data.head()


Unnamed: 0,Day,Store,Sales
0,1,A,2149.014246
1,2,A,1958.52071
2,3,A,2194.306561
3,4,A,2456.908957
4,5,A,1929.753988


In [10]:
# Fit the repeated measures ANOVA model
model = ols('Sales ~ Store + Day + Store:Day', data=data).fit()

# Perform the ANOVA
anova_results = anova_lm(model, typ=2)

# Display the ANOVA results
print(anova_results)


                 sum_sq    df         F    PR(>F)
Store      1.021827e+06   2.0  6.224303  0.003015
Day        2.778034e+04   1.0  0.338439  0.562289
Store:Day  2.584426e+05   2.0  1.574264  0.213210
Residual   6.895024e+06  84.0       NaN       NaN


In [11]:
# Perform Tukey's HSD post-hoc test
tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'])

# Display the post-hoc results
print(tukey_results)


 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
     A      B 126.1535 0.2108 -50.7304 303.0373  False
     A      C 260.9537  0.002  84.0698 437.8376   True
     B      C 134.8003   0.17 -42.0836 311.6842  False
------------------------------------------------------
