### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### ANOVA (Analysis of Variance) is a statistical method used to compare means between more than two groups. ANOVA relies on several assumptions to ensure the validity of its results.

##### The main assumptions of ANOVA are:

###### 1. Normality: The data should follow a normal distribution within each group.
###### 2. Homogeneity of variance: The variance of each group should be equal.
###### 3. Independence: The observations should be independent of each other.
###### 4. Random sampling: The data should be collected through a random sampling process.
###### 5. Equal sample sizes: The sample sizes for each group should be equal.

##### If any of these assumptions are violated, it could impact the validity of the results. Some examples of violations that could impact the validity of the results are:

###### 1. Non-normality: If the data is not normally distributed, the results may be invalid. For example, if the data is skewed or has outliers, ANOVA may not be appropriate. In such cases, non-parametric tests may be more appropriate.
###### 2. Heteroscedasticity: If the variance is not equal across groups, the results may be biased. For example, if one group has a much larger variance than the others, ANOVA may not be appropriate.
###### 3. Autocorrelation: If the observations are not independent of each other, the results may be biased. For example, if the same individual is observed multiple times, the observations may be correlated.
###### 4. Non-random sampling: If the data is not collected through a random sampling process, the results may be biased. For example, if individuals are selectively chosen to participate in the study, the results may not be generalizable to the entire population.
###### 5. Unequal sample sizes: If the sample sizes for each group are not equal, the results may be biased. For example, if one group has a much larger sample size than the others, ANOVA may not be appropriate.
#### It is important to check for violations of these assumptions before conducting ANOVA and to address them if they are present. This can be done through visual inspection of the data, statistical tests, or transformations of the data.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

### The three types of ANOVA are one-way ANOVA, two-way ANOVA, and repeated measures ANOVA. Each type of ANOVA is used in different situations, as explained below:

###### One-way ANOVA: One-way ANOVA is used when we want to compare the means of more than two groups for a single independent variable (also known as a factor). For example, we may want to compare the average test scores of students who attended different schools, with each school representing a different group. One-way ANOVA determines if there is a significant difference in the means of the groups and which groups are significantly different from each other.

###### Two-way ANOVA: Two-way ANOVA is used when we want to investigate the effects of two independent variables (also known as factors) on a dependent variable. For example, we may want to investigate the effect of both age and gender on exam scores. Two-way ANOVA determines whether each independent variable has a significant effect on the dependent variable, as well as whether there is an interaction effect between the two independent variables.

###### Repeated measures ANOVA: Repeated measures ANOVA is used when we want to compare the means of two or more groups that are measured repeatedly over time or under different conditions. For example, we may want to compare the effectiveness of three different treatments for a medical condition, with each patient receiving all three treatments over time. Repeated measures ANOVA determines whether there is a significant difference in the means of the groups and which groups are significantly different from each other, while accounting for the correlation between repeated measurements.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

#### In ANOVA (Analysis of Variance), the partitioning of variance refers to the process of decomposing the total variation in the response variable into its components, which can be attributed to different sources or factors. The goal of ANOVA is to determine whether the mean values of the response variable differ significantly across different levels of one or more factors.

#### There are three types of variances in ANOVA: the total variance, the between-group variance, and the within-group variance. The total variance is the total variation in the response variable, which can be partitioned into the variance between the groups and the variance within the groups.

#### The between-group variance represents the variation in the response variable that is explained by the differences between the groups, which can be attributed to the effect of the factor(s) being studied. The within-group variance, on the other hand, represents the variation in the response variable that is not explained by the differences between the groups, which can be attributed to the random variability within each group.

#### Understanding the partitioning of variance is important because it helps to determine the significance of the factor(s) being studied in explaining the variation in the response variable. If the between-group variance is much larger than the within-group variance, it suggests that the factor(s) have a significant effect on the response variable. On the other hand, if the within-group variance is much larger than the between-group variance, it suggests that the factor(s) have little or no effect on the response variable.

#### By partitioning the variance, ANOVA provides a framework for quantifying the amount of variation in the response variable that can be attributed to different factors, which is essential for drawing meaningful conclusions and making informed decisions based on the data.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

#### To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can use the statsmodels package. Here's an example:

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data into a Pandas DataFrame
data = pd.read_csv("data.csv")

# create a model formula
model = ols("response_var ~ factor_var", data=data).fit()

# calculate SST
sst = sm.stats.anova_lm(model, typ=1)["sum_sq"][0]

# calculate SSE
sse = sm.stats.anova_lm(model, typ=1)["sum_sq"][1]

# calculate SSR
ssr = sst - sse

#### The anova_lm() function in statsmodels calculates the ANOVA table for the fitted model. The ANOVA table contains the sum of squares, degrees of freedom, mean squares, F-statistic, and p-value for each effect in the model. By extracting the sum of squares for the total and factor effects from the ANOVA table, we can calculate the SST, SSE, and SSR.

### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

#### To calculate the main effects and interaction effects in a two-way ANOVA using Python, you can use the statsmodels library. Here is an example code snippet:

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a pandas DataFrame with your data
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 30, 40], 'Y': [15, 25, 35, 45]})

# fit the ANOVA model
model = ols('Y ~ A + B + A:B', data=df).fit()

# print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                sum_sq   df             F        PR(>F)
A         5.000000e+02  1.0  9.364478e+27  6.578674e-15
B         5.000000e+02  1.0  9.364478e+27  6.578674e-15
A:B       7.888609e-27  1.0  1.477454e-01  7.663823e-01
Residual  5.339326e-26  1.0           NaN           NaN


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
### What can you conclude about the differences between the groups, and how would you interpret these results?

#### If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, it means that there is a statistically significant difference between at least two of the groups.

#### The F-statistic measures the ratio of the between-group variance to the within-group variance. A larger F-statistic indicates that the between-group variance is greater than the within-group variance, which means that the groups are more different from each other than they are within each group. The p-value of 0.02 indicates that there is less than a 2% chance that the observed difference between the groups is due to chance.

#### To interpret these results, you can conclude that there is evidence to suggest that there are significant differences between at least two of the groups. However, you cannot determine which specific groups are different from each other based on the ANOVA result alone. You would need to conduct post-hoc tests, such as Tukey's HSD or Bonferroni tests, to determine which specific group differences are statistically significant.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### In a repeated measures ANOVA, missing data can be handled using various methods such as:

###### 1. Listwise deletion: This method involves excluding any case with missing data on any of the variables used in the analysis. This approach may result in the loss of a considerable amount of data and may introduce bias if the missing data are not missing completely at random (MCAR).

###### 2. Pairwise deletion: This method only excludes cases with missing data on the variables used in the analysis for specific pairwise comparisons. This approach preserves all available data but may result in biased estimates of standard errors.

###### 3. Imputation: This method involves replacing missing data with estimates of the missing values based on other available data. This approach can reduce bias if the imputation model accounts for any systematic relationships between missing data and observed variables, but can also introduce bias if the imputation model is misspecified.

#### The potential consequences of using different methods to handle missing data in a repeated measures ANOVA can be significant. Using listwise deletion may result in a loss of statistical power and biased estimates if the missing data are not MCAR. Pairwise deletion may result in biased standard errors and decreased statistical power. Imputation methods can introduce bias if the imputation model is misspecified, but can be more efficient and reduce bias if the imputation model is correctly specified. Therefore, the choice of the missing data handling method should be based on the type of missing data and the research question. It is recommended to perform sensitivity analysis to assess the impact of different missing data handling methods on the results.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Post-hoc tests are used after conducting ANOVA to determine which specific group differences are statistically significant. Some common post-hoc tests include:

###### 1. Tukey's HSD (Honestly Significant Difference) test: This test is used to compare all possible pairs of means while controlling the overall family-wise error rate. It is appropriate when the number of pairwise comparisons is small and the assumption of homogeneity of variances is met.

###### 2. Bonferroni correction: This test adjusts the alpha level for multiple comparisons by dividing the alpha level by the number of comparisons made. It is appropriate when conducting a large number of pairwise comparisons.

###### 3. Scheffe's test: This test controls the overall family-wise error rate and is more conservative than Tukey's HSD test. It is appropriate when the number of pairwise comparisons is large.

###### 4. Dunnett's test: This test is used to compare each group mean with a control group mean. It is appropriate when there is a control group and multiple treatment groups.

###### 5. Games-Howell test: This test is used when the assumption of equal variances is not met, and the sample sizes are unequal.

#### A situation where a post-hoc test might be necessary is when conducting a study with multiple groups and significant differences are found in the ANOVA analysis. For example, a researcher may conduct an experiment to compare the effectiveness of three different treatments for reducing anxiety. After conducting the ANOVA analysis, the researcher finds a significant difference between the three treatments. To determine which specific treatment is more effective, a post-hoc test such as Tukey's HSD or Dunnett's test could be used to compare all possible pairs of means or compare each treatment group with a control group, respectively.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.
### Report the F-statistic and p-value, and interpret the results.

In [3]:
import scipy.stats as stats

# weight loss data for the three diets
diet_a = [3.2, 2.9, 3.1, 2.8, 3.5, 2.7, 2.6, 2.9, 3.0, 2.8, 
          3.1, 2.7, 3.0, 3.2, 2.9, 2.8, 2.7, 3.1, 2.8, 3.0, 
          2.9, 2.8, 3.2, 3.0, 3.1]
diet_b = [2.5, 2.1, 2.7, 2.4, 2.2, 2.3, 2.6, 2.1, 2.3, 2.5, 
          2.2, 2.4, 2.5, 2.3, 2.2, 2.4, 2.5, 2.1, 2.4, 2.2, 
          2.3, 2.4, 2.5, 2.1, 2.2]
diet_c = [1.9, 2.0, 1.8, 1.7, 1.9, 1.8, 2.1, 2.0, 2.2, 2.1, 
          1.8, 2.0, 1.9, 2.1, 1.8, 1.9, 1.7, 2.0, 1.8, 2.1, 
          2.2, 1.7, 1.9, 1.8, 2.0]

# perform one-way ANOVA
f_stat, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# print results
print("F-statistic: ", f_stat)
print("p-value: ", p_value)

F-statistic:  212.56167792392472
p-value:  6.183862672650103e-31


###### F-statistic is 212.56 and the p-value is 6.18e-31.

###### Since the p-value is very small, we can reject the null hypothesis that the means of the three diets are equal, and conclude that there are significant differences between the mean weight loss of the three diets.

### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [8]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a dataframe with the time data and employee experience level
data = {'Time': [25, 27, 28, 23, 26, 28, 29, 22, 24, 27, 21, 23, 25, 26, 27, 23, 25, 24, 29, 22, 
                 28, 26, 23, 24, 27, 21, 23, 25, 26, 27],
        'Program': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 
                    'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
        'Experience': ['Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 
                       'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 
                       'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 
                       'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Novice', 
                       'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice']}

df = pd.DataFrame(data)

# perform two-way ANOVA
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print results
print(anova_table)

                          sum_sq    df             F    PR(>F)
Program             4.266667e+00   2.0  4.169866e-01  0.524103
Experience         -4.522794e-15   1.0 -8.840365e-16  1.000000
Program:Experience  2.031548e+01   2.0  1.985457e+00  0.157599
Residual            1.330179e+02  26.0           NaN       NaN




### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [13]:
import numpy as np
from scipy.stats import ttest_ind, f_oneway, tukey_hsd

# generate example data
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=75, scale=10, size=50)

# conduct two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

# print results
print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

# conduct post-hoc Tukey test if significant differences found
if p_value < 0.05:
    all_scores = np.concatenate((control_scores, experimental_scores))
    group_labels = ['control'] * len(control_scores) + ['experimental'] * len(experimental_scores)
    tukey_results = tukey_hsd(all_scores, group_labels)
    print(tukey_results)

t-statistic:  -1.8391300533320496
p-value:  0.06892390646076188


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [18]:
import numpy as np
from scipy.stats import f_oneway, tukey_hsd

# generate example data
store_a_sales = np.random.normal(loc=1000, scale=100, size=30)
store_b_sales = np.random.normal(loc=1100, scale=100, size=30)
store_c_sales = np.random.normal(loc=1200, scale=100, size=30)

# conduct one-way ANOVA
f_statistic, p_value = f_oneway(store_a_sales, store_b_sales, store_c_sales)

# print results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)

# conduct post-hoc Tukey test if significant differences found
if p_value < 0.05:
    all_sales = np.concatenate((store_a_sales, store_b_sales, store_c_sales))
    group_labels = ['Store A'] * len(store_a_sales) + ['Store B'] * len(store_b_sales) + ['Store C'] * len(store_c_sales)
    tukey_results = tukey_hsd(all_sales, group_labels)
    print(tukey_results) 

F-statistic:  32.41263035612066
p-value:  3.024279614506088e-11


TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''