In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Ans. ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. To use ANOVA, the following assumptions need to be met:

Independence: The observations within each group must be independent of each other.
Normality: The data within each group must be normally distributed.
Homogeneity of variance: The variance within each group must be equal.
Violations of these assumptions can impact the validity of the ANOVA results. Here are some examples of violations and their 
impact on the results:

Violation of independence: If the observations within each group are not independent, the variance estimate of the means will 
be too small, leading to an inflated F-value and a higher chance of a Type I error. For example, if data is collected from siblings, 
the independence assumption may be violated, and the ANOVA results may not be valid.

Violation of normality: If the data within each group is not normally distributed, the ANOVA results may not be valid. Non-normal
data may lead to skewed or misinterpreted results. For example, if the data has extreme outliers or is heavily skewed, the normality
assumption may be violated.

Violation of homogeneity of variance: If the variance within each group is not equal, the ANOVA results may not be valid. Unequal
variances may lead to incorrect conclusions and affect the precision of the estimate. For example, if the variance in one group is 
much larger than in the other groups, the homogeneity of variance assumption may be violated.

It's important to check for these assumptions before conducting ANOVA and to take steps to address violations if they are found. 
For example, if the normality assumption is violated, transformations such as log or square root transformations may be used to 
normalize the data. If the homogeneity of variance assumption is violated, alternative methods such as Welch's ANOVA or a non-parametric
test may be used.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?
Ans. The three types of ANOVA are:

One-way ANOVA: This type of ANOVA is used when there is one independent variable with three or more groups. It is used to 
test whether there are any significant differences between the means of the groups. One-way ANOVA is commonly used in experimental 
designs where one factor is manipulated to test its effect on the dependent variable.

Two-way ANOVA: This type of ANOVA is used when there are two independent variables, each with two or more levels. It is used to test
the main effects of each independent variable as well as their interaction effect on the dependent variable. Two-way ANOVA is commonly 
used in experimental designs to test the effects of two different factors on the dependent variable.

Mixed ANOVA: This type of ANOVA is used when there are two or more independent variables, with at least one being a within-subjects
factor (i.e., measured on the same subjects under different conditions) and at least one being a between-subjects factor 
(i.e., measured on different subjects in different conditions). Mixed ANOVA is commonly used in research designs where both 
within-subject and between-subject factors are manipulated, such as in repeated measures experiments.

In summary, one-way ANOVA is used when there is one independent variable with three or more groups, two-way ANOVA is used when 
here are two independent variables with two or more levels, and mixed ANOVA is used when there are two or more independent variables, 
with at least one being a within-subjects factor and at least one being a between-subjects factor.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Ans. Partitioning of variance in ANOVA refers to the division of the total variance in the data into different sources of variance, 
such as the variability within groups and the variability between groups. This division is essential because it allows us to
understand the contributions of different sources of variance to the overall variance in the data and to determine whether
any of the sources of variance are significant predictors of the outcome variable.

The partitioning of variance in ANOVA is done by decomposing the total sum of squares (SS) into two components: the sum of 
squares between groups (SSB) and the sum of squares within groups (SSW). The sum of squares between groups measures the variation 
in the outcome variable that can be attributed to differences between the groups being compared. The sum of squares within groups
measures the variation in the outcome variable that cannot be explained by differences between the groups.

The ratio of the between-group variance to the within-group variance, known as the F-statistic, is used to test whether the means
of the groups are significantly different from each other. If the F-statistic is large enough and the associated p-value is small
enough, we can conclude that the groups are significantly different from each other.

Understanding the concept of partitioning of variance in ANOVA is essential because it helps us determine whether the differences
observed between groups are due to chance or whether they are statistically significant. By partitioning the variance into different
sources, we can also identify which factors are driving the differences between groups, allowing us to draw more meaningful conclusions
from our data.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

Ans. import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
data = pd.read_csv('data.csv')

# Fit one-way ANOVA model
model = ols('response ~ group', data=data).fit()

# Calculate SST
SST = sum((data['response'] - data['response'].mean())**2)

# Calculate SSE
SSE = sum(model.fittedvalues - data['response'].mean())**2

# Calculate SSR
SSR = sum((data['response'] - model.fittedvalues)**2)

print('SST:', SST)
print('SSE:', SSE)
print('SSR:', SSR)

In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
Ans. import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
data = pd.read_csv('data.csv')

# Fit two-way ANOVA model
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=data).fit()

# Calculate main effects
main_effect_factor1 = model.params['C(factor1)[T.B]'] - model.params['C(factor1)[T.A]']
main_effect_factor2 = model.params['C(factor2)[T.B]'] - model.params['C(factor2)[T.A]']

# Calculate interaction effect
interaction_effect = model.params['C(factor1)[T.B]:C(factor2)[T.B]'] - model.params['C(factor1)[T.B]:C(factor2)[T.A]'] - model.params['C(factor1)[T.A]:C(factor2)[T.B]'] + model.params['C(factor1)[T.A]:C(factor2)[T.A]']

print('Main effect of factor 1:', main_effect_factor1)
print('Main effect of factor 2:', main_effect_factor2)
print('Interaction effect:', interaction_effect)

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Ans. If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is 
evidence of significant differences between the groups.

The F-statistic compares the variability between the groups to the variability within the groups. A high F-value indicates that
there is more variability between the groups compared to within the groups, which suggests that the means of the groups are different.
The p-value tells us the probability of obtaining such an F-value by chance alone, assuming that there are no true differences between 
the groups. A low p-value (in this case, 0.02) indicates that it is unlikely that the observed differences between the groups are due
to chance alone.

To interpret these results, we can say that there is evidence to reject the null hypothesis that the means of all the groups are equal.
However, the ANOVA does not tell us which specific groups are different from each other. To determine this, we would need to conduct 
post-hoc tests such as Tukey's HSD or Bonferroni correction.

In summary, an F-statistic of 5.23 with a p-value of 0.02 suggests that there are significant differences between the groups, but 
further analysis is needed to determine which specific groups are different from each other.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Ans. In a repeated measures ANOVA, missing data can arise if some participants did not provide complete data for all time points
or conditions. There are different methods to handle missing data, and the appropriate method depends on the nature of the missingness
and the assumptions of the analysis.

One common method to handle missing data is to use listwise deletion, which involves excluding participants with any missing data
from the analysis. This method is straightforward, but it can reduce the power of the analysis and bias the results if the missing 
data are not missing completely at random (MCAR) but missing at random (MAR) or missing not at random (MNAR).

Another method is to impute the missing data, which involves replacing missing values with estimated values based on observed data.
There are several ways to impute missing data, such as mean imputation, regression imputation, or multiple imputation. These methods
can help to reduce bias and increase the power of the analysis, but they also rely on assumptions about the missing data mechanism 
and can introduce additional variability in the results.

The consequences of using different methods to handle missing data can be significant, as they can affect the validity and reliability
of the analysis. In general, it is important to carefully evaluate the nature of the missing data and choose an appropriate method 
that minimizes bias and maximizes power. Sensitivity analyses can also be used to assess the robustness of the results to different
methods of handling missing data.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Ans. Post-hoc tests are used in ANOVA to make pairwise comparisons between groups after a significant omnibus F-test. Some common 
post-hoc tests include Tukey's Honestly Significant Difference (HSD) test, Scheffe's test, Bonferroni correction, and Dunnett's test.

Tukey's HSD test is often used when there are equal sample sizes in all groups, and it controls the family-wise error rate (FWER) 
at a specific level. It is considered the most conservative post-hoc test because it is more likely to detect significant differences 
only when they truly exist.

Scheffe's test is more flexible than Tukey's HSD test and can be used when the sample sizes are unequal or the variances are not
homogeneous. It controls the FWER for all possible comparisons, but it is less powerful than other tests in detecting significant 
differences.

Bonferroni correction is a conservative method that adjusts the significance level for each comparison by dividing the overall alpha 
level by the number of comparisons. This method is appropriate when there are many pairwise comparisons and when the sample sizes are small.

Dunnett's test is used to compare several treatment groups to a control group. It is useful when there is a clear control group and 
a smaller number of treatment groups. This test controls the Type I error rate when multiple comparisons are made.

An example of a situation where a post-hoc test might be necessary is in a study that compares the effectiveness of different types
of medication in reducing symptoms of depression. After conducting an ANOVA, if the overall F-test is significant, a post-hoc test
can be used to identify which medication groups differ significantly from each other. Tukey's HSD test can be used if the sample
sizes are equal, and Scheffe's test can be used if the sample sizes are unequal or the variances are not homogeneous. Bonferroni
correction can be used if there are many pairwise comparisons, and Dunnett's test can be used if the treatments are compared to
a control group.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

Ans. To conduct a one-way ANOVA using Python, we first need to import the necessary libraries and load the data. Let's 
assume that the data is stored in a CSV file named "diet_data.csv" and the columns are labeled "Diet" and "Weight Loss".
Here's how we can perform the analysis:

import pandas as pd
import scipy.stats as stats

# Load the data
data = pd.read_csv("diet_data.csv")

# Conduct the one-way ANOVA
f_stat, p_value = stats.f_oneway(data[data['Diet'] == 'A']['Weight_Loss'],
                                 data[data['Diet'] == 'B']['Weight_Loss'],
                                 data[data['Diet'] == 'C']['Weight_Loss'])

# Display the results
print("F-statistic:", f_stat)
print("p-value:", p_value)
This code first loads the data and then uses the f_oneway() function from the scipy.stats library to conduct the one-way ANOVA. 
The arguments to the function are the weight loss data for each diet, grouped by the "Diet" column in the data frame.

The output should display the F-statistic and p-value for the ANOVA. To interpret the results, we can compare the p-value to the
chosen significance level (typically 0.05). If the p-value is less than the significance level, we can reject the null hypothesis
that the mean weight loss is the same for all three diets, and conclude that there are significant differences between the diets. 
The F-statistic measures the ratio of the between-group variance to the within-group variance, and a larger F-statistic indicates
greater differences between the group means.

For example, if the output shows a significant p-value, we can conclude that at least one of the diets had a significantly different
mean weight loss than the others. We can then perform post-hoc tests (e.g., Tukey's HSD test) to determine which diets differ 
significantly from each other.

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

Ans. To conduct a two-way ANOVA using Python, we first need to import the necessary libraries and load the data.
Let's assume that the data is stored in a CSV file named "task_completion_times.csv" and the columns are labeled "Program",
"Experience Level", and "Completion Time". Here's how we can perform the analysis:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data
data = pd.read_csv("task_completion_times.csv")

# Conduct the two-way ANOVA
model = ols("Completion_Time ~ Program + Experience_Level + Program*Experience_Level", data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the results
print(anova_table)
This code first loads the data and then constructs an ANOVA model using the ols() function from the statsmodels.formula.api library. 
The formula "Completion_Time ~ Program + Experience_Level + Program*Experience_Level" specifies that we want to test the effects
of the "Program" variable, the "Experience_Level" variable, and their interaction on the "Completion_Time" variable. The fit()
function fits the model to the data.

The anova_lm() function from the statsmodels.api library is then used to compute the ANOVA table, with the argument typ=2 specifying 
a type 2 ANOVA, which tests both main and interaction effects. The output should display the ANOVA table with the F-statistics,
p-values, and degrees of freedom for each effect and interaction.

To interpret the results, we can look at the p-values for each effect and interaction. If a p-value is less than the chosen 
significance level (typically 0.05), we can reject the null hypothesis that the effect or interaction has no effect on the completion time.

For example, if the ANOVA table shows a significant main effect for the "Program" variable, we can conclude that there are
significant differences in completion time between the three software programs. If the table also shows a significant interaction
effect between "Program" and "Experience_Level", we can conclude that the effect of the software program on completion time depends 
on the employee's experience level. We can further explore these effects using post-hoc tests to determine which levels of each 
factor differ significantly from each other.

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

Ans. To conduct a two-sample t-test using Python, we first need to import the necessary libraries and load the data.
Let's assume that the test scores are stored in a CSV file named "test_scores.csv" and the columns are labeled "Group" and
"Score". Here's how we can perform the analysis:

import pandas as pd
import scipy.stats as stats

# Load the data
data = pd.read_csv("test_scores.csv")

# Conduct the two-sample t-test
control_scores = data.loc[data["Group"] == "Control", "Score"]
experimental_scores = data.loc[data["Group"] == "Experimental", "Score"]
t, p = stats.ttest_ind(control_scores, experimental_scores)

# Display the results
print("t-statistic: {:.2f}".format(t))
print("p-value: {:.4f}".format(p))
This code first separates the test scores into two groups based on their assigned group, then performs a two-sample t-test 
using the ttest_ind() function from the scipy.stats library. The output should display the t-statistic and p-value of the test.

If the results of the t-test are significant (i.e., if the p-value is less than the chosen significance level, typically 0.05),
we can follow up with a post-hoc test to determine which group(s) differ significantly from each other. For example, let's say
we want to use Tukey's HSD test:

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=data["Score"], groups=data["Group"], alpha=0.05)

# Display the results
print(tukey.summary())
This code performs Tukey's HSD test on the "Score" variable with the "Group" variable as the grouping factor. The "alpha=0.05"
argument specifies the significance level. The output should display a table with the pairwise comparisons between groups, their
differences in means, standard errors, confidence intervals, and p-values. We can use this table to determine which group(s) 
differ significantly from each other.

In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

Ans. To conduct a repeated measures ANOVA using Python, we first need to import the necessary libraries and load the data. 
Let's assume that the data is stored in a CSV file named "sales_data.csv" and the columns are labeled "Store", "Day", and "Sales".
Here's how we can perform the analysis:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data
data = pd.read_csv("sales_data.csv")

# Conduct the repeated measures ANOVA
rm = ols("Sales ~ Store + Day + Store:Day", data=data).fit()
sm.stats.anova_lm(rm, typ=2)
This code fits a repeated measures ANOVA model with "Store" and "Day" as the within-subjects factors, and their interaction term "Store:Day". The "typ=2" argument specifies a Type 2 ANOVA table, which is appropriate for unbalanced designs with missing data. The output should display the results of the ANOVA, including the F-statistic, degrees of freedom, and p-value.

If the results of the ANOVA are significant, we can follow up with a post-hoc test to determine which store(s) differ significantly from each other. For example, let's say we want to use Tukey's HSD test:

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=data["Sales"], groups=data["Store"], alpha=0.05)

# Display the results
print(tukey.summary())
This code performs Tukey's HSD test on the "Sales" variable with the "Store" variable as the grouping factor. The "alpha=0.05" 
argument specifies the significance level. The output should display a table with the pairwise comparisons between stores, their
differences in means, standard errors, confidence intervals, and p-values. We can use this table to determine which store(s) differ 
significantly from each other.