Q1. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation
of 5 using Python. Interpret the results.

#Answer

ANOVA (Analysis of Variance) is a statistical technique used to compare the means of two or more groups to determine if there are significant differences among them. To ensure the validity of ANOVA results, certain assumptions need to be met. Violations of these assumptions can affect the accuracy and reliability of the ANOVA analysis. The three main assumptions for ANOVA are:

1. Independence: The observations within each group should be independent of each other. This means that the data points in one group should not be influenced by or related to the data points in another group. Violations of this assumption can occur when there is dependency or correlation among the observations, such as in repeated measures or matched-pairs designs.

2. Normality: The data within each group should follow a normal distribution. Normality assumption means that the residuals (differences between observed values and group means) should be normally distributed. Violations of normality occur when the data are significantly skewed or have heavy tails, deviating from the bell-shaped curve. This can affect the accuracy of the p-values and confidence intervals obtained from ANOVA.

3. Homogeneity of Variance: The variances within each group should be approximately equal. This assumption is known as homoscedasticity. Violations of this assumption, called heteroscedasticity, occur when the variability in one group is significantly different from the variability in other groups. Heteroscedasticity can lead to biased results and affect the interpretation of the ANOVA analysis.

Examples of violations impacting the validity of ANOVA results:

1. Non-independence: If the observations within groups are not independent, such as when measurements are taken from the same individual over time, it violates the independence assumption. This violation can lead to correlated errors and inflated significance levels.

2. Non-normality: When the data within groups are not normally distributed, it can impact the validity of p-values and confidence intervals. For example, if the data are heavily skewed or have outliers, it can lead to incorrect conclusions regarding group differences.

3. Heteroscedasticity: When the variability within groups is not equal, it violates the assumption of homogeneity of variance. This violation can affect the precision of estimates and lead to incorrect inferences about group differences. For instance, if one group has significantly higher variance than others, it may have a larger influence on the ANOVA results.

It's important to assess these assumptions before conducting ANOVA. If the assumptions are violated, alternative analysis methods or transformations of the data might be necessary. Additionally, non-parametric tests, which are more robust to violations of assumptions, can be considered as alternatives to ANOVA.

                      -------------------------------------------------------------------

Q2. What are the three types of ANOVA, and in what situations would each be used?

#Answer

Analysis of Variance (ANOVA) is a statistical technique used to analyze the differences between group means. There are three main types of ANOVA:

1. One-Way ANOVA: This type of ANOVA is used when there is one independent variable (also called a factor) with two or more levels, and a continuous dependent variable. The one-way ANOVA is used to determine if there are any statistically significant differences between the means of the different levels of the independent variable. For example, you might use a one-way ANOVA to analyze the effect of different doses of a drug on blood pressure, where the independent variable is the drug dose (with levels such as low, medium, and high), and the dependent variable is the blood pressure measurement.

2. Two-Way ANOVA: In two-way ANOVA, there are two independent variables (factors), and the dependent variable is continuous. This type of ANOVA is used to analyze the effects of two independent variables and their interactions on the dependent variable. For example, you might use a two-way ANOVA to study the effects of both gender and treatment type on the outcome of a medical procedure. The independent variables would be gender (male/female) and treatment type (A/B), and the dependent variable would be the outcome measure.

3. Factorial ANOVA: Factorial ANOVA is an extension of two-way ANOVA that allows for the analysis of multiple independent variables (factors) and their interactions. It is used when there are two or more independent variables, each with two or more levels, and a continuous dependent variable. Factorial ANOVA helps to examine how different factors, as well as their interactions, influence the dependent variable. For example, you might use factorial ANOVA to analyze the effects of age (young/old) and exercise type (A/B/C) on cardiovascular fitness.

These three types of ANOVA provide different levels of complexity and flexibility in analyzing the relationships between variables in various experimental designs. The choice of ANOVA type depends on the specific research question, the number of independent variables, and the design of the study.

                      -------------------------------------------------------------------

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

#Answer The partitioning of variance in ANOVA refers to the division of the total variance observed in the data into different components that can be attributed to different sources. This partitioning allows for a systematic examination of the variability in the data and helps determine the relative contributions of various factors or sources of variation.

In ANOVA, the total variance in the data is decomposed into two main components: 

1. Between-group variance (or treatment effect): This component of variance represents the variability between different groups or levels of the independent variable(s). It indicates the extent to which the means of the groups differ from each other. If the between-group variance is large relative to the within-group variance, it suggests that there are significant differences between the groups.

2. Within-group variance (or error variance): This component of variance represents the variability within each group or level of the independent variable(s). It captures the random variability or noise within the groups that is not explained by the independent variable(s). It is also referred to as the error term because it represents the unexplained variability in the model.

Understanding the partitioning of variance is important for several reasons:

1. Identifying significant effects: By comparing the between-group variance with the within-group variance, ANOVA allows us to determine if there are statistically significant differences between the groups. If the between-group variance is large compared to the within-group variance, it suggests that the independent variable(s) has a significant effect on the dependent variable.

2. Quantifying effect sizes: The partitioning of variance provides information about the magnitude of the effects. Effect sizes, such as eta-squared or partial eta-squared, can be calculated based on the ratio of between-group variance to total variance. These effect sizes indicate the proportion of the total variance that is accounted for by the independent variable(s).

3. Assessing the validity of the model: Understanding the partitioning of variance allows us to evaluate the adequacy of the ANOVA model. If the within-group variance is high relative to the between-group variance, it suggests that there may be unexplained variability or other factors influencing the dependent variable that are not accounted for in the model.

4. Designing future studies: Knowledge of the partitioning of variance can guide researchers in designing future studies. By understanding the relative contributions of different sources of variation, researchers can determine the sample size needed to detect significant effects or to estimate effect sizes accurately.

In summary, the partitioning of variance in ANOVA helps us understand the sources of variability in the data, identify significant effects, quantify effect sizes, assess model validity, and inform the design of future studies.

                      -------------------------------------------------------------------

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [14]:
#Answer

import numpy as np
from scipy import stats

# Example data for three groups
group1 = [10, 12, 15, 13, 11]
group2 = [18, 20, 22, 19, 21]
group3 = [8, 9, 7, 11, 10]

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate the SST (Total Sum of Squares)
grand_mean = np.mean(data)
sst = np.sum((data - grand_mean) ** 2)

# Calculate the SSE (Explained Sum of Squares)
group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
sse = np.sum([len(group) * (group_mean - grand_mean) ** 2 for group, group_mean in zip([group1, group2, group3], group_means)])

# Calculate the SSR (Residual Sum of Squares)
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)


SST: 354.93333333333334
SSE: 320.1333333333333
SSR: 34.80000000000001


                      -------------------------------------------------------------------

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data for two factors (A and B) and the response variable (Y)
data = {'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'B': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'Y': [4, 7, 9, 5, 8, 10, 6, 9, 12]}

df = pd.DataFrame(data)

# Perform the two-way ANOVA
model = ols('Y ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effect
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table.loc['A', 'df']
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table.loc['B', 'df']
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table.loc['A:B', 'df']

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


Main Effect A: 8.166666666666671
Main Effect B: 42.666666666666664
Interaction Effect: 0.24999999999999645


                       -------------------------------------------------------------------

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In the scenario you described, conducting a one-way ANOVA resulted in an F-statistic of 5.23 and a p-value of 0.02. To interpret these results, we need to consider the significance of the F-statistic and the p-value.

1. Significance of the F-statistic:
The F-statistic is a ratio of two variances: the between-group variance to the within-group variance. It measures the extent to which the group means differ from each other relative to the variability within the groups. In your case, the F-statistic of 5.23 suggests that there is some evidence of differences between the groups.

2. Significance of the p-value:
The p-value represents the probability of observing the obtained F-statistic (or a more extreme value) if the null hypothesis is true. In this case, the null hypothesis assumes that there are no significant differences between the group means. A p-value of 0.02 indicates that there is a 2% probability of obtaining the observed F-statistic by chance alone under the null hypothesis.

Interpreting the results:

Given the obtained F-statistic and the p-value, we can draw the following conclusions:

1. Differences between the groups: The obtained F-statistic of 5.23 suggests that there are statistically significant differences between the group means. However, it does not provide information about the direction or magnitude of these differences.

2. Rejecting the null hypothesis: The p-value of 0.02 is less than the conventional significance level of 0.05. Therefore, we can reject the null hypothesis and conclude that there are significant differences between the groups.

3. Practical significance: While the statistical test indicates that there are significant differences, it is also important to consider the practical significance or the real-world implications of these differences. The magnitude of the differences and their practical relevance should be evaluated in light of the specific context of the study.

In summary, based on the F-statistic and p-value obtained in the one-way ANOVA, you can conclude that there are statistically significant differences between the groups. However, further analysis and interpretation are necessary to understand the nature and practical significance of these differences.

                        -------------------------------------------------------------------

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA requires careful consideration to ensure valid and reliable results. There are several methods to handle missing data in this context, and the choice of method can have potential consequences. Here are some commonly used approaches:

1. Complete Case Analysis (Listwise deletion): This approach involves excluding any participant or case with missing data from the analysis. It only uses complete cases for the analysis, discarding incomplete cases. The main consequence of this method is a reduction in sample size, which can lead to decreased statistical power and potentially biased results if the missingness is related to the variables being analyzed.

2. Pairwise Deletion (Available Case Analysis): This method uses all available data for each comparison in the analysis. It analyzes each participant's available data points while excluding missing data points from specific comparisons. This approach maximizes the use of available data, but it can result in different sample sizes for different comparisons, which may affect the precision and reliability of the estimates.

3. Mean Substitution (Imputation): Mean imputation replaces missing data with the mean value of the variable. In repeated measures ANOVA, this means replacing missing values within a participant across time points with the participant's mean score. This method assumes that the missing values are missing completely at random (MCAR) and can introduce bias if the data are missing systematically or if the relationship between missingness and the variable is important.

4. Last Observation Carried Forward (LOCF): LOCF imputes missing data with the value of the last observed measurement. This approach assumes that the participant's missing value would be the same as the most recent measurement. However, LOCF may not accurately capture the true values and can lead to biased estimates, particularly if there is substantial variability within participants.

5. Multiple Imputation: Multiple imputation creates multiple plausible values for each missing data point based on the observed data. It accounts for the uncertainty associated with missing values and allows for valid statistical inference. Multiple imputation methods impute missing data based on patterns and relationships in the data, but it requires assumptions about the missing data mechanism.

The consequences of using different methods to handle missing data include potential biases in parameter estimates, standard errors, and hypothesis tests. The choice of method depends on the missing data pattern, the assumptions made about the missingness mechanism, and the specific goals of the analysis. It is crucial to consider the limitations and potential biases associated with each method and to perform sensitivity analyses to evaluate the robustness of the results across different missing data handling methods.

                        -------------------------------------------------------------------

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA and finding a statistically significant result, post-hoc tests are often used to determine which specific groups differ significantly from each other. Some common post-hoc tests used after ANOVA include:

1. Tukey's Honestly Significant Difference (HSD) Test: This test compares all possible pairs of group means and controls the familywise error rate. It is appropriate when you have equal sample sizes and homogeneous variances across groups.

2. Bonferroni Correction: Bonferroni correction adjusts the significance level of individual comparisons to maintain a desired familywise error rate. It is a conservative method that divides the overall significance level (e.g., 0.05) by the number of comparisons.

3. Dunnett's Test: Dunnett's test is used when you have one control group and want to compare it to multiple treatment groups. It controls the experimentwise error rate by comparing each treatment group to the control group.

4. Fisher's Least Significant Difference (LSD) Test: This test compares the means of all possible pairs of groups. It does not control the familywise error rate, making it less conservative than other tests. It is typically used for exploratory purposes or when there is a specific hypothesis about pairwise differences.

5. Scheffe's Test: Scheffe's test is a conservative post-hoc test that can be used in situations where there are unequal sample sizes and/or variances across groups. It controls the familywise error rate for all possible pairwise comparisons.

Example scenario:
Let's say you conducted a study comparing the effectiveness of three different treatment approaches (A, B, and C) for reducing symptoms of a specific medical condition. After performing an ANOVA on the data, you found a significant difference among the treatment groups. In this case, you would need a post-hoc test to determine which specific treatment groups differ significantly from each other.

For example, you could use Tukey's HSD test to compare the means of all possible pairs of treatment groups. It would provide you with a set of confidence intervals and p-values for each comparison, allowing you to identify the specific pairs of treatments that are significantly different from each other. This information would help you understand which treatment approaches are more effective than others in reducing symptoms of the medical condition.

Overall, post-hoc tests are necessary to perform multiple pairwise comparisons after obtaining a significant result in an ANOVA, enabling a more detailed understanding of the specific group differences.

                        -------------------------------------------------------------------

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [11]:
import numpy as np
from scipy import stats

# Example data for weight loss in three diets
diet_A = [2, 3, 4, 2, 5, 3, 4, 6, 2, 4, 3, 5, 4, 3, 2, 4, 5, 3, 4, 6, 2, 3, 4, 3, 5, 4, 2, 4, 3, 5, 3, 4, 2, 5, 3, 4, 6, 4, 3, 2, 4, 5, 4, 3, 2, 4, 5, 6, 4]
diet_B = [3, 4, 5, 3, 2, 4, 3, 5, 4, 2, 4, 3, 5, 4, 2, 3, 4, 5, 3, 4, 6, 2, 3, 4, 3, 5, 4, 2, 4, 3, 5, 3, 4, 2, 5, 3, 4, 6, 4, 3, 2, 4, 5, 4, 3, 2, 4, 5, 6, 4]
diet_C = [4, 5, 3, 4, 2, 3, 4, 5, 3, 2, 4, 3, 5, 4, 2, 3, 4, 5, 3, 4, 6, 2, 3, 4, 3, 5, 4, 2, 4, 3, 5, 3, 4, 2, 5, 3, 4, 6, 4, 3, 2, 4, 5, 4, 3, 2, 4, 5, 6, 4]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 0.0004079374570408328
p-value: 0.9995921468774791


To interpret the results:

If the obtained p-value is less than the chosen significance level (e.g., 0.05), it suggests that there are significant differences between the mean weight loss of the three diets.

If the p-value is greater than the significance level, it implies that there is insufficient evidence to conclude significant differences between the mean weight loss of the diets.

In the context of the example data, let's say the analysis resulted in an F-statistic of 3.18 and a p-value of 0.045. With a significance level of 0.05, since the p-value is less than 0.05, you would conclude that there are significant differences between the mean weight loss of the three diets A, B, and C.

                        -------------------------------------------------------------------

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [15]:
!pip install statsmodels

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
data = pd.DataFrame({
    'Time': [12, 14, 15, 18, 10, 11, 9, 16, 14, 17, 10, 11, 13, 12, 15, 16, 9, 10, 11, 14, 13, 15, 19, 17, 13, 12, 11, 15, 16, 17],
    'Software': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
    'Experience': ['Novice', 'Experienced'] * 15
})

# Convert Experience column to categorical type
data['Experience'] = data['Experience'].astype('category')

# Perform two-way ANOVA
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(table)


                               sum_sq    df         F    PR(>F)
C(Software)                 20.095357   2.0  1.331308  0.282945
C(Experience)               22.123134   1.0  2.931295  0.099772
C(Software):C(Experience)    3.437977   2.0  0.227764  0.798013
Residual                   181.133333  24.0       NaN       NaN




we can interpret the results based on the p-values as follows:

If the p-value for the main effect of Software is below your chosen significance level (e.g., 0.05), it indicates that there is a significant difference in the average time to complete the task among the software programs.
If the p-value for the main effect of Experience is below your significance level, it suggests that there is a significant difference in the average time to complete the task between novice and experienced employees.
If the p-value for the interaction effect (Software:Experience) is below your significance level, it suggests that the effect of software programs on the time to complete the task depends on the experience level of the employees. In other words, the difference in average time among the software programs varies depending on whether the employee is a novice or experienced.

                        -------------------------------------------------------------------

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [8]:
import numpy as np
from scipy import stats

# Example data for test scores in the control and experimental groups
control_group = [75, 80, 85, 90, 78, 82, 88, 92, 80, 85, 88, 95, 77, 82, 87, 93, 76, 81, 86, 92, 79, 84, 89, 96, 83, 88, 94, 97, 80, 85, 90, 98, 79, 84, 89, 94, 82, 87, 92, 100, 78, 83, 88, 95, 81, 86, 91, 98, 77, 82, 87, 94]

experimental_group = [80, 85, 90, 95, 79, 84, 89, 94, 81, 86, 91, 96, 78, 83, 88, 93, 77, 82, 87, 92, 76, 81, 86, 91, 75, 80, 85, 90, 74, 79, 84, 89, 73, 78, 83, 88, 72, 77, 82, 87, 71, 76, 81, 86, 70, 75, 80, 85, 69, 74, 79]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check if the results are significant (p-value < 0.05)
if p_value < 0.05:
    # Conduct post-hoc test (optional)
    posthoc_tukey = pairwise_tukeyhsd(np.concatenate([control_group, experimental_group]), np.concatenate([['Control']*len(control_group), ['Experimental']*len(experimental_group)]))
    print(posthoc_tukey)


t-statistic: 3.1204197640191387
p-value: 0.0023546273994354727
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower   upper  reject
-----------------------------------------------------------
Control Experimental  -4.1063 0.0024 -6.7168 -1.4958   True
-----------------------------------------------------------


                        -------------------------------------------------------------------

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [7]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data for daily sales in Store A, Store B, and Store C
store_A = [100, 105, 98, 102, 108, 105, 103, 100, 99, 105, 102, 106, 105, 108, 107, 110, 108, 103, 105, 109, 112, 115, 113, 110, 105, 108, 102, 100, 106, 109]
store_B = [98, 100, 103, 102, 101, 104, 102, 100, 99, 97, 98, 96, 98, 102, 100, 99, 101, 104, 102, 100, 99, 98, 102, 104, 106, 108, 105, 102, 101, 104]
store_C = [105, 102, 100, 106, 110, 112, 108, 105, 103, 102, 108, 106, 109, 105, 102, 108, 107, 106, 105, 102, 103, 108, 106, 110, 112, 108, 105, 103, 102, 108]

# Combine the data into a single DataFrame
data = pd.DataFrame({'Sales': np.concatenate([store_A, store_B, store_C]),
                     'Store': np.repeat(['A', 'B', 'C'], len(store_A)),
                     'Day': np.tile(range(len(store_A)), 3)})

# Perform repeated measures ANOVA
model = ols('Sales ~ C(Store)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

# Check if the results are significant (p-value < 0.05)
if anova_table['PR(>F)'][0] < 0.05:
    # Conduct post-hoc test (optional)
    from statsmodels.stats.multicomp import pairwise_tukeyhsd

    posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'])
    print(posthoc)


               sum_sq    df          F        PR(>F)
C(Store)   418.155556   2.0  17.577484  3.876618e-07
Residual  1034.833333  87.0        NaN           NaN
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     A      B  -4.4333    0.0 -6.5567  -2.31   True
     A      C   0.2667 0.9518 -1.8567   2.39  False
     B      C      4.7    0.0  2.5766 6.8234   True
---------------------------------------------------


                        -------------------------------------------------------------------