<a href="https://colab.research.google.com/github/sameermdanwer/python-assignment-/blob/main/Statistics_Advance_Assignment_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

ANOVA (Analysis of Variance) is a statistical technique used to compare the means of three or more groups to determine if at least one group mean is statistically significantly different from others. However, like all statistical techniques, ANOVA is built upon several critical assumptions. If these assumptions are violated, the validity of the ANOVA results can be compromised. Below are the primary assumptions of ANOVA, along with examples of violations and their potential impacts:

# 1. Independence of Observations
Assumption: Observations within each group must be independent of one another. This means that the data collected from one subject or group should not influence or be related to the data collected from another subject or group.

Examples of Violations:

* Repeated Measures: If the same subjects are measured multiple times (e.g., before and after treatment), the independence assumption is violated.
* Cluster Sampling: If samples are taken from naturally occurring groups (like classrooms), individuals within the same group may affect each other’s responses.
* Impact: If independence is violated, the results of ANOVA might lead to inflated Type I error rates (incorrectly rejecting the null hypothesis), as the true variability among the groups may be underestimated.

# 2. Normality of Residuals
Assumption: The residuals (the differences between observed and predicted values) should be normally distributed. While it’s not necessary for the original data to be normally distributed, the distribution of residuals should be normal, especially for small sample sizes.

Examples of Violations:

* Skewed Data: If the data is heavily skewed or contains outliers, the residuals will not follow a normal distribution.
* Transformation Issues: Using inappropriate transformations for variable types (like other than log or square root for right-skewed data) can maintain or worsen normality issues.
* Impact: Violation of the normality assumption can lead to inaccurate p-values, resulting in a higher likelihood of Type I or Type II errors (failing to reject the null hypothesis when it is false).

# 3. Homogeneity of Variances (Homoscedasticity)
Assumption: The variances among the groups being compared should be roughly equal. This assumption can be assessed using tests like Levene's test or Bartlett's test.

Examples of Violations:

* Inconsistent Variability: If one group has much more variability in its data compared to another (for instance, group A has a variance of 2 and group B has a variance of 30), this leads to heteroscedasticity.
* Different Measurement Techniques: Using different methods to measure the same phenomenon could lead to different variances between groups.
* Impact: If the variances are significantly different (heteroscedasticity), it can lead to biased F-ratios and thus affect the validity of the ANOVA results. It may, for example, lead to underestimating or overestimating the significance of group distinctions.

# 4. Random Sampling
* Assumption: The samples must be drawn randomly from the populations. This ensures that the sample is representative of the population.

Examples of Violations:

* Convenience Sampling: If the sample is taken from easily accessible subjects rather than a random selection, it may not represent the population well.
* Self-selection: Participants may choose whether or not to partake in a study, leading to biased groups.
* Impact: Non-random sampling could lead to systematic biases in results, impacting the generalizability of the findings and potentially misleading conclusions about differences between groups.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or more groups to determine if at least one group differs significantly from the others. There are three primary types of ANOVA, each suited for different experimental designs and situations:

# 1. One-Way ANOVA
Definition:
One-way ANOVA is used to compare the means of three or more independent groups that differ on one independent variable (factor). It tests the hypothesis that at least one of the group means is different from the others.

When to Use:

* When you have one categorical independent variable with two or more levels (groups) and a continuous dependent variable.
* For example, comparing the test scores among students from three different schools to see if the school affects performance.
* Situations where you want to test the effect of a single treatment or condition (e.g., comparing three different diets on weight loss).
Example:
A researcher wants to examine whether three different teaching methods (Lecture, Interactive, and Blended) have different effects on student performance.

# 2. Two-Way ANOVA
Definition:
Two-way ANOVA is used to examine the influence of two independent categorical variables (factors) on a continuous dependent variable, as well as to investigate if there is an interaction effect between the two factors.

When to Use:

* When analyzing the effects of two factors simultaneously while also assessing their interaction.
* For instance, evaluating the effects of both diet (e.g., Low-Carb vs. Regular) and exercise (e.g., Yes vs. No) on weight loss.
Example:
A study examines how two factors—type of fertilizer (Organic, Synthetic) and amount of sunlight (Full Sun, Partial Sun)—affect plant growth.

# 3. Repeated Measures ANOVA (within-subjects ANOVA)
Definition:
Repeated measures ANOVA is used when the same subjects are measured multiple times under different conditions or at different time points. This type accounts for the fact that measurements are not independent due to repeated testing of the same participants.

When to Use:

* When the same participants are subjected to different conditions (e.g., measuring their performance under different drugs, treatments, or time points).
* Situations where you want to analyze how a dependent variable changes over time or across different conditions while controlling for inter-subject variability.
Example:
A psychologist tests the stress levels of the same group of participants at three different times: before a stressful event, during the event, and after the event.


Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

 Partitioning of variance in ANOVA is a critical concept that refers to the process of breaking down the total variance observed in a dataset into its components. This breakdown helps researchers understand the sources of variability in the data and allows for meaningful comparisons among groups. The partitioning of variance entails distinguishing between the variability that can be attributed to the different groups being studied and the variability that arises from random error or individual differences within those groups.

# Understanding the Components of Variance
In ANOVA, the total variance in the dependent variable can be partitioned into two main components:

1. Between-Group Variance (Explained Variance):
This is the variance that is attributed to the differences between the means of the groups. It reflects how much the group means vary from the overall mean of all observations. A larger between-group variance indicates that the groups are more distinct or that the treatment or factor has a significant effect.

* Mathematically: ( \text{Between-Group Variance} = \sum_{i=1}^k n_i (\bar{X}_i - \bar{X})^2 ) where ( k ) is the number of groups, ( n_i ) is the sample size of the i-th group, ( \bar{X}_i ) is the mean of the i-th group, and ( \bar{X} ) is the overall mean.
2. Within-Group Variance (Unexplained Variance):

This is the variance within each group that is not explained by the differences between the group means. It reflects the individual differences or random error within groups. A higher within-group variance indicates more variability among the individuals within the same group.

* Mathematically: ( \text{Within-Group Variance} = \sum_{i=1}^k \sum_{j=1}^{n_i} (X_{ij} - \bar{X}i)^2 ) where ( X{ij} ) represents the j-th observation in group i.
The Total Variance can then be expressed as:
[ \text{Total Variance} = \text{Between-Group Variance} + \text{Within-Group Variance} ]

# Importance of Understanding Partitioning of Variance
1. Assessment of Treatment Effects:
By understanding how variance is partitioned, researchers can determine whether the groups are significantly different from each other. If the between-group variance is significantly larger than the within-group variance, this suggests that the treatment or categorical factor has a meaningful impact.

2. Calculation of the F-ratio:
The partitioning of variance is essential for calculating the F-statistic, which is the ratio of between-group variance to within-group variance:
[
F = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}}
]
A larger F-statistic indicates a greater likelihood that the group means are different due to the effects of the independent variable rather than random chance.

3. Guiding Experimental Design:
Understanding how variance is accounted for can inform researchers about how to structure their experiments. For instance, if the within-group variance is high, steps can be taken to control extraneous variables or increase sample sizes to improve the precision of estimates.

4. Interpreting Results:
Researchers need to interpret the results of ANOVA correctly. Recognizing where the variance is coming from (between versus within groups) allows for better understanding of the context of significant findings, including the actual impact of a treatment or condition.

5. Model Diagnostics and Assumptions:
The partitioning of variance can also provide insights into the assumptions underlying ANOVA, such as normality and homogeneity of variances. Researchers can check whether the variance structures meet the assumptions required to apply ANOVA and can take corrective actions if necessary.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In a one-way ANOVA, you typically deal with three important sums of squares:

1. Total Sum of Squares (SST): This measures the total variability in the dependent variable.
2. Explained Sum of Squares (SSE): This measures the variability that can be explained by the model (i.e., the differences between the group means).
3. Residual Sum of Squares (SSR): This measures the variability that cannot be explained by the model (i.e., the variability within the groups).
# Definitions
* Total Sum of Squares (SST):
[ \text{SST} = \sum_{i=1}^{N} (X_i - \bar{X})^2 ]
where ( X_i ) is each individual observation, ( N ) is the total number of observations, and ( \bar{X} ) is the overall mean of all observations.

* Explained Sum of Squares (SSE):
[ \text{SSE} = \sum_{j=1}^{k} n_j (\bar{X}_j - \bar{X})^2 ]
where ( k ) is the number of groups, ( n_j ) is the number of observations in group ( j ), and ( \bar{X}_j ) is the mean of group ( j ).

* Residual Sum of Squares (SSR):
[ \text{SSR} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (X_{ij} - \bar{X}j)^2 ]
where ( X{ij} ) is the i-th observation in group ( j ) and ( \bar{X}_j ) is the mean of group ( j ).


In [1]:
import numpy as np
import pandas as pd

# Sample data
data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'values': [5, 7, 6, 8, 9, 7, 10, 11, 10]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Calculate overall mean (X-bar)
overall_mean = np.mean(df['values'])

# Calculate Total Sum of Squares (SST)
SST = np.sum((df['values'] - overall_mean) ** 2)

# Calculate group means
group_means = df.groupby('group')['values'].mean()

# Calculate Explained Sum of Squares (SSE)
n = df['group'].value_counts()  # Number of observations per group
SSE = np.sum(n * (group_means - overall_mean) ** 2)

# Calculate Residual Sum of Squares (SSR)
SSR = SST - SSE

# Print results
print(f'Total Sum of Squares (SST): {SST:.2f}')
print(f'Explained Sum of Squares (SSE): {SSE:.2f}')
print(f'Residual Sum of Squares (SSR): {SSR:.2f}')

Total Sum of Squares (SST): 32.89
Explained Sum of Squares (SSE): 28.22
Residual Sum of Squares (SSR): 4.67


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In a two-way ANOVA, you assess the impact of two independent categorical variables (also known as factors) on a continuous dependent variable. In addition to evaluating the individual effects of each factor (main effects), you can also examine whether there is an interaction between the two factors (interaction effect).

# Steps to Calculate Main Effects and Interaction Effects in Python
To perform a two-way ANOVA and calculate the main effects and interaction effects, you can use the statsmodels library in Python. Below is a step-by-step example using a sample dataset.

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.api as sm

# Sample data setup
data = {
    'FactorA': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2'],
    'FactorB': ['B1', 'B1', 'B1', 'B1', 'B1', 'B1', 'B2', 'B2', 'B2', 'B2', 'B2', 'B2'],
    'Response': [5, 6, 7, 8, 9, 6, 7, 8, 9, 10, 11, 10]
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

   FactorA FactorB  Response
0       A1      B1         5
1       A1      B1         6
2       A1      B1         7
3       A2      B1         8
4       A2      B1         9
5       A2      B1         6
6       A1      B2         7
7       A1      B2         8
8       A1      B2         9
9       A2      B2        10
10      A2      B2        11
11      A2      B2        10


# Performing Two-Way ANOVA
1. Define the Model You can define the ANOVA model using the formula syntax where Response is the dependent variable, and FactorA, FactorB, and their interaction are independent variables.

In [3]:
# Define the model
model = ols('Response ~ C(FactorA) * C(FactorB)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)  # Type II ANOVA

print(anova_table)

                          sum_sq   df          F    PR(>F)
C(FactorA)             12.000000  1.0  10.285714  0.012478
C(FactorB)             16.333333  1.0  14.000000  0.005692
C(FactorA):C(FactorB)   0.333333  1.0   0.285714  0.607511
Residual                9.333333  8.0        NaN       NaN


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

When conducting a one-way ANOVA, you are generally testing the null hypothesis that there are no differences in the means of the groups being compared. In your case, you obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how you can interpret these results:

# Interpretation of the F-statistic
1. F-statistic: The F-statistic provides a measure of the ratio of the variance between the groups to the variance within the groups. A higher F-statistic suggests that a larger proportion of the variance is attributable to the group differences rather than to random chance. In your case, an F-statistic of 5.23 indicates that there is some evidence suggesting that the means of the different groups are not all equal.

# Interpretation of the p-value
2. P-value: The p-value indicates the probability of observing an F-statistic as extreme as (or more extreme than) the one obtained (5.23) under the null hypothesis. A p-value of 0.02 means that there is a 2% probability of obtaining such an F-statistic if the null hypothesis were true (i.e., if there were no actual differences between the group means).
# Conclusion Based on the Results
Given that the p-value (0.02) is less than the common alpha level of 0.05, you would reject the null hypothesis. This suggests that there is a statistically significant difference between at least one pair of group means.

Here's a structured conclusion:

* Statistical Significance: Since p < 0.05, we conclude that there is evidence to suggest that at least one group mean is significantly different from the others.

* Practical Interpretation: While the ANOVA tells us that not all group means are equal, it does not specify which groups are different from each other. To find out which specific groups differ, a post hoc test (like Tukey's HSD, Bonferroni, or Scheffé test) would need to be conducted.


# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA is crucial because it can impact the validity and interpretation of your results. Here are various methods to address missing data and the potential consequences associated with each approach:

# Methods for Handling Missing Data
1. Listwise Deletion (Complete Case Analysis):

* Description: Only the subjects with complete data across all time points are included in the analysis.
* Consequences:
* Pros: Simple to implement and interpret.
* Cons: This method can lead to significant loss of data, particularly if many subjects have missing values. If the missing data are not missing completely at random (MCAR), it can lead to biased estimates and reduced generalizability.
2. Pairwise Deletion:

* Description: Analyses are performed using all available data for each pair of groups being compared, allowing for different numbers of observations across comparisons.
* Consequences:
* Pros: Makes use of more data than listwise deletion.
* Cons: Can lead to inconsistencies in the sample size across comparisons and inflated Type I error rates. Results may also be less reliable because different subsets of data are used.
3. Mean Imputation:

* Description: Missing values for each participant are replaced with the mean value of that participant's other measurements.
* Consequences:
* Pros: Easy to implement and preserves sample size.
* Cons: Reduces variability in the dataset, can bias results due to artificially reducing standard deviations, and fails to consider the correlations between repeated measures.
4. Last Observation Carried Forward (LOCF):

* Description: The last observed value for each participant is used to fill in missing subsequent values.
* Consequences:
* Pros: Preserves the sample size and is straightforward to implement.
* Cons: Assumes that the last observation remains valid, which may not be true, especially in longitudinal data. It can lead to biased results and artificially stabilize trends.
5. Multiple Imputation:

* Description: Missing values are estimated multiple times to create several plausible datasets, which are then analyzed separately, and results are pooled.
* Consequences:
* Pros: Provides a statistically principled way of handling missing data, accounts for uncertainty, and is often seen as the best practice when dealing with missing data.
* Cons: More complex to implement and requires assumptions about the distribution of the missing data. The quality of imputations depends on the model specified.
6. Mixed-Effects Models (also called  Hierarchical Models):

* Description: These models allow for missing data points and provide estimates that account for the correlation between repeated measures.
* Consequences:
* Pros: Flexible and can handle unbalanced data. It allows the inclusion of all available data and can give accurate estimates of effects.
* Cons: Complexity in model specification and interpretation, and requires careful consideration of random effects.

# Potential Consequences of Different Methods

* Bias: Techniques like mean imputation and LOCF can introduce bias in the estimates, especially if the data are not MCAR.
* Loss of Power: Methods that involve deletion (listwise or pairwise) can significantly reduce the sample size and thus the power of the analysis.
* Increased Type I Error: Inconsistent sample sizes across comparisons (as in pairwise deletion) can lead to increased Type I error rates.
* Reduced Variability: Imputation methods can reduce variability in the dataset, affecting the estimates of means and variances.
* Generalizability: The method chosen can affect the generalizability of findings; for example, if data are systematically missing for certain groups, results may not reflect the true population.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after conducting an ANOVA to determine which specific group means are significantly different from each other. They are necessary when the ANOVA indicates that there are significant differences among the groups, but does not specify where those differences lie. Here are some common post-hoc tests along with their typical applications:

# Common Post-Hoc Tests
1. Tukey's Honestly Significant Difference (HSD) Test:

* Use: Tukey's HSD is used when you want to compare all possible pairs of group means with a control of the family-wise error rate.
* When to Use: It is best suited for equal sample sizes across groups, although it can handle unequal sample sizes to some extent.
* Example: Comparing the effectiveness of three different teaching methods on student performance where you have three groups (e.g., Method A, Method B, Method C). If the ANOVA shows significant differences among the groups, Tukey's HSD can determine which methods differ from each other.
2. Bonferroni Correction:

* Use: The Bonferroni adjustment is straightforward and involves dividing the significance level (alpha) by the number of tests being conducted. It is applied when you want to keep the family-wise error rate under control.
* When to Use: It is useful when conducting a small number of comparisons.
* Example: If you want to compare four different diets on weight loss, and the ANOVA indicates significant differences, you may perform pairwise comparisons with the Bonferroni approach to see which specific diets (e.g., Diet 1 vs. Diet 2) differ, adjusting the alpha level to account for the multiple comparisons.
3. Scheffé's Test:

* Use: Allows for comparison of groups in a more flexible way, including contrasts involving multiple groups.
* When to Use: It is appropriate when the number of groups is small or when the focus is on specific contrasts rather than pairwise differences.
* Example: Useful in studies where one might want to compare the means of several treatments against a control treatment, and you have unequal sample sizes or non-normal distributions.
4. Dunnett's Test:

* Use: Specifically compares each experimental group with a control group.
* When to Use: When you have multiple treatment groups and want to assess them against a particular control group only.
* Example: In a clinical trial testing new medications for treating a disease, if you want to compare the effectiveness of three new drugs against a placebo group, you would use Dunnett's Test.
5. Newman-Keuls Test:

* Use: A stepwise test that compares means in a hierarchical manner. It is generally less conservative than Tukey's HSD.
* When to Use: When you have a larger number of groups and the most significant contrasts of means are of interest.
* Example: In a psychological study assessing stress reduction across various therapies, the Newman-Keuls test can help identify therapy pairs that have differing effects on stress levels.

# Example Situation Where a Post-Hoc Test Might Be Necessary

Suppose you conduct an experiment to evaluate the impact of four different fertilizers (Fertilizer A, Fertilizer B, Fertilizer C, and Fertilizer D) on plant growth. After performing a one-way ANOVA, you find a significant F-statistic indicating that at least one of the fertilizers leads to different growth levels among the plants.

Since the ANOVA does not specify which fertilizers differ, you proceed with a post-hoc test, such as Tukey's HSD. This will allow you to conduct pairwise comparisons between the fertilizers to determine whether the mean plant growth with Fertilizer A is different from that of Fertilizer B, C, and D, and so on.

In summary, post-hoc tests are essential tools following ANOVA when you find significant differences among group means and need to identify where those differences lie. The choice of post-hoc test depends on your study design, th


# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

To conduct a one-way ANOVA to compare the mean weight loss of three diets (A, B, and C) in Python, we first need to generate some sample data (or you can input your actual weight loss data). Below is a complete Python code snippet that simulates data for this scenario, performs the one-way ANOVA, and interprets the results using the scipy and statsmodels libraries.

Python Code

In [4]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulating weight loss data for three diets
np.random.seed(42)  # For reproducibility

# Sample data: assume the weight loss (in pounds) for each diet
diet_a = np.random.normal(loc=10, scale=2, size=20)  # Diet A
diet_b = np.random.normal(loc=12, scale=2, size=20)  # Diet B
diet_c = np.random.normal(loc=15, scale=2, size=20)  # Diet C

# Combine data into a DataFrame
data = pd.DataFrame({
    'WeightLoss': np.concatenate([diet_a, diet_b, diet_c]),
    'Diet': ['A'] * 20 + ['B'] * 20 + ['C'] * 20
})

# Conducting one-way ANOVA
model = ols('WeightLoss ~ C(Diet)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extracting F-statistic and p-value
f_statistic = anova_table['F'][0]
p_value = anova_table['PR(>F)'][0]

# Display the results
print("ANOVA Results:")
print(f"F-statistic: {f_statistic:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpreting the results
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. There are significant differences in mean weight loss among the diets.")
else:
    print("Result: Fail to reject the null hypothesis. No significant differences in mean weight loss among the diets.")

ANOVA Results:
F-statistic: 42.7976
p-value: 0.0000
Result: Reject the null hypothesis. There are significant differences in mean weight loss among the diets.


  f_statistic = anova_table['F'][0]
  p_value = anova_table['PR(>F)'][0]


# Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

To conduct a two-way ANOVA in Python that examines the effects of software programs (Program A, Program B, Program C) and employee experience levels (novice vs. experienced) on the time taken to complete a task, we need to create a dataset that includes both factors. Below is a complete Python code snippet that simulates such data, carries out the two-way ANOVA, and provides interpretations of the results.

Python Code

In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set random seed for reproducibility
np.random.seed(42)

# Simulating the data
programs = ['A', 'B', 'C']
experience_levels = ['Novice', 'Experienced']

# Generating sample data
data = []
for program in programs:
    for experience in experience_levels:
        if experience == 'Novice':
            # Random times for novices (mean = 30, std = 5)
            times = np.random.normal(loc=30, scale=5, size=15)
        else:
            # Random times for experienced (mean = 25, std = 5)
            times = np.random.normal(loc=25, scale=5, size=15)

        # Append to data list
        for time in times:
            data.append({'Time': time, 'Program': program, 'Experience': experience})

# Convert to DataFrame
df = pd.DataFrame(data)

# Conducting two-way ANOVA
model = ols('Time ~ C(Program) * C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extracting F-statistics and p-values
f_stats = anova_table['F']
p_values = anova_table['PR(>F)']

# Display the results
print("Two-Way ANOVA Results:")
print(anova_table)

# Interpretation of results
alpha = 0.05
for i in range(len(anova_table)):
    if p_values[i] < alpha:
        print(f"Result for {anova_table.index[i]}: Reject the null hypothesis. Significant effect.")
    else:
        print(f"Result for {anova_table.index[i]}: Fail to reject the null hypothesis. No significant effect.")

Two-Way ANOVA Results:
                               sum_sq    df          F    PR(>F)
C(Program)                  15.717327   2.0   0.350798  0.705152
C(Experience)              622.680166   1.0  27.795447  0.000001
C(Program):C(Experience)    45.903396   2.0   1.024527  0.363407
Residual                  1881.787816  84.0        NaN       NaN
Result for C(Program): Fail to reject the null hypothesis. No significant effect.
Result for C(Experience): Reject the null hypothesis. Significant effect.
Result for C(Program):C(Experience): Fail to reject the null hypothesis. No significant effect.
Result for Residual: Fail to reject the null hypothesis. No significant effect.


  if p_values[i] < alpha:


# Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

To conduct a two-sample t-test comparing test scores between a control group (traditional teaching method) and an experimental group (new teaching method), we can use Python libraries such as scipy for the t-test and statsmodels or pingouin for post-hoc testing if needed. Below is a complete Python code snippet that simulates student test scores for both groups, performs the t-test, and interprets the results. Additionally, if the results are significant, we will use a post-hoc test for further analysis.

In [8]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg  # For post-hoc testing (if needed)

# Set a random seed for reproducibility
np.random.seed(42)

# Simulating test scores for the two groups
# Control group (traditional teaching method)
control_group = np.random.normal(loc=75, scale=10, size=50)  # mean = 75, std = 10, n = 50

# Experimental group (new teaching method)
experimental_group = np.random.normal(loc=80, scale=10, size=50)  # mean = 80, std = 10, n = 50

# Conducting a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Display the t-test results
print("Two-Sample T-Test Results:")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. There is a significant difference in test scores.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference in test scores.")

# If results are significant, perform a post-hoc test (e.g., independent samples)
# Not commonly needed after t-test since only two groups are compared, but included for completeness.
if p_value < alpha:
    # Concatenate both groups for post-hoc
    scores = np.concatenate([control_group, experimental_group])
    group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)

    post_hoc_df = pd.DataFrame({'Scores': scores, 'Group': group_labels})

    # Conducting post-hoc test (although it's not generally necessary for only two groups)
    post_hoc_results = pg.pairwise_ttests(data=post_hoc_df, dv='Scores', between='Group')
    print("\nPost-hoc Test Results:")
    print(post_hoc_results)

ModuleNotFoundError: No module named 'pingouin'

# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

To conduct a repeated measures ANOVA to compare the average daily sales of three retail stores (Store A, Store B, and Store C), we can use the statsmodels library in Python. Repeated measures ANOVA is appropriate when the same subjects (days in this case) are used to test more than one condition (sales from different stores).

Here’s how you can simulate the sales data, conduct the repeated measures ANOVA, and follow up with a post-hoc test if necessary:

In [7]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
import pingouin as pg

# Set random seed for reproducibility
np.random.seed(42)

# Simulating daily sales data for three stores over 30 days
days = 30
store_a_sales = np.random.normal(loc=200, scale=20, size=days)  # Store A
store_b_sales = np.random.normal(loc=220, scale=20, size=days)  # Store B
store_c_sales = np.random.normal(loc=250, scale=20, size=days)  # Store C

# Create a DataFrame with sales data
data = pd.DataFrame({
    'Day': np.arange(1, days+1),
    'Store A': store_a_sales,
    'Store B': store_b_sales,
    'Store C': store_c_sales
})

# Melt the DataFrame to long format for AnovaRM
data_long = data.melt(id_vars='Day', value_vars=['Store A', 'Store B', 'Store C'],
                      var_name='Store', value_name='Sales')

# Conducting repeated measures ANOVA
anova_results = AnovaRM(data_long, 'Sales', 'Day', within=['Store']).fit()

# Display ANOVA results
print(anova_results)

# Check if ANOVA is significant
if anova_results.anova_table['Pr > F'].iloc[0] < 0.05:
    print("Result: Reject the null hypothesis. There are significant differences in sales between the stores.")

    # Post-hoc test (pairwise comparison)
    post_hoc = pg.pairwise_ttests(data=data_long, dv='Sales', within='Store', padjust='bonf')
    print("\nPost-hoc Test Results:")
    print(post_hoc)
else:
    print("Result: Fail to reject the null hypothesis. No significant differences in sales.")

ModuleNotFoundError: No module named 'pingouin'