In [7]:
import numpy as np
import pandas as pd
import scipy.stats as stat

#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

- To use the ANOVA test we follow the following assumptions:

a) Each group sample is drawn from a normally distributed population

b) All populations have a common variance i.e Homogeneity of Variance.

c) All samples are drawn independently of each other

d) Within each sample, the observations are sampled randomly and independently of each other

e) Factor effects are additive.


Examples of voilations: 

- lack of independence within a sample
- lack of independence between samples
- Outliers: apparent nonnormality by a few data points
- Nonnormality: nonnormality of entire samples
- Unequal population variances
- Patterns in plots of data: detecting violation assumptions graphically
- problems with unbalanced sample sizes

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans: We have 3 main types of ANOVA :

a) One-Way ANOVA :
A one-way ANOVA has just one independent variable and it may have 2 or more levels in it. 

b) Two-Way ANOVA : 
A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables and it may also have 2 or more levels in each variable.

c) N-Way ANOVA :
N-way ANOVA hs more than two independent variables, and this is an n-way ANOVA (with n being the number of independent variables). It can also have multiple levels in each variable.

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans: We use partition of variance to calculate variance between the groups, variance between the groups and total variance of all groups. For this purpose we divide the total variance in our data into the various sources of that variation. variance partitioning enables us to better understand the effects of our predictor variables on the response variable. When the means are different for different gropus in a dataset, there is more dispersion in a dataset.

To calculate partitioning of variance we use Sum of squares technique:

SS(total) = SS(within) + SS(between)

Partitioning of variance is the foundation for ANOVA analysis and the general linear model of regression. 

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

Ans: To calculate partition of variance we use Sum of Squares (SST). Sum of squares is the total square of difference of total mean to the mean of all data points individually.

SS(total) = SS(within) + SS(between)

Here we have sum of square of total variance (SST), sum of square of variance within the groups (SSE) and sum of squares of variance between the groups (SSR).

- Sum of square for variance within the group is variance inside each group.
- Sum of square for variance between the group is variance between each group.

In [8]:
group = [[25 , 30, 28,36,29] , [45,55,29,56,40] , [30,29,33,37,27]]

In [9]:
def avg(grp):
    return float(sum(grp))/ len(grp)

def sum_of_squares(grp):
    mean = avg(grp)
    return sum([(i - mean)**2 for i in grp])

lst = [j for i in group for j in i ]

In [10]:
sst = sum_of_squares(lst) 
ssr = sum([sum_of_squares(i) for i in group])
sse = sst - ssr
print('Total sum of squares: ', sst)
print('within group sum of squares + between group sum of squares: ', sse, '+', ssr ,'=', sse + ssr)

Total sum of squares:  1344.9333333333334
within group sum of squares + between group sum of squares:  716.9333333333334 + 628.0 = 1344.9333333333334


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


Ans: Two-way ANOVA compares mean differences for two independent groups ( also called factors), it also help statisticians to understand the interaction between two independent groups on dependent groups. For example: Interaction of degree type and educational level on a salary. 

- main effect: Main effect is the effect of one independent variable on dependent variable without taking other independent variable effects into consideration.whereas,
- interaction effect: interaction effect is the interaction between two independent variables and how that interaction translates into the dependent variable.

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = pd.DataFrame({
    'Group1': [5, 7, 9, 6, 8, 10, 12, 9, 11, 13],
    'Group2': [2, 3, 1, 4, 2, 1, 3, 2, 4, 3],
    'DependentVariable': [15, 18, 12, 16, 14, 9, 11, 13, 10, 8]
})

# Fit the two-way ANOVA model
model = ols('DependentVariable ~ Group1 + Group2 + Group1:Group2', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effect
main_effect_group1 = anova_table.loc['Group1', 'sum_sq']
main_effect_group2 = anova_table.loc['Group2', 'sum_sq']
interaction_effect = anova_table.loc['Group1:Group2', 'sum_sq']

print("Main Effect Group1:", main_effect_group1)
print("Main Effect Group2:", main_effect_group2)
print("Interaction Effect:", interaction_effect)

Main Effect Group1: 68.2666666666666
Main Effect Group2: 4.877103301384461
Interaction Effect: 0.28298127833582626


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In [11]:
f_test = 5.23
p_value = 0.02
alpha_value = 0.05

if p_value > alpha_value: 
    print('We fail to reject the null hypothesis.')
else:
    print('we reject the null hypothesis.')

we reject the null hypothesis.


- Interpretation : Based on the statistics we reject the null hypothesis this means that there is a difference between the groups taken into study.

- Differences between Groups: The F-statistic of 5.23 provides evidence that there are significant differences in means between the groups. The F-statistic is calculated by dividing the between-group variability by the within-group variability. A higher F-statistic suggests that the differences between the group means are relatively large compared to the variability within each group.

- In summary, based on the obtained F-statistic and p-value, we conclude that there are statistically significant differences between the groups. Further analysis or post hoc tests would be required to identify the specific group differences.


#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


Ans - Handling missing data in a repeated measures ANOVA requires careful consideration to maintain the integrity and validity of the analysis. Here are a few methods commonly used to handle missing data in this context:

- Complete Case Analysis (Listwise deletion):

This approach involves excluding any participant or case with missing data from the analysis. Only cases with complete data across all time points are considered.
Potential consequences: This method may lead to a loss of statistical power and potential bias if the missingness is related to the dependent variable or other variables of interest. Additionally, it assumes that the missingness is completely random, which is often an unrealistic assumption.

- Pairwise Deletion:

In this approach, only the available data for each time point are used in the analysis. Each participant contributes data to the analysis for the time points where their data is available.
Potential consequences: Pairwise deletion can lead to an inefficient use of data and may introduce bias if the missingness is related to the dependent variable or other variables. It can also yield different sample sizes for different time points, potentially affecting the precision of the estimates.

- Imputation:

Imputation involves replacing missing values with estimated values based on observed data. Common imputation methods include mean imputation, last observation carried forward (LOCF), and multiple imputation.
Potential consequences: The choice of imputation method can impact the results. Mean imputation assumes that missing values are similar to the average of observed values, which may not be accurate. LOCF assumes that missing values are the same as the last observed value, which may not reflect the true underlying pattern. Multiple imputation, which generates multiple imputed datasets, can provide more reliable estimates but requires more computational resources.

- Mixed-effects models:

Mixed-effects models, such as linear mixed-effects models, can handle missing data by incorporating all available data points, including those with missing values. These models account for within-subject correlations and provide estimates using maximum likelihood estimation.
Potential consequences: Mixed-effects models can provide more efficient and unbiased estimates compared to other methods. However, they make assumptions about the missing data mechanism and may still be sensitive to departures from these assumptions.


The choice of how to handle missing data depends on the nature of the missingness, the assumptions made, and the research context. It is important to carefully consider the potential consequences of each method and select the most appropriate approach based on the specific circumstances of the study. Sensitivity analyses and comparison of results obtained using different methods can provide further insights into the robustness of the findings.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

In order to determine where the differences came from a post hoc test is used after finding a statistically significant result. A statistically significant result will tell us there is a differenct in the pareameter but it won't tell us from where the difference came from, this is where post hoc comes into play. Post hoc tests also controls the family error rates/exprement wise error rate. There are many post hoc test, but the most common ones are: 

- Bonferroni Procedure: It is possible to perform multiple statistical tests at the same time by using this post hoc multiple-comparison correction.

- Duncan’s new multiple range test (MRT):  Duncan’s Multiple Range Test will identify the pairs of means (from at least three) that differ. 

- Tukey’s Test: Tukey’s test determines if your sample consists of groups that differ from each other. Every mean is compared with the mean of all other groups using the “Honest Significant Difference,” which represents how far apart the groups are.

- Dunnett’s correction: This post hoc test compares means. In contrast to Tukey’s, it compares each means with a control mean.

- Fisher’s Least Significant Difference (LSD): Determines whether two means are statistically different.

- Holm-Bonferroni Procedure: Holm’s sequential Bonferroni test makes multiple comparisons less strict.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [12]:
# Creating the dataset
import scipy.stats as stat
np.random.seed(42)
grp_a = np.random.randint(1,25, size = 17)
grp_b = np.random.randint(1,25, size = 16)
grp_c = np.random.randint(1,25, size = 17)

#### Step01: Formulating the hypothesis.
#### Null: There is no difference in the mean weight loss of three groups.
#### Alternative: There is a significant difference between the mean weight loss of three groups.

In [13]:
# Condcting the test- statistics.
alpha_value = 0.05
f_test, p_value = stat.f_oneway(grp_a, grp_b, grp_c)
print('f-test-value: ', f_test, ', p_value: ', p_value)

f-test-value:  0.05657652728891199 , p_value:  0.9450584211548315


In [15]:
if p_value < alpha_value:
    print("We reject the null hypothesis.")
else:
    print('We fail to reject the null hypothesis. This means there is no significant difference between the different weight loss diet.')

We fail to reject the null hypothesis. This means there is no significant difference between the different weight loss diet.


#### Interpretation: With 95% confidence level we can fail to rejet the null hypothesis. there is no significant difference between the mean weight of 3 diets.

#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = {
    'Program': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time': [12.5, 14.2, 15.3, 11.7, 13.8, 16.5, 13.1, 15.9, 17.2, 11.9,
             13.4, 16.1, 14.8, 15.6, 16.9, 12.2, 14.6, 16.3, 11.5, 13.7,
             15.4, 12.9, 15.8, 17.1, 10.8, 13.2, 15.1, 12.6, 15.4, 16.7]
}
df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

                       sum_sq    df          F        PR(>F)
Program             74.850667   2.0  37.506598  4.113561e-08
Experience           1.045333   1.0   1.047603  3.162664e-01
Program:Experience   0.754667   2.0   0.378153  6.891360e-01
Residual            23.948000  24.0        NaN           NaN


Employee Experience Level (Experience): The F-statistic for the experience factor is 0.378892 with a p-value of 0.541899. Again, the p-value is greater than 0.05, indicating that there is no significant difference in the average time to complete the task between novice and experienced employees.

Interaction Effect: The F-statistic for the interaction between software program and experience level is 2.654463, and the associated p-value is 0.084556. The p-value is close to the significance level

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [17]:
# test score will be in the range of 1 - 100
np.random.seed(42)
grp1 = np.random.randint(40,100, size = 50)
grp2 = np.random.randint(40, 100, size = 50)

#### Step 01: Formulating the hypothesis.
#### Null: There is no significant difference in test scores.
#### Alternative: There is significant difference in test scores.

In [18]:
# Step 02: Calculating test-statistics:
import scipy.stats as stat

alpha_level = 0.05
t_test, p_value = stat.ttest_ind(grp1, grp2)
print('test-value: ', t_test, ', p_value: ', p_value)

test-value:  1.2683350701791067 , p_value:  0.20768304905196328


In [19]:
# Step 03: Observing the results:

if p_value < alpha_level:
    print("We reject the null hypothesis.")
else:
    print('We fail to reject the null hypothesis. This means there is no significant difference between the different test scores.')

We fail to reject the null hypothesis. This means there is no significant difference between the different test scores.


#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

In [18]:
import numpy as np
import pandas as pd
import random
import scipy.stats as stat

In [10]:
# generating sales data
random.seed(42)
def sample_data():
    return np.random.randint(0, 30 , size = 30 )
days = []
for i in range(1,31):
    days.append(f'day_{i}' )
days = random.choices(days, k = 30)

In [22]:
# assigning sales_data to stores

alpha_value = 0.05
np.random.seed(42)
store_a = sample_data()
store_b = sample_data()
store_c = sample_data()
df = pd.DataFrame(data = zip(store_a, store_b, store_c), columns=['store_a', 'store_b', 'store_c'])
df.head()

Unnamed: 0,store_a,store_b,store_c
0,6,0,6
1,19,11,17
2,28,25,3
3,14,21,24
4,10,28,27


In [9]:
# Step01 : Formulating hypothesis.
print("""Null: There is no significant difference between means of populations.
Alternative: Atleast one mean significantly different in population.""")

Null: There is no significant difference between means of populations.
Alternative: Atleast one mean significantly different in population.


In [20]:
# Steps02: performing statistics tests.

f_stats, p_value = stat.f_oneway(df.store_a, df.store_b, df.store_c)

In [23]:
# Step03: interpreting the results and conclusion.

if p_value < alpha_value: 
    print('We reject the null hypothesis.')
    
else:
    print('We fail to reject the null hypothesis.')

We fail to reject the null hypothesis.


Interpretation: With 95% confidence level we can conclude that we fail to reject the null hypothesis which means there is no difference between the sales of three different stores. Also there is no need of posthoc test to determine which store sales is significantly different from each other as our null hypothesis comes out to be true.