# Question - 1
ans - 

1). Independence of Observations:

*. Assumption: Observations in each group should be independent of each other. This means that the value of one observation should not be influenced by or dependent on the value of another observation.

*. Violation Example: If you are conducting an ANOVA on test scores of students in different classrooms, and students within the same classroom collaborate on their tests, violating the independence assumption.

2). Homogeneity of Variances (Homoscedasticity):

*. Assumption: The variances of the different groups being compared should be approximately equal. In other words, the spread of the data points should be similar across all groups.

*. Violation Example: If you are comparing the yields of different types of crops, and one crop type consistently shows much greater variation in yield compared to others, this would violate the homogeneity of variances assumption.

3). Normality of Residuals:

*. Assumption: The residuals (the differences between the observed values and the group means) should follow a normal distribution. This assumption applies to the residuals, not necessarily to the original data.

*. Violation Example: If the residuals do not follow a normal distribution and are skewed or have heavy tails, it can lead to inaccurate p-values and confidence intervals.

4). Random Sampling:

*. Assumption: The samples selected from each group should be random and representative of the population from which they are drawn.

*. Violation Example: If you are conducting an ANOVA on income levels across different regions and your samples are not randomly selected but, for example, biased towards a certain income group, it can introduce bias into your results.

5). Independence of Groups:

*. Assumption: The groups or treatments being compared should be mutually exclusive, and individuals or items should belong to only one group.
*. Violation Example: If you are conducting an ANOVA to compare the effectiveness of three different advertising campaigns, and some individuals are exposed to multiple campaigns, it can violate this assumption.

# Question - 2
ans - 

1). One-Way ANOVA:

*. Situation: One-Way ANOVA is used when you have one independent variable (factor) with more than two levels or groups, and you want to compare the means of those groups to determine if there are any statistically significant differences.

*. Example: Suppose you want to compare the test scores of students who have received three different types of tutoring (Group A, Group B, and Group C) to see if there are differences in their mean scores.


2). Two-Way ANOVA:

*. Situation: Two-Way ANOVA is used when you have two independent variables (factors) and you want to determine the effects of these two factors on a dependent variable. It examines not only the main effects of each factor but also their interaction effect.

*. Example: Suppose you are conducting a study to analyze the effects of both diet (Factor 1: Diet A, Diet B) and exercise (Factor 2: Exercise Yes/No) on weight loss. Two-Way ANOVA allows you to assess the impact of diet, exercise, and their interaction on weight loss.


3). Repeated Measures ANOVA:

*. Situation: Repeated Measures ANOVA is used when you have measured the same subjects or items under multiple conditions or time points. It is used to determine if there are significant differences between the conditions and whether those differences change over time.

*. Example: Imagine a study where the same group of participants is tested for their reaction times under three different conditions: before, during, and after a training program. Repeated Measures ANOVA helps assess whether the reaction times change significantly across these time points.




>>. In summary:

a). One-Way ANOVA is used when you have one factor with multiple levels or groups.

b). Two-Way ANOVA is used when you have two factors, and you want to examine their main effects and interaction.

c). Repeated Measures ANOVA is used when you have measurements on the same subjects or items under multiple conditions or time points.

# Quesiton - 3
ans - 

The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps explain the sources of variation in a dataset and how these sources contribute to the overall variability in the dependent variable. Understanding this concept is crucial in ANOVA because it allows researchers to assess the relative importance of different factors or sources of variation in a statistical analysis. It also helps in drawing conclusions about the significance of these factors and making informed decisions based on the results.

In ANOVA, the total variance in the dataset (the total variability in the dependent variable) is divided into several components, which are as follows:

1). Total Variance (Total Sum of Squares, SST):

The total variance represents the overall variability in the data. It measures how much the individual data points differ from the overall mean.


2). Between-Group Variance (Between-Group Sum of Squares, SSB):

This component of variance quantifies the variation between the different groups or treatments in your dataset. It measures how much the group means differ from the overall mean.


3). Within-Group Variance (Within-Group Sum of Squares, SSW):

Within-group variance measures the variation within each group or treatment. It represents how much individual data points within a group differ from their respective group mean.


>. The relationship between these components is expressed by the ANOVA identity:

SST=SSB+SSW

***. Understanding this partitioning of variance is essential for several reasons:

a). Assessing Significance: By comparing the between-group variance to the within-group variance, ANOVA determines whether there are statistically significant differences among the groups. If the between-group variance is much larger than the within-group variance, it suggests that the groups are different, and the differences are unlikely to be due to random chance.

b). . Identifying Influential Factors: It helps researchers identify which factors or groups are contributing the most to the observed differences in the dependent variable. This is valuable for understanding the factors that are driving the outcomes.

c). Model Interpretation: Understanding the partitioning of variance aids in interpreting the ANOVA model and its results. Researchers can explain how much of the total variation in the dependent variable is explained by the factors they are studying.

d). Model Comparison: Researchers can use the partitioning of variance to compare different models. For example, they can assess whether adding a new factor to the model significantly reduces the within-group variance and improves the model's explanatory power.

e). Effect Size: It allows for the calculation of effect size measures, such as eta-squared or partial eta-squared, which quantify the proportion of total variance attributable to the factors in the model.

# Question - 4
ans - 

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [15]:
df = pd.DataFrame({'group': ['A','A','B','B','C', 'C'],
                   'value':[10,12,15,18,14,16]})
df

model = ols('value ~ group' , data = df).fit()


# SST sum of total squares

sst = sum((df['value'] - df['value'].mean())**2)
print(f"total sum of squares is {sst}")

#SSE explained sum of squares

sse = sum((model.fittedvalues - df['value'].mean())**2)
print(f"explained sum of squares is {sse}")


# SSR residual sum of squares

ssr = sum((df['value'] - model.fittedvalues)**2)
print(f"residual sum of squares is {ssr}")

total sum of squares is 40.833333333333336
explained sum of squares is 32.33333333333331
residual sum of squares is 8.5


# Question - 5
ans - 

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
                     'factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
                     'value': [10, 12, 8, 9, 15, 14, 7, 6]})


# fit a Two way anova model

formula ='value ~ factor1 * factor2'
model = ols(formula , data = df).fit()


anova_table = sm.stats.anova_lm(model , typ=2)
print(anova_table)

maineffect_factor1 = anova_table['sum_sq']['factor1'] / anova_table['df']['factor1']
print(f"main effect for factor1 is {maineffect_factor1}")


maineffect_factor2 = anova_table['sum_sq']['factor2'] / anova_table['df']['factor2']
print(f"main effect for factor2 is {maineffect_factor2}")


interaction_effect = anova_table['sum_sq']['factor1:factor2'] / anova_table['df']['factor1:factor2']
print(f"interaction effect is {interaction_effect}")
              

                 sum_sq   df          F    PR(>F)
factor1          55.125  1.0  11.307692  0.028234
factor2           0.125  1.0   0.025641  0.880541
factor1:factor2   0.125  1.0   0.025641  0.880541
Residual         19.500  4.0        NaN       NaN
main effect for factor1 is 55.12499999999986
main effect for factor2 is 0.12500000000000114
interaction effect is 0.12500000000000025


# Question -6 
ans - 

In this scenario,  obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how to interpret these results:

>>. Null Hypothesis (H0): The null hypothesis in a one-way ANOVA is that there are no significant differences between the group means. In other words, all group means are equal.

>>. Alternative Hypothesis (Ha): The alternative hypothesis is that there are significant differences between at least two group means.

--.   Based on  results:

a). F-Statistic (5.23): This is a test statistic that quantifies the ratio of the variation between group means to the variation within groups. In simpler terms, it measures whether the differences between the group means are larger than what you would expect due to random chance.

b). P-Value (0.02): The p-value is the probability of obtaining an F-statistic as extreme as, or more extreme than, the one observed in your data if the null hypothesis were true. In this case, a p-value of 0.02 indicates that there is a 2% chance of observing an F-statistic as extreme as 5.23 under the assumption that there are no real differences between the group means.

--.  Interpretation:

With a p-value of 0.02, which is less than the common significance level of 0.05 (typically used in hypothesis testing), you would typically reject the null hypothesis (H0). This suggests that there are statistically significant differences between at least two of the groups.

# Question - 7
ans - 

In a repeated measures ANOVA (Analysis of Variance), handling missing data is an important consideration because missing data can potentially bias the results and reduce the power of the analysis. Here are some common methods for handling missing data in repeated measures ANOVA and their potential consequences:

>>.1 Complete Case Analysis (Listwise Deletion):

* . Method: In this approach, cases (participants) with any missing data on any variable are excluded from the analysis.

* . Consequences:

-- . Pros: It's straightforward and does not require imputation methods.

-- . Cons: Reduces the sample size, potentially leading to reduced statistical power. May introduce bias if the missing data are not missing completely at random (MCAR).


>>.2 Pairwise Deletion:

* . Method: In this approach, cases with missing data on a specific variable are excluded only from the analysis of that variable, allowing you to retain cases with partial data for other variables.

* . Consequences:

--. Pros: Retains more data compared to listwise deletion.

--. Cons: Can lead to different sample sizes for different variables, potentially affecting the interpretation of results. May also be problematic if data are not MCAR.


>>.3 Imputation Methods:

* . Methods: Various imputation methods can be used to estimate missing values, such as mean imputation, median imputation, regression imputation, or more advanced methods like multiple imputation.

* . Consequences:

--. Pros: Retains the full sample size and can reduce bias in parameter estimates.

--. Cons: Imputed values may introduce noise or bias if the imputation model is misspecified. The choice of imputation method can impact the results, and assumptions about the missing data mechanism (e.g., MCAR, MAR, MNAR) need to be made.


>>.4 Maximum Likelihood Estimation (MLE):

* . Method: MLE is a statistical method that estimates model parameters while accounting for missing data in the likelihood function.

* . Consequences:

--. Pros: Provides efficient estimates and standard errors, taking into account the uncertainty associated with missing data.
-- . Cons: Requires more complex modeling and software that supports MLE. Assumptions about the missing data mechanism are still necessary.


>>.5 Sensitivity Analysis:

* . Method: Perform the analysis using different methods for handling missing data (e.g., listwise deletion, imputation) and compare results.

* . Consequences:

--. Pros: Helps assess the robustness of the findings to different missing data handling methods.

--. Cons: Requires additional analyses and may not definitively resolve the issue of missing data.

# Question - 8
ans - 

1). Tukey's Honestly Significant Difference (HSD) Test:

* . When to use: Tukey's HSD is a widely used post-hoc test when you have multiple groups (more than two) and you want to compare all possible pairs of group means.

* . Example: In a study comparing the effectiveness of four different teaching methods, you find a significant difference among the methods using ANOVA. You can then use Tukey's HSD to identify which specific pairs of teaching methods have significantly different outcomes.


2). Bonferroni Correction:

* . When to use: Bonferroni correction is used when you want to control the familywise error rate (the probability of making at least one Type I error) in multiple comparisons.

* . Example: If you are conducting multiple pairwise comparisons (e.g., comparing the means of five different groups with each other), you might use Bonferroni correction to adjust the significance level to maintain an overall alpha level (e.g., 0.05) while conducting multiple tests.


3). Sidak Correction:

* . When to use: Similar to Bonferroni correction, Sidak correction is used to control the familywise error rate. It is often used when the number of comparisons is relatively small.
* . Example: If you are conducting a small number of pairwise comparisons (e.g., comparing three different treatments), you might use Sidak correction to adjust the significance level.


4). Dunnett's Test:

* . When to use: Dunnett's test is used when you have one control group and want to compare it with multiple treatment groups.

* . Example: In a drug trial, you have one control group receiving a placebo and multiple treatment groups receiving different doses of the drug. You can use Dunnett's test to compare each treatment group to the control group.


5). Holm-Bonferroni Method:

* . When to use: Holm-Bonferroni is a sequential correction method that controls the familywise error rate. It can be used when you have a moderate number of comparisons.

* . Example: If you are comparing the performance of several products in a consumer study, you might use Holm-Bonferroni correction to adjust for multiple comparisons and identify significant differences.


6). Games-Howell Test:

* . When to use: The Games-Howell test is used when you have unequal sample sizes and variances among groups.

* . Example: In a study comparing the exam scores of students in different schools, if the sample sizes and variances of the schools' scores are not equal, you might use the Games-Howell test to compare the schools.

# Question - 9
ans -

In [7]:
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols


np.random.seed(0)

diet_a_group = np.random.normal(5,2,50)
diet_b_group = np.random.normal(6,2,50)
diet_c_group = np.random.normal(4,2,50)






f_statistics , p_value = stats.f_oneway(diet_a_group,diet_b_group,diet_c_group)

print(f"f statistics is {f_statistics}")
print(f"p value is {p_value}")


if p_value< 0.05:
    print("we reject the null hypothesis , there is a significant difference between weigh loss mean of three diets")
    
else:
    print("failed to reject the null hypothesis , there is no significant differnce between weight loss mean of three diets ")

f statistics is 6.229115357571769
p value is 0.0025308938971832957
we reject the null hypothesis , there is a significant difference between weigh loss mean of three diets


# Question - 10
ans- 

In [21]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np


np.random.seed(0) 
n_employees = 30
n_repeats = 3
n_total_observations = n_employees * n_repeats

data = pd.DataFrame({
    'Software': np.repeat(['A', 'B', 'C'], n_employees), 
    'Experience': np.tile(['Novice', 'Experienced'], n_total_observations // 2),  
    'Time': np.random.randint(5, 21, n_total_observations) })

# Perform a two-way ANOVA
formula1 = 'Time ~ C(Software) * C(Experience)'
model1 = ols(formula, data=data).fit()
anova_table1 = sm.stats.anova_lm(model, typ=2)


print(anova_table1)


print("There are no significant main effects of software programs or employee experience level on task completion time."
      "There is no significant interaction effect between software programs and employee experience level on task completion time.")

                                sum_sq    df         F    PR(>F)
C(Software)                   5.955556   2.0  0.140535  0.869097
C(Experience)                12.100000   1.0  0.571054  0.451954
C(Software):C(Experience)    39.466667   2.0  0.931306  0.398069
Residual                   1779.866667  84.0       NaN       NaN
There are no significant main effects of software programs or employee experience level on task completion time.There is no significant interaction effect between software programs and employee experience level on task completion time.


# Question - 11
ans - 

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import MultiComparison

alpha = 0.05

np.random.seed(100)

control_group_scores  = np.random.randint(90,100,100)
experimental_group_Scores = np.random.randint(88,100,100)


t_statics , p_value = stats.ttest_ind(control_group_scores , experimental_group_Scores)
print(f"t statistic is {t_statics}")
print(f"p value is {p_value}")

if p_value < alpha:
    print("we reject the null hypothesis , there is  significant difference between the groups ")

else:
    print("failed to reject the null hypothesis , there is no significant difference between the groups")

t statistic is 2.293746243557139
p value is 0.022854674230857262
we reject the null hypothesis , there is  significant difference between the groups 


In [7]:
all_scores = np.concatenate((control_group_scores , experimental_group_Scores))

group_labels  = np.array(['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_Scores))

multicomparison = MultiComparison(all_scores , group_labels)

result = multicomparison.tukeyhsd()

print(result)

    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower   upper  reject
-----------------------------------------------------------
Control Experimental    -0.98 0.0229 -1.8225 -0.1375   True
-----------------------------------------------------------


# Question - 12
ans - 

In [29]:
import numpy as np 
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import MultiComparison
from statsmodels.stats.multitest import multipletests

np.random.seed(30)
data = {'day': np.arange(1,31), 
       'store_A': np.random.randint(100,200,30), 
       'store_B': np.random.randint(90,210,30),
       'store_C': np.random.randint(80,190,30),
        
       }
df = pd.DataFrame(data)

store_A_sales = df['store_A']
store_B_sales = df['store_B']
store_C_sales = df['store_C']

df['sales'] = store_A_sales + store_B_sales + store_C_sales

df



Unnamed: 0,day,store_A,store_B,store_C,sales
0,1,137,105,134,376
1,2,137,113,149,399
2,3,145,190,185,520
3,4,145,103,170,418
4,5,112,140,106,358
5,6,123,123,158,404
6,7,102,145,171,418
7,8,153,118,99,370
8,9,117,203,110,430
9,10,146,148,129,423


In [7]:
pip install pingouin

Collecting pingouin
  Downloading pingouin-0.5.3-py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting outdated
  Downloading outdated-0.2.2-py2.py3-none-any.whl (7.5 kB)
Collecting pandas-flavor>=0.2.0
  Downloading pandas_flavor-0.6.0-py3-none-any.whl (7.2 kB)
Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting xarray
  Downloading xarray-2023.9.0-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
Collecting littleutils
  Downloading littleutils-0.2.2.tar.gz (6.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: littleutils
  Building wheel for littleutils (setup.py) ... [?25ldone
[?25h  Created wheel for littleutils: filename=littleutils-0.2.2-py3-none-any.whl size=7028 sha256=1660368ccbcbf89128ede13b1d461ea3b222e

In [14]:
pip install pingouin scipy

Note: you may need to restart the kernel to use updated packages.


In [18]:





import pandas as pd
import numpy as np


num_days = 30
num_stores = 3


store_labels = np.repeat(['A', 'B', 'C'], num_days)


np.random.seed(0)  
sales_data = np.random.randint(100, 1000, size=num_days * num_stores)


day_labels = np.tile([f'Day {i}' for i in range(1, num_days + 1)], num_stores)


df = pd.DataFrame({'Store': store_labels, 'Sales': sales_data, 'Day': day_labels})


import pingouin as pg
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import MultiComparison
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.anova import AnovaRM

aov = AnovaRM(df,'Sales' , 'Store' , within = ['Day'])
result = aov.fit()

print(result)

if result.anova_table['Pr > F'][0] < 0.05:
    posthoc = pg.pairwise_tukey(df , dv = 'Sales' , between = 'Store')
    print(posthoc)







              Anova
    F Value  Num DF  Den DF Pr > F
----------------------------------
Day  1.4945 29.0000 58.0000 0.0965

