# STAT 13 MARCH ASSIGNMENT ABOUT THE ANOVA 

In [5]:
import logging
logging.basicConfig(filename="13Marchinfo.log", level=logging.INFO, format="%(asctime)s %(name)s %(message)s")

# Answer1
### ANOVA 

Analysis of Variance (ANOVA) is a statistical technique used to test for differences between two or more groups by comparing the means of the groups.


## ANOVA relies on certain assumptions to be met to ensure the validity of the results. 

##### 1 Normality assumption: The data within each group should follow a normal distribution.

##### 2 Homogeneity of variance assumption: The variance of the data in each group should be equal.

##### 3 Absence of Outlier :outlying score need to be removed from the dataset.

##### 4 Independence assumption: Observations in each group should be independent of one another.

### If any of these assumptions are violated, the results of ANOVA may not be reliable


 ## Some examples of violations that could impact the validity of the results are:
    

 
### 1 Violation of the normality assumption:

    If the data in any of the groups are not normally distributed, the ANOVA results may not be accurate. 
    For example, if the data are skewed or have outliers, the normality assumption may not be met. In this case, a transformation of the data may be necessary, 
    or a non-parametric test may be more appropriate
    
    
### 2 Violation of the homogeneity of variance assumption: 

  If the variance of the data in any of the groups is not equal, the ANOVA results may be affected. 
  For example, if the variances of the groups are very different, the F-test used in ANOVA may not accurately reflect the differences between the groups. 
  In this case, a Welch's ANOVA may be more appropriate.
    
    
### 3 Violation of the independence assumption: 
 If the observations in any of the groups are not independent, the ANOVA results may be unreliable. 
 For example, if repeated measures are taken on the same individuals or if there is clustering of observations, the independence assumption may not be met. 
 In this case, a repeated measures ANOVA or a mixed-effects model may be more appropriate.



# Answer2
 ## three types of ANOVA.

   #### 1 ONE-WAY ANOVA: 
    one factor with at least two level , but there level are independent.
    
   #### 2 TWO- WAY ANNOVA( REPEATED ANNOVA):
     one  factor with at least two level , but there level are dependent 
    
   #### 3 MANOVA (Multivariate ANOVA)/ FACTORIAL ANOVA:
      Two or more factor, each which with at least two level , level can either be dependent and indepenedent.

## situations would each be used 

 - ONE-WAY is used when levels are independent.
 - TWO-WAY is used when level are dependent 
 - FACTORIAL ANNOVA: is used when level can either be dependent and indepenedent.



In [1]:
# Answer3 
# The partitioning of variance in ANOVA.
"""
In Analysis of Variance (ANOVA), partitioning of variance refers to the decomposition of the total variation observed in a response variable into different components that can be attributed to different sources of variation.

"""

# The total variation in the data is decomposed into three components:

"""
Between-group variation: This represents the variation in the means of different groups or treatments being compared. If this value is large relative to the within-group variation, it suggests that the groups or treatments are significantly different from each other.

Within-group variation: This represents the variation within each group or treatment being compared. If this value is small relative to the between-group variation, it suggests that the groups or treatments are homogeneous.

Error variation: This represents the random variation that cannot be explained by the model. It includes measurement error, natural variation, and any other unexplained sources of variation.
"""

'\nBetween-group variation: This represents the variation in the means of different groups or treatments being compared. If this value is large relative to the within-group variation, it suggests that the groups or treatments are significantly different from each other.\n\nWithin-group variation: This represents the variation within each group or treatment being compared. If this value is small relative to the between-group variation, it suggests that the groups or treatments are homogeneous.\n\nError variation: This represents the random variation that cannot be explained by the model. It includes measurement error, natural variation, and any other unexplained sources of variation.\n'

In [6]:
# Answer4 

# SST SSE SSR
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(123)

data = pd.DataFrame({
    'School': ['A', 'B', 'C']*20,
    'Score': np.concatenate([np.random.normal(70, 10, 20),
                              np.random.normal(75, 8, 20),
                              np.random.normal(80, 12, 20)])})



# Fit the one-way ANOVA model
model = ols('Score ~ School', data=data).fit()

# Calculate SST
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate SSE
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate SSR
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)



SST: 465.7854599036257
SSE: 10298.511188194465
SSR: -9832.72572829084


In [None]:
# Answer5 

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas dataframe
data = pd.read_csv('data.csv')

# Fit the ANOVA model using the formula interface
model = ols('y ~ A + B + A:B', data).fit()

# Calculate the ANOVA table
table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effects from the table
main_effects = table.iloc[:-1, :-1]
interaction = table.iloc[-1, :-1]

# Print the main effects and interaction effects
print("Main effects:")
print(main_effects)
print("Interaction effect:")
print(interaction)


## Answer6 

### If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, it means that there is a significant difference between the groups on the dependent variable

### Explanation:
##### As according to the given data in the question  we observed that 
##### P-Value is less than the significance value (0.05) so we can say that 

### There is a significant effect of the independent variable on the dependent variable.

# Answer 7 

### Handling missing data in a repeated measures ANOVA can be challenging because each participant has multiple measurements on the dependent variable over time

## Here are a few common methods for handling missing data in a repeated measures ANOVA and their potential consequences:

 - __Complete Case Analysis:__  
 This method involves only including participants with complete data in the analysis. The main advantage of this method is that it is easy to implement, and it does not require making any assumptions about the missing data
 
 - __Pairwise Deletion:__
 This method involves including all participants with at least one measurement on the dependent variable. The main advantage of this method is that it uses all available data and does not reduce the sample size. 

- __Imputation:__ 
This method involves estimating the missing data based on the available data and using these estimates in the analysis. There are various methods for imputation, such as mean imputation, regression imputation, and multiple imputation. 


# Answer8

## post-hoc test:
##### Post-hoc tests are used after ANOVA to determine which groups differ significantly from each other

## you would use each one:

- __Tukey's HSD (Honestly Significant Difference):__
  This test compares all pairs of means and controls for the overall Type I error rate.
  
   It is appropriate when you have small number of the group.

- __Bonferroni correction:__

   This test adjusts the significance level (alpha) for each individual comparison to control for the overall Type I error rate.  
   
   It is appropriate when you have a large number of comparisons and want to reduce the chance of a Type I error (false positive

- __Scheffe's test:__
  This test is a conservative method that controls for the family-wise error rate (the   probability of making at least one Type I error in a set of comparisons). 
  
  It is appropriate when you have a small sample size or unequal variances between groups.
  
- __Games-Howell test:__
  This test is a non-parametric method that adjusts for unequal variances between groups.
  
  It is appropriate when you have a small sample size or unequal variances between groups and cannot assume that the data are normally distributed




In [7]:
# ANswer 9
# We assume test  then apply one_way anova  test "
"""
1 NULl hypothesis:
there are not any significant differences between the mean weight loss of the three diets.

2 Alternative Hypothesis:

there are any significant differences between the mean weight loss of the three diets.
"""



logging.info("we are using the f_test buil-in function  f_oneway ")
import numpy as np
from scipy.stats import f_oneway

# Generate random weight loss data for each diet (in pounds)
diet_a = np.random.normal(10, 2, 50)
diet_b = np.random.normal(8, 2, 50)
diet_c = np.random.normal(6, 2, 50)

alpha = 0.05 # default 
# Conduct one-way ANOVA on the weight loss data
f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if(p_value <alpha):
    print("Reject the Null Hypothesis: there are any significant differences between the mean weight loss of the three diets.")

else:
    print("we Fail to Accept the NULL hypothesis")

F-statistic: 40.29141845442496
p-value: 1.1170441959957356e-14
Reject the Null Hypothesis: there are any significant differences between the mean weight loss of the three diets.


In [22]:
# Answer10
import numpy as np
import pandas as pd
from scipy.stats import f_oneway

# Generate random data
np.random.seed(123)
n = 30
programs = ['A', 'B', 'C']
experience = ['novice', 'experienced']
data = pd.DataFrame({
    'program': np.random.choice(programs, n),
    'experience': np.random.choice(experience, n),
    'time': np.random.normal(10, 2, n)
})

# Conduct two-way ANOVA
anova = data.groupby(['program', 'experience'])['time'].apply(list)
f_stat, p_val = f_oneway(*anova)

# Print results
print('F-statistic:', f_stat)
print('p-value:', p_val)
print([anova])




F-statistic: 0.6890997864579507
p-value: 0.6364496835319241
[program  experience 
A        experienced    [10.677178101999603, 11.957472011874692, 7.411...
         novice         [10.005691831793621, 10.059366460606661, 11.78...
B        experienced    [8.600245530804166, 10.567254647614583, 9.9763...
         novice         [6.544661011758786, 11.147611724810115, 10.825...
C        experienced    [9.652728634419569, 11.37644542220457, 8.24092...
         novice         [6.456933790980306, 11.854924863517166, 8.3892...
Name: time, dtype: object]


In [7]:
# answer11 
import pandas as pd
import scipy.stats as stats

# create a sample data frame
data = {'Group': ['Control']*50 + ['Experimental']*50,
        'Test Score': [65, 70, 71, 75, 72, 78, 81, 83, 84, 88,
                       70, 73, 75, 77, 79, 81, 83, 85, 88, 90,
                       62, 66, 69, 70, 71, 74, 76, 77, 80, 81,
                       70, 71, 73, 74, 75, 76, 77, 79, 81, 85,
                       64, 68, 70, 71, 73, 75, 76, 79, 82, 84,
                       73, 75, 77, 79, 80, 82, 83, 85, 86, 88,
                       68, 70, 72, 74, 76, 78, 80, 82, 84, 86,
                       71, 74, 77, 79, 80, 82, 83, 84, 86, 88,
                       63, 65, 68, 70, 71, 74, 76, 77, 78, 80,
                       66, 68, 70, 72, 74, 76, 78, 80, 82, 84]}


df = pd.DataFrame(data)

# conduct a two-sample t-test
control = df[df['Group'] == 'Control']['Test Score']
experimental = df[df['Group'] == 'Experimental']['Test Score']

t, p = stats.ttest_ind(control, experimental, equal_var=False)
print('t-statistic:', t)
print('p-value:', p)

# conduct a post-hoc test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(df['Test Score'], df['Group'])
print(tukey)


t-statistic: -0.8896923721665605
p-value: 0.3758115643297868
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower  upper  reject
----------------------------------------------------------
Control Experimental     1.14 0.3758 -1.4028 3.6828  False
----------------------------------------------------------


In [19]:
# answer12 

import scipy.stats as stats
import pandas as pd
import statsmodels.stats.multicomp as mc

# create a dataframe with sales data for each store
sales = {'Store A': [100, 120, 90, 80, 110, 115, 95, 105, 125, 100, 120, 90, 80, 110, 115, 95, 105, 125, 100, 120, 90, 80, 110, 115, 95, 105, 125, 100, 120, 90],
         'Store B': [85, 95, 100, 110, 90, 120, 105, 100, 80, 85, 95, 100, 110, 90, 120, 105, 100, 80, 85, 95, 100, 110, 90, 120, 105, 100, 80, 85, 95, 100],
         'Store C': [75, 80, 85, 90, 100, 120, 110, 115, 100, 75, 80, 85, 90, 100, 120, 110, 115, 100, 75, 80, 85, 90, 100, 120, 110, 115, 100, 75, 80, 85]}
df = pd.DataFrame(sales)

# conduct one-way ANOVA
f_stat, p_value = stats.f_oneway(df['Store A'], df['Store B'], df['Store C'])
print("F-statistic:", f_stat)
print("p-value:", p_value)
alpha = 0.05

if(p_value<alpha):
    print("there is a significant difference in daily sales between at least one pair of stores")
else:
    print("there is not  significant difference in daily sales between at least one pair of stores")
    
    
# One common post-hoc test is the Tukey HSD test, which can be performed using the pairwise_tukeyhsd function from the statsmodels library.
print("\n")


# perform Tukey HSD post-hoc test
tukey_result = mc.pairwise_tukeyhsd(df.stack().values, df.stack().index.get_level_values(1))
print(tukey_result)


F-statistic: 3.326928926290175
p-value: 0.04052478918981419
there is a significant difference in daily sales between at least one pair of stores


  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
Store A Store B     -6.5 0.1654 -14.9629  1.9629  False
Store A Store C  -8.8333 0.0387 -17.2962 -0.3705   True
Store B Store C  -2.3333 0.7887 -10.7962  6.1295  False
-------------------------------------------------------
