## Q1

Assumptions for Anova

1. Normality of sampling distribution of mean:
    1. The distribution of sample means is normally distributed.
    
2. Absence of outliers:
    1. The outlying score need to be removed.
    
3. Homogenity of variance:
    1. population variance in different levels of each independent variable  are equal.
    
4. Samples are independent and randomly selected.    

Examples of violations:

Non-Normality: If the data within groups are not normally distributed, ANOVA may lead to incorrect conclusions. For example, if you're comparing the exam scores of students from different schools, and the scores within one group are skewed or not normally distributed, ANOVA results might be unreliable.
    

## Q2

1. One-Way ANOVA: Use when you have one categorical independent variable (factor) and want to compare the means of three or more independent groups.

2. Two-Way ANOVA: Use when you have two categorical independent variables (factors) and want to examine their individual and interaction effects on a continuous dependent variable. It's particularly useful for studying how two factors might interact to influence the outcome.

3. Repeated Measures ANOVA: Use when you have a single group of subjects or items measured under multiple conditions or time points. This allows you to analyze changes within the same subjects over time or across different conditions.

## Q3

The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps us understand how the total variability in a dataset is divided into different components, allowing us to assess the sources of variation and draw meaningful conclusions from our analysis.


It is crucial to understand this concept because it provides insights into the factors contributing to differences in the data and helps determine whether those differences are statistically significant.

## Q4

In [1]:
import numpy as np

In [2]:
group1 = [12, 15, 18, 10, 14]
group2 = [9, 8, 11, 6, 10]
group3 = [17, 20, 22, 15, 19]

In [7]:
data = np.concatenate([group1,group2,group3])
Grand_mean = np.mean(data)

sst = np.sum((data - Grand_mean)**2)

Group1_mean = np.mean(group1)
Group2_mean = np.mean(group2)
Group3_mean = np.mean(group3)

sse_group1 = len(group1) * ((Group1_mean - Grand_mean)**2)
sse_group2 = len(group2) * ((Group2_mean - Grand_mean)**2)
sse_group3 = len(group3) * ((Group3_mean - Grand_mean)**2)
Total_sse = sse_group1 + sse_group2 + sse_group3

ssr_group1 = np.sum((group1 - Group1_mean)**2)
ssr_group2 = np.sum((group2 - Group2_mean)**2)
ssr_group3 = np.sum((group3 - Group3_mean)**2)
Total_ssr = ssr_group1 + ssr_group2 + ssr_group3

print("Total Sum of squares: ",sst)
print("Explained Sum of squares: ",Total_sse)
print("Residual Sum of squares: ",Total_ssr)

Total Sum of squares:  320.9333333333333
Explained Sum of squares:  240.13333333333338
Residual Sum of squares:  80.8


## Q5

In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [11]:
data = pd.DataFrame({'A': [10, 15, 20, 25, 30],
                     'B': [5, 8, 10, 12, 15],
                     'Y': [35, 40, 45, 60, 65]})
data

Unnamed: 0,A,B,Y
0,10,5,35
1,15,8,40
2,20,10,45
3,25,12,60
4,30,15,65


In [12]:
model = ols('Y ~ A + B + A:B', data=data).fit()

In [13]:
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

             sum_sq   df      F    PR(>F)
A         27.586207  1.0  2.000  0.391827
B         10.000000  1.0  0.725  0.550962
A:B        6.206897  1.0  0.450  0.623839
Residual  13.793103  1.0    NaN       NaN


## Q6

In [14]:
signicance_level = 0.05
F_statistics = 5.23
p_value = 0.02

if p_value < signicance_level:
    print("Reject the Null Hypothesis")
else:
    print("Accept the Null Hypothesis")

Reject the Null Hypothesis


There is a significant different between the groups.

## Q7

1. Mean Imputation:
    1. Method: Replace missing values with the mean of the observed values for that variable (column).
    2. Consequences:
        1. Advantages: Easy to implement and does not reduce the sample size.
        2. Disadvantages: Can underestimate the variance and standard errors, leading to inaccurate p-values and confidence intervals. It assumes that missing values are missing at random (MAR) and may introduce bias if this assumption is violated.
        
        
2. Interpolation and Extrapolation:
    1. Method: Estimate missing values based on patterns or relationships in the observed data.
    2. Consequences:
        1. Advantages: Can provide more accurate imputations if patterns exist in the data.
        2. Disadvantages: Requires more complex modeling and assumptions about data patterns. Can be sensitive to model misspecification.        

## Q8

Post-hoc tests are used in the context of analysis of variance (ANOVA) to make pairwise comparisons between group means when the ANOVA reveals a significant overall difference among groups.

There are several common post-hoc tests, and the choice of which one to use depends on factors like the number of groups, sample size, and assumptions about the data.

1. Tukey's Honestly Significant Difference (Tukey HSD):

2. Use: Tukey's HSD is a widely used post-hoc test that is appropriate when you have three or more groups and you want to control the familywise error rate. It provides simultaneous confidence intervals for all pairwise group comparisons.
3. Example: In a one-way ANOVA comparing the test scores of students from different schools, if the ANOVA indicates a significant overall difference, you could use Tukey's HSD to determine which schools have significantly different mean scores.

## Q9

In [15]:
import scipy.stats as stat
import numpy as np

In [17]:
dietA = np.random.normal(2.0,0.5,50)
dietB = np.random.normal(1.8,1.2,50)
dietC = np.random.normal(2.23,0.98,50)

In [20]:
f_statistics , p_value = stat.f_oneway(dietA,dietB,dietC)
print("F statistics :  ",f_statistics)
print("P value : ",p_value)

F statistics :   0.6382367320045343
P value :  0.5296803286827185


1. If the p-value is less than your chosen significance level (e.g., 0.05), you would reject the null hypothesis.
2. If the p-value is greater than or equal to your chosen significance level, you would fail to reject the null hypothesis.


In [22]:
if p_value < 0.05:
    print("Reject the Null Hypothesis")
else:
    print("Accept the Null Hypothesis")
    

Accept the Null Hypothesis


## Q10

In [24]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import random

In [25]:
data = {
    "Time" : [],
    "Program" : [],
    "Experience" : []
}

for i in range(30):
    
    time = random.uniform(10,60)
    program = random.choice(["A","B","C"])
    experience = random.choice(["Novice","Experience"])
    
    data["Time"].append(time)
    data["Program"].append(program)
    data["Experience"].append(experience)
    

df = pd.DataFrame(data)
df.head()


Unnamed: 0,Time,Program,Experience
0,40.892854,A,Novice
1,57.171282,B,Novice
2,33.53147,A,Novice
3,21.767108,A,Experience
4,25.137024,A,Novice


In [26]:
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()

In [27]:
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                         sum_sq    df         F    PR(>F)
Program              201.056588   2.0  0.489689  0.618807
Experience           308.430946   1.0  1.502415  0.232192
Program:Experience   296.571257   2.0  0.722322  0.495882
Residual            4926.962278  24.0       NaN       NaN


1. If the p-value for the main effect of "Program" is less than your chosen significance level (e.g., 0.05), you would conclude that there is a significant main effect of software program on task completion time.

2. If the p-value for the main effect of "Experience" is less than your chosen significance level (e.g., 0.05), you would conclude that there is a significant main effect of employee experience level on task completion time.

3. If the p-value for the interaction effect (Program:Experience) is less than your chosen significance level (e.g., 0.05), you would conclude that there is a significant interaction effect between software program and employee experience level, indicating that the effect of one variable depends on the other.



## Q11

In [31]:
import numpy as np
import scipy.stats as stat

In [30]:
control_group = np.random.normal(80,10,100)
experimental_group = np.random.normal(85,10,100)

In [32]:
t_statistics, p_value = stat.ttest_ind(control_group, experimental_group)
print(t_statistics)
print(p_value)

-2.3311282991152447
0.020752776974165534


Interpret the results:

1. If the p-value is less than your chosen significance level (e.g., 0.05), you would conclude that there are statistically significant differences in test scores between the control and experimental groups.

2. If the results are significant (i.e., you reject the null hypothesis), you can proceed with a post-hoc test (e.g., Tukey's HSD) to determine which group(s) differ significantly from each other. However, please note that post-hoc tests like Tukey's HSD are typically used in the context of ANOVA, not t-tests. If you have more than two groups to compare, it would be appropriate to use ANOVA followed by Tukey's HSD for pairwise comparisons.

There is no significance value given in the question 
still Assuming alpha or significance level to be 0.05

In [33]:
if p_value < 0.05:
    print("Reject the Null Hypothesis")
else:
    print("Accept the Null Hypothesis")

Reject the Null Hypothesis


## Q12

In [62]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [48]:
days = np.arange(1,31)

sales_A = np.random.randint(200,400,30)
sales_B = np.random.randint(180,380,30)
sales_C = np.random.randint(220,420,30)

data1 = {
    "Days" : np.tile(days,3),
    "Stores" : ["A"] *30 + ["B"]* 30 + ["C"] *30,
    "Sales" : np.concatenate([sales_A, sales_B, sales_C])
}

df = pd.DataFrame(data1)
df.head()

Unnamed: 0,Days,Stores,Sales
0,1,A,239
1,2,A,281
2,3,A,310
3,4,A,252
4,5,A,223


In [57]:
f_statistic, p_value = stats.f_oneway(
    df['Sales'][df['Stores'] == 'A'],
    df['Sales'][df['Stores'] == 'B'],
    df['Sales'][df['Stores'] == 'C']
)


In [59]:
print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 4.4582055983411895
p-value: 0.01434648245968254


Interpret the results:

1. If the p-value is less than your chosen significance level (e.g., 0.05), you would conclude that there are statistically significant differences in average daily sales between the three stores.
2. If the results are significant (i.e., you reject the null hypothesis), you can follow up with a post-hoc test, such as Tukey's HSD, to determine which store(s) differ significantly from each other

In [60]:
if p_value < 0.05:
    print("Reject the Null Hypothesis")
else:
    print("Accept the Null Hypothesis")

Reject the Null Hypothesis


We can conclude that there are statistically significant differences in average daily sales between the three stores.

In [64]:

posthoc = pairwise_tukeyhsd(df['Sales'], df['Stores'], alpha=0.05)

print(posthoc)


 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B    -15.0  0.556 -49.5104 19.5104  False
     A      C     27.6 0.1428  -6.9104 62.1104  False
     B      C     42.6 0.0115   8.0896 77.1104   True
-----------------------------------------------------
