## F-Test:

#### Purpose: 

Compares variances of two or more <b>populations</b>.

#### Test statistic: 

F statistic, it measures ratio of variance between groups to the variance within groups.

#### Null Hypothesis(H0):

Null hyposthesis for the F-test depends on the specific application, Variances of the populations being compared are equal.

#### Alternative Hypothesis(H1):

The alternative hyposthesis typically states that the variances of the populations compared are not equal.

#### Assumptions:

1. The populations being compared should be approximately normally distributed.

2. Each observation should be independent of any other observation.

#### What Output represents:

Indicates whether there is a significant difference in variances between poulations.

### ANOVA:(Analysis of Variance)

#### Purpose:

is used to analyze the differences in means among three or more groups. It determines whether there are statistically significant differences among the means of three or more independent groups.

#### Test Statistic: 

F statistic.

#### Null Hypothesis (H0):

There is no significant difference among means of groups being compared.

#### Alternate hypothesis (H1):

There is a significant difference among the means of the groups.

#### Assumptions:

Independence, Normality within groups.

#### Output:

ANOVA provides info on whether there are significant differences among group means.

##### Group -> 3 Groups A, B, C
##### Scores

In [3]:
import pandas as pd

In [4]:
data = {
    'Group':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C',],
    'Scores':[80, 75, 90, 85, 95, 92, 88, 90, 78, 82, 75, 80]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Group,Scores
0,A,80
1,A,75
2,A,90
3,A,85
4,B,95
5,B,92
6,B,88
7,B,90
8,C,78
9,C,82


In [5]:
(df[df['Group']=='A']['Scores']).mean()

82.5

In [6]:
(df[df['Group']=='B']['Scores']).mean()

91.25

In [7]:
(df[df['Group']=='C']['Scores']).mean()

78.75

NULL(H0) -> The three groups are having equal mean.

Alternate(H1) -> Three groups are having different means.

In [8]:
import scipy.stats as st

In [9]:
_, p_val = st.f_oneway(df[df['Group']=='A']['Scores'],
                   df[df['Group']=='B']['Scores'],
                   df[df['Group']=='C']['Scores'])

print('P-value:', p_val)

P-value: 0.009062918473029835


In [10]:
if p_val > 0.05:
    print('The three groups are having equal mean.')
else:
    print('The three groups are having different means.')

The three groups are having different means.


###### A botanist investigates the effects of three different fertilizers (A, B, and C) on plant height. They randomly select 10 plants for each fertilizer and measure their final height after 6 weeks. Use F-oneway ANOVA to analyze the data and answer the following questions:

Null(H0): The mean heights of plants across all fertilizer groups are equal.
    
Alternate: Mean heights of plants differ across atleast one group.

In [9]:
heights_a = [x for x in range(10, 101, 10)]
heights_b = [x for x in range(20, 201, 20)]
heights_c = [x for x in range(30, 301, 30)]

heights_a

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

In [10]:
heights_b

[20, 40, 60, 80, 100, 120, 140, 160, 180, 200]

In [11]:
heights_c

[30, 60, 90, 120, 150, 180, 210, 240, 270, 300]

In [12]:
len(heights_a), len(heights_b), len(heights_c)

(10, 10, 10)

In [13]:
f_stat, p_value = st.f_oneway(heights_a, heights_b, heights_c)

print(f"p-value: {p_value:.4f}")

p-value: 0.0034


In [14]:
f_stat

7.071428571428572

##### Question:

Does the type of fertilizer or the watering level significantly influence the heights of plants?

In [15]:
data = {
    'Fertilizer': ['Organic', 'Organic', 'Organic', 'Organic', 'InOrganic', 'InOrganic', 'InOrganic', 'InOrganic'],
    'Watering': ['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High'],
    'Height': [20, 22, 21, 25, 28, 24, 18, 19] 
}

df = pd.DataFrame(data)
df

Unnamed: 0,Fertilizer,Watering,Height
0,Organic,Low,20
1,Organic,High,22
2,Organic,Low,21
3,Organic,High,25
4,InOrganic,Low,28
5,InOrganic,High,24
6,InOrganic,Low,18
7,InOrganic,High,19


For Fertilizer:
    
Null Hypothesis(H0): There is no significant difference in mean height treated with org or inorg fertilizers.

Alt: There is a significant difference in the mean height of plants treated with organic and inorganic fertilizers.

In [16]:
(df[df['Fertilizer']=='Organic']['Height']).mean()

22.0

In [17]:
(df[df['Fertilizer']=='InOrganic']['Height']).mean()

22.25

For watering:

Null Hypothesis: There is no significant difference in mean height of plants with low and high waterning levels.

Alternate Hypothesis: There is a significant difference in mean height of plants with low and high waterning levels.

In [18]:
(df[df['Watering']=='Low']['Height']).mean()

21.75

In [19]:
(df[df['Watering']=='High']['Height']).mean()

22.5

In [20]:
# Perform two-way ANOVA:

_, p_val_fert = st.f_oneway(df[df['Fertilizer']=='Organic']['Height'],
                           df[df['Fertilizer']=='InOrganic']['Height'])

_, p_val_watering = st.f_oneway(df[df['Watering']=='Low']['Height'],
                               df[df['Watering']=='High']['Height'])

In [21]:
print("P-value for fertilizer:", p_val_fert)
print("P-value for watering:", p_val_watering)

P-value for fertilizer: 0.9254362527404735
P-value for watering: 0.778192439842158


Therefore the type of fertilizer or watering level does not have a significant effect on plant height in the analysis.

### Chi-square Test:

Purpose: Used to determine if there is a significant association between categorical variables. It assesses whether there is a relationship or association between two cateorical variables.

#### Test Statistic: 

Chi square statistic, measures discrepancy between observed and expected frequencies in data.

#### Null Hypothesis:

There is no significant association between the categorical variables being analyzed.

#### Alternate:

There is a significant association

#### Output:

Chi-square test provides information on whether there is a significant association between the categorical variables being analyzed

#### A Larger chi-square value indicates more difference between observed & expected frequencies

In [22]:
from scipy.stats import chi2_contingency

data = [[50, 40, 20],
       [40, 50, 30]]

df = pd.DataFrame(data, columns=["No Parent", "One Parent", "Two Parents"],
                 index=["Graduated", "Not graduated"])

df

Unnamed: 0,No Parent,One Parent,Two Parents
Graduated,50,40,20
Not graduated,40,50,30


##### dof = (r-1)*(c-1)

In [24]:
chi2, p_val, dof, exp_freq = chi2_contingency(df)

print("Chi square stat:", chi2)
print("Degrees of freedom:", dof)
print("P-value:", p_val)

#Significance level
alpha = 0.05

if p_val <= alpha:
    print("Reject null hypothesis. Family structure is associated with graduation status")
else:
    print("Fail to reject null: No evidence of assosiation between both")

Chi square stat: 3.7946127946127968
Degrees of freedom: 2
P-value: 0.14997204074297982
Fail to reject null: No evidence of assosiation between both


##### A marketing researcher is studying the preferences of consumers regarding three different flavors of a new energy drink: Lemon-Lime, Berry Blast, and Tropical Punch. The researcher randomly selects 300 consumers and asks them to taste all three flavors of the energy drink. Each consumer is asked to rate their preference for each flavor as either "Like", "Neutral", or "Dislike". The researcher records the following data:

Lemon-Lime: Like - 80, Neutral - 60, Dislike - 20

Berry Blast: Like - 70, Neutral - 50, Dislike - 30

Tropical Punch: Like - 50, Neutral - 40, Dislike - 60

Null Hypothesis: No significant association between consumers preferenec and different flavors of drink.

Alternate Hypothesis: There is a significant association between consumers preferenec and different flavors of drink.

In [25]:
observed = [[80, 60, 20],
           [70, 50, 30],
           [50, 40, 60]]

df = pd.DataFrame(observed, columns=["Like", "Neutral", "Dislike"],
                 index=["Lemon", "Berry", "Tropical"])

df

Unnamed: 0,Like,Neutral,Dislike
Lemon,80,60,20
Berry,70,50,30
Tropical,50,40,60


In [26]:
chi2, p, dof, exp = chi2_contingency(df)

In [27]:
print(chi2)
print('P-value:',p)

34.1979797979798
P-value: 6.786680097378574e-07


In [28]:
dof

4

A larger chi square value shows greater diff between observed and expected.

##### Reject Null hypothesis, and conclude there is a significant association between consumers flavor preference and different flavors of energy drink