# **How do we get the 1.96 values for rejection region in the normal distribution for the two tailed test**

In [None]:
from scipy.stats import norm

# Defining significance level (alpha)
# alpha = 1 - CI = 1 - 0.95 = 0.05
alpha = 0.05 # (5%)

# Two-tailed test, so we need to split the alpha values in half for both the tails
# alpha / 2 = 0.025
tail_area = alpha / 2

# Finding the critical z-score using the inverse cumulative distribution function (cdf)
z = norm.ppf(1 - tail_area)

# Print the critical z-score
print("Critical z-score for two-tailed test with alpha=0.05:", z)

# Explain norm.ppf
print("\n\n\nExplanation of norm.ppf:")
print("The norm.ppf function in scipy.stats stands for 'percent point function' of the standard normal distribution.")
print("It takes a probability value between 0 and 1 as input and returns the corresponding z-score such that the area less \nthan that z-score under the standard normal curve is equal to the input probability.\n")
print("In this case, we provide the probability (1 - tail_area) which corresponds to the area outside the rejection region (non-shaded area under the curve).")
print("The norm.ppf function then finds the z-score that separates this non-shaded area from the shaded rejection region, which is the critical z-score.")

Critical z-score for two-tailed test with alpha=0.05: 1.959963984540054



Explanation of norm.ppf:
The norm.ppf function in scipy.stats stands for 'percent point function' of the standard normal distribution.
It takes a probability value between 0 and 1 as input and returns the corresponding z-score such that the area less 
than that z-score under the standard normal curve is equal to the input probability.

In this case, we provide the probability (1 - tail_area) which corresponds to the area outside the rejection region (non-shaded area under the curve).
The norm.ppf function then finds the z-score that separates this non-shaded area from the shaded rejection region, which is the critical z-score.


**More about ppf and its uses:**

**ppf stands for Percent Point Function, also known as the inverse cumulative distribution function (CDF) or quantile function. It's a crucial tool in statistics and probability for translating between probabilities and values within a specific distribution.**

Calculation:

**Imagine you have a probability distribution, like the standard normal distribution (bell curve). The CDF tells you the probability that a randomly drawn value from that distribution will be less than a specific value (x). The ppf does the opposite: it takes a probability value and tells you the value (x) on the distribution that has that probability or less accumulated below it.**

<hr>

# **Question for Two tailed T-Test: We are working in a biscuit company. The weight of each biscuit is 10gm on an average. Now, one day a employee notices the difference in weights of few biscuits and reports to the quality assurance department. Now, taking this thing on a note, QA department conducts a test. Here they have taken a sample, now they want to find out whether the average weight of the sample taken in same as of the population or not**

* **Null Hypothesis: The average weight is 10 grams**
* **Alternate Hypothesis: The average weight is not 10 grams**

In [None]:
from scipy.stats import ttest_ind

In [None]:
import numpy as np
from scipy import stats

# Population size
n = 100

# Population mean (assumed)
mu = 10

# Standard deviation (assumed)
sigma = 5

# Sample data
sample = np.random.normal(mu, sigma, 30)

In [None]:
sample

array([10.18489778, 12.76180926, 10.41377501, 12.51421621,  6.11613099,
       11.08952291, 15.24909942, 11.87095279,  8.34116407,  5.0742481 ,
        8.09122169,  6.88460186,  9.77654964, 16.95942574, 20.44414112,
        6.75132732,  8.48149601, -3.00523945, 10.50918996,  9.22891319,
        8.61154847, 12.20551767,  8.25489474,  9.65202716, 12.37302968,
       10.3003658 ,  8.20054544,  7.88715092, 17.66276069,  7.11669945])

In [None]:
# Hypothesis test
# Null hypothesis: The average weight of the sample is the same as the population mean (mu)
# Alternative hypothesis: The average weight of the sample is different from the population mean (mu) (two-tailed)

# Perform two-tailed t-test
t_statistic, p_value = stats.ttest_ind(sample, mu, alternative='two-sided')

# Significance level
alpha = 0.05

# Decision
if p_value < alpha:
    print("Reject the null hypothesis. The average weight of the sample is statistically different from the population mean.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight of the sample is different from the population mean.")


print('\n-----------------------------\nHere are the parametes results', end = " ")
print(f'\n=======\nPopulation Mean: {mu},\n=======\nStandard Deviation: {sigma},\n=======\nSample Mean: {sample.mean()},\n=======\nT - Test Value: {t_statistic},\n=======\nP_value: {p_value},\n=======\nLevel of Significance: {alpha}')


Fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight of the sample is different from the population mean.

-----------------------------
Here are the parametes results 
Population Mean: 10,
Standard Deviation: 5,
Sample Mean: 10.000066121134088,
T - Test Value: 1.5177137607236951e-05,
P_value: 0.9999879943202007,
Level of Significance: 0.05


# **Working with One tailed t-test (Greater side)**

In [None]:
import numpy as np
from scipy import stats

# Sample size
n = 100

# Population mean (assumed)
mu = 10

# Standard deviation (assumed)
sigma = 5

# Sample data
sample = np.random.normal(mu, sigma, n)

# Hypothesis test
# Null hypothesis: The average weight of the sample is equal to the population mean (mu)
# Alternative hypothesis: The average weight of the sample is greater than the population mean (mu) (one-tailed)

# Perform one-tailed t-test
t_statistic, p_value = stats.ttest_1samp(sample, mu, alternative='greater')

# Significance level
alpha = 0.05

# Decision
if p_value < alpha:
    print("Reject the null hypothesis. The average weight of the sample is statistically greater than the population mean.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight of the sample is greater than the population mean.")

# Alternatively, for a one-tailed test less than mu:
# t_statistic, p_value = stats.ttest_1samp(sample, mu, alternative='less')

print('\n-----------------------------\nHere are the parametes results', end = " ")
print(f'\n=======\nPopulation Mean: {mu},\n=======\nStandard Deviation: {sigma},\n=======\nSample Mean: {sample.mean()},\n=======\nT - Test Value: {t_statistic},\n=======\nP_value: {p_value},\n=======\nLevel of Significance: {alpha}')


Fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight of the sample is greater than the population mean.

-----------------------------
Here are the parametes results 
Population Mean: 10,
Standard Deviation: 5,
Sample Mean: 9.980949554445436,
T - Test Value: -0.04005791646634133,
P_value: 0.5159361874813231,
Level of Significance: 0.05


# **Working with One tailed t-test (Less than side)**

In [None]:
import numpy as np
from scipy import stats

# Sample size
n = 100

# Population mean (assumed)
mu = 10

# Standard deviation (assumed)
sigma = 5

# Sample data
sample = np.random.normal(mu, sigma, n)

# Hypothesis test
# Null hypothesis: The average weight of the sample is equal to the population mean (mu)
# Alternative hypothesis: The average weight of the sample is less than the population mean (mu) (one-tailed)

# Perform one-tailed t-test
t_statistic, p_value = stats.ttest_1samp(sample, mu, alternative='less')

# Significance level
alpha = 0.05

# Decision
if p_value < alpha:
    print("Reject the null hypothesis. The average weight of the sample is statistically greater than the population mean.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight of the sample is less than the population mean.")

# Alternatively, for a one-tailed test less than mu:
# t_statistic, p_value = stats.ttest_1samp(sample, mu, alternative='less')

print('\n-----------------------------\nHere are the parametes results', end = " ")
print(f'\n=======\nPopulation Mean: {mu},\n=======\nStandard Deviation: {sigma},\n=======\nSample Mean: {sample.mean()},\n=======\nT - Test Value: {t_statistic},\n=======\nP_value: {p_value},\n=======\nLevel of Significance: {alpha}')


Fail to reject the null hypothesis. There is not enough evidence to conclude that the average weight of the sample is less than the population mean.

-----------------------------
Here are the parametes results 
Population Mean: 10,
Standard Deviation: 5,
Sample Mean: 10.009782741805155,
T - Test Value: 0.017921643736574473,
P_value: 0.5071312841010813,
Level of Significance: 0.05


<hr>

* **Z-test: Used when the population standard deviation (σ) is known.**
* **T-test: Used when the population standard deviation (σ) is unknown, but the sample size is large enough or small enough but the standard deviation is not there**

<hr>

# **Analysing the Z-test Manually**

# **Scenario:**
**The average height of the boys in country is 175 cms. Let's say we want to evaluate the height of 10 students taken at random from a school.**

**Hypothesis Taken**
* **Null: : The average height of students in sample class is same as the country height (μ = 175 cm).**
* **Alternative Hypothesis (H1): The average height of students in your class not same as your country height (μ ≠ 175 cm).**

**Assumptions**
* The data is normally distributed
* SD(population) is known
* Sample should be larger n = 10 >= 30

**Step 1: Calculating a Z-statistics**


In [None]:
import math
data = [172, 168, 175, 170, 173, 178, 169, 171, 174, 176]
pop_mean, sample_mean = 175, 170
SD = 5
sample_size = 10

In [None]:
Z_stats = (sample_mean - pop_mean) / (SD / (math.sqrt(sample_size)))

In [None]:
Z_stats

-3.1622776601683795

In [None]:
# According to z-table negative value is -2.7 to +2.7

**Implementation**

In [None]:
from statsmodels.stats.weightstats import ztest

# Sample data
heights = [172, 168, 175, 170, 173, 178, 169, 171, 174, 176]


# Define null hypothesis
mu_0 = 175

# Calculate Z-statistic and p-value
z_stat, pval = ztest(heights, value = mu_0, ddof = 0, alternative = 'two-sided')

# Print results
print("Test statistic:", z_stat)
print("p-value:", pval)

if pval > 0.05:
    print("Fail to reject null hypothesis. No significant difference in average heights.")
else:
    print("Reject null hypothesis. Average heights differ significantly.")



Test statistic: -2.496751135729443
p-value: 0.012533688376410744
Reject null hypothesis. Average heights differ significantly.


**Chi-square test**

Null Hypo: The observed Frequecy of the the values is `16.67, our dice is fair

Alternate: The observed is not 16.67, our dice is unfair

In [None]:
from scipy.stats import chi2_contingency as chi

In [None]:
observed_val = [18, 14, 20, 12, 19, 17]

predicted_val = [16.67, 16.67, 16.67, 16.67, 16.67, 16.67]

In [None]:
#perform chi test
chi2, p_val, dof, expected_freq = chi(observed_val, predicted_val)

In [None]:
# results
print('Chi stats: ', chi2)
print('P_value: ', p_val)
print('Degree of Freedom: ', dof)

Chi stats:  0.0
P_value:  1.0
Degree of Freedom:  0


In [None]:
if p_val > 0.05:
  print('Accept the Null Hypo, our dice is fair')
else:
  print('Reject the Null Hypo, our dice isnt fair')

Accept the Null Hypo, our dice is fair


**Chi_test for indenpendence**

In [None]:
observed_val = [[20, 10], [50, 5], [10, 25]]

# calculating the expected frequecies
rows_total = [sum(row) for row in observed_val]
col_total = [sum(col) for col in zip(*observed_val)]

expected = []
for i in range(len(observed_val)):
  expected.append([rows_total[i] * col / sum(col_total) for col in col_total])

In [None]:
expected

[[20.0, 10.0],
 [36.666666666666664, 18.333333333333332],
 [23.333333333333332, 11.666666666666666]]

In [None]:
observed_val = [[20, 10], [50, 5], [10, 25]]
# -------
expected = [[20.0, 10.0],
 [36.666666666666664, 18.333333333333332],
 [23.333333333333332, 11.666666666666666]]

In [None]:
chi2, p_val, dof, expected_freq = chi(observed_val, expected)


In [None]:
# results
print('Chi stats: ', chi2)
print('P_value: ', p_val)
print('Degree of Freedom: ', dof)

Chi stats:  37.40259740259741
P_value:  7.553168436167824e-09
Degree of Freedom:  2


In [None]:
if p_val > 0.05:
  print('There is no evidence of any relationship between two columns')
else:
  print('There is evidence of any relationship between two columns')

There is evidence of any relationship between two columns


**One Way ANOVA**

**Student Group A, B, C**

In [None]:
import pandas as pd

data = {
    'Group': ['A', 'A', 'A', 'A', 'B','B','B', 'B', 'C', 'C', 'C', 'C'],
    'Scores':[80, 75, 90, 85, 95, 92, 88, 90, 78, 82, 75, 80]

}

df = pd.DataFrame(data)
df


Unnamed: 0,Group,Scores
0,A,80
1,A,75
2,A,90
3,A,85
4,B,95
5,B,92
6,B,88
7,B,90
8,C,78
9,C,82


H0 --> The three group are having equal mean

H1 --> The three groups are having different mean

In [None]:
import scipy.stats as st
_, p_val = st.f_oneway(df[df['Group']=='A']['Scores'],
                       df[df['Group']=='B']['Scores'],
                       df[df['Group']=='C']['Scores'])


print('Pvalue: ', p_val)

Pvalue:  0.009062918473029835


In [None]:
if p_val > 0.05:
  print(' The three group are having equal mean')
else:
  print('The three groups are having different mean')

The three groups are having different mean


**Two Way Test**

In [None]:
data = {
    'Fertilizer': ['Organic', 'Organic', 'Organic', 'Organic', 'InOrganic', 'InOrganic', 'InOrganic', 'InOrganic'],
    'Watering':['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High'],
    'Height': [20, 22, 21, 25, 28, 24, 18, 19]
}

In [None]:
df= pd.DataFrame(data)

In [None]:
df

Unnamed: 0,Fertilizer,Watering,Height
0,Organic,Low,20
1,Organic,High,22
2,Organic,Low,21
3,Organic,High,25
4,InOrganic,Low,28
5,InOrganic,High,24
6,InOrganic,Low,18
7,InOrganic,High,19


In [None]:
_, p_val1 = st.f_oneway(df[df['Fertilizer']=='Organic']['Height'],
                    df[df['Fertilizer']=='InOrganic']['Height'])

_, p_val2 = st.f_oneway(df[df['Watering']=='Low']['Height'],
                    df[df['Watering']=='High']['Height'])

In [None]:
p_val1, p_val2

(0.9254362527404735, 0.778192439842158)