## Statistics Case Study - Unit II

Index :

**Z-Test for Hypothesis** :

1.   One Sample Z-Test (Unknown Variance)
2.   Two Sample Z-Test (Same population)
3.   Two Sample Z-Test (Different population)

**T-Test for Hypothesis** :


1.   One Sample T-Test (Unknown Variance)
2.   Two Sample T-Test (Same population)
3.   Two Sample T-Test (Different population)


**Proportion Test for Hypothesis**:

1. One Sample Proportion-Test (Unknown Variance)








### One Sample Z-Test (Unknown Variance) :

H0 (Null Hypothesis) : The average Birth Rate is more than 40.

H1 (Alternate Hypothesis) : The average Birth Rate is not more than 40.


Importing the necessary python modules

In [3]:
import pandas as pd
import numpy as np
from scipy import stats as ss
import statistics as st
import math
from sklearn.preprocessing import StandardScaler

Defining the One Sample Z-Test function

In [4]:
def one_sample_z_test(s_mean, μ, s_dev, s_size):
  den = ((s_dev) / math.sqrt(s_size-1))
  z_cal = abs(s_mean - μ) / den
  return round(z_cal, 5)

Loading and storing the dataset and performing random sampling on it

In [5]:
file_path = "world-data-2023-1.csv"
data = pd.read_csv(file_path)

alpha = 0.05 # Level of significance

pop_mean = 40.0 # Mean we are testing against (μ)

sample_size = 50

z_tabulated = 1.645 # Z_tab @ alpha = 0.05 and one-tailed test

sample = np.random.choice(data['Birth Rate'], size=sample_size, replace=False)

sample


array([14.  , 10.1 , 19.49, 19.97,  8.1 , 10.7 , 24.56, 34.12,  9.5 ,
       41.54, 35.39, 41.18, 16.1 , 36.22, 24.35, 21.6 , 27.1 , 12.43,
       24.28, 28.75, 12.5 , 21.28, 18.78, 26.81, 37.93, 10.65, 10.1 ,
       17.6 , 13.94,  8.9 , 10.3 , 29.41, 42.17,  8.11, 18.25, 13.92,
        7.4 , 10.  , 23.55, 21.75, 17.86, 13.47, 16.75, 14.88, 10.2 ,
        9.2 , 17.55, 41.18, 11.3 , 12.6 ])

Calculating necessary values and performing the test

In [6]:
mean = np.mean(sample)

std_dev = np.std(sample, ddof=1) # ddof=1 for sample standard deviation

z_calculated = one_sample_z_test(mean, pop_mean, std_dev, sample_size)

print("Z_Calculated Value : ", z_calculated)

print("Z_Tabulated Value : ", z_tabulated)

Z_Calculated Value :  13.84696
Z_Tabulated Value :  1.645


In [7]:
# Perform the hypothesis test
if abs(z_calculated) >= z_tabulated:
    result = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result = "Accept the null hypothesis (H0)"

In [8]:
print('Result : ' , result)

Result :  Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)


Testing with the confidence interval.


In [9]:
limit = z_tabulated * (std_dev/(math.sqrt(sample_size)))

lower_limit = mean - limit
upper_limit = mean + limit

print("Confidence Interval - (95%) : ")

print(lower_limit , " < μ < " , upper_limit)

Confidence Interval - (95%) : 
17.375657703995664  < μ <  22.137142296004335


### Two Sample Z-Test (Same population) :
*(Two Samples are drawn from the same population)*


H0 (Null Hypothesis) : The average Birth Rate both samples is same.

H1 (Alternate Hypothesis) : The average Birth Rate both sample is not same.


Defining Two Sample (Same Population) Z-Test function.

In [10]:
def two_sample_z_same_population(s_mean1, s_mean2, s_size1, s_size2, s_dev1, s_dev2):
    temp = (1/(s_size1-1)) + (1/(s_size2-1))
    df = s_size1 + s_size2 - 2
    var = (s_size1 - 1) * s_dev1*s_dev1 + (s_size2 - 1) * s_dev2*s_dev2
    sP = var/df # Pooled Variance
    z_cal = abs(s_mean1 - s_mean2) / (sP * math.sqrt(temp))
    return [round(z_cal, 5),sP]  # Round to 5 decimal places

Loading and storing the dataset and performing random sampling on it


In [11]:
# file_path = '/content/dataset_bank.csv'
data_m = pd.read_csv(file_path)

alpha_m = 0.10 # Level of significance

z_tabulated_m = 1.282 # Z_tab @ alpha = 0.10 and one-tailed test

#married_age_data = data_m.loc[data_m['marital'] == 'married', 'age'].to_numpy()
birth_rate = data['Birth Rate']

# Create two random samples with sizes greater than 30
msample1 = np.random.choice(birth_rate, size=40, replace=False)
msample2 = np.random.choice(birth_rate, size=35, replace=False)

# Display the two random samples
print("\nRandom Sample 1:")
print(msample1)
print("\nRandom Sample 2:")
print(msample2)

mean_1 = np.mean(msample1)
mean_2 = np.mean(msample2)

std_dev1 = np.std(msample1)
std_dev2 = np.std(msample2)

n1 = len(msample1)
n2 = len(msample2)



Random Sample 1:
[21.98 17.26 33.69 23.55  9.2  13.97  8.1  38.54 12.5  10.65 24.56 36.22
 13.97  9.   12.6  10.9  29.08 24.28 10.3  13.92  7.4  21.28 27.1  34.12
 21.6  18.07 17.55  8.9  11.3  24.35  9.6  13.47 24.82 35.35 29.41 17.86
 16.1  41.18 17.6  26.81]

Random Sample 2:
[18.18 10.9  34.12 29.08 12.   10.   17.86 26.81 10.    8.11 29.41 13.97
 13.97 10.1   7.4  19.97 35.39 28.75 17.02 13.99 10.3  21.77 24.35 11.3
 12.5  17.26 42.17  8.1  10.3  21.98 41.18 11.78 13.92 18.25 19.49]


Calculating necessary values and performing the test

In [12]:
res = two_sample_z_same_population(mean_1, mean_2, n1, n2, std_dev1, std_dev2)

z_calculated_m = res[0]

print("Z_Calculated Value : ", z_calculated_m)

print("Z_Tabulated Value : ", z_tabulated_m)

Z_Calculated Value :  0.0652
Z_Tabulated Value :  1.282


In [13]:
# Perform the hypothesis test
if abs(z_calculated_m) >= z_tabulated_m:
    result_m = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result_m = "Accept the null hypothesis (H0)"

In [14]:
print('Result : ' , result_m)

Result :  Accept the null hypothesis (H0)


Testing with the confidence interval.

In [15]:
sP = res[1] # Pooled Variance
m_mean = abs(mean_1 - mean_2)

temp = (1/(n1-1)) + (1/(n2-1))

m_limit = z_tabulated_m * (sP * math.sqrt(temp))

lower_limit = m_mean - limit
upper_limit = m_mean + limit

print("Confidence Interval - (90%) : ")

print(lower_limit , " < μ1 - μ2 < " , upper_limit)

Confidence Interval - (90%) : 
-1.0466708674329093  < μ1 - μ2 <  3.7148137245757633


### Two Sample Z-Test (Different population) :
*(Two Samples are drawn from the different population)*



H0 (Null Hypothesis) : The average of both samples is same.

H1 (Alternate Hypothesis) : The average of both sample is not same.

Defining Two Sample (*Different* Population) Z-Test function.

In [16]:
def two_sample_z_different_population(s_mean1, s_mean2, s_size1, s_size2, s_dev1, s_dev2):
  temp1 = ((s_dev1)**2 / (s_size1 - 1))
  temp2 = ((s_dev2)**2 / (s_size2 - 1))
  den = math.sqrt(temp1 + temp2)
  z_cal = abs(s_mean1 - s_mean2) / den
  return [round(z_cal, 5), den]  # Round to 5 decimal places

Loading and storing the dataset and performing random sampling on it

In [17]:
# file_path = '/content/dataset_bank.csv'
data_m = pd.read_csv(file_path)

alpha_m = 0.10 # Level of significance

z_tabulated_m = 1.282 # Z_tab @ alpha = 0.10 and one-tailed test

birth_rate = data['Birth Rate']
life_expectancy =data['Life expectancy']


# Create two random samples with sizes greater than 30
m_sample1 = np.random.choice(birth_rate, size=42, replace=False)
m_sample2 = np.random.choice(life_expectancy, size=36, replace=False)

# Display the two random samples
print("\nRandom Sample 1: (Birth)")
print(m_sample1)
print("\nRandom Sample 2: (Life)")
print(m_sample2)

mean_1 = np.mean(m_sample1)
mean_2 = np.mean(m_sample2)

std_dev1 = np.std(msample1)
std_dev2 = np.std(msample2)

n1 = len(m_sample1)
n2 = len(m_sample2)


Random Sample 1: (Birth)
[33.04 18.07 10.65 10.1  33.69 14.   41.18 35.74 40.73 10.   41.18 35.39
 17.6  26.81 13.92 18.78 12.   12.43 24.28  7.4  35.35 18.25 31.61 24.35
 17.55 32.86 23.55 14.88 37.93  9.6  11.78 10.   10.3  19.97 16.1  12.5
  8.11 21.98 33.24  9.   13.47 21.6 ]

Random Sample 2: (Life)
[80.  81.9 61.7 64.7 52.8 79.1 69.3 71.8 63.8 82.3 58.9 74.9 82.8 58.4
 63.7 84.2 77.3 74.4 80.1 73.8 74.5 82.5 71.5 53.7 63.7 78.9 75.4 74.9
 74.4 77.1 57.4 78.5 82.1 82.7 74.1 82.7]


Calculating necessary values and performing the test

In [18]:
t_res = two_sample_z_different_population(mean_1, mean_2, n1, n2, std_dev1, std_dev2)

z_calculated_ms = t_res[0]

print("Z_Calculated Value : ", z_calculated_ms)

print("Z_Tabulated Value : ", z_tabulated_m)

Z_Calculated Value :  23.98052
Z_Tabulated Value :  1.282


In [19]:
# Perform the hypothesis test
if abs(z_calculated_ms) >= z_tabulated_m:
    result_ms = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result_ms = "Accept the null hypothesis (H0)"

print('Result : ' , result_ms)

Result :  Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)


Testing with the confidence interval.

In [20]:
tent = t_res[1]

limit_ms = z_tabulated * (tent)

ms_mean = abs(mean_1 - mean_2)

lower_limit = ms_mean - limit_ms
upper_limit = ms_mean + limit_ms

print("Confidence Interval - (90%) : ")

print(lower_limit , " < μ1 - μ2 < " , upper_limit)

Confidence Interval - (90%) : 
48.02703876984952  < μ1 - μ2 <  55.10137392856314


### One Sample T-Test (Unknown Variance) :

H0 (Null Hypothesis) : The average CPIis more than 280 seconds.

H1 (Alternate Hypothesis) : The average CPI is not more than 280 seconds.


Defining One Sample Z-Test function.

In [21]:
def one_sample_t_test(s_mean, μ, s_dev, s_size):
  den = ((s_dev) / math.sqrt(s_size-1))
  t_cal = abs(s_mean - μ) / den
  return round(t_cal, 5)

Loading and storing the dataset and performing random sampling on it

In [22]:
# file_path = '/content/dataset_bank.csv'
data = pd.read_csv(file_path)

alpha = 0.05 # Level of significance

pop_duration = 280.0 # Mean we are testing against (μ)

sample_size = 18

t_tabulated = 1.7396 # Z_tab @ alpha = 0.05 and one-tailed test

sample = np.random.choice(data['CPI'], size=sample_size, replace=False)

print(sample)


[105.48 149.9  151.36 418.34 113.45 111.23 119.8  117.11 172.73 166.2
 116.48 223.13 133.61 109.82 141.54 117.7  261.73 186.86]


Calculating necessary values and performing the test

In [23]:
mean = np.mean(sample)

std_dev = np.std(sample, ddof=1) # ddof=1 for sample standard deviation

t_calculated = one_sample_t_test(mean, pop_mean, std_dev, sample_size)

df = sample_size - 1

print("T_Calculated Value : ", t_calculated)

print("T_Tabulated Value : ", t_tabulated)

T_Calculated Value :  6.5498
T_Tabulated Value :  1.7396


In [24]:
# Perform the hypothesis test
if abs(t_calculated) >= t_tabulated:
    result = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result = "Accept the null hypothesis (H0)"

print('Result : ' , result)

Result :  Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)


Testing using the confidence interval!

In [25]:
limit = t_tabulated * (std_dev/(math.sqrt(sample_size)))

lower_limit = mean - limit
upper_limit = mean + limit

print("Confidence Interval - (95%) : ")

print(lower_limit , " < μ < " , upper_limit)

Confidence Interval - (95%) : 
130.52962226995558  < μ <  193.5225999522667


### Two Sample T-Test (Same population) :
*(Two Samples are drawn from the same population)*


H0 (Null Hypothesis) : The average CPI of both samples is same.

H1 (Alternate Hypothesis) : The average CPI of both sample is not same.


Defining Two Sample (Same Population) Z-Test function.

In [26]:
def two_sample_t_same_population(s_mean1, s_mean2, s_size1, s_size2, s_dev1, s_dev2):
    temp = (1/(s_size1-1)) + (1/(s_size2-1))
    df = s_size1 + s_size2 - 2
    var = (s_size1 - 1) * s_dev1*s_dev1 + (s_size2 - 1) * s_dev2*s_dev2
    sP = var/df # Pooled Variance
    t_cal = abs(s_mean1 - s_mean2) / (sP * math.sqrt(temp))
    return [round(t_cal, 5),sP]  # Round to 5 decimal places

Loading and storing the dataset and performing random sampling on it.

In [27]:
# file_path = '/content/dataset_bank.csv'
data_m = pd.read_csv(file_path)

alpha_m = 0.10 # Level of significance

t_tabulated_m = 2.7874 # Z_tab @ alpha = 0.10 and one-tailed test @ d.f = 25

cpi = data['CPI']

# Create two random samples with sizes greater than 30
msample1 = np.random.choice(cpi, size=13, replace=False)
msample2 = np.random.choice(cpi, size=14, replace=False)

# Display the two random samples
print("\nRandom Sample 1:")
print(msample1)
print("\nRandom Sample 2:")
print(msample2)

mean_1 = np.mean(msample1)
mean_2 = np.mean(msample2)

std_dev1 = np.std(msample1)
std_dev2 = np.std(msample2)

n1 = len(msample1)
n2 = len(msample2)

df = n1 + n2 -2


Random Sample 1:
[106.58 124.35 133.61 115.09 108.15 142.92 101.87 126.6  111.65 167.4
 122.19 133.85 162.47]

Random Sample 2:
[184.33 125.6  124.35 108.73 121.46 155.68 124.74 128.85 117.7  149.75
 131.91 150.34 418.34 118.38]


Calculating necessary values and performing the test.

In [28]:
res = two_sample_t_same_population(mean_1, mean_2, n1, n2, std_dev1, std_dev2)

t_calculated_m = res[0]

print("T_Calculated Value : ", t_calculated_m)

print("T_Tabulated Value : ", t_tabulated_m)

T_Calculated Value :  0.0212
T_Tabulated Value :  2.7874


In [29]:
# Perform the hypothesis test
if abs(t_calculated_m) >= t_tabulated_m:
    result_m = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result_m = "Accept the null hypothesis (H0)"

print('Result : ' , result_m)

Result :  Accept the null hypothesis (H0)


Testing with the help of confidence interval.

In [30]:
sP = res[1] # Pooled Variance
m_mean = abs(mean_1 - mean_2)

temp = (1/(n1-1)) + (1/(n2-1))

m_limit = t_tabulated_m * (sP * math.sqrt(temp))

lower_limit = m_mean - limit
upper_limit = m_mean + limit

print("Confidence Interval - (90%) : ")

print(lower_limit , " < μ1 - μ2 < " , upper_limit)

Confidence Interval - (90%) : 
-4.64011521478194  < μ1 - μ2 <  58.35286246752919


### Two Sample T-Test (Different population) :
*(Two Samples are drawn from the different population)*

Population : Birth Rate and Life expectancy

H0 (Null Hypothesis) : The average  of both samples is same.

H1 (Alternate Hypothesis) : The average  of both sample is not same.

Defining Two Sample (*Different* Population) Z-Test function.

In [31]:
def two_sample_t_different_population(s_mean1, s_mean2, s_size1, s_size2, s_dev1, s_dev2):
  temp1 = ((s_dev1)**2 / (s_size1 - 1))
  temp2 = ((s_dev2)**2 / (s_size2 - 1))
  den = math.sqrt(temp1 + temp2)
  t_cal = abs(s_mean1 - s_mean2) / den
  return [round(t_cal, 5), den]  # Round to 5 decimal places

Loading and storing the dataset and performing random sampling on it

In [32]:
# file_path = '/content/dataset_bank.csv'
data_m = pd.read_csv(file_path)

alpha_m = 0.05 # Level of significance

t_tabulated_m = 2.0484 # Z_tab @ alpha = 0.10 and one-tailed test @ dof =28

birth_rate = data['Birth Rate']
life_expectancy = data['Life expectancy']

# Create two random samples with sizes greater than 30
m_sample1 = np.random.choice(birth_rate, size=17, replace=False)
m_sample2 = np.random.choice(life_expectancy, size=13, replace=False)

# Display the two random samples
print("\nRandom Sample 1: (Birth)")
print(m_sample1)
print("\nRandom Sample 2: (Life)")
print(m_sample2)

mean_1 = np.mean(m_sample1)
mean_2 = np.mean(m_sample2)

std_dev1 = np.std(msample1)
std_dev2 = np.std(msample2)

n1 = len(m_sample1)
n2 = len(m_sample2)

df = n1 + n2 - 2


Random Sample 1: (Birth)
[14.88 41.54 32.86 13.99 32.66 29.08 41.18 17.02 24.56 23.55 11.3  35.74
 10.9  33.69 18.18 35.13 18.78]

Random Sample 2: (Life)
[60.8 76.7 63.7 80.9 78.1 81.9 74.9 75.  60.4 70.5 71.8 73.8 64.7]


In [33]:
t_res = two_sample_z_different_population(mean_1, mean_2, n1, n2, std_dev1, std_dev2)

t_calculated_ms = t_res[0]

print("T_Calculated Value : ", t_calculated_ms)

print("T_Tabulated Value : ", t_tabulated_m)

T_Calculated Value :  2.06268
T_Tabulated Value :  2.0484


In [34]:
# Perform the hypothesis test
if abs(t_calculated_ms) >= t_tabulated_m:
    result_ms = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result_ms = "Accept the null hypothesis (H0)"

print('Result : ' , result_ms)

Result :  Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)


Testing with the help of confidence interval!

In [35]:
tent = t_res[1]

limit_ms = t_tabulated_m * (tent)

ms_mean = abs(mean_1 - mean_2)

lower_limit = ms_mean - limit_ms
upper_limit = ms_mean + limit_ms

print("Confidence Interval - (90%) : ")

print(lower_limit , " < μ1 - μ2 < " , upper_limit)

Confidence Interval - (90%) : 
0.31981273564353074  < μ1 - μ2 <  92.06824156299899


### One Sample Proportion-Test (Unknown Variance) :

H0 (Null Hypothesis) : The proportion of cpi is more than 0.40.

H1 (Alternate Hypothesis) : The proportion of cpi is less than 0.40.


Seperating married clients and calculating necessary parameters!!

In [36]:
# file_path = '/content/dataset_bank.csv'
dataset = pd.read_csv(file_path)

sample_size = 50

sample = np.random.choice(data['CPI'], size=sample_size, replace=False)

# Convert the sample to a Pandas DataFrame
sample_df = pd.DataFrame({'CPI': sample})


cpi = data['CPI']

print("No of Entities : ", len(cpi))

count_married = len(cpi)
count_total = len(sample)


No of Entities :  84


Defining proportion one sample test function!!

In [37]:
def calculate_z_proportion_one(p_sample,p_pop,s_size):
    q_pop = 1 - p_pop
    den = math.sqrt((p_pop * q_pop)/(s_size-1))
    z_cal = abs(p_sample-p_pop) / den
    return [round(z_cal, 5),den]

Calculating proportion parameters and testing the hypothesis!!

In [38]:
p_sample = count_married / count_total
p_pop = 0.40 # According to hypothesis

z_tab = 1.645 # @ alpha = 0.05 and one-tailed test

res = calculate_z_proportion_one(p_sample, p_pop, sample_size)

z_calculated = res[0]

den = res[1]

print("Z-Calculated : ", z_calculated)
print("Z-Tabulated : ", z_tab)

Z-Calculated :  18.28952
Z-Tabulated :  1.645


In [39]:
# Perform the hypothesis test
if abs(z_calculated) >= z_tab:
    result = "Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)"
else:
    result = "Accept the null hypothesis (H0)"

print('Result : ' , result)

Result :  Reject the null hypothesis (H0) (or) Accepting alternate hypothesis (H1)


Testing using the confidence intervals!!

In [40]:
limit = z_tab * (den)

lower_limit = p_sample - limit
upper_limit = p_sample + limit

print("Confidence Interval - (95%) : ")

print(lower_limit , " < p < " , upper_limit)

Confidence Interval - (95%) : 
1.5648739820891906  < p <  1.7951260179108093
