## Statistics Case Study - Unit III

Index :


1.   Runs Test
2.   Sign Test
3.   Wilcoxon Rank Sum Test


### Runs Test :

H0 (Null Hypothesis) : The sequence of data points of sample from 'Life expectancy' is random

H1 (Alternate Hypothesis) : The sequence of data points of sample from 'Life expectancy' is not random


Importing the necessary python modules

In [1]:
import pandas as pd
import numpy as np
from scipy import stats as ss
import statistics as st
import math
from sklearn.preprocessing import StandardScaler

In [2]:
file_path = 'world-data-2023-1.csv'
dataset = pd.read_csv(file_path)
dataset

Unnamed: 0,Country,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Capital,Co2-Emissions,CPI,Forested Area (%),GDP,Life expectancy,Minimum wage,Unemployment rate
0,Albania,58.10%,652230,323000,32.49,Kabul,8672,149.90,2.10%,"$19,101,353,833",64.5,$0.43,11.12%
1,Algeria,43.10%,28748,9000,11.78,Tirana,4536,119.05,28.10%,"$15,278,077,447",78.5,$1.12,12.33%
2,Andorra,17.40%,2381741,317000,24.28,Algiers,150006,151.36,0.80%,"$169,988,236,398",76.7,$0.95,11.70%
3,Antigua and Barbuda,47.50%,1246700,117000,40.73,Luanda,34693,261.73,46.30%,"$94,635,415,870",60.8,$0.71,6.89%
4,Armenia,54.30%,2780400,105000,17.02,Buenos Aires,201348,232.75,9.80%,"$449,663,446,954",76.5,$3.35,9.79%
...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,Marshall Islands,32.40%,316,2000,9.20,Valletta,1342,113.45,1.10%,"$14,786,156,563",82.3,$5.07,3.47%
80,Mauritius,38.50%,1030700,21000,33.69,Nouakchott,2739,135.02,0.20%,"$7,593,752,450",64.7,$0.53,9.55%
81,Mexico,42.40%,2040,3000,10.20,Port Louis,4349,129.91,19.00%,"$14,180,444,557",74.4,$0.38,6.67%
82,Federated States of Micronesia,54.60%,1964375,336000,17.60,Mexico City,486406,141.54,33.90%,"$1,258,286,717,125",75.0,$0.49,3.42%


In [3]:
np.random.seed(15)
run_sample = np.random.choice(dataset['Life expectancy'], size = 25, replace = False)
run_sample

array([58.9, 64.3, 76.8, 74.4, 80.9, 64.7, 72.3, 71.8, 61.2, 74.1, 52.8,
       79. , 71.4, 75.7, 74.7, 60.8, 76.5, 61.7, 63.8, 73.2, 71.5, 75. ,
       63.7, 73.8, 77.3])

Defining the Runs Test function!!

In [4]:
def runs(sample):
    median = np.median(sample)
    print("Median of the sample : ",  median)

    binary_data = []
    for num in sample:
        if num > median:
            binary_data.append(1)
        elif num < median:
            binary_data.append(0)

    binary_data = np.array(binary_data)

    print('Binary Data : ', binary_data)

    one_count, zero_count = np.sum(binary_data == 1), np.sum(binary_data == 0)
    print('n1 = ', one_count)
    print('n2 = ', zero_count)

    rank = 1
    for i in range(1, len(binary_data)):
        if binary_data[i] != binary_data[i-1]:
            rank += 1

    return rank

In [5]:
# For n1 = 12 and n2 = 12, the critical rank range:(7, 19)
c_lower, c_upper = 7, 19
rank = runs(run_sample)
print('No of runs = ', rank)
print('Accept H0') if c_lower <= rank <= c_upper else ('Reject H0, Accepting H1')

Median of the sample :  72.3
Binary Data :  [0 0 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 1]
n1 =  12
n2 =  12
No of runs =  16
Accept H0


### Sign Test :

H0 (Null Hypothesis) : The median of the sample is equal to the specified value

H1 (Alternate Hypothesis) : The median of the sample is not equal to the specified value


Making some changes to the dataset to acquire education (in years).

In [6]:
import pandas as pd
import numpy as np

df = pd.read_csv(file_path)

cpi = df['CPI']

cpi


0     149.90
1     119.05
2     151.36
3     261.73
4     232.75
       ...  
79    113.45
80    135.02
81    129.91
82    141.54
83    166.20
Name: CPI, Length: 84, dtype: float64

Getting the random sampling done!!

In [7]:
np.random.seed(15)
sign_sample = np.random.choice(cpi, size = 25, replace = False)
sign_sample

array([108.73, 124.74, 124.14, 162.47, 112.85, 135.02, 179.68, 166.2 ,
       106.58, 142.92, 186.86, 116.48, 155.68, 167.4 , 116.86, 261.73,
       550.93, 172.73, 418.34, 182.75, 167.18, 141.54, 179.29, 116.22,
       104.9 ])

Defining sings test function!!

In [8]:
def sign(sample):
    median = np.median(sample)
    print("Median of the sample:",  median)

    positive, negative, zero = np.sum(sample > median), np.sum(sample < median), np.sum(sample == median)
    print('No of Positives : ', positive)
    print('No of Negatives : ', negative)
    print('No of Zeroes : ', zero)

    sample_count = positive + negative - zero

    sign_calculated = min(positive, negative)

    print("Sample Size : ", sample_count)

    return sign_calculated

Calculating and Checking the sign values to come up with a conclusion!!

In [9]:
sign_cal = sign(sign_sample)

sign_tabulated = 1 # alpha = 0.05, two-tailed test, n = 11

print("Calculated Sign Value : ", sign_cal)
print("Tabulated Sign Value : ", sign_tabulated)
print("\n")
if(sign_cal > sign_tabulated):
  print("Accepting the H0 (Null Hypothesis)")
else:
  print("Rejecting the H0 (Null Hypothesis)")

Median of the sample: 155.68
No of Positives :  12
No of Negatives :  12
No of Zeroes :  1
Sample Size :  23
Calculated Sign Value :  12
Tabulated Sign Value :  1


Accepting the H0 (Null Hypothesis)


### Wilcoxon Rank Sum Test :

H0 : There is no significant difference between the two independent samples taken from 'CPI' and 'Birth Rate' respectively.

H1 : There is significant difference between the two independent samples taken from 'CPI' and 'Birth Rate' respectively.

Sampling the samples from the dataset according to requirements!

In [10]:
np.random.seed(20)

w_sample_1 = np.random.choice(cpi, size = 15, replace = True)
w_sample_2 = np.random.choice(dataset['Birth Rate'], size = 17, replace = True)

print(w_sample_1)
print(w_sample_2)

[148.32 125.08 179.68 106.58 184.33 110.5  155.86 133.85 132.3  117.7
 166.2  104.9  105.48 104.9  156.32]
[12.6  42.17 36.22 32.66 29.08 35.35 40.73 10.3  32.66 16.1  16.75 10.1
 18.78  9.   10.65 12.6  32.66]


Defining the wilcoxon rank sum test function!!

In [11]:
from scipy.stats import rankdata
from scipy.stats import mannwhitneyu

def wilcoxon_rank_sum(s1, s2):
    n1, n2 = len(s1), len(s2)

    combine = np.concatenate([s1, s2])
    ranks = len(combine) + 1 - rankdata(combine)

    # Create a DataFrame to display the combined data and ranks
    data_table = pd.DataFrame({
        'Combined Data': combine,
        'Rank': ranks
    })

    # # Display the data table
    # print("Combined Data and Ranks:")
    # print(data_table)

    R_s1, R_s2 = ranks[:n1], ranks[n1:]
    min_count = min(n1, n2)

    R = np.sum(R_s1) if n1 <= n2 else np.sum(R_s2)

    mean_R = min_count * (n1 + n2 + 1) / 2

    std_R = np.sqrt(n1 * n2 * (n1 + n2 + 1) / 12)

    z_cal = np.round(np.abs(R - mean_R) / std_R, 4)
    # statistic, p_value = mannwhitneyu(s1, s2)
    # print(np.round(np.abs(statistic - mean_R) / std_R, 4))

    return [z_cal, data_table]

Calculating the values and coming up with the result!!

In [12]:
# For 5% level of significance Ztab = 1.9600
z_tab = 1.9600
result = wilcoxon_rank_sum(w_sample_1,w_sample_2)
z_cal = result[0]
rank_table = result[1]
print('Calcuated Z statistic:', z_cal)
print('Tabuated Z statistic:', z_tab)
print('Accept H0') if z_cal <= z_tab else print('Reject H0')

Calcuated Z statistic: 4.8148
Tabuated Z statistic: 1.96
Reject H0


In [13]:
rank_table

Unnamed: 0,Combined Data,Rank
0,148.32,6.0
1,125.08,9.0
2,179.68,2.0
3,106.58,12.0
4,184.33,1.0
5,110.5,11.0
6,155.86,5.0
7,133.85,7.0
8,132.3,8.0
9,117.7,10.0
