# Inferential Statistics

membuat konklusi/informasi tentang populasi berdasarkan data sample

misal: kalau di sebuah data sample pilpres, 60% voter pilih #1, apakah kita bisa membuat konklusi kalau di Indonesia (populasi) 60% juga memilih #1?

- Confidence Interval
- Hypothesis Testing


## Single Sample Test

membandingkan data mean dari sample dengan hypothesis untuk populasi

"apakah mean μ (sample) kita berbeda dengan mean μ₀ (populasi)?"

Null Hypothesis H0: tidak ada perbedaan,
Alternate Hypothesis H1: ada perbedaan

Ada tiga jenis:

- u < u0
- u > u0
- u != u0


In [2]:
import numpy as np
import pandas as pd
from scipy import stats

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# import data
df_austin_weather = pd.read_csv("dataset/austin_weather.csv")

# convert data types to numeric for numeric columns
col_list = list(df_austin_weather.columns)
col_list.remove('Date')
col_list.remove('Events')

# semua kolom yang harusnya numerik diubah jadi numerik
for col in col_list:
    df_austin_weather[col] = pd.to_numeric(
        df_austin_weather[col], errors='coerce')

# kolom Date diubah jadi tipe datetime
df_austin_weather['Date'] = pd.to_datetime(df_austin_weather['Date'])

In [3]:
dwind = df_austin_weather['WindAvgMPH']

Q: "Apakah rata2 wind speed kita di 12 mph?"

### one sample testing

H0: u = 12

H1: u != 12

\*tipe two-sided testing karena cuma cek sama/tidak

berikutnya hitung: n, mean, std dari data sample kita


In [4]:
n = dwind.count()
mean = dwind.mean()
std = dwind.std()

print(n, mean, std)

1317 5.0083523158694 2.0864500115879125


In [17]:
data = dwind.dropna()

In [18]:
# hipotesa kita
u0 = 12

stats.ttest_1samp(data, u0)

TtestResult(statistic=-121.60864208555034, pvalue=0.0, df=1316)

pvalue < 0.05 jadi kita bisa reject null hypothesis H0

ada perbedaan yang signifikan antara rata2 mean (sample) dengan hipotesa (12mph)


Q: "apakah rata rata wind speed nya diatas 6 mph?"

H0: u <= 6

H1: u > 6


In [19]:
n = dwind.count()
mean = dwind.mean()
std = dwind.std()

print(n, mean, std)

1317 5.0083523158694 2.0864500115879125


In [20]:
u0 = 6

stats.ttest_1samp(data, u0)

TtestResult(statistic=-17.24814146000529, pvalue=2.928230615778765e-60, df=1316)

kita bisa reject null hypothesis H0 karena pvalue < 0.05


## Two Sample Testing

melihat ada/tidak perbedaan antara dua set data

misal: apakah skor rata2 murid perempuan dan murid laki laki berbeda?

H0: u1 = u2

H1: u1 != u2


In [22]:
df_student = pd.read_csv("dataset/StudentsPerformance.csv")
df_student.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [23]:
df_student['gender'].value_counts()

gender
female    518
male      482
Name: count, dtype: int64

In [27]:
df_student_f = df_student[df_student['gender'] == 'female']
df_student_m = df_student[df_student['gender'] == 'male']

math_f = df_student_f['math score']
math_m = df_student_m['math score']

In [28]:
# n mean std
nf = math_f.count()
meanf = math_f.mean()
stdf = math_f.std()
print("female students:", nf, meanf, stdf)

nm = math_m.count()
meanm = math_m.mean()
stdm = math_m.std()
print("male students:", nm, meanm, stdm)

female students: 518 63.633204633204635 15.49145324233953
male students: 482 68.72821576763485 14.35627719636238


In [29]:
stats.ttest_ind(math_f, math_m, equal_var=False)

TtestResult(statistic=-5.398000564160736, pvalue=8.420838109090415e-08, df=997.9840751727494)

menurut pvalue kita < 0.05, kita bisa reject null hypothesis, ada perbedaan antara rata2 math score female dan male students.


## ANOVA

membandingkan tiga atau lebih kelompok data

misalnya nilai students grup A B C D E


In [30]:
df_student['race/ethnicity'].value_counts()

race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

In [31]:
df_student_a = df_student[df_student['race/ethnicity'] == 'group A']
df_student_b = df_student[df_student['race/ethnicity'] == 'group B']
df_student_c = df_student[df_student['race/ethnicity'] == 'group C']
df_student_d = df_student[df_student['race/ethnicity'] == 'group D']
df_student_e = df_student[df_student['race/ethnicity'] == 'group E']

math_a = df_student_a['math score']
math_b = df_student_b['math score']
math_c = df_student_c['math score']
math_d = df_student_d['math score']
math_e = df_student_e['math score']

In [32]:
# n, mean, std
print("group A:", math_a.count(), math_a.mean(), math_a.std())
print("group B:", math_b.count(), math_b.mean(), math_b.std())
print("group C:", math_c.count(), math_c.mean(), math_c.std())
print("group D:", math_d.count(), math_d.mean(), math_d.std())
print("group E:", math_e.count(), math_e.mean(), math_e.std())

group A: 89 61.62921348314607 14.52300840865962
group B: 190 63.45263157894737 15.468191236472933
group C: 319 64.46394984326018 14.852665879253692
group D: 262 67.36259541984732 13.769385976609573
group E: 140 73.82142857142857 15.53425912548117


In [33]:
# ANOVA
stats.f_oneway(math_a, math_b, math_c, math_d, math_e)

F_onewayResult(statistic=14.593885166332635, pvalue=1.3732194030370688e-11)

karena pvalue < 0.05, kita bisa reject null hypothesis, ada perbedaan signifikan antara nilai rata2 math antar group berbeda ini


## Chi Square Test

untuk data yang sifatnya kategorikal, kita gunakan test ini

misalnya, hasil voting pilpres (yes/no), adakah perbedaan hasil voting antara pria/wanita?

bisa juga, dari data kategori sleep, adakah perbedaan antara pria dan wanita?

"apakah kualitas tidur berbeda antara pria dan wanita?"


In [34]:
df_sleep = pd.read_csv("dataset/sleep_dat.csv")
df_sleep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


In [35]:
df_sleep2 = df_sleep[['Gender', 'Quality of Sleep']]
df_sleep2.head()

Unnamed: 0,Gender,Quality of Sleep
0,Male,6
1,Male,6
2,Male,6
3,Male,4
4,Male,4


In [36]:
df_sleep2['Quality of Sleep'].value_counts()

Quality of Sleep
8    109
6    105
7     77
9     71
5      7
4      5
Name: count, dtype: int64

In [37]:
df_sleep2['Gender'].value_counts()

Gender
Male      189
Female    185
Name: count, dtype: int64

In [39]:
table = pd.crosstab(df_sleep2['Gender'], df_sleep2['Quality of Sleep'])
table

Quality of Sleep,4,5,6,7,8,9
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,2,4,37,37,36,69
Male,3,3,68,40,73,2


In [42]:
chi2, p, dof, expected = stats.chi2_contingency(table)
print(chi2)
print(p)
print(dof)
print(expected)

85.3640901482707
6.314315931960022e-17
5
[[ 2.47326203  3.46256684 51.93850267 38.08823529 53.9171123  35.12032086]
 [ 2.52673797  3.53743316 53.06149733 38.91176471 55.0828877  35.87967914]]


pvalue < 0.05, kita reject null hypothesis. Ada perbedaan kualitas tidur antara pria dan wanita


# Correlation Analysis

melihat apakah ada korelasi antara dua variable

1. Pearson: hubungan linear, r antara -1,0,1 untuk negative corr, no corr, positive corr
   - hubungan linear
   - normal distribution
   - data continuous

Contoh: dua variable continuous, ada korelasi/tidak

2. Spearman: hubungan nya naik terus/ turun terus
   - tidak harus normal distribution
   - bisa untuk data ranking
   - tidak teralu terpengaruh outlier
   - data kategorikal

Contoh: ranking murid di ujian dan di olahraga, ada korelasi/ tidak?

3. Kendall: sama dengan spearman, tapi dia membandingkan per pasang data
   - dataset kecil
   - data kategorikal

Contoh: pilpres (pilihan 1,2,3)


In [43]:
df_sleep.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


In [45]:
df_sleep[['BP High', 'BP Low']] = df_sleep['Blood Pressure'].str.split(
    "/", expand=True)
df_sleep.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder,BP High,BP Low
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,,126,83
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,,125,80
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,,125,80
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90


In [46]:
# adakah korelasi antara BP high dengan heart rate
df_sleep['BP High'].corr(df_sleep['Heart Rate'], method='pearson')

0.2941429212564027

ada sedikit korelasi positif antara BP High dan Heart Rate


In [None]:
print(df_sleep['BP High'].corr(df_sleep['Heart Rate'], method='spearman'))
print(df_sleep['BP High'].corr(df_sleep['Heart Rate'], method='kendall'))

0.22260942439398276
0.19099422357447815


spearman juga setuju ada sedikit korelasi positif

kendall juga sama, biasanya angka lebih rendah, biasanya cocok kalau dataset punya banyak nilai yang sama
