# Basic Definition

| 통계량      | 기호        | 수식                                                  |
| -------- | --------- | --------------------------------------------------- |
| 평균       | $\bar{x}$ | $\frac{1}{n} \sum x_i$                              |
| 분산       | $s^2$     | $\frac{1}{n-1} \sum (x_i - \bar{x})^2$              |
| 표준편차     | $s$       | $\sqrt{s^2}$                                        |
| 변동계수     | $CV$      | $\frac{s}{\bar{x}} \times 100\%$                    |
| 95% 신뢰구간 | CI        | $\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$ |


# Import

- python

In [70]:
import pandas as pd

- R

In [21]:
import rpy2

In [23]:
%load_ext rpy2.ipython

# Data

[ref](https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset)

| 변수명 | 설명 |
|--------|------|
| **Person ID** | 각 개인을 식별하기 위한 고유 식별자입니다. |
| **Gender** | 개인의 성별을 나타냅니다. <br>값: `Male`, `Female` |
| **Age** | 개인의 나이(연령)를 년 단위로 나타냅니다. |
| **Occupation** | 개인의 직업 또는 직무 유형을 나타냅니다. |
| **Sleep Duration (hours)** | 하루 평균 수면 시간 (단위: 시간) |
| **Quality of Sleep (scale: 1-10)** | 수면의 질을 1~10 척도로 평가한 값입니다. <br>1: 매우 나쁨, 10: 매우 좋음 |
| **Physical Activity Level (minutes/day)** | 하루 평균 신체 활동 시간 (단위: 분) |
| **Stress Level (scale: 1-10)** | 스트레스 수준을 1~10 척도로 평가한 값입니다. <br>1: 매우 낮음, 10: 매우 높음 |
| **BMI Category** | 체질량지수(BMI)에 따른 분류 <br>값 예시: `Underweight`, `Normal`, `Overweight` |
| **Blood Pressure (systolic/diastolic)** | 혈압 수치로, `수축기/이완기` 형식 (예: `120/80`) |
| **Heart Rate (bpm)** | 안정 시 심박수 (단위: bpm, beats per minute) |
| **Daily Steps** | 하루 동안 걸은 총 걸음 수 |
| **Sleep Disorder** | 수면 장애 여부 및 유형 |
| &nbsp; | - `None`: 수면 장애 없음 |
| &nbsp; | - `Insomnia`: 불면증 |
| &nbsp; | - `Sleep Apnea`: 수면 무호흡증 |

In [31]:
df = pd.read_csv('../../../../delete/Sleep_health_and_lifestyle_dataset.csv')

In [32]:
df

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
370,371,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
371,372,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
372,373,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea


# One sample t-test

$t = \dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}$

### ex) score(60, 74, 69, 80, 72)의 평균은 75이다.

$t = \dfrac{\bar{x}-75}{s/\sqrt{n}}$

`SAS`

```sas
data test;
   input score;
   datalines;
60
74
69
80
72
;
run;

proc ttest data=test h0=75; \*귀무가설 h_0 = 75*\
   var score;
run;
```

`python`

In [18]:
scores = [60, 74, 69, 80, 72]

In [71]:
import scipy.stats as stats
t_stat, p_value = stats.ttest_1samp(scores, popmean=75)

In [20]:
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

T-statistic: -1.2172, P-value: 0.2904


`R`

In [24]:
%%R
scores <- c(60, 74, 69, 80, 72)

t.test(scores, mu = 75)


	One Sample t-test

data:  scores
t = -1.2172, df = 4, p-value = 0.2904
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
 61.87567 80.12433
sample estimates:
mean of x 
       71 



### data ex) Sleep Duration 평균은 7이다.

`SAS`

```sas
proc ttest data=df h0=7; \*귀무가설 h_0 = 75*\
   var Sleep Duration;
run;
```

`python`

In [34]:
stats.ttest_1samp(df['Sleep Duration'], popmean=7)

Ttest_1sampResult(statistic=3.2104462758942, pvalue=0.0014402421900475528)

`R`

In [89]:
%R -i df 

  for name, values in obj.iteritems():


In [41]:
%%R
t.test(df['Sleep Duration'], mu = 7)


	One Sample t-test

data:  df["Sleep Duration"]
t = 3.2104, df = 373, p-value = 0.00144
alternative hypothesis: true mean is not equal to 7
95 percent confidence interval:
 7.051185 7.212986
sample estimates:
mean of x 
 7.132086 



결론: p-value가 0.05보다 작아 귀무가설 기각하여 평균은 7이 아님을 알 수 있다.

**정규성 검정 만족하지 않는다면?**

## Wilcoxon Signed Rank Test

`SAS`

```sas
data height;
   input height;
   datalines;
165
170
160
172
168
169
171
167
;
run;

proc univariate data=height;
   var height;
   ods select TestsForLocation;
run;
```

`python`

In [87]:
from scipy.stats import wilcoxon

data = [165, 170, 160, 172, 168, 169, 171, 167]

diff = [x - 168 for x in data]

stat, p = wilcoxon(diff)
print(f"Wilcoxon stat: {stat}, p-value: {p:.4f}")

Wilcoxon stat: 13.0, p-value: 0.8653


`R`

In [88]:
%%R
data <- c(165, 170, 160, 172, 168, 169, 171, 167)

# 귀무가설: median = 168
wilcox.test(data, mu = 168)


	Wilcoxon signed rank test with continuity correction

data:  data
V = 15, p-value = 0.9324
alternative hypothesis: true location is not equal to 168



- 중앙값을 비교, 비모수는 분포를 가정하지 않기 때문에.

# Two sample t-test

$t = \dfrac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_p (\frac{1}{n_1}-\frac{1}{n_2})}}, s_p = \dfrac{(n_1 - 1)s_1^2 - (n_2 - 1)s_2^2}{n_1 - n_2-2}$

### ex) a vs b group 평균 비교

`SAS`

```sas
data two_group;
   input group $ score;
   datalines;
A 85
A 88
A 90
B 80
B 78
B 82
;
run;

proc ttest data=two_group;
   class group;
   var score;
run;
```

- SAS에서는 등분산을 가정한 결과(Pooled)와 가정하지 않은 결과(Satterthwaite/Welch)를 모두 제시하고 있어 별도의 옵션은 존재하지 않는다.

`python`

In [25]:
group_a = [85, 88, 90]
group_b = [80, 78, 82]

- equal_var는 등분산 가정 어떻게 할지

In [72]:
import scipy.stats as stats
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=True)

In [27]:
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

T-statistic: 4.1309, P-value: 0.0145


`R`

- var.equal는 등분산 가정 어떻게 할지

In [28]:
%%R
group_a <- c(85, 88, 90)
group_b <- c(80, 78, 82)

t.test(group_a, group_b, var.equal = TRUE)


	Two Sample t-test

data:  group_a and group_b
t = 4.1309, df = 4, p-value = 0.01448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.513803 12.819531
sample estimates:
mean of x mean of y 
 87.66667  80.00000 



### data ex) Sleep Duration 의 성별 평균 비교

`SAS`

```sas
proc ttest data=df;
   class Gender;
   var Sleep Duration;
run;
```

`python`

In [49]:
stats.ttest_ind(df.query('Gender=="Male"')['Sleep Duration'], df.query('Gender=="Female"')['Sleep Duration'], equal_var=True)

Ttest_indResult(statistic=-2.3624469898393397, pvalue=0.018668859270607456)

`R`

In [69]:
%%R
t.test(df[df['Gender']=='Male',]['Sleep Duration'], df[df['Gender']=='Female',]['Sleep Duration'], var.equal = TRUE)


	Two Sample t-test

data:  df[df["Gender"] == "Male", ]["Sleep Duration"] and df[df["Gender"] == "Female", ]["Sleep Duration"]
t = -2.3624, df = 372, p-value = 0.01867
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.35404821 -0.03239537
sample estimates:
mean of x mean of y 
 7.036508  7.229730 



- 결론: p-value가 0.05보다 작아 귀무가설을 기각하고 Sleep Duration의 성별간 평균이 다르다고 할 수 있다.

**정규성 검정 만족하지 않는다면?**

## Mann-Whitney U Test

`SAS`

```sas
data two_group;
   input group $ score;
   datalines;
A 85
A 88
A 90
B 80
B 78
B 82
;
run;

proc npar1way data=two_group wilcoxon;
   class group;
   var score;
run;
```

- `npar1way`는 비모수 검정 (nonparametric tests)  + 1개 요인(1way)

`python`

In [83]:
from scipy.stats import mannwhitneyu

group_a = [85, 88, 90]
group_b = [80, 78, 82]

stat, p = mannwhitneyu(group_a, group_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat}, p-value: {p:.4f}")

Mann-Whitney U: 9.0, p-value: 0.1000


`R`

In [85]:
%%R
group_a <- c(85, 88, 90)
group_b <- c(80, 78, 82)

wilcox.test(group_a, group_b, paired = FALSE)


	Wilcoxon rank sum exact test

data:  group_a and group_b
W = 9, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0



# Paired t-test

$t = \dfrac{\bar{d}}{s_d/\sqrt{n}}$

`SAS`

```sas
data bp;
   input before after;
   datalines;
130 125
128 126
135 132
133 130
129 124
;
run;

proc ttest data=bp;
   paired before*after;
run;
```

`python`

In [73]:
from scipy import stats

In [74]:
before = [130, 128, 135, 133, 129]
after  = [125, 126, 132, 130, 124]

In [76]:
t_stat, p_value = stats.ttest_rel(before, after)

In [77]:
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

T-statistic: 6.0000, P-value: 0.0039


`R`

In [78]:
%%R
before <- c(130, 128, 135, 133, 129)
after  <- c(125, 126, 132, 130, 124)

t.test(before, after, paired = TRUE)


	Paired t-test

data:  before and after
t = 6, df = 4, p-value = 0.003883
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.934133 5.265867
sample estimates:
mean difference 
            3.6 



- 결론: p-value가 0.05보다 작아 귀무가설을 기각하고 전과 후가 차이가 있는 것으로 결론을 내릴 수 있다.

**정규성 검정 만족하지 않는다면?**

## Wilcoxon Signed Rank Test

`SAS`

```sas
data bp;
   input before after;
   diff = before - after;
   datalines;
130 125
128 126
135 132
133 130
129 124
;
run;

proc univariate data=bp;
   var diff;
   ods select TestsForLocation;
run;
```

`python`

In [80]:
from scipy.stats import wilcoxon

before = [130, 128, 135, 133, 129]
after  = [125, 126, 132, 130, 124]

stat, p = wilcoxon(before, after)
print(f"Wilcoxon stat: {stat}, p-value: {p:.4f}")

Wilcoxon stat: 0.0, p-value: 0.0625


`R`

In [82]:
%%R
before <- c(130, 128, 135, 133, 129)
after  <- c(125, 126, 132, 130, 124)

wilcox.test(before, after, paired = TRUE)


	Wilcoxon signed rank test with continuity correction

data:  before and after
V = 15, p-value = 0.05676
alternative hypothesis: true location shift is not equal to 0



# ANOVA

`SAS`

```sas
data mydata;
   input group $ value;
   datalines;
A 5
A 6
A 7
B 8
B 9
B 10
C 6
C 5
C 4
;
run;

proc glm data=mydata;
   class group;
   model value = group;
run;

```

`python`


In [90]:
import scipy.stats as stats

group_A = [5, 6, 7]
group_B = [8, 9, 10]
group_C = [6, 5, 4]

f_stat, p = stats.f_oneway(group_A, group_B, group_C)
print(f"ANOVA p-value: {p:.4f}")

ANOVA p-value: 0.0066


`R`

In [91]:
%%R
group <- c(rep("A",3), rep("B",3), rep("C",3))
value <- c(5,6,7, 8,9,10, 6,5,4)
df <- data.frame(group, value)

anova_result <- aov(value ~ group, data = df)
summary(anova_result)

            Df Sum Sq Mean Sq F value  Pr(>F)   
group        2     26      13      13 0.00659 **
Residuals    6      6       1                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


**정규성 가정 만족하지만 등분산성 가정 만족 안 한다면?**

## Welch's ANOVA

`SAS`

```sas
PROC GLM DATA=your_dataset;
    CLASS group;
    MODEL response_variable = group;
    MEANS group / WELCH;
RUN;
```

`python`

In [95]:
import pandas as pd
import pingouin as pg

df = pd.DataFrame({
    'group': ['A']*3 + ['B']*3 + ['C']*3,
    'value': [5, 6, 7, 8, 9, 10, 6, 5, 4]
})

welch = pg.welch_anova(dv='value', between='group', data=df)
print(welch)

  Source  ddof1  ddof2          F     p-unc     np2
0  group      2    4.0  11.142857  0.023157  0.8125


`R`

In [96]:
%R -i df

  for name, values in obj.iteritems():


In [97]:
%%R
oneway.test(value ~ group, data = df, var.equal = FALSE)


	One-way analysis of means (not assuming equal variances)

data:  value and group
F = 11.143, num df = 2, denom df = 4, p-value = 0.02316



**정규성 만족하지 않을때**

## Kruskal-Wallis Test

`SAS`

```sas
proc npar1way data=mydata wilcoxon;
   class group;
   var value;
run;
```

`python`

In [98]:
h_stat, p = stats.kruskal(group_A, group_B, group_C)
print(f"Kruskal-Wallis p-value: {p:.4f}")

Kruskal-Wallis p-value: 0.0484


`R`

In [99]:
%%R
kruskal.test(value ~ group, data = df)


	Kruskal-Wallis rank sum test

data:  value by group
Kruskal-Wallis chi-squared = 6.0565, df = 2, p-value = 0.0484



사후검정 2 군간 ttest 수행 후 alpha 조정 0.0167!