# 迴歸分析: 數值型變數

## 讀取與清理資料
- 資料清理
   - BMI為Null
   - Smoking status為unknown

In [1]:
strokedata <- read.csv(file = '../data/healthcare-dataset-stroke-data-cleanbmi.csv')
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1
49,0,0,Yes,Private,171.23,34.4,smokes,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1


## Simple Models: 只含有一個解釋變數 

### Age
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{age}}x_{\rm{age}}$

In [2]:
model_age = glm(stroke ~ age, data=strokedata, family=binomial(link="logit"))
summary(model_age)


Call:
glm(formula = stroke ~ age, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7433  -0.3532  -0.2094  -0.1101   3.1489  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.386557   0.413389  -17.87   <2e-16 ***
age          0.076116   0.006051   12.58   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1180.3  on 3424  degrees of freedom
AIC: 1184.3

Number of Fisher Scoring iterations: 7


![nume_model_age](./images/nume_model_age.png)

### Average Glucose Level
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{glucose}}x_{\rm{glucose}}$

In [3]:
model_glucose = glm(stroke ~ avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model_glucose)


Call:
glm(formula = stroke ~ avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6635  -0.3162  -0.2829  -0.2613   2.6724  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -4.109807   0.188774 -21.771  < 2e-16 ***
avg_glucose_level  0.010115   0.001286   7.866 3.66e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1355.4  on 3424  degrees of freedom
AIC: 1359.4

Number of Fisher Scoring iterations: 6


![nume_model_glucose](./images/nume_model_glucose.png)

### BMI
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{bmi}}x_{\rm{bmi}}$

In [4]:
model_bmi = glm(stroke ~ bmi, data=strokedata, family=binomial(link="logit"))
summary(model_bmi)


Call:
glm(formula = stroke ~ bmi, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4043  -0.3322  -0.3266  -0.3220   2.4642  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.104671   0.321897  -9.645   <2e-16 ***
bmi          0.006975   0.010207   0.683    0.494    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1410.5  on 3424  degrees of freedom
AIC: 1414.5

Number of Fisher Scoring iterations: 5


![nume_model_bmi](./images/nume_model_bmi.png)

- bmi的數值對於預測中風沒有什麼資訊
  - 可以看到$\hat{\beta}_{\rm{bmi}}$非常接近0
  - $\hat{\beta}_{\rm{bmi}}$ 的z-test p-value很大
- 下一個部份會利用Deviance 來決定是否要捨棄bmi這個解釋變數

## Model Selection: with/without bmi
- Full Model: $\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}} + \beta_{\rm{bmi}}x_{\rm{bmi}}$
- Reduced Model: $\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

In [5]:
model_full = glm(stroke ~ age + avg_glucose_level + bmi, data=strokedata, family=binomial(link="logit"))
summary(model_full)


Call:
glm(formula = stroke ~ age + avg_glucose_level + bmi, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9252  -0.3444  -0.1970  -0.1050   3.2052  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -8.158626   0.620082 -13.157  < 2e-16 ***
age                0.073656   0.006365  11.572  < 2e-16 ***
avg_glucose_level  0.005208   0.001350   3.858 0.000114 ***
bmi                0.009182   0.012712   0.722 0.470102    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1163.3  on 3422  degrees of freedom
AIC: 1171.3

Number of Fisher Scoring iterations: 7


In [6]:
model_reduce = glm(stroke ~ age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model_reduce)


Call:
glm(formula = stroke ~ age + avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9514  -0.3432  -0.1975  -0.1068   3.1972  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.845862   0.434271  -18.07  < 2e-16 ***
age                0.072708   0.006176   11.77  < 2e-16 ***
avg_glucose_level  0.005442   0.001311    4.15 3.33e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1163.8  on 3423  degrees of freedom
AIC: 1169.8

Number of Fisher Scoring iterations: 7


### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成 $H_0: \beta_{\rm{bmi}} = 0$

In [7]:
anova(model_reduce, model_full, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3423,1163.793,,,
3422,1163.279,1.0,0.5142007,0.4733261


- 根據Deviance的p-value=0.4733
    - 不能拒絕 $H_0$，所以最終模型將會捨棄掉bmi的變數

### Final Selection: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

## 數值型轉換成類別型
- 為了在綜合型Model能做一個簡單的分類，我們希望能將數值型的age與glucose轉換成類別型

### Age

#### 資料處理

In [8]:
strokedata$age_group <- with(strokedata, ifelse(
    age < 50, '0-49', ifelse(
    age >= 50 & age < 60, '50-59', ifelse(
    age >= 60 & age < 70, '60-69', '>=70'))))
strokedata$age_group <- factor(strokedata$age_group, levels= c('0-49', '50-59', '60-69', '>=70'))
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,age_group
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1,60-69
80,0,1,Yes,Private,105.92,32.5,never smoked,1,>=70
49,0,0,Yes,Private,171.23,34.4,smokes,1,0-49
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1,>=70
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1,>=70
74,1,1,Yes,Private,70.09,27.4,never smoked,1,>=70


#### Model: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{50-59}}x_{\rm{50-59}} + \beta_{\rm{60-69}}x_{\rm{60-69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}}$  
- Baseline group:  0-49

In [9]:
model_age_cate = glm(stroke ~ age_group, data=strokedata, family=binomial(link="logit"))
summary(model_age_cate)


Call:
glm(formula = stroke ~ age_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6114  -0.3870  -0.1327  -0.1327   3.0781  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.7286     0.2593 -18.234  < 2e-16 ***
age_group50-59   1.7841     0.3149   5.666 1.46e-08 ***
age_group60-69   2.1747     0.3131   6.945 3.78e-12 ***
age_group>=70    3.1463     0.2823  11.146  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1205.1  on 3422  degrees of freedom
AIC: 1213.1

Number of Fisher Scoring iterations: 7


#### 比較:數值型與類別型
![age_nume_cate_model](./images/age_nume_cate_model.bgwhite.png)

- 可以看出，Numerical還是表現的比較好

### Average Glucose Level

#### 資料處理

In [10]:
strokedata$glc_group <- with(strokedata, ifelse(avg_glucose_level < 160, '<160', '>=160'))
strokedata$glc_group <- factor(strokedata$glc_group, levels= c('<160', '>=160'))
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,age_group,glc_group
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1,60-69,>=160
80,0,1,Yes,Private,105.92,32.5,never smoked,1,>=70,<160
49,0,0,Yes,Private,171.23,34.4,smokes,1,0-49,>=160
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1,>=70,>=160
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1,>=70,>=160
74,1,1,Yes,Private,70.09,27.4,never smoked,1,>=70,<160


#### Model: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$
- Baseline group: glc<160

In [11]:
model_glc_cate = glm(stroke ~ glc_group, data=strokedata, family=binomial(link="logit"))
summary(model_glc_cate)


Call:
glm(formula = stroke ~ glc_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5344  -0.2799  -0.2799  -0.2799   2.5531  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.21995    0.09636 -33.416   <2e-16 ***
glc_group>=160  1.34588    0.16201   8.307   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1350.5  on 3424  degrees of freedom
AIC: 1354.5

Number of Fisher Scoring iterations: 6


#### 比較:數值型與類別型
![glucose_nume_cate_model](./images/glucose_nume_cate_model.bgwhite.png)

- 意外的，分類的avg-glucose-level會比數值型的好

### Final Selection

$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{50-59}}x_{\rm{50-59}} + \beta_{\rm{60-69}}x_{\rm{60-69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}} + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$

In [12]:
model_cate_final = glm(stroke ~ age_group + glc_group, data=strokedata, family=binomial(link="logit"))
summary(model_cate_final)


Call:
glm(formula = stroke ~ age_group + glc_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7608  -0.3466  -0.1283  -0.1283   3.0998  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.7961     0.2602 -18.430  < 2e-16 ***
age_group50-59   1.6730     0.3167   5.283 1.27e-07 ***
age_group60-69   2.0137     0.3165   6.363 1.98e-10 ***
age_group>=70    2.9499     0.2868  10.286  < 2e-16 ***
glc_group>=160   0.7546     0.1700   4.439 9.02e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1186.4  on 3421  degrees of freedom
AIC: 1196.4

Number of Fisher Scoring iterations: 7


- 相比於純數值型的final selection(AIC=1169.8)，還是比較不好