# 迴歸分析2: 數值型變數
在迴歸分析的第二部份，我們將專注的探討三個數值型變數:年齡、血糖、BMI。主要的目的是了解這三個變數中哪些對中風的預測是重要的。此外，我們也會將數值型的變數做分組(數值轉類別)，目的是為了在下一個部份能提供一個圖像化的分析。

## 讀取資料

In [1]:
strokedata <- read.csv(file = '../data/healthcare-dataset-stroke-data-cleanbmi.csv')
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1
49,0,0,Yes,Private,171.23,34.4,smokes,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1


## Simple Models: 只含有一個解釋變數 

### Age(年齡)
$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{age}}x_{\rm{age}}
$$

In [2]:
model_age = glm(stroke ~ age, data=strokedata, family=binomial(link="logit"))
summary(model_age)


Call:
glm(formula = stroke ~ age, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7433  -0.3532  -0.2094  -0.1101   3.1488  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.386079   0.413419  -17.87   <2e-16 ***
age          0.076109   0.006051   12.58   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1180.3  on 3423  degrees of freedom
AIC: 1184.3

Number of Fisher Scoring iterations: 7


- <span style="color:blue"> $\beta$的z-test的p-value很小，年紀對中風的預測是顯著的 </span>
- <span style="color:blue"> $\beta_{\rm{age}}=0.076 > 0$ : 年紀增加，中風的機率也會提升 </span>
- <span style="color:blue"> 下圖藍色線是上述模型的圖像化，可以看到大約過了六十歲，中風的機率開始比較快速的增高 </span>

```{image} ./images/nume_model_age.png
:alt: nume_model_age.png
:class: bg-primary mb-1
:width: 800px
:align: center
```

### Average Glucose Level(血糖)
$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{glucose}}x_{\rm{glucose}}
$$

In [6]:
model_glucose = glm(stroke ~ avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model_glucose)


Call:
glm(formula = stroke ~ avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6637  -0.3162  -0.2829  -0.2614   2.6723  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -4.109552   0.188746 -21.773  < 2e-16 ***
avg_glucose_level  0.010116   0.001286   7.868  3.6e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1355.3  on 3423  degrees of freedom
AIC: 1359.3

Number of Fisher Scoring iterations: 6


- <span style="color:blue"> $\beta$的z-test的p-value很小，血糖對中風的預測是顯著的 </span>
- <span style="color:blue"> $\beta_{\rm{glucose}}=0.01 > 0$ : 血糖增加，中風的機率也會提升 </span>
- <span style="color:blue"> 下圖藍色線是上述模型的圖像化。隨著血糖增加，中風機率會增加。相比於年紀，增加的斜率比較小。表示血糖相對於年紀對於中風的影響性比較小。</span>

```{image} ./images/nume_model_glucose.png
:alt: nume_model_glucose.png
:class: bg-primary mb-1
:width: 800px
:align: center
```

### BMI(身體質量指數)
$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{bmi}}x_{\rm{bmi}}
$$

In [9]:
model_bmi = glm(stroke ~ bmi, data=strokedata, family=binomial(link="logit"))
summary(model_bmi)


Call:
glm(formula = stroke ~ bmi, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4039  -0.3322  -0.3266  -0.3221   2.4638  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.103070   0.321963  -9.638   <2e-16 ***
bmi          0.006932   0.010210   0.679    0.497    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1410.4  on 3423  degrees of freedom
AIC: 1414.4

Number of Fisher Scoring iterations: 5


- <span style="color:blue"> $\beta_{\rm{glucose}}=0.006 > 0$ : BMI增加，中風的機率也會提升 </span>
- <span style="color:blue"> $\beta$的z-test的p-value很大，BMI對中風的預測是不顯著的，我們因此在Final Model不考慮此變數 </span>
- <span style="color:blue"> 下圖藍色線是上述模型的圖像化。隨著BMI增加，中風機率非常緩慢的增加</span>

```{image} ./images/nume_model_bmi
:alt: nume_model_bmi
:class: bg-primary mb-1
:width: 800px
:align: center
```

## Final Model: 
由上述簡單模型的分析，我們選定以下的模型為最終模型
$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}
$$

In [11]:
final_model = glm(stroke ~ age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(final_model)


Call:
glm(formula = stroke ~ age + avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9513  -0.3432  -0.1975  -0.1068   3.1971  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.845280   0.434300  -18.06  < 2e-16 ***
age                0.072699   0.006177   11.77  < 2e-16 ***
avg_glucose_level  0.005443   0.001311    4.15 3.32e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1163.8  on 3422  degrees of freedom
AIC: 1169.8

Number of Fisher Scoring iterations: 7


## 數值型轉換成類別型
- 為了在綜合型Model能做一個簡單的分類，我們希望能將數值型的age與glucose轉換成類別型

### Age

#### 資料處理

In [13]:
strokedata$age_group <- with(strokedata, ifelse(
    age < 50, '0-49', ifelse(
    age >= 50 & age < 60, '50-59', ifelse(
    age >= 60 & age < 70, '60-69', '>=70'))))
strokedata$age_group <- factor(strokedata$age_group, levels= c('0-49', '50-59', '60-69', '>=70'))
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,age_group
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1,60-69
80,0,1,Yes,Private,105.92,32.5,never smoked,1,>=70
49,0,0,Yes,Private,171.23,34.4,smokes,1,0-49
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1,>=70
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1,>=70
74,1,1,Yes,Private,70.09,27.4,never smoked,1,>=70


#### Model: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{50-59}}x_{\rm{50-59}} + \beta_{\rm{60-69}}x_{\rm{60-69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}}$  
- Baseline group:  0-49

In [15]:
model_age_cate = glm(stroke ~ age_group, data=strokedata, family=binomial(link="logit"))
summary(model_age_cate)


Call:
glm(formula = stroke ~ age_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6114  -0.3870  -0.1327  -0.1327   3.0781  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.7286     0.2593 -18.234  < 2e-16 ***
age_group50-59   1.7841     0.3149   5.666 1.46e-08 ***
age_group60-69   2.1747     0.3131   6.945 3.78e-12 ***
age_group>=70    3.1463     0.2823  11.146  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1205.1  on 3422  degrees of freedom
AIC: 1213.1

Number of Fisher Scoring iterations: 7


#### 比較:數值型與類別型
![age_nume_cate_model](./images/age_nume_cate_model.bgwhite.png)

- 可以看出，Numerical還是表現的比較好

### Average Glucose Level

#### 資料處理

In [19]:
strokedata$glc_group <- with(strokedata, ifelse(avg_glucose_level < 160, '<160', '>=160'))
strokedata$glc_group <- factor(strokedata$glc_group, levels= c('<160', '>=160'))
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,age_group,glc_group
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1,60-69,>=160
80,0,1,Yes,Private,105.92,32.5,never smoked,1,>=70,<160
49,0,0,Yes,Private,171.23,34.4,smokes,1,0-49,>=160
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1,>=70,>=160
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1,>=70,>=160
74,1,1,Yes,Private,70.09,27.4,never smoked,1,>=70,<160


#### Model: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$
- Baseline group: glc<160

In [21]:
model_glc_cate = glm(stroke ~ glc_group, data=strokedata, family=binomial(link="logit"))
summary(model_glc_cate)


Call:
glm(formula = stroke ~ glc_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5344  -0.2799  -0.2799  -0.2799   2.5531  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.21995    0.09636 -33.416   <2e-16 ***
glc_group>=160  1.34588    0.16201   8.307   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1350.5  on 3424  degrees of freedom
AIC: 1354.5

Number of Fisher Scoring iterations: 6


#### 比較:數值型與類別型
![glucose_nume_cate_model](./images/glucose_nume_cate_model.bgwhite.png)

- 意外的，分類的avg-glucose-level會比數值型的好

### Final Selection

$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{50-59}}x_{\rm{50-59}} + \beta_{\rm{60-69}}x_{\rm{60-69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}} + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$

In [22]:
model_cate_final = glm(stroke ~ age_group + glc_group, data=strokedata, family=binomial(link="logit"))
summary(model_cate_final)


Call:
glm(formula = stroke ~ age_group + glc_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7608  -0.3466  -0.1283  -0.1283   3.0998  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.7961     0.2602 -18.430  < 2e-16 ***
age_group50-59   1.6730     0.3167   5.283 1.27e-07 ***
age_group60-69   2.0137     0.3165   6.363 1.98e-10 ***
age_group>=70    2.9499     0.2868  10.286  < 2e-16 ***
glc_group>=160   0.7546     0.1700   4.439 9.02e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1186.4  on 3421  degrees of freedom
AIC: 1196.4

Number of Fisher Scoring iterations: 7


- 相比於純數值型的final selection(AIC=1169.8)，還是比較不好