# Chapter 11: 迴歸分析3-綜合模型
在迴歸分析的第三部份，我們結合了前兩部份的模型。

## 讀取資料

In [1]:
strokedata <- read.csv(file = '../data/healthcare-dataset-stroke-data-cleanbmi.csv')
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1
49,0,0,Yes,Private,171.23,34.4,smokes,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1


## Model 1: Categorical + Numerical

- <span style="color:blue"> 在類別型的分析中: 高血壓、心臟病與結婚狀態 </span>
- <span style="color:blue"> 在數值型的分析中: 年齡、血糖</span>  
- <span style="color:blue"> 因此，我們提出了以下的模型</span>  

$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}
$$

In [3]:
model1 = glm(stroke ~ hypertension + heart_disease + ever_married + age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model1)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married + 
    age + avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2018  -0.3344  -0.1915  -0.1090   3.2072  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.526184   0.470428 -15.999  < 2e-16 ***
hypertension       0.567110   0.181509   3.124 0.001782 ** 
heart_disease      0.449095   0.216933   2.070 0.038433 *  
ever_marriedYes   -0.153133   0.260976  -0.587 0.557358    
age                0.068184   0.006361  10.720  < 2e-16 ***
avg_glucose_level  0.004733   0.001336   3.542 0.000398 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1149.6  on 3419  degrees of freedom
AIC: 1161.6

Number of Fisher Scoring iterations: 

- <span style="color:blue"> 高血壓、心臟病、年齡、血糖的$\beta$都大於零，表示這幾個因素都是中風的負面因子，與前面的分析一致</span>
- <span style="color:red"> $\beta_{\rm{marry}}<0$表示結婚會降低中風的機率，這與前面的分析矛盾。我們認為婚姻背後的confounding factor是年齡，會在下個部份討論</span>
- <span style="color:blue"> 此外，$\beta_{\rm{marry}}$ 的z-test p-value大，表示無法拒絕 $H_0: \beta_{\rm{marry}} = 0$</span>

## Model 2
在Model 1中，我們發現無法拒絕 $H_0: \beta_{\rm{marry}} = 0$，因此提出了下面的模型，並會用Deviance來比較Model 1 跟 Model 2

$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}
$$

In [4]:
model2 = glm(stroke ~ hypertension + heart_disease + age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model2)


Call:
glm(formula = stroke ~ hypertension + heart_disease + age + avg_glucose_level, 
    family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1420  -0.3358  -0.1927  -0.1074   3.1975  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.632239   0.439507 -17.365  < 2e-16 ***
hypertension       0.568359   0.181384   3.133 0.001728 ** 
heart_disease      0.453716   0.216659   2.094 0.036247 *  
age                0.067765   0.006359  10.656  < 2e-16 ***
avg_glucose_level  0.004701   0.001334   3.524 0.000425 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1149.9  on 3420  degrees of freedom
AIC: 1159.9

Number of Fisher Scoring iterations: 7


### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成 $H_0: \beta_{\rm{marry}} = 0$

In [5]:
anova(model1, model2, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3419,1149.567,,,
3420,1149.903,-1.0,-0.3352499,0.562584


- 根據Deviance的p-value=0.5626
    - <span style="color:blue">不能拒絕 $H_0$，所以我們選擇Model 2當作Final Model</span>

## Final Model
$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}
$$

In [6]:
summary(model2)


Call:
glm(formula = stroke ~ hypertension + heart_disease + age + avg_glucose_level, 
    family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1420  -0.3358  -0.1927  -0.1074   3.1975  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.632239   0.439507 -17.365  < 2e-16 ***
hypertension       0.568359   0.181384   3.133 0.001728 ** 
heart_disease      0.453716   0.216659   2.094 0.036247 *  
age                0.067765   0.006359  10.656  < 2e-16 ***
avg_glucose_level  0.004701   0.001334   3.524 0.000425 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1149.9  on 3420  degrees of freedom
AIC: 1159.9

Number of Fisher Scoring iterations: 7


### Final Model圖像化與討論
以下四張圖，是我們將Final Model圖像化。四張圖對應到四組血糖(80,110,150,200)。每一張圖的橫軸是age，一張圖裏面的四條線分別對應到高血壓與心臟病的不同狀況。 
- <span style="color:blue">年齡的影響大約是在50-60歲之後開始增加中風的機率</span>
- <span style="color:blue">高血壓與心臟病的影響是差不多的。對於年紀大的人，兩者都有(紅色線)對比於兩者都沒有(藍色線)，中風的機率大概相差0.2</span>
- <span style="color:blue">比較四張圖，血糖的影響沒這麼大。對於年紀大的人，血糖250對比於血糖80，中風的機率大概相差0.1</span>

```{image} ./images/regr_final_model_glu_80.png
:alt: regr_final_model_glu_80
:class: bg-primary mb-1
:width: 800px
:align: center
```

```{image} ./images/regr_final_model_glu_110.png
:alt: regr_final_model_glu_110
:class: bg-primary mb-1
:width: 800px
:align: center
```

```{image} ./images/regr_final_model_glu_150.png
:alt: regr_final_model_glu_110
:class: bg-primary mb-1
:width: 800px
:align: center
```

```{image} ./images/regr_final_model_glu_200.png
:alt: regr_final_model_glu_110
:class: bg-primary mb-1
:width: 800px
:align: center
```

## 討論: 結婚狀態與年齡
類別型簡單模型的分析指出有結婚的人，中風的機率會提升，且顯著性高。不過，在上面的Model 1，則說有結婚的人，中風的機率會下降，而且顯著性下降很多。我們直覺上認為有結婚與沒結婚這個資訊含有年齡的資訊。  
<span style="color:blue">因此，我們以下面四個模型來看是哪個變數造成$\beta_{\rm{marry}}$變號</span>

$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{hypert}}x_{\rm{hypert}}
$$

In [12]:
model_age_hypert = glm(stroke ~ ever_married + hypertension, data=strokedata, family=binomial(link="logit"))
summary(model_age_hypert)


Call:
glm(formula = stroke ~ ever_married + hypertension, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5655  -0.3119  -0.3119  -0.2091   2.7691  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.8122     0.2283 -16.695  < 2e-16 ***
ever_marriedYes   0.8133     0.2431   3.345 0.000822 ***
hypertension      1.2465     0.1714   7.271 3.58e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1344.7  on 3422  degrees of freedom
AIC: 1350.7

Number of Fisher Scoring iterations: 6


$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}
$$

In [13]:
model_age_heartd = glm(stroke ~ ever_married + heart_disease, data=strokedata, family=binomial(link="logit"))
summary(model_age_heartd)


Call:
glm(formula = stroke ~ ever_married + heart_disease, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6412  -0.3273  -0.3273  -0.2130   2.7560  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.7752     0.2277 -16.579  < 2e-16 ***
ever_marriedYes   0.8751     0.2422   3.614 0.000302 ***
heart_disease     1.4225     0.2036   6.987  2.8e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1351.2  on 3422  degrees of freedom
AIC: 1357.2

Number of Fisher Scoring iterations: 6


$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{age}}x_{\rm{age}}
$$

In [14]:
model_age_marry = glm(stroke ~ ever_married + age, data=strokedata, family=binomial(link="logit"))
summary(model_age_marry)


Call:
glm(formula = stroke ~ ever_married + age, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7898  -0.3501  -0.2070  -0.1125   3.1579  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -7.27687    0.44943 -16.191   <2e-16 ***
ever_marriedYes -0.15000    0.25746  -0.583     0.56    
age              0.07648    0.00605  12.642   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1180.0  on 3422  degrees of freedom
AIC: 1186

Number of Fisher Scoring iterations: 7


$$
\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{marry}}x_{\rm{marry}} + \beta_{\rm{glucose}}x_{\rm{glucose}}
$$

In [15]:
model_age_glc = glm(stroke ~ ever_married + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model_age_glc)


Call:
glm(formula = stroke ~ ever_married + avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6736  -0.3292  -0.2946  -0.2381   2.8574  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -4.685927   0.270481 -17.324  < 2e-16 ***
ever_marriedYes    0.801892   0.243289   3.296 0.000981 ***
avg_glucose_level  0.009397   0.001294   7.263  3.8e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1410.9  on 3424  degrees of freedom
Residual deviance: 1342.3  on 3422  degrees of freedom
AIC: 1348.3

Number of Fisher Scoring iterations: 6


- <span style="color:blue">跟我們的猜想一樣，是"年齡"造成$\beta_{\rm{marry}}$變號與不顯著的</span>
- <span style="color:blue">所以當我們在簡單模型裡只考慮婚姻這個參數時，年齡是背後的Confounding factor</span>
   - <span style="color:blue">結婚的人通常年紀比較大，因此容易中風。年紀小則反之。</span>
   
<p style="page-break-before: always">