# 迴歸分析: 綜合模型

## 讀取與清理資料
- 資料清理
   - BMI為Null
   - Smoking status為unknown

In [1]:
strokedata <- read.csv(file = '../data/healthcare-dataset-stroke-data-cleanbmi.csv')
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1
49,0,0,Yes,Private,171.23,34.4,smokes,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1


## Model 1: Categorical + Numerical

### Full Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

In [2]:
model1_full = glm(stroke ~ hypertension + heart_disease + ever_married + age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model1_full)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married + 
    age + avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2017  -0.3344  -0.1915  -0.1090   3.2073  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.527041   0.470374 -16.002  < 2e-16 ***
hypertension       0.567135   0.181510   3.125 0.001781 ** 
heart_disease      0.449092   0.216934   2.070 0.038435 *  
ever_marriedYes   -0.152803   0.260974  -0.586 0.558205    
age                0.068193   0.006360  10.722  < 2e-16 ***
avg_glucose_level  0.004732   0.001336   3.541 0.000398 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1149.6  on 3420  degrees of freedom
AIC: 1161.6

Number of Fisher Scoring iterations: 

- $\hat{\beta}_{\rm{marry}}$ 的z-test p-value大，因此沒辦法拒絕下面的$H_0$
  - $H_0: \beta_{\rm{marry}} = 0$
- 下一部份用Deviance來判斷是否要捨棄$x_{\rm{marry}}$

### Reduce Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

In [3]:
model1_reduce = glm(stroke ~ hypertension + heart_disease + age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model1_reduce)


Call:
glm(formula = stroke ~ hypertension + heart_disease + age + avg_glucose_level, 
    family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1420  -0.3357  -0.1927  -0.1072   3.1976  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.632810   0.439480 -17.368  < 2e-16 ***
hypertension       0.568379   0.181386   3.134 0.001727 ** 
heart_disease      0.453704   0.216660   2.094 0.036253 *  
age                0.067773   0.006359  10.659  < 2e-16 ***
avg_glucose_level  0.004701   0.001334   3.524 0.000426 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1149.9  on 3421  degrees of freedom
AIC: 1159.9

Number of Fisher Scoring iterations: 7


### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成 $H_0: \beta_{\rm{marry}} = 0$

In [4]:
anova(model1_reduce, model1_full, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3421,1149.914,,,
3420,1149.58,1.0,0.3338298,0.5634126


- 根據Deviance的p-value=0.5643
    - 不能拒絕 $H_0$，所以最終模型將會捨棄掉結婚狀態的變數

### Final Selection: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

## Model 2: Categorical

### 資料處理

In [5]:
strokedata$age_group <- with(strokedata, ifelse(
    age < 50, '0-49', ifelse(
    age >= 50 & age < 60, '50-59', ifelse(
    age >= 60 & age < 70, '60-69', '>=70'))))
strokedata$age_group <- factor(strokedata$age_group, levels= c('0-49', '50-59', '60-69', '>=70'))
strokedata$glc_group <- with(strokedata, ifelse(avg_glucose_level < 160, '<160', '>=160'))
strokedata$glc_group <- factor(strokedata$glc_group, levels= c('<160', '>=160'))
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,age_group,glc_group
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1,60-69,>=160
80,0,1,Yes,Private,105.92,32.5,never smoked,1,>=70,<160
49,0,0,Yes,Private,171.23,34.4,smokes,1,0-49,>=160
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1,>=70,>=160
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1,>=70,>=160
74,1,1,Yes,Private,70.09,27.4,never smoked,1,>=70,<160


### Full Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{50<=age<=59}}~x_{\rm{50<=age<=59}}\\ + \beta_{\rm{60<=age<=69}}~x_{\rm{60<=age<=69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}} + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$

In [6]:
model2_full = glm(stroke ~ hypertension + heart_disease + ever_married + age_group + glc_group, data=strokedata, 
                  family=binomial(link="logit"))
summary(model2_full)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married + 
    age_group + glc_group, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0323  -0.3198  -0.1633  -0.1264   3.1093  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.80277    0.30513 -15.740  < 2e-16 ***
hypertension     0.55933    0.18171   3.078 0.002082 ** 
heart_disease    0.51525    0.21533   2.393 0.016718 *  
ever_marriedYes -0.02324    0.26447  -0.088 0.929972    
age_group50-59   1.56936    0.33027   4.752 2.02e-06 ***
age_group60-69   1.87867    0.33125   5.671 1.42e-08 ***
age_group>=70    2.73247    0.30564   8.940  < 2e-16 ***
glc_group>=160   0.64429    0.17358   3.712 0.000206 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual devia

- $\hat{\beta}_{\rm{marry}}$ 的z-test p-value大，因此沒辦法拒絕下面的$H_0$
  - $H_0: \beta_{\rm{marry}} = 0$
- 下一部份用Deviance來判斷是否要捨棄$x_{\rm{marry}}$

### Reduce Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{50<=age<=59}}~x_{\rm{50<=age<=59}}\\ + \beta_{\rm{60<=age<=69}}~x_{\rm{60<=age<=69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}} + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$

In [7]:
model2_reduce = glm(stroke ~ hypertension + heart_disease +  age_group + glc_group, data=strokedata, family=binomial(link="logit"))
summary(model2_reduce)


Call:
glm(formula = stroke ~ hypertension + heart_disease + age_group + 
    glc_group, family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0240  -0.3201  -0.1641  -0.1270   3.1064  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.8168     0.2603 -18.503  < 2e-16 ***
hypertension     0.5596     0.1817   3.080 0.002071 ** 
heart_disease    0.5158     0.2152   2.397 0.016537 *  
age_group50-59   1.5619     0.3191   4.895 9.83e-07 ***
age_group60-69   1.8712     0.3200   5.847 5.01e-09 ***
age_group>=70    2.7253     0.2943   9.259  < 2e-16 ***
glc_group>=160   0.6439     0.1735   3.711 0.000207 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1171.7  on 3419  degrees of freedom
AIC: 1185.7

Number of Fisher Scoring iterat

### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成 $H_0: \beta_{\rm{marry}} = 0$

In [8]:
anova(model2_reduce, model2_full, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3419,1171.707,,,
3418,1171.699,1.0,0.007693608,0.9301046


- 根據Deviance的p-value=0.9301
    - 不能拒絕 $H_0$，所以最終模型將會捨棄掉結婚狀態的變數

### Final Selection: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{50<=age<=59}}~x_{\rm{50<=age<=59}}\\ + \beta_{\rm{60<=age<=69}}~x_{\rm{60<=age<=69}} + \beta_{\rm{age>=70}}x_{\rm{age>=70}} + \beta_{\rm{glc>=160}}x_{\rm{glc>=160}}$

```{image} ./images/regr_cate_model_final_plot.png
:alt: regr_cate_model_final_plot
:class: bg-primary mb-1
:width: 800px
:align: center
```