# 迴歸分析: 綜合模型

## 讀取與清理資料
- 資料清理
   - BMI為Null
   - Smoking status為unknown

In [1]:
strokedata <- read.csv(file = '../data/healthcare-dataset-stroke-data-cleanbmi.csv')
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1
49,0,0,Yes,Private,171.23,34.4,smokes,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1


## Model 1: Categorical + Numerical

### Full Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

In [2]:
model1_full = glm(stroke ~ hypertension + heart_disease + ever_married + age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model1_full)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married + 
    age + avg_glucose_level, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2017  -0.3344  -0.1915  -0.1090   3.2073  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.527041   0.470374 -16.002  < 2e-16 ***
hypertension       0.567135   0.181510   3.125 0.001781 ** 
heart_disease      0.449092   0.216934   2.070 0.038435 *  
ever_marriedYes   -0.152803   0.260974  -0.586 0.558205    
age                0.068193   0.006360  10.722  < 2e-16 ***
avg_glucose_level  0.004732   0.001336   3.541 0.000398 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1149.6  on 3420  degrees of freedom
AIC: 1161.6

Number of Fisher Scoring iterations: 

- $\hat{\beta}_{\rm{marry}}$ 的z-test p-value大，因此沒辦法拒絕下面的$H_0$
  - $H_0: \beta_{\rm{marry}} = 0$
- 下一部份用Deviance來判斷是否要捨棄$x_{\rm{marry}}$

### Reduce Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

In [3]:
model1_reduce = glm(stroke ~ hypertension + heart_disease + age + avg_glucose_level, data=strokedata, family=binomial(link="logit"))
summary(model1_reduce)


Call:
glm(formula = stroke ~ hypertension + heart_disease + age + avg_glucose_level, 
    family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1420  -0.3357  -0.1927  -0.1072   3.1976  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.632810   0.439480 -17.368  < 2e-16 ***
hypertension       0.568379   0.181386   3.134 0.001727 ** 
heart_disease      0.453704   0.216660   2.094 0.036253 *  
age                0.067773   0.006359  10.659  < 2e-16 ***
avg_glucose_level  0.004701   0.001334   3.524 0.000426 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1149.9  on 3421  degrees of freedom
AIC: 1159.9

Number of Fisher Scoring iterations: 7


### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成 $H_0: \beta_{\rm{marry}} = 0$

In [4]:
anova(model1_reduce, model1_full, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3421,1149.914,,,
3420,1149.58,1.0,0.3338298,0.5634126


- 根據Deviance的p-value=0.5643
    - 不能拒絕 $H_0$，所以最終模型將會捨棄掉結婚狀態的變數

### Final Selection: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age}}x_{\rm{age}} + \beta_{\rm{glucose}}x_{\rm{glucose}}$

## Model 2: Categorical

### 資料處理

In [5]:
strokedata$age_greater_50 <- ifelse(strokedata$age>50, 1, 0)
strokedata$age_greater_60 <- ifelse(strokedata$age>60, 1, 0)
strokedata$age_greater_70 <- ifelse(strokedata$age>70, 1, 0)
strokedata$age_greater_80 <- ifelse(strokedata$age>80, 1, 0)
strokedata$glc_greater_80 <- ifelse(strokedata$avg_glucose_level>80, 1, 0)
strokedata$glc_greater_110 <- ifelse(strokedata$avg_glucose_level>110, 1, 0)
strokedata$glc_greater_160 <- ifelse(strokedata$avg_glucose_level>160, 1, 0)
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,age_greater_50,age_greater_60,age_greater_70,age_greater_80,glc_greater_80,glc_greater_110,glc_greater_160
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1,1,1,0,0,1,1,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1,1,1,1,0,1,0,0
49,0,0,Yes,Private,171.23,34.4,smokes,1,0,0,0,0,1,1,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1,1,1,1,0,1,1,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1,1,1,1,1,1,1,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1,1,1,1,0,0,0,0


### Full Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}+ \beta_{\rm{age>50}}x_{\rm{age>50}} + \beta_{\rm{age>70}}x_{\rm{age>70}}  + \beta_{\rm{glc>160}}x_{\rm{glc>160}}$

In [6]:
model2_full = glm(stroke ~ hypertension + heart_disease + ever_married + age_greater_50 + age_greater_70 + glc_greater_160, data=strokedata, 
                  family=binomial(link="logit"))
summary(model2_full)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married + 
    age_greater_50 + age_greater_70 + glc_greater_160, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0434  -0.3032  -0.1367  -0.1367   3.0692  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.70078    0.28990 -16.215  < 2e-16 ***
hypertension     0.59381    0.18186   3.265 0.001094 ** 
heart_disease    0.54279    0.21589   2.514 0.011931 *  
ever_marriedYes  0.03222    0.26356   0.122 0.902692    
age_greater_50   1.61172    0.28415   5.672 1.41e-08 ***
age_greater_70   0.94652    0.17330   5.462 4.71e-08 ***
glc_greater_160  0.65007    0.17385   3.739 0.000185 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1180.9  on 3419  degrees

- $\hat{\beta}_{\rm{marry}}$ 的z-test p-value大，因此沒辦法拒絕下面的$H_0$
  - $H_0: \beta_{\rm{marry}} = 0$
- 下一部份用Deviance來判斷是否要捨棄$x_{\rm{marry}}$

### Reduce Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age>50}}x_{\rm{age>50}} + \beta_{\rm{age>70}}x_{\rm{age>70}}  + \beta_{\rm{glc>160}}x_{\rm{glc>160}}$

In [7]:
model2_reduce = glm(stroke ~ hypertension + heart_disease + age_greater_50 + age_greater_70 + glc_greater_160, data=strokedata, family=binomial(link="logit"))
summary(model2_reduce)


Call:
glm(formula = stroke ~ hypertension + heart_disease + age_greater_50 + 
    age_greater_70 + glc_greater_160, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0421  -0.3029  -0.1359  -0.1359   3.0627  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -4.6807     0.2382 -19.654  < 2e-16 ***
hypertension      0.5935     0.1819   3.264 0.001100 ** 
heart_disease     0.5422     0.2158   2.512 0.012005 *  
age_greater_50    1.6216     0.2727   5.945 2.76e-09 ***
age_greater_70    0.9458     0.1732   5.461 4.74e-08 ***
glc_greater_160   0.6507     0.1738   3.744 0.000181 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1180.9  on 3420  degrees of freedom
AIC: 1192.9

Number of Fisher Scoring iterations: 7


### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成 $H_0: \beta_{\rm{marry}} = 0$

In [8]:
anova(model2_reduce, model2_full, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3420,1180.918,,,
3419,1180.903,1.0,0.01502657,0.9024374


- 根據Deviance的p-value=0.9024
    - 不能拒絕 $H_0$，所以最終模型將會捨棄掉結婚狀態的變數

### Final Selection: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{age>50}}x_{\rm{age>50}} + \beta_{\rm{age>70}}x_{\rm{age>70}}  + \beta_{\rm{glc>160}}x_{\rm{glc>160}}$

```{image} ./images/regr_cate_model_final_plot.png
:alt: regr_cate_model_final_plot
:class: bg-primary mb-1
:width: 800px
:align: center
```