# 迴歸分析: 類別型變數
- 從前面列聯表分析中: 中風與性別及住處型態無關，因此先剔除
- 首先，我們先針對每一個變數做Logistic regression

## 讀取與清理資料
- 資料清理
   - BMI為Null
   - Smoking status為unknown

In [1]:
strokedata <- read.csv(file = '../data/healthcare-dataset-stroke-data-cleanbmi.csv')
head(strokedata)

age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
67,0,1,Yes,Private,228.69,36.6,formerly smoked,1
80,0,1,Yes,Private,105.92,32.5,never smoked,1
49,0,0,Yes,Private,171.23,34.4,smokes,1
79,1,0,Yes,Self-employed,174.12,24.0,never smoked,1
81,0,0,Yes,Private,186.21,29.0,formerly smoked,1
74,1,1,Yes,Private,70.09,27.4,never smoked,1


- 清理完的資料含有3426資料點

In [2]:
nrow(strokedata)

## Simple Models: 只含有一個解釋變數 

### Hypertension: 
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}$

In [3]:
model_hypert = glm(stroke ~ hypertension, data=strokedata, family=binomial(link="logit"))
summary(model_hypert)


Call:
glm(formula = stroke ~ hypertension, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5486  -0.2885  -0.2885  -0.2885   2.5299  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.15856    0.09206 -34.309  < 2e-16 ***
hypertension  1.34082    0.16991   7.892 2.99e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1358.2  on 3424  degrees of freedom
AIC: 1362.2

Number of Fisher Scoring iterations: 6


### Heart-disease
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{heartd}}x_{\rm{heartd}}$

In [4]:
model_heartd = glm(stroke ~ heart_disease, data=strokedata, family=binomial(link="logit"))
summary(model_heartd)


Call:
glm(formula = stroke ~ heart_disease, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6198  -0.3025  -0.3025  -0.3025   2.4929  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.06157    0.08525  -35.91  < 2e-16 ***
heart_disease  1.50929    0.20231    7.46 8.62e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1367.3  on 3424  degrees of freedom
AIC: 1371.3

Number of Fisher Scoring iterations: 5


### Ever-Married
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{marry}}x_{\rm{marry}}$

In [5]:
model_marry = glm(stroke ~ ever_married, data=strokedata, family=binomial(link="logit"))
summary(model_marry)


Call:
glm(formula = stroke ~ ever_married, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3565  -0.3565  -0.3565  -0.2213   2.7284  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.6976     0.2264 -16.335  < 2e-16 ***
ever_marriedYes   0.9734     0.2406   4.045 5.22e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1390.4  on 3424  degrees of freedom
AIC: 1394.4

Number of Fisher Scoring iterations: 6


### Work-type
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{GovJob}}x_{\rm{GovJob}} + \beta_{\rm{Neverworked}}x_{\rm{Neverworked}} + \beta_{\rm{Private}}x_{\rm{Private}} + \beta_{\rm{Selfemployed}}x_{\rm{Selfemployed}}$
- Baseline group: children

In [6]:
model_worktype = glm(stroke ~ work_type, data=strokedata, family=binomial(link="logit"))
summary(model_worktype)


Call:
glm(formula = stroke ~ work_type, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3985  -0.3187  -0.3187  -0.3187   2.4927  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)            -1.757e+01  4.798e+02  -0.037    0.971
work_typeGovt_job       1.451e+01  4.798e+02   0.030    0.976
work_typeNever_worked   4.643e-09  1.161e+03   0.000    1.000
work_typePrivate        1.461e+01  4.798e+02   0.030    0.976
work_typeSelf-employed  1.507e+01  4.798e+02   0.031    0.975

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1394.8  on 3421  degrees of freedom
AIC: 1404.8

Number of Fisher Scoring iterations: 16


- z-test的p-value都很大，表示我們沒辦法拒絕下面的$H_0$  
$H_0: \beta_{\rm{GovJob}} = 0$   
$H_0: \beta_{\rm{Neverworked}} = 0$  
$H_0: \beta_{\rm{Private}} = 0$  
$H_0: \beta_{\rm{Selfemployed}} = 0$  
- 因此，工作型態對於預測中風是沒有幫助的

### Smoke Status
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{Neversmoke}}x_{\rm{Neversmoke}} + \beta_{\rm{smoke}}x_{\rm{smoke}}$
- Baseline group: 'formerly smoked'

In [7]:
model_smoke = glm(stroke ~ smoking_status, data=strokedata, family=binomial(link="logit"))
summary(model_smoke)


Call:
glm(formula = stroke ~ smoking_status, family = binomial(link = "logit"), 
    data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3756  -0.3297  -0.3047  -0.3047   2.4872  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -2.6162     0.1372 -19.068   <2e-16 ***
smoking_statusnever smoked  -0.4305     0.1769  -2.434   0.0149 *  
smoking_statussmokes        -0.2684     0.2142  -1.253   0.2102    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1405.3  on 3423  degrees of freedom
AIC: 1411.3

Number of Fisher Scoring iterations: 5


### 小節結論
- 從每一個Simple model對於$\beta$的z-test，可以發現
   - 高血壓、心臟疾病、結婚狀態與中風的預測是顯著相關的
   - 抽煙狀態就比較不這麼顯著
       - formerly smoke, never smoke的p-value < 5%: 還是有提供預測中風的資訊
       - smoke的p-value=0.21: 對於預測中風，沒辦法提供什麼資訊
- 為了決定是否要將smoke status考慮進去最後的模型，將在下個部份做Model Selection

## Model Selection: with/without smoke status
- Full Model:
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}} + \beta_{\rm{Neversmoke}}x_{\rm{Neversmoke}} + \beta_{\rm{smoke}}x_{\rm{smoke}}$

- Reduced Model: $\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}$

In [8]:
model_full = glm(stroke ~ hypertension + heart_disease + ever_married + smoking_status, data=strokedata, family=binomial(link="logit"))
summary(model_full)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married + 
    smoking_status, family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9456  -0.3194  -0.2799  -0.2226   2.8192  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -3.6853     0.2682 -13.742  < 2e-16 ***
hypertension                 1.1382     0.1748   6.510 7.49e-11 ***
heart_disease                1.2389     0.2099   5.902 3.58e-09 ***
ever_marriedYes              0.7351     0.2451   3.000   0.0027 ** 
smoking_statusnever smoked  -0.2698     0.1824  -1.479   0.1390    
smoking_statussmokes        -0.1920     0.2196  -0.874   0.3820    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411.0  on 3425  degrees of freedom
Residual deviance: 1311.8  on 3420  degrees of freedom
AIC: 132

In [9]:
model_reduce = glm(stroke ~ hypertension + heart_disease + ever_married, data=strokedata, family=binomial(link="logit"))
summary(model_reduce)


Call:
glm(formula = stroke ~ hypertension + heart_disease + ever_married, 
    family = binomial(link = "logit"), data = strokedata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8899  -0.2930  -0.2930  -0.2012   2.7965  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.8899     0.2308 -16.858  < 2e-16 ***
hypertension      1.1374     0.1746   6.513 7.34e-11 ***
heart_disease     1.2675     0.2085   6.080 1.20e-09 ***
ever_marriedYes   0.7631     0.2443   3.124  0.00178 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1411  on 3425  degrees of freedom
Residual deviance: 1314  on 3422  degrees of freedom
AIC: 1322

Number of Fisher Scoring iterations: 6


### Likelihood-Ratio Test
- Full Model與Reduced Model的差異可以轉換成  $H_0: \beta_{\rm{Neversmoke}} = \beta_{\rm{smoke}}= 0$

In [10]:
anova(model_reduce, model_full, test="LRT")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3422,1313.986,,,
3420,1311.82,2.0,2.16591,0.3385936


- 根據Deviance的p-value=0.3385
    - 不能拒絕 $H_0$，所以最終模型將會捨棄掉抽煙的變數

## Final Model
$\log{\left(\frac{P[\rm{stroke}=1]}{P[\rm{stroke}=0]}\right)} = \beta_0 + \beta_{\rm{hypert}}x_{\rm{hypert}}+ \beta_{\rm{heartd}}x_{\rm{heartd}}+ \beta_{\rm{marry}}x_{\rm{marry}}$

### 模型解釋

```{image} ./cate_model_full_plot.png
:alt: glucose_stroke
:class: bg-primary mb-1
:width: 800px
:align: center
```

- 首先，高血壓、心臟疾病與結婚都會增加中風機率
   - 最右邊灰色的點，三種狀況都有，擁有最高的中風機率
   - 最左邊藍色的點，三種狀況都沒有，中風機率最低
- 當 $x_{\rm{hypert}}+ x_{\rm{heartd}}+ x_{\rm{marry}}=1$ 或是 $x_{\rm{hypert}}+ x_{\rm{heartd}}+ x_{\rm{marry}}=2$
   - 婚姻的影響是小於高血壓與心臟疾病的