# Example questions that we may want to ask and answers we got from the statistical analysis with GLM (For explanation of the answers, see below)

`Question`: Is there a statistical siginificant dependence between `gender` & `heart attack risk`?
- `Answer`: Yes, males have a higher chance of $2.18 \%$ of getting heart attack.

`Question`: Is there a statistical siginificant dependence between `age` and `heart attack risk`? 
- `Answer`: Intuitively, yes. Empirically, we can not say anything. There may be a dependence, there may be not.

For similar questions, we can obtain a pretty good answers, if we learn how to interpret the output of such a statistic model. 

`Hint` A very useful tutorial on how to interpret GLM outputs can be found at http://connor-johnson.com/2014/02/18/linear-regression-with-python/

In [None]:
import statsmodels.api as sm           
import statsmodels.formula.api as smf 
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

# Define the linear model
fit = smf.glm(formula='target ~ age + C(sex) + C(cp) + trestbps + chol + C(fbs) + C(restecg) + thalach + C(exang) + oldpeak + C(slope) + C(ca) + C(thal)', 
              data=df, 
              
              # Using logit link function
              family=sm.families.Binomial(link=sm.families.links.logit())).fit() 
fit.summary()

In [None]:
ci = fit.get_prediction(df).summary_frame(0.05)
ci = pd.concat([df,ci], axis=1)
ci = ci.sort_values(by="mean")
index = np.arange(len(ci))

plt.fill_between(index, ci["mean_ci_upper"], ci["mean"], alpha=0.5, color="red", label="0.95 confidence interval")
plt.fill_between(index, ci["mean"], ci["mean_ci_lower"], alpha=0.5, color="red")
plt.plot(index, ci["mean"], label="Logistic Prediction")
plt.xlabel("Subject")
plt.ylabel("Probability of having a high risk of heart attack")
plt.legend()
plt.show()

# Interpretation of the regression analysis

The model itself is a logistic regression model. The coefficients can be interpreted as the log odds, which can be translated to a number in probability by the formula

$$P = \cfrac{\text{odds}}{1 + \text{odss}}$$


## Question 1 Is there a statistical siginificant dependence between `gender` & `heart attack risk`?

```	
                  coef      std err     z         P>|z|   [0.025 0.975]
C(sex)[T.1]      -1.8623    0.571      -3.262    0.001   -2.981 -0.743
```

How to interpret this result? The column `coef` has a value of `-1.8623`, meaning a log odd of `-1.8623` for having a high risk, if you are male. Translated to a probability of $2.16 \%$, i.e, men has $2.16 \%$ higher probability of getting heart attack than women in this dataset. The standard error for the log odd is `std err = 0.571`.

The test statistic for this coefficient can be found in column `z = -3.262`, yielding a siginificant p value of `P > |z| = 0.001`

The $0.95$ confidence interval of this log odd is $[-2.981, -0.743]$, or $[1.5 \%, -2.8 \%]$. 


## Question 2 Is there a dependence between `age` and `heart attack risk`? 

```
                  coef      std err     z         P>|z|   [0.025 0.975]
age               0.0278    0.025       1.094     0.274   -0.022 0.078
```

The log odd in this case is $0.0278$ with a P value of $0.274$, which is not lower than the standard $0.05$, therefore we can not reject the null hypothesis that the log odd can be as well $0$.