### U ovoj laboratorijskoj vježbi upoznati ćemo se s regresijskom analizom. Naučit ćemo kako napraviti deskriptivnu regresijsku analizu u pythonu koristeći biblioteku statsmodels.

### Reference:
- Više informacija o sintaksi formule: https://patsy.readthedocs.io/en/latest/formulas.html
- Više informacija o statsmodelima: https://www.statsmodels.org/dev/example_formulas.html 

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# set seed for consistency
np.random.seed(2)

# Podatci - kratki pregled

Koristiti ćemo skup podataka s informacijama o pacijentima s kardiovaskularnim bolestima. Stupci sadrže ove informacije:

Neke varijable su kategoričke:
- 'DEATH_EVENT': je li korisnik umro za vrijeme perioda
- 'sex': muško/žensko (binarno, muško 1, žensko 0)
- 'anaemia': smanjeni broj crvenih krvnih stanica (boolean, da 1)
- 'smoking': puši li pacijent (boolean, da 1)
- 'diabetes': ima li pacijent dijabetes (boolean, da 1)
- 'high_blood_pressure': ima li pacijent visoki krvni tlak (boolean, da 1)


... i neke su kontinuirane/diskretne:
- 'age': dob pacijenta
- 'creatinine_phosphokinase': razina CPK enzima u krvi (mcg/L)
- 'ejection_fraction': postotak krvi koji napusti srce svakom kontrakcijom
- 'platelets': trombociti u krvi (kiloplatelets/mL)
- 'serum_creatinine': razina kreatina u serumu (mg/dL)
- 'serum_sodium': razina natrija u serumu (mEq/L)
- 'time': broj dana u bolnici

In [2]:
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


# Linearna regresija: Modeliranje dana provedenih u bolnici

- Koristeći linearnu regresiju modelirat ćemo broj dana provedenih u bolnici iz populacije pacijenata.


- Kako bi to izveli trebaju nam dvije komponente
   1. Jednadžba koja opisuje model
   2. Podatci
   

- Jednadžbe se specificiraju koristeći [sintaksu](https://patsy.readthedocs.io/en/latest/formulas.html) patsy programskog paketa:
    1. `~` : Odvaja lijevu i desnu stranu
    2. `+` : Stvara uniju prediktora koji se koriste
    3. `:` : Interakcijski termin
    3. `*` : `a * b` je kraći zapis za  `a + b + a:b`, i često je koristan kada želimo uključiti sve interakcije između skupa varijabli
    
    
- Intercepti se dodaju automatski
- Kategorijske varijable se dodaju koristeči termin C(a).
- Za podatke možemo koristiti pandas!


### Primjer

- Počnimo s primjerom iz našeg skupa podataka. Zanimaju nas dva prediktora: dijabetes i visoki krvni tlak. Ovo su dva prediktora koja želimo upotrijebiti da uklopimo ishod, broj dana provedenih u bolnici, koristeći linearnu regresiju.

- Model koji to postiže je formuliran kao:
        time ~ C(diabetes) + C(high_blood_pressure)
        
- Ovaj model možemo stvoriti pomoću smf.ols()

- OLS označava običnu linearnu regresiju najmanjih kvadrata

- Dvije komponente: formula i podaci su izričito navedeni.

- Pojmovi u formuli su stupci u pandas podatkovnom okviru. Lako!

In [3]:
# Declares the model
mod = smf.ols(formula='time ~ C(diabetes) + C(high_blood_pressure)', data=df)

In [4]:
# Fits the model (find the optimal coefficients, adding a random seed ensures consistency)
res = mod.fit()

In [5]:
# Print thes summary output provided by the library.
res.summary()

0,1,2,3
Dep. Variable:,time,R-squared:,0.04
Model:,OLS,Adj. R-squared:,0.033
Method:,Least Squares,F-statistic:,6.097
Date:,"Fri, 05 Nov 2021",Prob (F-statistic):,0.00254
Time:,11:11:45,Log-Likelihood:,-1718.9
No. Observations:,299,AIC:,3444.0
Df Residuals:,296,BIC:,3455.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,139.3851,6.658,20.934,0.000,126.282,152.489
C(diabetes)[T.1],4.9059,8.949,0.548,0.584,-12.706,22.518
C(high_blood_pressure)[T.1],-31.8228,9.247,-3.441,0.001,-50.021,-13.624

0,1,2,3
Omnibus:,159.508,Durbin-Watson:,0.076
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.166
Skew:,0.076,Prob(JB):,0.000114
Kurtosis:,1.802,Cond. No.,2.82


### Puno korisnih informacija pruža se prema zadanim postavkama.

- Zavisna varijabla: vrijeme (broj dana u bolnici)
- Metoda: vrsta modela koji je naučen (OLS)
- Nb opažanja: broj točaka podataka (299 pacijenata)
- R2: Udio objašnjene varijance
- Popis prediktora
- Za svaki prediktor: koeficijent, standardna pogreška koeficijenata, p-vrijednost, 95% intervali povjerenja. Vidimo da je samo visoki krvni tlak značajan prediktor (p = 0,001), dok dijabetes nije (0,584).
- Upozorenja ako postoje brojčani problemi (nadajmo se da ne!) 

### Sada možemo intrepretirati naučeni model

$time = 139.3851 + 4.9059 * diabetes + (-31.8228) * highBloodPressure$

- Koliko je očekivano vrijeme ako pacjent nema dijabetes niti visoki tlak?
- Što ako nema dijabetes a ima visoki tlak?
- Što ako ima oboje?

1. Osobe koje nemaju dijabetes niti visoki krvni tlak u bolnici borave u prosjeku 139 dana
2. Ljudi koji imaju dijabetes, ali nemaju krvni tlak ostaju 139 + 4,9 dana ~ 144 dana
3. Ljudi koji nemaju dijabetes, ali imaju krvni tlak ostaju 139 - 31 dan ~ 108 dana
4. Osobe s dijabetesom i krvnim tlakom ostaju 139 + 4,9 - 31,8 ~ 112 dana 

- Kako smo to mogli još izračunati?

In [6]:
selected = df.loc[(df['diabetes'] == 0) & (df["high_blood_pressure"] == 0)]

selected['time'].mean()

139.0

- Nije li čudno da visoki tlak ima negativan koeficijent? Čini se da pacijenti s visokim tlakom ostaju u bolnici kraći broj dana, iako bi se očekivalo suprotno. Znali li netko možda zašto bi to bilo tako?

# Linearna regresija: Modeliranje dana provedenih u bolnici V2

- Jedan od razloga zašto se ozbiljna stanja mogu povezati s kraćim vremenom provedenim u bolnici je treći čimbenik: smrt 💀. Pacijenti koji imaju ozbiljno stanje mogli bi provesti manje vremena u bolnici jer umiru.

- Dobijmo bolji osjećaj što se događa modelirajući vrijeme provedeno u bolnici sa smrću kao prediktorom.

- Ovaj put ćemo dodati interakcijske značajke. 

In [7]:
# we use a*b to add terms: a, b, a:b, and intercept

mod = smf.ols(formula='time ~ C(high_blood_pressure) * C(DEATH_EVENT,  Treatment(reference=0)) + C(diabetes)',
              data=df)

# C(DEATH_EVENT,  Treatment(reference=0)) implies that we are considering the population that did not die!

res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,time,R-squared:,0.303
Model:,OLS,Adj. R-squared:,0.293
Method:,Least Squares,F-statistic:,31.92
Date:,"Fri, 05 Nov 2021",Prob (F-statistic):,4.32e-22
Time:,11:11:45,Log-Likelihood:,-1671.0
No. Observations:,299,AIC:,3352.0
Df Residuals:,294,BIC:,3371.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,164.8348,6.476,25.452,0.000,152.089,177.581
C(high_blood_pressure)[T.1],-26.1462,9.781,-2.673,0.008,-45.395,-6.897
"C(DEATH_EVENT, Treatment(reference=0))[T.1]",-86.4520,10.286,-8.405,0.000,-106.696,-66.208
C(diabetes)[T.1],4.7903,7.655,0.626,0.532,-10.275,19.855
"C(high_blood_pressure)[T.1]:C(DEATH_EVENT, Treatment(reference=0))[T.1]",2.7778,16.725,0.166,0.868,-30.137,35.693

0,1,2,3
Omnibus:,34.161,Durbin-Watson:,0.484
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11.463
Skew:,0.185,Prob(JB):,0.00324
Kurtosis:,2.115,Cond. No.,6.31


### Interpretacija

- Ovaj model nam omogućuje da vidimo da je smrt povezana s manjim brojem dana provedenih u bolnici.
- Primijetite kako je R2 mnogo veći u usporedbi s prethodnim modelom: objašnjeno je više odstupanja u podacima.
- Oni koji imaju visoki krvni tlak ostaju kraće (-26 dana u prosjeku), oni koji imaju krvni tlak __i__ umru u prosjeku provode 2,7 dana više u bolnici, iako to nije statistički značajno.

# Standardizacija

In [8]:
formula = 'time ~ age + C(high_blood_pressure)'
mod = smf.ols(formula=formula, data=df)
res = mod.fit()
res.summary()


0,1,2,3
Dep. Variable:,time,R-squared:,0.081
Model:,OLS,Adj. R-squared:,0.075
Method:,Least Squares,F-statistic:,13.1
Date:,"Fri, 05 Nov 2021",Prob (F-statistic):,3.55e-06
Time:,11:11:45,Log-Likelihood:,-1712.3
No. Observations:,299,AIC:,3431.0
Df Residuals:,296,BIC:,3442.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,222.7404,22.559,9.874,0.000,178.343,267.137
C(high_blood_pressure)[T.1],-28.7444,9.083,-3.165,0.002,-46.620,-10.869
age,-1.3543,0.365,-3.709,0.000,-2.073,-0.636

0,1,2,3
Omnibus:,99.161,Durbin-Watson:,0.158
Prob(Omnibus):,0.0,Jarque-Bera (JB):,15.705
Skew:,0.063,Prob(JB):,0.000389
Kurtosis:,1.884,Cond. No.,324.0


Što možemo zaključiti iz intercepta? 

> da pacijenti koji nemaju visok pritisak krvi i imaju nula godina ostaju u bolnici ~223 dana

In [9]:
# how we standardize the countinuous variables
columns_to_standardize = [
    "age",
    "creatinine_phosphokinase",
    "ejection_fraction",
    "platelets",
    "serum_creatinine",
    "serum_sodium"
]

for col in columns_to_standardize:
    df[col] = (df[col] - df[col].mean()) / df[col].std()  # standardize column


In [10]:
formula = 'time ~ age + C(high_blood_pressure)'
mod = smf.ols(formula=formula, data=df)
res = mod.fit()
res.summary()


0,1,2,3
Dep. Variable:,time,R-squared:,0.081
Model:,OLS,Adj. R-squared:,0.075
Method:,Least Squares,F-statistic:,13.1
Date:,"Fri, 05 Nov 2021",Prob (F-statistic):,3.55e-06
Time:,11:11:45,Log-Likelihood:,-1712.3
No. Observations:,299,AIC:,3431.0
Df Residuals:,296,BIC:,3442.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,140.3550,5.367,26.150,0.000,129.792,150.918
C(high_blood_pressure)[T.1],-28.7444,9.083,-3.165,0.002,-46.620,-10.869
age,-16.1088,4.343,-3.709,0.000,-24.656,-7.562

0,1,2,3
Omnibus:,99.161,Durbin-Watson:,0.158
Prob(Omnibus):,0.0,Jarque-Bera (JB):,15.705
Skew:,0.063,Prob(JB):,0.000389
Kurtosis:,1.884,Cond. No.,2.43


Što sada možemo zaključiti iz intercepta? 

> da pacijenti koji nemaju visok pritisak krvi i imaju prosječan broj godina ostaju u bolnici ~140 dana

# Logaritamska transformacija

In [11]:
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')

df["logtime"] = np.log(df["time"])
mod = smf.ols(formula='logtime ~ C(high_blood_pressure) * C(DEATH_EVENT,  Treatment(reference=0)) + C(diabetes)',
              data=df)
res = mod.fit()
res.summary()


0,1,2,3
Dep. Variable:,logtime,R-squared:,0.36
Model:,OLS,Adj. R-squared:,0.351
Method:,Least Squares,F-statistic:,41.29
Date:,"Fri, 05 Nov 2021",Prob (F-statistic):,1.86e-27
Time:,11:11:45,Log-Likelihood:,-325.33
No. Observations:,299,AIC:,660.7
Df Residuals:,294,BIC:,679.2
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.9759,0.072,69.206,0.000,4.834,5.117
C(high_blood_pressure)[T.1],-0.1872,0.109,-1.724,0.086,-0.401,0.026
"C(DEATH_EVENT, Treatment(reference=0))[T.1]",-1.0647,0.114,-9.323,0.000,-1.289,-0.840
C(diabetes)[T.1],0.0715,0.085,0.841,0.401,-0.096,0.239
"C(high_blood_pressure)[T.1]:C(DEATH_EVENT, Treatment(reference=0))[T.1]",-0.1131,0.186,-0.609,0.543,-0.479,0.252

0,1,2,3
Omnibus:,24.038,Durbin-Watson:,0.616
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29.144
Skew:,-0.635,Prob(JB):,4.69e-07
Kurtosis:,3.851,Cond. No.,6.31


Kako sada možemo interpretirati ovaj rezultat?

U ovom slučaju možemo reći da svaki jedinični porast neke varijable $X_n$ množi vrijednost ovisne varijable s eksponentom koeficjenta te varijable $\beta_n$.  

Uzmimo koeficjent za dijabetes na primjer (0.0715). 

> $e^{0.0715} = 1.074118$ 

Iz ovog možemo reći da ako sve ostalo ostane isto a vrijednost dijabetesa poraste za jednu jediničnu vrijednost tada će pacijent ostati u bolnici 1.07 puta dulje, tj. za 7% dulje. 