<a href="https://colab.research.google.com/github/ingvft/Kalman_Filter/blob/master/E02_BayLogisticReg/E02_LogisticReg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://ingvft: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@github.com/ingvft/py_portfolio.git

Cloning into 'py_portfolio'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 70 (delta 14), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (70/70), 1.73 MiB | 5.23 MiB/s, done.
Resolving deltas: 100% (14/14), done.


# Bayesian Logistic Regression

## Introduction

### Problem Statement

You are interested in studying the factors that influence the likelihood of heart disease among patients.

You have a dataset of 303 patients, each with 14 variables: age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, oldpeak, slope, number of major vessels, thalassemia, and diagnosis of heart disease.

You want to use Bayesian logistic regression to model the probability of heart disease (the outcome variable) as a function of some or all of the other variables (the predictor variables).

You also want to compare different models and assess their fit and predictive performance.

## Bayesian Workflow

For this project, I will follow this workflow:

1. Data exploration: Explore the data using descriptive statistics and visualizations to get a sense of the distribution, range, and correlation of the variables. Identify any outliers, missing values, or potential errors in the data. Transform or standardize the variables if needed.
2. Model specification: Specify a probabilistic model that relates the outcome variable to the predictor variables using a logistic regression equation. Choose appropriate priors for the model parameters, such as normal, student-t, or Cauchy distributions.
3. Model fitting: Fit the model using a sampling algorithm such as Hamiltonian Monte Carlo (HMC) or No-U-Turn-Sampler (NUTS). You can use the `DynamicHMC` or `Turing.jl` package in Julia to implement these algorithms. check the convergence and mixing of the chains using diagnostics such as trace plots. You can use the `MCMCDiagnostics` or the included diagnostics features in `Turing.jl` package in Julia to compute these diagnostics.
4. Model checking: Check the fit and validity of the model using posterior predictive checks.
5. Model interpretation: Interpret the model results using summary statistics and visualizations of the posterior distribution, such as mean, median, standard deviation, credible intervals, density plots, and forest plots.

In [2]:
%matplotlib inline
import arviz as az
import pymc as pm
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
## use this to render LaTeX in plots when running notebook on pc
# plt.rcParams['text.usetex'] = True

# use this if running notebook on Google Colab
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['font.family'] = 'STIXGeneral'

Some text here

In [4]:
# load the data and print the header

data = pd.read_csv('/content/py_portfolio/data/processed_cleveland.csv')
print(data.head())


   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   1       145   233    1        2      150      0      2.3      3   
1   67    1   4       160   286    0        2      108      1      1.5      2   
2   67    1   4       120   229    0        2      129      1      2.6      2   
3   37    1   3       130   250    0        0      187      0      3.5      3   
4   41    0   2       130   204    0        2      172      0      1.4      1   

  ca thal  num  
0  0    6    0  
1  3    3    2  
2  2    7    1  
3  0    3    0  
4  0    3    0  
