# Part I

In [1]:
import pandas as pd
df = pd.read_csv("heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [3]:
n, p = df.shape[0], df.shape[1] - 1

In [4]:
import pymc as pm; import numpy as np
X,y=np.zeros((n,p)),np.ones((n,1))

with pm.Model() as logistic_model:
    betas = pm.MvNormal('betas', mu=np.zeros((p,1)), cov=np.eye(p), shape=(p,1))
    linear_comb = pm.math.dot(X, betas)
    p = pm.math.sigmoid(linear_comb)
    y_obs = pm.Bernoulli('y_obs', p=p, observed=y)
    
with logistic_model:
    idata = pm.sample()

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [betas]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 42 seconds.


# Part II

### Normal Prior:

A normal prior for the coefficients $\beta$ assumes: $\beta_j \sim \mathcal{N}(0, \sigma^2)$

Given the likelihood function: $\mathcal{L}(\beta|X, y) \propto \exp\left(-\frac{1}{2\sigma^2} ||y - X\beta||_2^2\right)$

The log posterior distribution for $\beta$ with a normal prior becomes:

\begin{align*}
\text{log posterior} &\propto \text{log likelihood} + \text{log prior} \\
&\propto -\frac{1}{2\sigma^2} ||y - X\beta||_2^2 - \frac{1}{2\sigma^2} \sum_{j=1}^n \beta_j^2 \\
&= -\frac{1}{2} ||y - X\beta||_2^2 - \frac{1}{2} \sum_{j=1}^n \beta_j^2
\end{align*}

This is similar to the ridge regression expression, where we have a penalty term proportional to the $L_2$ norm of $\beta$.

### Laplace Prior:

A Laplace prior for the coefficients $\beta$ assumes: $p(\beta_j) \propto \exp\left(-\frac{|\beta_j|}{b}\right)$

Given the likelihood function: $\mathcal{L}(\beta|X, y) \propto \exp\left(-\frac{1}{2\sigma^2} ||y - X\beta||_2^2\right)$

The log posterior distribution for $\beta$ with a Laplace prior becomes:

\begin{align*}
\text{log posterior} &\propto \text{log likelihood} + \text{log prior} \\
&\propto -\frac{1}{2\sigma^2} ||y - X\beta||_2^2 - \frac{1}{b} \sum_{j=1}^n |\beta_j|
\end{align*}

This is similar to the lasso regression expression, where we have a penalty term proportional to the $L_1$ norm of $\beta$.

Bayesians do not optimize posterior distributions, instead they sample from them. But, the posterior distributions are nonetheless 'regularizations' of the likelihood through the prior.

# Part III

In [2]:
import pymc as pm; import numpy as np; import arviz as az

np.random.seed(8)
n_samples = 100
X = np.random.normal(size=(n_samples, 1))
true_beta = 2.5
mu = X * true_beta
w = 1
nu = 4

y = np.random.normal(mu.flatten(), scale=w)

with pm.Model() as robust_model_with_lambda:
    # Priors
    beta = pm.Normal('beta', mu=0, sigma=10)
    nu = pm.Exponential('nu', 1/29) + 1
    lambda_i = pm.Gamma('lambda', alpha=nu/2, beta=nu/2, shape=n_samples)
    sigma_i = pm.Deterministic('sigma_i', 1 / (lambda_i))

    # Likelihood
    likelihood = pm.Normal('y', mu=X[:,0]*beta, sigma=sigma_i, observed=y)

    # Sampling
    trace = pm.sample(1000, target_accept=0.9, return_inferencedata=True)


lambda_posterior = trace.posterior['lambda'].values
lambda_mean = np.mean(lambda_posterior, axis=(0, 1))
outlier_threshold = np.quantile(lambda_mean, 0.05)  # Lower 5% as potential outliers
outliers = np.where(lambda_mean < outlier_threshold)[0]
print(f"Outliers: {outliers}")

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, nu, lambda]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details


Outliers: [ 9 20 49 57 63]
