Hugo Danet, EURECOM, Advanced Statistical Inference [ASI], May 2021
# ASI Assessed exercise

## Santander Customer Transaction Prediction

## Question A
(code) Download and import the Santander dataset. The labels of the test data are not
publicly available, so create your own test set by randomly choosing half of the instances in
the original training set. [3]

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import warnings
import scipy.linalg
import scipy.stats
import seaborn as sns
import warnings
color = sns.color_palette()
sns.set_style('darkgrid')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.metrics import roc_auc_score

In [None]:
def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()

In [None]:
jitter = 1e-10

In [None]:
train_path = "../input/santander-customer-transaction-prediction/train.csv"

In [None]:
df = pd.read_csv(train_path)
df.drop('ID_code',inplace=True, axis=1)
#print(df.describe())
#print(df.head())

### Plotting target imbalance

In [None]:
ones = len(df[df['target'] == 1])
zeros = len(df[df['target'] == 0])

classes = ['Target 0','Target 1']

plt.bar(classes,[zeros,ones],color='blue',edgecolor='black')
plt.xticks(classes)

plt.bar([0,1],[zeros,ones])
plt.xlabel('Class', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title("Diagram of imbalance of the two classes")
plt.show()

class_imbalance = ones/len(df)

print("Class imbalance : ", class_imbalance*100, "% of positive targets")

### Plotting features distribution

In [None]:
#Features histograms

df.hist(figsize = (50,50))
plt.show()

### Undersampling training set

In [None]:
threshold = 0.112
df = df[(np.random.rand(df.shape[0]) < threshold) | (df["target"] == 1)]

In [None]:
ones = len(df[df['target'] == 1])
zeros = len(df[df['target'] == 0])

classes = ['Target 0','Target 1']

plt.bar(classes,[zeros,ones],color='blue',edgecolor='black')
plt.xticks(classes)

plt.bar([0,1],[zeros,ones])
plt.xlabel('Class', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.title("Classes after undersampling")
plt.show()

print("Class undersampling : ",ones/len(df)*100, "% of positive targets")

### Dataset Split

In [None]:
numpy_data = df.to_numpy() 

targets = numpy_data[:,0]
features = numpy_data[:,1:len(numpy_data) - 1]

ones_count = np.count_nonzero(targets == 1)
zeros_count = np.count_nonzero(targets == 0)
unbalanced = ones_count / (zeros_count + ones_count)

X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.1, random_state=42)

## Question B
(text) Comment on the distribution of class labels and the dimensionality of the input and how
these may affect the analysis. [7]

<span style="color:black">
This dataset presents a very high dimensionality compared to the one in the labs (200 vs 2). Overall it will be more difficult for the models to converge and to achieve a high accuracy compared to the labs.
<br/><br/>
The targets are highly unbalanced as we can see above on the bar plot (10% of positive targets). To help the models learn correctly, I used an undersampling method (reducing the number of 0 target in the dataset) and it greatly improved the AUC scores of the three models. I used the precision, specificity and AUC scores, which are better than simple accuracy to monitor model performance on unbalanced datasets. Indeed the accuracy of a naive model predicting only 0 on a 90/10 unbalanced dataset is 0.9 but the auc is 0.5.
<br/><br/>
The features are gaussian distributed as we can see in the features histogram. It is useful for the logistic regressions where we assume a gaussian prior on the parameters. Therefore we do not have to make any major features transformation.
</span>

## 1. Bayesian Linear Regression

## Question A
(code) Implement Bayesian linear regression (you should already have an
implementation from the lab sessions) [10]

In [None]:
# build the design matrix

def build_X(X):    
    init = np.ones(len(X))
    
    matrix = init
        
    for i in range(200):
        column = []
        for j in range(len(X)):
            column.append(X[j,i])
        matrix = np.column_stack((matrix,column))

    return matrix

In [None]:
def compute_posterior(X, y, sigma2priorweights, sigma2noise):
            
    Sigma_inverse = X.T @ X * (1/sigma2noise) + np.linalg.inv(sigma2priorweights)
    
    posterior_Sigma = np.linalg.inv(Sigma_inverse)
        
    posterior_mu = (1/sigma2noise) * posterior_Sigma @ X.T @ y

    return posterior_mu, posterior_Sigma

## Question B
(text) Describe any pre-processing that you suggest for this data [5] 

<span style="color:black">
According to the features plot, there is no outliers in the dataset, so a standard scaler is sufficient enough for this work. I also had a problem of exploding gradients conducting to a NaN loss with variational inference and non scaled data.
<br/><br/>
The standard scaler is removing the mean and scaling to unit variance. If a feature has a variance with orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
</span>

## Data Scaling

In [None]:
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Question C
(code) Treat class labels as continuous and apply regression to the training data. Also,
calculate and report the posterior variance of the weights [10]

In [None]:
sigma2priorweights = 1
sigma2noise = 1

bigX = build_X(X_train)

w_posterior_mu, w_posterior_Sigma = compute_posterior(bigX, y_train, np.identity(201)*sigma2priorweights, sigma2noise)

### Posterior variance of the weights

In [None]:
print(w_posterior_Sigma)

fig, ax = plt.subplots()
ax.matshow(w_posterior_Sigma)
ax.grid(None)
plt.title("Posterior variance of the weights")
plt.show()

### Posteriors computation

In [None]:
def compute_predictive(Xnew, w_posterior_mu, w_posterior_Sigma, sigma2noise):
    
    Xnew = build_X(Xnew)
    
    y_posterior_mu = Xnew @ w_posterior_mu 
    y_posterior_Sigma = sigma2noise + Xnew @ w_posterior_Sigma @ Xnew.T
    
    return y_posterior_mu, y_posterior_Sigma

In [None]:
y_posterior_mu, y_posterior_Sigma = compute_predictive(X_test,w_posterior_mu,w_posterior_Sigma,1)

## Question D
(text) Suggest a way to discretize predictions and display the confusion matrix on the
test data and report accuracy [5]

<span style="color:black">
To discretize the predictions, we round all predictions greater than 0.5 to class 1 and all predictions less than 0.5 to class 0. It could also be possible to use an adaptative threshold by calculating it on a training dataset, however this is not necessary here since we have rebalanced the classes by undersampling the dataset.
<br/><br/>
With the code below, We can plot the confusion matrix with useful metrics like accuracy, sensitivity, specificity and auc.
The bayesian linear regression scores pretty well with 0.85 of accuracy and 0.77 of AUC score.  I compared its results with the ones of the Bayesian Linear Regression of sklearn and they are absolutely identical. The only optimization I could implement is too replace the matrix inversion algorithm to accelerate the computations.
</span>

In [None]:
def result_analysis(prediction,control):
    TP = 0
    FP = 0
    TN = 0
    FN = 0

    for i in range(len(prediction)):
        if(prediction[i] == 1 and control[i] == 1):
            TP += 1
        if(prediction[i] == 0 and control[i] == 1):
            FN += 1
        if(prediction[i] == 1 and control[i] == 0):
            FP += 1
        else:
            TN += 1
            
    accuracy = (TP + TN) / (TP + TN + FP + FN)

    se = TP / (TP + FN + jitter)
    sp = TN / (TN + FP + jitter)

    error_rate = 1 - accuracy
        
    auc_score = roc_auc_score(control, prediction)
        
    return TP, TN, FP, FN, accuracy, error_rate, se, sp, auc_score

In [None]:
def threshold(prediction,value):
    rounded_prediction = []
    for i in range(len(prediction)):
        if(prediction[i] < value):
            rounded_prediction.append(0)
        else:
            rounded_prediction.append(1)
    return rounded_prediction

### Bayesian Linear Regression Validation

In [None]:
def confusion_matrix(prediction,control,title):
    
    TP, TN, FP, FN, accuracy, error_rate, se, sp, auc = result_analysis(prediction,control)
            
    confusion_matrix = [[FP,TP],[TN,FN]]
    df_cm = pd.DataFrame(confusion_matrix, [1,0], [0,1])
    # plt.figure(figsize=(10,7))
    sns.set(font_scale=1.4) # for label size
    sns.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size
    
    plt.xlabel("True values")
    plt.ylabel("Predicted values")
    plt.title(title)
    
    plt.show()
    
    print("accuracy : ",accuracy)
    print("error rate : ", error_rate)
    print("Sensitivity : ", se)
    print("Specificity : ", sp)
    print("AUC : ", auc)

In [None]:
BLR_prediction = threshold(y_posterior_mu,0.5)

In [None]:
confusion_matrix(BLR_prediction,y_test,"Bayesian Linear Regression Confusion Matrix")

## 2. Logistic Regression
## Question A
(code) The goal is to implement a Bayesian logistic regression classifier; assume a
Gaussian prior on the parameters. As a first step, implement a Markov chain Monte
Carlo inference algorithm to infer parameters (you should already have an
implementation of the Metropolis-Hastings algorithm from the lab sessions). [10]

In [None]:
def MH_logistic(z):
    return 1/(1 + np.exp(-z))

In [None]:
class BernoulliLikelihood():
    def logdensity(self, y, p):
        value = np.sum(y*np.log(p + jitter) + (1-y) * np.log(1-p + jitter))
        return value

class NormalPrior():
    def __init__(self, sigma2x):
        self.sigma2x = sigma2x
        
    def logdensity(self, x):
        first_term = 1/((2*np.pi) * np.linalg.det(self.sigma2x)**(1/2)) 
        second_term = np.exp(-1/2 * x.T @ np.linalg.inv(self.sigma2x) @ x)
        value = np.log(first_term * second_term + jitter)
        return value

In [None]:
class MHSampler():
    @property
    def samples(self):
        return self._samples
    @samples.getter    
    def samples(self):
        return np.asarray(self._samples)
    
    def __init__(self, initial_sample, likelihood, prior):
        self.likelihood = likelihood
        self.prior = prior
        self._samples = [initial_sample]
    
    def unnormalized_logposterior(self, w, X, y):
        p = MH_logistic(X@w)
        log_likelihood = self.likelihood.logdensity(y,p)
        log_prior = self.prior.logdensity(w)
        return log_likelihood + log_prior

    def step(self, X, y, step_proposal):
        w_prev = self._samples[-1]
        w_proposal = step_proposal(w_prev)
        
        log_gw_prev = self.unnormalized_logposterior(w_prev, X, y)
        log_gw_proposal = self.unnormalized_logposterior(w_proposal, X, y)

        acceptance_ratio = np.exp(log_gw_proposal - log_gw_prev)
        
        if acceptance_ratio >= 1:
            self._samples.append(w_proposal)
        else:
            u = random.uniform(0.0,1.0)
            if u <= acceptance_ratio:
                self._samples.append(w_proposal)
            else:
                self._samples.append(w_prev)
        
        return min(acceptance_ratio, 1)

In [None]:
likelihood = BernoulliLikelihood()
prior = NormalPrior(np.identity(200))

#starting_point = np.random.randn(1, 200)
#sampler = MHSampler(starting_point[0,:],likelihood,prior)

#starting_point = np.zeros((200))
#sampler = MHSampler(starting_point,likelihood,prior)

sampler = MHSampler(X_train[0],likelihood,prior)

### Metropolis-Hastings training

In [None]:
def step_proposal(sample):
    new_sample = np.random.randn(1, 200) * 0.1 + sample
    return new_sample[0,:]

num_iterations = 5000
for step in range(num_iterations):
    acceptance = sampler.step(X_train,y_train,step_proposal)
    print('Metropolis Hastings Training : Epoch [{}/{}]'.format(step, num_iterations),end="\r")

### $\hat{R}$ - statistics

In [None]:
def _rhat_base(ary):
    """Compute the rhat for a 2d array."""
    _, num_samples = ary.shape

    # Calculate chain mean
    chain_mean = np.mean(ary, axis=1)
    # Calculate chain variance
    chain_var = np.var(ary, axis=1, ddof=1)
    # Calculate between-chain variance
    between_chain_variance = num_samples * np.var(chain_mean, axis=None, ddof=1)
    # Calculate within-chain variance
    within_chain_variance = np.mean(chain_var)
    # Estimate of marginal posterior variance
    rhat_value = np.sqrt(
        (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)
    )
    return rhat_value


def _rhat_rank(ary):
    """Compute the rank normalized rhat. 
    Computation follows https://arxiv.org/abs/1903.08008
    """
    
    def _z_scale(ary):
        rank = scipy.stats.rankdata(ary, method="average")
        z = scipy.stats.norm.ppf((rank - 0.5) / ary.size)
        return z.reshape(ary.shape)
    
    
    def _split_chains(ary):
        """Split and stack chains."""
        _, n_draw = ary.shape
        half = n_draw // 2
        return np.vstack((ary[:, :half], ary[:, -half:]))
    
    split_ary = _split_chains(ary)
    rhat_bulk = _rhat_base(_z_scale(split_ary))

    split_ary_folded = abs(split_ary - np.median(split_ary))
    rhat_tail = _rhat_base(_z_scale(split_ary_folded))

    rhat_rank = max(rhat_bulk, rhat_tail)
    return rhat_rank

def compute_rhat(samples):
    """Compute the rhat statistics from samples. Samples needs to be a tensor 
    with dimensions [num_of_chain, num_of_samples, num_of_variables]. """
    
    samples = np.atleast_3d(samples)
    return np.asarray([_rhat_rank(samples[...,i]) for i in range(samples.shape[-1]) ])

In [None]:
rhat = compute_rhat(sampler.samples)[0]

print("rhat of Metropolis Hastings : ", rhat)

## Question B
(code) Implement the variational approximation we studied in the course to obtain an
approximation to the posterior over model parameters (you should already have an
implementation of the from the lab sessions). [10] 

In [None]:
%config InlineBackend.figure_format = 'retina'
import numpy as np
import scipy as scipy
import scipy.spatial
import time
import random

import matplotlib 
import matplotlib.font_manager
import matplotlib.pyplot as plt
import seaborn as sns
#matplotlib.rc_file('~/.config/matplotlib/matplotlibrc')
import warnings
import pandas as pd
import torch
import torch.nn as nn

warnings.filterwarnings("ignore")
def set_seed(seed: int=0):
    np.random.seed(seed)
    torch.manual_seed(seed)
    
def args_as_tensors(*index):
    """A simple decorator to convert numpy arrays to torch tensors"""
    def decorator(method):
        def wrapper(*args, **kwargs):
            converted_args = [torch.tensor(a).float() 
                              if i in index and type(a) is np.ndarray else a 
                              for i, a in enumerate(args)]
            return method(*converted_args, **kwargs)
        return wrapper  
    return decorator

In [None]:
class Distribution(nn.Module):  
    pass

class Bernoulli(Distribution):
    @args_as_tensors(1, 2)
    def logdensity(self, y, p):
        return y * torch.log(p + jitter) + (1-y) * torch.log(1 - p + jitter)

In [None]:
class NormalDiagonal(Distribution):

    @property
    def var(self):
        return self.logvar.exp()
    
    def extra_repr(self):
        return 'train=%s' % self.train
    
    def __init__(self, d, train=True):
        super(NormalDiagonal, self).__init__()
        self.train = train
        self.d = d
        self.mean = torch.nn.Parameter(torch.zeros(d), requires_grad=train)
        self.logvar = torch.nn.Parameter(torch.zeros(d), requires_grad=train)
                                    
    def sample(self, n=1):

        eps = torch.randn(n,self.d,requires_grad=self.train) 
        samples = self.mean + eps * torch.sqrt(self.var)
                                                        
        return samples

In [None]:
from functools import total_ordering

_KL_REGISTRY = {}  # Source of truth mapping a few general (type, type) pairs to functions.
_KL_MEMOIZE = {}  # Memoized version mapping many specific (type, type) pairs to functions.

@total_ordering
class _Match(object):
    __slots__ = ['types']

    def __init__(self, *types):
        self.types = types

    def __eq__(self, other):
        return self.types == other.types

    def __le__(self, other):
        for x, y in zip(self.types, other.types):
            if not issubclass(x, y):
                return False
            if x is not y:
                break
        return True

def _dispatch_kl(type_q, type_p):
    matches = [(super_q, super_p) for super_q, super_p in _KL_REGISTRY
               if issubclass(type_q, super_q) and issubclass(type_p, super_p)]
    if not matches:
        return NotImplemented
    left_q, left_p = min(_Match(*m) for m in matches).types
    right_p, right_q = min(_Match(*reversed(m)) for m in matches).types
    left_fun = _KL_REGISTRY[left_q, left_p]
    right_fun = _KL_REGISTRY[right_q, right_p]
    if left_fun is not right_fun:
        logger.warning('Ambiguous kl_divergence({}, {}). Please register_kl({}, {})'.format(
            type_q.__name__, type_p.__name__, left_q.__name__, right_p.__name__))
    return left_fun


def register_kl(type_q, type_p):
    """
    Decorator to register a pairwise function with kl_divergence.
    Usage:

        @register_kl(Normal, Normal)
        def kl_normal_normal(q, p):
            # insert implementation here
    """
    if not isinstance(type_q, type) and issubclass(type_q, BaseDistribution):
        raise TypeError('Expected type_q to be a Distribution subclass but got {}'.format(type_q))
    if not isinstance(type_p, type) and issubclass(type_p, BaseDistribution):
        raise TypeError('Expected type_p to be a Distribution subclass but got {}'.format(type_p))
    
    def decorator(fun):
        _KL_REGISTRY[type_q, type_p] = fun
        _KL_MEMOIZE.clear()  # reset since lookup order may have changed
        print('KL divergence between \'%s\' and \'%s\' registered.' % (type_q.__name__, type_p.__name__))
        return fun
    return decorator


def kl_divergence(q, p):
    r"""Compute Kullback-Leibler divergence KL(p|q) between two distributions."""
    try:
        fun = _KL_MEMOIZE[type(q), type(p)]
    except KeyError:
        fun = _dispatch_kl(type(q), type(p))
        _KL_MEMOIZE[type(q), type(p)] = fun
    if fun is NotImplemented:
        raise NotImplementedError('KL divergence for pair %s - %s not registered' % (type(q).__name__,
                                                                                     type(p).__name__))
    return fun(q, p)

In [None]:
@register_kl(NormalDiagonal, NormalDiagonal)
def _normaldiagonal_normaldiagonal(q, p):

    kl = torch.log(p.var / q.var) + (q.var + torch.square((q.mean - p.mean))) / p.var  - 1
    
    return 1/2 * torch.sum(kl)

In [None]:
def VI_logistic(z):
    return 1/(1 + torch.exp(-z))

class LogisticRegression(nn.Module):
    def __init__(self, input_dim):
        super(LogisticRegression, self).__init__()
        
        self.prior_w = NormalDiagonal(input_dim,False)
        self.posterior_w = NormalDiagonal(input_dim)

    @args_as_tensors(1)
    def predict_y(self, X, mc_samples=1):
        
        w_samples = self.posterior_w.sample(mc_samples)
        w_samples = torch.unsqueeze(w_samples, 2)
        y_samples = VI_logistic(X @ w_samples)
        
        return y_samples
    
    def predict_vector(self, X, mc_samples=1):
        
        w_samples = self.posterior_w.sample(mc_samples)
        y_samples = X.float() @ w_samples.T
        prediction = torch.sum(y_samples) / mc_samples
        prediction = logistic(prediction.clone())
        
        return prediction

In [None]:
class VariationalObjective(nn.Module):    
    def __init__(self, model, likelihood, N, mc_samples=1):
        super(VariationalObjective, self).__init__()
        self.N = N
        self.model = model
        self.likelihood = likelihood
        self.mc_samples = mc_samples
        
    def expected_loglikelihood(self, Xbatch, ybatch):
        
        ypred = self.model.predict_y(Xbatch,self.mc_samples)
        
        logliks = self.likelihood.logdensity(ybatch,ypred)
        
        logliks = torch.sum(logliks.clone())
                
        return self.N/self.mc_samples * logliks
    
    def kl(self):
        return _normaldiagonal_normaldiagonal(self.model.posterior_w,self.model.prior_w)
    
    def compute_objective(self, Xbatch, ybatch):
        logliks = self.expected_loglikelihood(Xbatch,ybatch)
                
        kl = self.kl()
        
        result = - logliks + kl
        
        return result

In [None]:
class Dataset():
    def __init__(self, X, y, minibatch_size):
        self.X = X
        self.y = y 
        self.minibatch_size = min(minibatch_size, len(self.X))
        self._i = 0  
    def next_batch(self):  
        if len(self.X) <= self._i + self.minibatch_size:
            shuffle = np.random.permutation(len(self.X))
            self.X = self.X[shuffle]
            self.y = self.y[shuffle]
            Xbatch = self.X[self._i:]
            ybatch = self.y[self._i:]
            self._i = 0
            return Xbatch, ybatch

        Xbatch = self.X[self._i:self._i + self.minibatch_size]
        ybatch = self.y[self._i:self._i + self.minibatch_size]
        self._i += self.minibatch_size
        return Xbatch, ybatch

In [None]:
dataset = Dataset(X_train, y_train, minibatch_size=1)

In [None]:
likelihood = Bernoulli()
model = LogisticRegression(200)

nelbo = VariationalObjective(model,likelihood,200,200)

In [None]:
class Summary:
    @property
    def data(self):
        data = pd.DataFrame(self._data, columns=['step', self.name, 'time'])
        data.time = data.time - data.time.iloc[0]
        return data
    
    def __init__(self, name):
        """A simple class to store some values during optimization"""
        self.name = str(name)
        self._data = []
    
    def append(self, step, value):
        #self._data.append([step, float(value.detach().numpy()), time.time()])
        self._data.append([step, float(value), time.time()])

### Variational Inference Training

In [None]:
nelbo_summary = Summary('nelbo')
nll_summary = Summary('expected_loglik')
kl_summary = Summary('kl')

optimizer = torch.optim.SGD(nelbo.parameters(), lr=1e-7, momentum=0.9)
#optimizer = torch.optim.Adam(nelbo.parameters())

num_iterations = 20000 #50000

for step in range(num_iterations):
    
    optimizer.zero_grad()
    
    Xbatch, ybatch = dataset.next_batch()
    loss = nelbo.compute_objective(Xbatch,ybatch)
        
    nelbo_summary.append(step, loss.detach().numpy())    
    nll_summary.append(step, loss.detach().numpy() - nelbo.kl().detach().numpy())
    kl_summary.append(step, nelbo.kl().detach().numpy())
    
    loss.backward()
    
    optimizer.step()
    
    print('Epoch [{}/{}], Loss: {:.4f}'.format(step, num_iterations, loss.item()),end="\r")

In [None]:
fig, axs = plt.subplots(1, 2, figsize=[10, 3])

nelbo_summary.data.plot(x='step', y='nelbo', ax=axs[0]);
nll_summary.data.plot(x='step', y='expected_loglik', ax=axs[1], c='C1');
kl_summary.data.plot(x='step', y='kl', ax=axs[1], c='C2');
axs[1].semilogy();
fig.suptitle('Optimization of the NELBO', y=1.02)
axs[0].margins(0, 0.05)

## Question C
(code) Based on samples from the posterior over model parameters, write a function
that computes the predictive distribution, and write the necessary functions to evaluate
classification metrics such as the log-likelihood on test data and error rate. [10]

## Classification metrics

In [None]:
def log_likelihood(y,p):
    value = 0
    for i in range(len(p)):
        value += y[i] * np.log(p[i] + jitter) + (1 - y[i]) * np.log(1 - p[i] + jitter)
    return value/len(p)

## Predictive distribution

### Metropolis-Hastings Prediction

In [None]:
def MH_predict(x_new, w_samples):
    p = 0
    for w in w_samples[-200:]:
        p += MH_logistic(w.T@x_new)
    return p/len(w_samples[-200:])

In [None]:
MH_prediction = []
for step in range(len(X_test)):
    MH_prediction.append(MH_predict(X_test[step],sampler.samples))
    print('Metropolis Hastings Validation : Epoch [{}/{}]'.format(step, len(X_test)),end="\r")

In [None]:
MH_rounded_prediction = threshold(MH_prediction,0.5)

In [None]:
confusion_matrix(MH_rounded_prediction,y_test,"Metropolis Hastings Confusion Matrix")
print("log-likelihood : ", log_likelihood(y_test,MH_prediction))

### Variational Inference Prediction

In [None]:
with torch.no_grad():
    VI_prediction = nelbo.model.predict_y(X_test,200).detach().numpy()[0]

In [None]:
VI_rounded_prediction = threshold(VI_prediction,0.5)

In [None]:
confusion_matrix(VI_rounded_prediction,y_test, "Variational Inference Confusion Matrix")
print("log-likelihood : ", log_likelihood(y_test,VI_prediction)[0])

## Question D
(text) Comment on the tuning of the Metropolis-Hastings algorithm, and how to
guarantee that samples are representative of samples of the posterior over model
parameters. [5] 

<span style="color:black">
To tune the Metropolis-Hestings algorithm, I first reduced a little the step size compared to the lab (from 0.5 to 0.1). The number of steps is not very important for my implementation since it is quite fast and it achieve a low $\hat R$ value with less than 1000 steps. The tuning was principally focused on the data preprocessing (undersampling + standardization) and on the starting point which has a huge impact on the convergence speed.
<br/><br/>
I tried 3 different starting points : the null one, a random one centered on 0 with a variance of 1 on each parameter and the first sample of the train set. The first sample of the train set was the one that gave the more consistent results with a little number of steps.
<br/><br/>
To guarantee that samples are representative of samples of the posterior over model parameters we could run multiple iterations of MH with different starting points and verify that it converges each time. I am using the burn in trick by discarding the first samples and therefore giving time to the Markov chain to reach the equilibrum distribution.
<br/><br/>
Another possible way to garuantee that samples are representative of samples of the posterior over model parameters is to run independent chains from different starting points and check the obtained distributions. Due to the high dimensionality, visualizing the path of the chains was not the best solution to ensure representativity of samples.
<br/><br/>
I finally used the potential scale reduction factor $\hat R$ and verified that its value was less than 1.05 to ensure the chains have been fully mixed.
</span>

## Question E
(text) Comment on the tuning of the variational inference algorithm, and discuss the
behavior of the optimization with respect to the choice of the optimizer/step-size. [5]

<span style="color:black">
As mentioned in the lab, it is necessary to greatly reduce the number of mc_samples and batch size in order to scale the Variational Inference method on a large dataset. I finally used a batch size of 1 (which explains why the loss is noisy) and a number of mc_samples of 200 which achieve a good trade-off between speed of convergence and accuracy.
<br/><br/>
Since we are using the SGD optimizer, which has a fixed learning rate unlike Adam, we need to carefully tweak the learning rate. After several tries, I chose 1e-7 for the learning rate. The convergence is relatively slow but necessary in order to not suffer from a problem of explosion of the gradients leading to a NaN loss caused by learning rates greater than 1e-7.
</span>

## Question F
(text) Report the error metrics implemented in point 2.B. above and the confusion matrix
on the test data. Discuss logistic regression performance with respect to the
performance of Bayesian linear regression. [5]


|     | Bayesian Linear Regression | Metropolis-Hastings | Variational Inference |
| --- | --- | --- | --- |
| Accuracy | 0.85 | 0.82 | 0.75 |
| Sensitivity | 0.85 | 0.74 | 0.64 |
| Specificity | 0.85 | 0.86 | 0.80 |
| AUC | 0.77 | 0.73 | 0.63 |

<span style="color:black">
According to these results, the Bayesian linear regression is slightly more precise than the logistic regressions. However with a longer training for Metropolis-Hastings and Variational Inference I get similar results (around 0.8 of accuracy and 0.75 of auc score). I would therefore say that in terms of cost in time and in computational complexity, Bayesian linear regression seems more interesting on this dataset. However, with an even larger dataset and less Gaussian distributed features, logistic regressions could be more competitive.
<br/><br/>
We can see that the three confusion matrices present a satisfactory prediction diagonal with a limited number of false positives and false negatives.
</span>

## Question G
(text) Compare the uncertainties on predictions obtained by the Metropolis-Hastings
algorithm and variational inference. First, compare the log-likelihood on test data as a 
global metric to assess which inference method yields better uncertainty quantification.
Second, pick a few test points for which the mean of the predictive distribution is (a)
around 0.5 (b) giving a correct prediction (c) giving a wrong prediction, and visualize/
discuss what the predictive distribution looks like. Discuss the difference between the
Metropolis-Hastings algorithm and variational inference. [15]


 |    | Metropolis-Hastings | Variational Inference |
 | --- | --- | --- |
 | log-likelihood | -0.94 | -2.51 |
    
<span style="color:black">
First of all, the more steps we carry out with the two models, the less uncertainties there are. With the parameters I defined, Metropolis is a bit more uncertain than Variational Inference according to the histograms I plotted even if the log-likelihoods are -0.94 versus -2.51. Metropolis-Hastings has a better likelihood due to a better accuracy. We can observe on the distributions, that the logistic regressions predict values close to 0 and 1, unlike the Bayesian Linear Regression which is a Gaussian around 0.5. By centering the search around 0.5. I do not notice any differences between MH and VI on the incorrect and correct uncertainty distributions. Perhaps with more training for the two models, some differences would appear.
<br/><br/>
More generally, Variational Inference appears to be more scalable since training can be fast to achieve a reasonable result and tuning was glabally easier than with Metropolis-Hastings. However, with the current settings, ie with a slightly longer training, Metropolis-Hastings is more precise as seen in the question F. This dataset is probably not large enough to show that Variational Inference can be more scalable. The other advantage of Variational Inference is that we can try many different models (here a multivariate normal model) to represent the distribution of the posterior. In MCMC, after a correct amount of samples and with a high computational and time cost, any generic distribution could be fitted.
</span>

In [None]:
def plot_predictive_distribution(prediction,title):
    plt.hist(prediction, bins = 100)
    
    plt.xlabel("Predicted values")
    plt.ylabel("Count")
    plt.title(title)
    
    plt.show()

In [None]:
plot_predictive_distribution(y_posterior_mu,"Bayesian Linear Regression predictive distribution")

In [None]:
plot_predictive_distribution(MH_prediction,"Metropolis Hastings predictive distribution")

In [None]:
plot_predictive_distribution(VI_prediction, "Variational Inference predictive distribution")

In [None]:
def uncertainties(prediction,rounded_prediction,control):
    
    correct = 0
    incorrect = 0
    
    correct_uncertainties = []
    incorrect_uncertainties = []
    
    for i in range(len(prediction)):
                
        if(abs(prediction[i] - 0.5) < 0.2):
            if(rounded_prediction[i] == control[i]):
                correct_uncertainties.append(prediction[i])
                correct += 1
            else:
                incorrect_uncertainties.append(prediction[i])
                incorrect += 1
        
    correct_rate = correct / (correct + incorrect)
    incorrect_rate = incorrect / (correct + incorrect)
    
        
    print("Correct uncertainties rate : ", correct_rate)
    print("Incorrect uncertainties rate : ", incorrect_rate)

    
    return correct_uncertainties, incorrect_uncertainties

In [None]:
correct_MH_uncertainties, incorrect_MH_uncertainties = uncertainties(MH_prediction,MH_rounded_prediction,y_test)

In [None]:
correct_VI_uncertainties, incorrect_VI_uncertainties = uncertainties(VI_prediction,VI_rounded_prediction,y_test)

In [None]:
def plot_double_predictive_distribution(correct_prediction,incorrect_prediction,title):
    plt.hist(correct_prediction, bins = 20, color="green", alpha=0.5, label="correct")
    plt.hist(incorrect_prediction, bins = 20, color="red", alpha=0.5, label="incorrect")
    plt.legend(loc='upper right')
    plt.xlabel("Predicted values")
    plt.ylabel("Count")
    plt.title(title)
    
    plt.show()

In [None]:
plot_double_predictive_distribution(np.array(correct_MH_uncertainties), np.array(incorrect_MH_uncertainties), "Metropolis-Hastings uncertainties distribution")

In [None]:
plot_double_predictive_distribution(np.array(correct_VI_uncertainties), np.array(incorrect_VI_uncertainties), "Variational Inference uncertainties distribution")