# Bayesian classification

In [None]:
# make sure the notebook reloads the module each time we modify it
%load_ext autoreload
%autoreload 2

# Uncomment the next line if you want to be able to zoom on plots
%matplotlib notebook 

In [None]:
import classification_with_solutions as cl
import sklearn as skl
#from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import numpy as np
import pymc3 as pm

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")

## Preparing data and utilities

In [None]:
X, y = skl.datasets.load_wine(return_X_y=True)

In [None]:
sample_size, dimension = X.shape
print("sample size, dimension=", sample_size, dimension)
print("class labels are", np.unique(y))

**Question:** perform a PCA, and plot the projection onto the first two PCs. Color your points by class. How easy does the classification task look like?

In [None]:
# Quick PCA for visualization
cl.perform_and_visualize_PCA(X, y)

This is a classification task with 3 classes. The dimension $d$ is pretty high for such a small sample size $n$, it is naturally a task that calls for a Bayesian approach. In the following, we'll fit a simple three-class logistic regression. More precisely, we'll take $b\in\mathbb{R}^3$, $\theta_k\in\mathbb{R}^d$ for $k=0,1,2$ and $y\vert x,\theta, b$ to be multinomial with parameters that depend smoothly on $x^T\theta$. More precisely, we consider
$$
p(y = k \vert x, b, \theta) \propto e^{b_k+\theta_k^T x}.
$$

**Question:** Give the DAG for the above multinomial regression model.

**Question:**  What loss function do you want to take?

Since we ultimately want to compare the posterior marginals of different components of $b$ and $\theta$, it is useful to standardize the features. 

In [None]:
X = preprocessing.scale(X)
print(
    "Now every column of X has mean", np.round(np.mean(X[:,4]), 2), 
    "and variance", np.round(np.var(X[:,7]), 2)
)

We will also want to predict some unknown labels, so let's keep a test set apart. Be careful that comparing average prediction errors with such a small dataset is irrelevant: it is unlikely that they will be good estimators of the generalization error. We will thus just look at confusion matrices and check that our classifiers are not completely off.

In [None]:
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(
    X, y, test_size=.2, random_state=3)

## A simple MAP baseline

Try first with `scikit-learn`'s logistic regression. Check out the `multinomial`option, and note how $\ell_2$ regularization is applied by default, at least in the current version (0.22.2).

**Question:** The output of sklearn is thus a MAP estimate, but for what prior? What about the other possible values of the `penalty` option?

In [None]:
skl_intercept, skl_coeffs, skl_predictions = cl.get_sklearn_results(
    X_train, y_train, X_test)

In [None]:
# How good are the predicitions?
confusion_matrix = skl.metrics.confusion_matrix(y_test, skl_predictions)
print(confusion_matrix)

In [None]:
# What's the MAP value for the parameters?
print(skl_intercept)
print(skl_coeffs)

## Now we go Bayesian

**Question:** What priors do you want to try? *Hint: make sure the MAP of sklearn is not outside the support of your prior.*

**Question:** Now write your DAG in pymc format, and sample it using NUTS. Put your code in the companion Python file, to make the following line work. Note how there are now two outputs: `trace` and `ppc`. For now just care about `trace` and return whatever for `ppc`, we'll come back to it later on.

In [None]:
trace, ppc = cl.get_logistic_results(X_train, y_train, X_test)

**Question:** Check how well you chain has mixed.
*Hint: remember the three convergence diagnostics (visual inspection, Gelman-Rubin, Geweke).* 

**Question:** Plot credible intervals on the parameters. Compare what happens across the three classes.

## Making predictions
Now let's us predict the labels of the held out "test" dataset. 

**Question:** how do we do this as good Bayesians, now that we have a posterior sample? *Hint: check your course notes for the keyword ``posterior predictive"*.

**Question:** implement the Bayes rule for predictions (actually approximate Bayes rule, since you're going to use the MCMC sample in `trace`). You should complete the function `get_logistic_results` so that it outputs two `pymc3`traces: one of the posterior on the parameters, one of the labels targeting the posterior predictive. Once you have the posterior predictive sample, find the argmax for each test point, and print the confusion matrix.