# Homework 5: Conditional Probability Models


## 1. Introduction

In this homework we'll be investigating conditional probability models,
with a focus on various interpretations of logistic regression, with
and without regularization. Along the way we'll discuss the calibration
of probability predictions, both in the limit of infinite training
data and in a more bare-hands way. On the Bayesian side, we'll recreate
from scratch the Bayesian linear gaussian regression example we discussed
in lecture. We'll also have several optional problems that work through
many basic concepts in Bayesian statistics via one of the simplest
problems there is: estimating the probability of heads in a coin flip.
Later we'll extend this to the probability of estimating click-through
rates in mobile advertising. Along the way we'll encounter empirical
Bayes and hierarchical models. 

# 2. From Scores to Conditional Probabilities

2.1 Write $\mathbb{E}{y}\left[\ell\left(yf(x)\right)\mid x\right]$ in terms
of $\pi(x)$, $\ell(-f(x))$, and $\ell\left(f(x)\right)$. (Hint:
Use the fact that $y\in\left\{ -1,1\right\} $.)

2.2 Show that the Bayes prediction function $f^{*}(x)$ for the exponential
loss function $\ell\left(y,f(x)\right)=e^{-yf(x)}$ is given by 
\[
f^{*}(x)=\frac{1}{2}\ln\left(\frac{\pi(x)}{1-\pi(x)}\right),
\]
where we've assumed $\pi(x)\in\left(0,1\right)$. Also, show that
given the Bayes prediction function $f^{*}$, we can recover the conditional
probabilities by
\[
\pi(x)=\frac{1}{1+e^{-2f^{*}(x)}}.
\]
{[}Hint: Differentiate the expression in the previous problem with
respect to $f(x)$. To make things a little less confusing, and also
to write less, you may find it useful to change variables a bit: Fix
an $x\in\cx$. Then write $p=\pi(x)$ and $\hat{y}=f(x)$. After substituting
these into the expression you had for the previous problem, you'll
want to find $\hat{y}$ that minimizes the expression. Use differential
calculus. Once you've done it for a single $x$, it's easy to write
the solution as a function of $x$.{]} 

2.3 Show that the Bayes prediction function $f^{*}(x)$ for the logistic
loss function $\ell\left(y,f(x)\right)=\ln\left(1+e^{-yf(x)}\right)$
is given by
$$
f^{*}(x)=\ln\left(\frac{\pi(x)}{1-\pi(x)}\right)
$$
and the conditional probabilities are given by
$$
\pi(x)=\frac{1}{1+e^{-f^{*}(x)}}.
$$
Again, we may assume that $\pi(x)\in(0,1)$.

# 3. Logistic Regression

# 4. Bayesian Logistic Regression with Gaussian Priors

Let's return to the setup described in Section 3.1 and, in particular, to the Bernoulli regression setting with logistic transfer function. We had the following hypothesis space of conditional
probability functions:
$$
\cf_{\text{prob}}=\left\{ x\mapsto\phi(w^{T}x)\mid w\in\Re^{d}\right\} .
$$
Now let's consider the Bayesian setting, where we induce a prior on
$\cf_{\text{prob}}$ by taking a prior $p(w)$ on the parameter $w\in\Re^{d}$. 

# 5 Bayesian Linear Regression - Implementation

In this problem, we will implement Bayesian Gaussian linear regression,
essentially reproducing the example [from lecture](https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/13a.bayesian-regression.pdf\#page=12),
which in turn is based on the example in Figure 3.7 of Bishop's Pattern
Recognition and Machine Learning (page 155). We've provided plotting
functionality in "support_code.py". Your task is to complete "problem.py". The
implementation uses np.matrix objects, and you are welcome to use the np.matrix.getI method. 

5.1 Implement likelihood\_func.

5.2 Implement get\_posterior\_params.

5.3 Implement get\_predictive\_params.

5.4 Run ``python problem.py`` from inside the Bayesian Regression directory
to do the regression and generate the plots. This runs through the
regression with three different settings for the prior covariance.
You may want to change the default behavior in support\_code.make\_plots
from plt.show, to saving the plots for inclusion in your homework
submission.

5.5 Comment on your results. In particular, discuss how each of the following
change with sample size and with the strength of the prior:  (i) the
likelihood function, (ii) the posterior distribution, and (iii) the
posterior predictive distribution.

5.6 Our work above was very much ``full Bayes``, in that rather than
coming up with a single prediction function, we have a whole distribution
over posterior prediction functions. However, sometimes we want a
single prediction function, and a common approach is to use the MAP
estimate -- that is, choose the prediction function that has the
highest posterior likelihood. As we discussed in class, for this setting,
we can get the MAP estimate using ridge regression. Use ridge regression
to get the MAP prediction function corresponding to the first prior
covariance ($\Sigma=\frac{1}{2}I$, per the support code). What value
did you use for the regularization coefficient? Why?