In [None]:
# make sure the notebook reloads the module each time we modify it
%load_ext autoreload
%autoreload 2

# make sure the displays are nice
%matplotlib inline
#figsize(12,8)

In [None]:
import hierarchical_modelling as hm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")
import numpy as np
import numpy.linalg as npl

# Practical \#2: using a hierarchical model to predict GDP as a function of ruggedness

This practical will be marked and has to be done individually. **Deadline is 2 December, 23:59**. Please email me a zip file containing 
* this Jupyter notebook, containing your answers. Answer each question immediately below the corresponding question. An answer consists in a few sentences in plain English in a Markdown cell, with a snippet of code in a code cell if necessary.
* an html version of the same notebook with all cells executed.
* the companion Python file, that you will have filled. I should be able to run your notebook with the Python file placed in the same folder.


## Preparing data and utilities

In [None]:
# I prepared a class that fetches data for you
PM = hm.PracticalMaterial()
X, y = PM.fetch_data()

# Let us plot data
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6), sharey=True)

sns.scatterplot(X["rest"], y["rest"], ax=ax[1])
ax[0].set(xlabel="Ruggedness index", ylabel="log GDP", title="Non-African nations")

sns.scatterplot(X["African"], y["African"], ax=ax[0])
ax[1].set(xlabel="Ruggedness index", ylabel="log GDP", title="African nations")

plt.show()

The same geographer as for Practical \#1 comes back to you. She has a new hypothesis: GDP only increases with ruggedness for African nations. We have the same 170 points, with $y_i$ the log GDP, and $x_i$ the ruggedness index, for $i=1,\dots,170$. But now we also have a new binary variable $z_i\in\{0,1\}$, $i=1,\dots,170$, that indicates whether the corresponding country is in Africa. 

## Writing down a probabilistic model

We stick with linear regression given how small the dataset is. But we need a model where we can formalize the geographer's question, and compare the slope of the regressed line for African nations to the slope corresponding to the rest of the nations. In particular, let $N^{(k)}$, $k\in\{\text{"African"}, \text{"rest"}\}$ be the number of African and non-African nations in the dataset. Let $(X^{(k)}_i, y^{(k)}_i)_{i=1,\dots,N^{(k)}}$ be the corresponding $N^{(k)}$ data points, for $k\in\{\text{"African"}, \text{"rest"}\}$.

We consider the two likelihoods
$$ 
p(\mathbf y^{(k)} \vert X^{(k)}, \mu^{(k)}, \beta^{(k)}, \sigma^{(k)}) = \prod_{i=1}^{N^{(k)}} \mathcal{N}\left(y_i^{(k)}\vert \mu^{(k)}+ \beta^{(k)} X^{(k)}_i, (\sigma^{(k)})^2\right), \quad k\in\{\text{"African"}, \text{"rest"}\}.
$$

**Question:** Let $k\in\{\text{"African"}, \text{"rest"}\}$. Assume that the features are normalized, in the sense that $\sum_{i=1}^{N^{(k)}} X^{(k)}_i = 0$. Let us put an improper prior $p(\mu^{(k)})\propto 1$ on the intercept $\mu^{(k)}$. Show that we can integrate $\mu^{(k)}$ out, so that
$$
p(\mathbf y^{(k)} \vert X^{(k)},\beta^{(k)}, \sigma^{(k)}) = \int p(\mathbf y^{(k)},\mu^{(k)} \vert X^{(k)},\beta^{(k)}, \sigma^{(k)}) \mathrm{d}\mu^{(k)} = \prod_{i=1}^{N^{(k)}} \mathcal{N}\left(y_i^{(k)}- m^{(k)}\vert \beta^{(k)} X^{(k)}_i, (\sigma^{(k)})^2\right), \quad k\in\{\text{"African"}, \text{"rest"}\},
$$
where $m^{(k)} = \frac{1}{N^{(k)}}\sum_{i=1}^{N^{(k)}} y^{(k)}_i$ denotes the within-class average label.

**Solution:** TBC.

Henceforth, we will assume that data is normalized, so that $\sum_{i=1}^{N^{(k)}} X^{(k)}_i = 0$ and $m^{(k)}=0$, for $k\in\{\text{"African"}, \text{"rest"}\}$. We thus need to include this preprocessing in our code, and I did it for your in the method `GibbsSampler.load_and_normalize_data()` of the class `GibbsSampler`, which we we shall meet shortly. Before that, we need to finish specifying our prior over the remaining variables $\beta^{(k)},\sigma^{(k)}$. In order to apply Gibbs sampling, we need the conditionals to express in closed form. We thus look for a conjugate prior for linear regression: the normal-inverse Gamma prior. I recommend that you pause and read Sections 7.6.1-7.6.3 of Murphy to get familiar with this prior. 

With Murphy's notation, we let
$$ 
\beta^{(k)}, \sigma^{(k)} \vert \overline{\beta} \sim \mathcal{N}-\mathcal{IG}(\overline{\beta}, V=1, a=1, b=1), \quad k\in\{\text{"African"}, \text{"rest"}\}. 
$$

Note how the only unknown parameter of this prior (a ``hyperparameter") is $\overline{\beta}$, and that this parameter is common to both classes $k\in\{\text{"African"}, \text{"rest"}\}$. Thus, $\beta^{(\text{"African"})}$ and $\beta^{(\text{"rest"})}$ are not independent under the prior. We hope that this dependence will regularize the estimates of the individual slopes, forbidding them to be too different. This is particularly useful when you have a small dataset, or if you hope that there is something to learn from the data in the other class. 

**Question:** We need to choose a prior over $\overline{\beta}$. What can we choose if we want the conditional 
$$
\overline{\beta}\vert \beta^{(\text{"African"})}, \beta^{(\text{"rest"})}
$$
to be easy to sample? *Hint: think ``conjugate prior".*

**Solution:** TBC.

Finally, we have specified a full joint model, with posterior
$$
p(\beta^{(\text{"African"})}, \beta^{(\text{"rest"})}, \sigma^{(\text{"African"})}, \sigma^{(\text{"rest"})}, \overline{\beta}\vert \mathbf y, X) \propto
p\left(\mathbf y^{(\text{"African"})}\vert  X^{(\text{"African"})}, \beta^{(\text{"African"})}, \sigma^{(\text{"African"})}\right) 
\times p\left(\mathbf y^{(\text{"rest"})}\vert X^{(\text{"rest"})}, \beta^{(\text{"rest"})}, \sigma^{(\text{"rest"})}\right) 
\times p(\beta^{(\text{"African"})}, \sigma^{(\text{"African"})} \vert \overline{\beta})
\times p(\beta^{(\text{"rest"})}, \sigma^{(\text{"rest"})} \vert \overline{\beta})
\times p(\overline{\beta}) \qquad (*)
$$

**Question:** Assuming you can compute any integral with respect to this joint law, how would you answer the geographer's question? What integral should be our goal?

**Solution:** TBC.

## Implementing a Gibbs sampler

To sample from the posterior (*) over our five parameters, we propose to use Gibbs sampling. First, given the conditional structure of our model (feel free to write the graphical model for extra points), we have
$$
p(\overline{\beta}\vert \text{other variables}) = p(\overline{\beta}\vert\beta^{(\text{"African"})}, \beta^{(\text{"rest"})}, \sigma^{(\text{"African"})}, \sigma^{(\text{"rest"})}) = \text{TBC}.
$$ 

**Question:** Complete the calculation.

Second, for $k\in\{\text{"African"}, \text{"rest"}\}$, it comes
$$ 
p(\beta^{(k)}, \sigma^{(k)}\vert \text{other variables}) = p(\beta^{(k)}, \sigma^{(k)}\vert X^{(k)}, \mathbf{y}^{(k)}, \overline{\beta}) = \text{TBC}.
$$
**Question:** Complete the calculation. *Hint: Read Murphy's Section 7.6.3 again*.

Now we are ready to implement our Gibbs sampler. Check out the `GibbsSampler` class in the companion Python file? You should make the following cell run correctly.

In [None]:
gibbs = hm.GibbsSampler(num_full_sweeps=1000)
gibbs.load_and_normalize_data()
gibbs.run()

I've made a small function to plot the MCMC traces and visually check that they mix properly, i.e. that the time series do not exhibit any strong "non-independent" behaviour, like linear trends.

In [None]:
gibbs.plot_traces()

TO visualize the posterior, we can also look at the pairwise scatterplots of our posterior sample.

In [None]:
gibbs.plot_pairwise_marginals()

**Question:** Now that we have a posterior sample, what do we answer the geographer, who wanted to test whether GDP increased with ruggedness in African nations, and decreased with ruggedness in other places? Is your decision sensitive to the choices we made for the priors?

**Solution:** TBC

**Question:** Now the geographer is asking you to predict the GDP of a non-African nation that was not in the original dataset, with ruggedness index $4$. What can you tell her?

**Solution:** TBC

**Bonus question:** Implement the same model in `PyMC3`, change the priors, run Hamiltonian MC instead of Gibbs and check that the geographer's decision is robust to changing the functional form of the priors for nonconjugate ones.