# Generative Models

Goals:
* Introduce probabilistic graphical models in the context of generating mock data
* Explore some simple practical aspects of simulating data
* Walk through a simple mock data generation example and its PGM

## Further reading

* Ivezic et al, Sections 3.3 and 3.7
* [Bishop, "Pattern Recognition and Machine Learning,"](https://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738) Sections 8.1 and 8.2

## Understanding by Modeling

<table><tr width=90%>
<td><img src="../graphics/tour_cluster_image.png" height=300></td>
<td><img src="../graphics/tour_cluster_image_zoom.png" height=300></td>
</tr></table>

Understanding the features in these images means _making a model that can "predict" them_

## When do we want to generate data?

* Inference: we generate noise-free data to compare with our noisy data

* Checking: to investigate whether our model is a good one or not

* Testing: does our analysis recover the input model?

## The Sampling Distribution

* The noisy, mock data are drawn from a PDF known as the "sampling distribution"

* In the case of an X-ray image pixel, containing mock counts $N_k$, this PDF is $P(N_k|\mu_k,H)$

* $\mu_k$ is a parameter, the expected number of counts in the $k^{th}$ pixel

* $H$ is the list of assumptions that defines our model

> The sampling distribution is sometimes called the error distribution, because it captures the statistical properties of the "random errors" in the data. Drawing from the sampling distribution "adds noise" to the simulation.

## Conditional probability

* $P(N_k|\mu_k,H)$ is pronounced "the probabilty for $N_k$ given $\mu_k$ and $H$"

* This means that if you know what the value of $\mu_k$ is, you can _draw_ a sample value of $N_k$ from $P$ - since  $H$ tells you what the functional form of $P$ is.

## Sampling in practice

* In general, sampling from a PDF is difficult

* For certain standard distributions, however, there are fast algorithms

In [None]:
import scipy.stats

P = scipy.stats.poisson(mu=3)
P.rvs(size=10)

## Sampling in practice

* [numpy.random]() and [scipy.stats]() are two useful libraries for drawing samples from PDFs

* You may have encountered some of these routines as "random number generators"

* Sampling from a PDF means generating random numbers from that distribution

## Choosing the input model parameters

* When testing, we often want to assert a set of **input model parameters** <font color='brown'>$\theta$</font>, and then see what they produce

* Testing at large scale might involve generating many datasets, with different inputs. In this case we might want to sample from a *plausible distribution of input parameters*

* In practice: choose a particular standard probability distribution for $\theta$ and sample from it

## Deterministic relationships

* Often, our model provides a deterministic relationship between parameters: if you know $\theta$, then you know $\mu$

* In our X-ray image case, $\theta$ could be the set of parameters that describe the gas temperature and density profiles of a spherically symmetric cluster of galaxies with known centroid position

* $\mu(\theta)$ would then be some complicated function that took the cluster parameters $\theta$ as input, and predicted the expectation value of the counts at any pixel position.

## Probabilistic Graphical Models

* This procedure for simulating a mock X-ray image dataset can be usefully illustrated with a _directed acyclic graph_ called a "Probabilistic Graphical Model"

* In the present context these provide something like a "flow diagram" showing how we might draw a sample mock dataset from our model

* Let's look at the PGM for a simple X-ray image simulation 


## One pixel, fixed inputs

<img src="../graphics/pgms_one_pixel_input_fixed.png">

## PGM pieces

* Each **node** in the graph represents a PDF, for the variable labeled inside it

* Each **edge (arrow)** in the graph represents a conditional dependence

* Deterministic relationships are indicated by "fixed" variables represented by solid points

> The fixed nodes are _also_ PDFs: fixing a parameter is the same as sampling it from a delta function PDF

## One pixel, sampled inputs

<img src="../graphics/pgms_one_pixel_input_sampled.png" width=70%>

## PGM interpretation

* Following the arrows, the network of conditional dependences shows you how to go about simulating mock data ($N_k$)

> For example: 
1. Draw a sample $\theta$ vector;
2. Compute $\mu_k$ from it;
3. Draw an $N_k$ given that $\mu_k$

## PGM interpretation

* The graph is also an illustration of a particular factorisation of a joint probability distribution: the PDF for every variable in the model 

* For example: 

$\;\;\;\;\;\;\;P(N_k|\mu_k,H)\;P(\mu_k|\theta_k,H)\;P(\theta_k|H) =  P(N_k,\mu_k,\theta_k|H)$

> Note that the dependence of $N_k$ on $\theta$ has been dropped here: $N_k$ is only dependent on $\theta$ through $\mu_k$, or if you prefer, $N_k$ is _conditionally independent_ of $\theta$ given $\mu$

#### What do you make of these two PGMs?

<table><tr>
<td><img src="../graphics/pgms_a-c-d.png"></td>
<td><img src="../graphics/pgms_c-y-d.png"></td>
</tr></table>

* On your own, write down the probability expressions illustrated by these two graphs. 
* Then, discuss their meaning with your neighbor, and prepare to report back to the class.

### my answers

First graph: A constant "c" is predicted from a distribution of model parameters. Then "c" and those same model parameters are fed in to another distribution, and "d" is chosen from that distribution.

Second graph: A constant "c" is predicted from a distribution of model parameters. Then "c" is used to set another distribution, and from which "d" is drawn. The value of "d" is used to refine the model parameters, and the cycle continues.

## Simulating the whole image

* Our model for the cluster allows us to predict all of the image pixel values, separately: each of the $\mu_k$ depends on $\theta$ alone

* Our simple Poisson model reflects an assumption about our detector, which is that the observed counts in the $k^{th}$ pixel are independent of the observed counts in all the other pixels. $N_k$ only depends on its pixel's expected counts $\mu_k$

* In this case, the sampling distribution for all the observed pixel values $\boldsymbol{N}$ factorizes and simplifies to

$\;\;\;\;\;\;\;\;\;\;P(\boldsymbol{N}|\boldsymbol{\mu},H) = \prod_k P(N_k|\mu_k,H)$

## All pixels, fixed inputs

<img src="../graphics/pgms_all_pixels_input_fixed.png">

$\;\;\;\;\;\;\;P(\boldsymbol{N},\boldsymbol{\mu},\theta|H) = \left[ \prod_k P(N_k|\mu_k,H)  P(\mu_k|\theta,H) \right] P(\theta|H)$

## PGM pieces

* Each _node_ in the graph represents a PDF, for the variable labeled inside it

* Each _edge_ (arrow) in the graph represents a conditional dependence

* Deterministic relationships are indicated by "fixed" variables represented by solid points

* Plates contain conditionally independent variables

> Think of plates as illustrating a stack of layers, seen from above, that are only connected by the arrows coming from variables outside the plate.

## Take-home messages

* Generating data is the key function of a statistical model

* Noise-free mock datasets are central to inference; noisy mock datasets are vital for testing and model checking

* Generating a mock dataset means drawing a sample from the joint probability distribution for all variables in the model

* Probabilstic Graphical Models (PGMs) illustrate a particular factorization of this joint PDF, and can be viewed as "flow charts" for data simulation process

* The sampling distribution captures the statistical uncertainties in the data; drawing from it "adds noise" to make the final mock dataset




## Coded Examples

In the `examples` folder, there are some notebooks that show:

* [A "catalog" dataset from the Sloan Digital Sky Survey](../examples/SDSScatalog/FirstLook.ipynb)

* [An attempt to generate a mock dataset](../examples/SDSScatalog/GalaxySizes.ipynb) that (literally) looks like the real one

* [How the `daft` python package can be used to draw PGMs](../examples/SDSScatalog/FirstPGM.ipynb)


You might (one day) find some of this code useful.