# Lab 3 - Markov Chain Monte Carlo    

The objective of this lab is to familiarize you with using Markov Chain Monte Carlo methods in PyMC3 to constrain model parameters. We will use a toy dataset to practice predicting earthquake locations from arrival time measurements at seismic stations. The data set has three features:

**x** - horizontal location of each seismic station. Values can range from 0 to 100 km in our 2D domain.         
**z** - vertical location of each seismic station. This value is always 0 because each seismic station is sitting at the surface of the earth.   
**t** - the time in seconds when the earthquake's seismic signal reached the station.       

We would like to solve for $x_{eq}$ and $z_{eq}$, the location of the earthquake. We will assume that $z$ is positive going downwards (away from the surface towards the center of the earth). Remember that we can relate the earthquake arrival time ($t$) to the station location ($x$ and $z$), earthquake location ($x_{eq}$ and $z_{eq}$), seismic velocity of the crust ($v$), and the earthquake initiation time ($t_0$) as follows:        

$t = \frac{\sqrt{(x - x_{eq})^2 + (y - y_{eq})^2}}{v} + t_0$   

Therefore, the unknown variables that we want to estimate in our model are $x_{eq}$, $z_{eq}$, $t_0$, and $v$.

First, let's import the packages that we need to run our model.

In [None]:
import numpy as np
import pymc as pm  # pymc3 package
import pandas as pd
import arviz as az # summary statistics package

Next we will import our observational dataset. Look through the data and make sure you understand the available features. PyMC3 works most smoothly with numpy arrays, so in this code block, we convert the dataframe into three numpy arrays, one for each observed feature.

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/rtculberg/ml_in_eas/main/data/EarthquakeLocation2.csv")
x = data['x'].to_numpy()
z = data['z'].to_numpy()
t = data['t'].to_numpy()

data.head()

In the code block below, complete the following tasks:     
(1) Define a new model object and write a `with` statement to wrap the subsequent specification of the model components.    
Inside the `with` statement:            
(2) Define the prior distributions for each of the four variables that you want to estimate with the model ($x_{eq}$, $z_{eq}$, $t_0$, and $v$). Remember that all of these variables should always be positive, so you should select a distribution function that only admits positive values.         
(3) Define new mutable data containers for each of your observed variables ($x$, $z$, and $t$). For each variable, set the dimensional coordinate system using `dims="idx"`. By defining a mutable data container, we could later swap out the values of these variables without changing the model, which is important if we want to predict outcomes on unseen data. We won't use this feature in the lab, but it will be very important for PS3!        
(4) Write the likelihood function. Remember that the likelihood function compares our observed arrival times to the theoretical arrival times that you would calculate from the arrival time equation using the model variables.      
(5) Sample from the parameter space. Use 2000 draws and 500 tuning samples.   

You may find the PyMC3 introduction notebook to be a helpful example as you work on your code: https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/pymc_overview.html#pymc-overview.

In [None]:
coords = {"idx": np.arange(0,x.size,1)}     # define a coordinate system for our observed features

View the `trace` variable that holds the results of your sampling. Look through all of the data groups. Make sure you and your lab group understand:    

* Where the values for the estimated model parameters are stored.   
* Where the original observed data is stored.   
* Where the predicted output variable `time` is stored.
* Where the metadata about the sampling run is stored.   

Use the `az.plot_trace()` function to plot the distributions of model parameters and the sampling chains. Data from chain #1 is shown in the solid blue line and data from chain #2 is shown in the dashed blue line. If your model has converged properly, you should see that the distributions of model parameters between chain #1 and #2 are very similar. You should also see that the sampling traces for both chains are pretty stationary, i.e. the values are bouncing around a constant mean.      
You may find the documentation from Arviz for `plot_trace()` helpful: https://python.arviz.org/en/latest/api/generated/arviz.plot_trace.html    

Discuss with your lab group:   
Does the model appear to have converged? Why or why not?

Use the `az.plot_posterior()` function to plot the posterior distributions of model parameters, along with their means and 94% highest density intervals. You may find the documentation helpful: https://python.arviz.org/en/latest/api/generated/arviz.plot_posterior.html.      

Discuss with your lab group:    
The true values for the model parameters are:    
$x_{eq} = 30 km$       
$z_{eq} = 4 km$  
$t_0 = 300 s$         
$v = 4 km/s$     
How do the model parameter values predicted by MCMC compare to the true values? Were any model parameters particularly difficult to predict? Why do you think these parameters might be challenging?

**Bonus**: Suppose that we know the crustal velocity is 4 km/s. Update your model to take this information into account and rerun the model to predict $x_{eq}$, $z_{eq}$, and $t_0$. How much does this information improve your predictions of earthquake location and initiation time?