# Problem Set 4 - Bayesian Linear Regression

In this problem set, you will use Bayesian Linear Regression to model monthly ice sheet surface melt rates as a function of the number of positive degree days based on data from the KAN-M weather station site on the Greenland Ice Sheet.     

### The Positive Degree Day Model  
The positive degree day model argues that the amount of snow or ice that melts on the surface of a glacier over the course of a month is a linear function of the sum of all air temperature values over 0 C during the month. So for example, if the average temperature on day 1 was 5 C and on day 2 was 2 C, then the positive degree day (PDD) value for those two days would be 7. By calibrating this relationship, we can estimate total snowmelt without needing to run a complicated physical model.   

### Data Features   
**pdd** - positive degree days in degrees Celcius          
**melt** - monthly melt rate in mm of water equivalent per month

We will use the [PyMC3](https://www.pymc.io/projects/examples/en/latest/gallery.html) package for Bayesian Linear Regression. You may also find the Arviz documentation on working with `InferenceData` objects helpful: https://python.arviz.org/en/stable/getting_started/WorkingWithInferenceData.html.

**[1]** Import numpy, pymc3, matplotlib, pandas, arviz, xarray, and the sklearn `StandardScaler`.

**[2] (2 pts)** Use pandas to load the data from https://raw.githubusercontent.com/rtculberg/ml_in_eas/main/data/KANM_Melt.csv into a dataframe. Print the first few rows of the dataframe.

**[3] (3 pts)** Normalize the data using the z-score transform using `StandardScaler`. Create a scatterplot of your normalized data. 

**[4] (10 pts)** Train a Bayesian linear regression model using the PyMC3 interface. A few hints:   
* Be sure define mutable data containers for your x and y features so that later on we can change the x-values to make predictions on unseen data.    
* Similarly, be sure to define a coordinate system for the observed data.   
* Remember to choose appropriate priors for each model parameter, given you have normalized the data to a mean of 0 and standard deviation of 1.          

The following example notebooks may be helpful in writing your code:  
https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/GLM_linear.html      
https://github.com/pymc-devs/pymc-examples/blob/main/examples/introductory/api_quickstart.ipynb


**[5] (1 pt)** Print the `InferenceData` structure produced by `pm.sample()`. Look through the data structure and make sure that you understand where the inferred model parameters (slope, intercept, and sigma), model estimated melt volumes, and observed positive degree days and melt rates are stored.

**[6] (4 pts)** Use the `pm.plot_trace()` function to look at the posterior distributions of model parameters, as well as their traces.      

Answer the following question:    
Does the Markov Chain Monte Carlo algorithm appear to have converged to stable results for each parameter? Explain why or why not based on the plots.

**[7] (4 pts)** Use the Arviz `plot_posterior()` function to visualize the posterior distributions of the model parameter.   

Answer the following question:        
How does the mean value of sigma compare to the observed standard deviation of melt volumes in your dataset?

**[8] (3 pts)** Use `pm.sample_posterior_predictive()` to generate predictive samples from the posterior model distributions over whole range of positive degree day values from training feature set.

**[9] (1 pt)** Print the new `InferenceData` structure. Make sure you understand where to find the new posterior predictive samples.

**[10] (5 pts)** Bayesian linear regression can be thought of as finding a family of linear models that fit your data, as well as estimating the likelihood for each possible model. Use the Arviz `plot_lm()` function to plot your observed data, the distribution of predicted data from the posterior predictive distribution, and examples from the family of linear models estimated using MCMC.       

Hints:    
The documentation for `plot_lm()` provides a helpful example of how to generate this type of plot: https://python.arviz.org/en/stable/api/generated/arviz.plot_lm.html.

**[11] (6 pts)** Use the Arviz `plot_ppc()` function to compare the posterior distribution of melt rates predicted by the model to the observed distribution of melt rates. This is a good sanity check to see how the data generated from our model may deviate from data generated from the true distribution. This helps us understand if our posterior distribution is actually approximating the true underlying distribution. 

Answer the following question:      
How do the observed and posterior predicted distributions differ?      
How could we manipulate our observed data to reduce the discrepancies you observe?

**[12] (5 pts)** Now suppose that next year you measure 22 positive degree days during July at a different site near KAN-M. Use your Bayesian linear regression model to predict the total surface melt during July at that site.  

**Hint**: remember to set `predictions=True` in your call to `pm.sample_posterior_predictive()`. You may find the example code here to be helpful: https://www.pymc-labs.com/blog-posts/out-of-model-predictions-with-pymc/.

**[13] (1 pt)** Print the newly updated `InferenceData` structure and make sure you understand where the new predictions are stored.

**[14] (5 pts)** Use the Arviz `extract()` function to pull out the predictons into a new variable where the chain and and draw dimensions have been collapsed. Use matplotlib to plot a histogram of the predicted melt volumes.   

Answer the following question:   
Approximately what is the most likely melt volume for our site in July?  

**Hint**: Don't forget to unnormalize the predictions!