# Bayesian linear regression with R and Stan

We will now conduct linear regression in a Bayesian framework. We use the [Stan probabilistic programming language](https://mc-stan.org/), which allows full Bayesian statistical inference. A Stan model is a block of text which can either be written in a separate file, or in the same script as the current code. A model defined in its own file can then be called within either language: R, Python, Julia...

This tutorial uses the same case study as the OLR tutorials with Python and R. Here, Stan is used with R.

In the previous OLR tutorial, we selected a linear model with three predictors `(T_i-T_e)`, `I_{sol}` and `(T_i-T_s)`. The model can be written in probability form: each of the data points `e_{hp,n}` is normally distributed with a constant noise standard deviation $\sigma$:

$$ e_{hp,n} \sim N( \theta_1 (T_i-T_e)_n + \theta_2 I_{sol,n} + \theta_3 (T_i-T_s)_n, \sigma) $$

In [1]:
library(tidyverse)
library(lubridate)
library(rstan)

df <- read_csv("data/linearregression.csv") %>% 
        transform(TIMESTAMP = ymd(TIMESTAMP)) %>%
        mutate(tite = ti - te,
                titg = ti - tg,
                tits = ti - ts,
                vtite = wind_speed * (ti-te))

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)

Attaching package: ‘rstan’

The following object is masked from ‘package:tidyr’:

    extract

Parsed with column specification:
cols(
  TIMESTAMP = col_date(format = ""),
  e_hp = col_double(),
  e_dhw = col_d

Of course, the Stan documentation has [an example of linear regression model](https://mc-stan.org/docs/2_27/stan-users-guide/linear-regression.html). The following block defines a model with any number of predictors `K`, and no intercept.

Then, a list called `model_data` is created, which maps each part of the data to its appropriate variable into the STAN model. This list must contain all variables defined in the `data` block of the model.

After the model is specified and the data is been mapped to its variables, the model can be fitted by MCMC.

In [None]:
lr_model= "
data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  vector[K] theta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(x * theta, sigma);  // likelihood
}
"

model_data <- list(
  N = nrow(df),
  K = 3,
  x = df %>% select(tite, i_sol, tits),
  y = df$e_hp
)

fit1 <- stan(
  model_code = lr_model,    # Stan program
  data = model_data,        # named list of data
  chains = 4,               # number of Markov chains
  warmup = 1000,            # number of warmup iterations per chain
  iter = 4000,              # total number of iterations per chain
  cores = 1,
)

Fitting may result in a number of warnings, telling us that some problems may have occurred: divergent transitions, large R-hat values, low Effective Sample Size... Obtaining a fit without these warnings takes some practice but is essential for an unbiased interpretation of the inferred variables and predictions. A guide to Stan's warnings and how to address them [is available here](https://mc-stan.org/misc/warnings.html).

Stan returns an object (called `fit1` above) [from which the distributions of outputs and parameters of the fitted model can be accessed](https://cran.r-project.org/web/packages/rstan/vignettes/stanfit-objects.html)

As a first validation step, it is useful to take a look at the values of the parameters that have been estimated by the algorithm. Below, we use three diagnostics tools:

* The `print` method shows the table of parameters, much like we could display after an ordinary linear regression
* `traceplot` shows the traces of the selected parameters. If the fitting has converged, the traces approximate the posterior distributions
* `pairs` shows the pairwise relationships between parameters. Strong interactions between some parameters are an indication that the model should be re-parameterised.

In [None]:
print(fit1)
traceplot(fit1)
pairs(fit1)

The `n_eff` and `Rhat` indices show that convergence is fine (see Sec. \@ref(computation)). We are therefore allowed to carry on and interpret the results.

There is strong interaction between some parameters. The numerical results are almost identical to the non-Bayesian model. This is not surprising as we used exactly the same model with no prior distribution on any parameter.
