## Freni-Sterrantino et al 2017 - BYM2 connected, disconnected for Scotland Lip Cancer Dataset

The BYM2 model for areal data adds to components to a GLM:  an ICAR component which accounts for the spatial structure of the data, and a random effects component.  See the Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) for details on the ICAR, BYM, and BYM2 models.  This implementation assumes that the spatial structure is a single, fully connected component, i.e., a graph where any node in the graph can be reached from any other node.

In [A note on intrinsic Conditional Autoregressive models for disconnected graphs](https://arxiv.org/abs/1705.04854), Freni-Sterrantino et.al. show how to implement this model for disconnected graphs.  In this notebook, we present that Stan implementation of this proposal.

### Areal data:  the counties in Scotland, circa 1980

The canonical dataset used to test and compare different parameterizations of ICAR models is a study on the incidence of lip cancer in Scotland in the 1970s and 1980s.  The data, including the names and coordinates for the counties of Scotland are available from R package [SpatialEpi](https://cran.r-project.org/web/packages/SpatialEpi/SpatialEpi.pdf), dataset `scotland`.

3 of these counties are islands:  the Outer Hebrides (western.isles), Shetland, and Orkney.  In the canonical datasets, these islands are conntected to the mainland, so that the adjacency graph consists of a single, fully connected component.  However, different maps are possible:  a map with 4 components, the mainland and the 3 islands; or a map with 3 components:  the mainland, a component consisting of Shetland and Orkney, and a singleton consisting of the Hebrides. The following plots demonstrate the differences:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams

%matplotlib inline

# figure size in inches optional
rcParams['figure.figsize'] = 11 ,8

# read images
img_A = mpimg.imread('scot_connected.png')
img_B = mpimg.imread('scot_3_comp.png')
img_C = mpimg.imread('scot_islands.png')


# display images
fig, ax = plt.subplots(1,3)
ax[0].imshow(img_A);
ax[1].imshow(img_B);
ax[2].imshow(img_C);

### Areal data munging:  from spatial polygon to 2D array of edges

Inputs to the Stan model must match the set of variables declared in the `data` block.

The Stan implementation of the ICAR model computes with a 2D array of size 2 $\times$ J where J is the number of edges in the graph.  Each column entry in this array represents one undirected edge in the graph, where for each edge i, entries [i,1] and [i,2] index the nodes connected by that edge.  Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.

The `scotland` data is a set of spatial polygons, i.e., a description of the shape of each county in terms of its lat,lon coordinates.  The R package [spdep](https://r-spatial.github.io/spdep/index.html) extracts the adjacency relations as a `nb` object.
We have written a set of helper functions which take the `nb` objects for each graph into the set of data structures needed by the Stan models, these are in file `bym2_helpers.R`.  
The three versions of the Scotland spatial structure are in files `scotland_nbs.data.R`, `scotland_3_comp_nbs.data.R`, and `scotland_islands_nbs.data.R`.
The file `munge_scotland.R` munges the data, and it has been saved as JSON data files.

## Fit connected graph on Scotland Lip cancer dataset with BYM2 model implemented in Stan.

In [None]:
from cmdstanpy import cmdstan_path, CmdStanModel, install_cmdstan
# install_cmdstan()  # as needed - will install latest release (as needed)

The dataset `scot_connected.data.json` contains the cancer dataset together with the spatial structure.  The cancer study data is:

- `y`: observed outcome - number of cases of lip cancer
- `x`: single predictor - percent of population working in agriculture, forestry, or fisheries.
- `E`: population

The spatial structure is comprised of:

- I: `int<lower = 0> I;  // number of nodes`
- J: `int<lower = 0> J;  // number of edges`
- edges: `int<lower = 1, upper = I> edges[2, J];  // node[1, j] adjacent to node[2, j]`
- tau: `real tau; // scaling factor`

The helper function `nb_to_edge_array` takes the `nb` object and returns the 2 $\times$ J edge array; the helper function `scaling_factor` uses the edge array to compute the geometric mean of the corresponding adjacency matrix.
For the fully connected Scotland graph, the spatial data inputs are:

```
  "I": 56,
  "J": 132,

  "edges": [
    [ 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 7, 7, 7, 7, 9, 9, 9, 9, 9, 10, 10, 11, 13, 13, 14, 14, 14, 15, 15, 15, 16, 16, 16, 16, 17, 17, 18, 18, 18, 18, 18, 20, 21, 21, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 26, 26, 26, 27, 27, 27, 28, 28, 29, 29, 29, 30, 30, 30, 30, 30, 31, 31, 31, 31, 32, 33, 33, 34, 34, 34, 34, 34, 34, 34, 35, 35, 36, 36, 36, 37, 37, 38, 38, 38, 38, 38, 39, 39, 40, 40, 40, 41, 41, 41, 42, 42, 44, 44, 45, 46, 46, 47, 47, 47, 48, 49, 49, 49, 51, 52, 55 ],
    [ 5, 9, 11, 19, 7, 10, 6, 12, 18, 20, 28, 11, 12, 13, 19, 8, 10, 13, 16, 17, 11, 17, 19, 23, 29, 16, 22, 12, 17, 19, 31, 32, 35, 25, 29, 50, 17, 21, 22, 29, 19, 29, 20, 28, 33, 55, 56, 55, 29, 50, 29, 34, 36, 37, 39, 27, 30, 31, 44, 47, 48, 55, 56, 26, 29, 29, 42, 43, 31, 32, 55, 33, 45, 34, 43, 50, 38, 42, 44, 45, 56, 32, 35, 46, 47, 35, 45, 56, 39, 40, 42, 43, 51, 52, 54, 37, 46, 37, 39, 41, 41, 46, 42, 44, 49, 51, 54, 40, 41, 41, 49, 52, 46, 49, 53, 43, 51, 48, 49, 56, 47, 53, 48, 49, 53, 49, 52, 53, 54, 54, 54, 56 ]
  ],

  "tau": 0.4853,
```

In [None]:
from cmdstanpy import cmdstan_path, CmdStanModel
bym2_model = CmdStanModel(stan_file='bym2.stan')
bym2_fit = bym2_model.sample(data='scot_connected.data.json')

In [None]:
bym2_fit.summary()

In [None]:
bym2_fit.diagnose()

## Fit disconnected graphs on Scotland Lip cancer dataset with BYM2 model implemented in Stan, following Freni-Sterrantino


All components of the BYM2 model remain the same.  The same 2 $\times$ J edges array encodes the spatial structure of the graph, and as before, the spatial prior `phi` is a vector, one element per node (areal unit).  In order to compute the elements of `phi` on a per-component basis, we use Stan's multi-indexing operators.  For each component, we provide a vector of indices into `phi` and into the edges_array.  Because Stan doesn't have ragged arrays (yet), we construct two square matrices, where the number of rows in each are the number of components (including singletons) in the graph, and the number of columns are the number of nodes and edges, respectively.  In addition we compute the per-component scaling factor.

The helper function `index_components` takes an `nb` object as input and returns the list of data structures needed by the `BYM2_islands.stan` model.  We use the R package `jsonlite` to save this in JSON format.  For the Scotland map with 3 components, in file `scotland_3_comps_nbs.data.R`, function `index_components` and `write_json` produce the following input data, used in addition to the number of nodes, edges, and edge array:

```
    "K":3,
    "K_node_cts":[53,2,1],
    "K_edge_cts":[126,1,0],
    "K_node_idxs":[[1,2,3,4,5,7,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,0,0,0],
		   [6,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
		   [11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]],
    "K_edge_idxs":[[1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,0],
		   [13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
		   [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]],

  "tau": [0.4504, 0.25, 1],
```

In [None]:
from cmdstanpy import cmdstan_path, CmdStanModel
bym2_islands_model = CmdStanModel(stan_file='bym2_islands.stan')

In [None]:
print(bym2_islands_model.code())

In [None]:
import json
with open('scot_3_comp.data.json') as fd:
    scot_data = json.load(fd)
# print(scot_data)

In [None]:
bym2_islands_fit = bym2_islands_model.sample(data=scot_data, max_treedepth=11)
bym2_islands_fit.summary()

In [None]:
bym2_islands_fit.diagnose()