# Spatial modelling of probabilities in production: all GB

This document contains the code used to perform spatial modeling of probabilities as deployed for the case of GB. It is the production version of [`sp_model_chip_probabilities`](sp_model_chip_probabilities).

The following models are fitted:

1. `argmax`: pick the top probability from those produced for the chip by the neural net
1. `logit`: fit a Logit model using the chip probabilities from the neural net
1. `logit_wx`: fit a Logit model using the chip probabilities _and_ the spatial lag of chip probabilities (i.g., using also the probabilities from neighboring chips)
1. `gbt`: fit a histogram-based gradient boosted tree model using the chip probabilities from the neural net
1. `gbt_wx`: fit a histogram-based gradient boosted tree model using the chip probabilities _and_ the spatial lag of chip probabilities (i.g., using also the probabilities from neighboring chips)

These five models will be fitted to neuralnet results with the following features:

- Chip size (`8`, `16`, `32`, `64`)
- Architecture (base image classification, slided image classiffication, multi-output regression)

Each combination contains three original files:

- `XXX.npy`: original chips (`N x S x S x 3` with `S` being chip size)
- `XXX_prediction.npy`: predicted probabilities (`N x 12`)
- `XXX_labels.parquet`: geo-table with all the chip geometries, their split (`nn_train`, `nn_val`, `ml_train`, `ml_val`) and proportion of the chip assined into each label

This notebooks will generate single-class predictions for each combination and store them on disk (together with their geometries and true labels). The file name format will be:

> `pred_SS_AAA_model.parquet`

- `SS` is size (`8`, `16`, `32`, `64`)
- `AAA` is the architecture (`bic`, `sic`, `mor`)
- `model` is the modelling approach used to generate the class prediction (`argmax`, `logit`, `logit_wx`, `gbt`, `gbt_wx`)

To generate a single instance of the file above, we need to perform the following steps:

- Pull data for the instance
- Build spatial weights for training and validation
- Train model
- Use validation to get predictions
- Write them out to disk


In [43]:
import os
import pandas
import geopandas
import numpy as np
import tools_chip_prob_modelling as tools

data_p = '/home/jovyan/data/spatial_signatures/chip_probs/prod_probs/'

chip_sizes = [8, 16, 32, 64]
archs = ['bic', 'sic', 'mor']
models = ['argmax', 'logit', 'logit_wx', 'gbt', 'gbt_wx']

In [36]:
pandas.Series(
    [i[3:] for i in
 os.listdir(data_p)
    if ('v2_' in i) and ('.parquet' in i)]
).sort_values()

7            16_labels.parquet
3      16_multi_labels.parquet
0     16_slided_labels.parquet
2            32_labels.parquet
10     32_multi_labels.parquet
5     32_slided_labels.parquet
6            64_labels.parquet
11     64_multi_labels.parquet
1     64_slided_labels.parquet
8             8_labels.parquet
4       8_multi_labels.parquet
9      8_slided_labels.parquet
dtype: object

In [38]:
a1 = geopandas.read_parquet(
    data_p+'/v2_16_multi_labels.parquet'
)
a2 = geopandas.read_parquet(
    data_p+'/v2_16_slided_labels.parquet'
)

In [39]:
a1.shape

(126016, 22)

In [41]:
a2.shape

(1087827, 3)