# Spatial modelling of probabilities in production: all GB

This document contains the code used to perform spatial modeling of probabilities as deployed for the case of GB. It is the production version of [`sp_model_chip_probabilities`](sp_model_chip_probabilities).

The following models are fitted:

1. `maxprob`: pick the top probability from those produced for the chip by the neural net
1. `logite`: fit a Logit ensemble model using the chip probabilities from the neural net
1. `logite_wx`: fit a Logit ensemble model using the chip probabilities _and_ the spatial lag of chip probabilities (i.g., using also the probabilities from neighboring chips)
1. `gbt`: fit a histogram-based gradient boosted tree model using the chip probabilities from the neural net
1. `gbt_wx`: fit a histogram-based gradient boosted tree model using the chip probabilities _and_ the spatial lag of chip probabilities (i.g., using also the probabilities from neighboring chips)

These five models will be fitted to neuralnet results with the following features:

- Chip size (`8`, `16`, `32`, `64`)
- Architecture (base image classification, slided image classiffication, multi-output regression)

Each combination contains three original files:

- `XXX.npy`: original chips (`N x S x S x 3` with `S` being chip size)
- `XXX_prediction.npy`: predicted probabilities (`N x 12`)
- `XXX_labels.parquet`: geo-table with all the chip geometries, their split (`nn_train`, `nn_val`, `ml_train`, `ml_val`) and proportion of the chip assined into each label

This notebooks will generate single-class predictions for each combination and store them on disk (together with their geometries and true labels). The file name format will be:

> `pred_SS_AAA_model.parquet`

- `SS` is size (`8`, `16`, `32`, `64`)
- `AAA` is the architecture (`bic`, `sic`, `mor`)
- `model` is the modelling approach used to generate the class prediction (`argmax`, `logit`, `logit_wx`, `gbt`, `gbt_wx`)

To generate a single instance of the file above, we need to perform the following steps:

- Pull data for the instance
   - Read files
   - Convert labels to `Categorical` w/ actual names
   - Load/join probs
- Build spatial weights for training and validation and lag probabilities
- Train model
- Use validation to get predictions
- Write them out to disk


In [5]:
import os
import warnings
import pandas
import geopandas
import numpy as np
import tools_chip_prob_modelling as tools
from datetime import datetime

from libpysal import weights

data_p = '/home/jovyan/data/spatial_signatures/chip_probs/prod_probs/'
out_p = '/home/jovyan/data/spatial_signatures/chip_probs/prod_model_outputs/'

chip_sizes = [8, 16, 32, 64]
archs = {'bic': '', 'sic': 'slided', 'mor': 'multi'}
archs_r = {archs[i]: i for i in archs}

cd2nm = tools.parse_nn_json(
    '/home/jovyan/data/spatial_signatures/chip_probs/efficientnet_pooling_256_12.json'
).rename(lambda i: i.replace('signature_type_', ''))

## Single instance

Used for testing only.

In [83]:
p_ex = data_p + 'v2_8_slided'
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning)
    out, nm = tools.premodelling_process(p_ex, cd2nm)

In [55]:

log = tools.run_all_models(out, nm, out_p, True)

	### 32_bic ###

2022-08-17 21:52:12.912658 | <function run_maxprob at 0x7fc0982e09d0> completed successfully
Optimization terminated successfully.
         Current function value: 0.625027
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.506726
         Iterations 7




Optimization terminated successfully.
         Current function value: 0.668005
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.665365
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.693088
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.686854
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.682072
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692760
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692991
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692726
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692902
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693002
  



2022-08-17 21:52:15.397445 | <function run_logite at 0x7fc0982a63a0> completed successfully
Optimization terminated successfully.
         Current function value: 0.574965
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.378409
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.647169
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.654048
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.693062
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.684661
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.677947
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692598
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692931
         Ite

## Big loop

We divide model runs in two steps to be able to work on output results while the grid search for the GBT computes. The structure and code is the same, but we first run the `maxprob` and `logite` models and then we run a big job with the boosted trees only.

### `maxprob`, `logite`

In [8]:
from importlib import reload
reload(tools);

In [None]:
%%time

log_p = 'big_run_log_maxprob_logite.txt'
! rm -f $log_p
log = f'{datetime.now()} |Log| Start\n'
with open(log_p, 'w') as l:
    l.write(log)
    
for chip_size in chip_sizes:
    for arch in archs:
        p = data_p + (
            f'v2_{chip_size}_{archs[arch]}.'
            .replace('_.', '') # bic has no keyword
            .strip('.')        # in data files
        )
        print(p)
        with open(log_p, 'a') as l:
            l.write(p)
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            db, name = tools.premodelling_process(p, cd2nm)
            models = ['maxprob', 'logite']
            log = tools.run_all_models(
                db, name, out_p, verbose=True, fo=log_p, models=models
            )

### `gbt`

In [None]:
%%time

log_p = 'big_run_log_gbt.txt'
! rm -f $log_p
log = f'{datetime.now()} |Log| Start\n'
with open(log_p, 'w') as l:
    l.write(log)
    
for chip_size in chip_sizes:
    for arch in archs:
        p = data_p + (
            f'v2_{chip_size}_{archs[arch]}.'
            .replace('_.', '') # bic has no keyword
            .strip('.')        # in data files
        )
        print(p)
        with open(log_p, 'a') as l:
            l.write(p)
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            db, name = tools.premodelling_process(p, cd2nm)
            models = ['gbt']
            log = tools.run_all_models(
                db, name, out_p, verbose=True, fo=log_p, models=models
            )

/home/jovyan/data/spatial_signatures/chip_probs/prod_probs/v2_8
	### 8_bic ###

