# Spatial modelling of probabilities in production: all GB

This document contains the code used to perform spatial modeling of probabilities as deployed for the case of GB. It is the production version of [`sp_model_chip_probabilities`](sp_model_chip_probabilities).

The following models are fitted:

1. `argmax`: pick the top probability from those produced for the chip by the neural net
1. `logite`: fit a Logit ensemble model using the chip probabilities from the neural net
1. `logite_wx`: fit a Logit ensemble model using the chip probabilities _and_ the spatial lag of chip probabilities (i.g., using also the probabilities from neighboring chips)
1. `gbt`: fit a histogram-based gradient boosted tree model using the chip probabilities from the neural net
1. `gbt_wx`: fit a histogram-based gradient boosted tree model using the chip probabilities _and_ the spatial lag of chip probabilities (i.g., using also the probabilities from neighboring chips)

These five models will be fitted to neuralnet results with the following features:

- Chip size (`8`, `16`, `32`, `64`)
- Architecture (base image classification, slided image classiffication, multi-output regression)

Each combination contains three original files:

- `XXX.npy`: original chips (`N x S x S x 3` with `S` being chip size)
- `XXX_prediction.npy`: predicted probabilities (`N x 12`)
- `XXX_labels.parquet`: geo-table with all the chip geometries, their split (`nn_train`, `nn_val`, `ml_train`, `ml_val`) and proportion of the chip assined into each label

This notebooks will generate single-class predictions for each combination and store them on disk (together with their geometries and true labels). The file name format will be:

> `pred_SS_AAA_model.parquet`

- `SS` is size (`8`, `16`, `32`, `64`)
- `AAA` is the architecture (`bic`, `sic`, `mor`)
- `model` is the modelling approach used to generate the class prediction (`argmax`, `logit`, `logit_wx`, `gbt`, `gbt_wx`)

To generate a single instance of the file above, we need to perform the following steps:

- Pull data for the instance
   - Read files
   - Convert labels to `Categorical` w/ actual names
   - Load/join probs
- Build spatial weights for training and validation and lag probabilities
- Train model
- Use validation to get predictions
- Write them out to disk


In [53]:
import os
import pandas
import geopandas
import numpy as np
import tools_chip_prob_modelling as tools

from libpysal import weights

data_p = '/home/jovyan/data/spatial_signatures/chip_probs/prod_probs/'
out_p = '/home/jovyan/data/spatial_signatures/chip_probs/prod_model_outputs/'

chip_sizes = [8, 16, 32, 64]
archs = {'bic': '', 'sic': 'slided', 'mor': 'multi'}
archs_r = {archs[i]: i for i in archs}

cd2nm = tools.parse_nn_json(
    '/home/jovyan/data/spatial_signatures/chip_probs/efficientnet_pooling_256_12.json'
).rename(lambda i: i.replace('signature_type_', ''))

## Single instance

In [20]:
p_ex = data_p + 'v2_32'
out, nm = tools.premodelling_process(p_ex, cd2nm)

 There are 23855 disconnected components.
 There are 9370 islands with ids: 17, 18, 19, 22, 25, 26, 40, 44, 45, 49, 51, 52, 58, 69, 70, 74, 78, 85, 109, 110, 119, 122, 139, 140, 147, 150, 151, 161, 169, 172, 173, 176, 183, 184, 185, 186, 187, 211, 214, 243, 244, 245, 246, 247, 251, 252, 257, 258, 266, 303, 320, 333, 334, 335, 351, 352, 353, 366, 369, 370, 380, 381, 382, 391, 392, 393, 394, 395, 396, 397, 400, 401, 402, 423, 424, 437, 446, 456, 464, 465, 470, 480, 494, 495, 525, 536, 562, 567, 605, 613, 622, 625, 632, 633, 647, 654, 685, 687, 688, 689, 710, 711, 716, 717, 718, 734, 740, 741, 743, 747, 748, 749, 756, 775, 776, 777, 794, 806, 807, 808, 809, 826, 827, 855, 864, 877, 880, 896, 897, 898, 915, 941, 968, 969, 974, 975, 1012, 1014, 1021, 1022, 1023, 1033, 1051, 1052, 1053, 1054, 1057, 1058, 1059, 1066, 1067, 1081, 1133, 1141, 1142, 1143, 1149, 1164, 1165, 1166, 1170, 1179, 1180, 1214, 1215, 1227, 1241, 1242, 1247, 1260, 1269, 1281, 1282, 1287, 1291, 1292, 1293, 1330, 1340, 1343

In [55]:

log = tools.run_all_models(out, nm, out_p, True)

	### 32_bic ###

2022-08-17 21:52:12.912658 | <function run_maxprob at 0x7fc0982e09d0> completed successfully
Optimization terminated successfully.
         Current function value: 0.625027
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.506726
         Iterations 7




Optimization terminated successfully.
         Current function value: 0.668005
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.665365
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.693088
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.686854
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.682072
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692760
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692991
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692726
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692902
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.693002
  



2022-08-17 21:52:15.397445 | <function run_logite at 0x7fc0982a63a0> completed successfully
Optimization terminated successfully.
         Current function value: 0.574965
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.378409
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.647169
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.654048
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.693062
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.684661
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.677947
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.692598
         Iterations 3
Optimization terminated successfully.
         Current function value: 0.692931
         Ite

## Big loop

In [77]:
from importlib import reload
reload(tools);

In [None]:
log_p = 'big_run_log.txt'
! rm -f $log_p
log = f'{datetime.now()} |Log| Start\n'
with open(log_p, 'w') as l:
    l.write(log)

for chip_size in chip_sizes:
    for arch in archs:
        p = data_p + (
            f'v2_{chip_size}_{archs[arch]}.'
            .replace('_.', '') # bic has no keyword
            .strip('.')        # in data files
        )
        db, name = tools.premodelling_process(p, cd2nm)
        log += tools.run_all_models(db, name, out_p, verbose=True, fo=log_p)


 There are 57875 disconnected components.
 There are 20360 islands with ids: 12, 15, 16, 33, 65, 66, 71, 73, 74, 75, 76, 77, 78, 100, 116, 117, 118, 143, 145, 171, 172, 173, 199, 200, 201, 204, 211, 231, 249, 266, 267, 289, 290, 291, 318, 319, 322, 330, 332, 340, 351, 352, 356, 374, 375, 395, 396, 397, 398, 399, 414, 423, 428, 429, 430, 432, 434, 441, 442, 443, 444, 447, 448, 452, 453, 454, 455, 461, 488, 493, 494, 495, 496, 499, 505, 522, 523, 546, 554, 555, 556, 562, 578, 584, 589, 590, 604, 607, 608, 642, 646, 649, 661, 662, 663, 664, 665, 678, 679, 680, 684, 688, 689, 708, 717, 718, 719, 720, 721, 738, 748, 749, 750, 757, 772, 773, 774, 777, 778, 779, 780, 781, 794, 795, 796, 797, 810, 811, 819, 835, 843, 844, 845, 853, 856, 868, 871, 872, 873, 874, 882, 895, 915, 916, 917, 918, 959, 966, 967, 972, 975, 976, 984, 994, 995, 996, 997, 998, 1001, 1017, 1026, 1027, 1051, 1058, 1061, 1062, 1071, 1083, 1084, 1085, 1094, 1095, 1096, 1097, 1098, 1111, 1114, 1117, 1119, 1126, 1135, 1154, 11