# NDVI

This document shows the computation of NDVI scores for each of the enclosed tesselations (ETs) in GB.

In [7]:
! echo "Run this notebook using version $GDS_ENV_VERSION of the gds_env"
SERVER_IP = open("../../SERVER_IP").read().strip("\n")

Run this notebook using version 6.0alpha of the gds_env


## Set up

Since we will run some computations on a Dask cluster, let's set it up first:

In [2]:
! cat ../../worker-spec.yml

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: darribas/gds_py:6.0alpha1
    imagePullPolicy: IfNotPresent
    args: [start.sh, dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 2GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "1"
        memory: 2G
      requests:
        cpu: "1"
        memory: 2G


In [1]:
from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da

# Set up cluster
cluster = KubeCluster.from_yaml('../../worker-spec.yml')
# Provision with up to 30 pods
cluster.scale(50)
# Connect Dask to the cluster
client = Client(cluster)

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://138.253.73.24:37297
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO - Receive client connection: Client-106cf4b4-3296-11eb-8594-80e82cd20b5e
distributed.core - INFO - Starting established connection


In [2]:
cluster

distributed.scheduler - INFO - Register worker <Worker 'tcp://10.1.151.210:36721', name: 36, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.1.151.210:36721


VBox(children=(HTML(value='<h2>KubeCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    .…

distributed.core - INFO - Starting established connection


Bring on other libs we'll use:

In [3]:
import os
import fsspec
import pandas
import geopandas
import rioxarray, xarray
from dask import dataframe as dd
from numpy import percentile

## Connecting to the GHS Mosaic

The full mosaic is stored as a folder of COGs served over HTTP. First let's grab the URL for the mosaic:

In [8]:
mosaic_url = f"http://{SERVER_IP}:8000/ghs_composite_s2/GHS-composite-S2.vrt"

We inspect the details of the mosaic to select the chunk:

In [7]:
! rio info $mosaic_url | python -m json.tool

{
    "blockxsize": 128,
    "blockysize": 128,
    "bounds": [
        -222823.73719089525,
        -213574.25107683009,
        996789.2497132053,
        1612237.380579703
    ],
    "colorinterp": [
        "gray",
        "undefined",
        "undefined",
        "undefined"
    ],
    "count": 4,
    "crs": "EPSG:27700",
    "descriptions": [
        null,
        null,
        null,
        null
    ],
    "driver": "VRT",
    "dtype": "uint16",
    "height": 182437,
    "indexes": [
        1,
        2,
        3,
        4
    ],
    "lnglat": [
        -2.211309842042783,
        56.18643258743896
    ],
    "mask_flags": [
        [
            "nodata"
        ],
        [
            "nodata"
        ],
        [
            "nodata"
        ],
        [
            "nodata"
        ]
    ],
    "nodata": 0.0,
    "res": [
        10.007902079383749,
        10.007902079383749
    ],
    "shape": [
        182437,
        121865
    ],
    "tiled": true,
    "transform": 

Since it's tiled on 128 by 128 pixels, we pick a chunk size that is ten times larger:

In [10]:
r = rioxarray.open_rasterio(mosaic_url,
                            chunks={"x": 1280, "y": 1280}
                           )
r

Unnamed: 0,Array,Chunk
Bytes,177.86 GB,13.11 MB
Shape,"(4, 182437, 121865)","(4, 1280, 1280)"
Count,13729 Tasks,13728 Chunks
Type,uint16,numpy.ndarray
"Array Chunk Bytes 177.86 GB 13.11 MB Shape (4, 182437, 121865) (4, 1280, 1280) Count 13729 Tasks 13728 Chunks Type uint16 numpy.ndarray",121865  182437  4,

Unnamed: 0,Array,Chunk
Bytes,177.86 GB,13.11 MB
Shape,"(4, 182437, 121865)","(4, 1280, 1280)"
Count,13729 Tasks,13728 Chunks
Type,uint16,numpy.ndarray


This will make each chunk in `r` read ten tiles at a time.

## Enclosed Tesselation (ET) cells

The ET cells are stored as parquet files by geographic chunks. Let's read a random one (`#6`) for this illustration:

In [10]:
tst = geopandas.read_parquet("../../urbangrammar_samba/spatial_signatures/tessellation/tess_6.pq")
tst.head()

Unnamed: 0,hindex,tessellation,buildings
0,c006e658591t0000,"POLYGON Z ((356038.093 215849.818 0.000, 35603...","POLYGON ((356035.210 215860.810, 356035.630 21..."
1,c006e658591t0105,"MULTIPOLYGON Z (((355777.262 216080.707 0.000,...","POLYGON ((355762.480 216099.480, 355770.340 21..."
2,c006e658591t0106,"POLYGON Z ((355789.951 215942.223 0.000, 35577...","POLYGON ((355774.500 216010.050, 355778.130 21..."
3,c006e658591t0107,"POLYGON Z ((355804.850 215994.579 0.000, 35578...","POLYGON ((355785.950 216029.660, 355778.150 21..."
4,c006e658591t0108,"POLYGON Z ((355791.991 216043.449 0.000, 35578...","POLYGON ((355787.380 216069.090, 355790.600 21..."


In [9]:
tst_sub = tst.cx[315891.95:330000, 213727.69:250000]
tst_sub.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2049 entries, 15006 to 258237
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   hindex        2049 non-null   object  
 1   tessellation  2049 non-null   geometry
 2   buildings     2013 non-null   geometry
dtypes: geometry(2), object(1)
memory usage: 64.0+ KB


## NDVI computation

We can express the calculation of the NDVI index, although no computation will take place thanks to `xarray`/Dask's lazy evaluation:

In [10]:
ndvi = (r.sel(band=4) - r.sel(band=1)) / (r.sel(band=4) + r.sel(band=1))

## Distribute computation by ET cell

With the NDVI expressed, we create a method that allows us to, efficiently, go from a row in our `GeoDataFrame` to the average NDVI for that area:

In [4]:
def geom2ndvi(row, ndvi, geom="geometry"):
    val = ndvi.rio.clip([row[geom]])\
              .mean()\
              .values\
              .tolist()
    return val

For example, let's read in a section of the mosaic:

In [11]:
%time ndvi_sub = ndvi.rio.clip_box(*tst.total_bounds).load()
ndvi_sub

CPU times: user 3.08 s, sys: 1.3 s, total: 4.38 s
Wall time: 21.3 s


And calculate NDVI for a cell within that section:

In [14]:
%time geom2ndvi(tst_sub.iloc[0, :], ndvi_sub, geom="tessellation")

CPU times: user 511 ms, sys: 83.6 ms, total: 594 ms
Wall time: 571 ms


0.5895589197001319

This method can be applied sequentially to an entire table from the URL of the mosaic:

**NOTE** For this to work effectively, the extent of `db.total_bounds` needs to fit comfortably in memory

In [9]:
def ndvi_from_chunk(db_path,
                    mosaic_path=mosaic_url, 
                    geom="tessellation"
                   ):
    if type(db_path) is pandas.Series:
        db_path = db_path.iloc[0]
    with fsspec.open(db_path) as file:
        db = geopandas.read_parquet(file, 
                                    columns=["hindex", "tessellation"]
                                   ).set_index("hindex").head() # <-- head only!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    r = rioxarray.open_rasterio(mosaic_path,
                                chunks={
                                    "x": 640, 
                                    "y": 640
                                }
                               )
    ndvi = (r.sel(band=4) - r.sel(band=1)) / \
           (r.sel(band=4) + r.sel(band=1))
    ndvi_vals = ndvi.rio.clip_box(*db.total_bounds).load()
    rower = lambda row: geom2ndvi(row, ndvi_vals, geom=geom)
    return db.apply(rower, axis=1)

For example:

In [61]:
%time ndvis = ndvi_from_chunk(tst_url, mosaic_url, geom="tessellation")

distributed.utils_perf - INFO - full garbage collection released 782.16 MB from 96619 reference cycles (threshold: 10.00 MB)


CPU times: user 10.5 s, sys: 2.1 s, total: 12.6 s
Wall time: 23.5 s


To run the above distributedly, we will set up a Dask DataFrame with all the URLs of the file names:

In [10]:
tess_url = f"http://{SERVER_IP}:8000/spatial_signatures/tessellation/"
tess_path = "../../urbangrammar_samba/spatial_signatures/tessellation/"
chunk_names = pandas.DataFrame({
    "file": [f"{tess_url}/{i}" for i in os.listdir(tess_path) if i[-3:]==".pq"]
})
chunk_names = dd.from_pandas(chunk_names, chunksize=1)

Now we can map the computation of each chunk across the cluster:

In [11]:
%%time
ndvis = chunk_names["file"].map_partitions(ndvi_from_chunk, 
                                   meta=(None, 'f8')
                                  ).compute()

distributed.scheduler - INFO - Receive client connection: Client-worker-55b22b09-3296-11eb-8020-a664d571bf36
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-5698289a-3296-11eb-8020-26e77d64672b
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-56a81d51-3296-11eb-8020-da52d610c47f
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-56e002a9-3296-11eb-8021-46bbc0b220bc
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-57aebdd3-3296-11eb-8020-8a07bd078ccd
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-worker-57d9493f-3296-11eb-8020-32d454b4058c
distributed.core - INFO - Starting established connectio

KilledWorker: ("('ndvi_from_chunk-d563b338d9a3ff805a9851789d5dd4c6', 89)", <Worker 'tcp://10.1.193.206:43207', name: 3, memory: 0, processing: 4>)

In [17]:
ndvis

hindex
c011e675448t0000    0.550900
c011e675448t0052    0.505567
c011e675448t0053    0.485811
c011e675448t0054    0.526588
c011e675448t0037    0.521815
c046e432985t0000    0.448482
c046e432985t0059    0.198398
c046e432985t0040    0.335527
c046e432985t0041    0.623561
c046e432985t0038    0.613240
c040e541154t0000    0.224681
c040e541154t0001    0.229765
c040e160280t0002    0.463784
c040e160280t0015    0.631372
c040e160280t0012    0.293753
c073e716445t0053    0.351165
c073e716445t0010    0.412447
c073e716445t0011    0.405551
c073e716445t0012    0.521756
c073e716445t0013    0.358216
dtype: float64

---

## [DEPRECATED] Alternative using `geocube`

The alternative involves [`geocube`'s zonal stats](https://corteva.github.io/geocube/stable/examples/zonal_statistics.html) and `make_geocube`. In this approach, we first rasterize our ET cells in a grid aligned with the mosaic, then calculate the NDVI. At present, this approach is discarded because the resolution of the mosaic (10m) makes it too coarse to obtain an NDVI for each cell.

In [57]:
from geocube.api.core import make_geocube

We use the same set of ET cells:

In [None]:
url = f"http://{SERVER_IP}:8000/spatial_signatures/tessellation/tess_6.pq"
tst = geopandas.read_parquet(url)


In [117]:
tst_sub = tst.cx[315891.95:330000, 213727.69:250000]
tst_sub.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 2049 entries, 15006 to 258237
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   uID          2049 non-null   int64   
 1   geometry     2049 non-null   geometry
 2   enclosureID  2049 non-null   int64   
dtypes: geometry(1), int64(2)
memory usage: 64.0 KB


Before rasterization, we need to load the segment of the mosaic that overlaps (note no bits are streamed to memory, all lazy evaluation):

In [118]:
ndvi_segment = ndvi.rio.clip_box(*tst_sub.total_bounds)

We need to rasterize the features:

In [119]:
%%time
out_grid = make_geocube(
    vector_data = tst_sub,
    measurements=["uID"],
    like=ndvi_segment
)

CPU times: user 318 ms, sys: 5.81 ms, total: 324 ms
Wall time: 323 ms


This creates a `DataSet` object with a rasterised version of the tesselations in `tst`. Now we append the NDVI:

In [120]:
out_grid["ndvi"] = ndvi_segment

And with both aligned, we can group by each `uID` and calculate average NDVI:

In [121]:
%%time
g = out_grid.drop("spatial_ref")\
            .groupby(out_grid["uID"])

CPU times: user 598 ms, sys: 2.23 ms, total: 601 ms
Wall time: 598 ms


And we can get the average easily:

In [127]:
%%time
ndvi_mean = g.mean()

CPU times: user 6.21 s, sys: 2.89 ms, total: 6.22 s
Wall time: 6.21 s


In [132]:
mn = ndvi_mean.to_dataframe()[["ndvi"]]

  return func(*(_execute_task(a, cache) for a in args))
  x = np.divide(x1, x2, out)


In [149]:
geom2ndvi(tst_sub.query("uID == 6712717").iloc[0], ndvi)

  return func(*(_execute_task(a, cache) for a in args))


0.6968843539763994

In [153]:
%time out = tst_sub.head().apply(lambda r: geom2ndvi(r, ndvi), axis=1)

  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))


CPU times: user 1.6 s, sys: 14.6 ms, total: 1.62 s
Wall time: 1.6 s


  return func(*(_execute_task(a, cache) for a in args))


In [144]:
tst_sub.query("uID == 6712717")

Unnamed: 0,uID,geometry,enclosureID
22988,6712717,"POLYGON Z ((329700.369 234529.467 0.000, 32945...",658102


In [141]:
mn.head()

Unnamed: 0_level_0,ndvi
uID,Unnamed: 1_level_1
6712717.0,
6712718.0,0.682241
6712719.0,0.675091
6712720.0,
6712721.0,0.537544
