# (Distributed) areal interpolation

In this notebook, we compare the single-core version in `tobler.area_weighted.area_interpolate` with the distributed version in `tobler.area_weighted.area_interpolate_dask`. 

In [1]:
import os
os.environ['USE_PYGEOS'] = '1'

import geopandas
import dask_geopandas
import tobler

from dask.distributed import Client, LocalCluster

## Setup

We use the San Diego H3 dataset from the [GDS Book](https://geographicdata.science/book/data/h3_grid/build_sd_h3_grid.html):

In [2]:
h3 = geopandas.read_file((
    'https://geographicdata.science/book/'
    '_downloads/d740a1069144baa1302b9561c3d31afe/sd_h3_grid.gpkg'
)).to_crs(epsg=3310)

And the Census tracts dataset, also from the same [source](https://geographicdata.science/book/data/sandiego/sandiego_tracts_cleaning.html):

In [3]:
tracts = (
    geopandas.read_file((
        'https://geographicdata.science/book/'
        '_downloads/f2341ee89163afe06b42fc5d5ed38060/sandiego_tracts.gpkg'
    ))
    .to_crs(epsg=3310)
    .clip(h3)
)

  return lib.intersection(a, b, **kwargs)


Note in both cases we require a projected CRS and thus use the [NAD83/California Albers](https://epsg.io/3310).

We will set up a local Dask cluster:

In [4]:
client = Client(LocalCluster(n_workers=10))

2023-08-10 15:19:49,969 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/dz/5zvlmz1s0739pm0wx2ryxjf00000gn/T/dask-worker-space/worker-taj3n78d', purging
2023-08-10 15:19:49,970 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/dz/5zvlmz1s0739pm0wx2ryxjf00000gn/T/dask-worker-space/worker-283edkzp', purging
2023-08-10 15:19:49,970 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/dz/5zvlmz1s0739pm0wx2ryxjf00000gn/T/dask-worker-space/worker-5sq_o8d_', purging
2023-08-10 15:19:49,971 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/dz/5zvlmz1s0739pm0wx2ryxjf00000gn/T/dask-worker-space/worker-8pnf3b0w', purging
2023-08-10 15:19:49,972 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/dz/5zvlmz1s0739pm0wx2ryxjf00000gn/T/dask-worker-space/worker-0kn_mkzc', purging
2023-08-10 15:19:49,973 - distributed.diskutils - INFO - Found st

Finally, for Dask, we need to provide `dask_geopandas.GeoDataFrame` objects with spatial partitions and categorical variables properly set up:

In [5]:
tracts['sub_30'] = tracts['sub_30'].astype('category')
tracts['tract'] = tracts['tract'].astype('category')

dtracts = (
    dask_geopandas.from_geopandas(tracts[
        ['geometry', 'sub_30', 'tract', 'total_pop', 'total_pop_white']
    ], npartitions=10)
    .spatial_shuffle(by='hilbert', shuffle="tasks")
)

dh3 = (
    dask_geopandas.from_geopandas(h3, npartitions=10)
    .spatial_shuffle(by='hilbert', shuffle="tasks")
)

---

**IMPORTANT** - At this point, only *extensive* and *categorical* variables are implemented, so those are what we will test.

---

## Correctness

### Extensive

Here we transfer the total population from `tracts` to `h3`.

First, we transfer with the single-core approach:

In [6]:
ext_sc = tobler.area_weighted.area_interpolate(
    tracts, h3, extensive_variables=['total_pop', 'total_pop_white']
)

  return lib.intersects(a, b, **kwargs)
  return lib.intersection(a, b, **kwargs)


Then we perform the same operation using Dask:

In [7]:
ext_dk = tobler.area_weighted.area_interpolate_dask(
    dtracts, dh3, 'hex_id', extensive_variables=['total_pop', 'total_pop_white']
)

  return lib.intersection(a, b, **kwargs)


TypeError: _area_interpolate_binning() got an unexpected keyword argument 'spatial_index'

### Categorical

Single-core:

In [8]:
cat_sc = tobler.area_weighted.area_interpolate(
    tracts, h3, extensive_variables=['sub_30', 'tract']
)

  return lib.intersects(a, b, **kwargs)
  return lib.intersection(a, b, **kwargs)


TypeError: Object with dtype category cannot perform the numpy op isnan

And through Dask:

In [9]:
ext_dk = tobler.area_weighted.area_interpolate_dask(
    dtracts, dh3, 'hex_id', extensive_variables=['sub_30', 'tract']
)

  return lib.intersection(a, b, **kwargs)


TypeError: _area_interpolate_binning() got an unexpected keyword argument 'spatial_index'

## Performance