# **GeoEnricher**

Geospatial processing pipeline for large-scale datasets. Built on PySpark + Sedona, Kepler.gl.

*Suitable for spatial big data analyses, service accessibility modeling, and grid-based enrichment.*

## **Prerequisites**

### Select the python kernel where `geoenricher` is installed.

#### Environment Variables Setup: **JAVA_HOME**, (**HADOOP_HOME** for windows)

1. Download **Java** and install it if not already done.
  - Set `JAVA_HOME` its respective installation directory that has directories like "*bin*, *lib*, *legal*..."
  - Usually it is something like `C:\Program Files\Java\jre-1.8` on windows.
  - And set system PATH = `%JAVA_HOME%\bin`.

2. Download `winutils.exe` and `hadoop.dll` from [this repo](https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin).
   - Place `winutils.exe` in a directory such as `C:/Hadoop/bin`.
   - Place `hadoop.dll` in `C:/Windows/System32`.
   - And set HADOOP_HOME = `C:/Hadoop/bin` and system PATH = `%HADOOP_HOME%\bin`.

3. Reload the notebook.

## **Setup the Spark cluster**

Pass the CRS in the **Enricher**'s constructor.

Setup the Enricher object with "sedona" or "wherobots" (beta).

The default directory tree is automatically made when you run `geoenricher` in the terminal.

If it is located somewhere else, overwrite the default path in: `data_dir`.

`"ex_mem"` and `"dr_mem"` are the executor and driver memories in GB.



In [1]:
'''
Setup cluster

'''

from geoenricher import Enricher

obj = Enricher(crs="EPSG:3035")

data_dir = f"./data"

obj.setup_cluster(
    data_dir=data_dir, 
    which="sedona", 
    ex_mem=26,  # change this
    dr_mem=24,  # change this
    log_level="ERROR"
)

25/03/15 00:27:21 WARN Utils: Your hostname, marvin resolves to a loopback address: 127.0.1.1; using 172.20.27.4 instead (on interface eth0)
25/03/15 00:27:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/data/homes_data/sudheer/benchmark_data/.venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /data/homes_data/sudheer/.ivy2/cache
The jars for the packages stored in: /data/homes_data/sudheer/.ivy2/jars
org.apache.sedona#sedona-spark-shaded-3.5_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-53acb0c2-7f47-4425-b4b4-abcaa3e15f25;1.0
	confs: [default]
	found org.apache.sedona#sedona-spark-shaded-3.5_2.12;1.7.0 in central
	found org.datasyslab#geotools-wrapper;1.7.0-28.5 in central
:: resolution report :: resolve 232ms :: artifacts dl 9ms
	:: modules in use:
	org.apache.sedona#sedona-spark-shaded-3.5_2.12;1.7.0 from central in [default]
	org.datasyslab#geotools-wrapper;1.7.0-28.5 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	------------------------------------------

KeyboardInterrupt: 

## **First run (Load data from files)**

This loads data from files in `data_dir`. This will take a while since it makes some essential transformations on the datasets. 

`parquet_all()` will save all datasets to the disk, preserving any transformations applied. 

From next time, you can directly load them  with: `load_from_parquets()` to save time.


In [None]:
'''
First Run

'''

from geoenricher import Enricher



# provide the data directory
data_dir = f"./data"

# individual file paths:
path_com_EU = f"{data_dir}/data_EU/comuni_shp/"
path_contr = f"{data_dir}/data_EU/countries_shp/"
path_grids = f"{data_dir}/data_EU/census_grid_EU/grids_OG_corrected.parquet"
path_grids_new = f"{data_dir}/data_EU/census_grid_EU/grids_new.gpkg"
path_reg = f"{data_dir}/data_Italy/regioni/"
path_prov = f"{data_dir}/data_Italy/provinci"
path_com = f"{data_dir}/data_Italy/comuni/"
path_hlth = f"{data_dir}/data_EU/services/healthcare_dropna.gpkg"
path_edu = f"{data_dir}/data_EU/services/education_dropna.gpkg"
path_acc_health = f"{data_dir}/data_EU/accessibility/healthcare/grid_accessibility_health.geoparquet"
path_acc_edu = f"{data_dir}/data_EU/accessibility/education/grid_accessibility_educ.geoparquet"
path_NUTS = f"{data_dir}/NUTS.shp"
path_LAU = f"{data_dir}/LAU.shp"
path_DGURBA = f"{data_dir}/DGURBA"

# dataset names and their file formats:
# format: {dataset_name: (path, file_format), ...}

datasets: dict[str, tuple[str, str]] = {
    "comuni_EU": (path_com_EU, "shapefile"),
    "countries": (path_contr, "shapefile"),
    "pop_grids": (path_grids, "geoparquet"),
    # "pop_grids_new": (path_grids_new, "geopackage"),
    "regions_IT": (path_reg, "shapefile"),
    "provinces_IT": (path_prov, "shapefile"),
    "comuni_IT": (path_com, "shapefile"),
    "hospitals": (path_hlth, "geopackage"),
    # "education": (path_edu, "geopackage"),
    "accessibility_hosp": (path_acc_health, "geoparquet"),
    "accessibility_educ": (path_acc_edu, "geoparquet"),
    "NUTS": (path_NUTS, "shapefile"),
    "LAU": (path_LAU, "shapefile"),
    "DGURBAN": (path_DGURBA, "shapefile"),
}

obj = Enricher(crs="EPSG:3035")

obj.setup_cluster(
    data_dir=data_dir, 
    which="sedona", 
    ex_mem=26,  # change this
    dr_mem=24,  # change this
    log_level="ERROR"
)

# use "load()" to load all the datasets in {data_dir}, 
# according to the paths and file formats provided in "datasets{}"
obj.load(datasets, silent=True)


## Data Prep and Fix

1. Optionally, run `fix_geometries()` to fix invalid geometries, if any.
   If you want to skip the check for some dataframes, pass their names in `skip[]`.

2. Inspect the partitions and data skew by running `inspect_partitions()`.
    > **Note:** This may cause memory error and the kernel to break if the driver memory is not enough.

3. Force the dataframes to be *repartitioned* to the number of available cores.
   Pass the names of the dataframes to be skipped in skip[]

4. Transform the CRS of loaded datasets to the CRS passed in the Enricher's constructor. `lazy=True` will not cache the dataframes. 

In [None]:
obj.fix_geometries(
    skip=['pop_grids', 'pop_grids_new']
)

obj.force_repartition(skip=['pop_grids'])

obj.transform_CRS(lazy=False)


obj.parquet_all(preserve_partitions=True)

## ***Pickle* the loaded dataframes for Quick Access in the subsequent runs**

Default directory: `./{data_dir}/pickle_parquets/dfs_list`.
You may change the directory where they are saved by passing it in `parquet_dir`

Like: pqrquet_dir = `.{data_dir}/pickle_parquets/archive`

In [2]:

from geoenricher import Enricher

'''
Load data from pickled parquets 

'''

data_dir = "./data"

obj = Enricher(crs="EPSG:3035")

obj.setup_cluster(
    data_dir=data_dir, 
    which="sedona", 
    ex_mem=26,  # change this
    dr_mem=24,  # change this
    log_level="ERROR"
)

# pqrquet_dir = f".{data_dir}/pickle_parquets/archive"
obj.load_from_parquets()
# obj.inspect_partitions()


25/03/15 00:27:28 WARN Utils: Your hostname, marvin resolves to a loopback address: 127.0.1.1; using 172.20.27.4 instead (on interface eth0)
25/03/15 00:27:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/data/homes_data/sudheer/benchmark_data/.venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /data/homes_data/sudheer/.ivy2/cache
The jars for the packages stored in: /data/homes_data/sudheer/.ivy2/jars
org.apache.sedona#sedona-spark-shaded-3.5_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c32d2f8e-c74f-49b7-997c-d2f8148e3941;1.0
	confs: [default]
	found org.apache.sedona#sedona-spark-shaded-3.5_2.12;1.7.0 in central
	found org.datasyslab#geotools-wrapper;1.7.0-28.5 in central
:: resolution report :: resolve 200ms :: artifacts dl 8ms
	:: modules in use:
	org.apache.sedona#sedona-spark-shaded-3.5_2.12;1.7.0 from central in [default]
	org.datasyslab#geotools-wrapper;1.7.0-28.5 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	------------------------------------------

sedona initialized with 10 cores for parellelism.



                                                                                

Loaded dataframe 'hospitals'
Loaded dataframe 'com_X_pop_accssblty_hosps'
Loaded dataframe 'NUTS'
Loaded dataframe 'pop_grids_full'
Loaded dataframe 'LAU'
Loaded dataframe 'comuni_EU'
Loaded dataframe 'DGURBAN'


## **Interactive 3D Maps for Visualization**
##### Powered by kepler.gl

Pass a list of either:
- Names of the loaded datasets
- Or directly the Spark dataframes in memory
- ```dfs: str | SparkDataFrame | list[str | SparkDataFrame]```

In [4]:

from pyspark.sql import functions as F
from keplergl.keplergl import KeplerGl
'''
Visualize the datasets

'''

map_1: KeplerGl = obj.plot_this(
            df=[
                obj.dfs_list['com_X_pop_accssblty_hosps'].filter(F.col('CNTR_ID') == 'IT'),
                # obj.dfs_list["dg_urban"].filter(F.col('CNTR_CODE') == 'IT'),
                ],
            )

map_1


User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


                                                                                

Auto-detected geometry columns: ['geometry', 'centroid']


KeplerGl(data={'unnamed_0': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…

In [None]:

'''
Enrich by Spatial Joion

'''

from pyspark.sql import functions as F

grids_IT_df = obj.enrich_sjoin(
    df1="pop_grids", 
    df2=obj.dfs_list['countries'], 
    enr_cols=["CNTR_ID", "CNTR_NAME"]
    ).filter(F.col('CNTR_ID').isin("IT"))

with obj.get_time("exporting"):
    obj.parquet_this("grids_IT", grids_IT_df, preserve_partitions=True)


comuni_IT_df = obj.dfs_list['comuni_EU'].filter(F.col('CNTR_ID') == 'IT')

with obj.get_time("exporting"):
    obj.parquet_this("comuni_IT", comuni_IT_df, preserve_partitions=True)


In [3]:

from geoenricher import EnricherGUIOverlay

'''
# GUI for Enrich by Overlay

'''
# pass the `Enricher` object (loaded with the datasets) to the EnricherGUIOverlay constructor
obj_ui = EnricherGUIOverlay(obj)


VBox(children=(HTML(value='<h1>Enrich with Overlay & Aggregation</h1>'), HTML(value="<div style='height: 5px;'…

In [None]:

'''
Save the map with the applied symbology as a .html file
'''

map_1.save_to_html(file_name="./map_1.html")
