![](https://wherobots.com/wp-content/uploads/2023/12/Inline-Blue_Black_onWhite@3x.png)

# Havasu out-db Raster Example

In this notebook, we'll demonstrate how to load a large GeoTIFF file stored on S3 as out-db raster, and split it into smaller tiles.

We'll also show how to run RS_Value using a DataFrame of points on a large out-db raster. Read more about [Havasu](https://docs.wherobots.com/latest/references/havasu/introduction/), and [WherobotsDB Raster support](https://docs.wherobots.com/latest/references/havasu/raster/raster-overview/) in the documentation.

# Define Sedona context

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, col, lit
from sedona.spark import *

In [None]:
config = SedonaContext.builder().appName('havasu-iceberg-outdb-raster-etl')\
    .getOrCreate()
sedona = SedonaContext.create(config)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

# Load Raster

We'll load the world population data, which contains estimated total number of people per grid-cell. The dataset is available to download in Geotiff format at a resolution of 30 arc (approximately 1km at the equator). The projection is Geographic Coordinate System, WGS84.

The original data can be retrieved from [here](https://hub.worldpop.org/geodata/summary?id=24777).

In [62]:
raster_df = sedona.sql("SELECT RS_FromPath('s3://wherobots-examples/data/ppp_2020_1km_Aggregated.tif') as rast")
raster_df.show(5)

+--------------------+
|                rast|
+--------------------+
|LazyLoadOutDbGrid...|
+--------------------+



We can save this one large out-db raster as a Havasu table. The table will contain one row representing that large out-db raster.

In [63]:
sedona.sql("CREATE NAMESPACE IF NOT EXISTS wherobots.test_db")
sedona.sql("DROP TABLE IF EXISTS wherobots.test_db.world_pop")
raster_df.writeTo("wherobots.test_db.world_pop").create()

In [64]:
sedona.sql("SELECT RS_Metadata(rast) meta FROM wherobots.test_db.world_pop").show(5, False)

+-----------------------------------------------------------------------------------------------------------+
|meta                                                                                                       |
+-----------------------------------------------------------------------------------------------------------+
|{-180.001249265, 83.99958319871001, 43200, 18720, 0.0083333333, -0.0083333333, 0.0, 0.0, 4326, 1, 256, 256}|
+-----------------------------------------------------------------------------------------------------------+



# Split raster into tiles

Large rasters may not be suitable for performing raster processing tasks that reads all the pixel data. WherobotsDB provides `RS_TileExplode` function for splitting the large raster into smaller tiles. When the input raster is an out-db raster, the generated tiles are out-db rasters referencing different parts of the out-db raster file. This is a pure geo-referencing metadata operation so this is very fast.

The tiles generated by `RS_TileExplode` are within their original partition, so all the tiles are within one partition because the original DataFrame has only one row. This dataframe needs to be repartitioned to distribute the tiles to multiple executors, otherwise future processing on these tiles won't be parallelised.

In [70]:
tile_df = sedona.sql("SELECT RS_TileExplode(rast, 256, 256) AS (x, y, tile) FROM wherobots.test_db.world_pop").repartition(16)
tile_df.show(5)

+---+---+--------------------+
|  x|  y|                tile|
+---+---+--------------------+
|150| 55|OutDbGridCoverage...|
|139| 37|OutDbGridCoverage...|
|146| 31|OutDbGridCoverage...|
| 31| 14|OutDbGridCoverage...|
|131|  2|OutDbGridCoverage...|
+---+---+--------------------+
only showing top 5 rows



# Load raster as tiles (recommended)

WherobotsDB provides `raster` data source for loading raster files and splitting the rasters into tiles using one line of code. The loaded tiles will also be repartitioned to all executors to distribute future raster processing workloads. Read more about [Raster loader](https://docs.wherobots.com/latest/references/wherobotsdb/raster-data/Raster-loader/#loading-raster-using-the-raster-loader) in the documentatin.

In [66]:
raster_df_tiled = sedona.read.format("raster").option("tileWidth", "256").option("tileHeight", "256").load("s3://wherobots-examples/data/ppp_2020_1km_Aggregated.tif")
raster_df_tiled.show(5)

+--------------------+---+---+
|                rast|  x|  y|
+--------------------+---+---+
|OutDbGridCoverage...| 22| 12|
|OutDbGridCoverage...|140|  0|
|OutDbGridCoverage...| 52| 62|
|OutDbGridCoverage...|129| 37|
|OutDbGridCoverage...| 83| 67|
+--------------------+---+---+
only showing top 5 rows



We'll rename the raster column `rast` as `tile` before saving the DataFrame into Havasu table.

In [67]:
tile_df = raster_df_tiled.select(col("rast").alias("tile"), "x", "y")

## Saving as out-db rasters

In [71]:
sedona.sql("DROP TABLE IF EXISTS wherobots.test_db.world_pop_tiles")
tile_df.writeTo("wherobots.test_db.world_pop_tiles").create()

                                                                                

In [72]:
sedona.table("wherobots.test_db.world_pop_tiles").count()

12506

## Saving tiles as in-db rasters

WherobotsDB provides an `RS_AsInDb` function for converting out-db raster as in-db raster. It needs to fetch all the band data from the raster file. We manually repartition the out-db tile dataset to run this convertion with high parallelism.

In [54]:
indb_tile_df = tile_df.withColumn("tile", expr("RS_AsInDb(tile)"))
indb_tile_df.show(5)

+--------------------+---+---+
|                tile|  x|  y|
+--------------------+---+---+
|GridCoverage2D["g...| 22| 12|
|GridCoverage2D["g...|140|  0|
|GridCoverage2D["g...| 52| 62|
|GridCoverage2D["g...|129| 37|
|GridCoverage2D["g...| 83| 67|
+--------------------+---+---+
only showing top 5 rows



In [55]:
sedona.sql("DROP TABLE IF EXISTS wherobots.test_db.world_pop_indb_tiles")
indb_tile_df.writeTo("wherobots.test_db.world_pop_indb_tiles").create()

                                                                                

In [56]:
sedona.table("wherobots.test_db.world_pop_indb_tiles").count()

12506

## Visualize the tile boundaries on a map

In [57]:
sedona.table("wherobots.test_db.world_pop_indb_tiles").show()
tiledMap = SedonaKepler.create_map()
SedonaKepler.add_df(tiledMap, sedona.table("wherobots.test_db.world_pop_indb_tiles").withColumn("tile", expr("RS_Envelope(tile)")), name="tiles")
tiledMap

                                                                                

+--------------------+---+---+
|                tile|  x|  y|
+--------------------+---+---+
|GridCoverage2D["h...| 11|  6|
|GridCoverage2D["h...| 18| 37|
|GridCoverage2D["h...| 98| 58|
|GridCoverage2D["h...|124|  4|
|GridCoverage2D["h...|123| 61|
|GridCoverage2D["h...|139| 29|
|GridCoverage2D["h...| 34| 64|
|GridCoverage2D["h...|126| 47|
|GridCoverage2D["h...| 62| 39|
|GridCoverage2D["h...|116| 33|
|GridCoverage2D["h...|127| 32|
|GridCoverage2D["h...|102| 21|
|GridCoverage2D["h...|124| 31|
|GridCoverage2D["h...|122| 13|
|GridCoverage2D["h...|109| 73|
|GridCoverage2D["h...| 82| 66|
|GridCoverage2D["h...| 38| 41|
|GridCoverage2D["h...| 35| 31|
|GridCoverage2D["h...|108| 44|
|GridCoverage2D["h...|114| 53|
+--------------------+---+---+
only showing top 20 rows

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


                                                                                

KeplerGl(data={'tiles': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2…

# Population of POIs

We'll join the POI dataset with the population dataset to evaluate the population of POIs.

## Load POI Dataset

In [58]:
spatialRdd = ShapefileReader.readToGeometryRDD(sedona.sparkContext, "s3://wherobots-examples/data/ne_50m_airports")
poi_df = Adapter.toDf(spatialRdd, sedona)
poi_df.show(5)

+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|            geometry|scalerank|featurecla| type|            name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|POINT (113.935016...|        2|   Airport|major| Hong Kong Int'l|   HKG|terminal|    VHHH|      HKG|http://en.wikiped...|  150.000|
|POINT (121.231370...|        2|   Airport|major|         Taoyuan|   TPE|terminal|    RCTP|      TPE|http://en.wikiped...|  150.000|
|POINT (4.76437693...|        2|   Airport|major|        Schiphol|   AMS|terminal|    EHAM|      AMS|http://en.wikiped...|  150.000|
|POINT (103.986413...|        2|   Airport|major|Singapore Changi|   SIN|terminal|    WSSS|      SIN|http://en.wikiped...|  150.000|
|POINT (-0.4531566...|        2|   Airport|major| London Heathrow|   

## Joining POIs with out-db raster

We can perform a catesian join with the single row large out-db raster table, and evaluates the population value on each point.

In [59]:
res_df = poi_df.join(sedona.table("wherobots.test_db.world_pop")).withColumn("pop", expr("RS_Value(rast, geometry)")).drop("rast")
res_df.show(5)
res_df.where("pop > 100").count()

+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+------------------+
|            geometry|scalerank|featurecla| type|            name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|               pop|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+------------------+
|POINT (113.935016...|        2|   Airport|major| Hong Kong Int'l|   HKG|terminal|    VHHH|      HKG|http://en.wikiped...|  150.000| 1627.572998046875|
|POINT (121.231370...|        2|   Airport|major|         Taoyuan|   TPE|terminal|    RCTP|      TPE|http://en.wikiped...|  150.000|1459.4176025390625|
|POINT (4.76437693...|        2|   Airport|major|        Schiphol|   AMS|terminal|    EHAM|      AMS|http://en.wikiped...|  150.000|1093.3812255859375|
|POINT (103.986413...|        2|   Airport|major|Singapore Changi|   SIN|terminal|    WS

                                                                                

204

## Joining POIs with out-db tiles

We run a spatial join using the POI and out-db raster tile dataset, and evaluates the population value on each point.

In [60]:
res_df = poi_df.join(sedona.table("wherobots.test_db.world_pop_tiles"), expr("RS_Intersects(tile, geometry)")).withColumn("pop", expr("RS_Value(tile, geometry)")).drop("tile")
res_df.show(5)
res_df.where("pop > 100").count()

                                                                                

+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+---+---+------------------+
|            geometry|scalerank|featurecla| type|            name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|  x|  y|               pop|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+---+---+------------------+
|POINT (113.935016...|        2|   Airport|major| Hong Kong Int'l|   HKG|terminal|    VHHH|      HKG|http://en.wikiped...|  150.000|137| 28| 1627.572998046875|
|POINT (121.231370...|        2|   Airport|major|         Taoyuan|   TPE|terminal|    RCTP|      TPE|http://en.wikiped...|  150.000|141| 27|1459.4176025390625|
|POINT (4.76437693...|        2|   Airport|major|        Schiphol|   AMS|terminal|    EHAM|      AMS|http://en.wikiped...|  150.000| 86| 14|1093.3812255859375|
|POINT (103.986413...|        2|   Airpo

                                                                                

204

## Joining POIs with in-db tiles

In [61]:
res_df = poi_df.join(sedona.table("wherobots.test_db.world_pop_indb_tiles"), expr("RS_Intersects(tile, geometry)")).withColumn("pop", expr("RS_Value(tile, geometry)")).drop("tile")
res_df.show(5)
res_df.where("pop > 100").count()

                                                                                

+--------------------+---------+----------+-----+--------------------+------+--------+--------+---------+--------------------+---------+---+---+-----------------+
|            geometry|scalerank|featurecla| type|                name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|  x|  y|              pop|
+--------------------+---------+----------+-----+--------------------+------+--------+--------+---------+--------------------+---------+---+---+-----------------+
|POINT (-64.702774...|        4|   Airport|  mid|       Bermuda Int'l|   BDA|terminal|    TXKF|      BDA|http://en.wikiped...|   50.000| 54| 24|685.7862548828125|
|POINT (15.4465162...|        4|   Airport|  mid|Kinshasa N Djili ...|   FIH|terminal|    FZAA|      FIH|http://en.wikiped...|   50.000| 91| 41|994.3622436523438|
|POINT (-97.226769...|        4|   Airport|major|      Winnipeg Int'l|   YWG|terminal|    CYWG|      YWG|http://en.wikiped...|   50.000| 38| 15|2.445089340209961|
|POINT (80.1637759...|

                                                                                

204