# Geospatial Data Engineering Associate

**What you will learn**
This notebook will teach you to:

* Set up your environment and create a **WherobotsDB context**
* Load vector and raster datasets directly from **AWS S3** into **Apache Sedona DataFrames**
* Inspect and validate geometries for quality and consistency
* Standardize and transform **Coordinate Reference Systems (CRS)**
* Apply the **Bronze → Silver → Gold** data architecture pattern for spatial workflows
* Save and manage your first **Iceberg table** to prepare for scalable analysis

In [None]:
from sedona.spark import *
from pyspark.sql.functions import col, when, expr
from sedona.sql.st_functions import ST_IsValid, ST_IsValidReason, ST_MakeValid
from pyspark.sql import DataFrame
import urllib.request
import json

In [None]:
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

## Intoduction to a Sedona DataFrame

A Sedona DataFrame is an extension of the standard Spark DataFrame that adds native support for geospatial data types — namely Vector and Raster.

- Vector data includes geometries such as points, lines, and polygons.
- Raster represents gridded or image-based data such as satellite imagery or elevation tiles.

Because these are built-in (native) data types, Sedona and Wherobots know how to handle them automatically.
This means you can:
- Run spatial and raster functions (like `ST_Contains`, `ST_Intersection`, `RS_Clip`, `RS_ZonalStats`) directly in your queries.
- Combine vector and raster data in the same workflow — for example, clipping imagery to a city boundary.
- Scale these operations easily across large datasets.

## Loading vector data into a Sedona DataFrame

Now that we know what a Sedona DataFrame is, let’s see how to load vector data into one.

In this section, we’ll cover:
- GeoParquet — the most common cloud-native geospatial format
- GeoJSON — flexible and human-readable
- CSV — raw text with WKT/WKB geometries
- Shapefile — the classic desktop GIS format



### GeoParquet

GeoParquet is the preferred format for storing and sharing vector data in the cloud.

It extends the standard Apache Parquet format with a small block of “geo” metadata that describes:
- which column contains the geometry,
- the geometry type (Point, Polygon, etc.),
- its coordinate reference system (CRS), and
- the bounding box of each geometry column.

Because GeoParquet is columnar, compressed, and splittable, it’s ideal for large-scale analytics on cloud object storage like S3.
It also supports spatial predicate push-down — meaning Wherobots can automatically skip reading irrelevant files and row-groups when performing spatial range queries.

In [None]:
geoparquet_path = "s3://wherobots-examples/data/nyc_buildings.parquet"

df = sedona.read.format("geoparquet").load(geoparquet_path)
df.show(5)

### GeoJSON

Wherobots supports reading GeoJSON files directly using the geojson data source.

This reader understands most GeoJSON variations, including:
- Standard Feature and FeatureCollection objects
- SpatioTemporal Asset Catalog (STAC) files
- GeoJSON files that span multiple lines for readability

When loaded, Wherobots automatically parses the geometry field into its internal Geometry type.

In [None]:
geojson_multi_path = "s3://wherobots-examples/data/noaa/storms/"

df = sedona.read.format("geojson")\
        .option("multiLine", "true")\
        .load(geojson_multi_path)\
        .selectExpr("explode(features) as features")\
        .select("features.*")

df.show(5)

### Shapefile

The Shapefile format has been around since the early days of GIS and is still used widely in desktop and government datasets.

Wherobots can load Shapefiles directly into a Sedona DataFrame using the shapefile data source.
This works whether you point to a single .shp file or to a directory containing multiple shapefiles.

Wherobots automatically reads the related files (.dbf, .shx, etc.) and converts the geometry column into a native Geometry type.

In [None]:
shp_path = "s3://wherobots-examples/data/austin_boundaries/"

df = sedona.read.format("shapefile").load(shp_path)
df.show(5)

When the input path is a directory, all shapefiles directly inside that directory will be loaded.
To include shapefiles in subdirectories, set recursiveFileLookup to true:

In [None]:
df = sedona.read.format("shapefile") \
    .option("recursiveFileLookup", "true") \
    .load("s3://wherobots-examples/data/examples/Global_Landslide_Catalog/")

df.show(5)

## Loading raster data into a Sedona DataFrame

Wherobots can load rasters as native raster columns, allowing you to tile, clip, resample, and compute statistics directly in Spark - just like you would with tabular data.

We’ll cover:
- Reading GeoTIFFs (COGs recommended)
- Reading from STAC APIs and collections

### GeoTIFFs

GeoTIFFs are raster image files that store both pixel values and geospatial metadata (like coordinate reference systems).
A Cloud-Optimized GeoTIFF (COG) is a GeoTIFF structured for fast, partial reads in cloud storage — ideal for distributed systems like Wherobots.

Wherobots' raster reader can load these directly into a Sedona DataFrame.
Each raster becomes one or more tiles, stored in a column of type raster.

In [None]:
geotiff_path = "s3://wherobots-examples/data/ghs_population/GHS_POP_E1975_GLOBE_R2023A_4326_3ss_V1_0.tif"

df = sedona.read.format("raster").load(geotiff_path)
df.show(5)

Each row represents a raster tile, and by default the raster reader automatically:
- Splits large rasters into tiles.
- Adds x and y columns to indicate each tile’s position.
- Reads file-level metadata (CRS, extent, etc.) internally.

To fine-tune the tiling behavior, you can specify options:

In [None]:
df = sedona.read.format("raster") \
    .option("tileWidth", "512") \
    .option("tileHeight", "512") \
    .option("retile", "true") \
    .load(geotiff_path)

df.show(5)

> Tip - Use Cloud-Optimized GeoTIFFs (COGs) when possible — they’re optimized for partial reads in cloud storage, which makes distributed processing far more efficient.

### SpatioTemporal Asset Catalog (STAC)

A SpatioTemporal Asset Catalog (STAC) is a standard for describing and organizing geospatial assets — such as satellite imagery, aerial photos, and elevation data — across space and time.

Wherobots includes a built-in STAC Reader, which allows you to load STAC items and collections directly into a Sedona DataFrame. This gives you seamless access to large imagery archives hosted on cloud platforms like AWS, Planetary Computer, or Element84 — all without leaving your Spark environment.

The STAC Reader supports:
- Direct integration with HTTP, HTTPS, S3, or local STAC JSON sources.
- Unified geospatial analysis, so you can join imagery metadata with vector or raster datasets inside Spark.
- Spatial and temporal filter pushdown, meaning filters like `ST_Intersects` or datetime BETWEEN are pushed down to the STAC API, minimizing data transfer and improving query performance.
- Flexible configuration options for partitioning, request limits, and parallel loading — making it scalable for both exploration and production workflows.

You can connect to STAC sources via an HTTPS endpoint, an S3 path, or a local JSON file.

In [None]:
stac_df = sedona.read.format("stac").load(
    "https://earth-search.aws.element84.com/v1/collections/sentinel-2-pre-c1-l2a"
)

stac_df.printSchema()
stac_df.select("id", "datetime", "geometry", "collection").show(5, truncate=False)

You can control how many items to load, how requests are batched, and how partitions are distributed.

In [None]:
df = sedona.read.format("stac") \
            .option("itemsLimitMax", "1000")\
            .option("itemsLimitPerRequest", "50")\
            .option("itemsLoadProcessReportThreshold", "500000")\
            .load("https://earth-search.aws.element84.com/v1/collections/sentinel-2-pre-c1-l2a")

## Introduction to Managing Spatial Tables with Iceberg

Wherobots extends Apache Iceberg — a modern open table format — to natively support spatial data.
This allows you to manage vector and raster datasets just like any other analytical table, with the same reliability, scalability, and query optimization benefits.

For a data engineer, this means you can use familiar tools (SQL, Spark, Sedona) while gaining spatial awareness at the table level — no need for custom file handling or geospatial indexing setups.

### Why Iceberg Matters for Spatial Data

Iceberg brings all the essentials of modern data lake management — schema evolution, ACID transactions, partition pruning, and time travel.
Wherobots builds on this foundation to add spatial intelligence directly into the table layer.

1. Native Spatial Columns

    Geometry and raster columns are first-class types — not just blobs or strings.

    This means:
    - You can save Sedona DataFrames with geometry or raster columns directly to Iceberg tables.
    - Query them with familiar functions like `ST_Intersects`, `ST_Within`, or `RS_Clip`.
    - Work with both in-database rasters (stored in memory) and out-of-database rasters (linked to GeoTIFFs or COGs on S3).

2. Spatial Statistics and Pruning

    Each Iceberg data file stores spatial metadata — including minimum bounding rectangles (MBRs) for geometry and raster columns.

    This allows the query engine to:
    - Skip irrelevant files that fall outside your spatial filter (spatial pruning).
    - Push down bounding-box filters to the scan layer, reducing data read from storage.
    - Speed up spatial joins by comparing file-level extents before loading data.

3. Spatial Partitioning and Organization

    Spatial transformations (like tiling, grid partitioning, or bounding-box bucketing) can be used to organize data.
    This helps co-locate nearby geometries and tiles, reducing shuffle and improving performance in spatial joins or aggregations.

4. Query Optimization and Pushdown

    WherobotsDB uses Iceberg’s metadata to push down both spatial and temporal filters:
    - Spatial filters (`ST_Intersects`, `ST_Within`) are evaluated at the metadata level.
    - Raster metadata and specific bands can be selectively read (projection pushdown).

# Wherobots Fundamentals - Constructing

WherobotsDB provides a powerful set of functions to construct geometries. You can either create them from scratch using raw coordinate values (literals) or by parsing standard geospatial data formats like WKT and WKB.

---

### Creating from Coordinates

These functions build geometries directly from numerical inputs.

* `ST_MakePoint(x, y, [z], [m])`: Creates a **Point** geometry from its x and y coordinates. You can also optionally provide a z-coordinate (for elevation) and an m-coordinate (a measure value).

* `ST_MakeEnvelope(xmin, ymin, xmax, ymax)`: Creates a rectangular **Polygon** that represents a bounding box, or "envelope," from the coordinates of two opposing corners.

* `ST_LineStringFromText(text, delimiter)`: This function builds a LineString from a flat string of comma-separated coordinates, like `'x1, y1, x2, y2, ...'`. This provides a fast way to create line geometries directly from raw text data without needing the formal structure of WKT.

* `ST_PolygonFromText(text, delimiter)`: Similarly, the ST_PolygonFromText function creates a Polygon from a flat string of comma-separated coordinates. For a valid polygon, the sequence must form a closed ring by ensuring the last coordinate pair is identical to the first (e.g., `'x1, y1, x2, y2, x3, y3, x1, y1'`).

---

### Creating from Standard Formats

These functions parse common text-based or binary geospatial formats.

* `ST_GeomFromWKT(text)`: The primary function for constructing any geometry type from its **W**ell-**K**nown **T**ext (WKT) representation. This is one of the most common ways to ingest geometries.

* `ST_GeomFromWKB(binary)`: Creates a geometry from its **W**ell-**K**nown **B**inary (WKB) representation, which is a compact, machine-readable alternative to WKT.



---

### Creating from Other Geometries

This function combines existing geometries into a single feature.

* `ST_Collect(geometry_array)`: Takes an array of geometries and aggregates them into a single multi-part geometry (e.g., `MultiPoint`, `MultiPolygon`) or a `GeometryCollection`. This is useful for grouping related features together. This is an essential function because direct spatial operations on arrays are often limited, so ST_Collect consolidates the individual geometries into one object that can then be analyzed.

---

While these are common examples, they are not the only constructor functions available in WherobotsDB. We will now look at examples for `ST_MakePoint`, `ST_LineStringFromText`, `ST_Collect`, and `ST_MakeEnvelope`.

### ST_MakePoint()

In [None]:
points_df = sedona.sql("""

SELECT ST_MakePoint(-122.349277, 47.620504) as space_needle, ST_MakePoint(-122.350446, 47.620556) as glass_museum, ST_MakePoint(-122.348258, 47.621494) as pop_culture_museum

""")

points_df.show(1, False)

In [None]:
map_config_url = "https://raw.githubusercontent.com/wherobots/geospatial-data-engineering-associate/refs/heads/main/assets/week-1/conf/map_config.json"

with urllib.request.urlopen(map_config_url) as response:
    map_config = json.load(response)

map = SedonaKepler.create_map(points_df, "Tourist spots", map_config)
map

### ST_LineStringFromText()

In [None]:
line_df = sedona.sql("""

SELECT ST_LineStringFromText('-122.349277, 47.620504, -122.350446, 47.620556, -122.348258, 47.621494', ',') as order_to_visit

""")
line_df.show(1, False)

### ST_Collect()

In [None]:
# PS. this allows us to access the dataframes in a SQL environment
points_df.createOrReplaceTempView("points")
line_df.createOrReplaceTempView("line")

collection_df = sedona.sql("""

SELECT 
    ST_Collect(Array(space_needle, glass_museum, pop_culture_museum, order_to_visit))
FROM points, line

""")

collection_df.show(1, False)

In [None]:
map_collection = SedonaKepler.create_map(collection_df, "Things to do", map_config)
map_collection

### ST_MakeEnvelope()

In [None]:
envelope_df = sedona.sql("""

SELECT ST_MakeEnvelope(-122.352848,47.619674,-122.346539,47.622451) AS tourist_location_bbox

""")

envelope_df.show(1, False)

# Wherobots Fundamentals - Spatial Predicates

Spatial predicates are functions that test the relationship between two geometries, returning `TRUE` or `FALSE`. They form the core of most spatial analysis, allowing you to filter data or create joins based on how geometries interact with each other. Understanding the exact logic of each predicate is key to performing accurate analysis.

In this section, we will explore some of the most essential predicate functions in detail.

-----

## ST_Intersects(A, B)

This is the most general-purpose spatial relationship, returning `TRUE` if two geometries **share any space at all**. This includes touching at a single point on their boundaries or overlapping in any way. It's the opposite of `ST_Disjoint`.

  * **Use Case:** Finding any parcels that intersect with a specific road.

-----

## ST_Contains(A, B) and ST_Within(A, B)

These two functions are opposites and describe a "spatially-inside" relationship.

  * `ST_Contains(A, B)` returns `TRUE` if geometry **A** completely encloses geometry **B**. No part of B can be outside of A. Think of a cookie inside a cookie jar; the jar contains the cookie.
  * `ST_Within(A, B)` returns `TRUE` if geometry **A** is completely inside geometry **B**. It's the reverse of `ST_Contains`. The cookie is within the jar.

A key detail is that the boundaries of the geometries cannot simply touch; at least one point of the inner geometry's interior must fall inside the outer geometry's interior. For example, a line that lies perfectly on the boundary of a polygon is not *contained* by it.

  * **Use Case:** Finding all the schools (`ST_Within`) a specific city district (`ST_Contains`).

-----

## ST_Overlaps(A, B)

This predicate is more specific than `ST_Intersects`. It returns `TRUE` only if two geometries **partially intersect** and are of the **same dimension**. For example, two overlapping polygons will return `TRUE`, but a line crossing a polygon will not. Critically, neither geometry can be completely contained within the other.

  * **Use Case:** Finding sales regions that have a partial overlap, which might indicate a territory dispute.

-----

### ST_DWithin(A, B, distance, [useSpheroid])

Instead of just testing a direct spatial relationship, `ST_DWithin` checks for **proximity**. It returns `TRUE` if the two geometries are **within a specified distance** of each other. This is extremely powerful for "buffer" style queries and is highly optimized to use spatial indexes, making it much faster than calculating the exact distance for every pair of geometries.

#### Distance Calculation: Spheroid vs. Euclidean

The optional `useSpheroid` flag is crucial as it controls how the distance is calculated:

* **`useSpheroid = true` (Spheroidal Distance):** This method should be used for geographic data (latitude/longitude). It calculates the more accurate "great-circle" distance on a curved surface. When this is enabled, the distance unit is always in **meters**, and the calculation is performed between the centroids of the two geometries.

* **`useSpheroid = false` (Euclidean Distance):** This is the default behavior. It performs a simpler, "flat-earth" distance calculation. The unit of the `distance` parameter in this case is the same as the unit of the data's Coordinate Reference System (CRS). For accurate results, you should first transform your data into a projected CRS appropriate for distance measurements (e.g., a UTM or State Plane system).

* **Use Case:** Finding all ATMs within 500 **meters** of a specific address using geographic coordinates (`useSpheroid = true`), or finding all competing stores within 2,500 **feet** of a location using a projected State Plane CRS where the units are in feet (`useSpheroid = false`).

-----

These are just a few of the many powerful predicate functions available in WherobotsDB. You can find the complete list in the [official documentation](https://docs.wherobots.com/latest/references/wherobotsdb/vector-data/Predicate/).

# Wherobots Fundamentals - Spatial Joins (Range Joins)

Now that you understand spatial predicates, you can use them to perform one of the most powerful operations in geospatial analysis: the **spatial join**. While a standard join uses a key like an ID to match rows (`tableA.id = tableB.id`), a spatial join combines data from two tables based on the spatial relationship between their geometries. This is a type of "range join" where the condition isn't simple equality but a spatial test, such as `ST_Intersects(A.geom, B.geom)`.

The primary goal of a spatial join is enrichment: adding attributes from one spatial dataset to another based on their shared location.

---

## Why Spatial Joins Matter

Spatial joins allow you to get insights from your data **as if location itself were a column you could join on**. They let you combine completely different datasets using their shared space as the common link, which unlocks powerful analytical capabilities.

### Contextual Enrichment

Imagine you have a table of customer addresses (points) and a separate table of county demographics (polygons). These tables have no common ID column. A spatial join lets you "enrich" your customer data by transferring the demographic information from the county polygon that each customer point falls within. You could then analyze customer behavior by county income level, population density, or any other demographic metric.



### Answering Complex Questions

Ultimately, spatial joins are how you answer real-world questions that involve location. For example:
* Which of our stores are located in flood-prone areas?
* What is the average property value for parcels within 500 meters of a new transit line?
* How many competitors are within a 10-minute drive of each of our locations?

In [None]:
places_df = (
    sedona.sql("""

    WITH seattle_downtown AS (
        SELECT ST_GeomFromWKT('POLYGON ((-122.360916 47.590189, -122.299461 47.590189, -122.299461 47.641104, -122.360916 47.641104, -122.360916 47.590189))') AS geom
    ),

    -- This is to leverage spatial predicate pushdown
    places AS (
      SELECT *
      FROM
        wherobots_open_data.overture_maps_foundation.places_place places, seattle_downtown
      WHERE
        ST_Intersects(places.geometry, seattle_downtown.geom)
    ),

    -- This is to leverage spatial predicate pushdown
    buildings AS (
      SELECT *
      FROM
        wherobots_open_data.overture_maps_foundation.buildings_building buildings, seattle_downtown
      WHERE
        ST_Intersects(buildings.geometry, seattle_downtown.geom)
    )
    
    SELECT
      places.names.primary as place_name,
      places.categories.primary as place_type,
      element_at(places.addresses, 1) as place_address,
      ROUND(places.confidence * 100, 2) AS `place_confidence (%)`,
      places.geometry as place_geometry,
      buildings.geometry as building_geometry
    FROM
      places
    JOIN
      buildings
    ON
      ST_Intersects(places.geometry, buildings.geometry);
    
    """)
    # We are caching the result, as we will reuse it to visualize the data
        .cache()
)

places_df.show(10, False)

In [None]:
map_config_join_url = "https://raw.githubusercontent.com/wherobots/geospatial-data-engineering-associate/refs/heads/main/assets/week-1/conf/map_config_join.json"

with urllib.request.urlopen(map_config_join_url) as response:
    map_join_config = json.load(response)

map_places = SedonaKepler.create_map(places_df, "Places in buildings", map_join_config)
map_places

# Creating Havasu Tables from Sedona DataFrames

Now that you understand the importance of Apache Iceberg, let's see how simple it is to save a Sedona DataFrame as an Iceberg table. Wherobots uses an enhanced version of Iceberg called **Havasu**, which is purpose-built for high-performance geospatial analytics.

---

## What is Havasu?

Standard Apache Iceberg did not natively support geometry data types until its v3 specification. **Havasu** is Wherobots' enhanced implementation of Iceberg that adds first-class support for both **vector** and **raster** data. This allows you to combine all the benefits of the Iceberg format—like atomic transactions and schema evolution—with native, high-performance geospatial data handling.

---

## Saving a DataFrame to Havasu

Saving a Sedona DataFrame as a Havasu table is a straightforward, one-line command. The process is identical whether your DataFrame contains vector geometries, rasters, or no spatial data at all.

In [None]:
# Create a new Havasu (Iceberg) database

database = 'gde_bronze'

sedona.sql(f'CREATE DATABASE IF NOT EXISTS org_catalog.{database}')

In [None]:
geotiff_path = "s3://wherobots-examples/data/ghs_population/GHS_POP_E1975_GLOBE_R2023A_4326_3ss_V1_0.tif"

df = sedona.read.format("raster") \
    .option("tileWidth", "512") \
    .option("tileHeight", "512") \
    .option("retile", "true") \
    .load(geotiff_path)

df.writeTo(f"org_catalog.{database}.ghs_population_tiles")

In [None]:
df.writeTo(f"org_catalog.{database}.ghs_population_tiles").create()

# Data validity checks

Two of the most common issues with geospatial data include managing projections or Coordinate Reference Systems (CRS) and ensuring geometries are valid.

- A geometry is invalid if it violates spatial rules like self-intersections, unclosed rings, misaligned holes, or overlapping parts—making it topologically incorrect.
- Spatial files generally contain a Coordinate Reference System or CRS that is defined by a Spatial Reference ID or SRID. This tells us how the data is projected from the round spheroid of the earth onto a flat surface.

To fix these issues and ensure our data is valid and in the correct format we use two approaches:

1. Check the geometries for any invalidities, and if there are attempt to fix them using `ST_IsValid`, `ST_IsValidDetail`, and `ST_MakeValid`
2. Remove or log out any geometries that cannot be fixed
3. Standardize our geometries in a single CRS, in this case [EPSG:4326](https://epsg.io/4326) which renders in a coordinate reference system

In [None]:
prefix = 's3://wherobots-examples/gdea-course-data/raw-data/'

In [None]:
def check_invalid_geometries(df: DataFrame, geom_col: str = "geom", reason_col: str = "why_invalid") -> int:
    df_with_reason = df.withColumn(reason_col, ST_IsValidReason(col(geom_col)))
    # cache to avoid recomputation if you inspect reasons later
    df_with_reason.cache()
    invalid_count = df_with_reason.filter(~ST_IsValid(col(geom_col))).count()
    print(f"✅ Checked geometries — found {invalid_count} invalid geometries.")
    return invalid_count

def fix_invalid_geometries(df: DataFrame, invalid_count: int, geom_col: str = "geom") -> DataFrame:
    if invalid_count > 1:
        print(f"🔧 Attempting to fix {invalid_count} invalid geometries...")
        return df.withColumn(
            geom_col,
            when(~ST_IsValid(col(geom_col)), ST_MakeValid(col(geom_col))).otherwise(col(geom_col))
        )
    else:
        print("⚡ Only one invalid geometry (or none). Skipping automated fix.")
        return df

# --- driver program ---
def process_geometries(
    df: DataFrame,
    geom_col: str = "geom",
    attempt_fix: bool = True,
    split_on_fail: bool = True
):
    """
    Runs validity check -> optional repair -> optional split.
    Returns either:
      - {"df": corrected_df}  when all geometries valid after repair (or none invalid)
      - {"valid_df": ..., "invalid_df": ...} when some invalid remain and split_on_fail=True
    """
    # 1) Initial check
    invalid_count = check_invalid_geometries(df, geom_col=geom_col)

    if invalid_count == 0:
        print("✅ All geometries are valid.")
        return {"df": df}  # nothing to do

    # 2) Attempt repair (only changes rows that are invalid per your earlier contract)
    if attempt_fix:
        df_fixed = fix_invalid_geometries(df, invalid_count, geom_col=geom_col)
        remaining_invalid_count = df_fixed.filter(~ST_IsValid(col(geom_col))).count()
        print(f"🔎 After fixing, {remaining_invalid_count} invalid geometries remain.")
    
        if remaining_invalid_count == 0:
            print("✅ All geometries are valid after fixing.")
            return {"df": df_fixed}
        elif split_on_fail:
            print("⚠️ Some invalid geometries remain — splitting dataset.")
            valid_df = df_fixed.filter(ST_IsValid(col(geom_col)))
            invalid_df = df_fixed.filter(~ST_IsValid(col(geom_col)))
            print(f"✅ Split complete: {valid_df.count()} valid / {invalid_df.count()} invalid.")
            return {"valid_df": valid_df, "invalid_df": invalid_df}
        else:
            print("⚠️ Some invalid geometries remain, returning best-effort fixed DataFrame.")
            return {"df": df_fixed}
    
    # If no fix attempt, just split if requested
    if split_on_fail:
        print("⚠️ Skipping fix — splitting into valid and invalid.")
        valid_df = df.filter(ST_IsValid(col(geom_col)))
        invalid_df = df.filter(~ST_IsValid(col(geom_col)))
        print(f"✅ Split complete: {valid_df.count()} valid / {invalid_df.count()} invalid.")
        return {"valid_df": valid_df, "invalid_df": invalid_df}
    
    print("⚠️ Invalid geometries found but no fix or split requested. Returning original DataFrame.")
    return {"df": df}

In [None]:
# FEMA Flood Hazard Areas
fld_hazard_area = sedona.read.format('shapefile').load(f'{prefix}' + '53033C_20250330/S_FLD_HAZ_AR.shp')

In [None]:
result = process_geometries(fld_hazard_area, geom_col="geometry", attempt_fix=True, split_on_fail=True)

if "df" in result:
    df_final = result["df"]  # all valid (either already valid or successfully repaired)
else:
    valid_df = result["valid_df"]
    invalid_df = result["invalid_df"]
    # handle invalids (e.g., export for manual review)(fld_hazard_area, 'geometry')

In [None]:
fld_hazard_area.writeTo(f"org_catalog.{database}.fema_flood_zones_bronze").createOrReplace()

## Transforming CRS

In [None]:
sedona.sql(f'''
select st_srid(geometry) as srid from org_catalog.{database}.fema_flood_zones_bronze limit 1
''').show()

In [None]:
sedona.sql(f'''
select 
st_srid(st_transform(geometry, 'EPSG:4326')) as srid from org_catalog.{database}.fema_flood_zones_bronze limit 1
''').show()

In [None]:
sedona.sql(f'''
select 
st_srid(st_transform(geometry, 'EPSG:4269', 'EPSG:4326')) as srid from org_catalog.{database}.fema_flood_zones_bronze limit 1
''').show()

# Loading datasets into WherobotsDB

In [None]:
# King County Generalized Land Use Data
gen_land_use = sedona.read.format('shapefile').load(f'{prefix}' + 'General_Land_Use_Final_Dataset/General_Land_Use_Final_Dataset.shp')

In [None]:
gen_land_use.writeTo(f"org_catalog.{database}.gen_land_use_bronze").createOrReplace()

In [None]:
# King County Sherrif Patrol Districts
sherrif_districts = sedona.read.format('shapefile').load(f'{prefix}' + 'King_County_Sheriff_Patrol_Districts___patrol_districts_area/King_County_Sheriff_Patrol_Districts___patrol_districts_area.shp')

In [None]:
sherrif_districts.writeTo(f"org_catalog.{database}.sherrif_districts_bronze").createOrReplace()

In [None]:
# King County Offense Reports
offense_reports = sedona.read.format('csv').load(f'{prefix}' + 'KCSO_Offense_Reports__2020_to_Present_20250923.csv')

In [None]:
offense_reports.writeTo(f"org_catalog.{database}.offense_reports_bronze").createOrReplace()

In [None]:
# King County Bike Lanes
bike_lanes = sedona.read.format('shapefile').load(f'{prefix}' + 'Metro_Transportation_Network_(TNET)_in_King_County_for_Bicycle_Mode___trans_network_bike_line/Metro_Transportation_Network_(TNET)_in_King_County_for_Bicycle_Mode___trans_network_bike_line.shp')

In [None]:
bike_lanes.writeTo(f"org_catalog.{database}.bike_lanes_bronze").createOrReplace()

In [None]:
# FEMA National Risk Index
fema_nri = sedona.read.format('shapefile').load(f'{prefix}' + 'NRI_Shapefile_CensusTracts/NRI_Shapefile_CensusTracts.shp')

In [None]:
fema_nri.writeTo(f"org_catalog.{database}.fema_nri_bronze").createOrReplace()

In [None]:
# King County School Sites
school_sites = sedona.read.format('shapefile').load(f'{prefix}' + 'School_Sites_in_King_County___schsite_point/School_Sites_in_King_County___schsite_point.shp')

In [None]:
school_sites.writeTo(f"org_catalog.{database}.school_sites_bronze").createOrReplace()

In [None]:
# Schools Report Card
report_card = sedona.read. \
    format('csv'). \
    load(f'{prefix}' + 'Report_Card_Growth_for_2024-25_20250923.csv')

In [None]:
report_card.writeTo(f"org_catalog.{database}.report_card_bronze").createOrReplace()

In [None]:
# Seismic Hazards
seismic_hazards = sedona.read. \
    format('shapefile'). \
    load(f'{prefix}' + 'Seismic_Hazards___seism_area/Seismic_Hazards___seism_area.shp')

In [None]:
seismic_hazards.writeTo(f"org_catalog.{database}.seismic_hazards_bronze").createOrReplace()

In [None]:
# Census Block Groups
block_groups = sedona.read. \
    format('shapefile'). \
    load(f'{prefix}' + 'tl_2024_53_bg/tl_2024_53_bg.shp')

In [None]:
block_groups.writeTo(f"org_catalog.{database}.block_groups_bronze").createOrReplace()

In [None]:
# Census CSVs
median_age = sedona.read. \
    format('csv'). \
    load(f'{prefix}' + 'ACSDT5Y2023.B01002_2025-09-19T105233/ACSDT5Y2023.B01002-Data.csv')

median_age.writeTo(f"org_catalog.{database}.median_age_bronze").createOrReplace()

total_pop = sedona.read. \
    format('csv'). \
    load(f'{prefix}' + 'ACSDT5Y2023.B01003_2025-09-19T105050/ACSDT5Y2023.B01003-Data.csv')

total_pop.writeTo(f"org_catalog.{database}.total_pop_bronze").createOrReplace()

median_income = sedona.read. \
    format('csv'). \
    load(f'{prefix}' + 'ACSDT5Y2023.B19013_2025-09-19T105253/ACSDT5Y2023.B19013-Data.csv')

total_pop.writeTo(f"org_catalog.{database}.median_income_bronze").createOrReplace()

In [None]:
# Tranist Routes
transit_routes = sedona.read. \
    format('shapefile'). \
    load(f'{prefix}' + 'Transit_Routes_for_King_County_Metro___transitroute_line/Transit_Routes_for_King_County_Metro___transitroute_line.shp')

In [None]:
transit_routes.writeTo(f"org_catalog.{database}.transit_routes_bronze").createOrReplace()

In [None]:
# Transit Stops
transit_stops = sedona.read. \
    format('shapefile'). \
    load(f'{prefix}' + 'Transit_Stops_for_King_County_Metro___transitstop_point/Transit_Stops_for_King_County_Metro___transitstop_point.shp')

In [None]:
transit_stops.writeTo(f"org_catalog.{database}.transit_stops_bronze").createOrReplace()

In [None]:
# Water Bodies
water_bodies = sedona.read. \
    format('shapefile'). \
    load(f'{prefix}' + 'Waterbodies_with_History_and_Jurisdictional_detail___wtrbdy_det_area/Waterbodies_with_History_and_Jurisdictional_detail___wtrbdy_det_area.shp')

In [None]:
water_bodies.writeTo(f"org_catalog.{database}.water_bodies_bronze").createOrReplace()

In [None]:
# Wildfire Polygons
wildfires = sedona.read. \
    format('shapefile'). \
    load(f'{prefix}' + 'Wildfires_1878_2019_Polygon_Data/Shapefile/US_Wildfires_1878_2019.shp')

In [None]:
wildfires.writeTo(f"org_catalog.{database}.wildfires_bronze").createOrReplace()

In [None]:
# Elevation

url = 's3://copernicus-dem-30m/*/*.tif'

elev_df = sedona.read.format("raster").option("retile", "true").load(url) \
.where(
    "RS_Intersects(rast, ST_GeomFromText('POLYGON((-125.0572 48.9964, -120.255 48.9964, -120.255 46.491, -125.0572 46.491, -125.0572 48.9964))'))"
)

# Use the below function to load the global DEM

# elev_df = sedona.read.format("raster").option("retile", "true").load(url)

In [None]:
elev_df.writeTo(f"org_catalog.{database}.elevation_bronze").createOrReplace()

In [None]:
# Geocoded Schools
schools = sedona.read. \
    format('geojson'). \
    option('mode', 'DROPMALFORMED'). \
    load(f'{prefix}' + 'Washington_State_Public_Schools_GeoCoded.geojson')

In [None]:
schools = schools \
    .withColumn("geometry", expr("geometry")) \
    .withColumn("AYPCode", expr("properties['AYPCode']")) \
    .withColumn("CongressionalDistrict", expr("properties['CongressionalDistrict']")) \
    .withColumn("County", expr("properties['County']")) \
    .withColumn("ESDCode", expr("properties['ESDCode']")) \
    .withColumn("ESDName", expr("properties['ESDName']")) \
    .withColumn("Email", expr("properties['Email']")) \
    .withColumn("GeoCoded_X", expr("properties['GeoCoded_X']")) \
    .withColumn("GeoCoded_Y", expr("properties['GeoCoded_Y']")) \
    .withColumn("GradeCategory", expr("properties['GradeCategory']")) \
    .withColumn("HighestGrade", expr("properties['HighestGrade']")) \
    .withColumn("LEACode", expr("properties['LEACode']")) \
    .withColumn("LEAName", expr("properties['LEAName']")) \
    .withColumn("LegislativeDistrict", expr("properties['LegislativeDistrict']")) \
    .withColumn("LowestGrade", expr("properties['LowestGrade']")) \
    .withColumn("MailingAddress", expr("properties['MailingAddress']")) \
    .withColumn("NCES_X", expr("properties['NCES_X']")) \
    .withColumn("NCES_Y", expr("properties['NCES_Y']")) \
    .withColumn("Phone", expr("properties['Phone']")) \
    .withColumn("Principal", expr("properties['Principal']")) \
    .withColumn("School", expr("properties['School']")) \
    .withColumn("SchoolCategory", expr("properties['SchoolCategory']")) \
    .withColumn("SchoolCode", expr("properties['SchoolCode']")) \
    .withColumn("SingleAddress", expr("properties['SingleAddress']")) \
    .drop("properties").drop("type") \
    .drop("_corrupt_record").drop("type") \
    .drop("type").drop("type")

In [None]:
schools.printSchema()

In [None]:
schools.writeTo(f"org_catalog.{database}.schools_bronze").createOrReplace()