# Performing Spatial Joins in Wherobots

This notebook will guide you through performing spatial joins in Wherobots using Python and the DataFrame API — giving you a hands-on understanding of how to combine datasets based on their spatial relationships.

## What you will learn

This notebook will teach you to:

* Perform **standard spatial joins** — identifying features within other geometries
* Execute **nearest neighbor joins** — finding the closest feature between datasets
* Calculate **zonal statistics** — summarizing values within geographic zones
* Apply optimization techniques like spatial partitioning with GeoHashes
* Visualize join results using interactive tools

> Spatial joins are a core operation in geospatial analysis, allowing you to merge datasets based on how their features relate in space.

This notebook focuses on practical workflows and scalable processing with Wherobots and Apache Sedona.


In [None]:
from sedona.spark import SedonaContext
from pyspark.sql.functions import expr

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

# 📀 Load Spatial Data

Now, we load two spatial datasets stored in Wherobots Managed Storage:
- **Polygons:** Represent administrative boundaries.
- **Points:** Represent facility locations.

```python
# Load the polygon dataset (administrative boundaries) using a Spatial SQL Query
# Using sedona.sql, create a dataframe from the query
query = '''
SELECT 
    * 
FROM
    wherobots_open_data.overture_maps_foundation.divisions_division_area
WHERE
    subtype = 'locality'
    AND country = 'US'
'''

polygons_df = sedona.sql(query)
# (Alternatively, load from a file with spark.read.format("geoparquet") if necessary)

# Load the points dataset (facilities)
points_df = sedona.table("wherobots.sample_data.facilities")
# (Alternatively, load from a file with spark.read.format("geoparquet"))

# Display a sample of the polygon dataset
print("🔹 Sample of the Polygon Dataset (Administrative Boundaries):")
polygons_df.show(5, truncate=False)

# Display a sample of the points dataset
print("🔹 Sample of the Points Dataset (Facilities):")
points_df.show(5, truncate=False)
```

*Detailed Explanation:*  
- We use the `sedona.table` function to load the data directly from the Wherobots catalog.
- Two DataFrames are created: one for polygons and one for points.
- We then display the first five rows of each dataset to verify the contents. This helps ensure our data is loaded correctly and gives a preview of the schema.

In [None]:
query = '''
SELECT 
    * 
FROM
    wherobots_open_data.overture_maps_foundation.divisions_division_area
WHERE
    subtype = 'locality'
    AND country = 'US'
'''

In [None]:
polygons_df = sedona.sql(query)

In [None]:
points_df = sedona.table("wherobots_open_data.foursquare.places")

In [None]:
print("🔹 Sample of the Polygon Dataset (Administrative Boundaries):")
polygons_df.show(5)

In [None]:
print("🔹 Sample of the Points Dataset (Facilities):")
points_df.show(5, truncate=False)

# 🤝🏼 Standard Spatial Join (Pythonic Approach)

In a standard spatial join, we want to link points (facilities) with the polygons (administrative boundaries) that contain them. We use spatial predicates like `ST_Intersects`.

```python
# Alias the DataFrames for clarity
facilities = points_df.alias("f")
admin_boundaries = polygons_df.alias("poly")

# Perform a spatial join:
# Join the facilities and admin_boundaries DataFrames where the facility geometry
# intersects with the polygon geometry using the ST_Intersects predicate.
spatial_join_df = facilities.join(
    admin_boundaries,
    expr("ST_Intersects(poly.geometry, f.geom)")
)

# Show a few rows of the spatial join result
print("🔹 Standard Spatial Join Results (Facilities within Administrative Boundaries):")
spatial_join_df.show(10, truncate=False)
```

*Detailed Explanation:*  
- We alias the points and polygons DataFrames as "f" and "poly" for easier reference.
- The join condition uses the `ST_Intersects` function, which returns `true` if a point lies within (or touches) a polygon.
- The join operation returns combined rows from both DataFrames where the condition is met.
- We display the first 10 rows to inspect the join result.


In [None]:
facilities = points_df.alias("f")
admin_boundaries = polygons_df.alias("poly")

In [None]:
spatial_join_df = facilities.join(
    admin_boundaries,
    expr("ST_Intersects(poly.geometry, f.geom)")
)

In [None]:
%%time
spatial_join_df.count()

In [None]:
print("🔹 Standard Spatial Join Results (Facilities within Administrative Boundaries):")
spatial_join_df.show(1)

# 🔢 Efficiently Counting Points Within Each Polygon

In this step, we combine the spatial join and aggregation into one efficient operation. By directly applying the spatial predicate (`ST_Intersects`) during the join and then aggregating (grouping by polygon ID) to count the points, we allow Wherobots to optimize the query. This minimizes data shuffling and processing by filtering data at the source (e.g., using GeoParquet spatial filter pushdown). This method is particularly beneficial when working with large datasets.

```python
# Efficiently count the number of facilities (points) that fall inside each polygon.
# This approach directly aggregates the data after filtering with the spatial predicate.
points_count_efficient_df = polygons_df.alias("poly") \
    .join(points_df.alias("f"), expr("ST_Intersects(poly.geom, f.geom)")) \
    .groupBy("poly.id") \
    .agg(expr("COUNT(*) as point_count"))

# Display the aggregated result
print("🔹 Efficient Count of Points in Each Polygon:")
points_count_efficient_df.show(10, truncate=False)
```

*Detailed Explanation:*  
- **Spatial Predicate Pushdown:** By using `ST_Intersects` directly in the join condition, Wherobots can push the spatial predicate down to the data source level (especially if you're using spatially optimized formats such as GeoParquet). This means only the relevant data (points that are near or within the polygons) is loaded and processed. 🚀  
- **Single-step Aggregation:** We immediately group the joined result by the polygon's identifier (`poly.id`) and use the `COUNT(*)` aggregate function to determine how many points fall within each polygon. This avoids creating an intermediate, full join result before counting, which is both memory and compute efficient.  
- **Performance Gains:** Combining filtering and aggregation reduces unnecessary data movement and computation, making the operation much more efficient on large datasets.

This method is a best practice when dealing with spatial queries in environments like Wherobots that are optimized for spatial predicates. Enjoy the performance improvements and cleaner code! 😊

In [None]:
points_count_efficient_df = polygons_df.alias("poly") \
    .join(points_df.alias("f"), expr("ST_Intersects(poly.geometry, f.geom)")) \
    .groupBy("poly.id") \
    .agg(expr("COUNT(*) as point_count"))

In [None]:
print("🔹 Efficient Count of Points in Each Polygon:")
points_count_efficient_df.show(10)

# 🏘️ Nearest Neighbor Join

The nearest neighbor join finds, for each facility, the closest centroid of an administrative area. This can be useful for determining the nearest center point or service area.

```python
# Compute centroids for each polygon to represent the center of each administrative area.
centroids_df = polygons_df.selectExpr("id", "ST_Centroid(geom) as centroid")

# Display a few centroid records
print("🔹 Centroids of Administrative Boundaries:")
centroids_df.show(5, truncate=False)
```

*Detailed Explanation:*  
- We create a new DataFrame `centroids_df` by selecting the `id` and computing the centroid of each polygon using the `ST_Centroid` function.
- These centroids will later serve as reference points for our nearest neighbor calculation.

In [None]:
centroids_df = polygons_df.selectExpr("id", "ST_Centroid(geometry) as centroid")

In [None]:
print("🔹 Centroids of Administrative Boundaries:")
centroids_df.show(1)

In this approach, we use the ST_AKNN function to directly obtain the k nearest neighbors for each query geometry. The function signature is:  

```
ST_AKNN(query_geom, object_geom, k, include_ties)
```

In our example, we assume the following:  
- **Queries DataFrame:** Our facilities DataFrame (`points_df`) represents the query geometries.  
- **Objects DataFrame:** Our centroids DataFrame (`centroids_df`), which was created earlier by computing the centroid of each polygon, represents the object geometries.

The SQL equivalent of our operation is:  

```
SELECT
    QUERIES.ID AS QUERY_ID,
    QUERIES.GEOMETRY AS QUERIES_GEOM,
    OBJECTS.GEOMETRY AS OBJECTS_GEOM
FROM QUERIES JOIN OBJECTS ON ST_AKNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 4, FALSE)
```

Below is the Pythonic implementation:

```python
# Use ST_AKNN to perform an approximate k-nearest neighbor join between the queries and objects.
# In our example, we join the facilities (points_df) with the centroids (centroids_df),
# returning the four nearest centroids for each facility. The "false" parameter indicates that ties are not included.
aknn_df = points_df.alias("q").join(
    centroids_df.alias("o"),
    expr("ST_AKNN(q.geom, o.centroid, 4, false)")
)

# Select and rename the columns for clarity.
# Here, we select the query's id and geometry as well as the object's geometry.
aknn_result_df = aknn_df.select(
    expr("q.id as query_id"),
    expr("q.geom as query_geom"),
    expr("o.centroid as object_geom")
)

# Display the result of the nearest neighbor join using ST_AKNN.
print("🔹 Nearest Neighbor Join using ST_AKNN:")
aknn_result_df.show(10, truncate=False)
```

**Detailed Markdown Explanation:**  
- **Purpose:**  
  This code uses the `ST_AKNN` function to efficiently find the four closest (k = 4) object geometries (in this case, centroids) for each query geometry (facilities). This method is optimized within Wherobots and leverages the spatial predicate pushdown capabilities of the compute engine.
  
- **Process:**  
  1. **Aliasing:**  
     We alias `points_df` as `"q"` (representing the queries) and `centroids_df` as `"o"` (representing the objects) for easier reference.  
  2. **Joining with ST_AKNN:**  
     The join condition `expr("ST_AKNN(q.geom, o.centroid, 4, false)")` applies the ST_AKNN function to determine whether a given object is among the four nearest neighbors of a query geometry.  
  3. **Column Selection:**  
     After joining, we select and rename columns to clearly indicate the query ID, the query geometry, and the object geometry (centroid) for each match.  
  4. **Display:**  
     Finally, we display the top 10 results. This gives you a clear view of which centroids are among the nearest neighbors for each facility.

- **Efficiency:**  
  By using `ST_AKNN`, the engine performs an optimized nearest neighbor search without the need for an expensive cross join or manual windowing. This is especially beneficial when working with large datasets where performance is critical.

This approach provides a clean, efficient, and Pythonic solution for nearest neighbor joins using Wherobots and Apache Sedona. Enjoy the streamlined spatial analysis!

In [None]:
aknn_df = points_df.alias("q").join(
    centroids_df.alias("o"),
    expr("ST_AKNN(q.geom, o.centroid, 4, false)")
)

In [None]:
aknn_result_df = aknn_df.select(
    expr("q.fsq_place_id as query_id"),
    expr("q.geom as query_geom"),
    expr("o.centroid as object_geom")
)

In [None]:
print("🔹 Nearest Neighbor Join using ST_AKNN:")
aknn_result_df.show(10, truncate=False)

# 🦾 Advanced Optimization Techniques

Optimizing spatial operations is critical for performance, especially with large datasets. One common strategy is to repartition the data using a spatial key, such as a geohash. This improves data locality and reduces shuffle during joins.

## Cluster Data Using Geohash

```python
# Add a geohash column to the facilities and polygons DataFrames with a precision level of 5.
points_df = points_df.withColumn("geohash", expr("ST_GeoHash(geom, 5)"))
polygons_df = polygons_df.withColumn("geohash", expr("ST_GeoHash(geom, 5)"))

# Repartition the DataFrames based on the geohash column to group nearby features together.
points_df = points_df.withColumn("geohash", expr("ST_GeoHash(geometry, 6)"))
polygons_df = polygons_df.withColumn("geohash", expr("ST_GeoHash(geometry, 6)"))

sorted_points = points_df.sort(col("geohash"))\
    .drop("geohash")

sorted_polys = polygons_df.sort(col("geohash"))\
    .drop("geohash")

print("🔹 DataFrames clustered by geohash for improved spatial join performance!")
```

*Detailed Explanation:*  
- The `ST_GeoHash` function converts each geometry into a geohash string. The precision parameter (here, 5) determines the spatial resolution.
- Clustering by geohash ensures that spatially proximate features are processed in the same partition, which can significantly speed up join operations.

In [None]:
%%time
spatial_join_df.count()

In [None]:
database_name = 'joins'
sedona.sql(f"CREATE DATABASE IF NOT EXISTS wherobots.{database_name}")

In [None]:
points_df = points_df.withColumn("geohash", expr("ST_GeoHash(geom, 6)"))
polygons_df = polygons_df.withColumn("geohash", expr("ST_GeoHash(geometry, 6)"))

In [None]:
from pyspark.sql.functions import col

In [None]:
sorted_points = points_df.sort(col("geohash"))\
    .drop("geohash")

sorted_polys = polygons_df.sort(col("geohash"))\
    .drop("geohash")

sorted_points.writeTo(f"wherobots.{database_name}.points").createOrReplace()
sorted_polys.writeTo(f"wherobots.{database_name}.polygons").createOrReplace()

print("🔹 DataFrames sorted by geohash for improved spatial join performance!")

In [None]:
# Alias the DataFrames for clarity
facilities = sedona.table(f"wherobots.{database_name}.points").alias("f")
admin_boundaries = sedona.table(f"wherobots.{database_name}.polygons").alias("poly")

In [None]:
spatial_join_df_partition = facilities.join(
    admin_boundaries,
    expr("ST_Intersects(poly.geometry, f.geom)")
)

In [None]:
%%time
spatial_join_df_partition.count()

# 💅🏼 Visualizing Spatial Join Results with SedonaKepler

Wherobots offers interactive visualization tools to help you explore your spatial data. We will use SedonaKepler and SedonaPyDeck to visualize our spatial join and zonal statistics results.

### 7.1 Visualizing Spatial Join Results with SedonaKepler

```python
# Import SedonaKepler for interactive mapping
from sedona.visualize import SedonaKepler

# Create an interactive map from the spatial join DataFrame.
# The map will show facilities along with the administrative boundaries they fall within.
kepler_map = SedonaKepler.create_map(df=spatial_join_df, name="Facilities_Within_Zones")

# Display the interactive map in your Jupyter Notebook
kepler_map.show()
```

*Detailed Explanation:*  
- SedonaKepler integrates with KeplerGl to provide interactive spatial visualizations.
- The `create_map` function takes the spatial join DataFrame and renders an interactive map.
- This is especially useful for exploring the spatial relationships between facilities and boundaries visually.


In [None]:
# Define the WKT polygon as a string
wkt_polygon = "POLYGON((-84.656729 33.983118, -84.109483 33.983118, -84.109483 33.562116, -84.656729 33.562116, -84.656729 33.983118))"

In [None]:
detailed_facilities_df = spatial_join_df.select(
    "f.fsq_place_id",    # Unique facility identifier
    "f.name",            # Facility name
    "f.address",         # Facility address
    "f.locality",        # Locality information
    "f.region",          # Region name
    "f.postcode",        # Postal code
    "f.admin_region",    # Administrative region
    "f.post_town",       # Post town
    "f.country",         # Country name
    "f.geom",            # Facility geometry
    "poly.names"         # Additional name information
).filter(
    expr(f"ST_Intersects(geometry, ST_GeomFromText('{wkt_polygon}'))")
).selectExpr("*", "names.primary") \
.drop("names")

# Display the first few rows of the resulting DataFrame
print("🔹 Detailed Facility Information from spatial_join_df:")
detailed_facilities_df.count()

In [None]:
from sedona.maps.SedonaKepler import SedonaKepler

# Create an interactive map from the spatial join DataFrame.
# The map will show facilities along with the administrative boundaries they fall within.
kepler_map = SedonaKepler.create_map(df=detailed_facilities_df, name="Facilities_Within_Zones")

In [None]:
kepler_map

# 🖥️ Creating a Choropleth Map for Point in Polygon Join with SedonaPyDeck


```python
# Import SedonaPyDeck for creating choropleth maps
from sedona.maps.SedonaPyDeck import SedonaPyDeck

# Create a choropleth map using the zonal statistics DataFrame.
# The zones are colored based on the 'avg_measurement' column, highlighting variations across regions.
choropleth_map = SedonaPyDeck.create_choropleth_map(
    df=zonal_stats_df,
    plot_col="avg_measurement"  # This column drives the color intensity
)

# Display the choropleth map in your Jupyter Notebook
choropleth_map.show()
```

*Detailed Explanation:*  
- SedonaPyDeck leverages the pydeck library to create visually appealing maps.
- By passing the `zonal_stats_df` and specifying the `plot_col`, a choropleth map is created where the color intensity of each zone corresponds to its average measurement.
- This helps to quickly identify areas with high or low average values.


In [None]:
points_count_efficient_df = polygons_df.alias("poly") \
    .filter(
        expr(f"ST_Intersects(geometry, ST_GeomFromText('{wkt_polygon}'))")
    ) \
    .join(points_df.alias("f"), expr("ST_Intersects(poly.geometry, f.geom)")) \
    .groupBy("poly.id", "poly.geometry") \
    .agg(expr("COUNT(*) as point_count"))

In [None]:
points_count_efficient_df.count()

In [None]:
from sedona.maps.SedonaPyDeck import SedonaPyDeck

# Create a choropleth map using the zonal statistics DataFrame.
# The zones are colored based on the 'avg_measurement' column, highlighting variations across regions.

choropleth_map = SedonaPyDeck.create_choropleth_map(
    df=points_count_efficient_df,
    plot_col="point_count"  # This column drives the color intensity
)

# Display the choropleth map in your Jupyter Notebook
choropleth_map.show()

# 🎁 Conclusion and Summary

### Summary of Key Steps:
- **Environment Setup:** We initialized Spark and Sedona for spatial processing.
- **Data Loading:** We loaded two spatial datasets—administrative boundaries (polygons) and facilities (points).
- **Standard Spatial Join:** We performed a join using the `ST_Intersects` predicate to link facilities with their containing administrative boundaries.
- **Nearest Neighbor Join:** We computed centroids for administrative areas and then used a cross join with window functions to find the nearest centroid for each facility.
- **Optimization:** We improved join performance by repartitioning data based on geohash values.
- **Visualization:** We created interactive maps using SedonaKepler and SedonaPyDeck to visualize spatial join and zonal statistics results.

### Final Thoughts:
This notebook provides a detailed, Pythonic approach to handling spatial joins and related spatial operations in Wherobots using Apache Sedona. By leveraging Python’s DataFrame API, we maintain clean and readable code that is easy to modify and extend. Happy spatial data processing! 😊

For additional details and further learning:
- Check out the [Wherobots Documentation](https://docs.wherobots.com) for advanced topics.
- Visit the [Apache Sedona GitHub Repository](https://github.com/apache/sedona) for source code and examples.