![](https://wherobots.com/wp-content/uploads/2023/12/Inline-Blue_Black_onWhite@3x.png)

# Introduction to `KNN Join` for WherobotsDB

In this notebook we will demostrate how to perform k-Nearest Neighbors (kNN) joins in WherobotsDB.


A geospatial k-Nearest Neighbors (kNN) join is a specialized form of the kNN join that specifically deals with geospatial data. This method involves identifying the k-nearest neighbors for a given spatial point or region based on geographic proximity, typically using spatial coordinates and a suitable distance metric like Euclidean or great-circle distance.

**Approximate kNN Join**

The Z-order based approximate algorithm leverages the properties of Z-order (or Morton order) encoding to efficiently process k-nearest neighbors (kNN) joins in spatial databases. This method maps multidimensional data to one dimension while preserving locality to a certain extent. Here, we outline the algorithm and discuss its application and efficiency in spatial join operations.
    

**Exact kNN Join**

The method uses quad-tree partitioning strategy as a start point. It partitions the dataset $R$ into balanced partitions using the quad-tree strategy, preserving spatial locality. The method then builds an R-tree over a random sample of another dataset $S$ and uses distance bounds to ensure efficient local kNN joins. By calculating distance bounds and using circle range queries, the method ensures that the subsets $S_i$, containing all necessary points for accurate kNN results. The final union of local join results provides the complete kNN join result for the datasets $R$ and $S$.


## Initial Configuration

Runtime Requirement: large 48x 2U, 294 cores

In [1]:
from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

# Define Inputs
 - Queries: This table contains the objects for which you want to find the nearest neighbors.
 - Objects: This table contains the objects that are potential neighbors to the objects in the Queries table.

# State Boundary

### Pick a state or other boundary
    
[Click here for boundaries of other states](https://gist.github.com/JoshuaCarroll/49630cbeeb254a49986e939a26672e9c)


In [2]:
# California boundary
# spatial_filter = "POLYGON((-124.4009 41.9983,-123.6237 42.0024,-123.1526 42.0126,-122.0073 42.0075,-121.2369 41.9962,-119.9982 41.9983,-120.0037 39.0021,-117.9575 37.5555,-116.3699 36.3594,-114.6368 35.0075,-114.6382 34.9659,-114.6286 34.9107,-114.6382 34.8758,-114.5970 34.8454,-114.5682 34.7890,-114.4968 34.7269,-114.4501 34.6648,-114.4597 34.6581,-114.4322 34.5869,-114.3787 34.5235,-114.3869 34.4601,-114.3361 34.4500,-114.3031 34.4375,-114.2674 34.4024,-114.1864 34.3559,-114.1383 34.3049,-114.1315 34.2561,-114.1651 34.2595,-114.2249 34.2044,-114.2221 34.1914,-114.2908 34.1720,-114.3237 34.1368,-114.3622 34.1186,-114.4089 34.1118,-114.4363 34.0856,-114.4336 34.0276,-114.4652 34.0117,-114.5119 33.9582,-114.5366 33.9308,-114.5091 33.9058,-114.5256 33.8613,-114.5215 33.8248,-114.5050 33.7597,-114.4940 33.7083,-114.5284 33.6832,-114.5242 33.6363,-114.5393 33.5895,-114.5242 33.5528,-114.5586 33.5311,-114.5778 33.5070,-114.6245 33.4418,-114.6506 33.4142,-114.7055 33.4039,-114.6973 33.3546,-114.7302 33.3041,-114.7206 33.2858,-114.6808 33.2754,-114.6698 33.2582,-114.6904 33.2467,-114.6794 33.1720,-114.7083 33.0904,-114.6918 33.0858,-114.6629 33.0328,-114.6451 33.0501,-114.6286 33.0305,-114.5888 33.0282,-114.5750 33.0351,-114.5174 33.0328,-114.4913 32.9718,-114.4775 32.9764,-114.4844 32.9372,-114.4679 32.8427,-114.5091 32.8161,-114.5311 32.7850,-114.5284 32.7573,-114.5641 32.7503,-114.6162 32.7353,-114.6986 32.7480,-114.7220 32.7191,-115.1944 32.6868,-117.3395 32.5121,-117.4823 32.7838,-117.5977 33.0501,-117.6814 33.2341,-118.0591 33.4578,-118.6290 33.5403,-118.7073 33.7928,-119.3706 33.9582,-120.0050 34.1925,-120.7164 34.2561,-120.9128 34.5360,-120.8427 34.9749,-121.1325 35.2131,-121.3220 35.5255,-121.8013 35.9691,-122.1446 36.2808,-122.1721 36.7268,-122.6871 37.2227,-122.8903 37.7783,-123.2378 37.8965,-123.3202 38.3449,-123.8338 38.7423,-123.9793 38.9946,-124.0329 39.3088,-124.0823 39.7642,-124.5314 40.1663,-124.6509 40.4658,-124.3144 41.0110,-124.3419 41.2386,-124.4545 41.7170,-124.4009 41.9983,-124.4009 41.9983))"

# Wyoming state boundary
spatial_filter = "POLYGON((-104.0556 41.0037,-104.0584 44.9949,-111.0539 44.9998,-111.0457 40.9986,-104.0556 41.0006,-104.0556 41.0037))"


### Queries Table: Weather Events

This table contains the objects for which you want to find the nearest neighbors.

In [3]:
from pyspark.sql.functions import col
from pyspark.sql.functions import monotonically_increasing_id, col

# load data
df_queries = sedona.table("wherobots_pro_data.weather.weather_events")
df_queries = df_queries.withColumn("id", monotonically_increasing_id())
df_queries = df_queries.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")


df_queries = df_queries.repartition(100)

df_queries.cache()

df_queries.createOrReplaceTempView("queries")

print(df_queries.rdd.getNumPartitions())
print(df_queries.count())


100




186657


                                                                                

### Objects Table: Flights

This table contains the objects that are potential neighbors to the objects in the Queries table.


In [4]:
# Load objects table
df_objects = sedona.read.format("geoparquet").load("s3a://wherobots-examples/data/examples/flights/2024_s2.parquet")
df_objects = df_objects.filter("ST_Contains(ST_GeomFromWKT('"+spatial_filter+"'), geometry) = true")
df_objects = df_objects.repartition(800)

df_objects.cache()

df_objects.createOrReplaceTempView("objects")

print(df_objects.rdd.getNumPartitions())
print(df_objects.count())


                                                                                

800




1431091


                                                                                

## KNN Join

The spatial SQL below demonstrates the new SQL syntax for performing KNN joins on the Wherobots platform.

In [5]:
%%time

aknn_join_df = sedona.sql("""
SELECT
    QUERIES.GEOMETRY AS QUERIES_GEOM,
    QUERIES.ID AS QID,
    OBJECTS.GEOMETRY AS OBJECTS_GEOM,
    ST_DISTANCESPHERE(QUERIES.GEOMETRY, OBJECTS.GEOMETRY) AS DISTANCE,
    ST_MAKELINE(QUERIES.GEOMETRY, OBJECTS.GEOMETRY) AS LINE
FROM QUERIES
JOIN OBJECTS ON ST_KNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 4, FALSE)
""")

# cache for further queries and visualization
aknn_join_df.cache()

total_count = aknn_join_df.count()
print(total_count)




746628
CPU times: user 17.6 ms, sys: 2.3 ms, total: 19.9 ms
Wall time: 9.29 s


                                                                                

In [23]:
# Select N unique QID rows
unique_qid_df = aknn_join_df.dropDuplicates(["QUERIES_GEOM"])

# Perform an inner join to get all rows from join_df that have QIDs in unique_qid_df
related_rows_df = aknn_join_df.join(unique_qid_df, on="QID", how="inner").select(aknn_join_df["*"])

unique_qid_df.cache()
related_rows_df.cache()

related_rows_df.count()

                                                                                

220

In [24]:
# create map for the results
map_view = SedonaKepler.create_map(unique_qid_df.select('QUERIES_GEOM'), name="WEATHER EVENTS")
SedonaKepler.add_df(map_view, df=related_rows_df.select('OBJECTS_GEOM', 'DISTANCE').withColumnRenamed("OBJECTS_GEOM", "geometry"), name="FLIGHTS")
SedonaKepler.add_df(map_view, df=related_rows_df.select('LINE', 'DISTANCE').withColumnRenamed("LINE", "geometry"), name="KNN LINES")
map_view

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


                                                                                