# Geospatial ETL with Wherobots and Databricks Unity Catalog

This notebook builds a geospatial ETL pipeline using Wherobots and Databricks Unity Catalog.

Using a weather forecast dataset, you will:

* **Read** a Managed Delta table from Unity Catalog.
* **Transform** coordinates into spatial `POINT` geometries.
* **Enrich** the data by calculating each forecast's proximity to Tokyo to create a new "threat" feature.
* **Write** the results back to a new external Delta table, dropping the geometry column as it is not natively supported by Databricks.
    * This example writes to an External Delta table because Databricks prevents external platforms like Wherobots from writing to Managed tables. 

This notebook provides the building blocks for you to perform more complex spatial analysis and processing on your own data in Wherobots.

The exercises in this notebook use the [Accuweather](https://marketplace.databricks.com/details/8c8ad63e-d96e-47d6-b56b-a42affbdb227/AccuWeather_Forecast-Weather-Data) `forecast_daily_calendar_imperial` table dataset, which comes pre-loaded in your Databricks workspace.

> **Data Disclaimer:** Review the following about the dataset used in this notebook:
> * The weather data used in this demonstration originates from the `samples.accuweather.forecast_daily_calendar_imperial` dataset provided within Databricks.
> Wherobots is not responsible for the accuracy or completeness of this data.
> * This analysis is based on daily forecast data and does not represent real-time conditions.
> * For complete information about the dataset, go to [Forecast Weather Data](https://marketplace.databricks.com/details/8c8ad63e-d96e-47d6-b56b-a42affbdb227/AccuWeather_Forecast-Weather-Data) and click **Documentation** within the Product Links section.

##  Prerequisites

In order to run this example notebook, you'll need an:

- An **Existing Databricks catalog and schema governed** by Unity Catalog.
    - Optionally, you can create a new [catalog](https://docs.databricks.com/aws/en/catalogs/create-catalog) and [schema](https://docs.databricks.com/aws/en/schemas/create-schema) in Databricks.
- A **Connection** between Wherobots and your Unity Catalog-governed schema and catalog.
    - For more information on connecting Unity Catalog to your Wherobots Organization, including the necessary Databricks catalog permissions, see [Connect to Unity Catalog](https://docs.wherobots.com/latest/get-started/initial-storage/connect-to-unity-catalog/).
    - If your Unity Catalog has been successfully connected to Wherobots, you will be able to see it in the [**Wherobots Data Hub**](https://cloud.wherobots.com/data-hub).
    - The permissions necessary to read and write Delta tables within a Databricks Unity Catalog.


> **Note:** Wherobots discovers Databricks catalogs only at your runtime's initialization.
> If you created a new Databricks catalog _after_ the Wherobots runtime was started, that catalog won't be visible until you restart the Wherobots runtime.

>  To make a new catalog visible, complete the following steps to restart the runtime:

> 1. **Save active work:** Ensure any running jobs or SQL sessions are saved.
> 1. **Destroy runtime:** Stop the current Wherobots runtime in [Wherobots Cloud](https://cloud.wherobots.com/).
> 1. **Start a new runtime:** Start the runtime again.

## In a Databricks SQL editor

Create a table in your Unity Catalog that copies the data provided by Accuweather's `samples.accuweather.forecast_daily_calendar_imperial` dataset.
After copying this data into its own Delta table, you can query and modify it in Wherobots.

### Include your Databricks resources

Update the `YOUR-CATALOG` and `YOUR-SCHEMA` variables (maintaining the backticks around each) in the cell below to point to the resources in your Databricks environment where you have permission to create tables.

Run the following command in a **Databricks SQL editor** to create a new table with the necessary sample data from the built-in Accuweather sample data.

```sql
CREATE OR REPLACE TABLE `YOUR-CATALOG`.`YOUR-SCHEMA`.`forecast_daily_calendar_imperial_wbc_demo`

USING DELTA
AS
SELECT *
FROM `samples`.`accuweather`.`forecast_daily_calendar_imperial`
LIMIT 10000;
```

Go to your Databricks SQL Workspace to confirm that a new Managed Delta table has been created in your intended location.

## In a Wherobots notebook

Run the following commands in this Wherobots notebook.

### Import libraries

In [None]:
from sedona.spark import *
from pyspark.sql.functions import expr, col, when, lit
from pyspark.sql.functions import lit

### Create the SedonaContext

The following imports the necessary modules from the Sedona library and creates a `SedonaContext` object.

In [None]:
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

###  Set up Wherobots notebook variables

Define the variables you'll use throughout this notebook.

These variables define the key resources for the ETL pipeline for this notebook. 

Set the names for the Databricks catalog, schema, and the source and output tables you'll be using for reading and writing tables in this example.

Additionally, define the S3 path for an external table location and create fully qualified names (FQN) for easier use in Spark SQL commands.

In [None]:
CATALOG = "YOUR-CATALOG" # Change this to your catalog
SCHEMA  = "YOUR-SCHEMA" # Change this to your schema name
SOURCE_TABLE = "forecast_daily_calendar_imperial_wbc_demo"
OUTPUT_TABLE = "transformed_forecast_daily_calendar_imperial"
OUTPUT_TABLE_EXTERNAL_LOCATION = 's3://your-bucket-name/path/to/external/location/' # Change this to your external Databricks location

# To find your external location's S3 path in Databricks:
# 1. Navigate to the 'Data' explorer in your Databricks workspace.
# 2. Select 'External Locations' from the left-hand menu.
# 3. Click on the name of the external location you want to use.
# 4. The 'URL' field on the details page contains the S3 path you need.

SOURCE_TABLE_FQN = f"`{CATALOG}`.`{SCHEMA}`.`{SOURCE_TABLE}`"
OUTPUT_TABLE_FQN = f"`{CATALOG}`.`{SCHEMA}`.`{OUTPUT_TABLE}`"

print("Target UC input Delta table:", SOURCE_TABLE_FQN)
print("Target UC output Delta table:", OUTPUT_TABLE_FQN)

### Confirm that you can read data from the Unity Catalog table in your Wherobots Notebook

Read the table and confirm that it returns a dataframe with the Accuweather Data that returns a city with several days of forecast data. 

In [None]:
table_smoke_test = sedona.read.table(SOURCE_TABLE_FQN)
table_smoke_test.show(10)

## Running spatial operations

In this step, we will convert the latitude and longitude column into a `Point` object and add it to the table.

This following transforms latitude and longitude data in a DataFrame into a spatially-aware geometry column and then validates the result.

In short, it adds a new column named point by converting latitude and longitude values into a standard geographic point.

## Proximity analysis: calculate distances to key locations
In this section, you will perform a proximity analysis to calculate the distance from each weather forecast in your dataset to a specific point of interest. This allows you to filter data based on location and answer questions like, "Which of these weather events is closest to my operations center?"

### A practical example
Imagine your business has major operations or supply chain dependencies in the **Tokyo metropolitan area**, where severe weather can disrupt logistics and public safety. Your raw data contains thousands of forecasts across the region but lacks the context of which ones pose a direct threat to the city.

By defining **Tokyo's coordinates**, you can calculate the distance from every weather event to the city center, saving the result in a new column like `distance_to_tokyo_meters`.

With this new column, your data becomes an early-warning system. You can now easily ask critical business questions like:

> "Show me cities with **wind gusts over 40 mph** or **heavy precipitation** within a **500-kilometer radius** of Tokyo."

This analysis turns your spatial data into actionable intelligence, allowing you to focus only on the events that directly impact your operations.

In [None]:
# Load Data and Create Geometry
# Read the table from Unity Catalog and create the necessary geometry column for spatial analysis.

df = sedona.read.table(SOURCE_TABLE_FQN)

In [None]:
# Create a 'point' geometry column from the latitude and longitude columns.

df_w_geom = df.withColumn(
    "point",
    expr("ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)")
)

In [None]:
# Proximity Analysis: Calculate Distance to Tokyo
# This is the core spatial operation. We calculate the distance from every weather
# forecast point to our point of interest, Tokyo.

# Define the point of interest (Tokyo) as a WKT string.

tokyo_geom_wkt = "POINT (139.6917 35.6895)"

# Wherobots efficiently calculates the spherical distance in meters for every row.
df_with_distance = df_w_geom.withColumn(
    "distance_to_tokyo_meters",
    expr(f"ST_DistanceSphere(point, ST_SetSRID(ST_GeomFromWKT('{tokyo_geom_wkt}'), 4326))")
)

print("Calculated distance to Tokyo for each forecast.")
df_with_distance.select("distance_to_tokyo_meters").show(5)


In [None]:
# Define thresholds for our alerts
# 40 is the minimum wind speed that qualifies as a "Severe Wind" by the National Weather Service.
# 2.0 inches of precipitation in a 24-hour period is considered "Heavy Rain" by the National Weather Service.
# All measurements are in imperial units as provided by Accuweather.

proximity_threshold_km = 500.0
severe_wind_mph = 40
heavy_precipitation_rate_inches = 0.30

# Use a nested 'when' clause to build a descriptive alert string.
df_with_threats = df_with_distance.withColumn(
    "threat_description",
    when(col("distance_to_tokyo_meters") > proximity_threshold_km, lit("No Threat (Distance exceeds proximity threshold)"))
    .when(
        (col("wind_gust_max") >= severe_wind_mph) & (col("precipitation_lwe_max") >= heavy_precipitation_rate_inches),
        lit("High Wind & Flood Watch Near Tokyo")
    )
    .when(col("wind_gust_max") >= severe_wind_mph, lit("High Wind Warning Near Tokyo"))
    .when(col("precipitation_lwe_max") >= heavy_precipitation_rate_inches, lit("Flood Watch Near Tokyo"))
    .otherwise(lit("Normal Conditions Near Tokyo"))
)

print(" Generated new 'threat_description' feature:")
df_with_threats.select("city_name", "wind_gust_max", "precipitation_lwe_max", "threat_description").show()

## Writing the results

In this step we will write the results back to an external Delta table managed by Unity Catalog.

### Data preparation for Databricks

Before loading the data into Databricks, two preprocessing steps are performed on the geometry data:

1. **Convert to WKT:** The geometry column is converted into its Well-Known Text (WKT) string representation.

1. **Drop Original Column:** The 'point' column is dropped from the dataset because it contains the `geometry` datatype, `POINT`.

This procedure is required because Databricks does not offer native support for columns containing geometry data types.

> **Note:** A geometry data type is a special data type used in spatial databases to represent geographic features such as points (POINT), lines (LINESTRING), or polygons (POLYGON). You can learn more about them here.

In [None]:
final_df = df_with_threats.select(
    col("city_name"),
    col("date"),
    col("temperature_avg"),
    col("wind_gust_max"),
    col("precipitation_lwe_max"),
    col("distance_to_tokyo_meters"),
    col("threat_description") # This is our new, actionable feature with a feature type that is supported by Databricks!
)

print("\nFinal schema to be written to Unity Catalog:")
final_df.printSchema()

# Write the feature back to Unity Catalog

final_df.createOrReplaceTempView("temp_final_df_view")

# Now, execute the SQL command to create the table from the temporary view.
sedona.sql(f"""
  CREATE OR REPLACE TABLE {OUTPUT_TABLE_FQN}
  USING delta
  LOCATION '{OUTPUT_TABLE_EXTERNAL_LOCATION}'
  AS SELECT * FROM temp_final_df_view
""")

print(f"\nSuccess! The feature has been written to an External Delta Table located at {OUTPUT_TABLE_FQN}")