# **Optimized Notebook – Performance Optimization & Exporting Data in WherobotsDB 🚀**

This notebook **extracts, transforms, and loads (ETL)** large-scale **GDELT event data** into **WherobotsDB** before **exporting it as an optimized GeoParquet dataset**.

## **🔹 Workflow Overview**
1️⃣ **Create an optimized table in WherobotsDB** (Iceberg format).  
2️⃣ **Organize data by GeoHash** for spatial indexing.  
3️⃣ **Write the structured dataset** as a **GeoParquet file** for efficient querying.  

---

## **1️⃣ Creating an Optimized Iceberg Table for GDELT Data**
To store geospatial data efficiently, we create an **Iceberg table in WherobotsDB**, ensuring the dataset is well-structured for **fast queries**.

```python
# Create an optimized Iceberg table in WherobotsDB
sedona.sql(f'''
CREATE OR REPLACE TABLE wherobots.{name}.gdelt AS 
SELECT *, 
ST_SetSRID(
    ST_Point(ActionGeo_Long, ActionGeo_Lat),
4326) as geometry
FROM csv_df
''')
```

### **🛠️ What’s Happening?**
- **Converts CSV data (`csv_df`) into a structured Iceberg table.**
- **Creates a spatial geometry** column (`ST_Point`) using latitude (`ActionGeo_Lat`) and longitude (`ActionGeo_Long`).
- **Ensures proper geospatial reference system (EPSG:4326 - WGS 84).**

---

## **2️⃣ Organizing Data by GeoHash for Faster Queries**
GeoHash **spatially indexes** the dataset, allowing faster **geospatial filtering**.

```python
# Organize data by GeoHash and compute bounding boxes (BBOX)
gdelt = sedona.sql(f'''
SELECT 
    *,
    ST_GeoHash(geometry, 15) AS geohash,
    struct(st_xmin(geometry) as xmin, st_ymin(geometry) as ymin, 
           st_xmax(geometry) as xmax, st_ymax(geometry) as ymax) as bbox
FROM wherobots.{name}.gdelt
''')
```

### **🛠️ What’s Happening?**
- **`ST_GeoHash(geometry, 15)`** → Assigns a GeoHash (precision: 15) to each row.
- **`bbox` struct** → Computes the **bounding box (BBOX)** for each event.

---

## **3️⃣ Writing Optimized GeoParquet for Efficient Queries**
The **GeoParquet format** is ideal for storing spatial data in a compact and queryable format.

```python
%%time

gdelt.repartitionByRange(10, "geohash") \
    .sortWithinPartitions("geohash") \
    .drop("geohash15") \
    .write \
    .format("geoparquet") \
    .option("geoparquet.version", "1.1.0") \
    .option("geoparquet.covering", "bbox") \
    .option("geoparquet.crs", projjson) \
    .save(user_uri + "gdelt-snappy", mode='overwrite', compression='snappy')
```

### **🛠️ What’s Happening?**
- **Repartitions data into 10 spatially-organized partitions** using **GeoHash**.
- **Sorts each partition by GeoHash** for better locality-based queries.
- **Drops unnecessary fields (`geohash15`)** to optimize storage.
- **Writes the final dataset in GeoParquet format** (`geoparquet.version = 1.1.0`).
- **Includes bounding box (`bbox`) for efficient spatial filtering**.
- **Uses the EPSG projection system (`projjson`)** for CRS compatibility.
- **Saves the dataset in Snappy-compressed GeoParquet format**.

---

## **✅ Summary**
✔️ **Created a structured Iceberg table in WherobotsDB** for **GDELT event data**.  
✔️ **Added GeoHash and BBOX indexing** to **accelerate spatial queries**.  
✔️ **Exported the optimized dataset** to **GeoParquet (Snappy compression)** for **fast external use**.  

---

### **Next Steps**
- **Perform spatial queries on the GeoParquet dataset** in Spark or GIS tools.  
- **Integrate the dataset into dashboards or mapping applications.**  

# ⌨️ **Step 1: Convert GDELT CSV data to Havasu table**

In [None]:
# Step 1: Import necessary libraries and connect Apache Sedona to the Wherobots runtime
from sedona.spark import SedonaContext

config = SedonaContext.builder() \
    .getOrCreate()

sedona = SedonaContext.create(config)

25/02/06 01:45:01 WARN UDTRegistration: Cannot register UDT for org.geotools.coverage.grid.GridCoverage2D, which is already registered.
25/02/06 01:45:01 WARN SimpleFunctionRegistry: The function rs_union_aggr replaced a previously registered function.
25/02/06 01:45:01 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
25/02/06 01:45:01 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.index.SpatialIndex, which is already registered.
25/02/06 01:45:01 WARN SimpleFunctionRegistry: The function st_union_aggr replaced a previously registered function.
25/02/06 01:45:01 WARN SimpleFunctionRegistry: The function st_envelope_aggr replaced a previously registered function.
25/02/06 01:45:01 WARN SimpleFunctionRegistry: The function st_intersection_aggr replaced a previously registered function.
25/02/06 01:45:01 WARN SimpleFunctionRegistry: The function st_analyze_aggr replaced a previously registered function.


In [51]:
# Read over all of the different CSV files in the S3 bucket

csv_path = 's3://gdelt-open-data/events/*.*.csv'

In [52]:
# Read the CSV with a tab delimited 

csv_df = sedona.read.format("csv") \
    .option("delimiter", "\\t") \
    .load(csv_path)

                                                                                

In [53]:
# We have to attach the headers to the CSV from a TXT file 🫠
import requests

# Fetch the header file from the URL
response = requests.get('https://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt')
response.raise_for_status()  # ensure we notice bad responses

# Assume the first line contains the header names and they're comma-separated
header_line = response.text.splitlines()[0].strip()
headers = header_line.split('\t')

In [54]:
# Attach the headers
csv_df = csv_df.toDF(*headers)

In [55]:
csv_df.printSchema()

root
 |-- GLOBALEVENTID: string (nullable = true)
 |-- SQLDATE: string (nullable = true)
 |-- MonthYear: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- FractionDate: string (nullable = true)
 |-- Actor1Code: string (nullable = true)
 |-- Actor1Name: string (nullable = true)
 |-- Actor1CountryCode: string (nullable = true)
 |-- Actor1KnownGroupCode: string (nullable = true)
 |-- Actor1EthnicCode: string (nullable = true)
 |-- Actor1Religion1Code: string (nullable = true)
 |-- Actor1Religion2Code: string (nullable = true)
 |-- Actor1Type1Code: string (nullable = true)
 |-- Actor1Type2Code: string (nullable = true)
 |-- Actor1Type3Code: string (nullable = true)
 |-- Actor2Code: string (nullable = true)
 |-- Actor2Name: string (nullable = true)
 |-- Actor2CountryCode: string (nullable = true)
 |-- Actor2KnownGroupCode: string (nullable = true)
 |-- Actor2EthnicCode: string (nullable = true)
 |-- Actor2Religion1Code: string (nullable = true)
 |-- Actor2Religion2Code: stri

In [56]:
# Count the total number of rows
csv_df.count()

                                                                                

383958288

In [57]:
# Create a temporary view to create our table from the DataFrame
csv_df.createOrReplaceTempView('csv_df')

In [60]:
name = 'matt'

In [None]:
# Create a Database
sedona.sql(f'''
CREATE DATABASE IF NOT EXISTS wherobots.{name}
''')

In [73]:
# Create the Havasu table and create a geometry
sedona.sql(f'''
CREATE OR REPLACE TABLE wherobots.{name}.gdelt AS 
SELECT *, 
ST_SetSRID(
    ST_Point(ActionGeo_Long, ActionGeo_Lat),
4326) as geometry
FROM csv_df
''')

                                                                                

DataFrame[]

In [74]:
# Save the JSON file for the 4326 projection for the GeoParquet metadata

projjson = '''{
    "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
    "type": "GeographicCRS",
    "name": "WGS 84",
    "datum_ensemble": {
        "name": "World Geodetic System 1984 ensemble",
        "members": [
            {
                "name": "World Geodetic System 1984 (Transit)",
                "id": {
                    "authority": "EPSG",
                    "code": 1166
                }
            },
            {
                "name": "World Geodetic System 1984 (G730)",
                "id": {
                    "authority": "EPSG",
                    "code": 1152
                }
            },
            {
                "name": "World Geodetic System 1984 (G873)",
                "id": {
                    "authority": "EPSG",
                    "code": 1153
                }
            },
            {
                "name": "World Geodetic System 1984 (G1150)",
                "id": {
                    "authority": "EPSG",
                    "code": 1154
                }
            },
            {
                "name": "World Geodetic System 1984 (G1674)",
                "id": {
                    "authority": "EPSG",
                    "code": 1155
                }
            },
            {
                "name": "World Geodetic System 1984 (G1762)",
                "id": {
                    "authority": "EPSG",
                    "code": 1156
                }
            },
            {
                "name": "World Geodetic System 1984 (G2139)",
                "id": {
                    "authority": "EPSG",
                    "code": 1309
                }
            }
        ],
        "ellipsoid": {
            "name": "WGS 84",
            "semi_major_axis": 6378137,
            "inverse_flattening": 298.257223563
        },
        "accuracy": "2.0",
        "id": {
            "authority": "EPSG",
            "code": 6326
        }
    },
    "coordinate_system": {
        "subtype": "ellipsoidal",
        "axis": [
            {
                "name": "Geodetic latitude",
                "abbreviation": "Lat",
                "direction": "north",
                "unit": "degree"
            },
            {
                "name": "Geodetic longitude",
                "abbreviation": "Lon",
                "direction": "east",
                "unit": "degree"
            }
        ]
    },
    "scope": "Horizontal component of 3D system.",
    "area": "World.",
    "bbox": {
        "south_latitude": -90,
        "west_longitude": -180,
        "north_latitude": 90,
        "east_longitude": 180
    },
    "id": {
        "authority": "EPSG",
        "code": 4326
    }
}'''

In [75]:
import os
user_uri = os.getenv("USER_S3_PATH")

In [80]:
# Organize the table by the GeoHash for improved partitioning

gdelt = sedona.sql(f'''SELECT 
*,
ST_GeoHash(geometry, 15) AS geohash,
struct(st_xmin(geometry) as xmin, st_ymin(geometry) as ymin, st_xmax(geometry) as xmax, st_ymax(geometry) as ymax) as bbox
FROM wherobots.{name}.gdelt''')

In [84]:
%%time

gdelt.repartitionByRange(30, "geohash") \
    .sortWithinPartitions("geohash") \
    .drop("geohash15") \
    .write \
    .format("geoparquet") \
    .option("geoparquet.version", "1.1.0") \
    .option("geoparquet.covering", "bbox") \
    .option("geoparquet.crs", projjson) \
    .save(user_uri + "gdelt-snappy", mode='overwrite', compression='snappy')

25/02/06 02:46:15 ERROR TaskSchedulerImpl: Lost executor 66 on 10.1.50.156: 308]
The executor with id 66 exited with exit code -1(unexpected).



The API gave the following container statuses:


	 container name: spark-kubernetes-executor
	 container image: 329898491045.dkr.ecr.us-west-2.amazonaws.com/wherobots-spark:v1.5.0-db-12565648598
	 container state: running
	 container started at: 2025-02-05T22:36:12Z
      
25/02/06 02:46:56 ERROR TaskSchedulerImpl: Lost executor 89 on 10.1.48.146:  30]
The executor with id 89 exited with exit code -1(unexpected).



The API gave the following container statuses:


	 container name: spark-kubernetes-executor
	 container image: 329898491045.dkr.ecr.us-west-2.amazonaws.com/wherobots-spark:v1.5.0-db-12565648598
	 container state: running
	 container started at: 2025-02-06T01:50:45Z
      
                                                                                

CPU times: user 79.2 ms, sys: 28.7 ms, total: 108 ms
Wall time: 3min 52s
