
## Goal

The objective of this notebook is to ingest GeoPackage data from a storage account (Azure), S3 bucket (AWS), or Unity Catalog volume, and load it into Unity Catalog tables using Apache Sedona readers. The workflow leverages Databricks notebook widgets to parameterize the cloud provider and dataset location.

In [0]:
%run ../get_user

In [0]:
# Getting the current user
user_email = spark.sql("SELECT current_user()").collect()[0][0]
username = get_username_from_email(user_email)
print(username)

In [0]:
cloud_provider = dbutils.widgets.get("cloud_provider")
print(f"Cloud Provider: {cloud_provider}")

if cloud_provider == "azure":
    dataset_storage_account_name="melikadatabricksstorage"
    dataset_container_name="geospatial-dataset"
    dataset_dir="vector/uk"
elif cloud_provider == "aws":
    dataset_bucket_name = "revodata-databricks-geospatial"
    dataset_input_dir="geospatial-dataset/vector/uk"
elif cloud_provider == "volume":
    schema_name = "inputs"
    volume_name="geospatial_dataset"
    dataset_input_dir = "vector/uk"


catalog_name = "geospatial"



In [0]:
# This code initializes the Apache Sedona context for geospatial processing in Spark.
# It configures the required Sedona and GeoTools packages, then creates a SedonaContext object for reading and manipulating geospatial data in Spark DataFrames.

from sedona.spark import *
from pyspark.sql.functions import expr

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.3_2.12:1.7.1,'
           'org.datasyslab:geotools-wrapper:1.7.1-28.5'). \
    getOrCreate()

sedona = SedonaContext.create(config)

In [0]:
# This code iterates through a dictionary of schemas, GeoPackage files, and layer names to ingest geospatial data from cloud storage (Azure, AWS, or Unity Catalog volume) using Apache Sedona. 
# For each layer in each GeoPackage file, it reads the data into a Spark DataFrame, then writes the DataFrame as a Unity Catalog table with a username-specific suffix, overwriting any existing table. 
# The storage path is dynamically constructed based on the selected cloud provider.

schema_tables = {
    "lookups": {
        "bdline_gb.gpkg": ["boundary_line_ceremonial_counties"],
    },
    "greenspaces": {
        "opgrsp_gb.gpkg": ["greenspace_site", "access_point"]
    },
    "networks": {
        "oproad_gb.gpkg": ["road_link", "road_node"],
    },
}


for schema, files in schema_tables.items():
    for gpkg_file, layers in files.items():
        for table_name in layers:
            if cloud_provider == "azure":
                df = sedona.read.format("geopackage").option("tableName", table_name).load(f"abfss://{dataset_container_name}@{dataset_storage_account_name}.dfs.core.windows.net/{dataset_dir}/{gpkg_file}")
            elif cloud_provider == "aws":
                df = sedona.read.format("geopackage").option("tableName", table_name).load(f"s3://{dataset_bucket_name}/{dataset_input_dir}/{gpkg_file}")
            elif cloud_provider == "volume":
                df = sedona.read.format("geopackage").option("tableName", table_name).load(f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/{dataset_input_dir}/{gpkg_file}")
            df.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema}.{table_name}_{username}")
            print(f"Table {catalog_name}.{schema}.{table_name}_{username} is created, yay!")