# 🎉 Welcome to Wherobots! 🚀

We are *thrilled* to have you here and can't wait to help you get started! Before diving in, take a moment to watch this video below:


## 🌍 What You Will Learn in This Notebook

Welcome to this geospatial analysis notebook!

This notebook emphasizes hands-on geospatial analysis, combining the power of SQL, Python, and cloud-native data integration to unlock actionable insights from spatial datasets.

First, you'll learn the basics of geospatial analysis with Wherobots. Then, you'll apply that knowledge by analyzing real-world data about buildings near Central Park.

After completing this notebook, you'll be able to:

* **📂 Load raster and vector files into Dataframes**
    - Load in geospatial data into Sedona Dataframes from data on AWS S3.
    - Use Apache Sedona SQL to filter, query, and manipulate vector and raster data.
    
* **📊 Perform Zonal Statistics**
    - Leverage `RS_ZonalStats` to calculate statistics like mean temperature over spatial geometries.
    - Integrate raster and vector data for advanced spatial analysis.

* **🔄 Transform and Analyze Data**
    - Use SQL queries to extract insights, such as identifying regions meeting specific criteria (e.g. building elevation).

* **📝 Work with Temporary Views**
    - Understand the use of temporary views in Apache Sedona to streamline complex geospatial workflows.

* **🗺️ Visualize and Interpret Results**
    - Learn how to visualize geospatial datasets using tools like SedonaKepler.
    - Explore insights derived from the data, such as building heights.

## Step 1: Choose Your Storage

Wherobots is a **cloud-native tool** and works best with data that's stored in a cloud storage bucket.

With Wherobots, you have two options for data storage:

1. **Use Wherobots S3 storage** (our fully managed solution).
2. **Connect your own S3 buckets** to integrate seamlessly with your existing data workflows.

> *Tip*: Storing data in the cloud makes it easier to scale, process, and analyze geospatial data efficiently! 🌩️

Don't worry you don't need to make that decision to move ahead with this tutorial! **We have some data ready to use**!

If you want to use **Wherobots Managed Storage** [click here to get some more information](https://docs.wherobots.com/latest/develop/storage-management/managed-storage/) about how to load your data in.

## Step 2: Set up your Sedona context

The Sedona Context connects you to the Wherobots Cloud compute environment to make sure everything runs 🏎️ **fast and efficiently**, enabling functionalities like spatial indexing, querying, and analysis.

### 🧰 **The configuration**
This command sets up the basic configuration for your compute environment. You can include more options in this command for additional customization.

```python
config = SedonaContext.builder().getOrCreate()
```

### 🔌 **The context**
This command connects your configuration to the Wherobots Cloud environment and initializes the Sedona Context.

```python
sedona = SedonaContext.create(config)
```

> 📓 For more details on setting up your compute environment, check out our [documentation](https://docs.wherobots.com/latest/develop/notebook-management/notebook-instance-management/).

In [None]:
from sedona.spark import SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

## Step 3: Load a GeoParquet File Into a DataFrame

GeoParquet is a modern, efficient format for storing geospatial data. Here's how to load a GeoParquet file into a DataFrame:

### 📂 **Load the GeoParquet File**
The following code snippet loads a GeoParquet file from S3 - you don't need any extra libraries to do so!:

```python
# Load the dataframe using your S3 URL
df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)

# First show the data schema using the .printSchema() function
df.printSchema()

# Then show the first 20 rows of the dataframe
df.show()

# ...or the first 5 rows
df.show(5)
```

> *Note*: GeoParquet files store geospatial data in an efficient, interoperable format. This makes them perfect for large-scale geospatial workflows! 🌍 Here is some [more information from our documentation on loading GeoParquet](https://docs.wherobots.com/latest/tutorials/wherobotsdb/vector-data/vector-load/?h=read+geopar#__tabbed_9_3)

In [None]:
# Here, we define a variable with the URI to our data in the S3 bucket
geoparquet = 's3://wherobots-examples/data/onboarding_1/nyc_buildings.parquet'

# Then, we can load that into a Sedona DataFrame
buildings = sedona.read.format("geoparquet").load(geoparquet)

## Step 4: 📂 Load in Raster Data Into a DataFrame
The following code snippet loads a raster file (e.g., GeoTIFF) into a Sedona DataFrame:

```python
# Define the path to your raster file
raster_path = "s3://your-bucket-name/path/to/your/raster.tif"

# Load the raster file into a DataFrame using spatial SQL
elevation = sedona.sql(f'''SELECT RS_FromPath('{raster_path}') as rast''')

# Show the schema and some sample rows
raster_df.printSchema()
raster_df.show()

# Or load it using the Python API
df = sedona.read.format("binaryFile"). \
    load(raster_path). \
    drop("content").withColumn("rast", expr("RS_FromPath(path)"))
```

> *Note*: Make sure to adjust the `raster_path` to point to your specific file location. Sedona handles raster metadata and pixel data efficiently, making it ideal for spatial analysis. 🌐

---

### 🛠️ Further exploration

Use Sedona SQL to query and process the Central Park raster data, then combine it with vector data for advanced spatial analytics.

---

In [50]:
# Here, we set our S3 URI to our Raster dataset the to a variable
central_park = 's3://wherobots-examples/data/onboarding_1/CentralPark.tif'

# Here, we can use the Sedona Spatial SQL functions to load the data in using the RS_FromPath function
elevation = sedona.sql(f'''SELECT
RS_FromPath('{central_park}')
as rast''')

### **What is `elevation.createOrReplaceTempView('elevation')`?**

This line of code is used to register a DataFrame as a temporary SQL view in Apache Spark. Here's what it does:

- **`createOrReplaceTempView()`**: This method registers the DataFrame (in this case, `elevation`) as a temporary view.
- **`'elevation'`**: The name assigned to the SQL view. You can query this view using SQL commands in Spark SQL.

#### 🛠️ Why Use a Temporary View?

Temporary views allow you to interact with the DataFrame using SQL queries. For example, after creating the view, you can run the following query to analyze the elevation data:

```python
result = spark.sql("SELECT * FROM elevation WHERE height > 1000")
result.show()
```

> *Note*: Temporary views only exist for the duration of the Spark session. Once the session ends, the view will no longer be available.


In [5]:
# This creates a temporary view from our DataFrame so it can be used in Spatial SQL queries
elevation.createOrReplaceTempView('elevation')

## Step 5: 🍓 *I'm going to Strawberry Fields...*

The following query calculates the elevation from a raster file of Central Park at the specific location of Strawberry Fields:

```python
strawberry_fields = sedona.sql('''
select RS_Value(rast, ST_Point(-73.9751781, 40.7756813)) as elevation_in_feet
from elevation
''')
```

### 🔍 Explanation:

- **`RS_Value(rast, ST_Point(...))`**: This function retrieves the value (e.g., elevation) from the raster file at a specified point. Here, the point is defined by its longitude and latitude coordinates (-73.9751781, 40.7756813).
- **`as elevation_in_feet`**: Assigns an alias to the output column, making it easier to interpret the results.
- **`from elevation`**: Specifies the raster DataFrame (registered as a temporary view) as the source of the query.

### 📊 Practical Use:

This query allows you to extract elevation data or other raster-based values for specific locations, enabling precise spatial analysis. For example, the resulting DataFrame `strawberry_fields` will contain the elevation value in feet for the given coordinates.

```python
strawberry_fields.show()
```

> *Note*: This functionality is incredibly useful for point-specific raster queries, such as extracting elevation, temperature, or other environmental variables.


In [6]:
# First we create the point geometry and then use the RS_Value function
# to get the raster value for that specific point location

strawberry_fields = sedona.sql('''
select RS_Value(rast, ST_Point(-73.9751781, 40.7756813)) as elevation_in_feet
from elevation
''')

In [None]:
# Then, we call .show() on the DataFrame to run the query and show the results
strawberry_fields.show()

## Step 6: 🖼️ Visualizing our data using `SedonaKepler`

After processing your spatial data using Sedona, you can visualize it with SedonaKepler. For example:

```python
from sedona.maps.SedonaKepler import SedonaKepler

# Initialize a SedonaKepler map
map = SedonaKepler.create_map()

# Visualize a DataFrame (e.g., NYC Buildings)
SedonaKepler.add_df(map, buildings, config= {
    "mapStyle": "dark",  # Choose map style
    "layers": [
        {
            "type": "polygon", 
            "name": "NYC Buildings", 
            "colorBy": "category", 
            "colorColumn": "PRIM_ID", 
            "heightColumn": "height_val", 
            "heightScale": 1
        }
    ]
}
)
```

> *Note*: We added a `config` file into the map set up so you can see some sample styles right away.


In [None]:
import json
from sedona.maps.SedonaKepler import SedonaKepler

# This code will load our map configuration so it can be read by SedonaKepler.
with open('map-config/config.json') as f:
    # Load the JSON data into a dictionary
    map_config = json.load(f)

# These lines create the map with our configuration, add the dataframe, and then render the map.
map = SedonaKepler.create_map(config=map_config)
SedonaKepler.add_df(map, buildings, 'NYC Buildings')
map

## Final Project: 🏢 Analyzing Building Elevations in Central Park

This section explains how the following query calculates the average elevation of New York City buildings using a 1-ft Central Park DEM

### Code Overview

```python
buildings_elevation = sedona.sql(f'''
with a as (
select
buildings.PROP_ADDR as name,
buildings.geom,
avg(RS_ZonalStats(elevation.rast, st_transform(buildings.geom, 'epsg:4326','epsg:2263'), 1, 'mean', true)) as elevation
from buildings
join elevation
on RS_Intersects(elevation.rast, st_transform(buildings.geom, 'epsg:4326','epsg:2263'))
group by buildings.PROP_ADDR, buildings.geom)

select * from a where elevation > 0
''')
```

### 📊 Key Steps and Concepts

#### 1. Inputs
- **`elevation`**: A 1-ft resolution DEM of Central Park, providing high-precision elevation data.
- **`buildings`**: A dataset of all buildings in New York City, including geometry (`geom`) and property address (`PROP_ADDR`).

#### 2. Coordinate Transformation
- The building geometries are transformed from EPSG:4326 (geographic coordinates) to EPSG:2263 (New York State Plane coordinates) using `st_transform`. This ensures compatibility with the DEM raster.

#### 3. Zonal Statistics Calculation
- **`RS_ZonalStats`** computes the mean elevation for each building geometry based on the DEM:
  - **Raster Input**: `elevation.rast` (DEM raster file).
  - **Vector Geometry**: Transformed building geometries.
  - **Band**: The first band of the raster is used.
  - **Statistic**: `mean` calculates the average elevation within the building footprint.
  - **Ignore NoData**: `true` ensures invalid or missing data in the raster is excluded.

#### 4. Spatial Join
- **`RS_Intersects`** ensures only buildings intersecting the DEM raster are included in the analysis.

#### 5. Filtering Results**
- The query filters out buildings with non-positive elevation values using `where elevation > 0`.

#### 6. Aggregation
- Elevation values are grouped by building address (`PROP_ADDR`) and geometry to compute the average elevation for each unique building.

### 📋 Output
The resulting DataFrame, `buildings_elevation`, contains:
- **`name`**: The property address of the building.
- **`geom`**: The building geometry.
- **`elevation`**: The average elevation of the building footprint in feet.

### 🌟 Practical Use
This analysis combines raster (DEM) and vector (building footprints) data to derive meaningful insights about urban infrastructure. For example, it can be used for:
- Identifying buildings at risk of flooding based on elevation.
- Urban planning and construction in areas with varying terrain.
- Environmental impact studies within Central Park and surrounding areas.


## Calculate Average Building Elevation Around Central Park

In [21]:
# First, we'll create our temporary view of the data
buildings.createOrReplaceTempView('buildings')

In [None]:
# Here, we'll check the Spatial Reference ID (SRID) of the raster file
sedona.sql('SELECT RS_SRID(rast) FROM elevation limit 1').show()

In [33]:
# Below is the query that was explained in the "Final Project: Analyzing Building Elevations in Central Park" section

buildings_elevation = sedona.sql(f'''with a as (
select
buildings.PROP_ADDR as name,
buildings.geom,
avg(RS_ZonalStats(elevation.rast, st_transform(buildings.geom, 'epsg:4326','epsg:2263'), 1, 'mean', true)) as elevation
from buildings
join elevation
on RS_Intersects(elevation.rast, st_transform(buildings.geom, 'epsg:4326','epsg:2263'))
group by buildings.PROP_ADDR, buildings.geom)

select * from a where elevation > 0
''')

In [None]:
# A quick check of our data using the .show() command on the Dataframe with the query results
buildings_elevation.show()

In [None]:
# Then, load our map using the new map configuration file.

with open('map-config/central_park_config.json') as f:
    # Load the JSON data into a dictionary
    park_config = json.load(f)

map = SedonaKepler.create_map(config=park_config)
SedonaKepler.add_df(map, buildings_elevation, 'NYC Buildings')
map

## 🎯 Where to Go From Here

Congratulations on completing this notebook!

You’ve learned how to:
- Integrate raster and vector data for advanced geospatial analysis.
- Perform zonal statistics to derive meaningful insights from elevation and temperature datasets.
- Use Apache Sedona SQL to manipulate and query spatial data efficiently.

### 🛠️ Experiment with Different Data Sources
- Use additional raster datasets, such as vegetation indices or precipitation maps, to enhance your analysis.
- Incorporate demographic or socioeconomic vector datasets to explore spatial relationships.

> Did you know that Wherobots allows you to run [NDVI analysis](https://docs.wherobots.com/latest/api/wherobots-compute/sql/Raster-map-algebra/?h=ndvi#ndvi) and you can use [Overture Maps data](https://docs.wherobots.com/latest/tutorials/opendata/introduction/?h=overture#open-data-catalogs) from Wherobots DB.

### 🔍 Try Advanced Apache Sedona Features
- Explore Sedona’s spatial join capabilities to analyze relationships between multiple vector datasets.
- Use Sedona’s advanced functions, like `ST_Buffer` or `ST_Within`, for proximity and containment analysis.

> Check out our full function reference for [Apache Sedona here](https://docs.wherobots.com/latest/references/wherobotsdb/vector-data/Overview/).

## 📝 Coming Soon: Tips for Loading Raster and Vector Data

In the next notebook, we'll dive deeper into the best practices for efficiently loading raster and vector data into DataFrames for advanced geospatial analysis.

The next notebook will cover:

* 🌐 **Optimizing Vector Dataframes**
  - Step-by-step guides for loading vector data types into Apache Sedona DataFrames.
  - Steps on how to ensure your dataframes are performant in Wherobots.

* 🗺️ **Optimizing Raster Data**
  - Learn how to use Out-of-Database Rasters stored in remote cloud storage buckets (e.g., AWS Open Earth Data).
  - You'll complete the following steps:
    1. Create a new raster dataset from a remote file.
    2. Explode and divide your Out-DB raster into tiles to optimize query performance.

* 🛠️ **Preparing for Spatial Joins**
  - We’ll cover the best strategies for combining raster and vector datasets to answer complex geospatial questions.

With this foundation, you’ll be fully equipped to manage and query large-scale spatial datasets in Wherobots.