## Intermediate Parallel Computing

### Segment 5 of 5

### PySpark SQL III: Spatial is Special!

#### In this segment we will learn:
* Apache Sedona
* Querying spatial data with PySpark SQL.


*Lesson Developer: Mohsen Ahmadkhani, ahmad178@umn.edu*

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

## Why Distributed Spatial Computing?

<center><img src=https://media.makeameme.org/created/why-even-bother-5c8eb2.jpg width=300></center>
In recent years, the spatial technology has evolved tremendously resulted in BIG spatial data. Some examples of spatial big data include location-based services like Uber, Lyft, scooter ride companies and many more, remote sensing data, spatial social networks' data like twitter and FaceBook, weather maps, transportation, and countless others. Handling such BIG load of spatial data needs <b>faster</b> database management technologies, and parallel computing is faster!

By the way, if you are curious about spatial big data, we talked about it in the <a href="http://try.hourofci.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhourofci%2Flessons&urlpath=tree%2Flessons%2Fintermediate-lessons%2Fbig-data%2FWelcome.ipynb&branch=master">intermediate Big Data</a> lesson. So, take a look if you have not already.   


## Apache Sedona

SO far, we briefly saw how Apache Spark works but like most data management technologies, Apache Spark also first was developed for non-spatial data. Why? well, because <b>spatial is special!</b> Due to nature of spatial data type, it is much more complex and therefore harder to store and analyse. 
To support spatial data type, people at Apache launched an extension to Spark named <b><a href="http://sedona.apache.org">Apache Sedona</a></b>. 

Apache Sedona (formerly GeoSpark) is a powerful tool that extends RDDs to geospatial RDDs (aka SpatialRDD). In simple words Apache Sedona enables two major things: 
<ol>
    <li>
Distributing geospatial data between multiple computational cores 
    </li>
    <li>
Spatial functions and queries in SparkSQL
    </li>
</ol>

In this segment we touch on Apache Sedona and see how it works. 

First, we need to import `SparkSession` from SparkSQL as the initial step in Spark framework. Then we need to import a few sub-modules from Sedona. But we need to install the `Apache Sedona` package first. Let's do it in the next cells. 

In [None]:
pip install apache-sedona --quiet

Now, click the "Restart Kernel" to update the list of installed packages.

In [None]:
def restarter():
    display(HTML(
        '''
            <script>
                code_show = false;
                function restart_kernel(){
                    IPython.notebook.kernel.restart();
                }
            </script>
            <button onclick="restart_kernel()">Restart Kernel</button>
        '''
    ))
restarter()

In [None]:
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer

Now let's import a couple of general packages for visualization. 

In [None]:
import geopandas as gpd
from ipyleaflet import Map, GeoData

Now, we create a spatially enabled spark context using the imported Sedona sub-modules. Don't worry too much if it looks compicated! You can copy and paste this for your project :))

In [None]:
spark = SparkSession.\
    builder.\
    master("local[*]").\
    appName("Spatial Spark Demo").\
    config("spark.serializer", KryoSerializer.getName).\
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName) .\
    config("spark.jars.packages", "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating,org.datasyslab:geotools-wrapper:1.1.0-25.2") .\
    getOrCreate()

SedonaRegistrator.registerAll(spark)
sc = spark.sparkContext


## Downloading demo shapefiles to illustrate Apache Sedona

In this segment, we will use two shapefiles to demonstrate the functionalities of Apache Sedona. These shapefiles include a rivers' and the US state boundaries dataset.

Ok, let's download the dataset of rivers and lake centerlines for the entire world using `wget`. This is the same dataset we used <a href="http://try.hourofci.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhourofci%2Flessons&urlpath=tree%2Flessons%2Fbeginner-lessons%2Fgeospatial-data%2Fgd-example_1.ipynb&branch=master">here</a> in the beginner geospatial data lesson. You can learn more about this dataset there. 

Run the cell below to download the dataset as a zip file and then extract the shapefile using `unzip`. 

In [None]:
!wget -O ne_10m_rivers_lake_centerlines.zip https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/physical/ne_10m_rivers_lake_centerlines.zip 
!unzip -n ne_10m_rivers_lake_centerlines.zip

Let's read in the shapefile using geopandas and take a look at the first few rows. 

In [None]:
rivers = gpd.read_file('ne_10m_rivers_lake_centerlines.shp')
rivers = rivers[['featurecla', 'name', 'name_alt', 'rivernum', 'geometry']]
rivers.head()

Nice, there is a `geometry` column there too! That's what makes it a spatial dataset. Now let's plot it and see the line features of rivers and lake centerlines on the map. 


In [None]:
rivers_layer = GeoData(geo_dataframe = rivers, style={'color':'blue'})
mymap1 = Map(center=(40,10), zoom = 2)
mymap1.add_layer(rivers_layer)
mymap1

Interesting! 

Now we will do the same for the `US states` shapefile. This is a shapefile of the US states' boundaries downloaded from the <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjBpayMwJn7AhV2lGoFHbfWB_cQFnoECAsQAQ&url=https%3A%2F%2Fwww.census.gov%2Fgeographies%2Fmapping-files%2Ftime-series%2Fgeo%2Fcarto-boundary-file.html&usg=AOvVaw2QKo7f-rChpkoO7zQ9E75A">US Census government</a>.

In [None]:
!wget -O us_states.zip https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_20m.zip 
!unzip -n us_states.zip

In [None]:
states = gpd.read_file('us_states.zip')
states.head()

In [None]:
states_layer = GeoData(geo_dataframe = states, style={'color':'red'})
mymap2 = Map(center=(40,-100), zoom = 4)
mymap2.add_layer(states_layer)
mymap2

## Converting GeoPandas to Apache Sedona

So far, the shapefiles have been loaded as geopandas dataframes. In order to enable parallel computation, we need to convert them to spark dataframes. This is easy! Just use Spark's `createDataFrame` method as below. 

Here we also print the dataframe's schema to see what columns and data types we have. 

In [None]:
states_spdf = spark.createDataFrame(states)
states_spdf.printSchema()


Notice that at the very buttom of the schema there is a `geometry` data type! That's what Sedona braught to us. 

Next we use `show` method to see a few first rows of the spark dataframe. 

In [None]:
states_spdf.show(3)

And same process for the rivers' dataset. 

In [None]:
rivers_spdf = spark.createDataFrame(rivers)
rivers_spdf.printSchema()


In [None]:
rivers_spdf.show(3)

## Creating SQL Views (Virtual Tables)

Similar to non-spatial data, we need to create SQL Views for each of the dataframes we have. This is a required step to enable querying data in SQL and literally means creating virtual relations (tables) for dataframes. Below, we do this using `createOrReplaceTempView` method and create two Views named `rivers` and `states`. We will query from these two tables. 

In [None]:
rivers_spdf.createOrReplaceTempView("rivers")
states_spdf.createOrReplaceTempView("states")


### A Demo Spatial Query 
As an example, suppose we want to see which rivers and lake centerlines cross the boundaries of Minnesota and Washington states. This is indeed a spatial query as we care about the geographical (topological) relations of features. 

One way to write this query is as follows:<br>
In this SQL query, we select state and river names along with the rivers' geometry. 

Under the `WHERE` clause, we say if the state name is either Minnesota OR Washington AND the rivers that intersect the polygons of these two states. 

```sql
SELECT s.NAME, r.name river_name, r.geometry geom
FROM states s, rivers r
WHERE s.NAME IN ('Minnesota', 'Washington') and ST_INTERSECTS(r.geometry, s.geometry)
```

Let's execute this spatial query together:

In [None]:
mn_rivers = spark.sql("""
SELECT s.NAME, r.name river_name, r.geometry geom
FROM states s, rivers r
WHERE s.NAME IN ('Minnesota', 'Washington') and ST_INTERSECTS(r.geometry, s.geometry)
""")

mn_rivers.show()


### Note
Friends, spatial queries like this one would take much longer if you do not partition them. Please note that this difference is more significant for larger datasets as the "distribution" of data between multiple cores itself could be time consuming. Hence, for small datasets we do not usually use parallel computing. 

## Converting Apache Sedona to GeoPandas

Ok, we queried our data and returned all rivers that have the conditions we set. Now to spatially visualize them we need to convert the result back to geopandas dataframe. To do this we simply use `toPandas` method. 

In [None]:
mn_rivers_df = mn_rivers.toPandas()
result = gpd.GeoDataFrame(mn_rivers_df, geometry="geom")
result

And here is the result on a map! 

In [None]:
wa_mn = states[(states['NAME']=='Minnesota') | (states['NAME']=='Washington')]
wa_mn_layer = GeoData(geo_dataframe = wa_mn, style={'color':'red'})
gdf_layer = GeoData(geo_dataframe = result, style={'color':'blue'})
gdf_map = Map(center=(40,-100), zoom = 4)
gdf_map.add_layer(gdf_layer)
gdf_map.add_layer(wa_mn_layer)
gdf_map

### One Last Example

>**QUESTION:** What cafes in St Paul, MN are in 50-meter neighborhood of Mississippi river?

To answer this question we need two datasets and their corresponding SQL relations (i.e., Views) including the rivers' dataset and the coffee shop dataset. 

We already have created the rivers' SQL View, below we start downloading the point dataset of cafes in St Paul from OpenStreetMap liberary and convert it to Apache Sedona SQL View. 


In [None]:
import osmnx as ox 

place = 'St Paul, MN'
tags = {'amenity':'cafe', 'cuisine':'coffee-shop'}  
coffee_shops = ox.geometries_from_place(place, tags) 
coffee_shops = coffee_shops.to_crs('epsg:4326')[['name', 'geometry']]
coffee_shops = coffee_shops[coffee_shops['geometry'].type == 'Point']
coffee_shops.head()

In [None]:
coffee_shops_layer = GeoData(geo_dataframe = coffee_shops, point_style={'color': 'black'})
mymap3 = Map(center=(44.96,-93.13), zoom = 11)
mymap3.add_layer(coffee_shops_layer)
mymap3

Ok, now let's create the SQL View of the cafes' dataset as follow. 

In [None]:
coffee_shops_spdf = spark.createDataFrame(coffee_shops)
coffee_shops_spdf.createOrReplaceTempView("cafes")


### Solution
One way to answer the question is the following spatial query. We first want to build a budder of 50 meters around each cafe and then see which of them intersect with the Mississippi river. 

>```sql
SELECT c.name cafe, c.geometry as geom
FROM cafes c, rivers r
WHERE r.name = 'Mississippi' and ST_INTERSECTS(ST_TRANSFORM(r.geometry, 'epsg:4326','epsg:2180'), ST_BUFFER(ST_TRANSFORM(c.geometry, 'epsg:4326','epsg:2180'), 50))


In this query we used `ST_TRANSFORM` function to reproject our data to a projection system that uses meters as length unit.  Then we applied `ST_BUFFER` function to build a buffer of 50 meters around each cafe. Finally, we used `ST_INTERSECTS` to see if the Mississipi river intersects with the buffers. Let's execute this query in the next cell. 

In [None]:
riverside_cafes = spark.sql("""
SELECT c.name cafe, c.geometry as geom
FROM cafes c, rivers r
WHERE r.name = 'Mississippi' and ST_INTERSECTS(ST_TRANSFORM(r.geometry, 'epsg:4326','epsg:2180'), ST_BUFFER(ST_TRANSFORM(c.geometry, 'epsg:4326','epsg:2180'), 50))
""")

riverside_cafes.show()

Sounds good! Now, you know which cafe to go when you crave for a triple-shot Espresso with an outlook of Mississippi river in St Paul!!

Do you want to see them on the map?! Run the cell below. 

The queried cafes are colored in red and the others remained black. Also, the Mississippi river is displayed in blue. 

In [None]:

riverside_cafesdf = riverside_cafes.toPandas()
riverside_cafesdf = gpd.GeoDataFrame(riverside_cafesdf, geometry="geom")
riverside_cafes_layer = GeoData(geo_dataframe = riverside_cafesdf, point_style={'color': 'red'})

coffee_shops_layer = GeoData(geo_dataframe = coffee_shops, point_style={'color': 'black'})

mymap4 = Map(center=(44.96,-93.13), zoom = 11)
mymap4.add_layer(coffee_shops_layer)
mymap4.add_layer(gdf_layer)
mymap4.add_layer(riverside_cafes_layer)
mymap4

## What Else?

Apache Sedona enables tens of other spatial functions like centroid, distance, transformation, buffer and many more that we cannot cover them all here. 

You can see a list of all available spatial functions at https://sedona.apache.org/api/sql/Function. 

# Congratulations!


**You have finished an Hour of CI!**


But, before you go ... 

1. Please fill out a very brief questionnaire to provide feedback and help us improve the Hour of CI lessons. It is fast and your feedback is very important to let us know what you learned and how we can improve the lessons in the future.
2. If you would like a certificate, then please type your name below and click "Create Certificate" and you will be presented with a PDF certificate.

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="https://forms.gle/JUUBm76rLB8iYppN7">Take the questionnaire and provide feedback</a></font>

In [None]:

# This code cell loads the Interact Textbox that will ask users for their name
# Once they click "Create Certificate" then it will add their name to the certificate template
# And present them a PDF certificate
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw

from ipywidgets import interact

def make_cert(learner_name, lesson_name):
    cert_filename = 'hourofci_certificate.pdf'

    img = Image.open("../../supplementary/hci-certificate-template.jpg")
    draw = ImageDraw.Draw(img)

    cert_font   = ImageFont.truetype('../../supplementary/cruft.ttf', 150)
    cert_fontsm = ImageFont.truetype('../../supplementary/cruft.ttf', 80)
    
    _,_,w,h = cert_font.getbbox(learner_name)  
    draw.text( xy = (1650-w/2,1100-h/2), text = learner_name, fill=(0,0,0),font=cert_font)
    
    _,_,w,h = cert_fontsm.getbbox(lesson_name)
    draw.text( xy = (1650-w/2,1100-h/2 + 750), text = lesson_name, fill=(0,0,0),font=cert_fontsm)
    
    img.save(cert_filename, "PDF", resolution=100.0)   
    return cert_filename


interact_cert=interact.options(manual=True, manual_name="Create Certificate")

@interact_cert(name="Your Name")
def f(name):
    print("Congratulations",name)
    filename = make_cert(name, 'Intermediate Parallel Computing')
    print("Download your certificate by clicking the link below.")


<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="hourofci_certificate.pdf?download=1" download="hourofci_certificate.pdf">Download your certificate</a></font>