## Intermediate Parallel Computing

### Segment 6 of 6

# Exploration: A Coffeeholic Tourist's Problem!

### After completing this exploration you will learn:
>1. when you should avoid parallelism. 
>2. which cafe to go when you crave for a triple-shot Espresso with an outlook of the Mississippi river in St Paul :))

*Lesson Developer: Mohsen Ahmadkhani, ahmad178@umn.edu*

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display, Latex, Markdown
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout, VBox, Button
import warnings
warnings.filterwarnings('ignore') # Hide warnings
import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

class Output:
    def __init__(self, name='cmd_out'):
        self.h = display(display_id=name)
        self.content = ''
        self.mime_type = None
        self.dic_kind = {
            'text': 'text/plain',
            'markdown': 'text/markdown',
            'html': 'text/html',
        }
        
    def display(self):
        self.h.display({'text/plain': ''}, raw=True)
        
    def _build_obj(self, content, kind, append, new_line):
        self.mime_type = self.dic_kind.get(kind)
        if not self.mime_type:
            return content, False
        if append:
            sep = '\n' if new_line else ''
            self.content = self.content + sep + content
        else:
            self.content = content
        return {self.mime_type: self.content}, True
        
    def update(self, content, kind=None, append=False, new_line=True):
        obj, raw = self._build_obj(content, kind, append, new_line)
        self.h.update(obj, raw=raw)
        

def cont(gpd_time, sedona_time):
    return f'''*** \n <font style="font-size: large; color: black;">
Comparing the two execution times, GeoPandas took <font size=5 color='red'>{round(gpd_time, 3)}</font> seconds while 
Apache Sedona did it in <font size=5 color='red'>{round(sedona_time, 3)}</font> seconds. It means parallelizing this operation with Spark SQL made 
the spatial operation <font size=5 color='red'>{round(sedona_time/gpd_time, 2)}</font> times <font size=5 color='gold'>SLOWER.</font> Think about it 
for a few moments and write your thoughts in the text area below. 
'''

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')



## Outline
In this segment, we will explore a case that our datasets are too small and therefore parallelism would be inefficient. 

For this exploration, we will take the following steps: 
><ol>
    <li>
        Demo Problem 
    </li>
    <li>
        Data Collection
    </li>
    <li>
        Spatial Operation With <b>GeoPandas</b> (Not Parallelized)
    </li>
    <li>
        Spatial Operation With <b> Apache Sedona</b>  (Parallelized)
    </li>
    <li>
        Discussion and Conclusion 
    </li>
</ol>
    
Let's go!



# 1. Demo Problem:

>## What cafe in St Paul, MN is the nearest to the Mississippi river?


First, let's import the required packages and build our spatially enabled Spark Session. 

In [None]:
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
import geopandas as gpd
from ipyleaflet import Map, GeoData, AwesomeIcon, Marker

In [None]:
spark = SparkSession.\
    builder.\
    master("local[*]").\
    appName("Spatial Spark Demo").\
    config("spark.serializer", KryoSerializer.getName).\
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName) .\
    config("spark.jars.packages", "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating,org.datasyslab:geotools-wrapper:1.1.0-25.2") .\
    getOrCreate()



SedonaRegistrator.registerAll(spark)
sc = spark.sparkContext


# 2. Data Collection

To solve this problem, we will need the following datasets: 
1. Rivers and lake centerlines
    * We already have downloaded this dataset in the previous segment, so we'll just load it in.
2. Cafes in St. Paul
    * Download it from OpenStreetMap liberary (osmnx). 
    
The code provided in the next few cells collect and prepare the required datasets. 

In [None]:
rivers = gpd.read_file('ne_10m_rivers_lake_centerlines.shp')
rivers = rivers[['name', 'geometry']].dropna(subset=['geometry']).to_crs('epsg:3857')
rivers.head()

Download the point dataset of coffee shops in the city of St. Paul using the `osmnx` package. According to its <a href='https://osmnx.readthedocs.io/en/stable/'>documentation</a>, 
><i>"OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street networks and any other geospatial geometries".</i>

In [None]:
import osmnx as ox 

place = 'St Paul, MN'
tags = {'amenity':'cafe', 'cuisine':'coffee-shop'}  
coffee_shops = ox.geometries_from_place(place, tags) 
coffee_shops = coffee_shops[['name', 'geometry']].dropna(subset=['geometry']).to_crs('epsg:3857')
coffee_shops = coffee_shops[coffee_shops['geometry'].type == 'Point']
coffee_shops.head()

Let's look at the coffee shops.

In [None]:
coffee_shops_layer = GeoData(geo_dataframe = coffee_shops.to_crs('4326'), point_style={'color': 'black'})
mymap3 = Map(center=(44.96,-93.13), zoom = 11)
mymap3.add_layer(coffee_shops_layer)
mymap3


# 3. Spatial Operation With GeoPandas (Not Parallelized)



In [None]:
import time
start_time = time.time()

# filter only the mississippi river from the rivers' dataset and store in a variable with this name.
mississipi = gpd.GeoSeries(rivers[rivers['name']=='Mississippi']['geometry']).unary_union 
dists = []
for c in coffee_shops['geometry']: # measure distance from the river to every cafe through a for loop
    dists.append(mississipi.distance(c)) # make a list of distances 
    

coffee_shops['dists'] = dists # add the distances to the original cafe's dataframe as a new column named "dists"

nearest_cafes = coffee_shops.sort_values(by='dists').iloc[0:1,:] # sort the table by distances and select the first row

gpd_time = time.time() - start_time
print(f"Execution time for GeoPandas: {gpd_time}")
nearest_cafes

# 4. Spatial Operation With GeoPandas (Not Parallelized)

Here we follow the similar process that we did in the previous segment.

In [None]:
rivers_spdf = spark.createDataFrame(rivers)
rivers_spdf.printSchema()

In [None]:
coffee_shops_spdf = spark.createDataFrame(coffee_shops)
coffee_shops_spdf.printSchema()

### Creating SQL Views

Next, we create two Views named `rivers` and `Coffee shops`. We will query from these two tables. 

In [None]:
rivers_spdf.createOrReplaceTempView("rivers")
coffee_shops_spdf.createOrReplaceTempView("cafes")

### Solution SQL Query
One way to answer the question is the following spatial query. 
>```sql
SELECT c.name, ST_DISTANCE(r.geometry, c.geometry) AS dist, c.geometry
FROM cafes AS c, rivers AS r
WHERE r.name = 'Mississippi' 
ORDER BY dist ASC
LIMIT 1


* In this query, we select the `name` of each cafe along with its distance to the Mississippi river, and its geometry (coordinate information). Note that to calculate distances we applied `ST_DISTANCE` spatial method, we call this column `dist`. 

* In the `WHERE` clause, we specified the name of the river of interest. 

* The first three lines of this query will return a table of all coffee shops with their distances from the river. * Using the `ORDER BY dist ASC` clause we sort the result by the `dist` column in an ascending (`ASC`) order. 
* The very first row of this sorted table will be the cafe that has the shortest distance to the river (our answer). `LIMIT 1` returns this first row.

Let's execute this query and record the execution time in the next cell. 

In [None]:
import time
start_time = time.time()

nearest_cafe = spark.sql("""
SELECT c.name, ST_DISTANCE(r.geometry, c.geometry) AS dist, c.geometry
FROM cafes AS c, rivers AS r
WHERE r.name = 'Mississippi' 
ORDER BY dist ASC
LIMIT 1
""")
sedona_time = time.time() - start_time
print(f"Execution time for Apache Sedona: {sedona_time}")


In [None]:
display(Markdown('<b>Did you expect this result regarding the execution times?</b>'))
o1='Yes'
o2='No'
widget1 = widgets.RadioButtons(
    options = [o1, o2],
    description = ' ', style={'description_width': 'initial'},
    value = None
)
execute = Button(
    description='Submit',
    disabled=False,
    button_style='success',
)

children1 = [widget1, execute]
vbox = VBox(children=children1)
out = Output(name='cmd_out')
def cmd(b):
    print('Submitted!')
    return out.update(Markdown(cont(gpd_time, sedona_time)))

execute.on_click(cmd)
display(vbox)


### Show the result of the query

In [None]:
nearest_cafe.show()

# 5. Discussion and Conclusion


In [None]:
out = Output(name='cmd_out')

out.display()


In [None]:
w = widgets.Textarea(
            value='',
            placeholder='Write your thoughts here...',
            description='',
            disabled=False,
            layout=Layout( height='200px', min_height='100px', width='900px')
            )


def out3():
    print('Great! Move on to see the answer.')
    
display(w)
hourofci.SubmitBtn2(w, out3)

As you might have guessed, it is because our datasets were too small to see a faster performance for parallelization. 

In this exploration we dealed with about 50 coffeeshops and a single river performing a rather simple spatial operation of equcleadian distance. Regarding the simplicity of this entire process the framework of GeoPandas can perform much faster as it does not bother spending time to partition the input data between multiple worker nodes. Parallelization and distributed computing can show much significant improvements when working with voluminous datasets. 

In the <a href="../../beginner-lesson/parallel-computing/pc-1.ipynb">beginers' Parallel Computing lesson</a>, we also talked about "Amdahl's Law" and mentioned that having more worker nodes does not necessarily bring a better performance. 



## [Optional] Visualization

Do you want to see the coffee shops, the river, and its closest cafe on a map?! Run the cell below. 

The resulting cafe appears in red, and the others remain black. 

In [None]:
riverside_cafesdf = gpd.GeoDataFrame(nearest_cafes.to_crs(4326), geometry="geometry")
riverside_cafes_layer = GeoData(geo_dataframe = riverside_cafesdf, point_style={'color': 'red'})

coffee_shops_layer = GeoData(geo_dataframe = coffee_shops.to_crs(4326), point_style={'color': 'black'})

mymap4 = Map(center=(44.96,-93.13), zoom = 11)
mymap4.add_layer(coffee_shops_layer)
rivers_layer = GeoData(geo_dataframe = rivers.to_crs(4326), style={'color':'blue'})
mymap4.add_layer(rivers_layer)
mymap4.add_layer(riverside_cafes_layer)
mymap4

## What Else?

Apache Sedona enables tens of other spatial functions like centroid, distance, transformation, buffer, and many more that we cannot cover all here. 

You can see a list of all available spatial functions at https://sedona.apache.org/api/sql/Function. 

# Congratulations!


**You have finished an Hour of CI!**


But, before you go ... 

1. Please fill out a very brief questionnaire to provide feedback and help us improve the Hour of CI lessons. It is fast and your feedback is very important to let us know what you learned and how we can improve the lessons in the future.
2. If you would like a certificate, then please type your name below and click "Create Certificate" and you will be presented with a PDF certificate.

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="https://forms.gle/JUUBm76rLB8iYppN7">Take the questionnaire and provide feedback</a></font>

In [None]:

# This code cell loads the Interact Textbox that will ask users for their name
# Once they click "Create Certificate" then it will add their name to the certificate template
# And present them a PDF certificate
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw

from ipywidgets import interact

def make_cert(learner_name, lesson_name):
    cert_filename = 'hourofci_certificate.pdf'

    img = Image.open("../../supplementary/hci-certificate-template.jpg")
    draw = ImageDraw.Draw(img)

    cert_font   = ImageFont.truetype('../../supplementary/cruft.ttf', 150)
    cert_fontsm = ImageFont.truetype('../../supplementary/cruft.ttf', 80)
    
    _,_,w,h = cert_font.getbbox(learner_name)  
    draw.text( xy = (1650-w/2,1100-h/2), text = learner_name, fill=(0,0,0),font=cert_font)
    
    _,_,w,h = cert_fontsm.getbbox(lesson_name)
    draw.text( xy = (1650-w/2,1100-h/2 + 750), text = lesson_name, fill=(0,0,0),font=cert_fontsm)
    
    img.save(cert_filename, "PDF", resolution=100.0)   
    return cert_filename


interact_cert=interact.options(manual=True, manual_name="Create Certificate")

@interact_cert(name="Your Name")
def f(name):
    print("Congratulations",name)
    filename = make_cert(name, 'Intermediate Parallel Computing')
    print("Download your certificate by clicking the link below.")


<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="hourofci_certificate.pdf?download=1" download="hourofci_certificate.pdf">Download your certificate</a></font>