## Intermediate Parallel Computing

### Part 4 of 5

### PySpark SQL II: Non-Spatial SQL Query

### In this segment we will learn:
* Spark SQL and spark dataframes.
* Querying non-spatial data with PySpark SQL.

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

In [None]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("hourofci").setMaster("local[*]")

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
sc

## Spark DataFrames

Spark dataframes are organized data in rows and columns that are distributed between multiple computational cores. In other words, Spark DataFrames are very similar to Pandas dataframes except they are distributed. Spark dataframes are faster and more convenient to use compared to RDDs. 

In the next cell, we use `read` method followed by `toDF` method to load our favorite film's csv file as a Spark dataframe. Using the `option` method we indicate that our csv file has a header that should be skipped. Please note that we set column names for our new dataframe. The column names can be different than the header of the csv file. 

In [None]:
films = spark.read.option("delimiter", ",").option("header", "true").csv("supplementary/films.csv").\
toDF('index','Title','Year','Length','Subject','Popularity')
films.show()

## Back to Our Life-Finder Query!

In the previous segment we used `filter` method to retrieve all movies that have the word "Life" in their title from the film's RDD. Here, we want to write and execute actual SQL query to return the same result. 

But remember from the <a href="http://try.hourofci.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhourofci%2Flessons&urlpath=tree%2Flessons%2Fintermediate-lessons%2Fgeospatial-data%2FWelcome.ipynb&branch=master">intermediate lesson on Geospatial Data</a> to be able to execute SQL codes, we need relations (i.e., tables or views) that are different from dataframes. 

Spark SQL provides an environment to create SQL *Views* (i.e., virtual tables) from spark dataframes. We can do this using `createOrReplaceTempView` as follows. We name our View as `films_table`.


In [None]:
films.createOrReplaceTempView("films_table")


Now, time to write and execute a SQL query to fetch all movies containing the word "Life" in their title. The following SQL query is one way to do this. 


>```sql
SELECT *
FROM films_table f
WHERE f.Title LIKE '%Life%'
```

In this query, we select all (`*`) rows from the films_table that is renamed to `f` that meet our condition in the `WHERE`-clause. The condition is that the title includes (`LIKE` function) the word "Life". The percentage (`%`) symbol means there could be any character comming before and after the word of interest. 

In the cell below, we execute this query using `sql` method and print the result using the `show()` method. 


In [None]:
film_result = spark.sql(
            """
            SELECT *
            FROM films_table f
            WHERE f.Title LIKE '%Life%'
            """)

film_result.show()

### One More Example 

Cool! To practice this more, let's look at the next query. 
><b>Query:</b> Select all films that are shorter than 30 minutes and are highly popular (i.e., have popularity index higher than 50).

<b>Solution:</b> 
And here is the SQL query that can address the query. 
>```sql
SELECT *
FROM films_table f
WHERE f.length < 30 and f.popularity > 50```

Let's execute it!

In [None]:
short_films = spark.sql(
            """
            SELECT *
            FROM films_table f
            WHERE f.length < 30 and f.popularity > 50
            """)

short_films.show()

## Converting Spark DataFrames to Pandas 

Once the query is executed using multiple cores, sometimes we want to convert the results to regular Pandas dataframes to use Pandas's functionalities. We can do this as simple as using `toPandas()` method as follows. 

In [None]:
film_result_df = film_result.toPandas()
film_result_df


## Converting Pandas DataFrame to Spark DataFrame

You might wonder what if we have a Pandas dataframe and we want to convert it to Spark DataFrame and try parallel computing?

Well, Spark's `createDataFrame` function will do it. In the cell below we convert our `film_result_df` dataframe back to Spark dataframe. 

In [None]:
spark.createDataFrame(film_result_df).show()


## What About Spatial Datasets?
So far so good, but what if we want to load a shapefile as a Spark dataframe? 

A very first thought would be loading the shapefile into a GeoDataFrame and then converting it to Spark dataframe, right? 

Let's try it then!

In [None]:
import geopandas as gpd
gdf = gpd.read_file("supplementary/MN_points/POINT.shp")
try:
    points = spark.createDataFrame(gdf)
except ValueError as err:
    print("ValueError: ",err)
    


Oops!!! There is a ValueError! Seems like Spark is complaining about the `geometry` column and its data type.

>Why do you think this error occurred? <br>
What solution do you propose to debug?

Let us know in the textbox below.

In [None]:
w = widgets.Textarea(
            value='',
            placeholder='Write your answers here',
            description='',
            disabled=False,
            layout=Layout( height='200px', min_height='100px', width='900px')
            )

def out1():
    print('Submitted! See one debugging approach below.')
    
display(w)
hourofci.SubmitBtn2(w, out1)

One way to debug this code is to get rid of the `geometry` data type. To do this, we can convert the type of `geometry` column to `string` type using `astype('string')` method and then try converting it to Spark dataframe. 

In [None]:
gdf['geometry'] = gdf['geometry'].astype('string')

points = spark.createDataFrame(gdf)
points.show()

Well, seems like we could kinda debug our code, but the problem is that we no longer have the geometry data type. The `geometry` column simply contains some useless string values. 

This is because PySpark **does not** support spatial data. Therefore, we need to install and use the spatial extension of Spark. 

In the next segment, we will see how to handle spatial data. 

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" 
href="pc-6.ipynb">Click here to go to the next notebook.</a></font>
