## Intermediate Parallel Computing
### Segment 3 of 6

### PySpark SQL I: Fundamentals

### In this segment we will answer:
* What are the fundamental functions used in pySpark SQL?
* What are the fundamental data structures in pySpark?
* How to create RDDs and do simple operations on them?


*Lesson Developer: Mohsen Ahmadkhani, ahmad178@umn.edu*

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci


# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

## Spark SQL as a DataBase Management System (DBMS)


Recall the <a href="http://try.hourofci.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhourofci%2Flessons&urlpath=tree%2Flessons%2Fintermediate-lessons%2Fgeospatial-data%2FWelcome.ipynb&branch=master">intermediate lesson on Geospatial Data</a> where we mentioned that a DBMS is an interface that bridges users and the physical databases through a unified Structured Query Language (SQL). Spark SQL also uses this unified language for data query. SQL is a software module that allows you to search, filter, analyze, and generally query your database. 

Spark SQL works like a DBMS execpt it enables data partitioning and therefore parallelizes the querying process. We discussed the SQL language in the <a href="http://try.hourofci.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhourofci%2Flessons&urlpath=tree%2Flessons%2Fintermediate-lessons%2Fgeospatial-data%2FWelcome.ipynb&branch=master">intermediate lesson on Geospatial Data</a>, but don't wory if you have not completed it yet. We will provide an introduction to SQL in the next few slides. 



## SQL and Relational Databases

**Structured Query Language (SQL)** is a programming language developed to interact and communicate with a relational database. A **relational database** is a database where each dataset is stored in the form of a `Table` (a.k.a relation). Each `Table` is made of rows (a.k.a `records`) and columns (a.k.a `fields`). Each record is an instance of an entity and each column records a specific attribute of that entity. Figure below shows the first 5 records of a table named `US_States`. 

We will use table `US_States` to illustrate SQL in a few slides. This table contains 56 records representing US states. There are 5 fields in this table that keep information like index (`gid`), FIPS code (`statefp`), unique GeoID (`geoid`), abbreviation (`stusps`), and `name` for each US state. This dataset is downloaded from the <a href='https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html'> US Census Bureau</a> data center.

<img src = "supplementary/us_state.png" width = "700px">


## SQL Basic Keywords

SQL language reads very similar to plain English and contains three main statements namely `SELECT`, `FROM`, and `WHERE`. Next we will demonstrate these keywords by making sample statements using our `US_State` table. 



### **`SELECT` statement:**
- Tells the DBMS that which **fields** to select/filter. 

The general **syntax*** for **select statement** is 

>```mysql
SELECT field1,field2,...,fieldN 


Syntax: `SELECT` followed by a **comma separated** list of fields. <br>
**Note:** we use asterisk symbol (`*`) to indicate *all fields* instead of listing them all. 

***Syntax** of a programming language is a set of punctuation rules that defines what the combination of symbols and characters means to the computer. 


### **`FROM` statement:**
- Tells the DBMS that which table to look for. 

Here we select all fields from the `US_State` table. 

>```mysql
SELECT * 
FROM US_State


<img src = "supplementary/us_state2.png" width = "700px">
 
We just `selected` all fields `from` the `US_State` table. 

## `WHERE` Statement
- Tells the DBMS to select the records that meet a certain **condition**. 

We usually use comparison operators such as `=`, `>`, `<`, `>=`, `<=` , and `!=` (i.e., not equal to) to make WHERE-clauses. 

We can define multiple conditions using `AND` (if we want both conditions to be met at the same time) and `OR` (if we want at least one of the conditions to be met) logical operators. 

Here is the final syntax of a query made of the three statements: 
>```mysql
SELECT column1, column2, ..columnN
FROM table_name
WHERE condition1 AND condition2


Let's look at a US_State example! 

## An SQL Example

**Problem:** Select the name, abbreviation, and the FIPS code of the state of Minnesota from the US_State table.

>```mysql
SELECT name, stusps, statefp
FROM US_State
WHERE name = 'Minnesota'

And here is what we get: 
<center><img src = "supplementary/mnquery.png" width = "600px"></center>


## Getting Back to Spark SQL

Now that you are familiar with the basics of SQL language, let's continue exploring Spark SQL framework.

Later in this segment, we will go through the fundamental definitions and functions of PySpark SQL. 

First, we need to create a **Spark Context**. `Spark Context` is the entry point or gateway to PySpark framework. Although Spark Context is not exactly the same as the Spark Session, it's safe enough to consider them equivalent at this stage. 

In the next slide we create a Spark Context and store it in a variable named `sc`.

In [None]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("hourofci").setMaster("local[4]")

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
sc

## Resilient Distributed Dataset (RDD)
Spark uses RDDs as its fundamental data structure. RDDs are objects that contain data and values. We don't want to confuse you further with the details of this concept, but it's important to note that data stored as RDD is **partitioned** and distributed across the nodes of the cluster ready for parallelization. Figure below shows how an original dataset is transformed/partitioned to an RDD. 

<img src = "supplementary/rdd.png" width = "700px">

In this figure `P` stands for `Partition`.

## Creating an RDD Object

The `SparkContext` class provides the following two methods for creating RDD objects from the collection of data in different formats. 
* `parallelize()`
* `textFile()`

We'll make example of each in the upcoming slides. 

## `parallelize()`

This method creates RDD files from python's collection data of different types such as tuple, list, dictionary,  set, and pandas and numpy arrays.  

In the next slide, we create an `RDD` variable from a `list` variable containing seven elements.



In [None]:
sample_list = [1, 2, 'mouse', 5, 'cat', 0, 'dog']
rdd_list = sc.parallelize(sample_list)

rdd_list


# `collect()`
Ok, so far, we've created an RDD file from a list. If you print out an RDD variable it shows `ParallelCollectionRDD[<some identifier>]`. This is how Spark understands it, but what if we want to see the actual values stored in this variable?<br>
Here is where `collect()` method comes handy! <br>
In the cell below we use `collect()` method to see the values inside the created RDD variable. 

In [None]:
rdd_list.collect()


The values in the created RDD variable are exactly the same as the original list variable. However, this time, these seven elements have been partitioned into the number of partitions we set in the beginning using `setMaster` function which in this case is **4**. 

Here is how we can get the number of partitions: 

In [None]:
rdd_list.getNumPartitions()


Ok, it is confirmed that the `rdd_list` variable is partitioned into four parts. 

If you are curious what each partition contains, use `glom()` method as follow: 

In [None]:
rdd_list.glom().collect()

Spark does its best to split the data evenly into the number of partitions. 

Since there are *seven* elements in this RDD and we have set *four* partitions, three partitions get two elements each and the last partition gets the remainder which is a single element. 

Next, we will create an RDD from a CSV file.



### CSV Files
CSV (comma-separated values) is a popular file format that stores tables through a specific syntax. 

For example consider the US_State table:<br>
<center><img src = "supplementary/us_state2.png" width = "500px"></center><br>
This table can be stored as a CSV file like below:

>```csv
gid, statefp, geoid, stusps, name
1, "28", "28", "MS", "Mississippi"
2, "37", "37", "NC", "North Carolina"
...


## Creating an RDD From a CSV

We will use a `films` sample CSV dataset to create an RDD. Please note that we slightly modified its <a href="https://perso.telecom-paristech.fr/eagan/class/igr204/data/film.csv">original file</a> for pedagogical purposes. 

This sample dataset has 10 columns and 1658 rows holding some information regarding movies and their properties like title, director and so on. 


We can load a CSV file as an RDD dataset using `textFile()` function. When we load a file using this function, Spark itself decides the best number of partitions based on the file's size. However, if we want to specify the number of partitions ourselves (for perfomance reasons) we can do so by passing the `minPartitions` argument.

In [None]:
rddFromCSV = sc.textFile("supplementary/films.csv", minPartitions = 4)

rddFromCSV.collect()


Please note that the `textFile()` function reads each row of the CSV file as a single `string` element and returns an RDD of strings. So, according to our data analysis purposes we usually need to use `map` function to generate our suitable data structure. 

So, let's see how `map` works!

## `map` Function
`map` function is not specifically a Spark function. It is a python built-in function that performs an operation on the elements of an iterable item without using a `for` loop. `map` gets an input **function** and an **iterable item** and performs the function on each element of the interable variable. 

In the next slide we use the map function to square the elements of a list variable. 

Please note that the output of a map function is a **`map object`** that we need to convert to a `list` to visualize it at the end. 

In [None]:
simple_list = [1, 2, 3, 4, 5]     # input iterable variable

def square(el):                   # define a function "square" that gets a number and returns its square
     return el ** 2

res = map(square, simple_list)    # apply the function "square" to all elements of "simple_list"
list(res)                         # convert the result to a list for visualization

## `map` Function in PySpark

PySpark has implemented the `map` function as a method for each RDD variable that gets a function as input and applies it to all elements of the RDD file. 

In the next slide, we define a splitter function that splits an input text element (e.g., a film title) into a list of multiple words separated by `,`. Then we apply it to all elements of our films' RDD. 


In [None]:
def splitter(el):
    return el.split(",")

rddFromCSV.map(splitter).collect()


## Filtering Data 

Often we want to make data selections and fetch elements that satisfy a given condition. Python also has a built-in function for this purpose named `filter`. Similar to `map`, `filter` gets a discriminator function and returns all elements of an input iterable item that pass the discriminator function. 

The discriminator function tests a given condition on each element and returns `True` or `False`. 

To clear this up, see an example in the next slide. 


In [None]:
def range_tester(el):          # gets a value and returns True if it is larger ...
    if el > 10:                                       #...than 10 otherwise False
        return True
    else:
        return False

num_list = [1, 23, 0, 3, 5, 11]
    
filtered = filter(range_tester, num_list) # apply range_tester function to elements of num_list
list(filtered)        # convert the result to a list to visualize

## `filter` Function in PySpark

Similarly, `filter` is also implemented in PySpark. In the example below, we use our films RDD to illustrate this function by filtering all movies that have the word <i>"Life"</i> in them.

<i>*Remember this example. In the next segment we will see how we can do the same thing with Spark SQL!

In [None]:
def life_finder(el):
    if 'Life' in el:
        return True

rddFromCSV.filter(life_finder).collect()


Now you are acquainted with some basic functions of pySpark SQL. In the next segment, we will learn about Spark DataFrames and applications of pySpark SQL in executing SQL codes for non-spatial data. Let's go!


<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" 
href="pc-5.ipynb">Click here to go to the next notebook.</a></font>