## Intermediate Parallel Computing
### Segment 3 of 5

### PySpark SQL I: Fundamentals

### In this segment we will answer:
* What are the fundamental functions used in pySpark SQL?
* What are the fundamental data structures in pySpark?
* How to create RDDs and do simple operations on them?


*Lesson Developer: Mohsen Ahmadkhani, ahmad178@umn.edu*

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci


# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

## Spark SQL


In the <a href="http://try.hourofci.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhourofci%2Flessons&urlpath=tree%2Flessons%2Fintermediate-lessons%2Fgeospatial-data%2FWelcome.ipynb&branch=master">intermediate lesson on Geospatial Data</a>, we saw how to use `SELECT`, `FROM`, and `WHERE` clauses to query our data. In this segment we will be introducing Spark SQL that is a data querying tool similar to regular databases but different!! Regular databases like PostgreSQL, Oracle, and others have two broad responsibilities of **(long-term) storing** and **processing** (i.e., querying and analyzing) data. 

**Spark SQL**, however, is a framework that is mainly concerned with **processing** large datasets and **not with long term** storage. Spark SQL makes data querying parallelized and therefore faster and more efficient especially for large datasets. 

In this segment, we will go through the fundamental definitions and functions in pySpark SQL. First, let's create a spark context in the next slide.

In [None]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("hourofci").setMaster("local[4]")

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
sc

## Resilient Distributed Dataset (RDD)
Spark uses RDDs as its fundamental data structure. RDDs are objects that contain data and values. We don't want to confuse you further with the details of this concept, but it's important to note that data stored as RDD is **partitioned** and distributed across the nodes of the cluster ready for parallelization.

## Creating an RDD Object

The `SparkContext` class provides the following two methods for creating RDD objects from the collection of data in different formats. 
* `parallelize()`
* `textFile()`

We'll make example of each in the upcoming slides. 

## `parallelize()`

This method creates RDD files from python's collection data of different types such as tuple, list, dictionary,  set, and pandas and numpy arrays.  

In the cell below, we create the RDD variables from a list variable containing seven elements.



In [None]:
sample_list = [1, 2, [3, 4], 5, 'cat', 0, 'dog']
rdd_list = sc.parallelize(sample_list)

rdd_list


# `collect()`
Ok, so far, we created an RDD file from a list and a pandas array. If you print out an RDD variable it shows `ParallelCollectionRDD[<some identifier>]`. This is how Spark understands it, but what if we want to see the actual values stored in this variable?<br>
Here is where `collect()` method comes handy! <br>
In the next cell we use `collect()` method to see the values inside the RDD variables. 

In [None]:
rdd_list.collect()


The values in the created RDD variable are exactly the same as the original list variable. However, this time, these seven elements have been partitioned into the number of partitions we set in the beginning using `setMaster` function which in this case is four. 

Here is how we can get the number of partitions: 

In [None]:
rdd_list.getNumPartitions()


Ok, it is confirmed that the `rdd_list` variable is partitioned into four parts. 

If you are curious what each partition contains, use `glom()` method as follow: 

In [None]:
rdd_list.glom().collect()

Spark does it's best to distribute the data evenly into the number of partitions. 

Since there are *seven* elements in this RDD variable and we have set *four* partitions, three partitions get two elements each and the last partition gets the remainder which is a single element. 

Next, we will use a films sample dataset to create an RDD from CSV. Please note that we slightly modified it's <a href="https://perso.telecom-paristech.fr/eagan/class/igr204/data/film.csv">original file</a> for pedagogical purposes. 

This sample dataset has 10 columns and 1658 rows holding some information regarding movies and their properties like title, director and so on. 




## Create an RDD file from CSV
We can load a CSV file as an RDD dataset using `textFile()` function. When we load a file using this function, Spark itself decides the best number of partitions based on the file's size. However, if we want to specify the number of partitions ourselves (for perfomance reasons) we can do so by passing the `minPartitions` argument.

In [None]:
rddFromCSV = sc.textFile("supplementary/films.csv", minPartitions = 4)

rddFromCSV.collect()


Please note that the `textFile()` function reads each row of the CSV file as a single string element and returns an RDD of strings. So, according to our data analysis purposes we usually need to use `map` function to generate our suitable data structure. 

So, let's see how `map` works!

## `map` function
`map` function is not specifically a Spark function. It is a python built-in function that performs an operation on the elements of an iterable item without using a `for` loop. This function gets an input function and an iterable item and performs the function on each element of the interable variable. 

In the next cell we use the map function to square the elements of a list variable. 


In [None]:
simple_list = [1, 2, 3, 4, 5]

def square(el):
     return el ** 2

res = map(square, simple_list)
list(res)

## `map` Function in PySpark

PySpark has implemented the `map` function as a method for each RDD variable that gets a function as input and applies it on all elements of the RDD file. 

In the cell belwo, we define a splitter function that splits the input text element into a list of multiple words separated by `,`. Then we apply it to all elements of our films' RDD. 


In [None]:
def splitter(el):
    return el.split(",")

rddFromCSV.map(splitter).collect()


## Filtering Data 

Often we want to make data selections and fetch elements that satisfy a given condition. Python also have a built-in function for this purpose named `filter`. Similar to `map`, `filter` gets a discriminator function and returns all elements of an input iterable item that pass the discriminator function. 

The discriminator function tests a given condition on each element and returns True or False. 

To clear this up, see an example in the next slide. 


In [None]:
def range_tester(el):
    if el > 10:
        return True


num_list = [1, 23, 0, 3, 5, 11]
    
list(filter(range_tester, num_list))


## `filter` Function in PySpark

Similarly, `filter` is also implemented in PySpark. In the example below, we use our films RDD to illustrate this function by filtering all movies that have the word "Life" in them.


In [None]:
def life_finder(el):
    if 'Life' in el:
        return True

rddFromCSV.filter(life_finder).collect()


## Lambda Function

One last thing to learn in this segment is the `lambda` function that is used in PySpark very often. 
`lambda` function is a one-line short form of functions in python that can have only one expression. See an example below:


In [None]:
square = lambda a: a**2

square(4)

## Lambda Function in PySpark
In pySpark, we `map` a one-line `lambda` function to all of our RDD elements. Below is the same example we saw to split the rows into multiple words seperated by `,`.

In [None]:
rddFromCSV.map(lambda el: el.split(",")).collect()

Now you are acquainted with some basic functions of pySpark SQL. In the next segment, we will learn about Spark DataFrames and applications of pySpark SQL in executing SQL codes for non-spatial data. Let's go!


<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" 
href="pc-5.ipynb">Click here to go to the next notebook.</a></font>