# Chapter 2: Your first Pyspark application

Most data-driven application functions in the Extract-Transform-Load (ETL) pipeline:

1. Ingest or read the data we wish to work with.
2. Transform the data via a few simple instructions or a very complex machine learning model
3. Export the resulting data, either into a file to be fed into an app or by summarizing our findings into a visualization.

### `SparkSession` entry point

- `SparkSession` provides an entry point to Spark.
  - Wraps `SparkContext` and provides functionality for interacting with the data.
- Can be used as a normal object imported from a library in Python.
- `SparkSession` builder: builder pattern with set of methods to create a configurable object.

#### Creating a `SparkSession` entry point from scratch

In [6]:
from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .appName("Analyzing the vocabulary of Pride and Prejudice.")
         .getOrCreate())

`sparkContext` can be invoked from the `SparkSession` object like below.  

(Older code may present `sparkContext` as an `sc` variable)

In [7]:
sc = spark.sparkContext
sc

### Setting the log level

- Spark defaults to `WARN`.
- Can change via `spark.sparkContext.setLogLevel(KEYWORD)`

#### Log level keywords

<table border="1" class="contenttable" summary="log level keywords" width="100%"> 
  <colgroup class="calibre26" span="1"> 
   <col class="col_" span="1" width="50%"> 
   <col class="col_" span="1" width="50%"> 
  </colgroup> 
  <tbody>
   <tr class="calibre19"> 
       <td class="contenttable1" colspan="1" rowspan="1" align="left"><p>Keyword</p></td> 
       <td class="contenttable1" colspan="1"  align="left" rowspan="1"><p>Description</p></td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">OFF</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>No logging at all (not recommended).</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">FATAL</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>Only fatal errors. A fatal error will crash your Spark cluster.</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">ERROR</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>My personal favorite, will show <code class="code">FATAL</code> as well as other useful (but recoverable) errors.</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">WARN</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>Add warnings (and there is quite a lot of them).</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">INFO</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>Will give you runtime information, such as repartitioning and data recovery (see chapter 1).</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">DEBUG</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>Will provide debug information on your jobs.</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">TRACE</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>Will trace your jobs (more verbose debug logs). Can be quite pedagogic, but very annoying.</p> </td> 
   </tr> 
   <tr class="calibre19"> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p><code class="code">ALL</code></p> </td> 
    <td class="contenttable2" colspan="1" rowspan="1"> <p>Everything that PySpark can spit, it will spit. As useful as <code class="code">OFF</code>.</p> </td> 
   </tr> 
  </tbody>
 </table>

In [8]:
spark.sparkContext.setLogLevel('ERROR')

## Application Design

__Goal: What are the most popular words in the Jane Austen's _Pride and Prejudice_?__

Steps:
1. Read: Read the input data (we’re assuming a plain text file)
2. Tokenize: Tokenize each word
3. Clean: Remove any punctuation and/or tokens that aren’t words.
4. Count: Count the frequency of each word present in the text
5. Answer: Return the top 10 (or 20, 50, 100)

## Data Exploration

PySpark provide two main structures for storing data when performing manipulations:

1. The Resilient Distributed Dataset (or RDD)
2. The data frame; Stricter version of RDD. Makes heavy use of the concept of _columns_ where you perform ops on columns instead of on records (like in RDD).  
  - More common than RDD.
  - Syntax is similar to SQL

#### RDD vs Dataframe
<img src="notes/img/rdd_df.png">

#### Reading a dataframe with `spark.read`
Reading data into a data frame is done through the DataFrameReader object, which we can access through `spark.read`. 

`value: string` is the column, with text within that column

In [13]:
book = spark.read.text("data/Ch02/1342-0.txt")
book

DataFrame[value: string]

In [15]:
# Check schema
display(book.printSchema())

display(book.dtypes)

root
 |-- value: string (nullable = true)



None

[('value', 'string')]

#### Showing a dataframe with `spark.show()`

The show() method takes three optional parameters.

1. `n` can be set to any positive integer, and will display that number of rows.
2. `truncate`, if set to true, will truncate the columns to display only 20 characters. Set to False to display the whole length, or any positive integer to truncate to a specific number of characters.
3. `vertical` takes a Boolean value and, when set to True, will display each record as a small table. If you need to check some records in detail, this is a very useful option.

In [23]:
# play with params
book.show(2, truncate=False, vertical=True)

-RECORD 0-------------------------------------------------------------------
 value | The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen 
-RECORD 1-------------------------------------------------------------------
 value |                                                                    
only showing top 2 rows



#### Lazy vs Eager Evaluation

- Default, you need to pass `show()` to see dataframe content.  This follow's Spark's idea of lazy evaluation until some action is needed.
- Since Spark 2.4.0, you can configure the SparkSession object to support printing to screen. This may be helpful when learning:

```py
from pyspark.sql import SparkSession

spark = (SparkSession.builder
                     .config("spark.sql.repl.eagerEval.enabled", "True")
                     .getOrCreate())
```

### Tokenizing sentences with `select()` and `split()`

`select()` selects the data. Similar to SQL. Syntax is similar to pandas:

```py
book.select(book.value)
book.select(book["value"])
book.select(col("value"))
book.select("value")
```

`split()` transforms string column into an array column, containing `n` string elements (i.e. tokens).  Note that it uses `JVM`-based regex instead of Python.

`alias()` renames transformed columns for easier reference.  When applied to a column, it takes a single string as an argument.

Another way to alias set an alias is calling `.withColumnRenamed()` on the data frame.  If you just want to rename a column without changing the rest of the data frame, use .withColumnRenamed.

In [45]:
from pyspark.sql.functions import col, split

# Read, tokenize and alias the column
lines = book.select(split(col('value'), " ").alias("line"))

display(lines)

lines.printSchema()

lines.show(5)

DataFrame[line: array<string>]

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = true)

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



In [46]:
# Changing alias name using withColumnRenamed
alternative = lines.withColumnRenamed("line", 
                                      "here is an alternate alias")
alternative.printSchema()

root
 |-- here is an alternate alias: array (nullable = true)
 |    |-- element: string (containsNull = true)



### Reshaping data with `explode()`

When applied to a column containing a container-like data structure (such as an array), `explode()` will take each element and give it its own row.

![img](notes/img/explode.png)

In [49]:
# Explode column of arrays into rows of elements

from pyspark.sql.functions import explode, col

words = lines.select(explode(col("line")).alias("word"))
words.show(10)

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
+----------+
only showing top 10 rows



### String normalization

In [71]:
from pyspark.sql.functions import lower, regexp_extract

# Lowercase
words_lower = words.select(lower("word").alias("word_lower"))
words_lower.show()

# Naive punctuation normalization using regex
word_norm = words_lower.select(regexp_extract(col("word_lower"), "[a-z]*", 0).alias("word_normalized"))
word_norm.show()

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows

+---------------+
|word_normalized|
+---------------+
|            the|
|        project|
|      gutenberg|
|          ebook|
|             of|
|          pride|
|            and|
|      prejudice|
|             by|
|           jane|
|         austen|
|               |
|           this|
|          ebook|
|             is|
|            for|
|            the|
|            use|
|             of|
|         anyone|
+---------------+
only showing top 20 rows



### Filtering data

In [91]:
# Remove empty records

word_nonull = word_norm.filter(col("word_normalized") != "") \
                       .withColumnRenamed('word_normalized', 'word_nonull')
word_nonull.show()

+-----------+
|word_nonull|
+-----------+
|        the|
|    project|
|  gutenberg|
|      ebook|
|         of|
|      pride|
|        and|
|  prejudice|
|         by|
|       jane|
|     austen|
|       this|
|      ebook|
|         is|
|        for|
|        the|
|        use|
|         of|
|     anyone|
|   anywhere|
+-----------+
only showing top 20 rows



## Exercises

### 2.1
Rewrite the following code snippet, removing the withColumnRenamed method. Which version is clearer and easier to read?

```py
from pyspark.sql.functions import col, length

# The `length` function returns the number of characters in a string column.

ex21 = (
    spark.read.text("./data/Ch02/1342-0.txt")
    .select(length(col("value")))
    .withColumnRenamed("length(value)", "number_of_char")
)
```

In [77]:
from pyspark.sql.functions import col, length
ex21 = (
    spark.read.text("./data/Ch02/1342-0.txt")
    .select(length(col("value")).alias('values'))
)
ex21.show(5)

+------+
|values|
+------+
|    66|
|     0|
|    64|
|    68|
|    67|
+------+
only showing top 5 rows



### 2.2
The following code blocks gives an error. What is the problem and how can you solve it?

```py
from pyspark.sql.functions import col, greatest

ex22.printSchema()
# root
#  |-- key: string (containsNull = true)
#  |-- value1: long (containsNull = true)
#  |-- value2: long (containsNull = true)

# `greatest` will return the greatest value of the list of column names,
# skipping null value

# The following statement will return an error
ex22.select(
    greatest(col("value1"), col("value2")).alias("maximum_value")
).select(
    "key", "max_value"
)
```

### Answer

The columns given are not in a list?

### 2.3

Let’s take our words_nonull data frame, available in listing 2.19. You can use the code in the repository (code/Ch02/end_of_chapter.py) into your REPL to get the data frame loaded.

a) Remove all of the occurrences of the word "is"

b) (Challenge) Using the length function explained in exercise 2.1, keep only the words with more than 3 characters.

In [102]:
# 1. Remove all of the occurences of the word "is",
# 2. Using the length function explained in exercise 2.1, keep only the words with more than 3 characters.
word_nonull.filter(col("word_nonull") != "is") \
           .filter(length(col("word_nonull")) > 3) \
           .withColumnRenamed('word_nonull', 'words_greater_than_3') \
           .show()

+--------------------+
|words_greater_than_3|
+--------------------+
|             project|
|           gutenberg|
|               ebook|
|               pride|
|           prejudice|
|                jane|
|              austen|
|                this|
|               ebook|
|              anyone|
|            anywhere|
|                cost|
|                with|
|              almost|
|        restrictions|
|          whatsoever|
|                copy|
|                give|
|                away|
|               under|
+--------------------+
only showing top 20 rows



### 2.4

Remove the words is, not, the and if from your list of words, using a single `where()` method on the words_nonull data frame (see exercise 2.3). Write the code to do so.

In [103]:
word_nonull.where(~col("word_nonull").isin(['is', 'not', 'the', 'if'])) \
           .show()

+-----------+
|word_nonull|
+-----------+
|    project|
|  gutenberg|
|      ebook|
|         of|
|      pride|
|        and|
|  prejudice|
|         by|
|       jane|
|     austen|
|       this|
|      ebook|
|        for|
|        use|
|         of|
|     anyone|
|   anywhere|
|         at|
|         no|
|       cost|
+-----------+
only showing top 20 rows



### 2.5

One of your friends come to you with the following code. They have no idea why it doesn’t work. Can you diagnose the problem, explain why it is an error and provide a fix?

```py
from pyspark.sql.functions import col, split

book = spark.read.text("./data/ch02/1342-0.txt")

book = book.printSchema()

lines = book.select(split(book.value, " ").alias("line"))

words = lines.select(explode(col("line")).alias("word"))
```

#### Answer

They're assigning the output of `book.printSchema()` to `book`, hence writing over the spark data frame.

#### Solution

In [113]:
from pyspark.sql.functions import col, split

book = spark.read.text("./data/ch02/1342-0.txt")

# Don't assign it back to `book`
book.printSchema()

lines = book.select(split(book.value, " ").alias("line"))

words = lines.select(explode(col("line")).alias("word"))

words.show()

root
 |-- value: string (nullable = true)

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
|    Austen|
|          |
|      This|
|     eBook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows

