<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Aggregation:-groupBy-and-count" data-toc-modified-id="Aggregation:-groupBy-and-count-1">Aggregation: <code>groupBy</code> and <code>count</code></a></span></li><li><span><a href="#Writing-to-file:-csv" data-toc-modified-id="Writing-to-file:-csv-2">Writing to file: <code>csv</code></a></span></li><li><span><a href="#Streamlining-the-code-by-chaining" data-toc-modified-id="Streamlining-the-code-by-chaining-3">Streamlining the code by chaining</a></span><ul class="toc-item"><li><span><a href="#Method-chaining" data-toc-modified-id="Method-chaining-3.1">Method chaining</a></span></li></ul></li><li><span><a href="#Submitting-code-in-batch-mode-using-spark-submit" data-toc-modified-id="Submitting-code-in-batch-mode-using-spark-submit-4">Submitting code in batch mode using <code>spark-submit</code></a></span></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-5">Exercises</a></span></li></ul></div>

# Chapter 3: Submitting and scaling your first PySpark program

In [12]:
# Set up
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

In [13]:
# Data Frame Setup
# Set up
from pyspark.sql.functions import col, split, lower, explode, regexp_extract

book = spark.read.text("data/Ch02/1342-0.txt")
lines = book.select(split(col("value"), " ").alias("line"))
words = lines.select(explode(col("line")).alias("word"))
words_lower = words.select(lower("word").alias("word_lower"))
word_norm = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]*", 0).alias("word_normalized")
)
word_nonull = word_norm.filter(col("word_normalized") != "").withColumnRenamed(
    "word_normalized", "word_nonull"
)

## Aggregation: `groupBy` and `count`

- `GroupedData` allows you to perform an aggregate function on each group. 
- Use `groupby` to count record occurrence, passing columns we want to group.  Returned value is a `GroupedData` object, not a `DataFrame`.  Once you apply a function to it like `count()`, it returns a  `DataFrame`.
    - Note that `groupby` and `groupBy` are the same thing.
- You can sort the output by `orderBy`
    - Note that `orderBy` only exists as camel case.

In [16]:
groups = word_nonull.groupBy(col("word_nonull"))
display(groups)

results = groups.count().orderBy("count", ascending=False)
results.show()

<pyspark.sql.group.GroupedData at 0x11b77b910>

+-----------+-----+
|word_nonull|count|
+-----------+-----+
|        the| 4480|
|         to| 4218|
|         of| 3711|
|        and| 3504|
|        her| 2199|
|          a| 1982|
|         in| 1909|
|        was| 1838|
|          i| 1750|
|        she| 1668|
|       that| 1487|
|         it| 1482|
|        not| 1427|
|        you| 1301|
|         he| 1296|
|         be| 1257|
|        his| 1247|
|         as| 1174|
|        had| 1170|
|       with| 1092|
+-----------+-----+
only showing top 20 rows



## Writing to file: `csv`

- data frame has `write` method, which can be chained with `csv`
- default writes a bunch of separate files (1 file per partition) + `_SUCCESS` file.
- use `coalesce` to concat to 1 file
- use `.mode('overwrite')` to force write

> __TIP:__
Never assume that your data frame will keep the same ordering of records unless you explicitly ask via orderBy().

In [21]:
# Write multiple partitions + success file
results.write.mode("overwrite").csv("./output/results")

# Concatenate into 1 file, then write to disk
results.coalesce(1).write.mode("overwrite").csv("./output/result_single_partition")

## Streamlining the code by chaining

### Method chaining

In PySpark, every transformation returns an object, which is why we need to assign a variable to the result.  This means that PySpark doesn’t perform modifications in place.

In [36]:
# qualified import; import the whole module
import pyspark.sql.functions as F

# chain methods together instead of multiple variables
results = (
    spark.read.text("./data/ch02/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby("word")
    .count()
)

## Submitting code in batch mode using `spark-submit`

When wrapping a script to be executed with `spark-submit` ratherh than with the `pyspark` command, you'll need to define your `SparkSession` first.

In [39]:
# This can be wrapped into a `word_counter.py` file and be executed
# using `spark-submit`

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

results = (
    spark.read.text("./data/ch02/*.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby("word")
    .count()
    .orderBy("count", ascending=False)
)

results.show()

+----+-----+
|word|count|
+----+-----+
| the|38895|
| and|23919|
|  of|21199|
|  to|20526|
|   a|14464|
|   i|13973|
|  in|12777|
|that| 9623|
|  it| 9099|
| was| 8920|
| her| 7923|
|  my| 7385|
| his| 6642|
|with| 6575|
|  he| 6444|
|  as| 6439|
| you| 6295|
| had| 5718|
| she| 5617|
| for| 5425|
+----+-----+
only showing top 20 rows



## Exercises

See [chapter 3 code](./code/Ch03/word_count_submit.py)