<a id="top">

# A PySpark Tutorial

## Table of Contents

- [Installing "local" PySpark environment](#local)
- [Creating Spark Session](#session)
- [Reading a simple text file](#read_text)
- [Getting familiar with your data - `printSchema()`, `dtypes`, and `show()`](#schema)
- [Splitting words into Python lists](#split_words)
- [Different ways to select a dataframe column](#select_column)
- [Renaming column with `alias()`](#rename_column)
- [Exploding a column of arrays into rows of elements](#explode)
- [Lower the case of the words in the data frame](#lower_case)
- [Using `regexp_extract` to keep what looks like a word](#regexp_extract)
- [Filtering rows in your data frame using `where` or `filter`](#filtering)
- [Counting word frequencies using groupby() and count()](#word_freq)
- [Ordering / Sorting Dataframe by Single Column](#sorting)
- [Writing data from a dataframe as csv](#write)
- [Complete Word Count Script](#complete_word_count_script)
- [Simplifying your dependencies with PySpark’s import conventions](#imports)
- [Removing intermediate variables by chaining transformation methods](#method_chaining)

<a id="local">

## Installing "Local" PySpark on Windows 10

[[back to top]](#top)

1. Install Java 1.8 from Sun Java [site](https://www.java.com/download/ie_manual.jsp). Include path to java.exe in your PATH environment variable.
2. Install Python - download binaries at python.org
3. Create pyspark_dev virtual environment: `python -m venv pyspark_dev`
4. Change directory into `pyspark_dev` folder: `cd pyspark_dev`.  Then activate "pyspark_dev" environment with `Scripts/activate.bat`
5. Update pip and then install necessary packages: `python -m pip install -U pip`, then `pip install wheel`, then `pip install pyspark ipykernel`
6. Install kernel: `python -m ipykernel install --user --name pyspark_dev --display-name "Python (pyspark_dev)"`
7. Set environment variables: `PYSPARK_PYTHON=[path_to_python.exe]` and `SPARK_HOME=[path_to_site_packages/pyspark folder]`
8. \* Download `winutils.exe`,`hadoop.dll`, and `hdfs.dll` from https://github.com/cdarlint/winutils, save locally to "hadoop/bin" folder and then
9. `set HADOOP_HOME=[path_to_hadoop_folder]` and append `HADOOP_HOME\bin` to PATH: `set PATH=%PATH%;%HADOOP_HOME%\bin`
10. De-activate your pyspark_dev virtual environment, then activate your python virutal environment that has jupyterlab installed.
11. Confirm you have pyspark_dev installed as a kernel by issuing the following command: `jupyter kernelspec list`  If you see it, then launch jupyterlab via `jupyter lab`
12. Choose your PySpark kernel that you defined in Step 5 when opening a new jupyterlab notebook

\* Hadoop version 2.7.3 has worked for me for PySpark 3.x series.  Hadoop versions 2.7+ _should_ work with PySpark 3.x series

See this article https://phoenixnap.com/kb/install-spark-on-windows-10

## Installing "Local" PySpark on Ubuntu Linux WSL via pip in virtual environment, NOT system level installation

1. Install Java 1.8 `sudo apt-get update` then `sudo apt-get install openjdk-8-jdk` and set JAVA_HOME environment variable, as an example: `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64`
2. Create "pyspark_dev" virtual environment: `python3 -m venv pyspark_dev`
3. Activate "pyspark_dev" environment, then: `python -m pip install -U pip`, then `pip install wheel`, then `PYSPARK_HADOOP_VERSION=2 pip install pyspark pandas ipykernel`
4. Install kernel: `python -m ipykernel install --user --name pyspark_dev --display-name "Python (pyspark_dev)"`
5. Add 2 environment variables (SPARK_HOME and PYSPARK_HOME), as an example: `export SPARK_HOME=/home/pybokeh/envs/pyspark_local/lib/python3.10/site-packages/pyspark` and `export PYSPARK_PYTHON=/home/pybokeh/envs/pyspark_local/bin/python`
6. Append `SPARK_HOME/bin` and `SPARK_HOME/sbin` to your PATH, as an example: `export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin`
7. `source ~/.bashrc` or `source ~/.profile`
8. Issue the "pyspark" command to check if everything was installed correctly
9. De-activate your pyspark_dev virtual environment, then activate your python virutal environment that has jupyterlab installed.
10. Confirm you have pyspark_dev installed as a kernel by issuing the following command: `jupyter kernelspec list`  If you see it, then launch jupyterlab via `jupyter lab`
11. Choose your PySpark kernel that you defined in Step 4 when opening a new jupyterlab notebook

Official installation [instructions](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) from spark documentation

<a id="session">

## We will be performing a simple word count from a text file

#### Creating `SparkSession`

[[back to top]](#top)

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = (SparkSession.builder
    # Enable eager, interactive mode - typically do not do this with production code
    .config("spark.sql.repl.eagerEval.enabled", "True")
    .getOrCreate()
)

23/03/24 13:42:18 WARN Utils: Your hostname, pybokeh-Lemur resolves to a loopback address: 127.0.1.1; using 192.168.1.101 instead (on interface wlp2s0)
23/03/24 13:42:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/24 13:42:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


<a id="read_text">

#### Reading a text file

[[back to top]](#top)

In [3]:
book = spark.read.text("data/gutenberg_books/1342-0.txt")

In [4]:
book

value
The Project Guten...
This eBook is for...
almost no restric...
re-use it under t...
with this eBook o...
Title: Pride and ...
Author: Jane Austen
Posting Date: Aug...
Release Date: Jun...
Last Updated: Mar...


<a id="schema">

#### Getting familiar with your data - `printSchema()`, `dtypes`, and `show()`

[[back to top]](#top)

In [5]:
book.printSchema()

root
 |-- value: string (nullable = true)



In [6]:
print(book.dtypes)

[('value', 'string')]


In [7]:
book.show()

+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
|with this eBook o...|
|                    |
|                    |
|Title: Pride and ...|
|                    |
| Author: Jane Austen|
|                    |
|Posting Date: Aug...|
|Release Date: Jun...|
|Last Updated: Mar...|
|                    |
|   Language: English|
|                    |
|Character set enc...|
|                    |
+--------------------+
only showing top 20 rows



<a id="split_words">

#### Splitting words into Python lists

[[back to top]](#top)

In [8]:
from pyspark.sql.functions import split

In [9]:
lines = book.select(split(book.value, " ").alias("line"))

In [10]:
lines

line
"[The, Project, Gu..."
[]
"[This, eBook, is,..."
"[almost, no, rest..."
"[re-use, it, unde..."
"[with, this, eBoo..."
[]
[]
"[Title:, Pride, a..."
[]


<a id="select_column">

### Different ways to select a dataframe column

[[back to top]](#top)

In [11]:
from pyspark.sql.functions import col

#### Dot/. notation `dataframe.column_name`

In [12]:
book.select(book.value)

value
The Project Guten...
This eBook is for...
almost no restric...
re-use it under t...
with this eBook o...
Title: Pride and ...
Author: Jane Austen
Posting Date: Aug...
Release Date: Jun...
Last Updated: Mar...


#### Bracket notation `dataframe["column_name"]`

In [13]:
book.select(book["value"])

value
The Project Guten...
This eBook is for...
almost no restric...
re-use it under t...
with this eBook o...
Title: Pride and ...
Author: Jane Austen
Posting Date: Aug...
Release Date: Jun...
Last Updated: Mar...


#### Using `col()` function - recommended

In [14]:
book.select(col("value"))

value
The Project Guten...
This eBook is for...
almost no restric...
re-use it under t...
with this eBook o...
Title: Pride and ...
Author: Jane Austen
Posting Date: Aug...
Release Date: Jun...
Last Updated: Mar...


#### Just the column name itself - NOT recommended

In [15]:
book.select("value")

value
The Project Guten...
This eBook is for...
almost no restric...
re-use it under t...
with this eBook o...
Title: Pride and ...
Author: Jane Austen
Posting Date: Aug...
Release Date: Jun...
Last Updated: Mar...


#### PySpark API documentation can be accessed [here](http://spark.apache.org/docs/latest/api/python/)

In [16]:
book.select(split(col("value"), " ")).printSchema()

root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = false)



#### Default column name

In [17]:
book.select(split(col("value"), " ")).printSchema()

root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = false)



The column name is "split(value, , -1)" - that's not good!

<a id="rename_column">

#### Renaming column with `alias()`

[[back to top]](#top)

In [18]:
book.select(split(book.value, " ").alias("line")).printSchema()

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = false)



Another way to rename column which is hardly ever used since you have to know the current column name

In [19]:
lines.withColumnRenamed("split(value,  , -1)", "line").printSchema()

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = false)



<a id="explode">

#### Exploding a column of arrays into rows of elements

[[back to top]](#top)

In [20]:
from pyspark.sql.functions import explode, col

In [21]:
words = lines.select(explode(col("line")).alias("word"))

In [22]:
words.show()

+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
|    Austen|
|          |
|      This|
|     eBook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



`Prejudice,` has a comma and that the cell between `Austen` and `This` contains an empty string.

<a id="lower_case">

#### Lower the case of the words in the data frame

[[back to top]](#top)

In [23]:
from pyspark.sql.functions import lower

In [24]:
words_lower = words.select(lower(col("word")).alias("word_lower"))

In [25]:
words_lower.show()

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



<a id="regexp_extract">

#### Using regexp_extract to keep what looks like a word

[[back to top]](#top)

In [26]:
from pyspark.sql.functions import regexp_extract

In [27]:
words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

In [28]:
words_clean.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
+---------+
only showing top 20 rows



<a id="filtering">

#### Filtering rows in your data frame using `where` or `filter`

[[back to top]](#top)

In [29]:
words_nonull = words_clean.filter(col("word") != "")

In [30]:
words_nonull.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
+---------+
only showing top 20 rows



<a id="word_freq">

#### Counting word frequencies using groupby() and count()

[[back to top]](#top)

In [31]:
groups = words_nonull.groupby(col("word"))

In [32]:
print(groups)

<pyspark.sql.group.GroupedData object at 0x000002A4FFFCB130>


In [33]:
results = words_nonull.groupby(col("word")).count()

In [34]:
print(results)

+-------------+-----+
|         word|count|
+-------------+-----+
|       online|    4|
|         some|  209|
|        still|   72|
|          few|   72|
|         hope|  122|
|        those|   60|
|     cautious|    4|
|    imitation|    1|
|          art|    3|
|      solaced|    1|
|       poetry|    2|
|    arguments|    5|
| premeditated|    1|
|      elevate|    1|
|       doubts|    2|
|    destitute|    1|
|    solemnity|    5|
|   lieutenant|    1|
|gratification|    1|
|    connected|   14|
+-------------+-----+
only showing top 20 rows



<a id="sorting">

#### Ordering / Sorting Dataframe by Single Column

[[back to top]](#top)

In [35]:
results.orderBy(col("count"), ascending=False).show(10)
# or
# results.orderBy(col("count").desc()).show(10)

+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
|   i| 2052|
|   a| 1997|
|  in| 1920|
| was| 1844|
| she| 1703|
+----+-----+
only showing top 10 rows



<a id="write">

#### Writing data from a dataframe as csv

[[back to top]](#top)

In [36]:
results.coalesce(1).write.option("header", "true").csv("data/simple_count_single_partition.csv")

<a id="complete_word_count_script">

#### Complete Word Count Script

[[back to top]](#top)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    explode,
    lower,
    regexp_extract,
    split,
)
 
spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()
 
book = spark.read.text("./data/gutenberg_books/1342-0.txt")
 
lines = book.select(split(book.value, " ").alias("line"))
 
words = lines.select(explode(col("line")).alias("word"))
 
words_lower = words.select(lower(col("word")).alias("word"))
 
words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)
 
words_nonull = words_clean.where(col("word") != "")
 
results = words_nonull.groupby(col("word")).count()
 
results.orderBy("count", ascending=False).show(10)
 
results.coalesce(1).write.option("header", "true").csv("data/simple_count_single_partition.csv")

<a id="imports">

#### Simplifying your dependencies with PySpark’s import conventions

[[back to top]](#top)

In [None]:
# Before
from pyspark.sql.functions import col, explode, lower, regexp_extract, split
 
# After
import pyspark.sql.functions as F

<a id="method_chaining">

#### Removing intermediate variables by chaining transformation methods

[[back to top]](#top)

In [None]:
# Before
book = spark.read.text("./data/gutenberg_books/1342-0.txt")
 
lines = book.select(split(book.value, " ").alias("line"))
 
words = lines.select(explode(col("line")).alias("word"))
 
words_lower = words.select(lower(col("word")).alias("word"))
 
words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)
 
words_nonull = words_clean.where(col("word") != "")
 
results = words_nonull.groupby("word").count()
 
# After
import pyspark.sql.functions as F
 
results = (
    spark.read.text("./data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby("word")
    .count()
)

#### Modified Complete Word Count Script using Method Chaining

[[back to top]](#top)

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
 
spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
 
results = (
    spark.read.text("./data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby("word")
    .count()
)
 
results.coalesce(1).write.option("header", "true").csv("data/simple_count_single_partition.csv")