### Spark notebook ###

This notebook will only work in a Jupyter notebook or Jupyter lab session running on the cluster master node in the cloud.

Follow the instructions on the computing resources page to start a cluster and open this notebook.

**Steps**

1. Connect to the Windows server using Windows App.
2. Connect to Kubernetes.
3. Start Jupyter and open this notebook from Jupyter in order to connect to Spark.

In [1]:
# Run this cell to import pyspark and to define start_spark() and stop_spark()

import findspark

findspark.init()

import getpass
import pandas
import pyspark
import random
import re

from IPython.display import display, HTML
from pyspark import SparkContext
from pyspark.sql import SparkSession


# Constants used to interact with Azure Blob Storage using the hdfs command or Spark

global username

username = re.sub('@.*', '', getpass.getuser())


# Functions used below

def dict_to_html(d):
    """Convert a Python dictionary into a two column table for display.
    """

    html = []

    html.append(f'<table width="100%" style="width:100%; font-family: monospace;">')
    for k, v in d.items():
        html.append(f'<tr><td style="text-align:left;">{k}</td><td>{v}</td></tr>')
    html.append(f'</table>')

    return ''.join(html)


def show_as_html(df, n=20):
    """Leverage existing pandas jupyter integration to show a spark dataframe as html.
    
    Args:
        n (int): number of rows to show (default: 20)
    """

    display(df.limit(n).toPandas())

    
def display_spark():
    """Display the status of the active Spark session if one is currently running.
    """
    
    if 'spark' in globals() and 'sc' in globals():

        name = sc.getConf().get("spark.app.name")

        html = [
            f'<p><b>Spark</b></p>',
            f'<p>The spark session is <b><span style="color:green">active</span></b>, look for <code>{name}</code> under the running applications section in the Spark UI.</p>',
            f'<ul>',
            f'<li><a href="http://localhost:{sc.uiWebUrl.split(":")[-1]}" target="_blank">Spark Application UI</a></li>',
            f'</ul>',
            f'<p><b>Config</b></p>',
            dict_to_html(dict(sc.getConf().getAll())),
            f'<p><b>Notes</b></p>',
            f'<ul>',
            f'<li>The spark session <code>spark</code> and spark context <code>sc</code> global variables have been defined by <code>start_spark()</code>.</li>',
            f'<li>Please run <code>stop_spark()</code> before closing the notebook or restarting the kernel or kill <code>{name}</code> by hand using the link in the Spark UI.</li>',
            f'</ul>',
        ]
        display(HTML(''.join(html)))
        
    else:
        
        html = [
            f'<p><b>Spark</b></p>',
            f'<p>The spark session is <b><span style="color:red">stopped</span></b>, confirm that <code>{username} (notebook)</code> is under the completed applications section in the Spark UI.</p>',
            f'<ul>',
            f'<li><a href="http://mathmadslinux2p.canterbury.ac.nz:8080/" target="_blank">Spark UI</a></li>',
            f'</ul>',
        ]
        display(HTML(''.join(html)))


# Functions to start and stop spark

def start_spark(executor_instances=2, executor_cores=1, worker_memory=1, master_memory=1):
    """Start a new Spark session and define globals for SparkSession (spark) and SparkContext (sc).
    
    Args:
        executor_instances (int): number of executors (default: 2)
        executor_cores (int): number of cores per executor (default: 1)
        worker_memory (float): worker memory (default: 1)
        master_memory (float): master memory (default: 1)
    """

    global spark
    global sc

    cores = executor_instances * executor_cores
    partitions = cores * 4
    port = 4000 + random.randint(1, 999)

    spark = (
        SparkSession.builder
        .config("spark.driver.extraJavaOptions", f"-Dderby.system.home=/tmp/{username}/spark/")
        .config("spark.dynamicAllocation.enabled", "false")
        .config("spark.executor.instances", str(executor_instances))
        .config("spark.executor.cores", str(executor_cores))
        .config("spark.cores.max", str(cores))
        .config("spark.driver.memory", f'{master_memory}g')
        .config("spark.executor.memory", f'{worker_memory}g')
        .config("spark.driver.maxResultSize", "0")
        .config("spark.sql.shuffle.partitions", str(partitions))
        .config("spark.kubernetes.container.image", "madsregistry001.azurecr.io/hadoop-spark:v3.3.5-openjdk-8")
        .config("spark.kubernetes.container.image.pullPolicy", "IfNotPresent")
        .config("spark.kubernetes.memoryOverheadFactor", "0.3")
        .config("spark.memory.fraction", "0.1")
        .config("spark.app.name", f"{username} (notebook)")
        .getOrCreate()
    )
    sc = SparkContext.getOrCreate()
    
    display_spark()

    
def stop_spark():
    """Stop the active Spark session and delete globals for SparkSession (spark) and SparkContext (sc).
    """

    global spark
    global sc

    if 'spark' in globals() and 'sc' in globals():

        spark.stop()

        del spark
        del sc

    display_spark()


# Make css changes to improve spark output readability

html = [
    '<style>',
    'pre { white-space: pre !important; }',
    'table.dataframe td { white-space: nowrap !important; }',
    'table.dataframe thead th:first-child, table.dataframe tbody th { display: none; }',
    '</style>',
]
display(HTML(''.join(html)))

### DataFrame API ###

The code below demonstrates some common **transformations**, **actions**, and **functions** in the DataFrame API.

**Sections**

- [Data](#Data)
- [Transformations](#Transformations)
- [Null values](#Null-values)
- [Statistics](#Statistics)

**Key points**

- The datasets used in these examples are designed to have as much complexity as possible while still being small.
- The examples use `printSchema` and `show_as_html` frequently to show the contents of dataframes as they are transformed.

In [2]:
# Run this cell to start a spark session in this notebook

start_spark(executor_instances=2, executor_cores=1, worker_memory=1, master_memory=1)

25/08/16 17:25:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


0,1
spark.dynamicAllocation.enabled,false
spark.fs.azure.sas.uco-user.madsstorage002.blob.core.windows.net,"""sp=racwdl&st=2024-09-19T08:00:18Z&se=2025-09-19T16:00:18Z&spr=https&sv=2022-11-02&sr=c&sig=qtg6fCdoFz6k3EJLw7dA8D3D8wN0neAYw8yG4z4Lw2o%3D"""
spark.kubernetes.driver.pod.name,spark-master-driver
spark.kubernetes.executor.podNamePrefix,rsh224-notebook-b40e0298b1575f92
spark.app.name,rsh224 (notebook)
spark.fs.azure.sas.campus-user.madsstorage002.blob.core.windows.net,"""sp=racwdl&st=2024-09-19T08:03:31Z&se=2025-09-19T16:03:31Z&spr=https&sv=2022-11-02&sr=c&sig=kMP%2BsBsRzdVVR8rrg%2BNbDhkRBNs6Q98kYY695XMRFDU%3D"""
spark.kubernetes.container.image.pullPolicy,IfNotPresent
spark.kubernetes.namespace,rsh224
spark.driver.memory,1g
spark.executor.memory,1g


In [3]:
# We need to import the Row object and the functions and types defined in the pyspark.sql module

from pyspark.sql import Row, functions as F
from pyspark.sql.types import *

### Data ###

This code creates two datasets, `data` and `department_data`.

**Key points**

- These datasets are designed to have as much complexity as possible while still being small.
- The datasets are contructed in Python using pyspark Row objects, distributed to give an RDD, and then wrapped with a DataFrame = Dataset[Row].
- This code does not load any data from HDFS.

In [4]:
# Create, distribute, and wrap data by hand

schema = StructType([
    StructField("Name"       ,  StringType() , True),
    StructField("Department" ,  StringType() , True),
    StructField("Age"        , IntegerType() , True),
    StructField("Gender"     ,  StringType() , True),
    StructField("Salary"     ,  DoubleType() , True)
])
data = spark.createDataFrame(  # Finally, wrap the RDD with metadata by creating a DataFrame = Dataset[Row]
    sc.parallelize(  # Second, take that list of pyspark row objects, distribute them as Spark rows in an RDD[Row]
        [  # First, define a list of pyspark row objects (this is just a Python list in memory on the master node)
            Row("Alpha One"   , "X" , 28 , "M"  ,  80000.0),
            Row("Bravo Two"   , "X" , 25 , "M"  ,  70000.0),
            Row("Charlie"     , "X" , 23 , "M"  ,  80000.0),  # Charlie has no last name, duplicate salary in department X
            Row("Delta Four"  , "Y" , 30 , None , 100000.0),  # Gender is none
            Row("Echo Five"   , "Y" , 27 , "F"  , 120000.0),
            Row("Foxtrot Six" , "Z" , 20 , "F"  ,  90000.0),
            Row("Golf Seven"  , "Z" , 20 , "F"  ,  50000.0),  # Duplicate age in department Z
            Row("Hotel Eight" , "Z" , 38 , "F"  , 100000.0),
            Row("Indigo Nine" , "Z" , 50 , "M"  ,  70000.0),
            Row("Juliet Ten"  , "Z" , 18 , "F"  ,     None),  # Salary is none
        ]
    ), schema=schema)

print(type(data))
data.printSchema()
print(data)
show_as_html(data)

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: double (nullable = true)

DataFrame[Name: string, Department: string, Age: int, Gender: string, Salary: double]


                                                                                

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0
9,Juliet Ten,Z,18,F,


In [5]:
# Create, distribute, and wrap additional department data by hand

department_schema = StructType([
    StructField("Department", StringType(), True),
    StructField("Name", StringType(), True),
    StructField("Campus", StringType(), True)
])
department_data = spark.createDataFrame(
    sc.parallelize(
        [
            Row("X", "Xray",   "U"),
            Row("Y", "Yankee", "V"),
            Row("Z", "Zulu",   "W"),
        ]
    ), schema=department_schema)

print(type(department_data))
department_data.printSchema()
print(department_data)
show_as_html(department_data)

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- Department: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Campus: string (nullable = true)

DataFrame[Department: string, Name: string, Campus: string]


                                                                                

Unnamed: 0,Department,Name,Campus
0,X,Xray,U
1,Y,Yankee,V
2,Z,Zulu,W


### Transformations ###

The following are the most common untyped DataFrame transformations that you are likely to use regularly.

**Key points**

- Note that these cells do not modify `data` directly but rather define another variable e.g. `temp` so that each cell can be run in isolation successfully.

In [6]:
# Rename a column without change other columns

temp = data.withColumnRenamed("Name", "FullName")
temp.printSchema()
show_as_html(temp)

root
 |-- FullName: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: double (nullable = true)



Unnamed: 0,FullName,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0
9,Juliet Ten,Z,18,F,


In [7]:
# Create a new column and automatically select existing columns (note that schema is automatically inferred)

temp = data.withColumn("Names", F.split(F.col("Name"), " "))
temp.printSchema()
show_as_html(temp)

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: double (nullable = true)
 |-- Names: array (nullable = true)
 |    |-- element: string (containsNull = false)



Unnamed: 0,Name,Department,Age,Gender,Salary,Names
0,Alpha One,X,28,M,80000.0,"[Alpha, One]"
1,Bravo Two,X,25,M,70000.0,"[Bravo, Two]"
2,Charlie,X,23,M,80000.0,[Charlie]
3,Delta Four,Y,30,,100000.0,"[Delta, Four]"
4,Echo Five,Y,27,F,120000.0,"[Echo, Five]"
5,Foxtrot Six,Z,20,F,90000.0,"[Foxtrot, Six]"
6,Golf Seven,Z,20,F,50000.0,"[Golf, Seven]"
7,Hotel Eight,Z,38,F,100000.0,"[Hotel, Eight]"
8,Indigo Nine,Z,50,M,70000.0,"[Indigo, Nine]"
9,Juliet Ten,Z,18,F,,"[Juliet, Ten]"


In [8]:
# Select a variety of columns or expressions involving columns (note use of F.col() to access methods such as .alias()) 

temp = data.select(
    F.col("Name").alias("FullName"),
    F.split(F.col("Name"), " ").alias("Names"),
    F.col("Department"),
    "Age",
    "Gender",
    "Salary",
)
temp.printSchema()
show_as_html(temp)

root
 |-- FullName: string (nullable = true)
 |-- Names: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- Department: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: double (nullable = true)



Unnamed: 0,FullName,Names,Department,Age,Gender,Salary
0,Alpha One,"[Alpha, One]",X,28,M,80000.0
1,Bravo Two,"[Bravo, Two]",X,25,M,70000.0
2,Charlie,[Charlie],X,23,M,80000.0
3,Delta Four,"[Delta, Four]",Y,30,,100000.0
4,Echo Five,"[Echo, Five]",Y,27,F,120000.0
5,Foxtrot Six,"[Foxtrot, Six]",Z,20,F,90000.0
6,Golf Seven,"[Golf, Seven]",Z,20,F,50000.0
7,Hotel Eight,"[Hotel, Eight]",Z,38,F,100000.0
8,Indigo Nine,"[Indigo, Nine]",Z,50,M,70000.0
9,Juliet Ten,"[Juliet, Ten]",Z,18,F,


In [9]:
# Sort on a subset of columns or column expressions

temp = data.sort(["Department", "Gender", F.col("Salary").desc()])
temp.printSchema()
show_as_html(temp)

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: double (nullable = true)



Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Charlie,X,23,M,80000.0
2,Bravo Two,X,25,M,70000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Hotel Eight,Z,38,F,100000.0
6,Foxtrot Six,Z,20,F,90000.0
7,Golf Seven,Z,20,F,50000.0
8,Juliet Ten,Z,18,F,
9,Indigo Nine,Z,50,M,70000.0


In [10]:
# Aggregate data on one or more columns using groupBy

department_salary_summary = (
    data
    .groupBy(["Department"])  # GroupedDataFrame object where you need to run .agg() before you can e.g. run .show()
    .agg(
        F.count(F.col("Name")).alias("Count"),
        F.sum(F.col("Salary")).alias("TotalSalary"),
    )
    .sort(["Department"])
)
department_salary_summary.printSchema()
show_as_html(department_salary_summary)

root
 |-- Department: string (nullable = true)
 |-- Count: long (nullable = false)
 |-- TotalSalary: double (nullable = true)



Unnamed: 0,Department,Count,TotalSalary
0,X,3,230000.0
1,Y,2,220000.0
2,Z,5,310000.0


In [12]:
# Join data and department data (you could rename department data columns in the statement to avoid column name conflicts)

data_with_department_name = (
    data
    .join(department_data, on="Department", how="inner")
    .sort(["Department"])
)
data_with_department_name.printSchema()
show_as_html(data_with_department_name)

root
 |-- Department: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Salary: double (nullable = true)
 |-- Name: string (nullable = true)
 |-- Campus: string (nullable = true)



Unnamed: 0,Department,Name,Age,Gender,Salary,Name.1,Campus
0,X,Charlie,23,M,80000.0,Xray,U
1,X,Bravo Two,25,M,70000.0,Xray,U
2,X,Alpha One,28,M,80000.0,Xray,U
3,Y,Echo Five,27,F,120000.0,Yankee,V
4,Y,Delta Four,30,,100000.0,Yankee,V
5,Z,Juliet Ten,18,F,,Zulu,W
6,Z,Indigo Nine,50,M,70000.0,Zulu,W
7,Z,Hotel Eight,38,F,100000.0,Zulu,W
8,Z,Golf Seven,20,F,50000.0,Zulu,W
9,Z,Foxtrot Six,20,F,90000.0,Zulu,W


### Null values ###

There is a submodule for transformations that involve null values. 

**Key points**

- The submodule also includes replacing not null values with other values as this is the same type of transformation anyway. 

In [13]:
# First let's have another look at data before running each of the cells below

show_as_html(data)

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0
9,Juliet Ten,Z,18,F,


In [14]:
# Drop null values

show_as_html(data.na.drop())

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Echo Five,Y,27,F,120000.0
4,Foxtrot Six,Z,20,F,90000.0
5,Golf Seven,Z,20,F,50000.0
6,Hotel Eight,Z,38,F,100000.0
7,Indigo Nine,Z,50,M,70000.0


In [15]:
# Drop null values only on a subset of columns e.g. Salary

show_as_html(data.na.drop(subset=["Salary"]))

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0


In [16]:
# Fill numeric null values with 0 or 0.0

show_as_html(data.na.fill(0))

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0
9,Juliet Ten,Z,18,F,0.0


In [17]:
# Fill string null values with "Prefer not to say"

show_as_html(data.na.fill("Prefer not to say"))

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,X,28,M,80000.0
1,Bravo Two,X,25,M,70000.0
2,Charlie,X,23,M,80000.0
3,Delta Four,Y,30,Prefer not to say,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0
9,Juliet Ten,Z,18,F,


In [18]:
# Replace "X" with "Y" in the Department column only

show_as_html(data.na.replace("X", "Y", subset=["Department"]))

Unnamed: 0,Name,Department,Age,Gender,Salary
0,Alpha One,Y,28,M,80000.0
1,Bravo Two,Y,25,M,70000.0
2,Charlie,Y,23,M,80000.0
3,Delta Four,Y,30,,100000.0
4,Echo Five,Y,27,F,120000.0
5,Foxtrot Six,Z,20,F,90000.0
6,Golf Seven,Z,20,F,50000.0
7,Hotel Eight,Z,38,F,100000.0
8,Indigo Nine,Z,50,M,70000.0
9,Juliet Ten,Z,18,F,


### Statistics ###

There is a submodule for computing statistics and doing simple random sampling.

**Key points**

- The submodule does not provide exact stratified random sampling, we will look at this in later examples when we talk about sampling for machine learning.

In [None]:
# First let's have another look at data before running each of the cells below

show_as_html(data)

In [None]:
# Simple random sampling per Department with specific sampling probabilities

show_as_html(data.stat.sampleBy("Department", {"X": 0.5, "Y": 0.5, "Z": 0.5}))

In [None]:
# Approximate quantiles with different relative errors (note that a relative error of 0.0 gives you the exact quantiles)

probabilities = [0.0, 0.25, 0.5, 0.75, 1.0]  # [min, Q1, median, Q3, max]

relativeError = 0.1

approxQuantiles = data.stat.approxQuantile("Salary", probabilities, relativeError)

print([row["Salary"] for row in data.toLocalIterator()])
print("")
print(approxQuantiles)

In [None]:
# We can automate this to compare results for different relative errors 

probabilities = [0.0, 0.25, 0.5, 0.75, 1.0]

results = []
for relativeError in [1.0, 0.5, 0.1, 0.0]:
    approxQuantiles = data.stat.approxQuantile("Salary", probabilities, relativeError)
    results.append([relativeError] + approxQuantiles)

display(pandas.DataFrame(results, columns=["relativeError"] + probabilities))

### Stop Spark ###

In [None]:
# Run this cell before closing the notebook or kill your spark application by hand using the link in the Spark UI

stop_spark()