<a id="top">

# Table of Contents

- [Reading from CSV file](#csv)
- [Sorting on a single column: sort() / orderBy()](#sort_single)
- [Sorting on multiple columns sort() / orderBy()](#sort_multiple)
- [Removing duplicates across all columns](#distinct)
- [Removing duplicates across specific column(s)](#drop_duplicates)
- [Create new column based on values from another column](#new_column)
- [Using User-Defined Function UDF to Create New Column](#udf)
- [Using `when` to duplicate IF-ELSE Logic To Create New Column](#when)
- [Using sql to create new column](#sql_when)
- [Re-Arrange Columns using select()](#re-order)
- [Data summarizations](#summarizations)
- [Running total](#running_total)
- [Joining or merging](#merging)
- [Filtering](#filtering)
- [Exploding arrays](#exploding_arrays)
- [Using regular expressions](#regex)
- [Connecting to relational databases](#databases)

## Input / Reading Data

[[Back to Top](#top)]

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

In [2]:
spark = (
    SparkSession.builder
    # Enable eager, interactive mode - typically do not do this with production code
    .config("spark.sql.repl.eagerEval.enabled", "True")
    .appName("PySpark Cheat Sheet")
    .getOrCreate()
)

23/08/06 18:03:32 WARN Utils: Your hostname, asus-q530 resolves to a loopback address: 127.0.1.1; using 172.31.125.182 instead (on interface eth0)
23/08/06 18:03:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/06 18:03:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/08/06 18:03:34 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Sample data can be obtained from [here](https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv).

<a id="csv">

#### Read csv file by infering or guessing schema

[[Back to Top](#top)]

There is a sample data set located at `data/cars.csv`.  Let's first inspect it and see if it has header row and how it is delimited using plain old terminal command `head`:

In [3]:
# Terminal commands can be executed within a code cell by preceding it with an exclamation symbol
!head data/cars.csv

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US
Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US
Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US
AMC Rebel SST;16.0;8;304.0;150.0;3433.;12.0;70;US
Ford Torino;17.0;8;302.0;140.0;3449.;10.5;70;US
Ford Galaxie 500;15.0;8;429.0;198.0;4341.;10.0;70;US
Chevrolet Impala;14.0;8;454.0;220.0;4354.;9.0;70;US
Plymouth Fury iii;14.0;8;440.0;215.0;4312.;8.5;70;US
Pontiac Catalina;14.0;8;455.0;225.0;4425.;10.0;70;US


We see that it does indeed have a header row and the columns are delimited by semi-colon.

In [4]:
cars = spark.read.csv('data/cars.csv', header=True, sep=";")

In [5]:
cars.show(5)

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0| 3504.|        12.0|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0| 3693.|        11.5|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0| 3436.|        11.0|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0| 3433.|        12.0|   70|    US|
|         Ford Torino|17.0|        8|       302.0|     140.0| 3449.|        10.5|   70|    US|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
only showing top 5 rows



In [6]:
cars.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: string (nullable = true)
 |-- Cylinders: string (nullable = true)
 |-- Displacement: string (nullable = true)
 |-- Horsepower: string (nullable = true)
 |-- Weight: string (nullable = true)
 |-- Acceleration: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Origin: string (nullable = true)



#### Explicitly defining schema

In [60]:
from pyspark.sql.types import *

In [61]:
schema = StructType([
    StructField("Car", StringType(),True),
    StructField("MPG", DoubleType(),True),
    StructField("Cylinders", IntegerType(),True),
    StructField("Displacement", DoubleType(), True),
    StructField("Horsepower", DoubleType(), True),
    StructField("Weight", DoubleType(), True),
    StructField("Acceleration", IntegerType(), True),
    StructField("Model", IntegerType(), True),
    StructField("Origin", StringType(), True)
  ])

In [62]:
cars = spark.read.csv('data/cars.csv', header=True, sep=";", schema=schema)

In [10]:
cars.printSchema()

root
 |-- Car: string (nullable = true)
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Displacement: double (nullable = true)
 |-- Horsepower: double (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Acceleration: integer (nullable = true)
 |-- Model: integer (nullable = true)
 |-- Origin: string (nullable = true)



In [12]:
cars.show(5)

+--------------------+----+---------+------------+----------+------+------------+-----+------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0|3504.0|        null|   70|    US|
|   Buick Skylark 320|15.0|        8|       350.0|     165.0|3693.0|        null|   70|    US|
|  Plymouth Satellite|18.0|        8|       318.0|     150.0|3436.0|        null|   70|    US|
|       AMC Rebel SST|16.0|        8|       304.0|     150.0|3433.0|        null|   70|    US|
|         Ford Torino|17.0|        8|       302.0|     140.0|3449.0|        null|   70|    US|
+--------------------+----+---------+------------+----------+------+------------+-----+------+
only showing top 5 rows



<a id="sort_single">

## Sorting on Single Column

[[Back to Top](#top)]

In [13]:
from pyspark.sql.functions import col

cars.sort(col("MPG").asc())

# Alternate method to refer to a dataframe column
# cars.sort(df.MPG.asc())

# Alternative methods if your column names has a space in it
# cars.orderBy(['MPG'], ascending=[False]).show(10, truncate=False)
# cars.sort(col("MPG").asc())

Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
Citroen DS-21 Pallas,0.0,4,133.0,115.0,3090.0,,70,Europe
Ford Mustang Boss...,0.0,8,302.0,140.0,3353.0,,70,US
Ford Torino (sw),0.0,8,351.0,153.0,4034.0,,70,US
Volkswagen Super ...,0.0,4,97.0,48.0,1978.0,,71,Europe
Saab 900s,0.0,4,121.0,110.0,2800.0,,81,Europe
Chevrolet Chevell...,0.0,8,350.0,165.0,4142.0,,70,US
Plymouth Satellit...,0.0,8,383.0,175.0,4166.0,,70,US
AMC Rebel SST (sw),0.0,8,360.0,175.0,3850.0,,70,US
Hi 1200D,9.0,8,304.0,193.0,4732.0,,70,US
Ford F250,10.0,8,360.0,215.0,4615.0,,70,US


<a id="sort_multiple">

## Sorting on Multiple Columns

[[Back to Top](#top)]

In [14]:
from pyspark.sql.functions import col

cars.sort(col("MPG").asc(), col("Displacement").desc())

# Alternative methods
# cars.sort(df.MPG.asc(), df.Displacement.desc())
# cars.orderBy(['MPG','Displacement'], ascending=[True, False]).show(10, truncate=False)

Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
Plymouth Satellit...,0.0,8,383.0,175.0,4166.0,,70,US
AMC Rebel SST (sw),0.0,8,360.0,175.0,3850.0,,70,US
Ford Torino (sw),0.0,8,351.0,153.0,4034.0,,70,US
Chevrolet Chevell...,0.0,8,350.0,165.0,4142.0,,70,US
Ford Mustang Boss...,0.0,8,302.0,140.0,3353.0,,70,US
Citroen DS-21 Pallas,0.0,4,133.0,115.0,3090.0,,70,Europe
Saab 900s,0.0,4,121.0,110.0,2800.0,,81,Europe
Volkswagen Super ...,0.0,4,97.0,48.0,1978.0,,71,Europe
Hi 1200D,9.0,8,304.0,193.0,4732.0,,70,US
Ford F250,10.0,8,360.0,215.0,4615.0,,70,US


<a id="distinct">

## Removing Duplicates Across All Columns Using distinct()

[[Back to Top](#top)]

In [15]:
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4600), \
    ("James", "Sales", 3000)
]
columns= ["employee_name", "department", "salary"]

# Create DataFrame
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)



                                                                                

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4600  |
|James        |Sales     |3000  |
+-------------+----------+------+



We see that James show up twice.  So let's delete the extra row or record.

In [16]:
distinctDF = df.distinct()
distinctDF.show(truncate=False)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4600  |
+-------------+----------+------+



<a id="drop_duplicates">

## Removing Duplicates Across Specific Columns Using drop_duplicates()

[[Back to Top](#top)]

In [17]:
dropDisDF = df.drop_duplicates(["department","salary"])
dropDisDF.show(truncate=False)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
+-------------+----------+------+



## Data Transformations

<a id="new_column">

#### Create new column based on values from another column

[[Back to Top](#top)]

In [18]:
from pyspark.sql.functions import split, col

In [19]:
cars = cars.withColumn(
    "Make", split(col("Car"), " ")
    .getItem(0)
).withColumn(
    "Model", split(col("Car"), " ")
    .getItem(1)
)

In [20]:
cars.show(5, truncate=False)

+-------------------------+----+---------+------------+----------+------+------------+---------+------+---------+
|Car                      |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model    |Origin|Make     |
+-------------------------+----+---------+------------+----------+------+------------+---------+------+---------+
|Chevrolet Chevelle Malibu|18.0|8        |307.0       |130.0     |3504.0|null        |Chevelle |US    |Chevrolet|
|Buick Skylark 320        |15.0|8        |350.0       |165.0     |3693.0|null        |Skylark  |US    |Buick    |
|Plymouth Satellite       |18.0|8        |318.0       |150.0     |3436.0|null        |Satellite|US    |Plymouth |
|AMC Rebel SST            |16.0|8        |304.0       |150.0     |3433.0|null        |Rebel    |US    |AMC      |
|Ford Torino              |17.0|8        |302.0       |140.0     |3449.0|null        |Torino   |US    |Ford     |
+-------------------------+----+---------+------------+----------+------+------------+--

<a id="udf">

#### Using UDF to Create New Column

[[Back to Top](#top)]

In [21]:
data = [
    ("bacon", 4.0),
    ("pulled pork", 3.0),
    ("bacon", 12.0),
    ("pastrami", 6.0),
    ("corned beef", 7.5),
    ("bacon", 8.0),
    ("pastrami", 3.0),
    ("honey ham", 5.0),
    ("nova lox", 6.0),
  ]

In [22]:
schema = StructType([
    StructField("Food", StringType(),True),
    StructField("Ounces", DoubleType(),True),
])

In [23]:
df_food = spark.createDataFrame(data=data,schema=schema)

In [24]:
df_food.show()

+-----------+------+
|       Food|Ounces|
+-----------+------+
|      bacon|   4.0|
|pulled pork|   3.0|
|      bacon|  12.0|
|   pastrami|   6.0|
|corned beef|   7.5|
|      bacon|   8.0|
|   pastrami|   3.0|
|  honey ham|   5.0|
|   nova lox|   6.0|
+-----------+------+



In [25]:
from pyspark.sql.functions import udf

In [26]:
def food2animal(column):
    if column == 'bacon':
        return 'pig'
    elif column == 'pulled pork':
        return 'pig'
    elif column == 'pastrami':
        return 'cow'
    elif column == 'corned beef':
        return 'cow'
    elif column == 'honey ham':
        return 'pig'
    else:
        return 'salmon'

In [27]:
food2animal_udf = udf(food2animal, StringType())

In [28]:
df_food_with_animal = df_food.withColumn("animal", food2animal_udf("Food"))

In [29]:
df_food_with_animal.show()

                                                                                

+-----------+------+------+
|       Food|Ounces|animal|
+-----------+------+------+
|      bacon|   4.0|   pig|
|pulled pork|   3.0|   pig|
|      bacon|  12.0|   pig|
|   pastrami|   6.0|   cow|
|corned beef|   7.5|   cow|
|      bacon|   8.0|   pig|
|   pastrami|   3.0|   cow|
|  honey ham|   5.0|   pig|
|   nova lox|   6.0|salmon|
+-----------+------+------+



<a id="when">

#### Using `when` to duplicate IF-ELSE Logic To Create New Column

[[Back to Top](#top)]

In [30]:
from pyspark.sql.functions import when

In [31]:
data = [
    ("bacon", 4.0),
    ("pulled pork", 3.0),
    ("bacon", 12.0),
    ("pastrami", 6.0),
    ("corned beef", 7.5),
    ("bacon", 8.0),
    ("pastrami", 3.0),
    ("honey ham", 5.0),
    ("nova lox", 6.0),
  ]

schema = StructType([
    StructField("Food", StringType(),True),
    StructField("Ounces", DoubleType(),True),
])

df_food = spark.createDataFrame(data=data,schema=schema)

In [32]:
df_food.show()

+-----------+------+
|       Food|Ounces|
+-----------+------+
|      bacon|   4.0|
|pulled pork|   3.0|
|      bacon|  12.0|
|   pastrami|   6.0|
|corned beef|   7.5|
|      bacon|   8.0|
|   pastrami|   3.0|
|  honey ham|   5.0|
|   nova lox|   6.0|
+-----------+------+



In [33]:
df_food.withColumn(
    'Animal',
    when(col("Food") == 'bacon', 'pork')
    .when(col("Food") == 'pulled pork', 'pork')
    .when(col("Food") == 'pastrami', 'cow')
    .when(col("Food") == 'corned beef', 'cow')
    .when(col("Food") == 'honey ham', 'pig')
    .otherwise('salmon')
)

Food,Ounces,Animal
bacon,4.0,pork
pulled pork,3.0,pork
bacon,12.0,pork
pastrami,6.0,cow
corned beef,7.5,cow
bacon,8.0,pork
pastrami,3.0,cow
honey ham,5.0,pig
nova lox,6.0,salmon


<a id="re-order">

<a id="sql_when">

## Using sql to create new column

[[Back to Top](#top)]

In [5]:
data = [
    ("bacon", 4.0),
    ("pulled pork", 3.0),
    ("bacon", 12.0),
    ("pastrami", 6.0),
    ("corned beef", 7.5),
    ("bacon", 8.0),
    ("pastrami", 3.0),
    ("honey ham", 5.0),
    ("nova lox", 6.0),
  ]

schema = StructType([
    StructField("Food", StringType(),True),
    StructField("Ounces", DoubleType(),True),
  ])

df_food = spark.createDataFrame(data=data,schema=schema)

In [6]:
df_food.show()

                                                                                

+-----------+------+
|       Food|Ounces|
+-----------+------+
|      bacon|   4.0|
|pulled pork|   3.0|
|      bacon|  12.0|
|   pastrami|   6.0|
|corned beef|   7.5|
|      bacon|   8.0|
|   pastrami|   3.0|
|  honey ham|   5.0|
|   nova lox|   6.0|
+-----------+------+



In [7]:
df_food.createOrReplaceTempView('food_table')
newDF = spark.sql(
    '''
    select *,
    case
    when Food = 'bacon' then 'pig'
    when Food = 'pulled pork' then 'pig'
    when Food = 'pastrami' then 'cow'
    when Food = 'corned beef' then 'cow'
    when Food = 'honey ham' then 'pig'
    else 'salmon' end as Animal from food_table
    '''
)
newDF.show()

+-----------+------+------+
|       Food|Ounces|Animal|
+-----------+------+------+
|      bacon|   4.0|   pig|
|pulled pork|   3.0|   pig|
|      bacon|  12.0|   pig|
|   pastrami|   6.0|   cow|
|corned beef|   7.5|   cow|
|      bacon|   8.0|   pig|
|   pastrami|   3.0|   cow|
|  honey ham|   5.0|   pig|
|   nova lox|   6.0|salmon|
+-----------+------+------+



**New in PySpark 3.4** - We no longer need to create temp view to directly query a dataframe using SQL:

In [10]:
spark.sql(
    "select * from {df}", df=newDF
)

Food,Ounces,Animal
bacon,4.0,pig
pulled pork,3.0,pig
bacon,12.0,pig
pastrami,6.0,cow
corned beef,7.5,cow
bacon,8.0,pig
pastrami,3.0,cow
honey ham,5.0,pig
nova lox,6.0,salmon


#### Re-Arrange Columns using `select()`

[[Back to Top](#top)]

In [36]:
cars.select('Make', 'Model', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Origin')

Make,Model,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin
Chevrolet,Chevelle,18.0,8,307.0,130.0,3504.0,,US
Buick,Skylark,15.0,8,350.0,165.0,3693.0,,US
Plymouth,Satellite,18.0,8,318.0,150.0,3436.0,,US
AMC,Rebel,16.0,8,304.0,150.0,3433.0,,US
Ford,Torino,17.0,8,302.0,140.0,3449.0,,US
Ford,Galaxie,15.0,8,429.0,198.0,4341.0,,US
Chevrolet,Impala,14.0,8,454.0,220.0,4354.0,,US
Plymouth,Fury,14.0,8,440.0,215.0,4312.0,,US
Pontiac,Catalina,14.0,8,455.0,225.0,4425.0,,US
AMC,Ambassador,15.0,8,390.0,190.0,3850.0,,US


<a id="summarizations">

## Data Summarizations

[[Back to Top](#top)]

In [37]:
cars.show(truncate=False)

+--------------------------------+----+---------+------------+----------+------+------------+----------+------+---------+
|Car                             |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model     |Origin|Make     |
+--------------------------------+----+---------+------------+----------+------+------------+----------+------+---------+
|Chevrolet Chevelle Malibu       |18.0|8        |307.0       |130.0     |3504.0|null        |Chevelle  |US    |Chevrolet|
|Buick Skylark 320               |15.0|8        |350.0       |165.0     |3693.0|null        |Skylark   |US    |Buick    |
|Plymouth Satellite              |18.0|8        |318.0       |150.0     |3436.0|null        |Satellite |US    |Plymouth |
|AMC Rebel SST                   |16.0|8        |304.0       |150.0     |3433.0|null        |Rebel     |US    |AMC      |
|Ford Torino                     |17.0|8        |302.0       |140.0     |3449.0|null        |Torino    |US    |Ford     |
|Ford Galaxie 500       

#### Count of rows

In [38]:
cars.count()

406

#### Counts by Groups within a Single Column

In [39]:
cars.groupBy('Origin').count().withColumnRenamed('count', 'Count')

Origin,Count
Europe,73
US,254
Japan,79


#### Aggregations

In [40]:
from pyspark.sql.functions import mean 

In [41]:
cars.groupBy(
    "Origin"
).agg(
    mean('MPG')
).show()

+------+------------------+
|Origin|          avg(MPG)|
+------+------------------+
|Europe|26.745205479452057|
|    US|19.688188976377948|
| Japan|30.450632911392397|
+------+------------------+



<a id="running_total">

## Running Total / Cumulative Sum

[[Back to Top](#top)]

`ChatGPT prompt:` Using PySpark 3.2, how do I obtain multiple running totals based on values on one column and the running totals are differentiated based on values from another column?

In [63]:
from pyspark.sql.functions import sum, col
from pyspark.sql.window import Window

In [73]:
windowSpec = Window.orderBy(col("Origin")).rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [74]:
cars2 = cars.withColumn("cumulative_sum", sum(col("MPG")).over(windowSpec))

In [75]:
cars2.show()

23/04/08 10:07:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/04/08 10:07:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/04/08 10:07:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/04/08 10:07:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/04/08 10:07:02 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+--------------------+----+---------+------------+----------+------+------------+-----+------+--------------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weig

#### With partitioning on "Origin"

In [76]:
windowSpecOrigin = Window.partitionBy(col("Origin")).orderBy(col("MPG")).rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [77]:
cars3 = cars.withColumn("cumulative_sum", sum(col("MPG")).over(windowSpecOrigin))

In [78]:
cars3.show()

+--------------------+----+---------+------------+----------+------+------------+-----+------+--------------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|cumulative_sum|
+--------------------+----+---------+------------+----------+------+------------+-----+------+--------------+
|Citroen DS-21 Pallas| 0.0|        4|       133.0|     115.0|3090.0|        null|   70|Europe|           0.0|
|Volkswagen Super ...| 0.0|        4|        97.0|      48.0|1978.0|        null|   71|Europe|           0.0|
|           Saab 900s| 0.0|        4|       121.0|     110.0|2800.0|        null|   81|Europe|           0.0|
|       Peugeot 604sl|16.2|        6|       163.0|     133.0|3410.0|        null|   78|Europe|          16.2|
|  Mercedes-Benz 280s|16.5|        6|       168.0|     120.0|3820.0|        null|   76|Europe|          32.7|
|         Volvo 264gl|17.0|        6|       163.0|     125.0|3140.0|        null|   78|Europe|          49.7|
|     Volv

In [79]:
cars3.filter(col("Origin")=="US").show()

+--------------------+----+---------+------------+----------+------+------------+-----+------+--------------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|Model|Origin|cumulative_sum|
+--------------------+----+---------+------------+----------+------+------------+-----+------+--------------+
|Chevrolet Chevell...| 0.0|        8|       350.0|     165.0|4142.0|        null|   70|    US|           0.0|
|    Ford Torino (sw)| 0.0|        8|       351.0|     153.0|4034.0|        null|   70|    US|           0.0|
|Plymouth Satellit...| 0.0|        8|       383.0|     175.0|4166.0|        null|   70|    US|           0.0|
|  AMC Rebel SST (sw)| 0.0|        8|       360.0|     175.0|3850.0|        null|   70|    US|           0.0|
|Ford Mustang Boss...| 0.0|        8|       302.0|     140.0|3353.0|        null|   70|    US|           0.0|
|            Hi 1200D| 9.0|        8|       304.0|     193.0|4732.0|        null|   70|    US|           9.0|
|         

<a id="merging">

## Joining / Merging

[[Back to Top](#top)]

In [42]:
df_counts = cars.groupBy('Origin').count().withColumnRenamed('count', 'Count')

In [43]:
df_counts.show()

+------+-----+
|Origin|Count|
+------+-----+
|Europe|   73|
|    US|  254|
| Japan|   79|
+------+-----+



In [44]:
df_avgs = cars.groupBy(
    "Origin"
).agg(
    mean('MPG')
)

In [45]:
df_avgs.show()

+------+------------------+
|Origin|          avg(MPG)|
+------+------------------+
|Europe|26.745205479452057|
|    US|19.688188976377948|
| Japan|30.450632911392397|
+------+------------------+



In [46]:
df_avgs = df_avgs.withColumnRenamed('avg(MPG)', 'Avg')

In [47]:
df_avgs.show()

+------+------------------+
|Origin|               Avg|
+------+------------------+
|Europe|26.745205479452057|
|    US|19.688188976377948|
| Japan|30.450632911392397|
+------+------------------+



In [48]:
df_counts.join(df_avgs, df_counts.Origin == df_avgs.Origin, 'inner').select(df_counts.Origin, df_counts.Count, df_avgs.Avg)

Origin,Count,Avg
Europe,73,26.745205479452057
US,254,19.688188976377948
Japan,79,30.4506329113924


As you can see, we have done an inner join between two dataframes. The following joins are supported by PySpark:

- inner (default)
- cross
- outer
- full
- full_outer
- left
- left_outer
- right
- right_outer
- left_semi
- left_anti

<a id="filtering">

## Filtering

[[Back to Top](#top)]

In [49]:
cars.show(truncate=False)

+--------------------------------+----+---------+------------+----------+------+------------+----------+------+---------+
|Car                             |MPG |Cylinders|Displacement|Horsepower|Weight|Acceleration|Model     |Origin|Make     |
+--------------------------------+----+---------+------------+----------+------+------------+----------+------+---------+
|Chevrolet Chevelle Malibu       |18.0|8        |307.0       |130.0     |3504.0|null        |Chevelle  |US    |Chevrolet|
|Buick Skylark 320               |15.0|8        |350.0       |165.0     |3693.0|null        |Skylark   |US    |Buick    |
|Plymouth Satellite              |18.0|8        |318.0       |150.0     |3436.0|null        |Satellite |US    |Plymouth |
|AMC Rebel SST                   |16.0|8        |304.0       |150.0     |3433.0|null        |Rebel     |US    |AMC      |
|Ford Torino                     |17.0|8        |302.0       |140.0     |3449.0|null        |Torino    |US    |Ford     |
|Ford Galaxie 500       

In [50]:
cars.filter(col('Make')=='Chevrolet').show(5)

+--------------------+----+---------+------------+----------+------+------------+--------+------+---------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|   Model|Origin|     Make|
+--------------------+----+---------+------------+----------+------+------------+--------+------+---------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0|3504.0|        null|Chevelle|    US|Chevrolet|
|    Chevrolet Impala|14.0|        8|       454.0|     220.0|4354.0|        null|  Impala|    US|Chevrolet|
|Chevrolet Chevell...| 0.0|        8|       350.0|     165.0|4142.0|        null|Chevelle|    US|Chevrolet|
|Chevrolet Monte C...|15.0|        8|       400.0|     150.0|3761.0|        null|   Monte|    US|Chevrolet|
| Chevrolet Vega 2300|28.0|        4|       140.0|      90.0|2264.0|        null|    Vega|    US|Chevrolet|
+--------------------+----+---------+------------+----------+------+------------+--------+------+---------+
only showing top 5 rows



In [51]:
cars.filter(col('Make').contains('Chev')).show(5)

+--------------------+----+---------+------------+----------+------+------------+--------+------+---------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|   Model|Origin|     Make|
+--------------------+----+---------+------------+----------+------+------------+--------+------+---------+
|Chevrolet Chevell...|18.0|        8|       307.0|     130.0|3504.0|        null|Chevelle|    US|Chevrolet|
|    Chevrolet Impala|14.0|        8|       454.0|     220.0|4354.0|        null|  Impala|    US|Chevrolet|
|Chevrolet Chevell...| 0.0|        8|       350.0|     165.0|4142.0|        null|Chevelle|    US|Chevrolet|
|Chevrolet Monte C...|15.0|        8|       400.0|     150.0|3761.0|        null|   Monte|    US|Chevrolet|
|           Chevy C20|10.0|        8|       307.0|     200.0|4376.0|        null|     C20|    US|    Chevy|
+--------------------+----+---------+------------+----------+------+------------+--------+------+---------+
only showing top 5 rows



In [52]:
cars.filter(
    (col('Make').contains('Chev')) &
    (col('Cylinders') < 8)
).show()

+--------------------+----+---------+------------+----------+------+------------+--------+------+----------+
|                 Car| MPG|Cylinders|Displacement|Horsepower|Weight|Acceleration|   Model|Origin|      Make|
+--------------------+----+---------+------------+----------+------+------------+--------+------+----------+
| Chevrolet Vega 2300|28.0|        4|       140.0|      90.0|2264.0|        null|    Vega|    US| Chevrolet|
|Chevrolet Chevell...|17.0|        6|       250.0|     100.0|3329.0|        null|Chevelle|    US| Chevrolet|
| Chevrolet Vega (sw)|22.0|        4|       140.0|      72.0|2408.0|        null|    Vega|    US| Chevrolet|
|      Chevrolet Vega|20.0|        4|       140.0|      90.0|2408.0|        null|    Vega|    US| Chevrolet|
|Chevrolet Nova Cu...|16.0|        6|       250.0|     100.0|3278.0|        null|    Nova|    US| Chevrolet|
|      Chevrolet Vega|21.0|        4|       140.0|      72.0|2401.0|        null|    Vega|    US| Chevrolet|
|      Chevrolet No

In [53]:
spark.stop()

<a id="exploding_arrays">

## Exploding Arrays

[[Back to Top](#top)]

Data set: movies data set from movielens

Question: Which genre of movie is the most popular?

In [3]:
!head data/movies.csv

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action


As can be seen by our sample data above, the problem we are facing is that a movie can have more than one genre, delimited by the pipe/"|" symbol.  So with the data in its current form, it would be difficult to obtain the counts by genre.  We will need to transform the data such that the counts by genre can be made.

Transformation plan:
- We will convert the genres into an array
- Then "explode" the genres such that each genre will be in its own row.  Consequently, this means the other columns will be repeating.
- Then perform a count by "genre"

In [4]:
movies = spark.read.csv('data/movies.csv', header=True, sep=",", inferSchema=True)

                                                                                

In [5]:
movies.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

In [12]:
from pyspark.sql.functions import col, split, explode

#### Answer: Drama

In [22]:
(
    movies
    .withColumn("genres_array", split(col("genres"), "\|"))
    .drop(col("genres"))
    .withColumn("genre", explode(col("genres_array")))
    .groupby(col("genre"))
    .count()
    .withColumnRenamed("genre", "Genre")
    .withColumnRenamed('count', 'Count')
    .sort(col("Count"), ascending=False)
    .filter(col("Genre") != '(no genres listed)')
).show(1)

+-----+-----+
|Genre|Count|
+-----+-----+
|Drama|25606|
+-----+-----+
only showing top 1 row



<a id="regex">

## Using Regular Expressions (RegEx)

[[Back to Top](#top)]

In [23]:
from pyspark.sql.functions import regexp_extract

In [29]:
(
    movies
    .withColumn("year", regexp_extract(col("title"), r"\((\d{4})\)", 1))
)

movieId,title,genres,year
1,Toy Story (1995),Adventure|Animati...,1995
2,Jumanji (1995),Adventure|Childre...,1995
3,Grumpier Old Men ...,Comedy|Romance,1995
4,Waiting to Exhale...,Comedy|Drama|Romance,1995
5,Father of the Bri...,Comedy,1995
6,Heat (1995),Action|Crime|Thri...,1995
7,Sabrina (1995),Comedy|Romance,1995
8,Tom and Huck (1995),Adventure|Children,1995
9,Sudden Death (1995),Action,1995
10,GoldenEye (1995),Action|Adventure|...,1995


<a id="databases">

## Partitioning

[[Back to Top](#top)]

Work in Progress

## Connecting to Relational Databases (JDBC)

[[Back to Top](#top)]

[Link](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/from_to_dbms.html) to their documentation

The key thing to connect to relational databases using JDBC is to add `"spark.jars"` configuration.  You just need to ensure you download the required database driver and/or connector .jar files and reference their location or path.  You can connect to databases other than using JDBC, but since Java is the underlying technology behind Spark/PySpark, it is recommended that you use JDBC drivers for best performance and stability.

**NOTE:** There are various ways to pass database server and credentials information to your Python application.  The examples below assume the use of a `config.ini` text file that contains database server and credentials information that you would parse using Python's standard library [configparser](https://docs.python.org/3/library/configparser.html).

#### PostgreSQL

In [None]:
import configparser
import os
import pyspark
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.master("local[*]").appName("Postgres")\
    .config("spark.jars", "C:\\Users\\some_user\\drivers\\jdbc\\postgresql\\postgresql-42.2.23.jar")\
    .getOrCreate()

In [None]:
config_file = os.getenv("CONFIG_PATH")

In [None]:
config = configparser.ConfigParser()
try:
    config.read(config_file)
except ConfigFileNotFound:
    print("config.ini file not found")

In [None]:
# Read in the Postgresql database credentials for DSN-less connection
PG_HOST = config["MY_POSTGRES_DB"]["HOST"]
PG_PORT = config["MY_POSTGRES_DB"]["PORT"]
PG_DB = config["MY_POSTGRES_DB"]["DB"]
PG_USER = config["MY_POSTGRES_DB"]["USER"]
PG_PWD = config["MY_POSTGRES_DB"]["PWD"]

In [None]:
url = f'jdbc:postgresql://{PG_HOST}:{PG_PORT}/{PG_DB}'
driver = 'org.postgresql.Driver'

In [None]:
query = "SELECT CURRENT_DATE"

In [None]:
jdbcDF = spark.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", PG_USER) \
    .option("password", PG_PWD) \
    .option("query", query) \
    .load()

In [None]:
jdbcDF.show()

In [None]:
spark.stop()

#### IBM DB2 LUW

In [None]:
spark = SparkSession.builder.master("local[*]").appName("SCOODS")\
    .config("spark.jars", "C:\\Users\\some_user\\drivers\\jdbc\\mainframe\\db2jcc.jar")\
    .getOrCreate()

In [None]:
config_file = os.getenv("CONFIG_PATH")

In [None]:
config = configparser.ConfigParser()
try:
    config.read(config_file)
except ConfigFileNotFound:
    print("config.ini file not found")

In [None]:
# Read in the DB2 LUW credentials for DSN-less connection
LUW_HOST = config["MY_DB2_LUW"]["HOST"]
LUW_PORT = config["MY_DB2_LUW"]["PORT"]
LUW_DB = config["MY_DB2_LUW"]["DB"]
LUW_USER = config["MY_DB2_LUW"]["USER"]
LUW_PWD = config["MY_DB2_LUW"]["PWD"]

In [None]:
url = f'jdbc:db2://{LUW_HOST}:{LUW_PORT}/{LUW_DB}:useJDBC4ColumnNameAndLabelSemantics=false;'
driver = 'com.ibm.db2.jcc.DB2Driver'

In [None]:
query = "SELECT CURRENT TIMESTAMP as DATETIME_NOW FROM SYSIBM.SYSDUMMY1"

In [None]:
jdbcDF = spark.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", SCOODS_USER) \
    .option("password", SCOODS_PWD) \
    .option("query", query) \
    .load()

In [None]:
jdbcDF.show(truncate=False)

In [None]:
spark.stop()

#### Microsoft SQL Server

In [None]:
spark = SparkSession.builder.master("local[*]").appName("NAPS")\
    .config("spark.jars", "C:\\Users\\some_user\\drivers\\mssql_jdbc\\mssql-jdbc-9.4.0.jre8.jar")\
    .getOrCreate()

In [None]:
config_file = os.getenv("CONFIG_PATH")

In [None]:
config = configparser.ConfigParser()
try:
    config.read(config_file)
except ConfigFileNotFound:
    print("config.ini file not found")

In [None]:
# Read in the SQL Server database credentials for DSN-less connection
SSQL_HOST = config["MY_SQL_SERVER"]["HOST"]
SSQL_PORT = config["MY_SQL_SERVER"]["PORT"]
SSQL_DB = config["MY_SQL_SERVER"]["DB"]

In [None]:
url = f'jdbc:sqlserver://{SSQL_HOST}:{SSQL_PORT};databaseName={SSQL_DB};integratedSecurity=true'
driver = 'com.microsoft.sqlserver.jdbc.SQLServerDriver'

In [None]:
query = "SELECT * from DimCarrier"

In [None]:
jdbcDF = spark.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("query", query) \
    .load()

In [None]:
jdbcDF.show(truncate=False)

In [None]:
spark.stop()

#### IBM DB2 z/OS

Unfortunately, the IBM Java JRE does not work with Spark 3.x and thus, we will not be able to connect to mainframe DB2 z/OS platform.  Mainframe is too OLD!  Let it rest in peace.

#### Connecting to Snowflake

[Link](https://docs.snowflake.com/en/user-guide/spark-connector-install.html) to Snowflake's documentation on working with the PySpark connector

In [None]:
from pathlib import Path
import configparser
import os
import pyspark
from pyspark.sql import SparkSession

In [None]:
config_file = os.getenv("CONFIG_PATH")

In [None]:
config = configparser.ConfigParser()
try:
    config.read(config_file)
except ConfigFileNotFound:
    print("config.ini file not found")

JDBC driver and Snowflake Spark Connector can be downloaded [here](https://search.maven.org/search?q=g:net.snowflake)

In [None]:
sf_jdbc_driver = config['snowflake']['jdbc_driver_path']
sf_spark_driver = config['snowflake']['spark_driver_path']

In [None]:
sf_account = config['snowflake']['account']
sf_user = config['snowflake']['username']
sf_database = config['snowflake']['database']
sf_schema = config['snowflake']['schema']
sf_role = config['snowflake']['role']
sf_warehouse = config['snowflake']['warehouse']
sf_authenticator = config['snowflake']['authenticator']

In [None]:
spark = (
    SparkSession.builder.master("local[*]")
    .appName("Snowflake_JDBC")
    .config("spark.jars", f"{sf_jdbc_driver},{sf_spark_driver}")
    .getOrCreate()
)

In [None]:
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

In [None]:
# Snowflake connection parameters
sfparams = {
    "sfURL" : f"{sf_account}.snowflakecomputing.com",
    "sfUser" : sf_user,
    "sfPassword" : "your_password",  # Not applicable when using externalbrowser authenticator
    "sfDatabase" : sf_database,
    "sfSchema" : sf_schema,
    "sfRole" : sf_role,
    "sfWarehouse" : sf_warehouse,
    "sfAuthenticator" : sf_authenticator
}

In [None]:
query = "SELECT CURRENT_DATE as my_date"

Snowflake query as PySpark dataframe

In [None]:
#run custom query
df = (
    spark.read.format(SNOWFLAKE_SOURCE_NAME)
    .options(**sfparams)
    .option("query", query)
    .load()
)

In [None]:
df.show()

Dataframe as Snowflake table

In [None]:
(df
 .select("my_date").write.format(SNOWFLAKE_SOURCE_NAME)
 .options(**sfparams)
 .option("dbtable", "my_table")
 .mode("overwrite")
 .save()
)

In [None]:
spark.stop()

#### Practical Scenarios

Let's say I have 10 CSV files, in which each CSV file contains stock price of a particular stock symbol.  I want to make a PySpark dataframe for each csv file or stock symbol.  But, we also need to create a new column in each dataframe that is the name of the CSV file without the file extension (".csv") which happens to be the stock symbol name.  Then concatenate the dataframes into a single dataframe.  How do I do this with PySpark?

In [17]:
from pyspark.sql import SparkSession
from functools import reduce
from pathlib import Path
from pyspark.sql.functions import lit

# Initialize a SparkSession
spark = SparkSession.builder.appName("ConcatenateCSVFiles").getOrCreate()

# Directory containing the CSV files
csv_directory = Path("stocks/")

# Get a list of CSV file paths in the directory
csv_files = [file for file in csv_directory.iterdir() if file.suffix == ".csv"]

# Create an empty list to store DataFrames
dataframes = []

# Iterate through the CSV files
for csv_file in csv_files:
    # Load the CSV file into a DataFrame
    df = spark.read.csv(str(csv_file), header=True, inferSchema=True)
    
    # Extract the file name without the extension
    file_name = csv_file.stem
    
    # Add a new column with the file name
    df = df.withColumn("Stock Symbol", lit(file_name))
    
    # Append the DataFrame to the list
    dataframes.append(df)

# Concatenate the DataFrames into a single DataFrame
concatenated_df = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)

# Show the result
concatenated_df.show()

# Save the concatenated DataFrame as a CSV file, not partition, but single csv file
concatenated_df.coalesce(1).write.mode('overwrite').option("header", "true").csv("stocks/concatenated_data.csv")

pandas_df = concatenated_df.toPandas()
pandas_df.to_csv('stocks/pandas_df.csv', index=False)

# Stop the SparkSession when done
spark.stop()

+----------+------------------+------------------+------------------+------------------+-------------------+---------+------------+
|      Date|              Open|              High|               Low|             Close|          Adj Close|   Volume|Stock Symbol|
+----------+------------------+------------------+------------------+------------------+-------------------+---------+------------+
|1980-12-12|0.5133928656578064|          0.515625|0.5133928656578064|0.5133928656578064|0.40678155422210693|117258400|        AAPL|
|1980-12-15|0.4888392984867096|0.4888392984867096|0.4866071343421936|0.4866071343421936|  0.385558158159256| 43971200|        AAPL|
|1980-12-16|          0.453125|          0.453125|0.4508928656578064|0.4508928656578064| 0.3572602868080139| 26432000|        AAPL|
|1980-12-17|0.4620535671710968|0.4642857015132904|0.4620535671710968|0.4620535671710968| 0.3661033511161804| 21610400|        AAPL|
|1980-12-18|0.4754464328289032|0.4776785671710968|0.4754464328289032|0.47544