# Working with Different Types of Data

In this lab, we look at working with a variety of different kinds of data in Spark including:

    - Booleans
    - Numbers
    - Strings
    - Dates and timestamps
    - Handling null
    - Complex types
    - User-defined functions

In [79]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DataTypesLab") \
    .getOrCreate()

spark

In [80]:
## Loading the data required for this lab:

df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/Users/satkarkarki/spark_the_definitive_guide/data/retail-data/by-day/2010-12-01.csv")
df.printSchema()
df.createOrReplaceTempView("dfTable")

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



## Converting to Spark Types

    - The lit function coverts a type in another language to its corresponding Spark representation.
    - This concept will be used extensively throughout the lab.
   

In [81]:
from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

## Working with Booleans

    - Booleans are the foundation for all filtering.
    - Four types of Booleans are: `and`, `or`, `true`, `false`.

In [82]:
# In Spark, it is a good practice to filter first then select because it optimizes our query

from pyspark.sql.functions import col
df.where(col("InvoiceNo") != 536365) \
    .select("InvoiceNo", "Description") \
    .show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows


In [83]:
# another cleaner option to pass the filter would be as follows:
df.where("InvoiceNo = 536365") \
    .show(5, False)

+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+-------------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |2010-12-01 08:26:00|2.55     |17850.0   |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |2010-12-01 08:26:00|2.75     |17850.0   |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |2010-12-01 08:26:00|3.39     |17850.0   |United Kingdom|
+---------+-----

In [84]:
# another cleaner option to pass the filter would be as follows:
df.where("InvoiceNo <> 536365") \
    .show(5, False)

+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                  |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------+--------+-------------------+---------+----------+--------------+
|536366   |22633    |HAND WARMER UNION JACK       |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536366   |22632    |HAND WARMER RED POLKA DOT    |6       |2010-12-01 08:28:00|1.85     |17850.0   |United Kingdom|
|536367   |84879    |ASSORTED COLOUR BIRD ORNAMENT|32      |2010-12-01 08:34:00|1.69     |13047.0   |United Kingdom|
|536367   |22745    |POPPY'S PLAYHOUSE BEDROOM    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
|536367   |22748    |POPPY'S PLAYHOUSE KITCHEN    |6       |2010-12-01 08:34:00|2.1      |13047.0   |United Kingdom|
+---------+---------+-----------------------------+--------+----

In [85]:
# use of chain filtering epxressions to pass multiple filters:
from pyspark.sql.functions import instr
priceFilter = col("UnitPrice") > 600
descripFilter = instr(df.Description, "POSTAGE") >= 1
df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFilter).show()

## In SQL:
## SELECT * 
## FROM dfTable 
## WHERE StockCode in ("DOT") AND (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      NULL|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      NULL|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



In [86]:
## To filter a DataFrame, the following example shows how we can specify a Boolean column:

from pyspark.sql.functions import instr

DOTCodeFilter = col("StockCode") == "DOT"
priceFilter = col("UnitPrice") > 600
descripFilter = instr(col("Description"), "POSTAGE") >= 1
df.withColumn("isExpensive", DOTCodeFilter & (priceFilter | descripFilter)) \
    .where("isExpensive") \
    .select("unitPrice", "isExpensive").show(5)

## In SQL:
## SELECT UnitPrice, (StockCOde = 'DOT' AND
##    (UnitPrice > 600 OR instr(Description, "POSTAGE" >= 1)) AS isExpensive
## FROM dfTable
## WHERE (StockCode = 'DOT' AND
##          (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1))


+---------+-----------+
|unitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In [87]:
from pyspark.sql.functions import expr

df.withColumn("isExpensive", expr("NOT UnitPrice <= 250")) \
    .where("isExpensive") \
    .select("Description", "UnitPrice").show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



## Working with Numbers

- The second most common task after filtering in big data is counting things.
- We do rounding
- We compute correlation and then summary statistics



In [88]:
## Uses functional expressions, everything is Pythonic

from pyspark.sql.functions import exp, pow

fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5
df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows


In [89]:
## Uses a SQL expression string : one-off computation
## fabricatedQuantity can be used later

df.selectExpr(
    "CustomerId",
    "(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)

## In SQL
## SELECT CustomerID, (POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity
## FROM dfTable

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows


In [90]:
## Another common nummeric function is rounding:

from pyspark.sql.functions import lit, round, bround

df.select(round(lit("7.5")), bround(lit("3.5"))).show(5)

## In SQL:
## SELECT round(7.5), bround(3.5)

+-------------+--------------+
|round(7.5, 0)|bround(3.5, 0)|
+-------------+--------------+
|          8.0|           4.0|
|          8.0|           4.0|
|          8.0|           4.0|
|          8.0|           4.0|
|          8.0|           4.0|
+-------------+--------------+
only showing top 5 rows


In [91]:
## Another common numeric function is correlation:

from pyspark.sql.functions import corr
df.stat.corr("Quantity", "UnitPrice")
df.select(corr("Quantity", "UnitPrice")).show()

## IN SQL:
## SELECT corr(Quantity, UnitPrice) FROM dfTable

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|     -0.04112314436835551|
+-------------------------+



In [92]:
## Another common task is computing the summary statistics:

df.describe().show()

+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|summary|        InvoiceNo|         StockCode|         Description|          Quantity|         UnitPrice|        CustomerID|       Country|
+-------+-----------------+------------------+--------------------+------------------+------------------+------------------+--------------+
|  count|             3108|              3108|                3098|              3108|              3108|              1968|          3108|
|   mean| 536516.684944841|27834.304044117645|                NULL| 8.627413127413128| 4.151946589446603|15661.388719512195|          NULL|
| stddev|72.89447869788873|17407.897548583845|                NULL|26.371821677029203|15.638659854603892|1854.4496996893627|          NULL|
|    min|           536365|             10002| 4 PURPLE FLOCK D...|               -24|               0.0|           12431.0|     Australia|
|    max|          C

### Working with Strings



In [93]:
# the initcap function will capitalize every word in a given string when that word is separated from another space
from pyspark.sql.functions import initcap
df.select(initcap(col("Description"))).show(5)

## IN SQL:
## SELECT initcap(Description) FROM dfTable

+--------------------+
|initcap(Description)|
+--------------------+
|White Hanging Hea...|
| White Metal Lantern|
|Cream Cupid Heart...|
|Knitted Union Fla...|
|Red Woolly Hottie...|
+--------------------+
only showing top 5 rows


In [94]:
# casting strings in uppercase and lowercase
from pyspark.sql.functions import lower, upper
df.select(col("Description"),
          lower(col("Description")),
          upper(lower(col("Description")))).show(2)

## IN SQL:
## SELECT Description, lower(Description), Upper(lower(Description)) FROM dfTable

+--------------------+--------------------+-------------------------+
|         Description|  lower(Description)|upper(lower(Description))|
+--------------------+--------------------+-------------------------+
|WHITE HANGING HEA...|white hanging hea...|     WHITE HANGING HEA...|
| WHITE METAL LANTERN| white metal lantern|      WHITE METAL LANTERN|
+--------------------+--------------------+-------------------------+
only showing top 2 rows


#### Another trivial task is adding ore removing spaces around a string.
#### We can do this by using lpad, ltrim, rpad, rtrim, and trim

In [95]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim
df.select(
    ltrim(lit(" HELLO ")).alias("ltrim"),
    rtrim(lit(" HELLO ")).alias("rtrim"),
    trim(lit(" HELLO ")).alias("trim"),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")).show(2)

# -- in SQL
# SELECT
#     ltrim(' HELLLOOOO '),
#     rtrim(' HELLLOOOO '),
#     trim(' HELLLOOOO '),
#     lpad('HELLOOOO ', 3, ' '),
#     rpad('HELLOOOO ', 10, ' ')
# FROM dfTable

+------+------+-----+---+----------+
| ltrim| rtrim| trim| lp|        rp|
+------+------+-----+---+----------+
|HELLO | HELLO|HELLO|HEL|HELLO     |
|HELLO | HELLO|HELLO|HEL|HELLO     |
+------+------+-----+---+----------+
only showing top 2 rows


#### Regular Expressions

- In Spark, there are two key functions to perform regular expressions: regexp_extract and regexp_replace

In [96]:
# Replace color names with just "COLOR":

from pyspark.sql.functions import regexp_replace
regex_string = "BLACK|WHITE|RED|GREEN|BLUE"
df.select(
    regexp_replace(col("Description"), regex_string, "COLOR").alias("color_clean"),
    col("Description")).show(2)

# -- in SQL
# SELECT
#    regexp_replace(Description, 'BLACK|WHITE|RED|GREEN|BLUE', 'COLOR') as
#    color_clean, Description
# FROM dfTable

+--------------------+--------------------+
|         color_clean|         Description|
+--------------------+--------------------+
|COLOR HANGING HEA...|WHITE HANGING HEA...|
| COLOR METAL LANTERN| WHITE METAL LANTERN|
+--------------------+--------------------+
only showing top 2 rows


In [97]:
# Spark provides a translation function to replace a string value :

from pyspark.sql.functions import translate
df.select(translate(col("Description"), "LEET", "1337" ), col("Description")) \
    .show(2)

## IN SQL:
## SELECT translate(Description, 'LEET', '1337')

+----------------------------------+--------------------+
|translate(Description, LEET, 1337)|         Description|
+----------------------------------+--------------------+
|              WHI73 HANGING H3A...|WHITE HANGING HEA...|
|               WHI73 M37A1 1AN73RN| WHITE METAL LANTERN|
+----------------------------------+--------------------+
only showing top 2 rows


In [98]:
## We can use the instr function to check of the existence of certain string values rather than extract them:
from pyspark.sql.functions import instr
containsBlack = instr(col("Description"), "BLACK") >= 1
containsWhite = instr(col("Description"), "WHITE") >= 1
df.withColumn("hasSimpleColor", containsBlack | containsWhite) \
    .where("hasSimpleColor") \
    .select("Description").show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows


### Working with Dates and Timestamps

**Spark** handles two time types:

- DateType --> only stores calendar date (e.g. 2025-07-14)
- TimestampType --> stores both date and exact time (e.g. 2025-07-14 13:22:00)

**Spark** tires to auto-detect dates using **inferSchema=True** (this works well **if the format is standard**

**Print schema often** using *df.printSchema()* to stay on track

In [99]:
# This code creates a DataFrame with 10 rows and a single column id values from 0 to 9.
# Adds two columns with the current calendar row, and the exact current timestamp, respectively.
# Stores int a temp table & prints the schema next:

from pyspark.sql.functions import current_date, current_timestamp
dateDF = spark.range(10) \
    .withColumn("today", current_date()) \
    .withColumn("now", current_timestamp()) 

dateDF.createOrReplaceTempView("dataTable")

dateDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)



In [100]:
dateDF.show(7)

+---+----------+--------------------+
| id|     today|                 now|
+---+----------+--------------------+
|  0|2025-07-15|2025-07-15 18:17:...|
|  1|2025-07-15|2025-07-15 18:17:...|
|  2|2025-07-15|2025-07-15 18:17:...|
|  3|2025-07-15|2025-07-15 18:17:...|
|  4|2025-07-15|2025-07-15 18:17:...|
|  5|2025-07-15|2025-07-15 18:17:...|
|  6|2025-07-15|2025-07-15 18:17:...|
+---+----------+--------------------+
only showing top 7 rows


#### Now lets add and subtract five days from today:

In [101]:
from pyspark.sql.functions import date_add, date_sub
dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2025-07-10|        2025-07-20|
+------------------+------------------+
only showing top 1 row


#### Now lets calculate the difference between the number of days, we can do this datediff() function

In [102]:
from pyspark.sql.functions import datediff, months_between, to_date

print("Date Difference in Days")
dateDF.withColumn("week_ago", date_sub(col("today"), 7)) \
    .select(datediff(col("week_ago"), col("today"))).show(1)

print("Date Difference in Months")
dateDF.select(
    to_date(lit("2016-01-01")).alias("start"),
    to_date(lit("2017-05-22")).alias("end")) \
    .select(months_between(col("start"), col("end"))).show(1)

Date Difference in Days
+-------------------------+
|datediff(week_ago, today)|
+-------------------------+
|                       -7|
+-------------------------+
only showing top 1 row
Date Difference in Months
+--------------------------------+
|months_between(start, end, true)|
+--------------------------------+
|                    -16.67741935|
+--------------------------------+
only showing top 1 row


### Working with Nulls in Data
- In Spark, null is the **preferred way to represent missing or empty data** in a DataFrame.
- Spark **optimizes better for** null **values** than for placeholders like empty strings "", -999, or "NA".

###### Coalesce

- returns the **first non-null value** from the list of columns **for each row**

In [103]:
from pyspark.sql.functions import col, coalesce

df.select(
    coalesce(col("Description").cast("string"), col("CustomerId").cast("string")).alias("fallback")
).show()


+--------------------+
|            fallback|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
|CREAM CUPID HEART...|
|KNITTED UNION FLA...|
|RED WOOLLY HOTTIE...|
|SET 7 BABUSHKA NE...|
|GLASS STAR FROSTE...|
|HAND WARMER UNION...|
|HAND WARMER RED P...|
|ASSORTED COLOUR B...|
|POPPY'S PLAYHOUSE...|
|POPPY'S PLAYHOUSE...|
|FELTCRAFT PRINCES...|
|IVORY KNITTED MUG...|
|BOX OF 6 ASSORTED...|
|BOX OF VINTAGE JI...|
|BOX OF VINTAGE AL...|
|HOME BUILDING BLO...|
|LOVE BUILDING BLO...|
|RECIPE BOX WITH M...|
+--------------------+
only showing top 20 rows


##### ifnull, nullif, nvl, and nvl2 are some SQL functions that can be used
##### another sweet cornerstone about dealing with null values is to drop rows with null values using na.drop()

In [104]:
# drops all null values:
df.na.drop("all", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [105]:
# we can also apply that to certain sets of columns:
df.na.drop("all", subset=["StockCode", "InvoiceNo"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

##### The fill function:

In [106]:
df.na.fill("All Null values become this string")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [107]:
## We can also do this with Scala style mapping:
fill_cols_vals = {"StockCode": 5, "Description": "No Value"}
df.na.fill(fill_cols_vals)

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

##### replace() is another cool alternative to using drop and fill to handle null values:

In [108]:
df.na.replace([""],["UNKNOWN"],"Description")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

## Working with Complex Types

### Structs
- Structs are DataFrames within DataFrames
- It allows multiple columns together into a single nested field. 

In [109]:
df.selectExpr("(Description, InvoiceNo) as complex", "*")

df.selectExpr("struct(Description, InvoiceNo) as complex", "*")

DataFrame[complex: struct<Description:string,InvoiceNo:string>, InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In [110]:
from pyspark.sql.functions import struct
complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))
complexDF.createOrReplaceTempView("complexDF")

In [111]:
complexDF.show()

+--------------------+
|             complex|
+--------------------+
|{WHITE HANGING HE...|
|{WHITE METAL LANT...|
|{CREAM CUPID HEAR...|
|{KNITTED UNION FL...|
|{RED WOOLLY HOTTI...|
|{SET 7 BABUSHKA N...|
|{GLASS STAR FROST...|
|{HAND WARMER UNIO...|
|{HAND WARMER RED ...|
|{ASSORTED COLOUR ...|
|{POPPY'S PLAYHOUS...|
|{POPPY'S PLAYHOUS...|
|{FELTCRAFT PRINCE...|
|{IVORY KNITTED MU...|
|{BOX OF 6 ASSORTE...|
|{BOX OF VINTAGE J...|
|{BOX OF VINTAGE A...|
|{HOME BUILDING BL...|
|{LOVE BUILDING BL...|
|{RECIPE BOX WITH ...|
+--------------------+
only showing top 20 rows


In [112]:
complexDF.select("complex.Description")
complexDF.select(col("complex").getField("Description"))

DataFrame[complex.Description: string]

In [113]:
complexDF.select("complex.*")
# -- in SQL
# SELECT complex.* FROM complexDF

DataFrame[Description: string, InvoiceNo: string]

## Arrays

- In Spark, an **array** is a **complex column type** that can hold multiple values inside one cell.
- Can think of like a list in Python.

Lab Objective: Take each sentence (product description) in the Description column and **split it into words**, turning it into an **arrays of strings**.

In [114]:
# split is used to break a string into an array using a delimiter:
from pyspark.sql.functions import split 

df.select(split(col("Description"), " ")).show(2)

## In SQL:
## SELECT split(Description, ' ') FROM dfTable

+-------------------------+
|split(Description,  , -1)|
+-------------------------+
|     [WHITE, HANGING, ...|
|     [WHITE, METAL, LA...|
+-------------------------+
only showing top 2 rows


In [115]:
## We can also query the values of the array using Python-like syntax which makes it more powerful:
## we are only selecting the first string off the array:
df.select(split(col("Description"), " ").alias("array_col")) \
    .selectExpr("array_col[0]").show(2)


+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows


### Array Length:

We can determine the array’s length by querying for its size:

In [116]:
from pyspark.sql.functions import size
df.select(size(split(col("Description"), " "))).show(2) 

+-------------------------------+
|size(split(Description,  , -1))|
+-------------------------------+
|                              5|
|                              3|
+-------------------------------+
only showing top 2 rows


In [117]:
## We can also see whether this array contains a value:

from pyspark.sql.functions import array_contains
df.select(array_contains(split(col("Description"), " "), "WHITE")).show(2)

## IN SQL:
## SELECT array_contains(split(Description, ' '), 'WHITE') FROM dfTable

+------------------------------------------------+
|array_contains(split(Description,  , -1), WHITE)|
+------------------------------------------------+
|                                            true|
|                                            true|
+------------------------------------------------+
only showing top 2 rows


#### explode() --> takes an array column and turns each element int oa new row

- Useful in tokenizing text into words
- Breaking down multi-valued columns (easier to filter/count)
- Flattening nested structures

In [118]:
from pyspark.sql.functions import split, explode

df.withColumn("splitted", split(col("Description"), " ")) \
    .withColumn("exploded", explode(col("splitted"))) \
    .select("Description", "InvoiceNo", "exploded").show(10)

## IN SQL:
## SELECT Description, InvoiceNo, exploded
## FROM (SELECT *, split(Description, " ") as splitted FROM dfTable)
## LATERAL VIEW explode(splitted) as exploded

+--------------------+---------+--------+
|         Description|InvoiceNo|exploded|
+--------------------+---------+--------+
|WHITE HANGING HEA...|   536365|   WHITE|
|WHITE HANGING HEA...|   536365| HANGING|
|WHITE HANGING HEA...|   536365|   HEART|
|WHITE HANGING HEA...|   536365| T-LIGHT|
|WHITE HANGING HEA...|   536365|  HOLDER|
| WHITE METAL LANTERN|   536365|   WHITE|
| WHITE METAL LANTERN|   536365|   METAL|
| WHITE METAL LANTERN|   536365| LANTERN|
|CREAM CUPID HEART...|   536365|   CREAM|
|CREAM CUPID HEART...|   536365|   CUPID|
+--------------------+---------+--------+
only showing top 10 rows


#### Map() in Spark:

- A **map** is a **collection of key-value pairs**, like a Python dictionary.
- In Spark, we can use maps to store **column values as keys and values**, and query them later.
- Store related fields in one column and enable fast lookup by key.
- Explode them for flat transformations per use case

In [119]:
# Coupling Description and Invoice No. into a map

from pyspark.sql.functions import create_map
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")) \
    .show(5, truncate=False)

+-----------------------------------------------+
|complex_map                                    |
+-----------------------------------------------+
|{WHITE HANGING HEART T-LIGHT HOLDER -> 536365} |
|{WHITE METAL LANTERN -> 536365}                |
|{CREAM CUPID HEARTS COAT HANGER -> 536365}     |
|{KNITTED UNION FLAG HOT WATER BOTTLE -> 536365}|
|{RED WOOLLY HOTTIE WHITE HEART. -> 536365}     |
+-----------------------------------------------+
only showing top 5 rows


In [120]:
## Querying a map using proper key. A missing key returns null:

df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")) \
    .selectExpr("complex_map['WHITE METAL LANTERN']").show(2)

+--------------------------------+
|complex_map[WHITE METAL LANTERN]|
+--------------------------------+
|                            NULL|
|                          536365|
+--------------------------------+
only showing top 2 rows


In [121]:
## We can also explode map types which will turn them into columns:

df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")) \
    .selectExpr("explode(complex_map)").show(5, truncate = False)

+-----------------------------------+------+
|key                                |value |
+-----------------------------------+------+
|WHITE HANGING HEART T-LIGHT HOLDER |536365|
|WHITE METAL LANTERN                |536365|
|CREAM CUPID HEARTS COAT HANGER     |536365|
|KNITTED UNION FLAG HOT WATER BOTTLE|536365|
|RED WOOLLY HOTTIE WHITE HEART.     |536365|
+-----------------------------------+------+
only showing top 5 rows


## Working with JSON
- **JSON**: Java Script Object Notation (presents data as **key-value pairs** and **ordered lists**.
- Spark offers unique support for working with JSON data.

In [122]:
# Let's begin by creating a JSON column:

jsonDF = spark.range(1).selectExpr("""
'{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString""")

In [123]:
jsonDF.show(truncate = False)

+-------------------------------------------+
|jsonString                                 |
+-------------------------------------------+
|{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}|
+-------------------------------------------+



- We can use the *get_json_object* to inline query a JSON object, bet it a dictionary or a array.
- We can use *json_tuple* if this objective has only one level of nesting


In [124]:
from pyspark.sql.functions import get_json_object, json_tuple
jsonDF.select(
    get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]").alias("column"),
    json_tuple(col("jsonString"), "myJSONKey")).show(2, truncate=False)

+------+-----------------------+
|column|c0                     |
+------+-----------------------+
|2     |{"myJSONValue":[1,2,3]}|
+------+-----------------------+



In [125]:
## We can also turn a StructType into a JSON string by using the to_json function:

from pyspark.sql.functions import to_json
df.selectExpr("(InvoiceNo, Description) as myStruct") \
    .select(to_json(col("MyStruct")))

DataFrame[to_json(MyStruct): string]

##### This code convert two columns (InvoiceNo, Description) into a JSON string, and then parse that JSON back into a struct using a defined schema. This is often used when:

- Storing structured data as JSON (eg. for APIs, logs ,or files)
- Reading/parsing nested JSON from strings in a DataFrame
- Testing or validating schema transformations

In [126]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
parseSchema = StructType((
    StructField("InvoiceNo",StringType(),True),
    StructField("Description",StringType(),True)))
df.selectExpr("(InvoiceNo, Description) as myStruct") \
    .select(to_json(col("myStruct")).alias("newJSON")) \
    .select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2, truncate=False)

+--------------------------------------------+-------------------------------------------------------------------------+
|from_json(newJSON)                          |newJSON                                                                  |
+--------------------------------------------+-------------------------------------------------------------------------+
|{536365, WHITE HANGING HEART T-LIGHT HOLDER}|{"InvoiceNo":"536365","Description":"WHITE HANGING HEART T-LIGHT HOLDER"}|
|{536365, WHITE METAL LANTERN}               |{"InvoiceNo":"536365","Description":"WHITE METAL LANTERN"}               |
+--------------------------------------------+-------------------------------------------------------------------------+
only showing top 2 rows


## User-Defined Functions (UDF):

- Spark allows you to define your own functions whihc makes it possible to write custom transformations.
- Can write them in several programming languages (Python, Scala, Java, R)
- By default, UDFs are temporty functions to be used in that specific SparkSession or Context.

In [127]:
udfExampleDF = spark.range(5).toDF("num")
def power3(double_value):
    return double_value ** 3
power3(2.0)

8.0

##### Notes on UDFs:

- UDFs are slower than native Spark functions because they break optimization.
- Once a function is defined, Spark serializes the function on the driver node **the main program.**
- If UDFs not written in JVM, it serializes all data from JVM into Pythonic format runs the function, and return back to the JVM format again. 
- **JVM** --> **Python** = **High Overhead**

In [128]:
## First, we register th function:
from pyspark.sql.functions import udf
power3udf = udf(power3)

In [130]:
## Then we can use it in our DataFrame code:
from pyspark.sql.functions import col
udfExampleDF.select(power3udf(col("num"))).show(2)

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
+-----------+
only showing top 2 rows


{"ts": "2025-07-15 18:18:44.260", "level": "ERROR", "logger": "SQLQueryContextLogger", "msg": "[UNRESOLVED_ROUTINE] Cannot resolve routine `power3` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. SQLSTATE: 42883", "context": {"errorClass": "UNRESOLVED_ROUTINE"}, "exception": {"class": "Py4JJavaError", "msg": "An error occurred while calling o1337.selectExpr.\n: org.apache.spark.sql.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve routine `power3` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. SQLSTATE: 42883; line 1 pos 0\n\tat org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedRoutineError(QueryCompilationErrors.scala:907)\n\tat org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$18.applyOrElse(Analyzer.scala:2030)\n\tat org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$18.applyOrElse(Analyzer.scala:2011)\n\tat org.apache.spark.sql.catalyst.

AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve routine `power3` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]. SQLSTATE: 42883; line 1 pos 0