<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reading-the-Data" data-toc-modified-id="Reading-the-Data-1">Reading the Data</a></span><ul class="toc-item"><li><span><a href="#JSON-params" data-toc-modified-id="JSON-params-1.1">JSON params</a></span></li></ul></li><li><span><a href="#Spark's-complex-column-types:-array,-map-and-struct" data-toc-modified-id="Spark's-complex-column-types:-array,-map-and-struct-2">Spark's complex column types: <code>array</code>, <code>map</code> and <code>struct</code></a></span><ul class="toc-item"><li><span><a href="#array" data-toc-modified-id="array-2.1"><code>array</code></a></span><ul class="toc-item"><li><span><a href="#Creating-an-array-column" data-toc-modified-id="Creating-an-array-column-2.1.1">Creating an array column</a></span></li><li><span><a href="#Use-F.size-to-show-the-number-of-elements-in-an-array" data-toc-modified-id="Use-F.size-to-show-the-number-of-elements-in-an-array-2.1.2">Use <code>F.size</code> to show the number of elements in an array</a></span></li><li><span><a href="#Use-F.array_distinct()-to-remove-duplicates-(like-SQL)" data-toc-modified-id="Use-F.array_distinct()-to-remove-duplicates-(like-SQL)-2.1.3">Use <code>F.array_distinct()</code> to remove duplicates (like SQL)</a></span></li><li><span><a href="#Use-F.array_intersect-to-show-common-values-across-arrays" data-toc-modified-id="Use-F.array_intersect-to-show-common-values-across-arrays-2.1.4">Use <code>F.array_intersect</code> to show common values across arrays</a></span></li><li><span><a href="#Use-array_position()-to-get-the-position-of-the-item-in-an-array-if-it-exists" data-toc-modified-id="Use-array_position()-to-get-the-position-of-the-item-in-an-array-if-it-exists-2.1.5">Use <code>array_position()</code> to get the position of the item in an array if it exists</a></span></li></ul></li><li><span><a href="#map" data-toc-modified-id="map-2.2"><code>map</code></a></span></li><li><span><a href="#struct" data-toc-modified-id="struct-2.3"><code>struct</code></a></span><ul class="toc-item"><li><span><a href="#Using-explode-to-split-arrays-into-rows" data-toc-modified-id="Using-explode-to-split-arrays-into-rows-2.3.1">Using <code>explode</code> to split arrays into rows</a></span></li></ul></li></ul></li><li><span><a href="#How-to-define-and-use-a-schema-with-a-PySpark-data-frame" data-toc-modified-id="How-to-define-and-use-a-schema-with-a-PySpark-data-frame-3">How to define and use a schema with a PySpark data frame</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Building-the-entire-schema-from-scratch" data-toc-modified-id="Building-the-entire-schema-from-scratch-3.0.1">Building the entire schema from scratch</a></span></li></ul></li></ul></li><li><span><a href="#Reading-JSON-with-a-strict-schema" data-toc-modified-id="Reading-JSON-with-a-strict-schema-4">Reading JSON with a strict schema</a></span></li><li><span><a href="#Defining-your-schema-in-JSON" data-toc-modified-id="Defining-your-schema-in-JSON-5">Defining your schema in JSON</a></span><ul class="toc-item"><li><span><a href="#Array-types" data-toc-modified-id="Array-types-5.1">Array types</a></span></li><li><span><a href="#Map-types" data-toc-modified-id="Map-types-5.2">Map types</a></span></li></ul></li><li><span><a href="#Reducing-duplicate-data-with-complex-data-types" data-toc-modified-id="Reducing-duplicate-data-with-complex-data-types-6">Reducing duplicate data with complex data types</a></span><ul class="toc-item"><li><span><a href="#Hierarchichal-vs-2-D-row-column-models" data-toc-modified-id="Hierarchichal-vs-2-D-row-column-models-6.1">Hierarchichal vs 2-D row-column models</a></span><ul class="toc-item"><li><span><a href="#shows-data-frame-using-a-hierarchical-model" data-toc-modified-id="shows-data-frame-using-a-hierarchical-model-6.1.1"><code>shows</code> data frame using a hierarchical model</a></span></li></ul></li></ul></li><li><span><a href="#How-to-use-explode-and-collect-operations-to-go-from-hierarchical-to-tabular-and-back" data-toc-modified-id="How-to-use-explode-and-collect-operations-to-go-from-hierarchical-to-tabular-and-back-7">How to use <code>explode</code> and <code>collect</code> operations to go from hierarchical to tabular and back</a></span><ul class="toc-item"><li><span><a href="#Exploding-a-map" data-toc-modified-id="Exploding-a-map-7.1">Exploding a <code>map</code></a></span></li><li><span><a href="#collect-ing-records-into-a-complex-column" data-toc-modified-id="collect-ing-records-into-a-complex-column-7.2"><code>collect</code>-ing records into a complex column</a></span><ul class="toc-item"><li><span><a href="#collect_list()-and-collect_set()" data-toc-modified-id="collect_list()-and-collect_set()-7.2.1"><code>collect_list()</code> and <code>collect_set()</code></a></span></li></ul></li><li><span><a href="#Building-your-own-hierarchies-with-struct()" data-toc-modified-id="Building-your-own-hierarchies-with-struct()-7.3">Building your own hierarchies with <code>struct()</code></a></span></li></ul></li></ul></div>

# Multi-dimensional data frames: using PySpark with JSON data

In [184]:
# Set up
import os
import numpy as np
import json
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
import pyspark.sql.functions as F
import pyspark.sql.types as T


spark = SparkSession.builder.getOrCreate()

## Reading the Data

For this chapter, we use a JSON dump of the information about the TV Show Silicon Valley, from TV Maze.

### JSON params

- No need for delimiters like CSV
- No need to infer data type
- Contains __hierarchical data__, unlike CSVs
- Single JSON: __one JSON document, one line, one record__.
- Multiple JSON (`multiLine`):  __one JSON document, one FILE, one record__.

In [185]:
# Import a single JSON document
sv = "data/ch06/shows-silicon-valley.json"
shows = spark.read.json(sv)
display(shows.count())


# Read multiple JSON documents using multiLine param
three_shows = spark.read.json("data/ch06/shows-*.json", multiLine=True)
display(three_shows.count())

1

3

In [186]:
# Inspect the schema
shows.printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: timestamp (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = t

## Spark's complex column types: `array`, `map` and `struct`

### `array`

- PySpark arrays are containers for values of the same type, unlike JSON.
- __PySpark will not raise an error if you try to read an array-type column with multiple types__. Instead, it will simply default to the lowest common denominator, usually the string.
- Many array functions are available from `pyspark.sql.functions`

In [187]:
# Selecting the name and genres columns of the shows dataframe

import pyspark.sql.functions as F

array_subset = shows.select("name", "genres")
array_subset.show(1, False)

+--------------+--------+
|name          |genres  |
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



In [188]:
# Multiple methods to extract the same array

array_subset = array_subset.select(
    "name",
    array_subset.genres[0].alias("dot_and_index"),
    F.col("genres")[0].alias("col_and_index"),
    array_subset.genres.getItem(0).alias("dot_and_method"),
    F.col("genres").getItem(0).alias("col_and_method"),
)

array_subset.show()

+--------------+-------------+-------------+--------------+--------------+
|          name|dot_and_index|col_and_index|dot_and_method|col_and_method|
+--------------+-------------+-------------+--------------+--------------+
|Silicon Valley|       Comedy|       Comedy|        Comedy|        Comedy|
+--------------+-------------+-------------+--------------+--------------+



> WARNING: Although the square bracket approach looks very Pythonic, __you can’t use it as a slicing tool__. PySpark will accept only one integer as an index.

#### Creating an array column

1. Create three literal columns (using `lit()` to create scalar columns, then `make_array()`) to create an array of possible genres.
2. Use the function `array_repeat()` to create a column repeating the "Comedy" string

In [218]:
"""
1. Create three literal columns (using lit() to create scalar columns, 
   then make_array() to ) to create an array of possible genres.
2. Use the function array_repeat() to create a column repeating the "Comedy" string
"""

array_subset_repeated = array_subset.select(
    "name",
    F.lit("Comedy").alias("one"),
    F.lit("Horror").alias("two"),
    F.lit("Drama").alias("three"),
    F.col("dot_and_index"),
).select(
    "name",
    F.array("one", "two", "three").alias("Some_Genres"),
    F.array_repeat("dot_and_index", 5).alias("Repeated_Genres"),
)

array_subset_repeated.show(1, False)

+--------------+-----------------------+----------------------------------------+
|name          |Some_Genres            |Repeated_Genres                         |
+--------------+-----------------------+----------------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]|[Comedy, Comedy, Comedy, Comedy, Comedy]|
+--------------+-----------------------+----------------------------------------+



#### Use `F.size` to show the number of elements in an array 

In [190]:
array_subset_repeated.select(
    "name", F.size("Some_Genres"), F.size("Repeated_Genres")
).show()

+--------------+-----------------+---------------------+
|          name|size(Some_Genres)|size(Repeated_Genres)|
+--------------+-----------------+---------------------+
|Silicon Valley|                3|                    5|
+--------------+-----------------+---------------------+



#### Use `F.array_distinct()` to remove duplicates (like SQL)

In [191]:
array_subset_repeated.select(
    "name",
    F.array_distinct("Some_Genres"),
    F.array_distinct("Repeated_Genres")
).show(1, False)

+--------------+---------------------------+-------------------------------+
|name          |array_distinct(Some_Genres)|array_distinct(Repeated_Genres)|
+--------------+---------------------------+-------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]    |[Comedy]                       |
+--------------+---------------------------+-------------------------------+



#### Use `F.array_intersect` to show common values across arrays

In [192]:
array_subset_repeated = array_subset_repeated.select(
    "name", 
    F.array_intersect("Some_Genres", "Repeated_Genres").alias("Genres")
)

array_subset_repeated.show()

+--------------+--------+
|          name|  Genres|
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



#### Use `array_position()` to get the position of the item in an array if it exists

> WARNING: `array_position` is 1-based, unlike Python lists or extracting elements from arrays (e.g. ` array_subset.genres[0]` or `getItems(0)`)

In [193]:
# When using array_position(), the first item of the array has position 1, 
# not 0 like in python.
array_subset_repeated.select(
    "name",
    F.array_position("Genres", "Comedy").alias("Genres"),
).show()

+--------------+------+
|          name|Genres|
+--------------+------+
|Silicon Valley|     1|
+--------------+------+



### `map`

- Like Python typed dictionary: you have keys and values just like in a dictionary, 
- Like `array`, keys need to be of the same type and the values need to be of the same type
- Values can usually be null, but keys can’t (like Python)

In [194]:
# Creating a map from two arrays: one for the keys, one for the values. 
# This creates a hash-map within the column record.

# 1. Create two columns of arrays
columns = ["name", "language", "type"]
shows_map = shows.select(
    *[F.lit(column) for column in columns],
    F.array(*columns).alias("values")
)
shows_map = shows_map.select(F.array(*columns).alias("keys"), "values")
print("Two columns of arays")
shows_map.show(1, False)

# 2. Map them together using one array as the key, and other as value
shows_map = shows_map.select(
    F.map_from_arrays("keys", "values").alias("mapped")
)
shows_map.printSchema()
print("1 column of map")
shows_map.show(1, False)

# 3. 3 ways to select a key in a map column
print("3 ways to select a key in a map")
shows_map.select(
    F.col("mapped.name"), # dot_notation with col
    F.col("mapped")["name"], # Python dictionary style
    shows_map.mapped["name"] # dot_notation to get the column + bracket
).show()


Two columns of arays
+----------------------+-----------------------------------+
|keys                  |values                             |
+----------------------+-----------------------------------+
|[name, language, type]|[Silicon Valley, English, Scripted]|
+----------------------+-----------------------------------+

root
 |-- mapped: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

1 column of map
+---------------------------------------------------------------+
|mapped                                                         |
+---------------------------------------------------------------+
|[name -> Silicon Valley, language -> English, type -> Scripted]|
+---------------------------------------------------------------+

3 ways to select a key in a map
+--------------+--------------+--------------+
|          name|  mapped[name]|  mapped[name]|
+--------------+--------------+--------------+
|Silicon Valley|Silicon Valley|Silicon

### `struct`

- Similar to JSON object.  Key is a string and record can be of a different type.
- Unlike array & map, __the number of fields and their names are known ahead of time__


![](notes/img/struct.png)

In [195]:
# "schedule" column contain array of strings and a string
shows.select("schedule").printSchema()

root
 |-- schedule: struct (nullable = true)
 |    |-- days: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- time: string (nullable = true)



![](notes/img/embedded.png)

In [196]:
# A more complex struct
shows.select("_embedded").printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: timestamp (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = t

Above `struct` visualized:
![](notes/img/embedded.png)

In [197]:
# Drop useless _embedded column and promote the fields within
shows_clean = shows.withColumn("episodes", F.col("_embedded.episodes")).drop(
    "_embedded"
)
shows_clean.select("episodes").printSchema()

root
 |-- episodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |-- airdate: string (nullable = true)
 |    |    |-- airstamp: timestamp (nullable = true)
 |    |    |-- airtime: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- image: struct (nullable = true)
 |    |    |    |-- medium: string (nullable = true)
 |    |    |    |-- original: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- number: long (nullable = true)
 |    |    |-- runtime: long (nullable = true)
 |    |    |-- season: long (nullable = true)
 |    |    |-- summary: string (nullable = true)
 |    |    |-- url: string (nullable = true)



#### Using `explode` to split arrays into rows

In [198]:
# "episodes.name" == array of strings
episodes_name = shows_clean.select(F.col("episodes.name"))
episodes_name.printSchema()

# Just showing episodes_name is messy, so explode the array to show the names
episodes_name.select(F.explode("name").alias("name")).show(3, False)

root
 |-- name: array (nullable = true)
 |    |-- element: string (containsNull = true)

+-------------------------+
|name                     |
+-------------------------+
|Minimum Viable Product   |
|The Cap Table            |
|Articles of Incorporation|
+-------------------------+
only showing top 3 rows



## How to define and use a schema with a PySpark data frame

- Can build either 1) programmatically, or 2) DDL-style schema
- Type objects used to build schema located in `pyspark.sql.types`, usually imported as `T`.

Two object types in `pyspark.sql.types`
1. types object - represent column of a certain type (e.g. `LongType()`, `DecimalType(precision, scale)`, `ArrayType(StringType())`, etc.
2. field object - represent arbitrary number of named fields (e.g. StructField())
  - 2 mandatory params, `name` (str) and `dataType` (type)
  
Putting it altogether:
```
T.StructField("summary", T.StringType())
```

In [199]:
# For reference
shows.select("_embedded").printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: timestamp (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = t

#### Building the entire schema from scratch

In [200]:
# Full schema from scratch

# episode links
episode_links_schema = T.StructType(
    [T.StructField("self", T.StructType([T.StructField("href", T.StringType())]))]
)

# episode image
episode_image_schema = T.StructType(
    [
        T.StructField("medium", T.StringType()),
        T.StructField("original", T.StringType()),
    ]
)

# episode metadata
episode_schema = T.StructType(
    [
        T.StructField("_links", episode_links_schema),
        T.StructField("airdate", T.DateType()),
        T.StructField("airstamp", T.TimestampType()),
        T.StructField("airtime", T.StringType()),
        T.StructField("id", T.StringType()),
        T.StructField("image", episode_image_schema),
        T.StructField("name", T.StringType()),
        T.StructField("number", T.LongType()),
        T.StructField("runtime", T.LongType()),
        T.StructField("season", T.LongType()),
        T.StructField("summary", T.StringType()),
        T.StructField("url", T.StringType()),
    ]
)

# set top level array
embedded_schema = T.StructType([T.StructField("episodes", T.ArrayType(episode_schema))])

# network
network_schema = T.StructType(
    [
        T.StructField(
            "country",
            T.StructType(
                [
                    T.StructField("code", T.StringType()),
                    T.StructField("name", T.StringType()),
                    T.StructField("timezone", T.StringType()),
                ]
            ),
        ),
        T.StructField("id", T.LongType()),
        T.StructField("name", T.StringType()),
    ]
)

# shows (with embedded_schema and network_schema)
shows_schema = T.StructType(
    [
        T.StructField("_embedded", embedded_schema),
        T.StructField("language", T.StringType()),
        T.StructField("name", T.StringType()),
        T.StructField("network", network_schema),
        T.StructField("officialSite", T.StringType()),
        T.StructField("premiered", T.StringType()),
        T.StructField(
            "rating", T.StructType([T.StructField("average", T.DoubleType())])
        ),
        T.StructField("runtime", T.LongType()),
        T.StructField(
            "schedule",
            T.StructType(
                [
                    T.StructField("days", T.ArrayType(T.StringType())),
                    T.StructField("time", T.StringType()),
                ]
            ),
        ),
        T.StructField("status", T.StringType()),
        T.StructField("summary", T.StringType()),
        T.StructField("type", T.StringType()),
        T.StructField("updated", T.LongType()),
        T.StructField("url", T.StringType()),
        T.StructField("webChannel", T.StringType()),
        T.StructField("weight", T.LongType()),
    ]
)

## Reading JSON with a strict schema

Read the JSON file using the schema that we built up:
- `mode="FAILFAST"` is a param to throw an error if it reads a malformed record versus the schema provided.
- If reading non-standard date/timestamp format, you'll need to pass the right format to `dateFormat` or `timestampFormat`.

> Default for `mode` parameter is `PERMISSIVE`, which sets malformed records to `null`.

In [201]:
shows_with_schema = spark.read.json("./data/Ch06/shows-silicon-valley.json",
                                   schema=shows_schema,
                                   mode="FAILFAST")

# Check format for modified columns:
for column in ["airdate", "airstamp"]:
    shows_with_schema.select(f"_embedded.episodes.{column}") \
                     .select(F.explode(column)) \
                     .show(5, False)

+----------+
|col       |
+----------+
|2014-04-06|
|2014-04-13|
|2014-04-20|
|2014-04-27|
|2014-05-04|
+----------+
only showing top 5 rows

+-------------------+
|col                |
+-------------------+
|2014-04-06 22:00:00|
|2014-04-13 22:00:00|
|2014-04-20 22:00:00|
|2014-04-27 22:00:00|
|2014-05-04 22:00:00|
+-------------------+
only showing top 5 rows



Example of `FAILFAST` error due to conflicting schema

In [202]:
from py4j.protocol import Py4JJavaError

shows_schema2 = T.StructType(
    [
        T.StructField("_embedded", embedded_schema),
        T.StructField("language", T.StringType()),
        T.StructField("name", T.StringType()),
        T.StructField("network", network_schema),
        T.StructField("officialSite", T.StringType()),
        T.StructField("premiered", T.StringType()),
        T.StructField(
            "rating", T.StructType([T.StructField("average", T.DoubleType())])
        ),
        T.StructField("runtime", T.LongType()),
        T.StructField(
            "schedule",
            T.StructType(
                [
                    T.StructField("days", T.ArrayType(T.StringType())),
                    T.StructField("time", T.StringType()),
                ]
            ),
        ),
        T.StructField("status", T.StringType()),
        T.StructField("summary", T.StringType()),
        T.StructField("type", T.LongType()),         # switch to LongType
        T.StructField("updated", T.LongType()),      # switch to LongType
        T.StructField("url", T.LongType()),          # switch to LongType
        T.StructField("webChannel", T.StringType()),
        T.StructField("weight", T.LongType()),
    ]
)

shows_with_schema_wrong = spark.read.json(
    "data/Ch06/shows-silicon-valley.json", schema=shows_schema2, mode="FAILFAST",
)

try:
    shows_with_schema_wrong.show()
except Py4JJavaError:
    pass

# Huge Spark ERROR stacktrace, relevant bit:
#
# Caused by: java.lang.RuntimeException: Failed to parse a value for data type
#   bigint (current token: VALUE_STRING).

## Defining your schema in JSON

StructType comes with two methods for exporting its content into a JSON-esque format.
1. `json()` outputs a string containing the json formatted schema
2. `jsonValue()` returns the schema as a dictionary

In [203]:
from pprint import pprint

pprint(shows_with_schema.select('schedule').schema.jsonValue())

{'fields': [{'metadata': {},
             'name': 'schedule',
             'nullable': True,
             'type': {'fields': [{'metadata': {},
                                  'name': 'days',
                                  'nullable': True,
                                  'type': {'containsNull': True,
                                           'elementType': 'string',
                                           'type': 'array'}},
                                 {'metadata': {},
                                  'name': 'time',
                                  'nullable': True,
                                  'type': 'string'}],
                      'type': 'struct'}}],
 'type': 'struct'}


You can use `jsonValue` on complex schema to see its JSON representation. This is helpful when trying to remember a complex schema:

### Array types
  1. `containsNull`,
  2. `elementType`,
  3. `type` (always array)

In [204]:
pprint(T.StructField("array_example", T.ArrayType(T.StringType())).jsonValue())

{'metadata': {},
 'name': 'array_example',
 'nullable': True,
 'type': {'containsNull': True, 'elementType': 'string', 'type': 'array'}}


### Map types

1. `keyType`
2. `type` (always map)
3. `valueContainsNull`
2. `valueType`
3. `keyType`

In [205]:
# Example 1
pprint(
    T.StructField("map_example", T.MapType(T.StringType(), T.LongType())).jsonValue()
)

{'metadata': {},
 'name': 'map_example',
 'nullable': True,
 'type': {'keyType': 'string',
          'type': 'map',
          'valueContainsNull': True,
          'valueType': 'long'}}


In [206]:
# With both
pprint(
    T.StructType(
        [
            T.StructField("map_example", T.MapType(T.StringType(), T.LongType())),
            T.StructField("array_example", T.ArrayType(T.StringType())),
        ]
    ).jsonValue()
)

{'fields': [{'metadata': {},
             'name': 'map_example',
             'nullable': True,
             'type': {'keyType': 'string',
                      'type': 'map',
                      'valueContainsNull': True,
                      'valueType': 'long'}},
            {'metadata': {},
             'name': 'array_example',
             'nullable': True,
             'type': {'containsNull': True,
                      'elementType': 'string',
                      'type': 'array'}}],
 'type': 'struct'}


> Finally, we can close the loop by making sure that our JSON-schema is consistent with the one currently being used. For this, we’ll export the schema of shows_with_schema in a JSON string, load it as a JSON object and then use StructType.fromJson() method to re-create the schema.

In [207]:
other_shows_schema = T.StructType.fromJson(json.loads(shows_with_schema.schema.json()))

print(other_shows_schema == shows_with_schema.schema)  # True

True


## Reducing duplicate data with complex data types

### Hierarchichal vs 2-D row-column models

If we were to make the `shows` data frame in a traditional relational database, we could have a `shows` table linked to an `episodes` table using a star schema.

 `shows` table

| show_id | name           |
|---------|----------------|
| 143     | silicon valley |

`episodes` table, joined to `shows` by `show_id`

| show_id | episode_id     | name           |
|---------|----------------|----------------|
| 143     | 1 | Minimal Viable Product |
| 143     | 2 | The Cap Table |
| 143     | 3 | Articles of Incorporation |

`episodes` could be extended with more columns, but starts to have duplicate entries

| show_id | episode_id     | name           | genre           | day           |
|---------|----------------|----------------|----------------|----------------|
| 143     | 1 | Minimal Viable Product | Comedy | Sunday |
| 143     | 2 | The Cap Table | Comedy | Sunday |
| 143     | 3 | Articles of Incorporation | Comedy | Sunday |


In contrast, a hierarchichal data frame contains complex columns with arrays and struct columns:
- each record represents a show;
- a show has multiple episodes (array of structs column);
- each episode has many fields (struct column within the array);
- each show can have multiple genres (array of string column)
- each show has a schedule (struct column);
- each schedule belonging to a show can have multiple days (array), but a single time (string).


#### `shows` data frame using a hierarchical model

![](./notes/img/hier_df.png)

## How to use `explode` and `collect` operations to go from hierarchical to tabular and back

> We will now revisit the exploding operation by generalizing it to the map, looking at the behavior when your data frame has multiple columns, and see the different options PySpark provided with exploding.

In [208]:
# Exploding _embeedded.episodes
episodes = shows.select("id", F.explode("_embedded.episodes").alias("episodes"))
episodes.printSchema()
episodes.show(5)

root
 |-- id: long (nullable = true)
 |-- episodes: struct (nullable = true)
 |    |-- _links: struct (nullable = true)
 |    |    |-- self: struct (nullable = true)
 |    |    |    |-- href: string (nullable = true)
 |    |-- airdate: string (nullable = true)
 |    |-- airstamp: timestamp (nullable = true)
 |    |-- airtime: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- image: struct (nullable = true)
 |    |    |-- medium: string (nullable = true)
 |    |    |-- original: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- number: long (nullable = true)
 |    |-- runtime: long (nullable = true)
 |    |-- season: long (nullable = true)
 |    |-- summary: string (nullable = true)
 |    |-- url: string (nullable = true)

+---+--------------------+
| id|            episodes|
+---+--------------------+
|143|[[[http://api.tvm...|
|143|[[[http://api.tvm...|
|143|[[[http://api.tvm...|
|143|[[[http://api.tvm...|
|143|[[[http://api.tvm...|
+---

### Exploding a `map`

- keys and values exploded in two different fields
- `posexplode`: explodes the column and also returns an additional column before the data that contains the array positions (LongType).
- `explode` / `posexplode` skips null values

In [209]:
episode_name_id = shows.select(
    F.map_from_arrays(
        F.col("_embedded.episodes.id"), F.col("_embedded.episodes.name")
    ).alias("name_id")
)

episode_name_id = episode_name_id.select(
    F.posexplode("name_id").alias("position", "id", "name")
)

episode_name_id.show(5, False)

+--------+-----+-------------------------+
|position|id   |name                     |
+--------+-----+-------------------------+
|0       |10897|Minimum Viable Product   |
|1       |10898|The Cap Table            |
|2       |10899|Articles of Incorporation|
|3       |10900|Fiduciary Duties         |
|4       |10901|Signaling Risk           |
+--------+-----+-------------------------+
only showing top 5 rows



### `collect`-ing records into a complex column

#### `collect_list()` and `collect_set()`

- takes column as arg, returns an array column
- collect_list = 1 array per column record
- collect_set = 1 array per distinct column record (like Python set)

In [210]:
collected = episodes.groupby("id").agg(F.collect_list("episodes").alias("episodes"))
print(collected.count())
collected.printSchema()

1
root
 |-- id: long (nullable = true)
 |-- episodes: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |-- airdate: string (nullable = true)
 |    |    |-- airstamp: timestamp (nullable = true)
 |    |    |-- airtime: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- image: struct (nullable = true)
 |    |    |    |-- medium: string (nullable = true)
 |    |    |    |-- original: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- number: long (nullable = true)
 |    |    |-- runtime: long (nullable = true)
 |    |    |-- season: long (nullable = true)
 |    |    |-- summary: string (nullable = true)
 |    |    |-- url: string (nullable = true)



### Building your own hierarchies with `struct()`

`struct()` function takess columns as params, and returns struct column containing the columns passed as params as fields.

In [211]:
# Creating a struct column

struct_ex = shows.select(
    F.struct(
        F.col("status"), F.col("weight"), F.lit(True).alias("has_watched")
    ).alias("info")
)

struct_ex.show(1, False)

struct_ex.printSchema()

+-----------------+
|info             |
+-----------------+
|[Ended, 96, true]|
+-----------------+

root
 |-- info: struct (nullable = false)
 |    |-- status: string (nullable = true)
 |    |-- weight: long (nullable = true)
 |    |-- has_watched: boolean (nullable = false)



In [212]:
shows.printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: timestamp (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = t