### Complex Types in Spark

In this notebook we demonstrate complex column types in Spark SQL. As we saw with Hive these complex types are particularly useful when working with semi-structured data sources (e.g. JSON or XML) and with text. 

This notebook is based on material supplied by Cloudera under their Cloudera Academic Partner program and *Spark: The Definitive Guide* by Bill Chambers and Matei Zaharia. 

Topics
- Arrays
- Maps
- Structs 

See the [documentation for these complex types](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types) for more information and methods

In [0]:
# Load raw ride data
rides = spark.read.csv("/mnt/cis442f-data/duocar/raw/rides/", header=True, inferSchema=True)

# Load raw driver data 
drivers = spark.read.csv("/mnt/cis442f-data/duocar/raw/drivers/", header=True, inferSchema=True)

# Load raw rider data 
riders = spark.read.csv("/mnt/cis442f-data/duocar/raw/riders/", header=True, inferSchema=True) 

#### Arrays

Are similar to lists in python (but all elements should be of the same type). The `split` function will create an array of word when applied to text. 

**Note:** We use 
- The [array](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array) function to create an array from multiple columns. 
- The [size](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size) function to get the length of the array.
- The [sort_array](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sort_array) function to sort the array.

In [0]:
# Use the array function to create an array from multiple columns
from pyspark.sql.functions import array
drivers_array = drivers.select( \
  "vehicle_make", \
  "vehicle_model", \
  array("vehicle_make", "vehicle_model").alias("vehicle_array"))
drivers_array.printSchema()
drivers_array.show(5, truncate=False)

# Note: `vehicle_array` is a Python list. 

In [0]:
# Use the size function to get the length of the array
from pyspark.sql.functions import size
drivers_array \
  .select("vehicle_array", size("vehicle_array")) \
  .show(5, False)

# Use index notation to access elements of the array
drivers_array \
  .select("vehicle_array", drivers_array.vehicle_array[0]) \
  .show(5, False)

In [0]:
# Note: Some equivalent alternatives to the previous expression
# https://community.cloud.databricks.com/?o=1954896675389977#
drivers_array \
  .select("vehicle_array", drivers_array["vehicle_array"][0]) \
  .show(1, False)

from pyspark.sql.functions import col
drivers_array \
  .select("vehicle_array", col("vehicle_array")[0]) \
  .show(1, False)

from pyspark.sql.functions import expr
drivers_array \
  .select("vehicle_array", expr("vehicle_array[0]")) \
  .show(1, False)

drivers_array \
  .selectExpr("vehicle_array", "vehicle_array[0]") \
  .show(1, False)

In [0]:
# Use the sort_array function to sort the array
from pyspark.sql.functions import sort_array
drivers_array \
  .select("vehicle_array", sort_array("vehicle_array", asc=True)) \
  .show(5, False)

In [0]:
# Use the `array_contains` function to search the array
from pyspark.sql.functions import array_contains
drivers_array \
  .select("vehicle_array", array_contains("vehicle_array", "Subaru")) \
  .show(5, False)

In [0]:
# Use the `explode`  function to explode the array
from pyspark.sql.functions import explode
drivers_array \
  .select("vehicle_array", explode("vehicle_array")) \
  .show(5, False)



In [0]:
# Also use the `posexplode` functions to explode the array
# Note that you can pass multiple names to the `alias` method
from pyspark.sql.functions import posexplode
drivers_array \
  .select("vehicle_array", posexplode("vehicle_array").alias("position", "car_type")) \
  .show(5, False)


#### Maps

Key, value pairs nested in a DataFrame column

Use the
- The [`create_map`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.create_map) function to create a map
- `size` method to get the length of the map
- dot notation to acces the value by key

In [0]:
# Use the `create_map` function to create a map
from pyspark.sql.functions import lit, create_map
drivers_map = drivers.select( \
  "vehicle_make", \
  "vehicle_model", \
   create_map(lit("make"), "vehicle_make", lit("model"), "vehicle_model").alias("vehicle_map"))
drivers_map.printSchema()
drivers_map.show(5, False) 

In [0]:
# Use the `size` function to get the length of the map:
drivers_map.select("vehicle_map", size("vehicle_map")).show(5, False)


In [0]:
# Use dot notation to access the value by key:
drivers_map.select("vehicle_map", drivers_map.vehicle_map.make).show(5, False)

In [0]:
# Use the `explode` and `posexplode` functions to explode the map:
drivers_map.select("vehicle_map", explode("vehicle_map")).show(5, False)
drivers_map.select("vehicle_map", posexplode("vehicle_map")).show(5, False)

#### Structs

Structs are a complex datatype where each element has its own field name. The `struct` data type is actually  a `Row` object (embedded in a `Row` object).

Use the [struct](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.struct) function to create a struct.

In [0]:
# Use the `struct` function to create a struct
from pyspark.sql.functions import struct
drivers_struct = drivers.select( \
  "vehicle_make", \
  "vehicle_model", \
  struct(drivers.vehicle_make.alias("make"), drivers.vehicle_model.alias("model")).alias("vehicle_struct"))
drivers_struct.printSchema()
drivers_struct.show(5, False)


drivers_struct.head(5) 

In [0]:
# Use dot notation to access struct items
drivers_struct \
  .select("vehicle_struct", col("vehicle_struct").make.alias("vehicle_make")) \
  .show(5, False)

# Note: Using `col` is a bit more concise in this case (than specifying `drivers_struct.vehicle_struct`)

In [0]:
# Use the `to_json` function to convert the struct to a JSON string
from pyspark.sql.functions import to_json
drivers_struct \
  .select("vehicle_struct", to_json("vehicle_struct")) \
  .show(5, False)

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Create an array called `home_array` that includes driver's home latitude and longitude.

(2) Create a map called `name_map` that includes the driver's first and last ame.

(3) Create a struct called `name_struct` that includes the driver's first and last name.



#### References

See the [pyspark function documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions) for more on `arrary`, `create_map`, and `struct`