d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Complex Types

##### Methods
- DataFrame (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">Scala</a>): `union`
- Built-In Functions (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#module-pyspark.sql.functions" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html" target="_blank">Scala</a>):
  - Collection: `explode`, `array_contains`, `element_at`, `collect_set`
  - String: `split`

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) User Purchases
List all size and quality options purchased by each buyer.
1. Extract item details from purchases
2. Extract size and quality options from mattress purchases
3. Extract size and quality options from pillow purchases
4. Combine data for mattress and pillows
5. List all size and quality options bought by each user

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
df = spark.read.parquet(salesPath)
display(df)

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
257437,kmunoz@powell-duran.com,1592194221828900,1,1995.0,1,"List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))"
282611,bmurillo@hotmail.com,1592504237604072,1,940.5,1,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))"
257448,bradley74@gmail.com,1592200438030141,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
257440,jameshardin@campbell-morris.biz,1592197217716495,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
283949,whardin@hotmail.com,1592510720760323,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"
257444,emily88@cobb.com,1592199040703476,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
257449,craig61@luna-oliver.com,1592200459769596,1,1195.0,1,"List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))"
257441,johnsonashley@mcclain.com,1592197729873798,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
264191,maxwelltara@edwards.com,1592306255847870,2,993.6,2,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1), List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1))"
286727,rojasjorge@yahoo.com,1592533048926949,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"


### 1. Extract item details from purchases
- Explode **`items`** field in **`df`**
- Select **`email`** and **`item.item_name`** fields
- Split words in **`item_name`** into an array and alias with "details"

Assign the resulting DataFrame to **`detailsDF`**.

In [0]:
from pyspark.sql.functions import *

detailsDF = (df.withColumn("items", explode("items"))
  .select("email", "items.item_name")
  .withColumn("details", split(col("item_name"), " "))             
)
display(detailsDF)

email,item_name,details
kmunoz@powell-duran.com,Premium King Mattress,"List(Premium, King, Mattress)"
bmurillo@hotmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)"
bradley74@gmail.com,Standard Full Mattress,"List(Standard, Full, Mattress)"
jameshardin@campbell-morris.biz,Standard Queen Mattress,"List(Standard, Queen, Mattress)"
whardin@hotmail.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)"
emily88@cobb.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)"
craig61@luna-oliver.com,Standard King Mattress,"List(Standard, King, Mattress)"
johnsonashley@mcclain.com,Standard Full Mattress,"List(Standard, Full, Mattress)"
maxwelltara@edwards.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)"
maxwelltara@edwards.com,Standard Foam Pillow,"List(Standard, Foam, Pillow)"


### 2. Extract size and quality options from mattress purchases
- Filter **`detailsDF`** for records where **`details`** contains "Mattress"
- Add **`size`** column from extracting element at position 2
- Add **`quality`** column from extracting element at position 1

Save result as **`mattressDF`**.

In [0]:
mattressDF = (detailsDF.filter(array_contains(col("details"), "Mattress"))
  .withColumn("size", element_at(col("details"), 2))
  .withColumn("quality", element_at(col("details"), 1))
)           
display(mattressDF)

email,item_name,details,size,quality
kmunoz@powell-duran.com,Premium King Mattress,"List(Premium, King, Mattress)",King,Premium
bmurillo@hotmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen,Standard
bradley74@gmail.com,Standard Full Mattress,"List(Standard, Full, Mattress)",Full,Standard
jameshardin@campbell-morris.biz,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen,Standard
whardin@hotmail.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)",Twin,Standard
emily88@cobb.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen,Standard
craig61@luna-oliver.com,Standard King Mattress,"List(Standard, King, Mattress)",King,Standard
johnsonashley@mcclain.com,Standard Full Mattress,"List(Standard, Full, Mattress)",Full,Standard
maxwelltara@edwards.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen,Standard
rojasjorge@yahoo.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)",Twin,Standard


### 3. Extract size and quality options from pillow purchases
- Filter **`detailsDF`** for records where **`details`** contains "Pillow"
- Add **`size`** column from extracting element at position 1
- Add **`quality`** column from extracting element at position 2

Note the positions of **`size`** and **`quality`** are switched for mattresses and pillows.

Save result as **`pillowDF`**.

In [0]:
pillowDF = (detailsDF.filter(array_contains(col("details"), "Pillow"))
  .withColumn("size", element_at(col("details"), 1))
  .withColumn("quality", element_at(col("details"), 2))
)           
display(pillowDF)

email,item_name,details,size,quality
maxwelltara@edwards.com,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam
marmstrong46@hotmail.com,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam
johnsonderrick@yahoo.com,King Down Pillow,"List(King, Down, Pillow)",King,Down
johnsonderrick@yahoo.com,Standard Down Pillow,"List(Standard, Down, Pillow)",Standard,Down
hilljoshua43@hotmail.com,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam
gayala@phillips.net,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam
andrew5297@hotmail.com,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam
owerner@yahoo.com,Standard Down Pillow,"List(Standard, Down, Pillow)",Standard,Down
racheljackson@gmail.com,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam
dstout@keith.net,Standard Foam Pillow,"List(Standard, Foam, Pillow)",Standard,Foam


### 4. Combine data for mattress and pillows
- Perform a union on **`mattressDF`** and **`pillowDF`** by column names
- Drop **`details`** column

Save result as **`unionDF`**.

In [0]:
unionDF = (mattressDF.unionByName(pillowDF)
  .drop("details"))
display(unionDF)

email,item_name,size,quality
kmunoz@powell-duran.com,Premium King Mattress,King,Premium
bmurillo@hotmail.com,Standard Queen Mattress,Queen,Standard
bradley74@gmail.com,Standard Full Mattress,Full,Standard
jameshardin@campbell-morris.biz,Standard Queen Mattress,Queen,Standard
whardin@hotmail.com,Standard Twin Mattress,Twin,Standard
emily88@cobb.com,Standard Queen Mattress,Queen,Standard
craig61@luna-oliver.com,Standard King Mattress,King,Standard
johnsonashley@mcclain.com,Standard Full Mattress,Full,Standard
maxwelltara@edwards.com,Standard Queen Mattress,Queen,Standard
rojasjorge@yahoo.com,Standard Twin Mattress,Twin,Standard


### 5. List all size and quality options bought by each user
- Group rows in **`unionDF`** by **`email`**
  - Collect set of all items in **`size`** for each user with alias "size options"
  - Collect set of all items in **`quality`** for each user with alias "quality options"
  
Save result as **`optionsDF`**.

In [0]:
optionsDF = (unionDF.groupBy("email")
  .agg(collect_set("size").alias("size options"),
       collect_set("quality").alias("quality options"))
)
display(optionsDF)

email,size options,quality options
aadkins@hill.biz,List(Twin),List(Standard)
aalexander@hotmail.com,List(King),List(Standard)
aallen43@hotmail.com,"List(Queen, Twin)","List(Premium, Standard)"
aallen@keith-taylor.com,List(Queen),List(Standard)
aalvarez4@gmail.com,List(Queen),List(Standard)
aalvarez@gmail.com,"List(Queen, Full)",List(Standard)
aanderson26@hotmail.com,List(Queen),List(Premium)
aaron01@hotmail.com,List(Queen),List(Standard)
aaron04@wolfe.com,List(Queen),List(Standard)
aaron05@hotmail.com,List(Twin),List(Premium)


### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
