-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Complex Types

Explore built-in functions for working with collections and strings.

##### Objectives
1. Apply collection functions to process arrays
1. Union DataFrames together

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html" target="_blank">DataFrame</a>:**`union`**, **`unionByName`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html" target="_blank">Built-In Functions</a>:
  - Aggregate: **`collect_set`**
  - Collection: **`array_contains`**, **`element_at`**, **`explode`**
  - String: **`split`**

In [0]:
%run ../Includes/Classroom-Setup

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03"

Validating the locally installed datasets...(4 seconds)

Predefined tables in "da_sergio_salgado_4613_asp":
  -none-

Predefined paths variables:
  DA.paths.user_db:     dbfs:/mnt/dbacademy-users/sergio.salgado@n.world/apache-spark-programming-with-databricks/database.db
  DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03
  DA.paths.working_dir: dbfs:/mnt/dbacademy-users/sergio.salgado@n.world/apache-spark-programming-with-databricks
  DA.paths.checkpoints: dbfs:/mnt/dbacademy-users/sergio.salgado@n.world/apache-spark-programming-with-databricks/_checkpoints
  DA.paths.sales:       dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03/ecommerce/sales/sales.delta
  DA.paths.users:       dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03/ecommerce/users/users.delta
  DA.paths.events:

In [0]:
from pyspark.sql.functions import *

In [0]:
df = spark.read.format("delta").load(DA.paths.sales)

display(df)

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
257437,kmunoz@powell-duran.com,1592194221828900,1,1995.0,1,"List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))"
282611,bmurillo@hotmail.com,1592504237604072,1,940.5,1,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))"
257448,bradley74@gmail.com,1592200438030141,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
257440,jameshardin@campbell-morris.biz,1592197217716495,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
283949,whardin@hotmail.com,1592510720760323,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"
257444,emily88@cobb.com,1592199040703476,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
257449,craig61@luna-oliver.com,1592200459769596,1,1195.0,1,"List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))"
257441,johnsonashley@mcclain.com,1592197729873798,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
264191,maxwelltara@edwards.com,1592306255847870,2,993.6,2,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1), List(NEWBED10, P_FOAM_S, Standard Foam Pillow, 53.1, 59.0, 1))"
286727,rojasjorge@yahoo.com,1592533048926949,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"


In [0]:
# You will need this DataFrame for a later exercise
details_df = (df
              .withColumn("items", explode("items"))
              .select("email", "items.item_name")
              .withColumn("details", split(col("item_name"), " "))
             )
display(details_df)

email,item_name,details
phillipmorgan@hotmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)"
derekreed@yahoo.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)"
jamespowell@gmail.com,King Foam Pillow,"List(King, Foam, Pillow)"
wbrown@gonzales-miranda.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)"
zavalamario@yahoo.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)"
meganhopkins@gmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)"
meganhopkins@gmail.com,King Foam Pillow,"List(King, Foam, Pillow)"
daniel054@gmail.com,Standard King Mattress,"List(Standard, King, Mattress)"
hoffmanjohn@chan.com,Standard King Mattress,"List(Standard, King, Mattress)"
anthony1587@yahoo.com,Standard Full Mattress,"List(Standard, Full, Mattress)"


### String Functions
Here are some of the built-in functions available for manipulating strings.

| Method | Description |
| --- | --- |
| translate | Translate any character in the src by a character in replaceString |
| regexp_replace | Replace all substrings of the specified string value that match regexp with rep |
| regexp_extract | Extract a specific group matched by a Java regex, from the specified string column |
| ltrim | Removes the leading space characters from the specified string column |
| lower | Converts a string column to lowercase |
| split | Splits str around matches of the given pattern |

For example: let's imagine that we need to parse our **`email`** column. We're going to use the **`split`** function  to split domain and handle.

In [0]:
from pyspark.sql.functions import split

In [0]:
display(df.select(split(df.email, '@', 0).alias('email_handle')))

email_handle
"List(kmunoz, powell-duran.com)"
"List(bmurillo, hotmail.com)"
"List(bradley74, gmail.com)"
"List(jameshardin, campbell-morris.biz)"
"List(whardin, hotmail.com)"
"List(emily88, cobb.com)"
"List(craig61, luna-oliver.com)"
"List(johnsonashley, mcclain.com)"
"List(maxwelltara, edwards.com)"
"List(rojasjorge, yahoo.com)"


### Collection Functions

Here are some of the built-in functions available for working with arrays.

| Method | Description |
| --- | --- |
| array_contains | Returns null if the array is null, true if the array contains value, and false otherwise. |
| element_at | Returns element of array at given index. Array elements are numbered starting with **1**. |
| explode | Creates a new row for each element in the given array or map column. |
| collect_set | Returns a set of objects with duplicate elements eliminated. |

In [0]:
mattress_df = (details_df
               .filter(array_contains(col("details"), "Mattress"))
               .withColumn("size", element_at(col("details"), 2)))
display(mattress_df)

email,item_name,details,size
phillipmorgan@hotmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen
derekreed@yahoo.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)",Twin
wbrown@gonzales-miranda.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)",Twin
zavalamario@yahoo.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)",Twin
meganhopkins@gmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen
daniel054@gmail.com,Standard King Mattress,"List(Standard, King, Mattress)",King
hoffmanjohn@chan.com,Standard King Mattress,"List(Standard, King, Mattress)",King
anthony1587@yahoo.com,Standard Full Mattress,"List(Standard, Full, Mattress)",Full
camerontorres@gmail.com,Standard Queen Mattress,"List(Standard, Queen, Mattress)",Queen
camerontorres@gmail.com,Standard Twin Mattress,"List(Standard, Twin, Mattress)",Twin


### Aggregate Functions

Here are some of the built-in aggregate functions available for creating arrays, typically from GroupedData.

| Method | Description |
| --- | --- |
| collect_list | Returns an array consisting of all values within the group. |
| collect_set | Returns an array consisting of all unique values within the group. |

Let's say that we wanted to see the sizes of mattresses ordered by each email address. For this, we can use the **`collect_set`** function

In [0]:
size_df = mattress_df.groupBy("email").agg(collect_set("size").alias("size options"))

display(size_df)

email,size options
aadkins@hill.biz,List(Twin)
aalexander@hotmail.com,List(King)
aallen43@hotmail.com,"List(Queen, Twin)"
aallen@keith-taylor.com,List(Queen)
aalvarez4@gmail.com,List(Queen)
aalvarez@gmail.com,"List(Queen, Full)"
aanderson12@gmail.com,List(King)
aanderson26@hotmail.com,List(Queen)
aaron01@hotmail.com,List(Queen)
aaron04@wolfe.com,List(Queen)


##Union and unionByName
<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> The DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html" target="_blank">**`union`**</a> method resolves columns by position, as in standard SQL. You should use it only if the two DataFrames have exactly the same schema, including the column order. In contrast, the DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html" target="_blank">**`unionByName`**</a> method resolves columns by name.  This is equivalent to UNION ALL in SQL.  Neither one will remove duplicates.  

Below is a check to see if the two dataframes have a matching schema where **`union`** would be appropriate

In [0]:
mattress_df.schema==size_df.schema

Out[11]: False

If we do get the two schemas to match with a simple **`select`** statement, then we can use a **`union`**

In [0]:
union_count = mattress_df.select("email").union(size_df.select("email")).count()

mattress_count = mattress_df.count()
size_count = size_df.count()

mattress_count + size_count == union_count

Out[12]: True

### Clean up classroom

And lastly, we'll clean up the classroom.

In [0]:
DA.cleanup()

Resetting the learning environment...
...dropping the database "da_sergio_salgado_4613_asp"...(0 seconds)
...removing the working directory "dbfs:/mnt/dbacademy-users/sergio.salgado@n.world/apache-spark-programming-with-databricks"...(0 seconds)

Validating the locally installed datasets...(5 seconds)


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>