# Working with Different Types of Data

Let's start by first loading some data.

Variable `data` shows where data is located. Modify it as needed

In [1]:
data = "gs://is843/notebooks/jupyter/data/"

In [2]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load(data + "retail-data/by-day/2010-12-01.csv")

df.createOrReplaceTempView("dfTable")

df.printSchema()
df.show(5)

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365| 

## Converting to Spark Types

`lit()` function converts a type in Python to its correspnding Spark representation. Here’s how we can convert a couple of different kinds of Python values to their respective Spark types:

In [3]:
from pyspark.sql.functions import lit

df.select(lit(5), lit("five"), lit(5.0))

DataFrame[5: int, five: string, 5.0: double]

There is no function needed for SQL:

In [4]:
spark.sql("""
SELECT 5, "five", 5.0
""")

DataFrame[5: int, five: string, 5.0: decimal(2,1)]

## Working with Booleans

Booleans are essential when it comes to data analysis because they are the foundation for all filtering. Boolean statements consist of four elements: *and*, *or*, *true*, and *false*. We use these simple structures to build logical statements that evaluate to either *true* or *false*. These statements are often used as conditional requirements for when a row of data must either pass the test (evaluate to true) or else it will be filtered out.

Let’s use our retail dataset to explore working with Booleans. We can specify equality as well as less-than or greater-than:

In [5]:
from pyspark.sql.functions import col

df.where(col("InvoiceNo") != 536365)\
  .select("InvoiceNo", "Description")\
  .show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



In [6]:
df.where("InvoiceNo <> 536365").select("InvoiceNo", "Description").show(5, False)

+---------+-----------------------------+
|InvoiceNo|Description                  |
+---------+-----------------------------+
|536366   |HAND WARMER UNION JACK       |
|536366   |HAND WARMER RED POLKA DOT    |
|536367   |ASSORTED COLOUR BIRD ORNAMENT|
|536367   |POPPY'S PLAYHOUSE BEDROOM    |
|536367   |POPPY'S PLAYHOUSE KITCHEN    |
+---------+-----------------------------+
only showing top 5 rows



Although you can specify your statements explicitly by using and if you like, they’re often easier to understand and to read if you specify them serially. or statements need to be specified in the same statement:

In [7]:
from pyspark.sql.functions import instr

priceFilter = col("UnitPrice") > 600
descripFilter = instr(df.Description, "POSTAGE") >= 1  # instr(): Locate the position of the first occurrence of substr column in the given string.

df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFilter).show()

+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|   Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+
|   536544|      DOT|DOTCOM POSTAGE|       1|2010-12-01 14:32:00|   569.77|      null|United Kingdom|
|   536592|      DOT|DOTCOM POSTAGE|       1|2010-12-01 17:06:00|   607.49|      null|United Kingdom|
+---------+---------+--------------+--------+-------------------+---------+----------+--------------+



Equivalent SQL:

```sql
SELECT * FROM dfTable 
WHERE StockCode in ("DOT") AND 
    (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)
```

In [8]:
from pyspark.sql.functions import instr

DOTCodeFilter = col("StockCode") == "DOT"
priceFilter = col("UnitPrice") > 600
descripFilter = instr(col("Description"), "POSTAGE") >= 1

df.withColumn("isExpensive", DOTCodeFilter & (priceFilter | descripFilter))\
  .where("isExpensive")\
  .select("unitPrice", "isExpensive").show(5)

+---------+-----------+
|unitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



In [9]:
spark.sql("""
SELECT 
  UnitPrice, 
  (StockCode = 'DOT' AND (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)) as isExpensive
FROM dfTable
WHERE (StockCode = 'DOT' AND (UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1))
""").show()

+---------+-----------+
|UnitPrice|isExpensive|
+---------+-----------+
|   569.77|       true|
|   607.49|       true|
+---------+-----------+



Notice how we did not need to specify our filter as an expression and how we could use a column name without any extra work.

If you’re coming from a SQL background, all of these statements should seem quite familiar. Indeed, all of them can be expressed as a where clause. In fact, it’s often easier to just express filters as SQL statements than using the programmatic DataFrame interface and Spark SQL allows us to do this without paying any performance penalty. For example, the following statement uses SQL commands within `expr()`:

In [10]:
from pyspark.sql.functions import expr

df.withColumn("isExpensive", expr("NOT UnitPrice <= 250"))\
  .where("isExpensive")\
  .select("Description", "UnitPrice").show(5)

+--------------+---------+
|   Description|UnitPrice|
+--------------+---------+
|DOTCOM POSTAGE|   569.77|
|DOTCOM POSTAGE|   607.49|
+--------------+---------+



**WARNING**

If there is a null in your data, you’ll need to treat things a bit differently. Here’s how you can ensure that you perform a null-safe equivalence test:

In [11]:
df.where(col("Description").eqNullSafe("hello")).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



## Working with Numbers

When working with big data, the second most common task you will do after filtering things is counting things. For the most part, we simply need to express our computation, and that should be valid assuming that we’re working with numerical data types.

To fabricate a contrived example, let’s imagine that we found out that we mis-recorded the quantity in our retail dataset and the true quantity is equal to $(Current\_Quantity * Unit\_Price)^2 + 5$. This will introduce our first numerical function as well as the `pow()` function that raises a column to the expressed power:

In [12]:
from pyspark.sql.functions import expr, pow

fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5

df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



Notice that we were able to multiply our columns together because they were both numerical. Naturally we can add and subtract as necessary, as well. In fact, we can do all of this as a SQL expression, as well:

In [13]:
df.selectExpr(
  "CustomerId",
  "(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)

+----------+------------------+
|CustomerId|      realQuantity|
+----------+------------------+
|   17850.0|239.08999999999997|
|   17850.0|          418.7156|
+----------+------------------+
only showing top 2 rows



In SQL:
```sql
SELECT customerId, (POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity
FROM dfTable
```

Another common numerical task is rounding. If you’d like to just round to a whole number, oftentimes you can cast the value to an integer and that will work just fine. However, Spark also has more detailed functions for performing this explicitly and to a certain level of precision. In the following example, we round to one decimal place:

In [14]:
from pyspark.sql.functions import lit, round, bround

df.select(round(lit("2.5")), bround(lit("2.5"))).show(1)

+-------------+--------------+
|round(2.5, 0)|bround(2.5, 0)|
+-------------+--------------+
|          3.0|           2.0|
+-------------+--------------+
only showing top 1 row



In SQL
```sql
SELECT round(2.5), bround(2.5)
```

Another numerical task is to compute the correlation of two columns. For example, we can see the Pearson correlation coefficient for two columns to see if cheaper things are typically bought in greater quantities. We can do this through a function as well as through the DataFrame statistic methods:

In [15]:
df.stat.corr("Quantity", "UnitPrice")

-0.04112314436835551

In [16]:
from pyspark.sql.functions import corr

df.select(corr("Quantity", "UnitPrice")).show()

+-------------------------+
|corr(Quantity, UnitPrice)|
+-------------------------+
|     -0.04112314436835551|
+-------------------------+



In SQL
```sql
SELECT corr(Quantity, UnitPrice) FROM dfTable
```

Another common task is to compute summary statistics for a column or set of columns. We can use the `describe` method to achieve exactly this. This will take all numerical and string columns and calculate the count, mean, standard deviation, min, and max:

In [17]:
df.describe(['Quantity', 'UnitPrice', 'Country']).show()

+-------+------------------+------------------+--------------+
|summary|          Quantity|         UnitPrice|       Country|
+-------+------------------+------------------+--------------+
|  count|              3108|              3108|          3108|
|   mean| 8.627413127413128| 4.151946589446603|          null|
| stddev|26.371821677029203|15.638659854603892|          null|
|    min|               -24|               0.0|     Australia|
|    max|               600|            607.49|United Kingdom|
+-------+------------------+------------------+--------------+



If you need these exact numbers, you can also perform this as an aggregation yourself by importing the functions and applying them to the columns that you need:

In [18]:
from pyspark.sql.functions import count, mean, stddev_pop, min, max

There are a number of statistical functions available in the `StatFunctions` Package (accessible using stat as we see in the code block below). These are DataFrame methods that you can use to calculate a variety of different things. For instance, you can calculate either exact or approximate quantiles of your data using the `approxQuantile` method:

In [19]:
colName = "UnitPrice"
quantileProbs = [0.5]
relError = 0.05

df.stat.approxQuantile("UnitPrice", quantileProbs, relError)

[2.51]

Finding frequent items for columns:

In [20]:
df.stat.freqItems(["StockCode", "Quantity"]).show()

+--------------------+--------------------+
| StockCode_freqItems|  Quantity_freqItems|
+--------------------+--------------------+
|[90214E, 20728, 2...|[200, 128, 23, 32...|
+--------------------+--------------------+



As a last note, we can also add a unique ID to each row by using the function `monotonically_increasing_id`. This function generates a unique value for each row, starting with 0:

In [21]:
from pyspark.sql.functions import monotonically_increasing_id

df.select(monotonically_increasing_id(), "StockCode", "Quantity", "UnitPrice").show(5)

+-----------------------------+---------+--------+---------+
|monotonically_increasing_id()|StockCode|Quantity|UnitPrice|
+-----------------------------+---------+--------+---------+
|                            0|   85123A|       6|     2.55|
|                            1|    71053|       6|     3.39|
|                            2|   84406B|       8|     2.75|
|                            3|   84029G|       6|     3.39|
|                            4|   84029E|       6|     3.39|
+-----------------------------+---------+--------+---------+
only showing top 5 rows



## Working with Strings

The `initcap` function will capitalize every word in a given string, with the first letter of each word in uppercase, all other letters in lowercase:

In [22]:
from pyspark.sql.functions import initcap

df.select(initcap(col("Description"))).show(5, False)

+-----------------------------------+
|initcap(Description)               |
+-----------------------------------+
|White Hanging Heart T-light Holder |
|White Metal Lantern                |
|Cream Cupid Hearts Coat Hanger     |
|Knitted Union Flag Hot Water Bottle|
|Red Woolly Hottie White Heart.     |
+-----------------------------------+
only showing top 5 rows



You can cast strings in uppercase and lowercase, as well:

In [23]:
from pyspark.sql.functions import lower, upper

df.select(col("Description"),
    lower(col("Description")),
    upper(col("Description"))).show(2, False)

+----------------------------------+----------------------------------+----------------------------------+
|Description                       |lower(Description)                |upper(Description)                |
+----------------------------------+----------------------------------+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|white hanging heart t-light holder|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |white metal lantern               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+----------------------------------+
only showing top 2 rows



In SQL
```sql
SELECT Description, lower(Description), upper(Description) FROM dfTable
```

Another trivial task is adding or removing spaces around a string. You can do this by using `lpad`, `ltrim`, `rpad` and `rtrim`, `trim`:

In [24]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

df.select(
    ltrim(lit("    HELLO    ")).alias("ltrim"),
    rtrim(lit("    HELLO    ")).alias("rtrim"),
    trim(lit("    HELLO    ")).alias("trim"),
    lpad(lit("HELLO"), 3, " ").alias("lp"),
    rpad(lit("HELLO"), 10, " ").alias("rp")).show(1)

+---------+---------+-----+---+----------+
|    ltrim|    rtrim| trim| lp|        rp|
+---------+---------+-----+---+----------+
|HELLO    |    HELLO|HELLO|HEL|HELLO     |
+---------+---------+-----+---+----------+
only showing top 1 row



In SQL
```sql
SELECT
  ltrim('    HELLLOOOO  '),
  rtrim('    HELLLOOOO  '),
  trim('    HELLLOOOO  '),
  lpad('HELLOOOO  ', 3, ' '),
  rpad('HELLOOOO  ', 10, ' ')
FROM dfTable
```

Note that if lpad or rpad takes a number less than the length of the string, it will always remove values from the right side of the string.

## Regular Expressions

Probably one of the most frequently performed tasks is searching for the existence of one string in another or replacing all mentions of a string with another value. This is often done with a tool called *regular expressions* that exists in many programming languages.

Spark takes advantage of the complete power of Java regular expressions. There are two key functions in Spark that you’ll need in order to perform regular expression tasks: `regexp_extract` and `regexp_replace`. These functions extract values and replace values, respectively.

Let’s explore how to use the `regexp_replace` function to replace substitute color names in our description column:

In [25]:
from pyspark.sql.functions import regexp_replace

regex_string = "BLACK|WHITE|RED|GREEN|BLUE"

df.select(
  regexp_replace(col("Description"), regex_string, "COLOR").alias("color_clean"),
  col("Description")).show(2)

+--------------------+--------------------+
|         color_clean|         Description|
+--------------------+--------------------+
|COLOR HANGING HEA...|WHITE HANGING HEA...|
| COLOR METAL LANTERN| WHITE METAL LANTERN|
+--------------------+--------------------+
only showing top 2 rows



In SQL
```sql
SELECT
  regexp_replace(Description, 'BLACK|WHITE|RED|GREEN|BLUE', 'COLOR') as
  color_clean, Description
FROM dfTable
```

Another task might be to replace given characters with other characters. Building this as a regular expression could be tedious, so Spark also provides the `translate` function to replace these values. This is done at the character level and will replace all instances of a character with the indexed character in the replacement string:

In [26]:
from pyspark.sql.functions import translate

df.select(translate(col("Description"), "LEET", "1337"),col("Description"))\
  .show(2, False)

+----------------------------------+----------------------------------+
|translate(Description, LEET, 1337)|Description                       |
+----------------------------------+----------------------------------+
|WHI73 HANGING H3AR7 7-1IGH7 HO1D3R|WHITE HANGING HEART T-LIGHT HOLDER|
|WHI73 M37A1 1AN73RN               |WHITE METAL LANTERN               |
+----------------------------------+----------------------------------+
only showing top 2 rows



In SQL
```sql
SELECT translate(Description, 'LEET', '1337'), Description FROM dfTable
```

We can also perform something similar, like pulling out the first mentioned color:

In [27]:
from pyspark.sql.functions import regexp_extract

extract_str = "(BLACK|WHITE|RED|GREEN|BLUE)"

df.select(
     regexp_extract(col("Description"), extract_str, 1).alias("color_clean"),
     col("Description")).show(5, False)

+-----------+-----------------------------------+
|color_clean|Description                        |
+-----------+-----------------------------------+
|WHITE      |WHITE HANGING HEART T-LIGHT HOLDER |
|WHITE      |WHITE METAL LANTERN                |
|           |CREAM CUPID HEARTS COAT HANGER     |
|           |KNITTED UNION FLAG HOT WATER BOTTLE|
|RED        |RED WOOLLY HOTTIE WHITE HEART.     |
+-----------+-----------------------------------+
only showing top 5 rows



In SQL
```sql
SELECT regexp_extract(Description, '(BLACK|WHITE|RED|GREEN|BLUE)', 1),
  Description
FROM dfTable
```

Sometimes, rather than extracting values, we simply want to check for their existence. We can do this with the `instr` method on each column. This will return a *Boolean* declaring whether the value you specify is in the column’s string:

In [28]:
from pyspark.sql.functions import instr

containsBlack = instr(col("Description"), "BLACK") >= 1
containsWhite = instr(col("Description"), "WHITE") >= 1

df.withColumn("hasSimpleColor", containsBlack | containsWhite)\
  .where("hasSimpleColor")\
  .select("Description").show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



In SQL
```sql
SELECT Description FROM dfTable
WHERE instr(Description, 'BLACK') >= 1 OR instr(Description, 'WHITE') >= 1
```

This is trivial with just two values, but it becomes more complicated when there are values.

Let’s work through this in a more rigorous way and take advantage of Spark’s ability to accept a dynamic number of arguments. When we convert a list of values into a set of arguments and pass them into a function, we use a language feature called varargs. Using this feature, we can effectively unravel an array of arbitrary length and pass it as arguments to a function. 

We can also do this quite easily in Python. In this case, we’re going to use a different function, `locate`, that returns the integer location (1 based location). We then convert that to a Boolean before using it as the same basic feature:

In [29]:
from pyspark.sql.functions import expr, locate
simpleColors = ["black", "white", "red", "green", "blue"]
def color_locator(column, color_string):
    return locate(color_string.upper(), column)\
        .cast("boolean")\
        .alias("is_" + color_string)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr("*")) # has to be a Column type

df.select(*selectedColumns).where(expr("is_white OR is_red"))\
  .select("Description").show(3, False)

+----------------------------------+
|Description                       |
+----------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|
|WHITE METAL LANTERN               |
|RED WOOLLY HOTTIE WHITE HEART.    |
+----------------------------------+
only showing top 3 rows



In [30]:
df.select(*selectedColumns)

DataFrame[is_black: boolean, is_white: boolean, is_red: boolean, is_green: boolean, is_blue: boolean, InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

## Working with Dates and Timestamps

Let’s begin with the basics and get the current date and the current timestamps:

In [31]:
from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10)\
  .withColumn("today", current_date())\
  .withColumn("now", current_timestamp())

dateDF.createOrReplaceTempView("dateTable")

In [32]:
dateDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)



In [33]:
dateDF.show(5, False)

+---+----------+-----------------------+
|id |today     |now                    |
+---+----------+-----------------------+
|0  |2020-03-17|2020-03-17 18:31:34.946|
|1  |2020-03-17|2020-03-17 18:31:34.946|
|2  |2020-03-17|2020-03-17 18:31:34.946|
|3  |2020-03-17|2020-03-17 18:31:34.946|
|4  |2020-03-17|2020-03-17 18:31:34.946|
+---+----------+-----------------------+
only showing top 5 rows



Notice that `current_timestamp()` records the current time of when the action which is `show()` takes place, not when the transformation, dateDF creation, was placed.

Also, notice that the date/time is being recorded as GMT. One should always have in mind which timezone is being used. You can set a session local timezone if necessary by setting spark.conf.sessionLocalTimeZone in the SQL configurations. 

Now that we have a simple DataFrame to work with, let’s add and subtract five days from today. These functions take a column and then the number of days to either add or subtract as the arguments:

In [34]:
from pyspark.sql.functions import date_add, date_sub

dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1)

+------------------+------------------+
|date_sub(today, 5)|date_add(today, 5)|
+------------------+------------------+
|        2020-03-12|        2020-03-22|
+------------------+------------------+
only showing top 1 row



In SQL
```sql
SELECT date_sub(today, 5), date_add(today, 5) FROM dateTable
```

Another common task is to take a look at the difference between two dates. We can do this with the `datediff` function that will return the number of days in between two dates. Most often we just care about the days, and because the number of days varies from month to month, there also exists a function, `months_between`, that gives you the number of months between two dates:

In [35]:
from pyspark.sql.functions import datediff, months_between, to_date

dateDF.withColumn("week_ago", date_sub(col("today"), 7))\
  .select("week_ago", "today", datediff(col("week_ago"), col("today"))).show(1)

+----------+----------+-------------------------+
|  week_ago|     today|datediff(week_ago, today)|
+----------+----------+-------------------------+
|2020-03-10|2020-03-17|                       -7|
+----------+----------+-------------------------+
only showing top 1 row



In [36]:
dateDF.select(
    to_date(lit("2016-01-01")).alias("start"),
    to_date(lit("2017-05-22")).alias("end"))\
  .select("start", "end", months_between(col("start"), col("end"))).show(1)

+----------+----------+--------------------------+
|     start|       end|months_between(start, end)|
+----------+----------+--------------------------+
|2016-01-01|2017-05-22|              -16.67741935|
+----------+----------+--------------------------+
only showing top 1 row



In SQL
```sql
SELECT to_date('2016-01-01'), months_between('2016-01-01', '2017-01-01'),
datediff('2016-01-01', '2017-01-01')
FROM dateTable
```

Notice that we introduced a new function: the `to_date` function. The `to_date` function allows you to convert a string to a date, optionally with a specified format. We specify our format in the [Java SimpleDateFormat](https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html) which will be important to reference if you use this function:

In [37]:
from pyspark.sql.functions import to_date, lit

spark.range(5).withColumn("date", lit("2017-01-01"))\
  .select(to_date(col("date"))).show(1)

+---------------+
|to_date(`date`)|
+---------------+
|     2017-01-01|
+---------------+
only showing top 1 row



Spark will not throw an error if it cannot parse the date; rather, it will just return null. This can be a bit tricky in larger pipelines because you might be expecting your data in one format and getting it in another. To illustrate, let’s take a look at the date format that has switched from year-month-day to year-day-month. Spark will fail to parse this date and silently return null instead:

In [38]:
dateDF.select(to_date(lit("2016-20-12")),to_date(lit("2017-12-11"))).show(1)

+---------------------+---------------------+
|to_date('2016-20-12')|to_date('2017-12-11')|
+---------------------+---------------------+
|                 null|           2017-12-11|
+---------------------+---------------------+
only showing top 1 row



We find this to be an especially tricky situation for bugs because some dates might match the correct format, whereas others do not. In the previous example, notice how the second date appears as Decembers 11th instead of the correct day, November 12th. Spark doesn’t throw an error because it cannot know whether the days are mixed up or that specific row is incorrect.

Let’s fix this pipeline, step by step, and come up with a robust way to avoid these issues entirely. The first step is to remember that we need to specify our date format according to the Java SimpleDateFormat standard.

We will use two functions to fix this: `to_date` and `to_timestamp`. The former optionally expects a format, whereas the latter requires one:

In [39]:
from pyspark.sql.functions import to_date

dateFormat = "yyyy-dd-MM"

cleanDateDF = spark.range(1).select(
    to_date(lit("2017-12-11"), dateFormat).alias("date"),
    to_date(lit("2017-20-12"), dateFormat).alias("date2"))

cleanDateDF.show()

cleanDateDF.createOrReplaceTempView("dateTable2")

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+



In SQL
```sql
SELECT to_date(date, 'yyyy-dd-MM'), to_date(date2, 'yyyy-dd-MM'), to_date(date)
FROM dateTable2
```

Now let’s use an example of to_timestamp, which always requires a format to be specified:

In [40]:
from pyspark.sql.functions import to_timestamp

cleanDateDF.select(to_timestamp(col("date"), "yyyy-dd-MM")).show()

+----------------------------------+
|to_timestamp(`date`, 'yyyy-dd-MM')|
+----------------------------------+
|               2017-11-12 00:00:00|
+----------------------------------+



After we have our date or timestamp in the correct format and type, comparing between them is actually quite easy. We just need to be sure to either use a date/timestamp type or specify our string according to the right format of *yyyy-MM-dd* if we’re comparing a date:

In [41]:
cleanDateDF.where(col("date2") > lit("2017-12-12")).show()

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+



One minor point is that we can also set this as a string, which Spark parses to a literal:

In [42]:
cleanDateDF.filter(col("date2") > "2017-12-12").show()

+----------+----------+
|      date|     date2|
+----------+----------+
|2017-11-12|2017-12-20|
+----------+----------+



It is important to point out that a good practice is to parse the values explicitly instead of relying on implicit conversions.

## Working with Nulls in Data

As a best practice, you should always use nulls to represent missing or empty data in your DataFrames. Spark can optimize working with null values more than it can if you use empty strings or other values. The primary way of interacting with null values, at DataFrame scale, is to use the `.na` subpackage on a DataFrame. There are also several functions for performing operations and explicitly specifying how Spark should handle null values.

**WARNING**

Nulls are a challenging part of all programming, and Spark is no exception. Being explicit is always better than being implicit when handling null values. When we declare a column as not having a null time, that is not actually enforced. To reiterate, when you define a schema in which all columns are declared to not have null values, Spark will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize for handling that column. If you have null values in columns that should not have null values, you can get an incorrect result or see strange exceptions that can be difficult to debug.

There are two things you can do with null values: you can explicitly drop nulls or you can fill them with a value (globally or on a per-column basis). Let’s experiment with each of these now.

### Coalesce

Spark includes a function to allow you to select the first non-null value from a set of columns by using the `coalesce` function. In this case, there are no null values, so it simply returns the first column:

In [43]:
from pyspark.sql.functions import coalesce

df.select(coalesce(col("Description"), col("CustomerId"))).show()

+---------------------------------+
|coalesce(Description, CustomerId)|
+---------------------------------+
|             WHITE HANGING HEA...|
|              WHITE METAL LANTERN|
|             CREAM CUPID HEART...|
|             KNITTED UNION FLA...|
|             RED WOOLLY HOTTIE...|
|             SET 7 BABUSHKA NE...|
|             GLASS STAR FROSTE...|
|             HAND WARMER UNION...|
|             HAND WARMER RED P...|
|             ASSORTED COLOUR B...|
|             POPPY'S PLAYHOUSE...|
|             POPPY'S PLAYHOUSE...|
|             FELTCRAFT PRINCES...|
|             IVORY KNITTED MUG...|
|             BOX OF 6 ASSORTED...|
|             BOX OF VINTAGE JI...|
|             BOX OF VINTAGE AL...|
|             HOME BUILDING BLO...|
|             LOVE BUILDING BLO...|
|             RECIPE BOX WITH M...|
+---------------------------------+
only showing top 20 rows



### `ifnull`, `nullIf`, `nvl`, and `nvl2`

There are several other SQL functions that you can use to achieve similar things. 

The `ifnull` and `nvl` functions are synonyms: ifnull(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.

In [44]:
spark.sql("""
SELECT
  ifnull(null, 'expr2'),
  ifnull('expr1', 'expr2'),
  nvl(null, 'expr2')
FROM dfTable LIMIT 1
""").show()

+---------------------+------------------------+------------------+
|ifnull(NULL, 'expr2')|ifnull('expr1', 'expr2')|nvl(NULL, 'expr2')|
+---------------------+------------------------+------------------+
|                expr2|                   expr1|             expr2|
+---------------------+------------------------+------------------+



`nullif`: nullif(expr1, expr2) - Returns null if expr1 equals to expr2, or expr1 otherwise.

In [45]:
spark.sql("""
SELECT
  nullif('expr1', 'expr1'),
  nullif('expr1', 'expr2')
FROM dfTable LIMIT 1
""").show()

+------------------------+------------------------+
|nullif('expr1', 'expr1')|nullif('expr1', 'expr2')|
+------------------------+------------------------+
|                    null|                   expr1|
+------------------------+------------------------+



`nvl2`: nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.

In [46]:
spark.sql("""
SELECT
  nvl2('expr1', 'expr2', "expr3"),
  nvl2(null, 'expr2', "expr3")
FROM dfTable LIMIT 1
""").show()

+-------------------------------+----------------------------+
|nvl2('expr1', 'expr2', 'expr3')|nvl2(NULL, 'expr2', 'expr3')|
+-------------------------------+----------------------------+
|                          expr2|                       expr3|
+-------------------------------+----------------------------+



Naturally, we can use these in select expressions on DataFrames, as well.

### `isNull`

`isNull(expr)` - Returns true if expr is null, or false otherwise.

In [47]:
df.where(col("UnitPrice").isNull()).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



In [48]:
df.where(col("CustomerID").isNull()).show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536414|    22139|                null|      56|2010-12-01 11:52:00|      0.0|      null|United Kingdom|
|   536544|    21773|DECORATIVE ROSE B...|       1|2010-12-01 14:32:00|     2.51|      null|United Kingdom|
|   536544|    21774|DECORATIVE CATS B...|       2|2010-12-01 14:32:00|     2.51|      null|United Kingdom|
|   536544|    21786|  POLKADOT RAIN HAT |       4|2010-12-01 14:32:00|     0.85|      null|United Kingdom|
|   536544|    21787|RAIN PONCHO RETRO...|       2|2010-12-01 14:32:00|     1.66|      null|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



Let's get a count of how many null values exist in each column using a loop and `isNull` function:

In [49]:
[(c, df.where(col(c).isNull()).count()) for c in df.columns]

[('InvoiceNo', 0),
 ('StockCode', 0),
 ('Description', 10),
 ('Quantity', 0),
 ('InvoiceDate', 0),
 ('UnitPrice', 0),
 ('CustomerID', 1140),
 ('Country', 0)]

In [50]:
print("DataFrame df consists of {} records, out of which {} records have a missing CustomerID, {} have a missing Description, and {} are missing both!"\
      .format(df.count(),
             df.where(col("CustomerID").isNull()).count(),
             df.where(col("Description").isNull()).count(),
             df.where(col("CustomerID").isNull() & col("Description").isNull()).count()))

DataFrame df consists of 3108 records, out of which 1140 records have a missing CustomerID, 10 have a missing Description, and 10 are missing both!


In [51]:
df.where(col("CustomerID").isNull() & col("Description").isNull()).show()

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   536414|    22139|       null|      56|2010-12-01 11:52:00|      0.0|      null|United Kingdom|
|   536545|    21134|       null|       1|2010-12-01 14:32:00|      0.0|      null|United Kingdom|
|   536546|    22145|       null|       1|2010-12-01 14:33:00|      0.0|      null|United Kingdom|
|   536547|    37509|       null|       1|2010-12-01 14:33:00|      0.0|      null|United Kingdom|
|   536549|   85226A|       null|       1|2010-12-01 14:34:00|      0.0|      null|United Kingdom|
|   536550|    85044|       null|       1|2010-12-01 14:34:00|      0.0|      null|United Kingdom|
|   536552|    20950|       null|       1|2010-12-01 14:34:00|      0.0|      null|United Kingdom|
|   536553

### Negating conditions within a filter
We can use `~` within `where()` to negate the filter. For instance this can be used in conjunction with `isNull()` to filter values that are not null:

In [52]:
df.where(~col("UnitPrice").isNull()).show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



### `drop`

The simplest function is `drop`, which removes rows that contain nulls. The default is to drop any row in which any value is null:

In [53]:
df.na.drop()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

Equivalent to 
```python 
df.na.drop("any")
```

In SQL, we have to do this column by column:

```sql
SELECT * FROM dfTable WHERE Description IS NOT NULL
```

Specifying "`any`" as an argument drops a row if any of the values are null. 

Let's check and see how many rows were actually dropped:

In [54]:
df.count() - df.na.drop().count()

1140

Which is what we expected, based on the above summary.

Using “all” drops the row only if all values are `null` or `NaN` for that row:

In [55]:
df.na.drop("all")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

Since we don't have any rows with all columns being null this doesn't drop anything:

In [56]:
df.na.drop("all").count()

3108

We can also apply this to certain sets of columns by passing in an array of columns:

In [57]:
df.na.drop("all", subset=["CustomerID", "Description"])

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

The following code shows that the above command will drop 10 records that have both *CustomerID* and *Description* missing.

In [58]:
df.count() - df.na.drop("all", subset=["CustomerID", "Description"]).count()

10

### `fill`

Using the `fil`l function, you can "fill" one or more columns with a set of values. This can be done by specifying a map— that is a particular value and a set of columns.

For example, to fill all null values in columns of type String, you might specify a string:

In [59]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



In [60]:
df.where(col("Description").isNull())\
  .na.fill("No Value").show(5)

+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
|   536414|    22139|   No Value|      56|2010-12-01 11:52:00|      0.0|      null|United Kingdom|
|   536545|    21134|   No Value|       1|2010-12-01 14:32:00|      0.0|      null|United Kingdom|
|   536546|    22145|   No Value|       1|2010-12-01 14:33:00|      0.0|      null|United Kingdom|
|   536547|    37509|   No Value|       1|2010-12-01 14:33:00|      0.0|      null|United Kingdom|
|   536549|   85226A|   No Value|       1|2010-12-01 14:34:00|      0.0|      null|United Kingdom|
+---------+---------+-----------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



We could do the same for columns of type Integer by using df.na.fill(5:Integer), or for Doubles df.na.fill(5:Double). To specify columns, we just pass in an array of column names like we did in the previous example:

In [61]:
df.where(col("CustomerID").isNull())\
  .na.fill(0.0).show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536414|    22139|                null|      56|2010-12-01 11:52:00|      0.0|       0.0|United Kingdom|
|   536544|    21773|DECORATIVE ROSE B...|       1|2010-12-01 14:32:00|     2.51|       0.0|United Kingdom|
|   536544|    21774|DECORATIVE CATS B...|       2|2010-12-01 14:32:00|     2.51|       0.0|United Kingdom|
|   536544|    21786|  POLKADOT RAIN HAT |       4|2010-12-01 14:32:00|     0.85|       0.0|United Kingdom|
|   536544|    21787|RAIN PONCHO RETRO...|       2|2010-12-01 14:32:00|     1.66|       0.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



Please note that in the code above the filtering is just for "showing" purposes and "fill" will get applied to all the columns with type Double.

We can also do this with a Python dictionary, where the key is the column name and the value is the value we would like to use to fill null values:

In [62]:
fill_cols_vals = {"CustomerID": 0.0, "Description" : "No Value"}
df2 = df.na.fill(fill_cols_vals)

df2.where(col("CustomerID").isNull() | col("Description").isNull()).show()

+---------+---------+-----------+--------+-----------+---------+----------+-------+
|InvoiceNo|StockCode|Description|Quantity|InvoiceDate|UnitPrice|CustomerID|Country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+



We can see that all the null values have been replaced with a non-null that we specidied for df2. 

### `replace`

In addition to replacing null values like we did with `drop` and `fill`, there are more flexible options that you can use with more than just null values. Probably the most common use case is to replace all values in a certain column according to their current value. The only requirement is that this value be the same type as the original value:

In [63]:
df.na.replace([""], ["UNKNOWN"], "Description")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

In this particular example we didn't have an empty string in Description to replace it with UNKNONW:

In [64]:
df.na.replace([""], ["UNKNOWN"], "Description").where("Description = 'UNKNOWN'").count()

0

## Ordering

You can use `asc_nulls_first`, `desc_nulls_first`, `asc_nulls_last`, or `desc_nulls_last` to specify where you would like your null values to appear in an ordered DataFrame.

**Note**

This is a new feature in Spark 2.4 and is not yet working properly. The PySpark code for it should look something like this:

```python
from pyspark.sql.functions import desc_nulls_first
df.orderBy(col("Description").desc_nulls_first()).show()
```

However, its Sparl SQL function works fine:

In [65]:
spark.sql("""
SELECT * FROM dfTable
ORDER BY Description DESC NULLS FIRST
LIMIT 15
""").show()

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   536545|    21134|                null|       1|2010-12-01 14:32:00|      0.0|      null|United Kingdom|
|   536414|    22139|                null|      56|2010-12-01 11:52:00|      0.0|      null|United Kingdom|
|   536546|    22145|                null|       1|2010-12-01 14:33:00|      0.0|      null|United Kingdom|
|   536547|    37509|                null|       1|2010-12-01 14:33:00|      0.0|      null|United Kingdom|
|   536549|   85226A|                null|       1|2010-12-01 14:34:00|      0.0|      null|United Kingdom|
|   536550|    85044|                null|       1|2010-12-01 14:34:00|      0.0|      null|United Kingdom|
|   536552|    20950|       

## Working with Complex Types

Complex types can help you organize and structure your data in ways that make more sense for the problem that you are hoping to solve. There are three kinds of complex types: structs, arrays, and maps.

### Structs

You can think of structs as DataFrames within DataFrames. A worked example will illustrate this more clearly. We can create a struct by wrapping a set of columns in parenthesis in a query:

In [66]:
df.selectExpr("(Description, InvoiceNo) as complex", "*").show(5)

+--------------------+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|             complex|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+--------------------+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|[WHITE HANGING HE...|   536365|   85123A|WHITE HANGING HEA...|       6|2010-12-01 08:26:00|     2.55|   17850.0|United Kingdom|
|[WHITE METAL LANT...|   536365|    71053| WHITE METAL LANTERN|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|[CREAM CUPID HEAR...|   536365|   84406B|CREAM CUPID HEART...|       8|2010-12-01 08:26:00|     2.75|   17850.0|United Kingdom|
|[KNITTED UNION FL...|   536365|   84029G|KNITTED UNION FLA...|       6|2010-12-01 08:26:00|     3.39|   17850.0|United Kingdom|
|[RED WOOLLY HOTTI...|   536365|   84029E|RED WOOLLY HOTTIE...|       6|2010-12-01 08:26:00|     

Which is equivalent to:

```python
df.selectExpr("struct(Description, InvoiceNo) as complex", "*")
```

Let's make a new DataFrame by wrapping two columns, and including two other columns from our df:

In [67]:
from pyspark.sql.functions import struct

complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"), "StockCode", "CustomerID")
complexDF.createOrReplaceTempView("complexDF")

In [68]:
complexDF.show(5, False)

+---------------------------------------------+---------+----------+
|complex                                      |StockCode|CustomerID|
+---------------------------------------------+---------+----------+
|[WHITE HANGING HEART T-LIGHT HOLDER, 536365] |85123A   |17850.0   |
|[WHITE METAL LANTERN, 536365]                |71053    |17850.0   |
|[CREAM CUPID HEARTS COAT HANGER, 536365]     |84406B   |17850.0   |
|[KNITTED UNION FLAG HOT WATER BOTTLE, 536365]|84029G   |17850.0   |
|[RED WOOLLY HOTTIE WHITE HEART., 536365]     |84029E   |17850.0   |
+---------------------------------------------+---------+----------+
only showing top 5 rows



We now have a DataFrame with a column complex. We can query it just as we might another DataFrame, the only difference is that we use a dot syntax to do so, or the column method `getField`:

We can access and expand all the columns by:

In [69]:
complexDF.select("complex.Description", "complex.InvoiceNo", "StockCode", "CustomerID").show(5, False)

+-----------------------------------+---------+---------+----------+
|Description                        |InvoiceNo|StockCode|CustomerID|
+-----------------------------------+---------+---------+----------+
|WHITE HANGING HEART T-LIGHT HOLDER |536365   |85123A   |17850.0   |
|WHITE METAL LANTERN                |536365   |71053    |17850.0   |
|CREAM CUPID HEARTS COAT HANGER     |536365   |84406B   |17850.0   |
|KNITTED UNION FLAG HOT WATER BOTTLE|536365   |84029G   |17850.0   |
|RED WOOLLY HOTTIE WHITE HEART.     |536365   |84029E   |17850.0   |
+-----------------------------------+---------+---------+----------+
only showing top 5 rows



In [70]:
complexDF.select(col("complex").getField("Description")).show(2)

+--------------------+
| complex.Description|
+--------------------+
|WHITE HANGING HEA...|
| WHITE METAL LANTERN|
+--------------------+
only showing top 2 rows



In [71]:
complexDF.printSchema()

root
 |-- complex: struct (nullable = false)
 |    |-- Description: string (nullable = true)
 |    |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- CustomerID: double (nullable = true)



We can also query all values in the struct by using *. This brings up all the columns to the top-level DataFrame:

In [72]:
complexDF.select("complex.*").show(2)

+--------------------+---------+
|         Description|InvoiceNo|
+--------------------+---------+
|WHITE HANGING HEA...|   536365|
| WHITE METAL LANTERN|   536365|
+--------------------+---------+
only showing top 2 rows



In SQL
```sql
SELECT complex.* FROM complexDF
```

### Arrays

To define arrays, let’s work through a use case. With our current data, our objective is to take every single word in our Description column and convert that into a row in our DataFrame.

The first task is to turn our Description column into a complex type, an array.

### `split`

We do this by using the split function and specify the delimiter:

In [73]:
from pyspark.sql.functions import split

df.select(split(col("Description"), " ")).show(2)

+---------------------+
|split(Description,  )|
+---------------------+
| [WHITE, HANGING, ...|
| [WHITE, METAL, LA...|
+---------------------+
only showing top 2 rows



In SQL
```sql
SELECT split(Description, ' ') FROM dfTable
```

This is quite powerful because Spark allows us to manipulate this complex type as another column. We can also query the values of the array using Python-like syntax:

In [74]:
df.select(split(col("Description"), " ").alias("array_col"))\
  .selectExpr("array_col[0]").show(2)

+------------+
|array_col[0]|
+------------+
|       WHITE|
|       WHITE|
+------------+
only showing top 2 rows



In SQL
```sql
SELECT split(Description, ' ')[0] FROM dfTable
```

### Array Length

We can determine the array’s length by querying for its size:

In [75]:
from pyspark.sql.functions import size

df.select(size(split(col("Description"), " "))).show(2)

+---------------------------+
|size(split(Description,  ))|
+---------------------------+
|                          5|
|                          3|
+---------------------------+
only showing top 2 rows



### `array_contains`

We can also see whether this array contains a value:

In [76]:
from pyspark.sql.functions import array_contains

df.select(array_contains(split(col("Description"), " "), "WHITE")).show(5)

+--------------------------------------------+
|array_contains(split(Description,  ), WHITE)|
+--------------------------------------------+
|                                        true|
|                                        true|
|                                       false|
|                                       false|
|                                        true|
+--------------------------------------------+
only showing top 5 rows



However, this does not solve our current problem. To convert a complex type into a set of rows (one per value in our array), we need to use the explode function.

### `explode`

The `explode` function takes a column that consists of arrays and creates one row (with the rest of the values duplicated) per value in the array. Figure below illustrates the process.

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/06-01-Exploding-a-column-of-text.png?raw=true" width="900" align="left"/>

In [77]:
from pyspark.sql.functions import split, explode

df.withColumn("splitted", split(col("Description"), " "))\
  .withColumn("exploded", explode(col("splitted")))\
  .select("Description", "InvoiceNo", "exploded").show(13, False)

+----------------------------------+---------+--------+
|Description                       |InvoiceNo|exploded|
+----------------------------------+---------+--------+
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |WHITE   |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HANGING |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HEART   |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |T-LIGHT |
|WHITE HANGING HEART T-LIGHT HOLDER|536365   |HOLDER  |
|WHITE METAL LANTERN               |536365   |WHITE   |
|WHITE METAL LANTERN               |536365   |METAL   |
|WHITE METAL LANTERN               |536365   |LANTERN |
|CREAM CUPID HEARTS COAT HANGER    |536365   |CREAM   |
|CREAM CUPID HEARTS COAT HANGER    |536365   |CUPID   |
|CREAM CUPID HEARTS COAT HANGER    |536365   |HEARTS  |
|CREAM CUPID HEARTS COAT HANGER    |536365   |COAT    |
|CREAM CUPID HEARTS COAT HANGER    |536365   |HANGER  |
+----------------------------------+---------+--------+
only showing top 13 rows



### Maps

Maps are created by using the `create_map` function (`map` in SQL) and key-value pairs of columns. You then can select them just like you might select from an array:

In [78]:
from pyspark.sql.functions import create_map

df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")).show(5, False)

+-----------------------------------------------+
|complex_map                                    |
+-----------------------------------------------+
|[WHITE HANGING HEART T-LIGHT HOLDER -> 536365] |
|[WHITE METAL LANTERN -> 536365]                |
|[CREAM CUPID HEARTS COAT HANGER -> 536365]     |
|[KNITTED UNION FLAG HOT WATER BOTTLE -> 536365]|
|[RED WOOLLY HOTTIE WHITE HEART. -> 536365]     |
+-----------------------------------------------+
only showing top 5 rows



In SQL
```sql
SELECT map(Description, InvoiceNo) as complex_map FROM dfTable
WHERE Description IS NOT NULL
```

In [79]:
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
  .selectExpr("*", "complex_map['WHITE METAL LANTERN']").show(5, False)

+-----------------------------------------------+--------------------------------+
|complex_map                                    |complex_map[WHITE METAL LANTERN]|
+-----------------------------------------------+--------------------------------+
|[WHITE HANGING HEART T-LIGHT HOLDER -> 536365] |null                            |
|[WHITE METAL LANTERN -> 536365]                |536365                          |
|[CREAM CUPID HEARTS COAT HANGER -> 536365]     |null                            |
|[KNITTED UNION FLAG HOT WATER BOTTLE -> 536365]|null                            |
|[RED WOOLLY HOTTIE WHITE HEART. -> 536365]     |null                            |
+-----------------------------------------------+--------------------------------+
only showing top 5 rows



You can also explode map types, which will turn them into columns:

In [80]:
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
  .selectExpr("explode(complex_map)").show(2, False)

+----------------------------------+------+
|key                               |value |
+----------------------------------+------+
|WHITE HANGING HEART T-LIGHT HOLDER|536365|
|WHITE METAL LANTERN               |536365|
+----------------------------------+------+
only showing top 2 rows



## User-Defined Functions

One of the most powerful things that you can do in Spark is define your own functions. These user-defined functions (UDFs) make it possible for you to write your own custom transformations using Python or Scala and even use external libraries. UDFs can take and return one or more columns as input. Spark UDFs are incredibly powerful because you can write them in several different programming languages; you do not need to create them in an esoteric format or domain-specific language. They’re just functions that operate on the data, record by record. By default, these functions are registered as temporary functions to be used in that specific SparkSession or Context.

Although you can write UDFs in Scala, Python, or Java, there are performance considerations that you should be aware of. To illustrate this, we’re going to walk through exactly what happens when you create UDF, pass that into Spark, and then execute code using that UDF.

The first step is the actual function. We’ll create a simple one for this example. Let’s write a power3 function that takes a number and raises it to a power of three:

In [81]:
udfExampleDF = spark.range(5).toDF("num")
udfExampleDF.createOrReplaceTempView("udfExampleDFTable")

def power3(double_value):
    return double_value ** 3

power3(2.0)

8.0

In this trivial example, we can see that our functions work as expected. We are able to provide an individual input and produce the expected result (with this simple test case). Thus far, our expectations for the input are high: it must be a specific type and cannot be a null value.

Now that we’ve created these functions and tested them, we need to register them with Spark so that we can use them on all of our worker machines. Spark will serialize the function on the driver and transfer it over the network to all executor processes. This happens regardless of language.

When you use the function, Spark starts a Python process on the worker, serializes all of the data to a format that Python can understand (remember, it was in the JVM earlier), executes the function row by row on that data in the Python process, and then finally returns the results of the row operations to the JVM and Spark. Figure below provides an overview of the process:

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/06-01-UDF.png?raw=true" width="700" align="center"/>

**Warning:** Starting this Python process is expensive. The real cost is in serializing the data to Python.

We need to register the function to make it available as a DataFrame function:

In [82]:
from pyspark.sql.functions import udf

power3udf = udf(power3)

Then, we can use it in our DataFrame code:

In [83]:
from pyspark.sql.functions import col

udfExampleDF.select(power3udf(col("num"))).show(5)

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
|          8|
|         27|
|         64|
+-----------+



At this juncture, we can use this only as a DataFrame function. That is to say, we can’t use it within a string expression, only on an expression. However, we can also register this UDF as a Spark SQL function. This is valuable because it makes it simple to use this function within SQL as well as across languages.

In [84]:
spark.udf.register("power3py", power3)

<function __main__.power3py>

In [85]:
udfExampleDF.selectExpr("power3py(num)").show()

+-------------+
|power3py(num)|
+-------------+
|            0|
|            1|
|            8|
|           27|
|           64|
+-------------+



In SQL:

In [86]:
spark.sql("""
SELECT num, power3py(num) from udfExampleDFTable
""").show()

+---+-------------+
|num|power3py(num)|
+---+-------------+
|  0|            0|
|  1|            1|
|  2|            8|
|  3|           27|
|  4|           64|
+---+-------------+



For a complete list of **pyspark.sql.functions** visit [Spark 2.4 documentation page](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#module-pyspark.sql.functions).