d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Advanced UDFs

Apache Spark&trade; and Databricks&reg; allow you to create your own User Defined Functions (UDFs) specific to the needs of your data.

## In this lesson you:
* Apply UDFs with a multiple DataFrame column inputs
* Apply UDFs that return complex types
* Write vectorized UDFs using Python

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/46zerb33vk?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/46zerb33vk?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Complex Transformations
 
UDFs provide custom, generalizable code that you can apply to ETL workloads when Spark's built-in functions won't suffice.  
In the last lesson we covered a simple version of this: UDFs that take a single DataFrame column input and return a primitive value. Often a more advanced solution is needed.

UDFs can take multiple column inputs. While UDFs cannot return multiple columns, they can return complex, named types that are easily accessible. This approach is especially helpful in ETL workloads that need to clean complex and challenging data structures.

Another other option is the new vectorized, or pandas, UDFs available in Spark 2.3. These allow for more performant UDFs written in Python.<br><br>

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/pandas-udfs.png" style="height: 400px; margin: 20px"/></div>

### UDFs with Multiple Columns

To begin making more complex UDFs, start by using multiple column inputs.  This is as simple as adding extra inputs to the function or lambda you convert to the UDF.

Run the cell below to mount the data.

In [7]:
%run "./Includes/Classroom-Setup"

Write a basic function that combines two columns.

In [9]:
def manual_add(x, y):
  return x + y

manual_add(1, 2)

Register the function as a UDF by binding it with a Python variable, adding a name to access it in the SQL API and giving it a return type.

In [11]:
from pyspark.sql.types import IntegerType

manualAddPythonUDF = spark.udf.register("manualAddSQLUDF",manual_add,IntegerType()) 

Create a dummy DataFrame to apply the UDF.

In [13]:
integerDF = (spark.createDataFrame([
  (1, 2),
  (3, 4),
  (5, 6)
], ["col1", "col2"]))

display(integerDF)

col1,col2
1,2
3,4
5,6


Apply the UDF to your DataFrame.

In [15]:
integerAddDF = integerDF.select("*",manualAddPythonUDF("col1","col2").alias("result"))

display(integerAddDF)

col1,col2,result
1,2,3
3,4,7
5,6,11


### UDFs with Complex Output

Complex outputs are helpful when you need to return multiple values from your UDF. The UDF design pattern involves returning a single column to drill down into, to pull out the desired data.

-sandbox
Start by determining the desired output.  This will look like a schema with a high level `StructType` with numerous `StructFields`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> For a refresher on this, see the lesson **Applying Schemas to JSON Data** in <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction/" target="_blank">ETL Part 1 course from Databricks Academy</a>.

In [18]:
from pyspark.sql.types import FloatType, StructType, StructField

mathOperationsSchema = StructType([
  StructField("sum", FloatType(), True), 
  StructField("multiplication", FloatType(), True), 
  StructField("division", FloatType(), True) 
])

Create a function that returns a tuple of your desired output.

In [20]:
def manual_math(x, y):
  return (float(x+y),float(x*y),x/float(y))

manual_math(1, 2)

Register your function as a UDF and apply it.  In this case, your return type is the schema you created.

In [22]:
manualMathPythonUDF = spark.udf.register("manualMathSQLUDF",manual_math,mathOperationsSchema)

display(integerDF.select("*", manualMathPythonUDF("col1", "col2").alias("sum")))

col1,col2,sum
1,2,"List(3.0, 2.0, 0.5)"
3,4,"List(7.0, 12.0, 0.75)"
5,6,"List(11.0, 30.0, 0.8333333)"


### Vectorized UDFs in Python

Starting in Spark 2.3, vectorized UDFs can be written in Python called Pandas UDFs.  This alleviates some of the serialization and invocation overhead of conventional Python UDFs.  While there are a number of types of these UDFs, this walk-through focuses on scalar UDFs. This is an ideal solution for Data Scientists needing performant UDFs written in Python.

:NOTE: Your cluster will need to run Spark 2.3 in order to execute the following code.

Use the decorator syntax to designate a Pandas UDF.  The input and outputs are both Pandas series of doubles.

In [25]:
%python
from pyspark.sql.functions import pandas_udf, PandasUDFType
#decorating/enhancing
@pandas_udf('double', PandasUDFType.SCALAR)
def pandas_plus_one(v):
    return v+1

Create a DataFrame to apply the UDF.

In [27]:
%python
from pyspark.sql.functions import col, rand

df = spark.range(0, 10 * 1000 * 1000)

display(df)

id
0
1
2
3
4
5
6
7
8
9


Apply the UDF

In [29]:
%python
display(df.withColumn('id_transformed', pandas_plus_one("id")))

id,id_transformed
0,1.0
1,2.0
2,3.0
3,4.0
4,5.0
5,6.0
6,7.0
7,8.0
8,9.0
9,10.0


%md ## Exercise 1: Multiple Column Inputs to Complex Type

Given a DataFrame of weather in various units, write a UDF that translates a column for temperature and a column for units into a complex type for temperature in three units:<br><br>

* fahrenheit
* celsius
* kelvin

### Step 1: Import and Explore the Data

Import the data sitting in `/mnt/training/weather/StationData/stationData.parquet` and save it to `weatherDF`.

In [32]:
# TODO
weatherDF = spark.read.parquet("/mnt/training/weather/StationData/stationData.parquet")

In [33]:
# TEST - Run this cell to test your solution
cols = set(weatherDF.columns)

dbTest("ET2-P-04-01-01", 2559, weatherDF.count())
dbTest("ET2-P-04-01-02", True, "TAVG" in cols and "UNIT" in cols)

print("Tests passed!")

### Step 2: Define Complex Output Type

Define the complex output type for your UDF.  This should look like the following:

| Field Name | Type |
|:-----------|:-----|
| fahrenheit | Float |
| celsius | Float |
| kelvin | Float |

In [35]:
from pyspark.sql.types import FloatType, StructType, StructField
schema = StructType([
  StructField("fahrenheit",FloatType(),True),
  StructField("celsius",FloatType(),True),
  StructField("kelvin",FloatType(),True)
])

In [36]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import FloatType
names = [i.name for i in schema.fields]

dbTest("ET2-P-04-02-01", 3, len(schema.fields))
dbTest("ET2-P-04-02-02", [FloatType(), FloatType(), FloatType()], [i.dataType for i in schema.fields])
dbTest("ET2-P-04-02-03", True, "fahrenheit" in names and "celsius" in names and "kelvin" in names)

print("Tests passed!")

### Step 3: Create the Function

Create a function that takes `temperature` as a Float and `unit` as a String.  `unit` will either be `F` for fahrenheit or `C` for celsius.  
Return a tuple of floats of that value as `(fahrenheit, celsius, kelvin)`.

Use the following equations:

| From | To Fahrenheit | To Celsius | To Kelvin |
|:-----|:--------------|:-----------|:-----------|
| Fahrenheit | F | (F - 32) * 5/9 | (F - 32) * 5/9 + 273.15 |
| Celsius | (C * 9/5) + 32 | C | C + 273.15 |
| Kelvin | (K - 273.15) * 9/5 + 32 | K - 273.15 | K |

In [39]:
# TODO
def temperatureConverter(temperature,unit):
  temp=float(temperature)
  if unit=="F":
    fahrenheit=temp
    celsius=(temp-32)*5/9
    kelvin=(temp-32)*(5/9)+273.15
  elif unit=="C":
    fahrenheit=(temp*9/5)+32
    celsius=temp
    kelvin=temp+273.15
  return (fahrenheit,celsius,kelvin)

In [40]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-04-03-01", (194.0, 90.0, 363.15), temperatureConverter(90, "C"))
dbTest("ET2-P-04-03-02", (0.0, -17.77777777777778, 255.3722222222222), temperatureConverter(0, "F"))

print("Tests passed!")

### Step 4: Register the UDF

Register the UDF as `temperatureConverterUDF`

In [42]:
# TODO
temperatureConverterUDF = spark.udf.register("temperatureConverterUDF",temperatureConverter,schema)

In [43]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-04-04-01", (194.0, 90.0, 363.15), temperatureConverterUDF.func(90, "C"))
dbTest("ET2-P-04-04-02", (0.0, -17.77777777777778, 255.3722222222222), temperatureConverterUDF.func(0, "F"))

print("Tests passed!")

### Step 5: Apply your UDF

Create `weatherEnhancedDF` with a new column `TAVGAdjusted` that applies your UDF.

In [45]:
# TODO
weatherEnhancedDF = weatherDF.select("*", temperatureConverterUDF("TAVG", "UNIT").alias("TAVGAdjusted"))

display(weatherEnhancedDF)

NAME,STATION,LATITUDE,LONGITUDE,ELEVATION,DATE,UNIT,TAVG,TAVGAdjusted
"HAYWARD AIR TERMINAL, CA US",USW00093228,37.6542,-122.115,13.1,2018-05-27,F,61.0,"List(61.0, 16.11111, 289.2611)"
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-01-05,C,11.7,"List(53.06, 11.7, 284.85)"
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-02-24,C,8.3,"List(46.94, 8.3, 281.45)"
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-03-26,C,9.4,"List(48.92, 9.4, 282.55)"
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",USW00012960,29.98,-95.36,29.0,2018-05-25,F,80.0,"List(80.0, 26.666666, 299.81668)"
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-05-16,C,11.1,"List(51.98, 11.1, 284.25)"
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-05-25,C,10.6,"List(51.08, 10.6, 283.75)"
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-05-21,C,11.7,"List(53.06, 11.7, 284.85)"
"WOODACRE CALIFORNIA, CA US",USR0000CWOO,37.9906,-122.6447,426.7,2018-05-26,F,53.0,"List(53.0, 11.666667, 284.81668)"
"BRIONES CALIFORNIA, CA US",USR0000CBRI,37.9442,-122.1178,442.0,2018-04-08,F,53.0,"List(53.0, 11.666667, 284.81668)"


In [46]:
# TEST - Run this cell to test your solution
result = weatherEnhancedDF.select("TAVGAdjusted").first()[0].asDict()

dbTest("ET2-P-04-05-01", {'fahrenheit': 61.0, 'celsius': 16.11111068725586, 'kelvin': 289.2611083984375}, result)
dbTest("ET2-P-04-05-02", 2559, weatherEnhancedDF.count())

print("Tests passed!")

## Review

**Question:** How do UDFs handle multiple column inputs and complex outputs?   
**Answer:** UDFs allow for multiple column inputs.  Complex outputs can be designated with the use of a defined schema encapsulate in a `StructType()` or a Scala case class.

**Question:** How can I do vectorized UDFs in Python and are they as performant as built-in functions?   
**Answer:** Spark 2.3 includes the use of vectorized UDFs using Pandas syntax. Even though they are vectorized, these UDFs will not be as performant built-in functions, though they will be more performant than non-vectorized Python UDFs.

## Next Steps

Start the next lesson, [Joins and Lookup Tables]($./05-Joins-and-Lookup-Tables ).

## Additional Topics & Resources

**Q:** Where can I find out more about UDFs?  
**A:** Take a look at the <a href="https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html" target="_blank">Databricks documentation for more details</a>

**Q:** Where can I find out more about vectorized UDFs in Python?  
**A:** Take a look at the <a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" target="_blank">Databricks blog for more details</a>

**Q:** Where can I find out more about User Defined Aggregate Functions?  
**A:** Take a look at the <a href="https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html" target="_blank">Databricks documentation for more details</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>