d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# User Defined Functions

Apache Spark&trade; and Databricks&reg; allow you to create your own User Defined Functions (UDFs) specific to the needs of your data.

## In this lesson you:
* Write UDFs with a single DataFrame column inputs
* Transform data using UDFs using both the DataFrame and SQL API
* Analyze the performance trade-offs between built-in functions and UDFs

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/fgp5h61gps?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/fgp5h61gps?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Custom Transformations with User Defined Functions

Spark's built-in functions provide a wide array of functionality, covering the vast majority of data transformation use cases. Often what differentiates strong Spark programmers is their ability to utilize built-in functions since Spark offers many highly optimized options to manipulate data. This matters for two reasons:<br><br>

- First, *built-in functions are finely tuned* so they run faster than less efficient code provided by the user.  
- Secondly, Spark (or, more specifically, Spark's optimization engine, the Catalyst Optimizer) knows the objective of built-in functions so it can *optimize the execution of your code by changing the order of your tasks.* 

In brief, use built-in functions whenever possible.

There are, however, many specific use cases not covered by built-in functions. **User Defined Functions (UDFs) are useful when you need to define logic specific to your use case and when you need to encapsulate that solution for reuse.** They should only be used when there is no clear way to accomplish a task using built-in functions.

UDFs are generally more performant in Scala than Python since for Python, Spark has to spin up a Python interpreter on every executor to run the function. This causes a substantial performance bottleneck due to communication across the Py4J bridge (how the JVM inter-operates with Python) and the slower nature of Python execution.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/built-in-vs-udfs.png" style="height: 400px; margin: 20px"/></div>

### A Basic UDF

UDFs take a function or lambda and make it available for Spark to use.  Start by writing code in your language of choice that will operate on a single row of a single column in your DataFrame.

Run the cell below to mount the data.

In [7]:
%run "./Includes/Classroom-Setup"

Write a basic function that splits a string on an `e`.

In [9]:
def "manual_split"(x):
  return x.split("e")


manual_split("this is my example string")

Register the function as a UDF by designating the following:

* A name for access in Python (`manualSplitPythonUDF`)
* A name for access in SQL (`manualSplitSQLUDF`)
* The function itself (`manual_split`)
* The return type for the function (`StringType`)

In [11]:

from pyspark.sql.types import StringType

manualSplitPythonUDF=spark.udf.register("manualSplitSQLUDF",manual_split,StringType())


Create a DataFrame of 100k values with a string to index. Do this by using a hash function, in this case `SHA-1`.

In [13]:
from pyspark.sql.functions import sha1, rand
randomDF = (spark.range(1, 10000 * 10 * 10 * 10)
  .withColumn("random_value", rand(seed=10).cast("string"))
  .withColumn("hash", sha1("random_value"))
  .drop("random_value")
)

display(randomDF)

id,hash
1,6f65a37773e1a173aefe023667fe23442efaf0e5
2,3c5f154011aa08151d2cc5b458a48b33ab656c67
3,28fc878d1f1942bd718cefa60424d9e18e8756db
4,9c54ebb1c11c92d4ce7c2d52a15d21c337e0ced5
5,46500976da0a25029a93419acd36935d06c86e95
6,9d8e41374ef6637e274e4b5487a6428b72b5fb2a
7,29d9fa515d557568abc85d08d1217b6b21c4841d
8,672da7735c57d6707a8307bd18b1fd4dc04821ee
9,e6fb1c43359cff2d70014e776bd47565bdbf40a2
10,8e3f2b2dfae9a25aa4a940067e34bf4cebde5926


Apply the UDF by using it just like any other Spark function.

In [15]:
randomAugmentedDF = randomDF.select("id","hash",manualSplitPythonUDF("hash").alias("augmented_col")) 

display(randomAugmentedDF)

id,hash,augmented_col
1,6f65a37773e1a173aefe023667fe23442efaf0e5,"[6f65a37773, 1a173a, f, 023667f, 23442, faf0, 5]"
2,3c5f154011aa08151d2cc5b458a48b33ab656c67,[3c5f154011aa08151d2cc5b458a48b33ab656c67]
3,28fc878d1f1942bd718cefa60424d9e18e8756db,"[28fc878d1f1942bd718c, fa60424d9, 18, 8756db]"
4,9c54ebb1c11c92d4ce7c2d52a15d21c337e0ced5,"[9c54, bb1c11c92d4c, 7c2d52a15d21c337, 0c, d5]"
5,46500976da0a25029a93419acd36935d06c86e95,"[46500976da0a25029a93419acd36935d06c86, 95]"
6,9d8e41374ef6637e274e4b5487a6428b72b5fb2a,"[9d8, 41374, f6637, 274, 4b5487a6428b72b5fb2a]"
7,29d9fa515d557568abc85d08d1217b6b21c4841d,[29d9fa515d557568abc85d08d1217b6b21c4841d]
8,672da7735c57d6707a8307bd18b1fd4dc04821ee,"[672da7735c57d6707a8307bd18b1fd4dc04821, , ]"
9,e6fb1c43359cff2d70014e776bd47565bdbf40a2,"[, 6fb1c43359cff2d70014, 776bd47565bdbf40a2]"
10,8e3f2b2dfae9a25aa4a940067e34bf4cebde5926,"[8, 3f2b2dfa, 9a25aa4a940067, 34bf4c, bd, 5926]"


### DataFrame and SQL APIs

When you registered the UDF, it was named `manualSplitSQLUDF` for access in the SQL API. This gives us the same access to the UDF you had in the python DataFrames API.

Register `randomDF` to access it within SQL.

In [18]:
randomDF.createOrReplaceTempView("randomTable")

Now switch to the SQL API and use the same UDF.

In [20]:
%sql
SELECT id,
  hash,
  manualSplitSQLUDF(hash) as augmented_col
FROM
  randomTable

id,hash,augmented_col
1,6f65a37773e1a173aefe023667fe23442efaf0e5,"[6f65a37773, 1a173a, f, 023667f, 23442, faf0, 5]"
2,3c5f154011aa08151d2cc5b458a48b33ab656c67,[3c5f154011aa08151d2cc5b458a48b33ab656c67]
3,28fc878d1f1942bd718cefa60424d9e18e8756db,"[28fc878d1f1942bd718c, fa60424d9, 18, 8756db]"
4,9c54ebb1c11c92d4ce7c2d52a15d21c337e0ced5,"[9c54, bb1c11c92d4c, 7c2d52a15d21c337, 0c, d5]"
5,46500976da0a25029a93419acd36935d06c86e95,"[46500976da0a25029a93419acd36935d06c86, 95]"
6,9d8e41374ef6637e274e4b5487a6428b72b5fb2a,"[9d8, 41374, f6637, 274, 4b5487a6428b72b5fb2a]"
7,29d9fa515d557568abc85d08d1217b6b21c4841d,[29d9fa515d557568abc85d08d1217b6b21c4841d]
8,672da7735c57d6707a8307bd18b1fd4dc04821ee,"[672da7735c57d6707a8307bd18b1fd4dc04821, , ]"
9,e6fb1c43359cff2d70014e776bd47565bdbf40a2,"[, 6fb1c43359cff2d70014, 776bd47565bdbf40a2]"
10,8e3f2b2dfae9a25aa4a940067e34bf4cebde5926,"[8, 3f2b2dfa, 9a25aa4a940067, 34bf4c, bd, 5926]"


This is an easy way to generalize UDFs, allowing teams to share their code.

### Performance Trade-offs

The performance of custom UDFs normally trail far behind built-in functions.  Take a look at this other example to compare built-in functions to custom UDFs.

Create a large DataFrame of random values, cache the result in order to keep the DataFrame in memory, and perform a `.count()` to trigger the cache to take effect.

In [24]:
from pyspark.sql.functions import col, rand

randomFloatsDF = (spark.range(0, 100 * 1000 * 1000)
  .withColumn("id", (col("id") / 1000).cast("integer"))
  .withColumn("random_float", rand())
)

randomFloatsDF.cache()
randomFloatsDF.count()

display(randomFloatsDF)

id,random_float
0,0.8939360678095802
0,0.6773484429370513
0,0.42656132569368
0,0.2372522790018589
0,0.9577208132799496
0,0.5983428637712083
0,0.3844312735459718
0,0.6943676737920392
0,0.6739005106742604
0,0.3463602500162592


Register a new UDF that increments a column by 1.  Here, use a lambda instead of a function.

In [26]:
from pyspark.sql.types import FloatType
  
plusOneUDF = spark.udf.register("plusOneUDF", lambda x: x+1, FloatType())

Compare the results using the `%timeit` function.  Run it a few times and examine the results.

In [28]:
%timeit randomFloatsDF.withColumn("incremented_float", plusOneUDF("random_float")).count()

In [29]:
%timeit randomFloatsDF.withColumn("incremented_float", col("random_float") + 1).count()

Which was faster, the UDF or the built-in functionality?  By how much?  This can differ based upon whether you work through this course in Scala (which is much faster) or Python.

In [31]:
#Built-in functions are faster that user defined functions. In the above case it is faster by 156 ms on average

## Exercise 1: Converting IP Addresses to Decimals

Write a UDF that translates an IPv4 address string (e.g. `123.123.123.123`) into a numeric value.

### Step 1: Create a Function

IP addresses pose challenges for efficient database lookups.  One way of dealing with this is to store an IP address in numerical form.  Write the function `IPConvert` that satisfies the following:

Input: IP address as a string (e.g. `123.123.123.123`)  
Output: an integer representation of the IP address (e.g. `2071690107`)

If the input string is `1.2.3.4`, think of it like `A.B.C.D` where A is 1, B is 2, etc. Solve this with the following steps:

&nbsp;&nbsp;&nbsp; (A x 256^3) + (B x 256^2) + (C x 256) + D <br>
&nbsp;&nbsp;&nbsp; (1 x 256^3) + (2 x 256^2) + (3 x 256) + 4 <br>
&nbsp;&nbsp;&nbsp; 116777216 + 131072 + 768 + 4 <br>
&nbsp;&nbsp;&nbsp; 16909060

Make a function to implement this.

In [34]:
# TODO
def IPConvert(IPString):
  IP=IPString.split(".")
  return (int(IP[0])*256**3)+(int(IP[1])*256**2)+(int(IP[2])*256)+int(IP[3])

IPConvert("1.2.3.4") # should equal 16909060

In [35]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-03-01-01", 16909060, IPConvert("1.2.3.4"))
dbTest("ET2-P-03-01-02", 168430090, IPConvert("10.10.10.10"))
dbTest("ET2-P-03-01-03", 386744599, IPConvert("23.13.65.23"))

print("Tests passed!")

### Step 2: Register a UDF

Register your function as `IPConvertUDF`.  Be sure to use `LongType` as your output type.

In [37]:
from pyspark.sql.types import LongType
IPConvertUDF = spark.udf.register("IPConvertUDF",IPConvert,LongType())

In [38]:
# TEST - Run this cell to test your solution
testDF = spark.createDataFrame((
  ("1.2.3.4", ),
  ("10.10.10.10", ),
  ("23.13.65.23", )
), ("ip",))
result = [i[0] for i in testDF.select(IPConvertUDF("ip")).collect()]

dbTest("ET2-P-03-02-01", 16909060, result[0])
dbTest("ET2-P-03-02-02", 168430090, result[1])
dbTest("ET2-P-03-02-03", 386744599, result[2])

print("Tests passed!")

### Step 3: Apply the UDF

Apply the UDF on the `IP` column of the DataFrame created below, creating the new column `parsedIP`.

In [40]:
# TODO
IPDF = spark.createDataFrame([["123.123.123.123"], ["1.2.3.4"], ["127.0.0.0"]], ['ip'])

IPDFWithParsedIP = IPDF.select("ip",IPConvertUDF("ip").alias("parsedIP"))

display(IPDFWithParsedIP)

ip,parsedIP
123.123.123.123,2071690107
1.2.3.4,16909060
127.0.0.0,2130706432


In [41]:
# TEST - Run this cell to test your solution
result2 = [i[1] for i in IPDFWithParsedIP.collect()]

dbTest("ET2-P-03-03-01", 2071690107, result2[0])
dbTest("ET2-P-03-03-02", 16909060, result2[1])
dbTest("ET2-P-03-03-03", 2130706432, result2[2])

print("Tests passed!")

## Review
**Question:** What are the performance trade-offs between UDFs and built-in functions?  When should I use each?  
**Answer:** Built-in functions are normally faster than UDFs and should be used when possible.  UDFs should be used when specific use cases arise that aren't addressed by built-in functions.

**Question:** How can I use UDFs?  
**Answer:** UDFs can be used in any Spark API. They can be registered for use in SQL and can otherwise be used in Scala, Python, R, and Java.

**Question:** Why are built-in functions faster?  
**Answer:** Reasons include:
* The catalyst optimizer knows how to optimize built-in functions
* They are written in highly optimized Scala
* There is no serialization cost at the time of running a built-in function

**Question:** Can UDFs have multiple column inputs and outputs?  
**Answer:** Yes, UDFs can have multiple column inputs and multiple complex outputs. This is covered in the following lesson.

## Next Steps

Start the next lesson, [Advanced UDFs]($./04-Advanced-UDFs ).

## Additional Topics & Resources

**Q:** Where can I find out more about UDFs?  
**A:** Take a look at the <a href="https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html" target="_blank">Databricks documentation for more details</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>