<a href="https://colab.research.google.com/github/vaniamv/dataprocessing/blob/main/spark/examples/07-udf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/lucprosa/dataeng-basic-course/blob/main/spark/examples/07-udf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UDF
### Performance gaps with UDF

- Serialization and Deserialization: Data needs to be serialized and deserialized between the JVM and Python processes. This back-and-forth communication can introduce latency.
- Single-threaded Execution: UDFs in PySpark are executed in a single thread, making them less efficient compared to the parallel execution capabilities of native Spark functions.
- Lack of Optimization: Spark’s Catalyst optimizer does not optimize UDFs, leading to potentially inefficient execution plans.

### When to use
- Use UDFs for logic that is difficult to express with built-in Apache Spark functions. Built-in Apache Spark functions are optimized for distributed processing and generally offer better performance at scale. For more information, see Functions.

- Databricks recommends UDFs for ad hoc queries, manual data cleansing, exploratory data analysis, and operations on small to medium-sized datasets. Common use cases for UDFs include data encryption and decryption, hashing, JSON parsing, and validation.

- Use Apache Spark methods for operations on very large datasets and any workloads that are run regularly or continuously, including ETL jobs and streaming operations.

# Setting up PySpark

In [None]:
%pip install pyspark



In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Spark Course').config('spark.ui.port', '4050').getOrCreate()
sc = spark.sparkContext

# UDF

In [2]:
# UDF examples

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

slen = udf(lambda s: len(s), IntegerType())

@udf
def to_upper(s):
    if s is not None:
        return s.upper()

@udf(returnType=IntegerType())
def add_one(x):
    if x is not None:
        return x + 1

df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")).show()

+----------+--------------+------------+
|slen(name)|to_upper(name)|add_one(age)|
+----------+--------------+------------+
|         8|      JOHN DOE|          22|
+----------+--------------+------------+



In [3]:
# Preparing the data
employee_data = [("101", "Chloe", 3),
            ("102", "Paul", 1),
            ("103", "John", 1),
            ("104", "Lisa", 2),
            ("105", "Evan", 3),
            ("106", "Amy", 3),
            ("107", "Jimmy", 5)]

employee_columns = ["id", "name", "dpto"]

employee = sc.parallelize(employee_data).toDF(employee_columns)

In [4]:

# Creating UDF to identify the employess that will be fired

fired_employees = ["John", "Lisa", "Evan"]

@udf
def add_char_at_end(s, fired=fired_employees):
  if s in fired:
    return f"{s}#FIRED"
  else:
    return s

employee.select(employee["*"], add_char_at_end("name").alias("additional_info")).show() #add_char_at_end só recebe uma variavel porque a segunda está fixa

+---+-----+----+---------------+
| id| name|dpto|additional_info|
+---+-----+----+---------------+
|101|Chloe|   3|          Chloe|
|102| Paul|   1|           Paul|
|103| John|   1|     John#FIRED|
|104| Lisa|   2|     Lisa#FIRED|
|105| Evan|   3|     Evan#FIRED|
|106|  Amy|   3|            Amy|
|107|Jimmy|   5|          Jimmy|
+---+-----+----+---------------+



In [None]:
# Can be used in SQL

slen = udf(lambda s: len(s), IntegerType())
spark.udf.register("slen", slen)  #se usar sql é preciso registar este udf

qry2 = """SELECT slen("Data Engineering Course") AS length"""

spark.sql(qry2).show()

+------+
|length|
+------+
|    23|
+------+



# Question

In [5]:
# Q1
# Create an UDF to add the department name to the dataset based on dpto id
# mapping: {1: "Marketing", 2: "Sales", 3: "HR", 4: "Finance", 5: "IT"}

dic = {1: "Marketing", 2: "Sales", 3: "HR", 4: "Finance", 5: "IT"}

In [6]:
dic[1]

'Marketing'

In [7]:
employee.show()

+---+-----+----+
| id| name|dpto|
+---+-----+----+
|101|Chloe|   3|
|102| Paul|   1|
|103| John|   1|
|104| Lisa|   2|
|105| Evan|   3|
|106|  Amy|   3|
|107|Jimmy|   5|
+---+-----+----+



In [8]:

@udf
def get_dpto(s, mapping=dic):
  return dic[s]


employee.select(employee["*"], get_dpto("dpto").alias("additional_info")).show()


+---+-----+----+---------------+
| id| name|dpto|additional_info|
+---+-----+----+---------------+
|101|Chloe|   3|             HR|
|102| Paul|   1|      Marketing|
|103| John|   1|      Marketing|
|104| Lisa|   2|          Sales|
|105| Evan|   3|             HR|
|106|  Amy|   3|             HR|
|107|Jimmy|   5|             IT|
+---+-----+----+---------------+



In [10]:
#se quiser usar SQL
spark.udf.register("get_dpto", get_dpto)  #se usar sql é preciso registar este udf
employee.createOrReplaceTempView("employee")
spark.sql("select * , get_dpto(dpto) from employee").show()


+---+-----+----+--------------+
| id| name|dpto|get_dpto(dpto)|
+---+-----+----+--------------+
|101|Chloe|   3|            HR|
|102| Paul|   1|     Marketing|
|103| John|   1|     Marketing|
|104| Lisa|   2|         Sales|
|105| Evan|   3|            HR|
|106|  Amy|   3|            HR|
|107|Jimmy|   5|            IT|
+---+-----+----+--------------+

