## User-Defined Functions (UDFs)

User-Defined Functions (UDFs) allow you to define custom column-based functions in Python (or other languages) and apply them to your Spark DataFrames. While powerful, try to use built-in Spark SQL functions when possible, as they are generally more optimized than UDFs.

Let's create a simple DataFrame to work with.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType, IntegerType

# Make sure SparkSession is initialized (it usually is in Jupyter)
# If not, uncomment the next line
spark = SparkSession.builder.appName("UDF Examples").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/18 06:43:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Original DataFrame:
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



### Example 1: Simple String Manipulation UDF

Let's define a UDF that converts the 'name' column to uppercase.

In [2]:
# 1. Define the Python function
def to_upper_case(s):
  if s is not None:
    return s.upper()
  else:
    return None

# 2. Register the Python function as a UDF
#    Specify the return type of the UDF
upper_case_udf = udf(to_upper_case, StringType())

# 3. Apply the UDF to the DataFrame column
df_upper = df.withColumn("name_upper", upper_case_udf(col("name")))

print("DataFrame with Uppercase Name:")
df_upper.show()

DataFrame with Uppercase Name:
+-------+---+----------+
|   name|age|name_upper|
+-------+---+----------+
|  Alice| 25|     ALICE|
|    Bob| 30|       BOB|
|Charlie| 35|   CHARLIE|
+-------+---+----------+



### Example 2: UDF with Multiple Columns and Different Return Type

Now, let's create a UDF that takes the 'name' and 'age' columns and returns a descriptive string.

In [3]:
# 1. Define the Python function
def describe_person(name, age):
    return f"{name} is {age} years old."

# 2. Register the UDF
describe_person_udf = udf(describe_person, StringType())

# 3. Apply the UDF
df_described = df.withColumn("description", describe_person_udf(col("name"), col("age")))

print("DataFrame with Description:")
df_described.show()

DataFrame with Description:
+-------+---+--------------------+
|   name|age|         description|
+-------+---+--------------------+
|  Alice| 25|Alice is 25 years...|
|    Bob| 30|Bob is 30 years old.|
|Charlie| 35|Charlie is 35 yea...|
+-------+---+--------------------+



### Example 3: Using UDFs with `spark.sql` (SQL Expressions)

You can also register UDFs for use directly within Spark SQL queries.

In [4]:
# Register the first UDF for SQL use
spark.udf.register("SQL_UPPER", to_upper_case, StringType())

# Create a temporary view to run SQL queries
df.createOrReplaceTempView("people")

# Use the registered UDF in a SQL query
sql_result = spark.sql("SELECT SQL_UPPER(name) as upper_name, age FROM people WHERE age > 28")

print("Result from SQL query using UDF:")
sql_result.show()

# Remember to stop the SparkSession if you're done (optional in interactive sessions)
# spark.stop()

Result from SQL query using UDF:
+----------+---+
|upper_name|age|
+----------+---+
|       BOB| 30|
|   CHARLIE| 35|
+----------+---+



# Approach 1: Using a UDF with Dictionary

In [5]:
# Create a dictionary
age_category = {
    25: "Young Adult",
    30: "Adult",
    35: "Mid-30s"
}

# Define UDF that uses the dictionary
def get_age_category(age):
    return age_category.get(age, "Unknown")

# Register UDF
age_category_udf = udf(get_age_category, StringType())

# Apply UDF to create new column
df_with_category = df.withColumn("age_category", age_category_udf(col("age")))

# Show result
print("DataFrame with Category Column (UDF approach):")
df_with_category.show()

DataFrame with Category Column (UDF approach):
+-------+---+------------+
|   name|age|age_category|
+-------+---+------------+
|  Alice| 25| Young Adult|
|    Bob| 30|       Adult|
|Charlie| 35|     Mid-30s|
+-------+---+------------+



# Using a Map expression (more efficient)

In [6]:
from pyspark.sql.functions import create_map, lit
from itertools import chain

# Create a map expression from the dictionary items
# This converts the Python dictionary to a Spark map
mapping_expr = create_map([lit(x) for x in chain(*[(k, v) for k, v in age_category.items()])])

# Apply the mapping expression to create new column
df_with_map = df.withColumn("age_category", mapping_expr[df.age])

# Show result
print("DataFrame with Category Column (map approach):")
df_with_map.show()

DataFrame with Category Column (map approach):
+-------+---+------------+
|   name|age|age_category|
+-------+---+------------+
|  Alice| 25| Young Adult|
|    Bob| 30|       Adult|
|Charlie| 35|     Mid-30s|
+-------+---+------------+



# Approach 3: Creating a mapping DataFrame and joining

In [7]:
# Convert dictionary to DataFrame
mapping_data = [(k, v) for k, v in age_category.items()]
mapping_df = spark.createDataFrame(mapping_data, ["age", "category"])

# Join with original DataFrame
df_with_join = df.join(mapping_df, "age", "left")

# Show result
print("DataFrame with Category Column (join approach):")
df_with_join.show()

DataFrame with Category Column (join approach):
+---+-------+-----------+
|age|   name|   category|
+---+-------+-----------+
| 25|  Alice|Young Adult|
| 30|    Bob|      Adult|
| 35|Charlie|    Mid-30s|
+---+-------+-----------+



The UDF approach is the simplest but tends to be less performant for large datasets. The map expression is generally more efficient as it keeps processing inside Spark's execution engine. The join approach works well when you have a large mapping table that might be used in multiple places.