# Complex UDFs with Dictionary Return Types in PySpark

This notebook demonstrates how to create and use PySpark UDFs that read multiple input columns and return Python dictionaries containing various data types including:
- Integers
- Strings
- Nested objects/dictionaries

We'll cover:
1. Creating UDFs that return dictionaries
2. Converting Python dictionaries to structured data in Spark
3. Handling nested structures efficiently
4. Performance considerations

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, struct, to_json, from_json
from pyspark.sql.types import (
    StringType, IntegerType, DoubleType, BooleanType, 
    StructType, StructField, MapType, ArrayType
)

# Initialize SparkSession
spark = SparkSession.builder.appName("Dictionary UDF Examples").getOrCreate()

print("SparkSession initialized successfully!")

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/18 06:51:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


SparkSession initialized successfully!


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 55570)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 271, in accum_updates
   

## 1. Create Sample Data

Let's create a sample DataFrame with various data types to work with:

In [2]:
# Create a sample DataFrame
data = [
    (1, "John", 32, "New York", 75000.0),
    (2, "Alice", 28, "San Francisco", 95000.0),
    (3, "Bob", 45, "Chicago", 85000.0),
    (4, "Emma", 36, "Seattle", 92000.0),
    (5, "Michael", 52, "Boston", 120000.0)
]

columns = ["id", "name", "age", "city", "salary"]
df = spark.createDataFrame(data, columns)

print("Sample DataFrame:")
df.show()

Sample DataFrame:
+---+-------+---+-------------+--------+
| id|   name|age|         city|  salary|
+---+-------+---+-------------+--------+
|  1|   John| 32|     New York| 75000.0|
|  2|  Alice| 28|San Francisco| 95000.0|
|  3|    Bob| 45|      Chicago| 85000.0|
|  4|   Emma| 36|      Seattle| 92000.0|
|  5|Michael| 52|       Boston|120000.0|
+---+-------+---+-------------+--------+



## 2. UDF that Returns a Simple Dictionary

First, let's create a UDF that processes multiple columns and returns a simple dictionary containing string and integer values.

In [3]:
# Define a UDF that returns a simple dictionary
def create_person_dict(name, age, city):
    """
    Create a dictionary with person information
    """
    return {
        "full_name": name,
        "age_in_years": age,
        "residence": city,
        "age_group": "young" if age < 30 else "middle" if age < 50 else "senior"
    }

# Define the schema for the struct we want to return
person_struct_schema = StructType([
    StructField("full_name", StringType(), True),
    StructField("age_in_years", IntegerType(), True),
    StructField("residence", StringType(), True),
    StructField("age_group", StringType(), True)
])

# Register UDF with struct return type that can be converted to JSON
person_struct_udf = udf(create_person_dict, person_struct_schema)

# Apply the UDF and convert resulting struct to JSON
df_with_dict = df.withColumn(
    "person_info", 
    to_json(person_struct_udf(col("name"), col("age"), col("city")))
)

print("DataFrame with Dictionary UDF result:")
df_with_dict.show(truncate=False)

DataFrame with Dictionary UDF result:
+---+-------+---+-------------+--------+---------------------------------------------------------------------------------------+
|id |name   |age|city         |salary  |person_info                                                                            |
+---+-------+---+-------------+--------+---------------------------------------------------------------------------------------+
|1  |John   |32 |New York     |75000.0 |{"full_name":"John","age_in_years":32,"residence":"New York","age_group":"middle"}     |
|2  |Alice  |28 |San Francisco|95000.0 |{"full_name":"Alice","age_in_years":28,"residence":"San Francisco","age_group":"young"}|
|3  |Bob    |45 |Chicago      |85000.0 |{"full_name":"Bob","age_in_years":45,"residence":"Chicago","age_group":"middle"}       |
|4  |Emma   |36 |Seattle      |92000.0 |{"full_name":"Emma","age_in_years":36,"residence":"Seattle","age_group":"middle"}      |
|5  |Michael|52 |Boston       |120000.0|{"full_name":"Micha

### Alternative Approach: Direct JSON String

Another approach is to return a JSON string directly from the UDF instead of using the `to_json` function. This is often simpler:

In [4]:
# Better approach: Return a JSON string directly from the UDF
import json

def create_person_json(name, age, city):
    """
    Create a dictionary and convert it to a JSON string
    """
    person_dict = {
        "full_name": name,
        "age_in_years": age,
        "residence": city,
        "age_group": "young" if age < 30 else "middle" if age < 50 else "senior"
    }
    return json.dumps(person_dict)

# Register UDF with return type as string
person_json_udf = udf(create_person_json, StringType())

# Apply the UDF
df_with_json = df.withColumn(
    "person_info", 
    person_json_udf(col("name"), col("age"), col("city"))
)

print("DataFrame with JSON UDF result:")
df_with_json.show(truncate=False)

DataFrame with JSON UDF result:
+---+-------+---+-------------+--------+----------------------------------------------------------------------------------------------+
|id |name   |age|city         |salary  |person_info                                                                                   |
+---+-------+---+-------------+--------+----------------------------------------------------------------------------------------------+
|1  |John   |32 |New York     |75000.0 |{"full_name": "John", "age_in_years": 32, "residence": "New York", "age_group": "middle"}     |
|2  |Alice  |28 |San Francisco|95000.0 |{"full_name": "Alice", "age_in_years": 28, "residence": "San Francisco", "age_group": "young"}|
|3  |Bob    |45 |Chicago      |85000.0 |{"full_name": "Bob", "age_in_years": 45, "residence": "Chicago", "age_group": "middle"}       |
|4  |Emma   |36 |Seattle      |92000.0 |{"full_name": "Emma", "age_in_years": 36, "residence": "Seattle", "age_group": "middle"}      |
|5  |Michael|52 

## 3. Parsing the JSON Back to a Structured Format

Now, let's parse the JSON string back to a structured column using `from_json`:

In [5]:
# Define schema for the person_info JSON
person_schema = StructType([
    StructField("full_name", StringType(), True),
    StructField("age_in_years", IntegerType(), True),
    StructField("residence", StringType(), True),
    StructField("age_group", StringType(), True)
])

# Parse the JSON string back to a struct
df_parsed = df_with_json.withColumn(
    "person_struct",
    from_json(col("person_info"), person_schema)
)

# Access fields from the parsed struct
df_parsed_fields = df_parsed.select(
    "id",
    "salary",
    "person_struct.full_name",
    "person_struct.age_in_years",
    "person_struct.residence",
    "person_struct.age_group"
)

print("DataFrame with parsed JSON fields:")
df_parsed_fields.show()

DataFrame with parsed JSON fields:
+---+--------+---------+------------+-------------+---------+
| id|  salary|full_name|age_in_years|    residence|age_group|
+---+--------+---------+------------+-------------+---------+
|  1| 75000.0|     John|          32|     New York|   middle|
|  2| 95000.0|    Alice|          28|San Francisco|    young|
|  3| 85000.0|      Bob|          45|      Chicago|   middle|
|  4| 92000.0|     Emma|          36|      Seattle|   middle|
|  5|120000.0|  Michael|          52|       Boston|   senior|
+---+--------+---------+------------+-------------+---------+



## 4. Complex UDF with Nested Dictionaries

Let's create a more complex UDF that returns nested dictionaries:

In [6]:
def create_complex_profile(id, name, age, city, salary):
    """
    Create a complex dictionary with nested structures
    """
    # Calculate some derived values
    tax_rate = 0.20 if salary < 85000 else 0.30
    take_home = salary * (1 - tax_rate)
    retirement_years = 65 - age
    
    # Create nested dictionary
    profile = {
        "personal": {
            "id": id,
            "name": name,
            "age": age,
            "location": {
                "city": city,
                "country": "USA",  # Assuming USA for all
                "is_metro": city in ["New York", "San Francisco", "Chicago", "Boston"]
            }
        },
        "financial": {
            "income": {
                "gross_salary": salary,
                "tax_rate": tax_rate,
                "take_home": take_home
            },
            "retirement": {
                "years_to_retirement": retirement_years,
                "retirement_age": 65
            }
        },
        "summary": f"{name} from {city}, age {age}, earning ${salary:.2f}"
    }
    
    return json.dumps(profile)

# Register UDF
complex_profile_udf = udf(create_complex_profile, StringType())

# Apply the UDF
df_complex = df.withColumn(
    "profile",
    complex_profile_udf(
        col("id"), col("name"), col("age"), col("city"), col("salary")
    )
)

print("DataFrame with complex nested dictionary:")
df_complex.select("id", "profile").show(truncate=False)

DataFrame with complex nested dictionary:
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |profile                                                                                                                                                                                                                                                                                                                                                     |
+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 5. Parse Complex Nested Structure

Now, let's define a schema for our complex structure and parse it:

In [7]:
# Define schema for the complex profile
location_schema = StructType([
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("is_metro", BooleanType(), True)
])

personal_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("location", location_schema, True)
])

income_schema = StructType([
    StructField("gross_salary", DoubleType(), True),
    StructField("tax_rate", DoubleType(), True),
    StructField("take_home", DoubleType(), True)
])

retirement_schema = StructType([
    StructField("years_to_retirement", IntegerType(), True),
    StructField("retirement_age", IntegerType(), True)
])

financial_schema = StructType([
    StructField("income", income_schema, True),
    StructField("retirement", retirement_schema, True)
])

profile_schema = StructType([
    StructField("personal", personal_schema, True),
    StructField("financial", financial_schema, True),
    StructField("summary", StringType(), True)
])

# Parse the complex JSON
df_complex_parsed = df_complex.withColumn(
    "parsed_profile",
    from_json(col("profile"), profile_schema)
)

# Show schema of the parsed profile
print("Schema of parsed profile:")
df_complex_parsed.select("parsed_profile").printSchema()

Schema of parsed profile:
root
 |-- parsed_profile: struct (nullable = true)
 |    |-- personal: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: integer (nullable = true)
 |    |    |-- location: struct (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |    |-- is_metro: boolean (nullable = true)
 |    |-- financial: struct (nullable = true)
 |    |    |-- income: struct (nullable = true)
 |    |    |    |-- gross_salary: double (nullable = true)
 |    |    |    |-- tax_rate: double (nullable = true)
 |    |    |    |-- take_home: double (nullable = true)
 |    |    |-- retirement: struct (nullable = true)
 |    |    |    |-- years_to_retirement: integer (nullable = true)
 |    |    |    |-- retirement_age: integer (nullable = true)
 |    |-- summary: string (nullable = true)



## 6. Extract Specific Fields from Nested Structure

Now let's extract and work with specific fields from our complex nested structure:

In [8]:
# Extract specific fields
df_extracted = df_complex_parsed.select(
    "parsed_profile.personal.name",
    "parsed_profile.personal.location.city",
    "parsed_profile.personal.location.is_metro",
    "parsed_profile.financial.income.gross_salary",
    "parsed_profile.financial.income.take_home",
    "parsed_profile.financial.retirement.years_to_retirement",
    "parsed_profile.summary"
)

print("Extracted fields from nested structure:")
df_extracted.show(truncate=False)

Extracted fields from nested structure:
+-------+-------------+--------+------------+-----------------+-------------------+---------------------------------------------------+
|name   |city         |is_metro|gross_salary|take_home        |years_to_retirement|summary                                            |
+-------+-------------+--------+------------+-----------------+-------------------+---------------------------------------------------+
|John   |New York     |true    |75000.0     |60000.0          |33                 |John from New York, age 32, earning $75000.00      |
|Alice  |San Francisco|true    |95000.0     |66500.0          |37                 |Alice from San Francisco, age 28, earning $95000.00|
|Bob    |Chicago      |true    |85000.0     |59499.99999999999|20                 |Bob from Chicago, age 45, earning $85000.00        |
|Emma   |Seattle      |false   |92000.0     |64399.99999999999|29                 |Emma from Seattle, age 36, earning $92000.00       |
|Michael

## 7. Creating a MapType UDF

Another approach is to use a MapType return type instead of converting to/from JSON. However, this has limitations on the types of values that can be stored:

In [9]:
# Define a UDF that returns a map (key-value pairs)
def create_person_map(name, age):
    """
    Create a map of person attributes
    Note: All values must be of the same type when using MapType
    """
    # Convert all values to strings for consistency
    return {
        "name": name,
        "age": str(age),
        "age_group": "young" if age < 30 else "middle" if age < 50 else "senior"
    }

# Register UDF with MapType return type
# Note: Both keys and values must be of uniform types
person_map_udf = udf(create_person_map, MapType(StringType(), StringType()))

# Apply the UDF
df_with_map = df.withColumn(
    "person_map",
    person_map_udf(col("name"), col("age"))
)

print("DataFrame with MapType result:")
df_with_map.show(truncate=False)

DataFrame with MapType result:
+---+-------+---+-------------+--------+-------------------------------------------------+
|id |name   |age|city         |salary  |person_map                                       |
+---+-------+---+-------------+--------+-------------------------------------------------+
|1  |John   |32 |New York     |75000.0 |{name -> John, age_group -> middle, age -> 32}   |
|2  |Alice  |28 |San Francisco|95000.0 |{name -> Alice, age_group -> young, age -> 28}   |
|3  |Bob    |45 |Chicago      |85000.0 |{name -> Bob, age_group -> middle, age -> 45}    |
|4  |Emma   |36 |Seattle      |92000.0 |{name -> Emma, age_group -> middle, age -> 36}   |
|5  |Michael|52 |Boston       |120000.0|{name -> Michael, age_group -> senior, age -> 52}|
+---+-------+---+-------------+--------+-------------------------------------------------+



### Access individual values from the map

In [10]:
# Access a specific key from the map
df_map_access = df_with_map.select(
    "id",
    "name",
    "age",
    col("person_map")["name"].alias("map_name"),
    col("person_map")["age_group"].alias("map_age_group")
)

print("Accessing specific keys from map:")
df_map_access.show()

Accessing specific keys from map:
+---+-------+---+--------+-------------+
| id|   name|age|map_name|map_age_group|
+---+-------+---+--------+-------------+
|  1|   John| 32|    John|       middle|
|  2|  Alice| 28|   Alice|        young|
|  3|    Bob| 45|     Bob|       middle|
|  4|   Emma| 36|    Emma|       middle|
|  5|Michael| 52| Michael|       senior|
+---+-------+---+--------+-------------+



## 8. Performance Comparison

Let's look at the performance implications of these different approaches:

In [11]:
# Create a larger DataFrame for performance testing
from pyspark.sql.functions import lit, rand, monotonically_increasing_id
import time

# Multiply our DataFrame to create more rows
df_large = df.union(df).union(df).union(df)  # 20 rows
for i in range(3):  # Multiple more times to get ~160 rows
    df_large = df_large.union(df_large)

# Add some random variation
df_large = df_large.withColumn("salary", col("salary") * (rand() + 0.5))
df_large = df_large.withColumn("id", monotonically_increasing_id())

print(f"Created larger DataFrame with {df_large.count()} rows")



Created larger DataFrame with 160 rows


                                                                                

In [12]:
from pyspark.sql.functions import monotonically_increasing_id

# Fix error with previous cell
df_large = df.union(df).union(df).union(df)  # 20 rows
for i in range(3):  # Multiple more times to get ~160 rows
    df_large = df_large.union(df_large)

# Add some random variation
df_large = df_large.withColumn("salary", col("salary") * (rand() + 0.5))
df_large = df_large.withColumn("id", monotonically_increasing_id())

print(f"Created larger DataFrame with {df_large.count()} rows")



Created larger DataFrame with 160 rows


                                                                                

In [13]:
# Test JSON string approach
start_time = time.time()
df_large_json = df_large.withColumn(
    "profile",
    complex_profile_udf(
        col("id"), col("name"), col("age"), col("city"), col("salary")
    )
)
df_large_json.select("id", "profile").count()  # Force execution
json_time = time.time() - start_time

# Test MapType approach
start_time = time.time()
df_large_map = df_large.withColumn(
    "person_map",
    person_map_udf(col("name"), col("age"))
)
df_large_map.select("id", "person_map").count()  # Force execution
map_time = time.time() - start_time

print(f"Time for JSON string approach: {json_time:.2f} seconds")
print(f"Time for MapType approach: {map_time:.2f} seconds")



Time for JSON string approach: 2.70 seconds
Time for MapType approach: 2.64 seconds


25/04/18 06:52:28 WARN JavaUtils: Attempt to delete using native Unix OS command failed for path = /tmp/blockmgr-268a8dc4-280e-44f9-81ab-00bc1f25f6ab. Falling back to Java IO way
java.io.IOException: Failed to delete: /tmp/blockmgr-268a8dc4-280e-44f9-81ab-00bc1f25f6ab
	at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:173)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:109)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:90)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively(SparkFileUtils.scala:121)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively$(SparkFileUtils.scala:120)
	at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1126)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1(DiskBlockManager.scala:368)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1$adapted(DiskBlockManager.scala:364)
	at scala.collection.IndexedSeqOptimize

## 9. Summary and Best Practices

Based on our exploration, here are some best practices for working with dictionary-based UDFs in PySpark:

1. **For simple dictionaries with homogeneous value types:**
   - Use `MapType` for better performance and direct access
   - Ensure all values are of the same type

2. **For complex nested dictionaries with heterogeneous types:**
   - Convert to JSON strings in the UDF
   - Use `from_json` with explicit schema to parse
   - Leverage dot notation to access nested fields

3. **Performance considerations:**
   - JSON serialization/deserialization adds overhead
   - Consider using Pandas UDFs for better performance with large datasets
   - If possible, restructure your data model to avoid complex nested structures

4. **Schema Management:**
   - Always define explicit schemas for JSON parsing
   - Use appropriate data types to avoid later conversions
   - Document your schema structure for future reference