# Flattening Complex Objects in PySpark with UDFs

This notebook demonstrates how to use User-Defined Functions (UDFs) to flatten complex nested data structures in PySpark. We'll cover three common scenarios:

1. Flattening arrays into strings
2. Extracting data from nested structures
3. Flattening map/dictionary structures

Let's first initialize our SparkSession.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, explode, array_join
from pyspark.sql.types import StringType, ArrayType, StructType, StructField, MapType, IntegerType

# Initialize SparkSession
spark = SparkSession.builder.appName("Flatten Object Examples").getOrCreate()

print("SparkSession initialized successfully!")

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/18 06:49:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


SparkSession initialized successfully!


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 59644)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 271, in accum_updates
   

## 1. Flattening Arrays

Arrays are common data structures in Spark DataFrames. Sometimes you need to convert an array into a single string (e.g., for reporting, exporting to CSV, etc.).

Let's create a DataFrame with an array column and flatten it using a UDF.

In [2]:
# Create a DataFrame with array column
array_data = [
    (1, ["apple", "banana", "cherry"]),
    (2, ["orange", "grape"]),
    (3, []),
    (4, None)  # Handle null case
]
array_df = spark.createDataFrame(array_data, ["id", "fruits"])

print("Original DataFrame with Array:")
array_df.show(truncate=False)

Original DataFrame with Array:
+---+-----------------------+
|id |fruits                 |
+---+-----------------------+
|1  |[apple, banana, cherry]|
|2  |[orange, grape]        |
|3  |[]                     |
|4  |NULL                   |
+---+-----------------------+



### UDF Approach to Flatten Array

Let's define a UDF that converts an array to a comma-separated string.

In [3]:
# Define UDF to flatten array to comma-separated string
@udf(StringType())
def flatten_array(arr):
    if arr is None:
        return None
    return ", ".join(arr)

# Apply the UDF
flattened_array_df = array_df.withColumn("flattened_fruits", flatten_array(col("fruits")))

print("DataFrame with Flattened Array:")
flattened_array_df.show(truncate=False)

DataFrame with Flattened Array:
+---+-----------------------+---------------------+
|id |fruits                 |flattened_fruits     |
+---+-----------------------+---------------------+
|1  |[apple, banana, cherry]|apple, banana, cherry|
|2  |[orange, grape]        |orange, grape        |
|3  |[]                     |                     |
|4  |NULL                   |NULL                 |
+---+-----------------------+---------------------+



### Built-in Function Alternative

While UDFs work well, Spark provides a built-in function `array_join()` that can be more efficient for this particular task.

In [4]:
# Using built-in function (more efficient)
array_df_built_in = array_df.withColumn(
    "flattened_fruits_built_in", 
    array_join(col("fruits"), ", ")
)

print("Using Built-in array_join Function:")
array_df_built_in.show(truncate=False)

Using Built-in array_join Function:
+---+-----------------------+-------------------------+
|id |fruits                 |flattened_fruits_built_in|
+---+-----------------------+-------------------------+
|1  |[apple, banana, cherry]|apple, banana, cherry    |
|2  |[orange, grape]        |orange, grape            |
|3  |[]                     |                         |
|4  |NULL                   |NULL                     |
+---+-----------------------+-------------------------+



## 2. Flattening Nested Structures

Nested structures (structs) are common in semi-structured data like JSON. Let's see how to extract and flatten data from deeply nested fields.

In [5]:
# Define schema with nested structure
nested_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("person", StructType([
        StructField("name", StringType(), True),
        StructField("address", StructType([
            StructField("city", StringType(), True),
            StructField("zip", StringType(), True)
        ]), True)
    ]), True)
])

# Create data with nested structure
nested_data = [
    (1, {"name": "John", "address": {"city": "New York", "zip": "10001"}}),
    (2, {"name": "Alice", "address": {"city": "San Francisco", "zip": "94105"}}),
    (3, {"name": "Bob", "address": None}),
    (4, None)  # Handle completely null record
]
nested_df = spark.createDataFrame(nested_data, nested_schema)

print("Original DataFrame with Nested Structure:")
nested_df.printSchema()
nested_df.show(truncate=False)

Original DataFrame with Nested Structure:
root
 |-- id: integer (nullable = false)
 |-- person: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- zip: string (nullable = true)

+---+-------------------------------+
|id |person                         |
+---+-------------------------------+
|1  |{John, {New York, 10001}}      |
|2  |{Alice, {San Francisco, 94105}}|
|3  |{Bob, NULL}                    |
|4  |NULL                           |
+---+-------------------------------+



### UDF Approach for Nested Structures

Let's create a UDF that combines fields from a nested structure into a single string, with appropriate null handling.

In [6]:
# Define UDF to flatten the nested structure
@udf(StringType())
def flatten_address(person):
    if person is None:
        return "No person data"
    if person["address"] is None:
        return f"{person['name']} - No address"
    
    address = person["address"]
    return f"{person['name']} lives in {address['city']}, {address['zip']}"

# Apply the UDF
flattened_nested_df = nested_df.withColumn(
    "person_summary", 
    flatten_address(col("person"))
)

print("DataFrame with Flattened Nested Structure:")
flattened_nested_df.select("id", "person_summary").show(truncate=False)

DataFrame with Flattened Nested Structure:
+---+-----------------------------------+
|id |person_summary                     |
+---+-----------------------------------+
|1  |John lives in New York, 10001      |
|2  |Alice lives in San Francisco, 94105|
|3  |Bob - No address                   |
|4  |No person data                     |
+---+-----------------------------------+



### Direct Column Reference Alternative

For accessing specific nested fields directly, Spark allows dot notation which can be more efficient than UDFs.

In [7]:
# Extracting fields with direct column references
direct_access_df = nested_df.select(
    "id",
    col("person.name").alias("name"),
    col("person.address.city").alias("city"),
    col("person.address.zip").alias("zip_code")
)

print("Using Direct Column References:")
direct_access_df.show(truncate=False)

Using Direct Column References:
+---+-----+-------------+--------+
|id |name |city         |zip_code|
+---+-----+-------------+--------+
|1  |John |New York     |10001   |
|2  |Alice|San Francisco|94105   |
|3  |Bob  |NULL         |NULL    |
|4  |NULL |NULL         |NULL    |
+---+-----+-------------+--------+



## 3. Flattening Maps/Dictionaries

Maps (key-value pairs) are often used for sparse data or attribute collections. Let's see how to extract and format map data using UDFs.

In [8]:
# Create a DataFrame with a map column
map_data = [
    (1, {"height": 180, "weight": 75, "age": 30}),
    (2, {"height": 165, "weight": 65}),  # Missing age key
    (3, {}),  # Empty map
    (4, None)  # Null map
]
map_df = spark.createDataFrame(map_data, ["id", "metrics"])

print("Original DataFrame with Map:")
map_df.show(truncate=False)

Original DataFrame with Map:
+---+----------------------------------------+
|id |metrics                                 |
+---+----------------------------------------+
|1  |{weight -> 75, age -> 30, height -> 180}|
|2  |{weight -> 65, height -> 165}           |
|3  |{}                                      |
|4  |NULL                                    |
+---+----------------------------------------+



### UDF Approach for Maps

Let's create a UDF that extracts specific values from the map and formats them with default handling.

In [9]:
# Define UDF to extract specific values with default handling
@udf(StringType())
def flatten_metrics(metrics):
    if metrics is None:
        return "No metrics data"
    if len(metrics) == 0:
        return "Empty metrics"
    
    # Extract values with defaults
    height = metrics.get("height", "unknown")
    weight = metrics.get("weight", "unknown")
    age = metrics.get("age", "unknown")
    
    return f"Height: {height}cm, Weight: {weight}kg, Age: {age}y"

# Apply the UDF
flattened_map_df = map_df.withColumn("metrics_summary", flatten_metrics(col("metrics")))

print("DataFrame with Flattened Map:")
flattened_map_df.show(truncate=False)

DataFrame with Flattened Map:
+---+----------------------------------------+------------------------------------------+
|id |metrics                                 |metrics_summary                           |
+---+----------------------------------------+------------------------------------------+
|1  |{weight -> 75, age -> 30, height -> 180}|Height: 180cm, Weight: 75kg, Age: 30y     |
|2  |{weight -> 65, height -> 165}           |Height: 165cm, Weight: 65kg, Age: unknowny|
|3  |{}                                      |Empty metrics                             |
|4  |NULL                                    |No metrics data                           |
+---+----------------------------------------+------------------------------------------+



### Explode Alternative

If you need to extract all keys and values, you can use the `explode` function to convert the map to rows.

In [10]:
from pyspark.sql.functions import explode, map_keys, map_values, col, size

# Filter out nulls and empty maps for the explode
map_df_filtered = map_df.filter(col("metrics").isNotNull() & (size(map_keys(col("metrics"))) > 0))

# Explode the map into rows
exploded_map_df = map_df_filtered.select(
    "id",
    explode(col("metrics")).alias("metric_name", "metric_value")
)

print("Using explode to Convert Map to Rows:")
exploded_map_df.show(truncate=False)

Using explode to Convert Map to Rows:
+---+-----------+------------+
|id |metric_name|metric_value|
+---+-----------+------------+
|1  |weight     |75          |
|1  |age        |30          |
|1  |height     |180         |
|2  |weight     |65          |
|2  |height     |165         |
+---+-----------+------------+



## 4. Complex Example: Combining Multiple Complex Objects

In real-world scenarios, you often need to handle multiple complex types together. Let's create a more complex example.

In [11]:
# Define a complex schema
complex_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("user", StructType([
        StructField("name", StringType(), True),
        StructField("skills", ArrayType(StringType()), True),
        StructField("properties", MapType(StringType(), StringType()), True)
    ]), True)
])

# Create complex data
complex_data = [
    (1, {
        "name": "John", 
        "skills": ["Python", "SQL", "Spark"], 
        "properties": {"dept": "Data Science", "level": "Senior"}
    }),
    (2, {
        "name": "Mary", 
        "skills": ["Java", "Scala"], 
        "properties": {"dept": "Engineering", "location": "Remote"}
    }),
    (3, {
        "name": "Bob", 
        "skills": [], 
        "properties": {}
    })
]
complex_df = spark.createDataFrame(complex_data, complex_schema)

print("Complex DataFrame:")
complex_df.printSchema()
complex_df.show(truncate=False)

Complex DataFrame:
root
 |-- id: integer (nullable = false)
 |-- user: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- skills: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- properties: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------+
|id |user                                                                 |
+---+---------------------------------------------------------------------+
|1  |{John, [Python, SQL, Spark], {level -> Senior, dept -> Data Science}}|
|2  |{Mary, [Java, Scala], {location -> Remote, dept -> Engineering}}     |
|3  |{Bob, [], {}}                                                        |
+---+---------------------------------------------------------------------+



### Comprehensive UDF for Complex Flattening

Let's create a UDF that processes the entire complex structure to create a comprehensive profile string.

In [12]:
# Define comprehensive flattening UDF
@udf(StringType())
def create_user_profile(user):
    if user is None:
        return "No user data"
    
    # Extract basic info
    name = user["name"]
    
    # Process skills (array)
    skills = user["skills"]
    if skills and len(skills) > 0:
        skills_str = ", ".join(skills)
    else:
        skills_str = "No skills listed"
    
    # Process properties (map)
    props = user["properties"]
    if props and len(props) > 0:
        # Create a formatted string from the map
        props_str = ", ".join([f"{k}: {v}" for k, v in props.items()])
    else:
        props_str = "No properties listed"
    
    # Combine everything into a profile
    return f"User: {name} | Skills: {skills_str} | Properties: {props_str}"

# Apply the UDF
profile_df = complex_df.withColumn("user_profile", create_user_profile(col("user")))

print("DataFrame with Comprehensive User Profile:")
profile_df.select("id", "user_profile").show(truncate=False)

DataFrame with Comprehensive User Profile:
+---+---------------------------------------------------------------------------------------+
|id |user_profile                                                                           |
+---+---------------------------------------------------------------------------------------+
|1  |User: John | Skills: Python, SQL, Spark | Properties: level: Senior, dept: Data Science|
|2  |User: Mary | Skills: Java, Scala | Properties: location: Remote, dept: Engineering     |
|3  |User: Bob | Skills: No skills listed | Properties: No properties listed                |
+---+---------------------------------------------------------------------------------------+



## Summary: Best Practices for Flattening Complex Objects

1. **Choose the Right Approach:**
   - Use built-in functions when possible (like `array_join`, dot notation, `explode`)
   - Use UDFs when you need custom logic or handling multiple nested levels together

2. **Always Handle Nulls and Edge Cases:**
   - Check for `None` values at each level of nesting
   - Provide meaningful defaults or error messages
   - Handle empty collections (empty arrays, maps)

3. **Consider Performance:**
   - UDFs have serialization overhead - avoid for simple operations
   - For large-scale data, consider restructuring data model if possible
   - For better performance with UDFs, consider Pandas UDFs (vectorized UDFs)

4. **Documentation:**
   - Document complex UDFs clearly with examples
   - Specify return types explicitly