# Lesson 10 - Working with Complex and Nested Data

Okay, here are detailed technical notes on working with Complex and Nested Data in PySpark, suitable for professional learners and training material.

---

## PySpark Technical Notes: Working with Complex and Nested Data

**Introduction**

Modern data sources, particularly semi-structured formats like JSON and Avro, often contain complex data types. These include nested structures (structs), lists (arrays), and key-value pairs (maps). PySpark provides powerful capabilities to natively handle these complex types within its DataFrame API, allowing for sophisticated querying and manipulation without needing to manually parse text or rely heavily on User-Defined Functions (UDFs).

Understanding how to effectively read, query, transform, and restructure these nested formats is crucial for working with real-world data originating from web APIs, NoSQL databases, event streams, and configuration files. These notes explore PySpark's features for managing `StructType`, `ArrayType`, and `MapType` columns.

---

### 1. Working with JSON and StructType

**Theory**

JSON (JavaScript Object Notation) is a ubiquitous format for data interchange, often featuring nested objects and arrays. PySpark can directly ingest JSON data into DataFrames. When reading JSON, Spark can either:

1.  **Infer the Schema:** Spark reads a sample of the JSON data (configurable) to automatically determine the structure, including nested objects (`StructType`), lists (`ArrayType`), and data types. While convenient for exploration, schema inference can be slow for large datasets, might miss fields present only later in the file, and can infer incorrect types (e.g., inferring `LongType` when `DoubleType` is needed later).
2.  **Use an Explicit Schema:** Defining the schema explicitly using `StructType` and `StructField` provides robustness, performance benefits, and data integrity. Spark skips the inference step, validates data against the schema during read, and ensures correct data types. `StructType` represents a row or a nested object, containing a list of `StructField` objects. Each `StructField` defines the name, data type (`StringType`, `IntegerType`, `ArrayType(ElementType())`, `MapType(KeyType(), ValueType())`, `StructType(...)`, etc.), and nullability of a column or nested field.

Accessing fields within a `StructType` column is done using dot (`.`) notation or the `getField()` method within expressions.

**Code Examples**

Let's assume we have a JSON file (`data.json`) or a list of JSON strings:

```json
// Sample JSON data (imagine this in a file or RDD)
{"id": 1, "name": "Alice", "address": {"street": "123 Main St", "city": "Anytown"}, "roles": ["Admin", "Editor"], "attributes": {"level": 10, "status": "Active"}}
{"id": 2, "name": "Bob", "address": {"street": "456 Oak Ave", "city": "Otherville"}, "roles": ["Viewer"], "attributes": {"level": 5, "status": "Inactive", "validated": false}}
{"id": 3, "name": "Charlie", "address": null, "roles": ["Editor", "Viewer"], "attributes": null}
```

```python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType, BooleanType
from pyspark.sql.functions import col, get_json_object, json_tuple

# Initialize Spark Session
spark = SparkSession.builder.appName("ComplexJson").getOrCreate()

# Sample data as a list of strings (simulating reading lines from a file)
json_strings = [
    '{"id": 1, "name": "Alice", "address": {"street": "123 Main St", "city": "Anytown"}, "roles": ["Admin", "Editor"], "attributes": {"level": 10, "status": "Active"}}',
    '{"id": 2, "name": "Bob", "address": {"street": "456 Oak Ave", "city": "Otherville"}, "roles": ["Viewer"], "attributes": {"level": 5, "status": "Inactive", "validated": false}}',
    '{"id": 3, "name": "Charlie", "address": null, "roles": ["Editor", "Viewer"], "attributes": null}'
]
json_rdd = spark.sparkContext.parallelize(json_strings)

# ----- Method 1: Reading JSON with Schema Inference -----
print("Reading with Schema Inference:")
df_inferred = spark.read.json(json_rdd)
df_inferred.printSchema()
df_inferred.show(truncate=False)
# Explanation:
# 1. `spark.sparkContext.parallelize(json_strings)`: Creates an RDD from the list of JSON strings.
# 2. `spark.read.json(json_rdd)`: Reads the JSON data. Since no schema is provided, Spark infers it by sampling the data. It correctly identifies nested structures ('address', 'attributes') and arrays ('roles').
# 3. `df_inferred.printSchema()`: Displays the inferred schema structure. Notice `address` is `StructType`, `attributes` is `StructType`, and `roles` is `ArrayType`.

# ----- Method 2: Defining an Explicit Schema -----
print("\nDefining an Explicit Schema:")
explicit_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True)
    ]), True),
    StructField("roles", ArrayType(StringType()), True),
    StructField("attributes", MapType(StringType(), StringType()), True) # Infer schema guessed struct, let's define as map
    # Note: Original data had mixed types in attributes (int, string, bool). MapType requires consistent value types.
    # If types truly vary, reading as Map<String, String> and casting later, or keeping as Struct is better.
    # For this example, let's *assume* we want them treated as strings in the map for simplicity here,
    # or better yet, adjust the schema if it should be a Struct.
])

# Let's redefine schema for attributes based on data:
explicit_schema_corrected = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True)
    ]), True),
    StructField("roles", ArrayType(StringType()), True),
    StructField("attributes", StructType([ # More accurate than MapType for this specific data
         StructField("level", IntegerType(), True),
         StructField("status", StringType(), True),
         StructField("validated", BooleanType(), True) # Field only present in some records
    ]), True)
])


print("\nReading with Explicit Schema:")
df_explicit = spark.read.schema(explicit_schema_corrected).json(json_rdd)
df_explicit.printSchema()
df_explicit.show(truncate=False)
# Explanation:
# 1. `StructType([...])`: Defines the overall structure of the DataFrame.
# 2. `StructField("name", DataType(), nullable)`: Defines each field within a StructType.
# 3. `StructType([...])`: Used nestedly for the 'address' and 'attributes' fields.
# 4. `ArrayType(StringType())`: Defines the 'roles' field as an array of strings.
# 5. `MapType(StringType(), StringType())`: Defines 'attributes' as a map with string keys and string values (original attempt). Corrected to StructType based on data.
# 6. `spark.read.schema(explicit_schema_corrected).json(json_rdd)`: Reads the JSON using the *provided* schema. This is generally faster and safer. It enforces the structure; non-conforming data might result in nulls or errors depending on options.

# ----- Accessing Nested Fields -----
print("\nAccessing Nested Fields:")
df_selected = df_explicit.select(
    col("id"),
    col("name"),
    col("address.street").alias("street_address"), # Dot notation
    col("address").getField("city").alias("city_name"), # getField() method
    col("roles")[0].alias("first_role"), # Access array element by index
    col("attributes.level").alias("attribute_level"), # Dot notation for nested struct
    col("attributes")["status"].alias("attribute_status") # Map-like access for Struct field
)
df_selected.show(truncate=False)
# Explanation:
# 1. `col("address.street")`: Accesses the 'street' field within the 'address' struct using dot notation.
# 2. `col("address").getField("city")`: Achieves the same using the `getField()` method, which can be useful if field names contain dots or special characters.
# 3. `.alias(...)`: Renames the selected field for clarity in the resulting DataFrame.
# 4. `col("roles")[0]`: Accesses the first element (index 0) of the 'roles' array.
# 5. `col("attributes.level")`: Accesses the 'level' field within the nested 'attributes' struct.
# 6. `col("attributes")["status"]`: Uses map-like key access notation to retrieve the 'status' field within the 'attributes' struct. Both dot notation and key access work for StructFields. For MapType, only key access works.
```

**Practical Use Cases & Performance:**

*   **Explicit Schemas:** Strongly recommended for production pipelines. They prevent errors from schema drift, improve read performance (no inference step), and ensure data type consistency. Use schema inference primarily for initial exploration or when the schema is highly variable and cannot be predefined (though even then, reading as raw text and parsing might be safer).
*   **Schema Evolution:** For evolving JSON schemas, Spark's schema merging (`spark.read.option("mergeSchema", "true").json(...)`) can be useful, but it adds overhead. Defining a superset schema or handling schema differences explicitly is often more robust. The `schema_of_json` function can help extract a schema from JSON strings already loaded in a DataFrame column.
*   **Complex Queries:** Dot notation and `getField` allow complex filtering and transformations directly on nested fields (e.g., `df.filter(col("address.city") == "Anytown")`).

---

### 2. Flattening Nested Columns

**Theory**

While Spark handles nested structures well, sometimes it's necessary to "flatten" them. Flattening transforms nested fields into top-level columns. This might be required for:

*   Compatibility with systems or tools that don't support nested types (e.g., traditional relational databases, some BI tools).
*   Simplifying analysis or feature engineering for certain machine learning models.
*   Improving human readability of tabular data.

The most common way to flatten a `StructType` column is to select each nested field individually and assign it a unique top-level alias.

**Code Examples**

Using `df_explicit` from the previous section:

```python
from pyspark.sql.functions import col

print("Original Nested DataFrame:")
df_explicit.show(truncate=False)

# ----- Flattening the 'address' struct -----
print("\nFlattening 'address' struct:")
df_flat_address = df_explicit.select(
    col("id"),
    col("name"),
    col("address.street").alias("address_street"), # Select nested field and alias
    col("address.city").alias("address_city"),   # Select nested field and alias
    col("roles"),
    col("attributes")
)
df_flat_address.printSchema()
df_flat_address.show(truncate=False)
# Explanation:
# 1. `df_explicit.select(...)`: Selects specific columns to create the new DataFrame.
# 2. `col("address.street").alias("address_street")`: Selects the 'street' field from within 'address' and gives it the new top-level name 'address_street'.
# 3. `col("address.city").alias("address_city")`: Does the same for the 'city' field.
# 4. Other columns ('id', 'name', 'roles', 'attributes') are selected directly.

# ----- Flattening all nested structures ('address' and 'attributes') -----
print("\nFlattening all nested structures ('address', 'attributes'):")
# Be careful with potential name collisions if nested fields have same names.
df_fully_flattened = df_explicit.select(
    col("id"),
    col("name"),
    col("address.street").alias("address_street"),
    col("address.city").alias("address_city"),
    col("roles"),
    col("attributes.level").alias("attr_level"), # Alias to avoid potential future collisions
    col("attributes.status").alias("attr_status"),
    col("attributes.validated").alias("attr_validated")
    # Note: We deliberately skip the original 'address' and 'attributes' columns.
)
df_fully_flattened.printSchema()
df_fully_flattened.show(truncate=False)
# Explanation:
# 1. We now select fields from both 'address' and 'attributes' structs.
# 2. Aliases like 'attr_level' are used. Good practice to ensure unique and descriptive top-level names, especially if nested structures might share field names (e.g., 'id' inside 'address' and 'id' at top level).
# 3. The original struct columns ('address', 'attributes') are *not* included in the select list, effectively removing them and replacing them with their flattened contents.

# ----- Dynamic Flattening (Conceptual - Requires helper function usually) -----
# For schemas with many nested fields, manual selection is tedious.
# You might write a helper function to generate the select expressions dynamically.
def flatten_struct_cols(df, struct_col_name):
    """ Helper to generate select expressions for flattening one struct column """
    field_names = [f.name for f in df.schema[struct_col_name].dataType.fields]
    select_exprs = [f"{struct_col_name}.{name} as {struct_col_name}_{name}" for name in field_names]
    # Include other columns
    other_cols = [c for c in df.columns if c != struct_col_name]
    return df.selectExpr(other_cols + select_exprs)

print("\nFlattening 'address' dynamically (using a helper concept):")
df_flat_dynamic = flatten_struct_cols(df_explicit, "address")
df_flat_dynamic.show(truncate=False)
# Explanation:
# 1. This section demonstrates the *idea* of dynamic flattening. A real implementation would need more robust recursion for multi-level nesting and better name collision handling.
# 2. `df.schema[struct_col_name].dataType.fields`: Accesses the schema definition of the struct column to get its field names.
# 3. `selectExpr(...)`: Uses SQL-like expressions generated dynamically to perform the flattening.
```

**Practical Use Cases & Performance:**

*   **Data Warehousing:** Flattening is common when loading data into traditional DWHs with flat table structures.
*   **Feature Engineering:** Creating individual features from nested attributes for ML models.
*   **Performance:** Flattening itself is primarily a projection operation (`select`), which is generally efficient in Spark. However, it increases the number of columns, which might slightly impact the performance of subsequent operations and increase metadata size. Keeping data nested can sometimes be more efficient for queries that only need a subset of nested fields, especially with columnar storage formats like Parquet where entire structs can be skipped (column pruning). Choose based on downstream requirements.

---

### 3. Exploding Arrays and Maps

**Theory**

`ArrayType` and `MapType` columns store multiple values within a single row field. "Exploding" these columns transforms the DataFrame so that each element in the array or map gets its own row, duplicating the other column values for each new row.

*   `explode(col)`: For arrays, creates a new row for each element in the array, placing the element in a new column named `col` by default. For maps, creates a new row for each key-value pair, creating two new columns named `key` and `value` by default. If the array/map is null or empty, the original row disappears from the result (like an inner join).
*   `explode_outer(col)`: Similar to `explode`, but if the array/map is null or empty, it keeps the original row and produces `null` in the new column(s). (Like a left outer join).
*   `posexplode(col)`: For arrays only. Similar to `explode`, but adds an additional column (default name `pos`) indicating the position (index) of the element in the original array.
*   `posexplode_outer(col)`: Outer version of `posexplode`.

Exploding is essential when you need to work with individual array elements or map entries as separate records, for example, to join them with other data, perform aggregations per element, or filter based on individual element values.

**Code Examples**

Using `df_explicit` which contains an array ('roles') and a struct that we read as a map in the first attempt ('attributes'). Let's use `df_explicit` for exploding the array and create a separate example for maps.

```python
from pyspark.sql.functions import explode, explode_outer, posexplode, map_keys, map_values

print("Original DataFrame:")
df_explicit.select("id", "name", "roles", "attributes").show(truncate=False)

# ----- Exploding the 'roles' array -----
print("\nExploding 'roles' array (explode):")
df_exploded_roles = df_explicit.withColumn("role", explode(col("roles")))
# or: df_exploded_roles = df_explicit.select("id", "name", "attributes", explode(col("roles")).alias("role"))
df_exploded_roles.select("id", "name", "role", "attributes").show(truncate=False)
# Explanation:
# 1. `explode(col("roles"))`: Takes the 'roles' array column as input.
# 2. `withColumn("role", ...)`: Adds a new column named 'role' containing the exploded element.
# 3. Notice how rows with id 1 and 3 are duplicated, one for each role in their original 'roles' array. The original 'roles' column is typically dropped or ignored afterwards.

print("\nExploding 'roles' array (explode_outer):")
# Let's add a row with null/empty roles to see the difference
data_with_empty_roles = json_strings + ['{"id": 4, "name": "Dave", "address": null, "roles": [], "attributes": null}']
rdd_with_empty = spark.sparkContext.parallelize(data_with_empty_roles)
df_with_empty = spark.read.schema(explicit_schema_corrected).json(rdd_with_empty)

df_exploded_outer_roles = df_with_empty.withColumn("role", explode_outer(col("roles")))
df_exploded_outer_roles.select("id", "name", "role", "attributes").show(truncate=False)
# Explanation:
# 1. `explode_outer(col("roles"))`: Works like explode, but for row id 4 where 'roles' is empty, the row is kept, and the new 'role' column has a null value. If 'roles' itself was null, the result would be the same.

print("\nExploding 'roles' array with position (posexplode):")
df_posexploded_roles = df_explicit.select(
    col("id"), col("name"), posexplode(col("roles")).alias("pos", "role") # Alias multiple output columns
)
df_posexploded_roles.show(truncate=False)
# Explanation:
# 1. `posexplode(col("roles"))`: Explodes the array and adds a position column.
# 2. `.alias("pos", "role")`: Renames the default output columns (`pos`, `col`) to `pos` and `role`.

# ----- Example with Maps -----
map_data = [(1, {"a": "apple", "b": "banana"}), (2, {"c": "carrot"}), (3, {})]
map_df = spark.createDataFrame(map_data, ["id", "data_map"])
map_df.show(truncate=False)

print("\nExploding a Map column:")
df_exploded_map = map_df.select(col("id"), explode(col("data_map"))) # Default columns: key, value
df_exploded_map.show()
# Explanation:
# 1. `explode(col("data_map"))`: Explodes the map column.
# 2. It creates two new columns, 'key' and 'value', by default. Row 1 is duplicated for keys 'a' and 'b'. Row 3 (empty map) disappears.

print("\nExploding a Map column (explode_outer):")
df_exploded_outer_map = map_df.select(col("id"), explode_outer(col("data_map")))
df_exploded_outer_map.show()
# Explanation:
# 1. `explode_outer(col("data_map"))`: Explodes the map, but keeps row 3, placing nulls in the 'key' and 'value' columns because the original map was empty.
```

**Practical Use Cases & Performance:**

*   **Unnesting Data:** Converting lists of items (e.g., products in an order, tags on an article) into individual rows for analysis or joining.
*   **Processing Key-Value Pairs:** Handling flexible attribute maps where keys represent attribute names and values represent their values.
*   **Performance:** Exploding significantly increases the number of rows in the DataFrame. This can dramatically increase the amount of data shuffled in subsequent operations like joins or aggregations. If the arrays/maps have highly variable sizes, exploding can lead to data skew (some tasks processing vastly more rows than others). Use `explode_outer` if you need to preserve rows with empty/null complex types. Consider whether the explosion is necessary or if functions operating directly on arrays/maps (`array_contains`, `transform`, `map_filter`, etc.) can achieve the goal more efficiently.

---

**Conclusion**

PySpark's built-in support for complex data types (`StructType`, `ArrayType`, `MapType`) is a cornerstone of its ability to handle diverse, real-world data. Understanding how to define schemas for, query, flatten, and explode these types using functions like `col`, `getField`, `explode`, and `posexplode` is essential for data engineers and analysts. While these operations provide immense flexibility, always consider the performance implications, especially regarding schema inference versus explicit schemas and the potential data volume increase caused by flattening or exploding nested structures. Choosing the right approach depends on the specific data characteristics and the requirements of the downstream processing steps.

---