# Processing JSON Data in PySpark

This notebook demonstrates how to read, parse, and extract fields from JSON data using PySpark. We'll cover:

1. Reading JSON from various sources
2. Extracting fields from JSON structures
3. Working with nested JSON objects
4. Handling JSON arrays
5. Schema inference and explicit schema definition

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, from_json, to_json, json_tuple, get_json_object, size
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType, DoubleType

# Initialize SparkSession
spark = SparkSession.builder.appName("JSON Processing").getOrCreate()

print("SparkSession initialized successfully!")

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/18 06:49:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


SparkSession initialized successfully!


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 60890)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/usr/local/lib/python3.10/site-packages/pyspark/accumulators.py", line 271, in accum_updates
   

## 1. Creating Sample JSON Data

Let's start by creating some sample JSON data to work with. We'll create JSON strings directly in PySpark.

In [2]:
# Simple JSON data
simple_json_data = [
    (1, '{"name":"John", "age":30, "city":"New York"}'),
    (2, '{"name":"Alice", "age":25, "city":"Los Angeles"}'),
    (3, '{"name":"Bob", "age":35, "city":"Chicago"}')
]

simple_json_df = spark.createDataFrame(simple_json_data, ["id", "json_data"])

print("Simple JSON DataFrame:")
simple_json_df.show(truncate=False)

Simple JSON DataFrame:
+---+------------------------------------------------+
|id |json_data                                       |
+---+------------------------------------------------+
|1  |{"name":"John", "age":30, "city":"New York"}    |
|2  |{"name":"Alice", "age":25, "city":"Los Angeles"}|
|3  |{"name":"Bob", "age":35, "city":"Chicago"}      |
+---+------------------------------------------------+



## 2. Extracting Fields from JSON Strings

### Method 1: Using `json_tuple`

The `json_tuple` function allows extracting multiple fields from a JSON string in one go.

In [3]:
# Extract fields using json_tuple
parsed_json_df1 = simple_json_df.select(
    "id",
    json_tuple(col("json_data"), "name", "age", "city").alias("name", "age", "city")
)

print("Parsed JSON using json_tuple:")
parsed_json_df1.show()

Parsed JSON using json_tuple:
+---+-----+---+-----------+
| id| name|age|       city|
+---+-----+---+-----------+
|  1| John| 30|   New York|
|  2|Alice| 25|Los Angeles|
|  3|  Bob| 35|    Chicago|
+---+-----+---+-----------+



### Method 2: Using `get_json_object`

The `get_json_object` function extracts a single field at a time using a JSONPath expression.

In [4]:
# Extract fields using get_json_object
parsed_json_df2 = simple_json_df.select(
    "id",
    get_json_object(col("json_data"), "$.name").alias("name"),
    get_json_object(col("json_data"), "$.age").alias("age"),
    get_json_object(col("json_data"), "$.city").alias("city")
)

print("Parsed JSON using get_json_object:")
parsed_json_df2.show()

Parsed JSON using get_json_object:
+---+-----+---+-----------+
| id| name|age|       city|
+---+-----+---+-----------+
|  1| John| 30|   New York|
|  2|Alice| 25|Los Angeles|
|  3|  Bob| 35|    Chicago|
+---+-----+---+-----------+



### Method 3: Using `from_json` with Schema

For more control and type safety, use `from_json` with an explicit schema.

In [5]:
# Define schema for the JSON
simple_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

# Parse with from_json
parsed_json_df3 = simple_json_df.select(
    "id",
    from_json(col("json_data"), simple_schema).alias("parsed_data")
)

# Extract the struct fields
parsed_json_df3 = parsed_json_df3.select(
    "id",
    "parsed_data.name",
    "parsed_data.age",
    "parsed_data.city"
)

print("Parsed JSON using from_json with schema:")
parsed_json_df3.show()

Parsed JSON using from_json with schema:
+---+-----+---+-----------+
| id| name|age|       city|
+---+-----+---+-----------+
|  1| John| 30|   New York|
|  2|Alice| 25|Los Angeles|
|  3|  Bob| 35|    Chicago|
+---+-----+---+-----------+



## 3. Working with Nested JSON

Now let's handle more complex, nested JSON objects.

In [6]:
# Nested JSON data
nested_json_data = [
    (1, '{"name":"John", "contact":{"email":"john@example.com", "phone":"555-1234"}, "address":{"city":"New York", "zip":"10001"}}'),
    (2, '{"name":"Alice", "contact":{"email":"alice@example.com", "phone":"555-5678"}, "address":{"city":"San Francisco", "zip":"94105"}}'),
    (3, '{"name":"Bob", "contact":{"email":"bob@example.com"}, "address":{"city":"Chicago", "zip":"60601"}}')
]

nested_json_df = spark.createDataFrame(nested_json_data, ["id", "json_data"])

print("Nested JSON DataFrame:")
nested_json_df.show(truncate=False)

Nested JSON DataFrame:
+---+--------------------------------------------------------------------------------------------------------------------------------+
|id |json_data                                                                                                                       |
+---+--------------------------------------------------------------------------------------------------------------------------------+
|1  |{"name":"John", "contact":{"email":"john@example.com", "phone":"555-1234"}, "address":{"city":"New York", "zip":"10001"}}       |
|2  |{"name":"Alice", "contact":{"email":"alice@example.com", "phone":"555-5678"}, "address":{"city":"San Francisco", "zip":"94105"}}|
|3  |{"name":"Bob", "contact":{"email":"bob@example.com"}, "address":{"city":"Chicago", "zip":"60601"}}                              |
+---+--------------------------------------------------------------------------------------------------------------------------------+



### Using `get_json_object` for Nested Fields

In [7]:
# Extract nested fields using get_json_object
parsed_nested_df1 = nested_json_df.select(
    "id",
    get_json_object(col("json_data"), "$.name").alias("name"),
    get_json_object(col("json_data"), "$.contact.email").alias("email"),
    get_json_object(col("json_data"), "$.contact.phone").alias("phone"),
    get_json_object(col("json_data"), "$.address.city").alias("city"),
    get_json_object(col("json_data"), "$.address.zip").alias("zip")
)

print("Parsed nested JSON using get_json_object:")
parsed_nested_df1.show()

Parsed nested JSON using get_json_object:
+---+-----+-----------------+--------+-------------+-----+
| id| name|            email|   phone|         city|  zip|
+---+-----+-----------------+--------+-------------+-----+
|  1| John| john@example.com|555-1234|     New York|10001|
|  2|Alice|alice@example.com|555-5678|San Francisco|94105|
|  3|  Bob|  bob@example.com|    NULL|      Chicago|60601|
+---+-----+-----------------+--------+-------------+-----+



### Using `from_json` with Nested Schema

In [8]:
# Define nested schema
nested_schema = StructType([
    StructField("name", StringType(), True),
    StructField("contact", StructType([
        StructField("email", StringType(), True),
        StructField("phone", StringType(), True)
    ]), True),
    StructField("address", StructType([
        StructField("city", StringType(), True),
        StructField("zip", StringType(), True)
    ]), True)
])

# Parse nested JSON with schema
parsed_nested_df2 = nested_json_df.select(
    "id",
    from_json(col("json_data"), nested_schema).alias("data")
)

# Flatten the nested structure
flattened_df = parsed_nested_df2.select(
    "id",
    "data.name",
    "data.contact.email",
    "data.contact.phone",
    "data.address.city",
    "data.address.zip"
)

print("Flattened nested JSON using from_json with schema:")
flattened_df.show()

Flattened nested JSON using from_json with schema:
+---+-----+-----------------+--------+-------------+-----+
| id| name|            email|   phone|         city|  zip|
+---+-----+-----------------+--------+-------------+-----+
|  1| John| john@example.com|555-1234|     New York|10001|
|  2|Alice|alice@example.com|555-5678|San Francisco|94105|
|  3|  Bob|  bob@example.com|    NULL|      Chicago|60601|
+---+-----+-----------------+--------+-------------+-----+



## 4. Handling JSON Arrays

JSON data often contains arrays. Let's see how to process them.

In [9]:
# JSON with arrays
array_json_data = [
    (1, '{"name":"John", "skills":["Java", "Python", "SQL"]}'),
    (2, '{"name":"Alice", "skills":["C++", "JavaScript"]}'),
    (3, '{"name":"Bob", "skills":[]}')
]

array_json_df = spark.createDataFrame(array_json_data, ["id", "json_data"])

print("JSON with Arrays:")
array_json_df.show(truncate=False)

JSON with Arrays:
+---+---------------------------------------------------+
|id |json_data                                          |
+---+---------------------------------------------------+
|1  |{"name":"John", "skills":["Java", "Python", "SQL"]}|
|2  |{"name":"Alice", "skills":["C++", "JavaScript"]}   |
|3  |{"name":"Bob", "skills":[]}                        |
+---+---------------------------------------------------+



### Parsing Arrays with Schema and Exploding

In [10]:
# Define schema with array
array_schema = StructType([
    StructField("name", StringType(), True),
    StructField("skills", ArrayType(StringType()), True)
])

# Parse JSON with array
parsed_array_df = array_json_df.select(
    "id",
    from_json(col("json_data"), array_schema).alias("data")
)

# Extract fields
parsed_array_df = parsed_array_df.select(
    "id",
    "data.name",
    "data.skills"
)

print("Parsed JSON with arrays:")
parsed_array_df.show(truncate=False)

Parsed JSON with arrays:
+---+-----+-------------------+
|id |name |skills             |
+---+-----+-------------------+
|1  |John |[Java, Python, SQL]|
|2  |Alice|[C++, JavaScript]  |
|3  |Bob  |[]                 |
+---+-----+-------------------+



### Exploding Arrays

We can use `explode` to convert array elements into separate rows.

In [11]:
# Filter out empty arrays to avoid explode issues
non_empty_skills_df = parsed_array_df.filter(col("skills").isNotNull() & (size(col("skills")) > 0))

# Explode the skills array
exploded_skills_df = non_empty_skills_df.select(
    "id",
    "name",
    explode("skills").alias("skill")
)

print("Exploded skills array:")
exploded_skills_df.show()

Exploded skills array:
+---+-----+----------+
| id| name|     skill|
+---+-----+----------+
|  1| John|      Java|
|  1| John|    Python|
|  1| John|       SQL|
|  2|Alice|       C++|
|  2|Alice|JavaScript|
+---+-----+----------+



## 5. Complex JSON with Arrays of Objects

Let's handle even more complex JSON with arrays of objects.

In [12]:
# JSON with array of objects
complex_json_data = [
    (1, '{"name":"John", "courses":[{"name":"Python", "score":95}, {"name":"SQL", "score":87}]}'),
    (2, '{"name":"Alice", "courses":[{"name":"Java", "score":90}, {"name":"JavaScript", "score":85}]}'),
    (3, '{"name":"Bob", "courses":[]}')
]

complex_json_df = spark.createDataFrame(complex_json_data, ["id", "json_data"])

print("Complex JSON with arrays of objects:")
complex_json_df.show(truncate=False)

Complex JSON with arrays of objects:
+---+--------------------------------------------------------------------------------------------+
|id |json_data                                                                                   |
+---+--------------------------------------------------------------------------------------------+
|1  |{"name":"John", "courses":[{"name":"Python", "score":95}, {"name":"SQL", "score":87}]}      |
|2  |{"name":"Alice", "courses":[{"name":"Java", "score":90}, {"name":"JavaScript", "score":85}]}|
|3  |{"name":"Bob", "courses":[]}                                                                |
+---+--------------------------------------------------------------------------------------------+



### Parsing and Exploding Arrays of Objects

In [13]:
from pyspark.sql.functions import size

# Define complex schema
course_schema = StructType([
    StructField("name", StringType(), True),
    StructField("score", IntegerType(), True)
])

complex_schema = StructType([
    StructField("name", StringType(), True),
    StructField("courses", ArrayType(course_schema), True)
])

# Parse complex JSON
parsed_complex_df = complex_json_df.select(
    "id",
    from_json(col("json_data"), complex_schema).alias("data")
)

parsed_complex_df = parsed_complex_df.select(
    "id",
    "data.name",
    "data.courses"
)

print("Parsed complex JSON:")
parsed_complex_df.show(truncate=False)

Parsed complex JSON:
+---+-----+------------------------------+
|id |name |courses                       |
+---+-----+------------------------------+
|1  |John |[{Python, 95}, {SQL, 87}]     |
|2  |Alice|[{Java, 90}, {JavaScript, 85}]|
|3  |Bob  |[]                            |
+---+-----+------------------------------+



In [14]:
# Filter out empty arrays
non_empty_courses_df = parsed_complex_df.filter(col("courses").isNotNull() & (size(col("courses")) > 0))

# Explode courses array
exploded_courses_df = non_empty_courses_df.select(
    "id",
    "name",
    explode("courses").alias("course")
)

# Extract fields from the exploded struct
final_courses_df = exploded_courses_df.select(
    "id",
    "name",
    col("course.name").alias("course_name"),
    "course.score"
)

print("Final exploded courses:")
final_courses_df.show()

Final exploded courses:
+---+-----+-----------+-----+
| id| name|course_name|score|
+---+-----+-----------+-----+
|  1| John|     Python|   95|
|  1| John|        SQL|   87|
|  2|Alice|       Java|   90|
|  2|Alice| JavaScript|   85|
+---+-----+-----------+-----+



## 6. Reading JSON Files

In real applications, you often need to read JSON from files. Here's how to do it:

In [15]:
# First, let's write some sample data to a JSON file
simple_json_df.write.mode("overwrite").json("/tmp/sample.json")

# Reading JSON files
# With schema inference
json_file_df = spark.read.json("/tmp/sample.json")
print("JSON read from file with inferred schema:")
json_file_df.printSchema()
json_file_df.show()

JSON read from file with inferred schema:
root
 |-- id: long (nullable = true)
 |-- json_data: string (nullable = true)

+---+--------------------+
| id|           json_data|
+---+--------------------+
|  2|{"name":"Alice", ...|
|  1|{"name":"John", "...|
|  3|{"name":"Bob", "a...|
+---+--------------------+



In [16]:
# Reading with explicit schema
file_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("json_data", StringType(), True)
])

# Reading with options
json_file_df2 = spark.read.option("multiLine", "true").schema(file_schema).json("/tmp/sample.json")
print("JSON read from file with explicit schema:")
json_file_df2.show()

JSON read from file with explicit schema:
+---+--------------------+
| id|           json_data|
+---+--------------------+
|  2|{"name":"Alice", ...|
|  1|{"name":"John", "...|
|  3|{"name":"Bob", "a...|
+---+--------------------+



## 7. Converting Dataframe to JSON

We can also convert DataFrames back to JSON format.

In [17]:
# Convert DataFrame to JSON string
from pyspark.sql.functions import struct, to_json

# Create a sample DataFrame
data_for_json = [
    (1, "John", 30, "New York"),
    (2, "Alice", 25, "San Francisco"),
    (3, "Bob", 35, "Chicago")
]
df_for_json = spark.createDataFrame(data_for_json, ["id", "name", "age", "city"])

# Convert to JSON string
json_output_df = df_for_json.select(
    "id",
    to_json(struct("name", "age", "city")).alias("person_json")
)

print("DataFrame converted to JSON strings:")
json_output_df.show(truncate=False)

DataFrame converted to JSON strings:
+---+------------------------------------------------+
|id |person_json                                     |
+---+------------------------------------------------+
|1  |{"name":"John","age":30,"city":"New York"}      |
|2  |{"name":"Alice","age":25,"city":"San Francisco"}|
|3  |{"name":"Bob","age":35,"city":"Chicago"}        |
+---+------------------------------------------------+



## 8. Schema Inference for JSON

PySpark can infer the schema from JSON data, which is useful for exploration.

In [18]:
# Using schema inference with samplingRatio
inferred_df = spark.read.option("samplingRatio", "0.8").json("/tmp/sample.json")

print("Inferred schema from JSON:")
inferred_df.printSchema()

Inferred schema from JSON:
root
 |-- id: long (nullable = true)
 |-- json_data: string (nullable = true)



25/04/18 06:50:44 WARN JavaUtils: Attempt to delete using native Unix OS command failed for path = /tmp/spark-27bf6407-3d0f-4dfe-a761-0b3b189d9754/userFiles-3f634710-1cab-42d0-a0f4-1aedee30baef. Falling back to Java IO way
java.io.IOException: Failed to delete: /tmp/spark-27bf6407-3d0f-4dfe-a761-0b3b189d9754/userFiles-3f634710-1cab-42d0-a0f4-1aedee30baef
	at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:173)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:109)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:90)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively(SparkFileUtils.scala:121)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively$(SparkFileUtils.scala:120)
	at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1126)
	at org.apache.spark.SparkEnv.stop(SparkEnv.scala:108)
	at org.apache.spark.SparkContext.$anonfun$stop$25(SparkContext.scala:2310)
	at org.apac

## Summary: Best Practices for JSON Processing

1. **For simple JSON extraction:**
   - Use `json_tuple` for extracting multiple fields at once
   - Use `get_json_object` for extracting specific fields with JSONPath

2. **For complex JSON structures:**
   - Define explicit schemas with `StructType` and use `from_json`
   - Handle nested structures with dot notation
   - Use `explode` for arrays

3. **Performance considerations:**
   - Schema inference is convenient but can be slow on large datasets
   - Define explicit schemas for production workloads
   - Use appropriate data types to avoid type conversions

4. **File handling:**
   - Use `multiLine` option for multi-line JSON files
   - Consider partitioning for large JSON datasets
   - Use Parquet instead of JSON for better performance in analytical workloads