<a href="https://colab.research.google.com/github/rahulrajpr/prepare-anytime/blob/main/spark/functions/10_spark_sql_json_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spark JSON Functions**
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#json-functions

#### JSON in Spark

**JSON as File:**  
A JSON file that Spark automatically reads and converts into a structured DataFrame with proper column types.

**JSON as Column Value:**  
A string column storing JSON text, which requires explicit parsing using `from_json()` to access its data.


---
#### JSON vs Map in Spark
---
###### Quick Comparison
---
| Aspect | JSON Column | Map Column |
|--------|-------------|------------|
| **Storage** | String text | Native MapType |
| **Access** | Parse with `from_json()` | Direct key access |
| **Speed** | Slower (parsing needed) | Faster (ready-to-use) |
| **Memory** | Higher (text + parsing) | Lower (efficient) |
| **Schema** | Flexible, needs definition | Fixed key-value types |
| **Best For** | External data, APIs | Internal processing |
---
#### Spark Convert JSON AS Structs (NOT Maps) on Parsing by defualt
---
###### Performance Matters
---
- **Structs**: Fixed schema = Faster access, better memory
- **Maps**: Variable keys = Slower lookups, inefficient storage
---
###### Data Quality
---
- **Structs**: Validate types and structure upfront
- **Maps**: Allow any keys = No quality checks
---
###### Query Power
---
- **Structs**: Clean SQL syntax (`user.name, user.age`)
- **Maps**: Special syntax required (`user['name'], user['age']`)
---
###### Optimization
- **Structs**: Enable predicate pushdown
- **Structs**: Better memory optimization


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-functions').getOrCreate()

In [None]:
js_string = '''{
  "user_id": 12345,
  "name": "John Doe",
  "email": "john.doe@example.com",
  "age": 30,
  "is_active": true,
  "skills": ["Python", "SQL", "Spark", "Java"],
  "address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zipcode": "94105"
  },
  "salary": 75000.50,
  "hire_date": "2022-03-15",
  "tags": ["engineer", "backend", "senior"],
  "projects": [
    {"name": "Project A", "status": "completed"},
    {"name": "Project B", "status": "in_progress"}
  ]
}'''

In [None]:

sql = f'''
create or replace temp view input_view as
(
  select '{js_string}' as js_string
)
'''

spark.sql(sql)

sql = 'select js_string from input_view'
spark.sql(sql).show(truncate = False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|js_string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [None]:
sql = '''
select *
from input_view
'''
json_sample = spark.sql(sql).first()[0]
print(json_sample)

{
  "user_id": 12345,
  "name": "John Doe",
  "email": "john.doe@example.com",
  "age": 30,
  "is_active": true,
  "skills": ["Python", "SQL", "Spark", "Java"],
  "address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zipcode": "94105"
  },
  "salary": 75000.50,
  "hire_date": "2022-03-15",
  "tags": ["engineer", "backend", "senior"],
  "projects": [
    {"name": "Project A", "status": "completed"},
    {"name": "Project B", "status": "in_progress"}
  ]
}


In [None]:
# schema_of_json : Get the schema of a json sring


sql = f'''
select schema_of_json('{json_sample}') as js_sch
'''
js_schema = spark.sql(sql).first()['js_sch']
print(js_schema)

STRUCT<address: STRUCT<city: STRING, state: STRING, street: STRING, zipcode: STRING>, age: BIGINT, email: STRING, hire_date: STRING, is_active: BOOLEAN, name: STRING, projects: ARRAY<STRUCT<name: STRING, status: STRING>>, salary: DOUBLE, skills: ARRAY<STRING>, tags: ARRAY<STRING>, user_id: BIGINT>


In [None]:
# from_json : converts a json strintig to a struct object

# -- paring the json string

sql = f'''
create or replace temp view parsed_view1 AS
(
with cte as
(
select from_json(js_string,'{js_schema}') as js_struct
from input_view
)
select js_struct
from cte
)
'''
spark.sql(sql)

DataFrame[]

In [None]:
## -- show the parsed json

sql = '''
select js_struct
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

## -- see the type of parsed json

spark.sql(sql).printSchema()

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|js_struct                                                                                                                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{{San Francisco, CA, 123 Main St, 94105}, 30, john.doe@example.com, 2022-03-15, true, John Doe, [{Project A, completed}, {Project B, in_progress}], 75000.5, [Python, SQL, Spark, Java], [engineer, backend, senior], 12345}|
+-----------------------------------------------------------------------------------------------------------

In [None]:
## -- access the element from the parsed

# -- string values

sql = '''
select js_struct['name'] as `name`
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

# -- nexted json ( actually struct )

sql = '''
select js_struct['address']['street'] as `street`
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

+--------+
|name    |
+--------+
|John Doe|
+--------+

+-----------+
|street     |
+-----------+
|123 Main St|
+-----------+



In [None]:
# from_json (TO ARRAY)

sql = '''
select
  from_json('["rahul","lathika","skylr"]','ARRAY<STRING>') as mapParsed
'''
spark.sql(sql).show(truncate = False)
spark.sql(sql).printSchema()

+-----------------------+
|mapParsed              |
+-----------------------+
|[rahul, lathika, skylr]|
+-----------------------+

root
 |-- mapParsed: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [None]:
# from_json (TO MAP)

sql = '''
select
  from_json('{"name":"rahul","age":25}', 'MAP<STRING,STRING>') as mapParsed
'''
spark.sql(sql).show(truncate = False)
spark.sql(sql).printSchema()

+--------------------------+
|mapParsed                 |
+--------------------------+
|{name -> rahul, age -> 25}|
+--------------------------+

root
 |-- mapParsed: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



### Converting JSON Strings to STRUCT, MAP, and ARRAY
### `using from_json`

| Aspect | STRUCT | MAP | ARRAY |
|--------|--------|-----|-------|
| **Schema Syntax** | `'field1 TYPE, field2 TYPE, nested STRUCT<...>'` | `'MAP<KEY_TYPE, VALUE_TYPE>'` | `'ARRAY<ELEMENT_TYPE>'` |
| **JSON Example** | `'{"name":"John","age":30}'` | `'{"name":"John","age":30}'` | `'["apple","banana"]'` |
| **Output Type** | `struct<name:string, age:int>` | `map<string,string>` | `array<string>` |
| **Field Access** | `struct_col.name`<br>`struct_col.age` | `map_col['name']`<br>`map_col['age']` | `array_col[0]`<br>`array_col[1]` |
| **Schema Flexibility** | Fixed schema, must match exactly | Flexible keys, fixed value type | Fixed element type |
| **Type Safety** | Strong - each field has specific type | Weak - all values same type | Medium - all elements same type |
| **Performance** | Fastest (known schema) | Moderate | Fast |
| **Use Case** | Fixed schema objects | Dynamic key-value pairs | Ordered lists |

In [None]:
# -- nexted array

sql = '''
select typeOf(js_struct['skills']) as tp
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

# -- access nexted array

sql = '''
select js_struct['skills'] as `skills`
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

# -- access nexted array and element from the array

sql = '''
select js_struct['skills'][0] as `firstSkill`
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

+-------------+
|tp           |
+-------------+
|array<string>|
+-------------+

+--------------------------+
|skills                    |
+--------------------------+
|[Python, SQL, Spark, Java]|
+--------------------------+

+----------+
|firstSkill|
+----------+
|Python    |
+----------+



In [None]:
# get_json_object :

# `get_json_object` is used to get the json value with key WITHOUT parsing the json string into struct.

sql = '''
select
get_json_object(js_string,'$.name') as `name`
from input_view
'''
spark.sql(sql).show(truncate = False)



+--------+
|name    |
+--------+
|John Doe|
+--------+



In [None]:
# get_json_object :

# `get_json_object` is used to get the json value with key WITHOUT parsing the json string into struct.

# -- access nexted json

sql = '''
select
get_json_object(js_string,'$.address') as `address`, -- first level
get_json_object(get_json_object(js_string,'$.address'),'$.street') as street -- second level extract
from input_view
'''
spark.sql(sql).show(truncate = False)
spark.sql(sql).printSchema()

+------------------------------------------------------------------------------+-----------+
|address                                                                       |street     |
+------------------------------------------------------------------------------+-----------+
|{"street":"123 Main St","city":"San Francisco","state":"CA","zipcode":"94105"}|123 Main St|
+------------------------------------------------------------------------------+-----------+

root
 |-- address: string (nullable = true)
 |-- street: string (nullable = true)



In [None]:
# to_json : convert a struct or map or array to a json string

sql = '''
select to_json(js_struct) as backToJSONString
from parsed_view1
'''
spark.sql(sql).show(truncate = False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|backToJSONString                                                                                                                                                                                                                                                                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---
#### Comparison: `from_json` vs `to_json`
---

| Aspect | from_json | to_json |
|--------|-----------|---------|
| **Purpose** | JSON Deserialization | JSON Serialization |
| **Input** | JSON String | Struct, Map, or Array |
| **Output** | Structured Data (Struct/Map/Array) | JSON String |
| **Direction** | JSON → Structured Data | Structured Data → JSON |
| **Schema Required** | Yes | No |
| **Primary Use Case** | Reading JSON data | Writing JSON data |

---
| | from_json | to_json |
|--|-----------|---------|
| **Syntax** | `from_json(jsonStr, schema[, options])` | `to_json(expr[, options])` |
| **Schema Parameter** | Required string defining structure | Not applicable |
| **Options Parameter** | Optional map for formatting | Optional map for formatting |
---

| Function | Input Example | Output Example |
|----------|---------------|----------------|
| **from_json** | `'{"name":"John","age":30}'` | `{name: "John", age: 30}` |
| **to_json** | `named_struct('name', 'John', 'age', 30)` | `'{"name":"John","age":30}'` |

In [None]:
# get_json_object :

# `get_json_object` is used to get the json value with key WITHOUT parsing the json string into struct.

# -- access array value

sql = '''
select
get_json_object(js_string,'$.skills') as `skills`, -- first level
typeOf(get_json_object(js_string,'$.skills')) as returnType -- type of firts level

from input_view
'''
spark.sql(sql).show(truncate = False)

+-------------------------------+----------+
|skills                         |returnType|
+-------------------------------+----------+
|["Python","SQL","Spark","Java"]|string    |
+-------------------------------+----------+



In [None]:
# get_json_object :

# `get_json_object` is used to get the json value with key WITHOUT parsing the json string into struct.

# -- access array value

sql = '''
select
explode(from_json(get_json_object(js_string, '$.skills'),'array<string>')) as explodedDirectlyFromString
from input_view
order by explodedDirectlyFromString asc
'''
spark.sql(sql).show(truncate = False)

+--------------------------+
|explodedDirectlyFromString|
+--------------------------+
|Java                      |
|Python                    |
|SQL                       |
|Spark                     |
+--------------------------+



### Spark JSON Parsing: `from_json` vs `get_json_object`

| Feature | from_json() | get_json_object() |
|---------|-------------|-------------------|
| **Schema Required** | ✅ Mandatory | ❌ Not required |
| **Efficiency** | 🚀 High (parsed once) | 🐌 Low (parsed every time) |
| **Use Case** | Multiple field access, known structure | Quick exploration, unknown schema, few fields |
| **Memory Usage** | 🟢 Better (structured) | 🔴 Higher (string repetition) |
| **Type Safety** | ✅ Strong (enforced types) | ❌ Weak (always returns string) |
| **Null Handling** | 🟡 Configurable | 🟢 Returns NULL if path missing |
| **Complex JSON** | ✅ Handles nested objects/arrays well | ⚠️ Limited to single path |
| **Parallelism** | 🟢 High (structured data) | 🟡 Moderate (string processing) |
| **Data Shuffling** | 🟢 Minimal (works within partitions) | 🔴 Can cause shuffling |
| **Partitioning** | ✅ Preserves partitioning | ❌ May break partitioning |
| **Cluster Utilization** | 🟢 Better utilization | 🟡 Less efficient |
| **Performance at Scale** | 🚀 Excellent (linear scaling) | 🐌 Poor (degrades with size) |
| **Production Readiness** | ✅ Recommended | ⚠️ Limited use |

In [None]:
# explode function & posexplode function

sql = '''
select
explode(from_json(get_json_object(js_string, '$.skills'),'array<string>')) as explodedDirectlyFromString
from input_view
order by explodedDirectlyFromString asc
'''
spark.sql(sql).show(truncate = False)

#---

sql = '''
SELECT
  posexplode_outer(from_json(get_json_object(js_string, '$.skills'), 'array<string>')) as (pos, skill)
FROM input_view
ORDER BY pos ASC
'''
spark.sql(sql).show(truncate = False)

+--------------------------+
|explodedDirectlyFromString|
+--------------------------+
|Java                      |
|Python                    |
|SQL                       |
|Spark                     |
+--------------------------+

+---+------+
|pos|skill |
+---+------+
|0  |Python|
|1  |SQL   |
|2  |Spark |
|3  |Java  |
+---+------+



### explode vs posexplode

| Feature | `explode()` | `posexplode()` |
|---------|-------------|----------------|
| **Purpose** | Explodes arrays into rows | Explodes arrays into rows with position/index |
| **Output Columns** | 1 column (array element) | 2 columns (index + array element) |
| **Index Included** | ❌ No | ✅ Yes (starts from 0) |
| **Basic Syntax** | `explode(array_col)` | `posexplode(array_col)` |
| **Direct Usage** | `SELECT explode(arr) FROM table` | `SELECT posexplode(arr) FROM table` |
| **Direct Aliasing** | `SELECT explode(arr) as element` | `SELECT posexplode(arr) as (position, element)` |
| **LATERAL VIEW Syntax** | `LATERAL VIEW explode(arr) t AS elem` | `LATERAL VIEW posexplode(arr) t AS pos, elem` |
| **LATERAL VIEW Aliasing** | Single alias for element | **Two aliases required** (position + element) |
| **Column Reference** | Refer to single column | Refer to two columns separately |
| **Example Output** | `['A','B','C'] → 'A' 'B' 'C'` | `['A','B','C'] → (0,'A') (1,'B') (2,'C')` |
| **Performance** | Faster (less data) | Slightly slower (more data) |
| **Use Case** | When you only need values | When you need position/index + values |

In [None]:
# get_json_tuple : get a list of json value in single shot

# works on JSON STRING only

sql = '''
select json_tuple(js_string,'user_id','name','email','skills') as (user_id,name,email,skills)
from input_view
'''

spark.sql(sql).show(truncate = False)
spark.sql(sql).printSchema()

+-------+--------+--------------------+-------------------------------+
|user_id|name    |email               |skills                         |
+-------+--------+--------------------+-------------------------------+
|12345  |John Doe|john.doe@example.com|["Python","SQL","Spark","Java"]|
+-------+--------+--------------------+-------------------------------+

root
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- skills: string (nullable = true)



In [None]:
# json_array_length
# works on json STRING containing an array

sql = '''
SELECT json_array_length('[{"name":"rahul"},{"name":"lathika"}]') as array_length
'''
spark.sql(sql).show(truncate = False)

+------------+
|array_length|
+------------+
|2           |
+------------+



In [None]:
# json_object_keys

# works on json STRING, NOT on a parsed JSON

sql = '''
SELECT json_object_keys(js_string) as jsKeys
from input_view
'''
spark.sql(sql).show(truncate = False)
spark.sql(sql).printSchema()

+------------------------------------------------------------------------------------------+
|jsKeys                                                                                    |
+------------------------------------------------------------------------------------------+
|[user_id, name, email, age, is_active, skills, address, salary, hire_date, tags, projects]|
+------------------------------------------------------------------------------------------+

root
 |-- jsKeys: array (nullable = true)
 |    |-- element: string (containsNull = true)



---
#### JSON and Array Functions Comparison Table
---

| Function | Syntax | Purpose | Input | Output | Schema Required | Notes |
|----------|--------|---------|-------|--------|-----------------|--------|
| **schema_of_json** | `schema_of_json(json_str)` | Infers schema from JSON string | JSON string | Schema string (DDL) | No | Returns the schema as a DDL string |
| **from_json** | `from_json(json_str, schema[, options])` | Parses JSON string to structured data | JSON string | Struct/Map/Array | Yes | Converts JSON to Spark complex types |
| **to_json** | `to_json(expr[, options])` | Converts structured data to JSON string | Struct/Map/Array | JSON string | No | Serializes Spark types to JSON |
| **get_json_object** | `get_json_object(json_str, path)` | Extracts specific element using JSON path | JSON string, Path | String | No | Uses JSONPath syntax for extraction |
| **json_tuple** | `json_tuple(json_str, field1, field2, ...)` | Extracts multiple fields as tuple | JSON string, Field names | Multiple string columns | No | More efficient than multiple get_json_object calls |
| **json_array_length** | `json_array_length(json_array_str)` | Returns length of JSON array | JSON array string | Integer | No | Works only with JSON arrays |
| **json_object_keys** | `json_object_keys(json_obj_str)` | Returns keys from JSON object | JSON object string | Array of strings | No | Extracts all keys from JSON object |
| **explode** | `explode(array_column)` | Creates new row for each array element | Array column | Multiple rows (same type as elements) | No | Increases row count, creates Cartesian product |
| **posexplode** | `posexplode(array_column)` | Creates new row for each array element with position | Array column | Multiple rows with (pos, element) | No | Includes array index (0-based) |