https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.json.html

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

from pyspark.sql.functions import *
from pyspark.sql.types import *

In [2]:
readComplexJSONDF = spark.read.option("multiLine","true").option("mode", "permissive").json('complexJSON.json')

In [3]:
readComplexJSONDF.printSchema()

root
 |-- nationality: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- user: struct (nullable = true)
 |    |    |    |-- BSN: string (nullable = true)
 |    |    |    |-- cell: string (nullable = true)
 |    |    |    |-- dob: long (nullable = true)
 |    |    |    |-- email: string (nullable = true)
 |    |    |    |-- gender: string (nullable = true)
 |    |    |    |-- location: struct (nullable = true)
 |    |    |    |    |-- city: string (nullable = true)
 |    |    |    |    |-- state: string (nullable = true)
 |    |    |    |    |-- street: string (nullable = true)
 |    |    |    |    |-- zip: long (nullable = true)
 |    |    |    |-- md5: string (nullable = true)
 |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |-- first: string (nullable = true)
 |    |    |    |    |-- last: string (nullable = true)
 |    |    |    |    |-- title: string (nullable = true)
 |    |    |    

In [4]:
readComplexJSONDF.show()

+-----------+--------------------+-------+
|nationality|             results|version|
+-----------+--------------------+-------+
|         NL|[[[12059376, (727...|    0.8|
+-----------+--------------------+-------+



the “results” tag is an array, so to read the content inside an array element we need to Explode it first. The below code will show how we can read Location and Name from the above input file

In [5]:
explodeArrDF = readComplexJSONDF.withColumn('Exp_RESULTS', explode(col('results'))).drop('results')

# Read location and name
dfReadSpecificStructure = explodeArrDF.select("Exp_RESULTS.user.location.*", "Exp_RESULTS.user.name.*")

dfReadSpecificStructure.show(truncate=False)

+------+----------+-------------+-----+-------+-----+-----+
|city  |state     |street       |zip  |first  |last |title|
+------+----------+-------------+-----+-------+-----+-----+
|gennep|overijssel|1510 vismarkt|35356|genelva|spits|ms   |
+------+----------+-------------+-----+-------+-----+-----+



#### Read from HTTP link

In [6]:
from urllib.request import Request, urlopen

# Online data source
onlineData = 'https://randomuser.me/api/0.8/?results=10'

# read the online data file
httpData = urlopen(onlineData).read().decode('utf-8')

# convert into RDD
rdd = spark.sparkContext.parallelize([httpData])

# create a Dataframe
jsonDF = spark.read.json(rdd)

# read all the users name:
readUser = jsonDF.withColumn('Exp_Results', explode('results')).select('Exp_Results.user.name.*')
readUser.show(truncate=False)

+---------+----------+-----+
|first    |last      |title|
+---------+----------+-----+
|maélie   |moulin    |mrs  |
|florent  |rodriguez |mr   |
|thiago   |caron     |mr   |
|eva      |perrin    |ms   |
|erwan    |rey       |mr   |
|eloïse   |morel     |ms   |
|bastien  |leroy     |mr   |
|eva      |adam      |miss |
|valentine|carpentier|ms   |
|tessa    |le gall   |ms   |
+---------+----------+-----+



https://ch-nabarun.medium.com/read-json-using-pyspark-f792bda95741
    
https://sparkbyexamples.com/pyspark/pyspark-parse-json-from-string-column-text-file/    

In [7]:
rawDF = spark.read.json("complexJSON2.json", multiLine = "true")

In [8]:
rawDF.show(truncate=False)

+---------------------------------------------------------+----+----+----+-----------------------------------------------------------------------------------------------------------------------------------------+-----+
|batters                                                  |id  |name|ppu |topping                                                                                                                                  |type |
+---------------------------------------------------------+----+----+----+-----------------------------------------------------------------------------------------------------------------------------------------+-----+
|[[[1001, Regular], [1002, Chocolate], [1003, Blueberry]]]|0001|Cake|0.55|[[5001, None], [5002, Glazed], [5005, Sugar], [5007, Powdered Sugar], [5006, Chocolate with Sprinkles], [5003, Chocolate], [5004, Maple]]|donut|
+---------------------------------------------------------+----+----+----+--------------------------------------------------

In [9]:
rawDF.printSchema()

root
 |-- batters: struct (nullable = true)
 |    |-- batter: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ppu: double (nullable = true)
 |-- topping: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)



column batters is a struct of an array of a struct. Column topping is an array of a struct. Column id, name, ppu, and type are simple string, string, double, and string columns respectively.


#### Convert Nested “batters” to Structured DataFrame

First of all, let's rename the top-level “id” column because we have another “id” as a key of element struct under the batters.

In [10]:
sampleDF = rawDF.withColumnRenamed("id", "key")

In [11]:
batDF = sampleDF.select("key", "batters.batter")

batDF.printSchema()

root
 |-- key: string (nullable = true)
 |-- batter: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)



In [12]:
batDF.show(2, False)

+----+-------------------------------------------------------+
|key |batter                                                 |
+----+-------------------------------------------------------+
|0001|[[1001, Regular], [1002, Chocolate], [1003, Blueberry]]|
+----+-------------------------------------------------------+



Let’s create a separate row for each element of “batter” array by exploding “batter” column.

In [13]:
bat2DF = batDF.select("key", explode("batter").alias("new_batter"))

bat2DF.show()

+----+-----------------+
| key|       new_batter|
+----+-----------------+
|0001|  [1001, Regular]|
|0001|[1002, Chocolate]|
|0001|[1003, Blueberry]|
+----+-----------------+



In [14]:
bat2DF.printSchema()

root
 |-- key: string (nullable = true)
 |-- new_batter: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- type: string (nullable = true)



Now we can extract the individual elements from the “new_batter” struct. We can use a dot (“.”) operator to extract the individual element or we can use “*” with dot (“.”) operator to select all the elements.

In [15]:
bat2DF.select("key", "new_batter.*").show()

+----+----+---------+
| key|  id|     type|
+----+----+---------+
|0001|1001|  Regular|
|0001|1002|Chocolate|
|0001|1003|Blueberry|
+----+----+---------+



Let’s put together everything we discussed so far.

In [16]:
finalBatDF = (sampleDF
        .select("key",  
explode("batters.batter").alias("new_batter"))
        .select("key", "new_batter.*")
        .withColumnRenamed("id", "bat_id")
        .withColumnRenamed("type", "bat_type"))
finalBatDF.show()

+----+------+---------+
| key|bat_id| bat_type|
+----+------+---------+
|0001|  1001|  Regular|
|0001|  1002|Chocolate|
|0001|  1003|Blueberry|
+----+------+---------+



#### Convert Nested “toppings” to Structured DataFrame

In [17]:
sampleDF.printSchema()

root
 |-- batters: struct (nullable = true)
 |    |-- batter: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |-- key: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ppu: double (nullable = true)
 |-- topping: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)



In [18]:
topDF = (sampleDF
        .select("key", explode("topping").alias("new_topping"))
        .select("key", "new_topping.*")
        .withColumnRenamed("id", "top_id")
        .withColumnRenamed("type", "top_type")
        )
topDF.show(10, False)

+----+------+------------------------+
|key |top_id|top_type                |
+----+------+------------------------+
|0001|5001  |None                    |
|0001|5002  |Glazed                  |
|0001|5005  |Sugar                   |
|0001|5007  |Powdered Sugar          |
|0001|5006  |Chocolate with Sprinkles|
|0001|5003  |Chocolate               |
|0001|5004  |Maple                   |
+----+------+------------------------+



### PySpark JSON Functions

#### from_json()

`from_json()` function is used to convert JSON string into Struct type or Map type.

`from_json(col, schema, options={})`

Parses a column containing a JSON string into a `MapType` with `StringType` as keys type, `StructType` or `ArrayType` with the specified schema. 

Returns null, in the case of an unparseable string.

__Parameters__ : ___col___ : Column or str, string column in json format

__schema__ : DataType or str

a StructType or ArrayType of StructType to use when parsing the json column.

the DDL-formatted string is also supported for schema.

__options__ : dict, optional

options to control parsing. accepts the same options as the json datasource

In [19]:
data = [(1, '''{"a": 1}''')]

schema = StructType([StructField("a", IntegerType())])

df = spark.createDataFrame(data, ("key", "value"))

df.show()
df.printSchema()

df.select(from_json(df.value, schema).alias("json")).show()

+---+--------+
|key|   value|
+---+--------+
|  1|{"a": 1}|
+---+--------+

root
 |-- key: long (nullable = true)
 |-- value: string (nullable = true)

+----+
|json|
+----+
| [1]|
+----+



In [20]:
df.select(from_json(df.value, "a INT").alias("json")).show()

+----+
|json|
+----+
| [1]|
+----+



In [21]:
schema = ArrayType(StructType([StructField("a", IntegerType())]))

In [22]:
df.select(from_json(df.value, schema).alias("json")).show()

+-----+
| json|
+-----+
|[[1]]|
+-----+



In [23]:
jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""

df=spark.createDataFrame([(1,jsonString)],["id","value"])

df.show(truncate=False)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+



#### covert to MapType using `from_json()`

In [24]:
df2=df.withColumn("value",from_json(df.value, MapType(StringType(), StringType() ) ) )

df2.printSchema()
df2.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
+---+---------------------------------------------------------------------------+



#### covert to StructType using `from_json()`

In [25]:
# schema = StructType[
#     (StructField("key", StringType())), 
#     (StructField("value", StringType()))]

# df3=df.withColumn("value",from_json(df.value, schema))

# df3.printSchema()
# df3.show(truncate=False)

#### to_json()

`to_json()` function is used to convert DataFrame columns MapType or Struct type to JSON string. Throws an exception, in the case of an unsupported type.

`to_json(col, options={})`

__Parameters__ : __col__ : Column or str, name of column containing a struct, an array or a map.

__options__ : dict, optional
options to control converting. accepts the same options as the JSON datasource. Additionally the function supports the pretty option which enables pretty JSON generation.



In [26]:
df2.show(truncate=False)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
+---+---------------------------------------------------------------------------+



In [27]:
df2.withColumn("value",to_json(col("value"))) \
   .show(truncate=False)

+---+----------------------------------------------------------------------------+
|id |value                                                                       |
+---+----------------------------------------------------------------------------+
|1  |{"Zipcode":"704","ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+----------------------------------------------------------------------------+



#### json_tuple()

Function `json_tuple()` is used the query or extract the elements from JSON column and create the result as a new columns.

Creates a new row for a json column according to the given field names.

`json_tuple(col, *fields)`

__Parameters__ : __col__ : Column or str, string column in json format

__fieldsstr__ : fields to extract

In [28]:
df.show(truncate=False)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+



In [29]:
df.select(col("id"),json_tuple(col("value"), "Zipcode", "ZipCodeType", "City")) \
    .toDF("id","Zipcode","ZipCodeType","City") \
    .show(truncate=False)

+---+-------+-----------+-----------+
|id |Zipcode|ZipCodeType|City       |
+---+-------+-----------+-----------+
|1  |704    |STANDARD   |PARC PARQUE|
+---+-------+-----------+-----------+



#### get_json_object()

`get_json_object()` extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

__Parameters__ : __col__ : Column or str, string column in json format

__pathstr__ : path to the json object to extract

In [30]:
df.show(truncate=False)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+



In [31]:
df.select(col("id"), get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")
                   , get_json_object(col("value"), "$.State").alias("State")) \
    .show(truncate=False)

+---+-----------+-----+
|id |ZipCodeType|State|
+---+-----------+-----+
|1  |STANDARD   |PR   |
+---+-----------+-----+



#### schema_of_json()

`schema_of_json()` Parses a JSON string and infers its schema in DDL format.

`schema_of_json(json, options={})`

__Parameters__: __json__ : Column or str, a JSON string or a foldable string column containing a JSON string.

__options__ : dict, optional, options to control parsing. accepts the same options as the JSON datasource. Changed in version 3.0: It accepts options parameter to control schema inferring.

In [32]:
schemaStr=spark.range(1) \
    .select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))) \
    .collect()[0][0]
print(schemaStr)

struct<City:string,State:string,ZipCodeType:string,Zipcode:bigint>


#### parse or read a JSON string from a TEXT/CSV file and convert it into DataFrame

parse a JSON string from a text file and convert it to PySpark DataFrame columns using `from_json()`