# Flattening JSON File Containing Arrayed Dictionaries

**BACKGROUND:** We have a JSON file that has arrayed dictionaries in the "digitalInsuranceBuildData.invoice.orders" key and also in the "digitalInsuranceBuildData.features" key.  The goal is to flatten all the keys into columns.  Since the "digitalInsuranceBuildData.features" key is mapped to 6 elements or features, I should then expect the resulting table or dataframe to have 6 rows of data.  We can use dot/. notation to reference non-arrayed keys and use explode() to explode arrayed dictionaries into rows.  Since we can only use one explode() within a single select() function, we need to create a separate select query for the "digitalInsuranceBuildData.invoice.orders" key and a separate select query for the "digitalInsuranceBuildData.features" key.  Then join them onto the main select query that contains rest of the columns created from keys that are not arrayed.

In [1]:
from pyspark.sql import SparkSession
import pandas as pd
import pyspark.sql.functions as F

In [2]:
# create a SparkSession object
spark = SparkSession.builder.appName("JSONtoDF").getOrCreate()

In [3]:
# define the JSON string
json_string = '''
{
    "digitalInsuranceBuildData": {
        "vin": "1FBZX2ZMREDACTED",
        "buildDate": "2015-12-20",
        "year": 2016,
        "make": "FORD",
        "model": "TRANSIT",
        "vehicleEngineerDescription": "FORD  TRANSIT T350 WAGON LOW ROOF LONG WB 60/40 CARGO DR",
        "plantName": "KANSAS CITY ASSY",
        "trimLevel": "XLT",
        "dealer": {
            "countryCode": "USA"
        },
        "configuration": {
            "siriusXm": {
                "capable": true
            },
            "sync": {
                "capable": true
            }
        },
        "invoice": {
            "currencyCode": "USD",
            "orders": [
                {
                    "orderCode": "X2Z",
                    "orderType": "BODY"
                }
            ],
            "price": {
                "manufacturerSuggestedRetailPrice": 39180.0
            }
        },
        "features": [
            {
                "featureWersCode": "A22AA",
                "engineerDescription": "LESS D PILLAR ASSIST HANDLE",
                "familyEngineerDescription": "D PILLAR ASSIST HANDLE"
            },
            {
                "featureWersCode": "EN-RM",
                "engineerDescription": "3.7L 4V-DAMB PFI V6 NA GAS",
                "familyEngineerDescription": "ENGINE-CAR/LT TRK",
                "featureGroupType": "ENGINE"
            },
            {
                "featureWersCode": "A4MAA",
                "engineerDescription": "LESS DIESEL PARTICULATE FILTER",
                "familyEngineerDescription": "DIESEL PARTICULATE FILTERS"
            },
            {
                "featureWersCode": "TR-C3",
                "engineerDescription": "6 SPD AUTO TRANS (6R80)",
                "familyEngineerDescription": "TRANSMISSION-CAR/LT TRK",
                "featureGroupType": "TRANSMISSION"
            },
            {
                "featureWersCode": "YZKAB",
                "engineerDescription": "FLEET",
                "familyEngineerDescription": "FLEET"
            },
            {
                "featureWersCode": "DR--B",
                "engineerDescription": "2 WHL L/H REAR DRIVE",
                "familyEngineerDescription": "DRIVE-CAR/LT TRK",
                "featureGroupType": "DRIVETRAIN"
            }
        ]
    }
}'''

In [4]:
# When creating a DataFrame from a JSON string, need to parallelize it
df = spark.read.json(spark.sparkContext.parallelize([json_string]))

In [None]:
# When creating a DataFrame from a JSON string, no need to parallelize it
# df = spark.read.json("data/sample.json")

In [5]:
# Create main dataframe from non-arrayed keys using dot/. notation
df_main = df.select(
    F.col("digitalInsuranceBuildData.vin").alias("vin"),
    F.col("digitalInsuranceBuildData.buildDate").alias("buildDate"),
    F.col("digitalInsuranceBuildData.year").alias("year"),
    F.col("digitalInsuranceBuildData.make").alias("make"),
    F.col("digitalInsuranceBuildData.model").alias("model"),
    F.col("digitalInsuranceBuildData.vehicleEngineerDescription").alias("vehicleEngineerDescription"),
    F.col("digitalInsuranceBuildData.plantName").alias("plantName"),
    F.col("digitalInsuranceBuildData.trimLevel").alias("trimLevel"),
    F.col("digitalInsuranceBuildData.dealer.countryCode").alias("countryCode"),
    F.col("digitalInsuranceBuildData.configuration.siriusXm.capable").alias("siriusXm_capable"),
    F.col("digitalInsuranceBuildData.configuration.sync.capable").alias("sync_capable"),
    F.col("digitalInsuranceBuildData.invoice.currencyCode").alias("currencyCode"),
    F.col("digitalInsuranceBuildData.invoice.price.manufacturerSuggestedRetailPrice").alias("manufacturerSuggestedRetailPrice"),
)

In [6]:
df_main.show()

+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+
|             vin| buildDate|year|make|  model|vehicleEngineerDescription|       plantName|trimLevel|countryCode|siriusXm_capable|sync_capable|currencyCode|manufacturerSuggestedRetailPrice|
+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+
|1FBZX2ZMREDACTED|2015-12-20|2016|FORD|TRANSIT|      FORD  TRANSIT T35...|KANSAS CITY ASSY|      XLT|        USA|            true|        true|         USD|                         39180.0|
+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+



#### Since our main dataframe doesn't include the "digitalInsuranceBuildData.invoice.orders" and ""digitalInsuranceBuildData.features" columns which would need to be exploded, we only have 1 row returned.  We will next create separate dataframes of exploded orders and features columns.

In [7]:
# create a subquery for orders since the "digitalInsuranceBuildData.invoice.orders" is mapped to an array of dictionaries
orders_subquery = df.select(
    F.col("digitalInsuranceBuildData.vin").alias("vin"),
    F.explode(F.col("digitalInsuranceBuildData.invoice.orders")).alias("orders")
)

# create a subquery for featyres since the "digitalInsuranceBuildData.features" is mapped to an array of dictionaries
features_subquery = df.select(
    F.col("digitalInsuranceBuildData.vin").alias("vin"),
    F.explode(F.col("digitalInsuranceBuildData.features")).alias("features")
)

#### Merge or join our exploded dataframes with our main dataframe

In [8]:
# join the subqueries back to the main DataFrame using a common key or unique identifier ("vin")
df_main = df_main.join(orders_subquery, on=["vin"], how="inner")
df_main = df_main.join(features_subquery, on=["vin"], how="inner")

In [9]:
df_main.show()

+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+-----------+--------------------+
|             vin| buildDate|year|make|  model|vehicleEngineerDescription|       plantName|trimLevel|countryCode|siriusXm_capable|sync_capable|currencyCode|manufacturerSuggestedRetailPrice|     orders|            features|
+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+-----------+--------------------+
|1FBZX2ZMREDACTED|2015-12-20|2016|FORD|TRANSIT|      FORD  TRANSIT T35...|KANSAS CITY ASSY|      XLT|        USA|            true|        true|         USD|                         39180.0|{X2Z, BODY}|{LESS D PILLAR AS...|
|1FBZX2ZMREDACTED|2015-12-20|2016|FORD|TRANSIT|      FORD  TRANSIT T35...|KANSAS CITY ASSY|      XLT|       

#### From above, our main dataframe went from having just one row of data to now 6 rows of data, which we should expect after using explode().  But we see that the orders column contains a dictionary value for each row and the features column also has a dictionary value for each row.  We will extract the individual dictionary values using dot/. notation in our final dataframe below:

In [10]:
df_final = df_main.select(
    F.col("vin"),
    F.col("buildDate"),
    F.col("year"),
    F.col("make"),
    F.col("model"),
    F.col("vehicleEngineerDescription"),
    F.col("plantName"),
    F.col("trimLevel"),
    F.col("countryCode"),
    F.col("siriusXm_capable"),
    F.col("sync_capable"),
    F.col("currencyCode"),
    F.col("manufacturerSuggestedRetailPrice"),
    # Use dot/. notation to reference specific keys within "orders" key
    F.col("orders.orderCode").alias("orderCode"),
    F.col("orders.orderType").alias("orderType"),
    # Use dot/. notation to reference specific keys within "features" key
    F.col("features.featureWersCode").alias("featureWersCode"),
    F.col("features.engineerDescription").alias("engineerDescription"),
    F.col("features.familyEngineerDescription").alias("familyEngineerDescription"),
    F.col("features.featureGroupType").alias("featureGroupType"),
)

In [11]:
# show the final DataFrame
df_final.show()

+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+---------+---------+---------------+--------------------+-------------------------+----------------+
|             vin| buildDate|year|make|  model|vehicleEngineerDescription|       plantName|trimLevel|countryCode|siriusXm_capable|sync_capable|currencyCode|manufacturerSuggestedRetailPrice|orderCode|orderType|featureWersCode| engineerDescription|familyEngineerDescription|featureGroupType|
+----------------+----------+----+----+-------+--------------------------+----------------+---------+-----------+----------------+------------+------------+--------------------------------+---------+---------+---------------+--------------------+-------------------------+----------------+
|1FBZX2ZMREDACTED|2015-12-20|2016|FORD|TRANSIT|      FORD  TRANSIT T35...|KANSAS CITY ASSY|      XLT|        USA|            true|

Let's convert to pandas dataframe to view a more pleasant, HTML dataframe output:

In [12]:
df_pandas = df_final.toPandas()
df_pandas

Unnamed: 0,vin,buildDate,year,make,model,vehicleEngineerDescription,plantName,trimLevel,countryCode,siriusXm_capable,sync_capable,currencyCode,manufacturerSuggestedRetailPrice,orderCode,orderType,featureWersCode,engineerDescription,familyEngineerDescription,featureGroupType
0,1FBZX2ZMREDACTED,2015-12-20,2016,FORD,TRANSIT,FORD TRANSIT T350 WAGON LOW ROOF LONG WB 60/4...,KANSAS CITY ASSY,XLT,USA,True,True,USD,39180.0,X2Z,BODY,A22AA,LESS D PILLAR ASSIST HANDLE,D PILLAR ASSIST HANDLE,
1,1FBZX2ZMREDACTED,2015-12-20,2016,FORD,TRANSIT,FORD TRANSIT T350 WAGON LOW ROOF LONG WB 60/4...,KANSAS CITY ASSY,XLT,USA,True,True,USD,39180.0,X2Z,BODY,EN-RM,3.7L 4V-DAMB PFI V6 NA GAS,ENGINE-CAR/LT TRK,ENGINE
2,1FBZX2ZMREDACTED,2015-12-20,2016,FORD,TRANSIT,FORD TRANSIT T350 WAGON LOW ROOF LONG WB 60/4...,KANSAS CITY ASSY,XLT,USA,True,True,USD,39180.0,X2Z,BODY,A4MAA,LESS DIESEL PARTICULATE FILTER,DIESEL PARTICULATE FILTERS,
3,1FBZX2ZMREDACTED,2015-12-20,2016,FORD,TRANSIT,FORD TRANSIT T350 WAGON LOW ROOF LONG WB 60/4...,KANSAS CITY ASSY,XLT,USA,True,True,USD,39180.0,X2Z,BODY,TR-C3,6 SPD AUTO TRANS (6R80),TRANSMISSION-CAR/LT TRK,TRANSMISSION
4,1FBZX2ZMREDACTED,2015-12-20,2016,FORD,TRANSIT,FORD TRANSIT T350 WAGON LOW ROOF LONG WB 60/4...,KANSAS CITY ASSY,XLT,USA,True,True,USD,39180.0,X2Z,BODY,YZKAB,FLEET,FLEET,
5,1FBZX2ZMREDACTED,2015-12-20,2016,FORD,TRANSIT,FORD TRANSIT T350 WAGON LOW ROOF LONG WB 60/4...,KANSAS CITY ASSY,XLT,USA,True,True,USD,39180.0,X2Z,BODY,DR--B,2 WHL L/H REAR DRIVE,DRIVE-CAR/LT TRK,DRIVETRAIN
