# Reading Data - JSON Files

**Technical Accomplishments:**
- Read data from:
  * JSON without a Schema
  * JSON with a Schema

## ➡️ Getting Started

Run the following cell to configure our notebook."

In [None]:
%run Utilities

## ➡️ Reading from JSON w/ InferSchema

Reading in JSON isn't that much different than reading in CSV files.

Let's start with taking a look at all the different options that go along with reading in JSON files.

## ➡️ JSON Lines

Much like the CSV reader, the JSON reader also assumes...
* That there is one JSON object per line and...
* That it's delineated by a new-line.

This format is referred to as **JSON Lines** or **newline-delimited JSON** 

More information about this format can be found at <a href="http://jsonlines.org/" target="_blank">http://jsonlines.org</a>.

## ➡️ The Data Source
* For this exercise, we will be using a JSON file, containing some ZIP codes
* Like we did with the CSV file, we can use **&percnt;&percnt;sh ls ...** to view the file on the DBFS.

In [None]:
%%sh

ls /lakehouse/default/Files/sampledata/zipcodes_singlelines.json

Like we did with the CSV file, we can use %%sh head ... to peek at the first couple lines of the JSON file.

In [None]:
%%sh

head /lakehouse/default/Files/sampledata/zipcodes_singlelines.json

## ➡️ Read The JSON File

The command to read in JSON looks very similar to that of CSV.

In addition to reading the JSON file, we will also print the resulting schema.

In [None]:
jsonFile = "Files/sampledata/zipcodes_singlelines.json"

zipcodesDF = (spark.read           # The DataFrameReader
    .option("inferSchema", "true")  # Automatically infer data types & column names
    .json(jsonFile)                 # Creates a DataFrame from JSON after reading in the file
 )
zipcodesDF.printSchema()

With our DataFrame created, we can now take a peak at the data.

But to demonstrate a unique aspect of JSON data (or any data with embedded fields), we will first create a temporary view and then view the data via SQL:

In [None]:
# create a view called wiki_edits
zipcodesDF.createOrReplaceTempView("zipcodes")

And now we can take a peak at the data with simple SQL SELECT statement:

In [None]:
%%sql

SELECT * FROM zipcodes

Notice the **geocoding** column has embedded data.

You can expand the fields by clicking the right triangle in each row.

But we can also reference the sub-fields directly as we see in the following SQL statement:

In [None]:
%%sql

SELECT City, Location, geocoding.Lat, geocoding.Long, geocoding.Zaxis
FROM zipcodes
WHERE geocoding.Zaxis > 0.5

## ➡️ Reading from JSON w/ User-Defined Schema

To avoid the extra job, we can (just like we did with CSV) specify the schema for the `DataFrame`.

## ➡️ Step #1 - Create the Schema

Compared to our CSV example, the structure of this data is a little more complex.

Note that we can support complex data types as seen in the field `geocoding`.

In [None]:
from pyspark.sql.types import *

jsonSchema = StructType([
    StructField("City", StringType(), True),
    StructField("Country", StringType(), True),
    StructField("Decommisioned", BooleanType(), True),
    StructField("EstimatedPopulation", LongType(), True),
    StructField("Location", StringType(), True),
    StructField("LocationText", StringType(), True),
    StructField("LocationType", StringType(), True),
    StructField("Notes", StringType(), True),
    StructField("RecordNumber", LongType(), True),
    StructField("State", StringType(), True),
    StructField("TaxReturnsFiled", LongType(), True),
    StructField("TotalWages", LongType(), True),
    StructField("WorldRegion", StringType(), True),
    StructField("ZipCodeType", StringType(), True),
    StructField("Zipcode", LongType(), True),
    StructField("geocoding", StructType([
        StructField("Lat", DoubleType(), True),
        StructField("Long", DoubleType(), True),
        StructField("Xaxis", DoubleType(), True),
        StructField("Yaxis", DoubleType(), True),
        StructField("Zaxis", DoubleType(), True)
    ]))
])

That was a lot of typing to get our schema!

For a small file, manually creating the the schema may not be worth the effort.

However, for a large file, the time to manually create the schema may be worth the trade off of a really long infer-schema process.

## ➡️ Step #2 - Read in the JSON

Next, we will read in the JSON file and once again print its schema.

In [None]:
(spark.read            # The DataFrameReader
  .schema(jsonSchema)  # Use the specified schema
  .json(jsonFile)      # Creates a DataFrame from JSON after reading in the file
  .printSchema()
)

## ➡️ Review: Reading from JSON w/ User-Defined Schema
* Just like CSV, providing the schema avoids the extra jobs.
* The schema allows us to rename columns and specify alternate data types.
* Can get arbitrarily complex in its structure.

Let's take a look at some of the other details of the `DataFrame` we just created for comparison sake.

In [None]:
jsonDF = (spark.read
  .schema(jsonSchema)
  .json(jsonFile)    
)

print("Partitions: " + str(jsonDF.rdd.getNumPartitions()))

printRecordsPerPartition(jsonDF)

print("-"*80)

And of course we can view that data here:

In [None]:
display(jsonDF)