# Reading Data - JSON Files

**Technical Accomplishments:**
- Read data from:
  * JSON without a Schema
  * JSON with a Schema

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
%run "./Includes/Utility-Methods"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from JSON w/ InferSchema

Reading in JSON isn't that much different than reading in CSV files.

Let's start with taking a look at all the different options that go along with reading in JSON files.

### JSON Lines

Much like the CSV reader, the JSON reader also assumes...
* That there is one JSON object per line and...
* That it's delineated by a new-line.

This format is referred to as **JSON Lines** or **newline-delimited JSON** 

More information about this format can be found at <a href="http://jsonlines.org/" target="_blank">http://jsonlines.org</a>.

** *Note:* ** *Spark 2.2 was released on July 11th 2016. With that comes File IO improvements for CSV & JSON, but more importantly, **Support for parsing multi-line JSON and CSV files**. You can read more about that (and other features in Spark 2.2) in the <a href="https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html" target="_blank">Databricks Blog</a>.*

### The Data Source
* For this exercise, we will be using the file called **snapshot-2016-05-26.json** (<a href="https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org/rc" target="_blank">4 MB</a> file from Wikipedia).
* The data represents a set of edits to Wikipedia articles captured in May of 2016.
* It's located on the DBFS at **dbfs:/mnt/training/wikipedia/edits/snapshot-2016-05-26.json**
* Like we did with the CSV file, we can use **&percnt;fs ls ...** to view the file on the DBFS.

In [0]:
%fs ls dbfs:/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json

path,name,size
dbfs:/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json,2015_2_clickstream.json,2815233011


In [0]:
# %fs ls dbfs:/mnt/training/wikipedia/edits/snapshot-2016-05-26.json


Like we did with the CSV file, we can use **&percnt;fs head ...** to peek at the first couple lines of the JSON file.

In [0]:
%fs head dbfs:/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json

### Read The JSON File

The command to read in JSON looks very similar to that of CSV.

In addition to reading the JSON file, we will also print the resulting schema.

In [0]:
jsonFile = "dbfs:/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"

wikiEditsDF = (spark.read           # The DataFrameReader
    .option("inferSchema", "true")  # Automatically infer data types & column names
    .json(jsonFile)                 # Creates a DataFrame from JSON after reading in the file
 )
wikiEditsDF.printSchema()


With our DataFrame created, we can now take a peak at the data.

But to demonstrate a unique aspect of JSON data (or any data with embedded fields), we will first create a temporary view and then view the data via SQL:

In [0]:
# create a view called wiki_edits
wikiEditsDF.createOrReplaceTempView("wiki_edits")

And now we can take a peak at the data with simple SQL SELECT statement:

In [0]:
%sql

SELECT * FROM wiki_edits 

curr_id,curr_title,n,prev_id,prev_title,type
3632887.0,!!,121,,other-google,other
3632887.0,!!,93,,other-wikipedia,other
3632887.0,!!,46,,other-empty,other
3632887.0,!!,10,,other-other,other
3632887.0,!!,11,64486.0,!_(disambiguation),other
2556962.0,!!!_(album),19,2061699.0,Louden_Up_Now,link
2556962.0,!!!_(album),25,,other-empty,other
2556962.0,!!!_(album),16,,other-google,other
2556962.0,!!!_(album),44,,other-wikipedia,other
2556962.0,!!!_(album),15,64486.0,!_(disambiguation),link


Notice the **geocoding** column has embedded data.

You can expand the fields by clicking the right triangle in each row.

But we can also reference the sub-fields directly as we see in the following SQL statement:

In [0]:
%sql
SELECT *
FROM wiki_edits 
WHERE prev_id IS NOT NULL

curr_id,curr_title,n,prev_id,prev_title,type
3632887.0,!!,11,64486,!_(disambiguation),other
2556962.0,!!!_(album),19,2061699,Louden_Up_Now,link
2556962.0,!!!_(album),15,64486,!_(disambiguation),link
2556962.0,!!!_(album),297,600744,!!!,link
6893310.0,!Hero_(album),26,1921683,!Hero,link
22602473.0,!Oka_Tokat,16,8127304,Jericho_Rosales,link
22602473.0,!Oka_Tokat,20,35978874,List_of_telenovelas_of_ABS-CBN,link
22602473.0,!Oka_Tokat,10,7360687,Rica_Peralejo,link
22602473.0,!Oka_Tokat,11,37104582,Jeepney_TV,link
22602473.0,!Oka_Tokat,22,34376590,Oka_Tokat_(2012_TV_series),link


### Review: Reading from JSON w/ InferSchema

While there are similarities between reading in CSV & JSON there are some key differences:
* We only need one job even when inferring the schema.
* There is no header which is why there isn't a second job in this case - the column names are extracted from the JSON object's attributes.
* Unlike CSV which reads in 100% of the data, the JSON reader only samples the data.  
**Note:** In Spark 2.2 the behavior was changed to read in the entire JSON file.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from JSON w/ User-Defined Schema

To avoid the extra job, we can (just like we did with CSV) specify the schema for the `DataFrame`.

### Step #1 - Create the Schema

Compared to our CSV example, the structure of this data is a little more complex.

Note that we can support complex data types as seen in the field `geocoding`.

In [0]:
from pyspark.sql.types import *

jsonSchema = StructType([
  StructField("curr_id", IntegerType(), True),
  StructField("curr_title", StringType(), True),
  StructField("n", IntegerType(), True),
  StructField("prev_id", IntegerType(), True),
  StructField("prev_title", StringType(), True),
  StructField("type", StringType(), True)
])

That was a lot of typing to get our schema!

For a small file, manually creating the the schema may not be worth the effort.

However, for a large file, the time to manually create the schema may be worth the trade off of a really long infer-schema process.

### Step #2 - Read in the JSON

Next, we will read in the JSON file and once again print its schema.

In [0]:
(spark.read            # The DataFrameReader
  .schema(jsonSchema)  # Use the specified schema
  .json(jsonFile)      # Creates a DataFrame from JSON after reading in the file
  .printSchema()
)

### Review: Reading from JSON w/ User-Defined Schema
* Just like CSV, providing the schema avoids the extra jobs.
* The schema allows us to rename columns and specify alternate data types.
* Can get arbitrarily complex in its structure.

Let's take a look at some of the other details of the `DataFrame` we just created for comparison sake.

In [0]:
jsonDF = (spark.read
  .schema(jsonSchema)
  .json(jsonFile)    
)
print("Partitions: " + str(jsonDF.rdd.getNumPartitions()))
printRecordsPerPartition(jsonDF)
print("-"*80)

And of course we can view that data here:

In [0]:
display(jsonDF)

curr_id,curr_title,n,prev_id,prev_title,type
,!!,,,other-google,other
,!!,,,other-wikipedia,other
,!!,,,other-empty,other
,!!,,,other-other,other
,!!,,,!_(disambiguation),other
,!!!_(album),,,Louden_Up_Now,link
,!!!_(album),,,other-empty,other
,!!!_(album),,,other-google,other
,!!!_(album),,,other-wikipedia,other
,!!!_(album),,,!_(disambiguation),link


## Next steps

Start the next lesson, [Reading Data - Parquet]($./3.Reading%20Data%20-%20Parquet)