# Querying JSON & Hierarchical Data with DataFrames

Apache Spark&trade; and Azure Databricks&reg; make it easy to work with hierarchical data, such as nested JSON records.

### Getting Started

Run the following cell to configure our "classroom."

In [3]:
%run "./Includes/Classroom-Setup"

## Examining the Contents of a JSON file

JSON is a common file format used in big data applications and in data lakes (or large stores of diverse data).  File formats such as JSON arise out of a number of data needs.  For instance, what if:
<br>
* Your schema, or the structure of your data, changes over time?
* You need nested fields like an array with many values or an array of arrays?
* You don't know how you're going use your data yet, so you don't want to spend time creating relational tables?

The popularity of JSON is largely due to the fact that JSON allows for nested, flexible schemas.

This lesson uses the `DatabricksBlog` table, which is backed by JSON file `dbfs:/mnt/training/databricks-blog.json`. If you examine the raw file, notice it contains compact JSON data. There's a single JSON object on each line of the file; each object corresponds to a row in the table. Each row represents a blog post on the <a href="https://databricks.com/blog" target="_blank">Databricks blog</a>, and the table contains all blog posts through August 9, 2017.

In [5]:
%fs head dbfs:/mnt/training/databricks-blog.json

Create a DataFrame out of the syntax introduced in the previous lesson:

In [7]:
databricksBlogDF = spark.read.option("inferSchema","true").option("header","true").json("/mnt/training/databricks-blog.json")

Take a look at the schema by invoking `printSchema` method.

In [9]:
databricksBlogDF.printSchema()

Run a query to view the contents of the table.

Notice:
* The `authors` column is an array containing one or more author names.
* The `categories` column is an array of one or more blog post category names.
* The `dates` column contains nested fields `createdOn`, `publishedOn` and `tz`.

In [11]:
display(databricksBlogDF.select("authors","categories","dates","content"))

## Nested Data

Think of nested data as columns within columns. 

For instance, look at the `dates` column.

In [13]:
datesDF = databricksBlogDF.select("dates")
display(datesDF)

Pull out a specific subfield with `.` (object) notation.

In [15]:
display(databricksBlogDF.select("dates.createdOn", "dates.publishedOn"))

Create a DataFrame, `databricksBlog2DF` that contains the original columns plus the new `publishedOn` column obtained
from flattening the dates column.

In [17]:
from pyspark.sql.functions import col
databricksBlog2DF = databricksBlogDF.withColumn("publishedOn",col("dates.publishedOn"))

With this temporary view, apply the `printSchema` method to check its schema and confirm the timestamp conversion.

In [19]:
databricksBlog2DF.printSchema()

Both `createdOn` and `publishedOn` are stored as strings.

Cast those values to SQL timestamps:

In this case, use a single `select` method to:
0. Cast `dates.publishedOn` to a `timestamp` data type
0. "Flatten" the `dates.publishedOn` column to just `publishedOn`

In [21]:
from pyspark.sql.functions import date_format
display(databricksBlogDF.select("title",date_format("dates.publishedOn","yyyy-MM-dd").alias("publishedOn")))

Create another DataFrame, `databricksBlog2DF` that contains the original columns plus the new `publishedOn` column obtained
from flattening the dates column.

In [23]:
databricksBlog2DF = databricksBlogDF.withColumn("publishedOn", date_format("dates.publishedOn","yyyy-MM-dd")) 
display(databricksBlog2DF)

With this temporary view, apply the `printSchema` method to check its schema and confirm the timestamp conversion.

In [25]:
databricksBlog2DF.printSchema()

-sandbox
Since the dates are represented by a `timestamp` data type, we need to convert to a data type that allows `<` and `>`-type comparison operations in order to query for articles within certain date ranges (such as a list of all articles published in 2013). This is accopmplished by using the `to_date` function in Scala or Python.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the Spark documentation on <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>, for a long list of date-specific functions.

In [27]:
from pyspark.sql.functions import to_date, year, col
          
resultDF = (databricksBlog2DF.select("title", to_date(col("publishedOn"),"MMM dd, yyyy").alias('date'),"link") 
  .filter(year(col("publishedOn")) == '2013') 
  .orderBy(col("publishedOn"))
)

display(resultDF)

## Array Data

The DataFrame also contains array columns. 

Easily determine the size of each array using the built-in `size(..)` function with array columns.

In [29]:
from pyspark.sql.functions import size
display(databricksBlogDF.select(size("authors"),"authors"))

Pull the first element from the array `authors` using an array subscript operator.

For example, in Scala, the 0th element of array `authors` is `authors(0)`
whereas, in Python, the 0th element of `authors` is `authors[0]`.

In [31]:
display(databricksBlogDF.select(col("authors")[0].alias("primaryAuthor")))

### Explode

The `explode` method allows you to split an array column into multiple rows, copying all the other columns into each new row. 

For example, split the column `authors` into the column `author`, with one author per row.

In [33]:
from pyspark.sql.functions import explode
display(databricksBlogDF.select("title","authors",explode(col("authors")).alias("author"), "link"))

It's more obvious to restrict the output to articles that have multiple authors, and then sort by the title.

In [35]:
databricksBlog2DF = (databricksBlogDF 
  .select("title","authors",explode(col("authors")).alias("author"), "link") 
  .filter(size(col("authors")) > 1) 
  .orderBy("title")
)

display(databricksBlog2DF)

## Exercise 1

Identify all the articles written or co-written by Michael Armbrust.

-sandbox
### Step 1

Starting with the `databricksBlogDF` DataFrame, create a DataFrame called `articlesByMichaelDF` where:
0. Michael Armbrust is the author.
0. The data set contains the column `title` (it may contain others).
0. It contains only one record per article.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the Spark documentation on <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>.  

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Include the column `authors` in your view to help you debug your solution.

In [38]:
# TODO
from pyspark.sql.functions import array_contains
articlesByMichaelDF = # FILL_IN

In [39]:
# TEST - Run this cell to test your solution.

from pyspark.sql import Row

resultsCount = articlesByMichaelDF.count()
dbTest("DF-L5-articlesByMichael-count", 3, resultsCount)  

results = articlesByMichaelDF.collect()

dbTest("DF-L5-articlesByMichael-0", Row(title=u'Spark SQL: Manipulating Structured Data Using Apache Spark'), results[0])
dbTest("DF-L5-articlesByMichael-1", Row(title=u'Exciting Performance Improvements on the Horizon for Spark SQL'), results[1])
dbTest("DF-L5-articlesByMichael-2", Row(title=u'Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform'), results[2])

print("Tests passed!")

### Step 2
Show the list of Michael Armbrust's articles in HTML format.

## Exercise 2

Identify the complete set of categories used in the Databricks blog articles.

### Step 1

Starting with the `databricksBlogDF` DataFrame, create another DataFrame called `uniqueCategoriesDF` where:
0. The data set contains the one column `category` (and no others).
0. This list of categories should be unique.

In [43]:
# TODO
uniqueCategoriesDF = # FILL_IN

In [44]:
# TEST - Run this cell to test your solution.

resultsCount =  uniqueCategoriesDF.count()

dbTest("DF-L5-uniqueCategories-count", 12, resultsCount)

results = uniqueCategoriesDF.collect()

dbTest("DF-L5-uniqueCategories-0", Row(category=u'Announcements'), results[0])
dbTest("DF-L5-uniqueCategories-1", Row(category=u'Apache Spark'), results[1])
dbTest("DF-L5-uniqueCategories-2", Row(category=u'Company Blog'), results[2])

dbTest("DF-L5-uniqueCategories-9", Row(category=u'Platform'), results[9])
dbTest("DF-L5-uniqueCategories-10", Row(category=u'Product'), results[10])
dbTest("DF-L5-uniqueCategories-11", Row(category=u'Streaming'), results[11])

print("Tests passed!")

### Step 2
Show the complete list of categories.

In [46]:
# TODO

FILL_IN

## Exercise 3

Count how many times each category is referenced in the Databricks blog.

-sandbox
### Step 1

Starting with the `databricksBlogDF` DataFrame, create another DataFrame called `totalArticlesByCategoryDF` where:
0. The new DataFrame contains two columns, `category` and `total`.
0. The `category` column is a single, distinct category (similar to the last exercise).
0. The `total` column is the total number of articles in that category.
0. Order by `category`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Because articles can be tagged with multiple categories, the sum of the totals adds up to more than the total number of articles.

In [49]:
# TODO

from pyspark.sql.functions import count
totalArticlesByCategoryDF = # FILL_IN

In [50]:
# TEST - Run this cell to test your solution.

results = totalArticlesByCategoryDF.count()

dbTest("DF-L5-articlesByCategory-count", 12, results)

print("Tests passed!")

In [51]:
# TEST - Run this cell to test your solution.

results = totalArticlesByCategoryDF.collect()

dbTest("DF-L5-articlesByCategory-0", Row(category=u'Announcements', total=72), results[0])
dbTest("DF-L5-articlesByCategory-1", Row(category=u'Apache Spark', total=132), results[1])
dbTest("DF-L5-articlesByCategory-2", Row(category=u'Company Blog', total=224), results[2])

dbTest("DF-L5-articlesByCategory-9", Row(category=u'Platform', total=4), results[9])
dbTest("DF-L5-articlesByCategory-10", Row(category=u'Product', total=83), results[10])
dbTest("DF-L5-articlesByCategory-11", Row(category=u'Streaming', total=21), results[11])

print("Tests passed!")

### Step 2
Display the totals of each category in html format (should be ordered by `category`).

In [53]:
# TODO

FILL_IN

## Summary

* Spark DataFrames allows you to query and manipulate structured and semi-structured data.
* Spark DataFrames built-in functions provide powerful primitives for querying complex schemas.

## Review Questions
**Q:** What is the syntax for accessing nested columns?  
**A:** Use the dot notation:
`select("dates.publishedOn")`

**Q:** What is the syntax for accessing the first element in an array?  
**A:** Use the [subscript] notation: 
`select("col(authors)[0]")`

**Q:** What is the syntax for expanding an array into multiple rows?  
**A:** Use the explode method:  `select(explode(col("authors")).alias("Author"))`

## Next Steps

Start the next lesson, [Querying Data Lakes with DataFrames]($./06-Data-Lakes).

## Additional Topics & Resources

* <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html" target="_blank">Spark SQL, DataFrames and Datasets Guide</a>