<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../resources/logo.png" alt="Intellinum Bootcamp" style="width: 400px; height: 200px">
</div>

# Querying JSON & Hierarchical Data with DataFrames

Apache Spark&trade; make it easy to work with hierarchical data, such as nested JSON records.

## In this lesson you:
* Use DataFrames to query JSON data.
* Query nested structured data.
* Query data containing array columns.


In [None]:
#MODE = "LOCAL"
MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
from matplotlib import interactive
interactive(True)
import matplotlib.pyplot as plt
%matplotlib inline
import json
import math
import numbers
import numpy as np
import plotly
plotly.offline.init_notebook_mode(connected=True)

sys.path.insert(0,'../../src')
from settings import *

try:
    fh = open('../../libs/pyspark24_py36.zip', 'r')
except FileNotFoundError:
    !AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY} AWS_SECRET_ACCESS_KEY={AWS_SECRET_KEY} aws s3 cp s3://yuan.intellinum.co/bins/pyspark24_py36.zip ../../libs/pyspark24_py36.zip

try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession detected")
    print("Creating a new SparkSession")

SPARK_DRIVER_MEMORY= "1G"
SPARK_DRIVER_CORE = "1"
SPARK_EXECUTOR_MEMORY= "1G"
SPARK_EXECUTOR_CORE = "1"
SPARK_EXECUTOR_INSTANCES = 6



conf = None
if MODE == "LOCAL":
    os.environ["PYSPARK_PYTHON"] = "/home/yuan/anaconda3/envs/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_day03_querying_json").\
            setMaster('local[*]').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', '../../libs/mysql-connector-java-5.1.45-bin.jar').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1')
else:
    os.environ["PYSPARK_PYTHON"] = "./MN/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_day03_querying_json").\
            setMaster('yarn-client').\
            set('spark.executor.cores', SPARK_EXECUTOR_CORE).\
            set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\
            set('spark.driver.cores', SPARK_DRIVER_CORE).\
            set('spark.driver.memory', SPARK_DRIVER_MEMORY).\
            set("spark.executor.instances", SPARK_EXECUTOR_INSTANCES).\
            set('spark.sql.files.ignoreCorruptFiles', 'true').\
            set('spark.yarn.dist.archives', '../../libs/pyspark24_py36.zip#MN').\
            set('spark.sql.shuffle.partitions', '5000').\
            set('spark.default.parallelism', '5000').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1'). \
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', 's3://yuan.intellinum.co/bins/mysql-connector-java-5.1.45-bin.jar')
        

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()


sc = spark.sparkContext

sc.addPyFile('../../src/settings.py')

sc=spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", AWS_ACCESS_KEY)
hadoop_conf.set("fs.s3a.secret.key", AWS_SECRET_KEY)
hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

def display(df, limit=10):
    return df.limit(limit).toPandas()

def dfTest(id, expected, result):
    assert str(expected) == str(result), "{} does not equal expected {}".format(result, expected)

## Examining the Contents of a JSON file

JSON is a common file format used in big data applications and in data lakes (or large stores of diverse data).  File formats such as JSON arise out of a number of data needs.  For instance, what if:
<br>
* Your schema, or the structure of your data, changes over time?
* You need nested fields like an array with many values or an array of arrays?
* You don't know how you're going use your data yet, so you don't want to spend time creating relational tables?

The popularity of JSON is largely due to the fact that JSON allows for nested, flexible schemas.

This lesson uses the `s3://data.intellinum.co/bootcamp/common/blog.json`. If you examine the raw file, notice it contains compact JSON data. There's a single JSON object on each line of the file; each object corresponds to a row in the table. Each row represents a blog post and the json file contains all blog posts through August 9, 2017.

In [None]:
!AWS_ACCESS_KEY_ID={AWS_ACCESS_KEY} AWS_SECRET_ACCESS_KEY={AWS_SECRET_KEY} aws s3 cp s3://data.intellinum.co/bootcamp/common/blog.json - | head -n 2
    

Create a DataFrame out of the syntax introduced in the previous lesson:

In [None]:
blogDF = spark.read.option("inferSchema","true").option("header","true").json("s3a://data.intellinum.co/bootcamp/common/blog.json")

Take a look at the schema by invoking `printSchema` method.

In [None]:
blogDF.printSchema()

Run a query to view the contents of the table.

Notice:
* The `authors` column is an array containing one or more author names.
* The `categories` column is an array of one or more blog post category names.
* The `dates` column contains nested fields `createdOn`, `publishedOn` and `tz`.

In [None]:
display(blogDF.select("authors","categories","dates","content"))

## Nested Data

Think of nested data as columns within columns. 

For instance, look at the `dates` column.

In [None]:
datesDF = blogDF.select("dates")
display(datesDF)

Pull out a specific subfield with `.` (object) notation.

In [None]:
display(blogDF.select("dates.createdOn", "dates.publishedOn"))

Create a DataFrame, `blog2DF` that contains the original columns plus the new `publishedOn` column obtained
from flattening the dates column.

In [None]:
from pyspark.sql.functions import col
blog2DF = blogDF.withColumn("publishedOn",col("dates.publishedOn"))

With this temporary view, apply the printSchema method to check its schema and confirm the timestamp conversion.

In [None]:
blog2DF.printSchema()

Both `createdOn` and `publishedOn` are stored as strings.

Cast those values to SQL timestamps:

In this case, use a single `select` method to:
0. Cast `dates.publishedOn` to a `timestamp` data type
0. "Flatten" the `dates.publishedOn` column to just `publishedOn`

In [None]:
from pyspark.sql.functions import date_format
display(blogDF.select("title",date_format("dates.publishedOn","yyyy-MM-dd").alias("publishedOn")))

Create another DataFrame, `blog2DF` that contains the original columns plus the new `publishedOn` column obtained
from flattening the dates column.

In [None]:
blog2DF = blogDF.withColumn("publishedOn", date_format("dates.publishedOn","yyyy-MM-dd")) 
display(blog2DF)

With this temporary view, apply the `printSchema` method to check its schema and confirm the timestamp conversion.

In [None]:
blog2DF.printSchema()

Since the dates are represented by a `timestamp` data type, we need to convert to a data type that allows `<` and `>`-type comparison operations in order to query for articles within certain date ranges (such as a list of all articles published in 2013). This is accopmplished by using the `to_date` function in Scala or Python.

See the Spark documentation on <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>, for a long list of date-specific functions.

In [None]:
from pyspark.sql.functions import to_date, year, col
          
resultDF = (blog2DF.select("title", to_date(col("publishedOn"),"MMM dd, yyyy").alias('date'),"link") 
  .filter(year(col("publishedOn")) == '2013') 
  .orderBy(col("publishedOn"))
)

display(resultDF)

## Array Data

The DataFrame also contains array columns. 

Easily determine the size of each array using the built-in `size(..)` function with array columns.

In [None]:
from pyspark.sql.functions import size
display(blogDF.select(size("authors"),"authors"))

Pull the first element from the array `authors` using an array subscript operator.

For example, in Scala, the 0th element of array `authors` is `authors(0)`
whereas, in Python, the 0th element of `authors` is `authors[0]`.

In [None]:
display(blogDF.select(col("authors")[0].alias("primaryAuthor")))

### Explode

The `explode` method allows you to split an array column into multiple rows, copying all the other columns into each new row. 

For example, split the column `authors` into the column `author`, with one author per row.

In [None]:
from pyspark.sql.functions import explode
display(blogDF.select("title","authors",explode(col("authors")).alias("author"), "link"))

It's more obvious to restrict the output to articles that have multiple authors, and then sort by the title.

In [None]:
blog2DF = (blogDF 
  .select("title","authors",explode(col("authors")).alias("author"), "link") 
  .filter(size(col("authors")) > 1) 
  .orderBy("title")
)

display(blogDF)

## Exercise 1

Identify all the articles written or co-written by Michael Armbrust.

### Step 1

Starting with the `blogDF` DataFrame, create a DataFrame called `articlesByMichaelDF` where:
0. Michael Armbrust is the author.
0. The data set contains the column `title` (it may contain others).
0. It contains only one record per article.

**Hint:** See the Spark documentation on <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>.  

**Hint:** Include the column `authors` in your view to help you debug your solution.

In [None]:
# TODO
articlesByMichaelDF = ## FILL_IN

In [None]:
# TEST - Run this cell to test your solution.

from pyspark.sql import Row

resultsCount = articlesByMichaelDF.count()
dfTest("DF-L5-articlesByMichael-count", 3, resultsCount)  

results = articlesByMichaelDF.collect()

dfTest("DF-L5-articlesByMichael-0", Row(title=u'Spark SQL: Manipulating Structured Data Using Apache Spark'), results[0])
dfTest("DF-L5-articlesByMichael-1", Row(title=u'Exciting Performance Improvements on the Horizon for Spark SQL'), results[1])
dfTest("DF-L5-articlesByMichael-2", Row(title=u'Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform'), results[2])

print("Tests passed!")

### Step 2
Show the list of Michael Armbrust's articles in HTML format.

In [None]:
# TODO


## Exercise 2

Identify the complete set of categories used in the blog articles.

### Step 1

Starting with the `blogDF` DataFrame, create another DataFrame called `uniqueCategoriesDF` where:
0. The data set contains the one column `category` (and no others).
0. This list of categories should be unique.

In [None]:
# TODO

In [None]:
# TEST - Run this cell to test your solution.

resultsCount =  uniqueCategoriesDF.count()

dfTest("DF-L5-uniqueCategories-count", 12, resultsCount)

results = uniqueCategoriesDF.collect()

dfTest("DF-L5-uniqueCategories-0", Row(category=u'Announcements'), results[0])
dfTest("DF-L5-uniqueCategories-1", Row(category=u'Apache Spark'), results[1])
dfTest("DF-L5-uniqueCategories-2", Row(category=u'Company Blog'), results[2])

dfTest("DF-L5-uniqueCategories-9", Row(category=u'Platform'), results[9])
dfTest("DF-L5-uniqueCategories-10", Row(category=u'Product'), results[10])
dfTest("DF-L5-uniqueCategories-11", Row(category=u'Streaming'), results[11])

print("Tests passed!")

### Step 2
Show the complete list of categories.

In [None]:
# TODO

## Exercise 3

Count how many times each category is referenced in the blog.

### Step 1

Starting with the `blogDF` DataFrame, create another DataFrame called `totalArticlesByCategoryDF` where:
0. The new DataFrame contains two columns, `category` and `total`.
0. The `category` column is a single, distinct category (similar to the last exercise).
0. The `total` column is the total number of articles in that category.
0. Order by `category`.

Because articles can be tagged with multiple categories, the sum of the totals adds up to more than the total number of articles.

In [None]:
# TODO

In [None]:
# TEST - Run this cell to test your solution.

results = totalArticlesByCategoryDF.count()

dfTest("DF-L5-articlesByCategory-count", 12, results)

print("Tests passed!")

In [None]:
# TEST - Run this cell to test your solution.

results = totalArticlesByCategoryDF.collect()

dfTest("DF-L5-articlesByCategory-0", Row(category=u'Announcements', total=72), results[0])
dfTest("DF-L5-articlesByCategory-1", Row(category=u'Apache Spark', total=132), results[1])
dfTest("DF-L5-articlesByCategory-2", Row(category=u'Company Blog', total=224), results[2])

dfTest("DF-L5-articlesByCategory-9", Row(category=u'Platform', total=4), results[9])
dfTest("DF-L5-articlesByCategory-10", Row(category=u'Product', total=83), results[10])
dfTest("DF-L5-articlesByCategory-11", Row(category=u'Streaming', total=21), results[11])

print("Tests passed!")

### Step 2
Display the totals of each category in html format (should be ordered by `category`).

In [None]:
# TODO

## Summary

* Spark DataFrames allows you to query and manipulate structured and semi-structured data.
* Spark DataFrames built-in functions provide powerful primitives for querying complex schemas.

## Review Questions
**Q:** What is the syntax for accessing nested columns?  
**A:** Use the dot notation:
`select("dates.publishedOn")`

**Q:** What is the syntax for accessing the first element in an array?  
**A:** Use the [subscript] notation: 
`select("col(authors)[0]")`

**Q:** What is the syntax for expanding an array into multiple rows?  
**A:** Use the explode method:  `select(explode(col("authors")).alias("Author"))`

## Additional Topics & Resources

* <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html" target="_blank">Spark SQL, DataFrames and Datasets Guide</a>


&copy; 2019 [Intellinum Analytics, Inc](http://www.intellinum.co). All rights reserved.<br/>