d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Applying Schemas to JSON Data

Apache Spark&trade; and Databricks&reg; provide a number of ways to project structure onto semi-structured data allowing for quick and easy access.
## In this lesson you:
* Infer the schema from JSON files
* Create and use a user-defined schema with primitive data types
* Use non-primitive data types such as `ArrayType` and `MapType` in a schema

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Software Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/dataframes" target="_blank">DataFrames course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/xninybx2e2?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/xninybx2e2?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Schemas

Schemas are at the heart of data structures in Spark.
**A schema describes the structure of your data by naming columns and declaring the type of data in that column.** 
Rigorously enforcing schemas leads to significant performance optimizations and reliability of code.

Why is open source Spark so fast, and why is [Databricks Runtime even faster?](https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html) While there are many reasons for these performance improvements, two key reasons are:<br><br>
* First and foremost, Spark runs first in memory rather than reading and writing to disk. 
* Second, using DataFrames allows Spark to optimize the execution of your queries because it knows what your data looks like.

Two pillars of computer science education are data structures, the organization and storage of data and algorithms, and the computational procedures on that data.  A rigorous understanding of computer science involves both of these domains. When you apply the most relevant data structures, the algorithms that carry out the computation become significantly more eloquent.

In the road map for ETL, this is the **Apply Schema** step:

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-2.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

### Schemas with Semi-Structured JSON Data

**Tabular data**, such as that found in CSV files or relational databases, has a formal structure where each observation, or row, of the data has a value (even if it's a NULL value) for each feature, or column, in the data set.  

**Semi-structured data** does not need to conform to a formal data model. Instead, a given feature may appear zero, once, or many times for a given observation.  

Semi-structured data storage works well with hierarchical data and with schemas that may evolve over time.  One of the most common forms of semi-structured data is JSON data, which consists of attribute-value pairs.

<iframe  
src="//fast.wistia.net/embed/iframe/4e7wshp1ax?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/4e7wshp1ax?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [8]:
%run "./Includes/Classroom-Setup"

Print the first few lines of a JSON file holding ZIP Code data.

In [10]:
%fs head /mnt/training/zips.json

### Schema Inference

Import data as a DataFrame and view its schema with the `printSchema()` DataFrame method.

In [12]:
zipsDF = spark.read.json("/mnt/training/zips.json")
zipsDF.printSchema()

Store the schema as an object by calling `.schema` on a DataFrame. Schemas consist of a `StructType`, which is a collection of `StructField`s.  Each `StructField` gives a name and a type for a given field in the data.

In [14]:
zipsSchema = zipsDF.schema
print(type(zipsSchema))

[field for field in zipsSchema]

### User-Defined Schemas

Spark infers schemas from the data, as detailed in the example above.  Challenges with inferred schemas include:  
<br>
* Schema inference means Spark scans all of your data, creating an extra job, which can affect performance
* Consider providing alternative data types (for example, change a `Long` to a `Integer`)
* Consider throwing out certain fields in the data, to read only the data of interest

To define schemas, build a `StructType` composed of `StructField`s.

<iframe  
src="//fast.wistia.net/embed/iframe/jizz3og20l?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/jizz3og20l?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

Import the necessary types from the `types` module. Build a `StructType`, which takes a list of `StructField`s.  Each `StructField` takes three arguments: the name of the field, the type of data in it, and a `Boolean` for whether this field can be `Null`.

In [18]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

zipsSchema2 = StructType([
  StructField("city", StringType(), True), 
  StructField("pop", IntegerType(), True)
])

-sandbox
Apply the schema using the `.schema` method. This `read` returns only  the columns specified in the schema and changes the column `pop` from `LongType` (which was inferred above) to `IntegerType`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> A `LongType` is an 8-byte integer ranging up to 9,223,372,036,854,775,807 while `IntegerType` is a 4-byte integer ranging up to 2,147,483,647.  Since no American city has over two billion people, `IntegerType` is sufficient.

In [20]:
zipsDF2 = (spark.read
  .schema(zipsSchema2)
  .json("/mnt/training/zips.json")
)

display(zipsDF2)

city,pop
AGAWAM,15338
CUSHMAN,36963
BARRE,4546
BELCHERTOWN,10579
BLANDFORD,1240
BRIMFIELD,3706
CHESTER,1688
CHESTERFIELD,177
CHICOPEE,23396
CHICOPEE,31495


-sandbox
### Primitive and Non-primitive Types

The Spark [`types` package](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types) provides the building blocks for constructing schemas.

A primitive type contains the data itself.  The most common primitive types include:

| Numeric | General | Time |
|-----|-----|
| `FloatType` | `StringType` | `TimestampType` | 
| `IntegerType` | `BooleanType` | `DateType` | 
| `DoubleType` | `NullType` | |
| `LongType` | | |
| `ShortType` |  | |

Non-primitive types are sometimes called reference variables or composite types.  Technically, non-primitive types contain references to memory locations and not the data itself.  Non-primitive types are the composite of a number of primitive types such as an Array of the primitive type `Integer`.

The two most common composite types are `ArrayType` and `MapType`. These types allow for a given field to contain an arbitrary number of elements in either an Array/List or Map/Dictionary form.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the [Spark documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types) for a complete picture of types in Spark.

<iframe  
src="//fast.wistia.net/embed/iframe/qk2is6llgl?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/qk2is6llgl?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

The ZIP Code dataset contains an array with the latitude and longitude of the cities.  Use an `ArrayType`, which takes the primitive type of its elements as an argument.

In [24]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, FloatType

zipsSchema3 = StructType([
  StructField("city", StringType(), True), 
  StructField("loc", 
    ArrayType(FloatType(), True), True),
  StructField("pop", IntegerType(), True)
])

Apply the schema using the `.schema()` method and observe the results.  Expand the array values in the column `loc` to explore further.

In [26]:
zipsDF3 = (spark.read
  .schema(zipsSchema3)
  .json("/mnt/training/zips.json")
)
display(zipsDF3)

city,loc,pop
AGAWAM,"List(-72.62274, 42.070206)",15338
CUSHMAN,"List(-72.51565, 42.377018)",36963
BARRE,"List(-72.10835, 42.4097)",4546
BELCHERTOWN,"List(-72.41095, 42.275105)",10579
BLANDFORD,"List(-72.93611, 42.18295)",1240
BRIMFIELD,"List(-72.18845, 42.116543)",3706
CHESTER,"List(-72.98876, 42.279423)",1688
CHESTERFIELD,"List(-72.833305, 42.38167)",177
CHICOPEE,"List(-72.60796, 42.162045)",23396
CHICOPEE,"List(-72.57614, 42.17644)",31495


## Exercise 1: Exploring JSON Data

<a href="https://archive.ics.uci.edu/ml/datasets/UbiqLog+(smartphone+lifelogging)">Smartphone data from UCI Machine Learning Repository</a> is available under `/mnt/training/UbiqLog4UCI`. This is log data from the open source project [Ubiqlog](https://github.com/Rezar/Ubiqlog).

Import this data and define your own schema.

### Step 1: Import the Data

Import data from `/mnt/training/14_F/log*`. (This is the log files from a given user.)

Look at the head of one file from the data set.  Use `/mnt/training/UbiqLog4UCI/14_F/log_1-6-2014.txt`.

In [30]:
%fs head  /mnt/training/UbiqLog4UCI/14_F/log_1-6-2014.txt

Read the data and save it to `smartphoneDF`. Read the logs using a `*` in your path like `/mnt/training/UbiqLog4UCI/14_F/log*`.

In [32]:
# TODO
smartphoneDF = spark.read.json("/mnt/training/UbiqLog4UCI/14_F/log*")

In [33]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import desc

cols = set(smartphoneDF.columns)
sample = smartphoneDF.orderBy(desc("Application")).first()[0][0]

dbTest("ET1-P-05-01-01", 25372, smartphoneDF.count())
dbTest("ET1-P-05-01-02", '12-9-2013 21:30:02', sample)

dbTest("ET1-P-05-01-03", True, "Location" in cols)
dbTest("ET1-P-05-01-04", True, "SMS" in cols)
dbTest("ET1-P-05-01-05", True, "WiFi" in cols)
dbTest("ET1-P-05-01-06", True, "_corrupt_record" in cols)
dbTest("ET1-P-05-01-07", True, "Application" in cols)
dbTest("ET1-P-05-01-08", True, "Call" in cols)
dbTest("ET1-P-05-01-09", True, "Bluetooth" in cols)

print("Tests passed!")

### Step 2: Explore the Inferred Schema

Print the schema to get a sense for the data.

In [35]:
# TODO
smartphoneDF.printSchema()

The schema shows:  

* Six categories of tracked data 
* Nested data structures
* A field showing corrupt records

## Exercise 2: Creating a User Defined Schema

### Step 1: Set Up Your workflow

Often the hardest part of a coding challenge is setting up a workflow to get continuous feedback on what you develop.

Start with the import statements you need, including functions from two main packages:

| Package | Function |
|---------|---------|
| `pyspark.sql.types` | `StructType`, `StructField`, `StringType` |
| `pyspark.sql.functions` | `col` |

In [39]:
# TODO
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import col

The **SMS** field needs to be parsed. Create a placeholder schema called `schema` that's a `StructType` with one `StructField` named **SMS** of type `StringType`. This imports the entire attribute (even though it contains nested entities) as a String.  

This is a way to get a sense for what's in the data and make a progressively more complex schema.

In [41]:
# TODO
schema = StructType([
  StructField("SMS",StringType(),True)
])

In [42]:
# TEST - Run this cell to test your solution
fields = schema.fields

dbTest("ET1-P-05-02-01", 1, len(fields))
dbTest("ET1-P-05-02-02", 'SMS', fields[0].name)

print("Tests passed!")

Apply the schema to the data and save the result as `SMSDF`. This closes the loop on which to iterate and develop an increasingly complex schema. The path to the data is `/mnt/training/UbiqLog4UCI/14_F/log*`. 

Include only records where the column `SMS` is not `Null`.

In [44]:
SMSDF = (spark.read
        .schema(schema)
        .json("/mnt/training/UbiqLog4UCI/14_F/log*")
        .dropna()
        )
display(SMSDF)

SMS
"{""Address"":""+98214428####"",""type"":""1"",""date"":""1-10-2014 11:30:05"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""""}}"
"{""Address"":""+985000406500####"",""type"":""1"",""date"":""1-10-2014 11:32:01"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""""}}"
"{""Address"":""+98214428####"",""type"":""1"",""date"":""1-10-2014 11:30:05"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""""}}"
"{""Address"":""+98939283####"",""type"":""1"",""date"":""1-9-2014 23:54:31"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""bahram""}}"
"{""Address"":""+98214428####"",""type"":""1"",""date"":""1-10-2014 12:15:19"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""""}}"
"{""Address"":""+98939283####"",""type"":""1"",""date"":""1-9-2014 23:54:31"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""bahram""}}"
"{""Address"":""+98935566####"",""type"":""2"",""date"":""1-10-2014 12:35:00"",""body"":""ANONYMIZED"",""Type"":""2"",""metadata"":{""name"":""u Kh sevda""}}"
"{""Address"":""+98214428####"",""type"":""1"",""date"":""1-10-2014 12:39:20"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""""}}"
"{""Address"":""+981000721670####"",""type"":""1"",""date"":""1-10-2014 12:41:57"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""""}}"
"{""Address"":""+98935566####"",""type"":""1"",""date"":""1-10-2014 12:45:28"",""body"":""ANONYMIZED"",""Type"":""1"",""metadata"":{""name"":""u Kh sevda""}}"


In [45]:
# TEST - Run this cell to test your solution
cols = SMSDF.columns

dbTest("ET1-P-05-03-01", 1147, SMSDF.count())
dbTest("ET1-P-05-03-02", ['SMS'], cols)

print("Tests passed!")

-sandbox
### Step 2: Create the Full Schema for SMS

Define the Schema for the following fields in the `StructType` `SMS` and name it `schema2`.  Apply it to a new DataFrame `SMSDF2`:  
<br>
* `Address`
* `date`
* `metadata`
 - `name`
 
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Note there's `Type` and `type`, which appears to be redundant data.

In [47]:
from pyspark.sql.types import StructType, StructField, StringType, ArrayType



schema2=StructType([
  StructField("SMS",StructType([
    StructField("Address",StringType(),True),
    StructField("date",StringType(),True),
    StructField("metadata",StructType([
      StructField("name",StringType(),True)
    ]),True)
  ]),True)
])
  

SMSDF2 = (spark.read
         .schema(schema2)
         .json("/mnt/training/UbiqLog4UCI/14_F/log*").dropna()
        )
display(SMSDF2)

SMS
"List(+98214428####, 1-10-2014 11:30:05, List())"
"List(+985000406500####, 1-10-2014 11:32:01, List())"
"List(+98214428####, 1-10-2014 11:30:05, List())"
"List(+98939283####, 1-9-2014 23:54:31, List(bahram))"
"List(+98214428####, 1-10-2014 12:15:19, List())"
"List(+98939283####, 1-9-2014 23:54:31, List(bahram))"
"List(+98935566####, 1-10-2014 12:35:00, List(u Kh sevda))"
"List(+98214428####, 1-10-2014 12:39:20, List())"
"List(+981000721670####, 1-10-2014 12:41:57, List())"
"List(+98935566####, 1-10-2014 12:45:28, List(u Kh sevda))"


In [48]:
# TEST - Run this cell to test your solution
cols = SMSDF2.columns
schemaJson = SMSDF2.schema.json()

dbTest("ET1-P-05-04-01", 1147, SMSDF2.count())
dbTest("ET1-P-05-04-02", ['SMS'], cols)
dbTest("ET1-P-05-04-03", True, 'Address' in schemaJson and 'date' in schemaJson)

print("Tests passed!")

### Step 3: Compare Solution Performance

Compare the dafault schema inference to applying a user defined schema using the `%timeit` function.  Which completed faster?  Which triggered more jobs?  Why?

In [50]:
%timeit SMSDF = spark.read.schema(schema2).json("/mnt/training/UbiqLog4UCI/14_F/log*").count()

In [51]:
%timeit SMSDF = spark.read.json("/mnt/training/UbiqLog4UCI/14_F/log*").count()

Providing a schema increases performance two to three times, depending on the size of the cluster used. Since Spark doesn't infer the schema, it doesn't have to read through all of the data. This is also why there are fewer jobs when a schema is provided: Spark doesn't need one job for each partition of the data to infer the schema.

## Review

**Question:** What are two ways to attain a schema from data?  
**Answer:** Allow Spark to infer a schema from your data or provide a user defined schema. Schema inference is the recommended first step; however, you can customize this schema to your use case with a user defined schema.

**Question:** Why should you define your own schema?  
**Answer:** Benefits of user defined schemas include:
* Avoiding the extra scan of your data needed to infer the schema
* Providing alternative data types
* Parsing only the fields you need

**Question:** Why is JSON a common format in big data pipelines?  
**Answer:** Semi-structured data works well with hierarchical data and where schemas need to evolve over time.  It also easily contains composite data types such as arrays and maps.

**Question:** By default, how are corrupt records dealt with using `spark.read.json()`?  
**Answer:** They appear in a column called `_corrupt_record`.  These are the records that Spark can't read (e.g. when characters are missing from a JSON string).

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [55]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Corrupt Record Handling]($./06-Corrupt-Record-Handling ).

## Additional Topics & Resources

**Q:** Where can I find more information on working with JSON data?  
**A:** Take a look at the <a href="http://files.training.databricks.com/courses/dataframes/" target="_blank">DataFrames course from Databricks Academy</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>