d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Capstone Project: Parsing Nested Data

Mount JSON data using DBFS, define and apply a schema, parse fields, and save the cleaned results back to DBFS.

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Lesson: <a href="$./02-ETL-Process-Overview">ETL Process Overview</a> 
* Lesson: <a href="$./05-Applying-Schemas-to-JSON-Data">Applying Schemas to JSON Data</a> 

## Instructions

A common source of data in ETL pipelines is <a href="https://kafka.apache.org/" target="_blank">Apache Kafka</a>, or the managed alternative
<a href="https://aws.amazon.com/kinesis/" target="_blank">Kinesis</a>.
A common data type in these use cases is newline-separated JSON.

For this exercise, Tweets were streamed from the <a href="https://developer.twitter.com/en/docs" target="_blank">Twitter firehose API</a> into such an aggregation server and,
from there, dumped into the distributed file system.

Use these four exercises to perform ETL on the data in this bucket:  
<br>
1. Extracting and Exploring the Data
2. Defining and Applying a Schema
3. Creating the Tables
4. Loading the Results

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

## Exercise 1: Extracting and Exploring the Data

First, review the data.

### Step 1: Explore the Folder Structure

Explore the mount and review the directory structure. Use `%fs ls`.  The data is located in `/mnt/training/twitter/firehose/`

In [7]:
%fs ls /mnt/training/twitter/firehose/

path,name,size
dbfs:/mnt/training/twitter/firehose/2018/,2018/,0


### Step 2: Explore a Single File

> "Premature optimization is the root of all evil." -Sir Tony Hoare

There are a few gigabytes of Twitter data available in the directory. Hoare's law about premature optimization is applicable here.  Instead of building a schema for the entire data set and then trying it out, an iterative process is much less error prone and runs much faster. Start by working on a single file before you apply your proof of concept across the entire data set.

Read a single file.  Start with `twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4`. Find this in `/mnt/training/twitter/firehose/2018/01/08/18/`.  Save the results to the variable `df`.

In [10]:
# TODO
df = spark.read.json("/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4")

In [11]:
# TEST - Run this cell to test your solution
cols = df.columns

dbTest("ET1-P-08-02-01", 1744, df.count())
dbTest("ET1-P-08-02-02", True, "id" in cols)
dbTest("ET1-P-08-02-03", True, "text" in cols)

print("Tests passed!")

Display the schema.

In [13]:
df.printSchema()

Count the records in the file. Save the result to `dfCount`.

In [15]:
# TODO
dfCount = df.count()

In [16]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-08-03-01", 1744, dfCount)

print("Tests passed!")

## Exercise 2: Defining and Applying a Schema

Applying schemas is especially helpful for data with many fields to sort through. With a complex dataset like this, define a schema **that captures only the relevant fields**.

Capture the hashtags and dates from the data to get a sense for Twitter trends. Use the same file as above.

### Step 1: Understanding the Data Model

In order to apply structure to semi-structured data, you first must understand the data model.  

There are two forms of data models to employ: a relational or non-relational model.<br><br>
* **Relational models** are within the domain of traditional databases. [Normalization](https://en.wikipedia.org/wiki/Database_normalization) is the primary goal of the data model. <br>
* **Non-relational data models** prefer scalability, performance, or flexibility over normalized data.

Use the following relational model to define a number of tables to join together on different columns, in order to reconstitute the original data. Regardless of the data model, the ETL principles are roughly the same.

Compare the following [Entity-Relationship Diagram](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model) to the schema you printed out in the previous step to get a sense for how to populate the tables.

-sandbox
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ER-diagram.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

-sandbox
### Step 2: Create a Schema for the `Tweet` Table

Create a schema for the JSON data to extract just the information that is needed for the `Tweet` table, parsing each of the following fields in the data model:

| Field | Type|
|-------|-----|
| tweet_id | integer |
| user_id | integer |
| language | string |
| text | string |
| created_at | string* |

*Note: Start with `created_at` as a string. Turn this into a timestamp later.

Save the schema to `tweetSchema`, use it to create a DataFrame named `tweetDF`, and use the same file used in the exercise above: `"/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You might need to reexamine the data schema. <br>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** [Import types from `pyspark.sql.types`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=pyspark%20sql%20types#module-pyspark.sql.types).

In [21]:
# TODO
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, LongType
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

tweetSchema = StructType()\
.add("created_at",StringType())\
.add("id",LongType())\
.add("user",StructType().add("user_id",LongType()))\
.add("lang",StringType())\
.add("text",StringType())

tweetDF1 = (spark.read
          .schema(tweetSchema)
          .json(path)
          )
tweetDF=tweetDF1.select("created_at","id","user.user_id","lang","text").withColumnRenamed("lang","language").withColumnRenamed("user_id","userID")
display(tweetDF)

created_at,id,userID,language,text
,,,,
,,,,
Mon Jan 08 18:47:59 +0000 2018,9.504389542720961e+17,,en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…
Mon Jan 08 18:47:59 +0000 2018,9.504389542889144e+17,,ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet
Mon Jan 08 18:47:59 +0000 2018,9.504389542764504e+17,,tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur
Mon Jan 08 18:47:59 +0000 2018,9.504389542804723e+17,,ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران
Mon Jan 08 18:47:59 +0000 2018,9.504389542888897e+17,,en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*
Mon Jan 08 18:47:59 +0000 2018,9.504389542806692e+17,,en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂
Mon Jan 08 18:47:59 +0000 2018,9.504389542764419e+17,,pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️"
Mon Jan 08 18:47:59 +0000 2018,9.504389542764787e+17,,en,I just want this all to be over


In [22]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

schema = tweetSchema.fieldNames()
schema.sort()
tweetCount = tweetDF.filter(col("id").isNotNull()).count()

dbTest("ET1-P-08-04-01", 'created_at', schema[0])
dbTest("ET1-P-08-04-02", 'id', schema[1])
dbTest("ET1-P-08-04-03", 1491, tweetCount)

assert schema[0] == 'created_at' and schema[1] == 'id'
assert tweetCount == 1491

print("Tests passed!")

### Step 3: Create a Schema for the Remaining Tables

Finish off the full schema, save it to `fullTweetSchema`, and use it to create the DataFrame `fullTweetDF`. Your schema should parse all the entities from the ER diagram above.  Remember, smart small, run your code, and then iterate.

In [24]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, LongType
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"

fullTweetSchema=StructType()\
.add("created_at",StringType())\
.add("entities",StructType().add("hashtags",ArrayType(StructType().add("element",StructType().add("text",StringType()))))\
.add("media",ArrayType(StructType().add("element",StructType().add("id",LongType())))).add("urls",ArrayType(StructType().add("element",StructType()\
.add("display_url",StringType()).add("expanded_url",StringType()).add("url",StringType())))))\
.add("user",StructType().add("user_id",LongType()).add("screen_name",StringType()).add("location",StringType()).add("friends_count",LongType())\
.add("followers_count",LongType()).add("description",StringType()))\
.add("id",LongType())

fullTweetDF = (spark.read
          .schema(fullTweetSchema)
          .json(path)
             
          )
display(fullTweetDF)

created_at,entities,user,id
,,,
,,,
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, smileifyou_love, null, 473, 160, •Psalm 34:18• Living life one day at a time ✌️)",9.504389542720961e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(List(null)), null, List())","List(null, bw198e18, null, 1641, 1285, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc)",9.504389542889144e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, marlascigarette, null, 214, 223, △)",9.504389542764504e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(List(null)), null, List(List(null)))","List(null, rebaab_1326, null, 45, 0, null)",9.504389542804723e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, puskine, Kampala, Uganda, 5008, 4916, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952)",9.504389542888897e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, xNina_Beana, the land , 1130, 1646, Prince Carter ❤️ && Messiah Carter Miles ❤️)",9.504389542806692e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, gbfranca22, cpx da congo🔞, 252, 632, mãe nunca te escutei, mas sempre te amarei❤)",9.504389542764419e+17
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, squeeqi, null, 213, 160, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below)",9.504389542764787e+17


In [25]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import col

schema = fullTweetSchema.fieldNames()
schema.sort()
tweetCount = fullTweetDF.filter(col("id").isNotNull()).count()

assert tweetCount == 1491

dbTest("ET1-P-08-05-01", "created_at", schema[0])
dbTest("ET1-P-08-05-02", "entities", schema[1])
dbTest("ET1-P-08-05-03", 1491, tweetCount)

print("Tests passed!")

## Exercise 3: Creating the Tables

Apply the schema you defined to create tables that match the relational data model.

### Step 1: Filtering Nulls

The Twitter data contains both deletions and tweets.  This is why some records appear as null values. Create a DataFrame called `fullTweetFilteredDF` that filters out the null values.

In [28]:
# TODO
from pyspark.sql.functions import col
fullTweetFilteredDF = (spark.read
          .schema(fullTweetSchema)
          .json(path)
          .filter(col("id").isNotNull())
          )
display(fullTweetFilteredDF)

created_at,entities,user,id
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, smileifyou_love, null, 473, 160, •Psalm 34:18• Living life one day at a time ✌️)",950438954272096257
Mon Jan 08 18:47:59 +0000 2018,"List(List(List(null)), null, List())","List(null, bw198e18, null, 1641, 1285, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc)",950438954288914432
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, marlascigarette, null, 214, 223, △)",950438954276450305
Mon Jan 08 18:47:59 +0000 2018,"List(List(List(null)), null, List(List(null)))","List(null, rebaab_1326, null, 45, 0, null)",950438954280472576
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, puskine, Kampala, Uganda, 5008, 4916, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952)",950438954288889856
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, xNina_Beana, the land , 1130, 1646, Prince Carter ❤️ && Messiah Carter Miles ❤️)",950438954280669184
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, gbfranca22, cpx da congo🔞, 252, 632, mãe nunca te escutei, mas sempre te amarei❤)",950438954276442113
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, squeeqi, null, 213, 160, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below)",950438954276478976
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, iiib53, null, 631, 427, null)",950438954289033216
Mon Jan 08 18:47:59 +0000 2018,"List(List(), null, List())","List(null, nappo_what, 🇫🇮🇺🇦, 297, 925, null)",950438954289033218


In [29]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-08-06-01", 1491, fullTweetFilteredDF.count())

print("Tests passed!")

-sandbox
### Step 2: Creating the `Tweet` Table

Twitter uses a non-standard timestamp format that Spark doesn't recognize. Currently the `created_at` column is formatted as a string. Create the `Tweet` table and save it as `tweetDF`. Parse the timestamp column using `unix_timestamp`, and cast the result as `TimestampType`. The timestamp format is `EEE MMM dd HH:mm:ss ZZZZZ yyyy`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Use `alias` to alias the name of your columns to the final name you want for them.  
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** `id` corresponds to `tweet_id` and `user.id` corresponds to `user_id`.

In [31]:
# TODO
from pyspark.sql.functions import unix_timestamp
from pyspark.sql.types import TimestampType

timestampFormat = "EEE MMM dd HH:mm:ss ZZZZZ yyyy"
tweetDF = tweetDF.select(
unix_timestamp("created_at", "EEE MMM dd HH:mm:ss ZZZZZ yyyy").astype(TimestampType()).alias("createdAt"),
"id","userID","language","text"
).withColumnRenamed("id","tweetID")
display(tweetDF)

createdAt,tweetID,userID,language,text
,,,,
,,,,
2018-01-08T18:47:59.000+0000,9.504389542720961e+17,,en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…
2018-01-08T18:47:59.000+0000,9.504389542889144e+17,,ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet
2018-01-08T18:47:59.000+0000,9.504389542764504e+17,,tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur
2018-01-08T18:47:59.000+0000,9.504389542804723e+17,,ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران
2018-01-08T18:47:59.000+0000,9.504389542888897e+17,,en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*
2018-01-08T18:47:59.000+0000,9.504389542806692e+17,,en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂
2018-01-08T18:47:59.000+0000,9.504389542764419e+17,,pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️"
2018-01-08T18:47:59.000+0000,9.504389542764787e+17,,en,I just want this all to be over


In [32]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import TimestampType
t = tweetDF.select("createdAt").schema[0]

dbTest("ET1-P-08-07-01", TimestampType(), t.dataType)

print("Tests passed!")

### Step 3: Creating the Account Table

Save the account table as `accountDF`.

In [34]:
# TODO
accountDF2 = (spark.read
             .schema(fullTweetSchema)
             .json(path)
             .dropna()
             )


accountDF=accountDF2.select("user.*").withColumnRenamed("screen_name","screenName").withColumnRenamed("friends_count","friendsCount").withColumnRenamed("followers_count","followersCount").withColumnRenamed("user_id","userID")
display(accountDF)

userID,screenName,location,friendsCount,followersCount,description
,smileifyou_love,,473,160,•Psalm 34:18• Living life one day at a time ✌️
,bw198e18,,1641,1285,【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc
,marlascigarette,,214,223,△
,rebaab_1326,,45,0,
,puskine,"Kampala, Uganda",5008,4916,God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952
,xNina_Beana,the land,1130,1646,Prince Carter ❤️ && Messiah Carter Miles ❤️
,gbfranca22,cpx da congo🔞,252,632,"mãe nunca te escutei, mas sempre te amarei❤"
,squeeqi,,213,160,We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below
,iiib53,,631,427,
,nappo_what,🇫🇮🇺🇦,297,925,


In [35]:
# TEST - Run this cell to test your solution
cols = accountDF.columns

dbTest("ET1-P-08-08-01", True, "friendsCount" in cols)
dbTest("ET1-P-08-08-02", True, "screenName" in cols)
dbTest("ET1-P-08-08-03", 1491, accountDF.count())


print("Tests passed!")

-sandbox
### Step 4: Creating Hashtag and URL Tables Using `explode`

Each tweet in the data set contains zero, one, or many URLs and hashtags. Parse these using the `explode` function so that each URL or hashtag has its own row.

In this example, `explode` gives one row from the original column `hashtags` for each value in an array. All other columns are left untouched.

```
+---------------+--------------------+----------------+
|     screenName|            hashtags|explodedHashtags|
+---------------+--------------------+----------------+
|        zooeeen|[[Tea], [GoldenGl...|           [Tea]|
|        zooeeen|[[Tea], [GoldenGl...|  [GoldenGlobes]|
|mannydidthisone|[[beats], [90s], ...|         [beats]|
|mannydidthisone|[[beats], [90s], ...|           [90s]|
|mannydidthisone|[[beats], [90s], ...|     [90shiphop]|
|mannydidthisone|[[beats], [90s], ...|           [pac]|
|mannydidthisone|[[beats], [90s], ...|        [legend]|
|mannydidthisone|[[beats], [90s], ...|          [thug]|
|mannydidthisone|[[beats], [90s], ...|         [music]|
|mannydidthisone|[[beats], [90s], ...|     [westcoast]|
|mannydidthisone|[[beats], [90s], ...|        [eminem]|
|mannydidthisone|[[beats], [90s], ...|         [drdre]|
|mannydidthisone|[[beats], [90s], ...|          [trap]|
|  Satish0919995|[[BB11], [BiggBos...|          [BB11]|
|  Satish0919995|[[BB11], [BiggBos...|    [BiggBoss11]|
|  Satish0919995|[[BB11], [BiggBos...| [WeekendKaVaar]|
+---------------+--------------------+----------------+
```

The concept of `explode` is similar to `pivot`.

Create the rest of the tables and save them to the following DataFrames:<br><br>

* `hashtagDF`
* `urlDF`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode" target="_blank">Find the documentation for `explode` here</a>

In [37]:
# TODO
from pyspark.sql.functions import col,explode
hashtagDF2 = (spark.read
             .schema(fullTweetSchema)
             .json(path)
             .filter(col("entities.hashtags").isNotNull())
             )
urlDF2=(spark.read
             .schema(fullTweetSchema)
             .json(path)
             .filter(col("entities.urls").isNotNull())
             )

hashtagDF=hashtagDF2.select("entities.media.element.id",explode("entities.hashtags")).withColumnRenamed("col","hashtag")\
.withColumnRenamed("entities.media.element.id","tweetID")

urlDF=urlDF2.select("entities.media.element.id",explode("entities.urls")).withColumnRenamed("col","displayURL")\
.withColumnRenamed("entities.media.element.id","tweetID")

display(hashtagDF)

tweetID,hashtag
,List(null)
,List(null)
List(null),List(null)
List(null),List(null)
,List(null)
,List(null)
,List(null)
,List(null)
,List(null)
,List(null)


In [38]:
# TEST - Run this cell to test your solution
hashtagCols = hashtagDF.columns
urlCols = urlDF.columns
hashtagDFCounts = hashtagDF.count()
urlDFCounts = urlDF.count()

dbTest("ET1-P-08-09-01", True, "hashtag" in hashtagCols)
dbTest("ET1-P-08-09-02", True, "displayURL" in urlCols)
dbTest("ET1-P-08-09-03", 394, hashtagDFCounts)
dbTest("ET1-P-08-09-04", 368, urlDFCounts)

print("Tests passed!")

-sandbox
## Exercise 4: Loading the Results

Use DBFS as your target warehouse for your transformed data. Save the DataFrames in Parquet format to the following endpoints:  

| DataFrame    | Endpoint                                 |
|:-------------|:-----------------------------------------|
| `accountDF`  | `"/tmp/" + username + "/account.parquet"`|
| `tweetDF`    | `"/tmp/" + username + "/tweet.parquet"`  |
| `hashtagDF`  | `"/tmp/" + username + "/hashtag.parquet"`|
| `urlDF`      | `"/tmp/" + username + "/url.parquet"`    |

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If you run out of storage in `/tmp`, use `.limit(10)` to limit the size of your DataFrames to 10 records.

In [40]:
# TODO
accountDF.write.parquet("/tmp/" + username + "/account.parquet")
tweetDF.write.parquet("/tmp/" + username + "/tweet.parquet")

In [41]:
hashtagDF.write.parquet("/tmp/" + username + "/hashtag.parquet")
urlDF.write.parquet("/tmp/" + username + "/url.parquet")

In [42]:
# TEST - Run this cell to test your solution
from pyspark.sql.dataframe import DataFrame

accountDF = spark.read.parquet("/tmp/" + username + "/account.parquet")
tweetDF = spark.read.parquet("/tmp/" + username + "/tweet.parquet")
hashtagDF = spark.read.parquet("/tmp/" + username + "/hashtag.parquet")
urlDF = spark.read.parquet("/tmp/" + username + "/url.parquet")

dbTest("ET1-P-08-10-01", DataFrame, type(accountDF))
dbTest("ET1-P-08-10-02", DataFrame, type(tweetDF))
dbTest("ET1-P-08-10-03", DataFrame, type(hashtagDF))
dbTest("ET1-P-08-10-04", DataFrame, type(urlDF))

print("Tests passed!")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [44]:
%run ./Includes/Classroom-Cleanup

## IMPORTANT Next Steps
* Please complete the <a href="https://www.surveymonkey.com/r/WPD7YNV" target="_blank">short feedback survey</a>.  Your input is extremely important and shapes future course development.
* Congratulations, you have completed ETL Part 1!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>