d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Capstone Project: Custom Transformations, Aggregating and Loading

The goal of this project is to populate aggregate tables using Twitter data.  In the process, you write custom User Defined Functions (UDFs), aggregate daily most trafficked domains, join new records to a lookup table, and load to a target database.

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome or Firefox
* Lesson: <a href="$./03-User-Defined-Functions">User Defined Functions</a> 
* Lesson: <a href="$./05-Joins-and-Lookup-Tables">Joins and Lookup Tables</a> 
* Lesson: <a href="$./06-Database-Writes">Database Writes</a> 

## Instructions

The Capstone work for the previous course in this series (ETL: Part 1) defined a schema and created tables to populate a relational mode. In this capstone project you take the project further.

In this project you ETL JSON Twitter data to build aggregate tables that monitor trending websites and hashtags and filter malicious users using historical data.  Use these four exercises to achieve this goal:<br><br>

1. **Parse tweeted URLs** using a custom UDF
2. **Compute aggregate statistics** of most tweeted websites and hashtags by day
3. **Join new data** to an existing dataset of malicious users
4. **Load records** into a target database

Run the following cell.

In [4]:
%run "./Includes/Classroom-Setup"

## Exercise 1: Parse Tweeted URLs

Some tweets in the dataset contain links to other websites.  Import and explore the dataset using the provided schema.  Then, parse the domain name from these URLs using a custom UDF.

### Step 1: Import and Explore

The following is the schema created as part of the capstone project for ETL Part 1.  
Run the following cell and then use this schema to import one file of the Twitter data.

In [7]:
from pyspark.sql.types import StructField, StructType, ArrayType, StringType, IntegerType, LongType
from pyspark.sql.functions import col

fullTweetSchema = StructType([
  StructField("id", LongType(), True),
  StructField("user", StructType([
    StructField("id", LongType(), True),
    StructField("screen_name", StringType(), True),
    StructField("location", StringType(), True),
    StructField("friends_count", IntegerType(), True),
    StructField("followers_count", IntegerType(), True),
    StructField("description", StringType(), True)
  ]), True),
  StructField("entities", StructType([
    StructField("hashtags", ArrayType(
      StructType([
        StructField("text", StringType(), True)
      ]),
    ), True),
    StructField("urls", ArrayType(
      StructType([
        StructField("url", StringType(), True),
        StructField("expanded_url", StringType(), True),
        StructField("display_url", StringType(), True)
      ]),
    ), True)
  ]), True),
  StructField("lang", StringType(), True),
  StructField("text", StringType(), True),
  StructField("created_at", StringType(), True)
])

Import one file of the JSON data located at `/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4` using the schema.  Be sure to do the following:<br><br>

* Save the result to `tweetDF`
* Apply the schema `fullTweetSchema`
* Filter out null values from the `id` column

In [9]:
from pyspark.sql.functions import col
path = "/mnt/training/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"
tweetDF = (spark.read
          .schema(fullTweetSchema)
          .json(path)
          .filter(col("id").isNotNull()))
display(tweetDF)

id,user,entities,lang,text,created_at
950438954272096257,"List(371607576, smileifyou_love, null, 473, 160, •Psalm 34:18• Living life one day at a time ✌️)","List(List(), List())",en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,Mon Jan 08 18:47:59 +0000 2018
950438954288914432,"List(732417055, bw198e18, null, 1641, 1285, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc)","List(List(List(diet)), List())",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,Mon Jan 08 18:47:59 +0000 2018
950438954276450305,"List(235927210, marlascigarette, null, 214, 223, △)","List(List(), List())",tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,Mon Jan 08 18:47:59 +0000 2018
950438954280472576,"List(1564880654, rebaab_1326, null, 45, 0, null)","List(List(List(صاروخ_سعودي_يرعب_ايران)), List(List(https://t.co/j0RgDwS36n, https://www.youtube.com/watch?v=b4iz9nZPzAA, youtube.com/watch?v=b4iz9n…)))",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,Mon Jan 08 18:47:59 +0000 2018
950438954288889856,"List(349070364, puskine, Kampala, Uganda, 5008, 4916, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952)","List(List(), List())",en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,Mon Jan 08 18:47:59 +0000 2018
950438954280669184,"List(340482488, xNina_Beana, the land , 1130, 1646, Prince Carter ❤️ && Messiah Carter Miles ❤️)","List(List(), List())",en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,Mon Jan 08 18:47:59 +0000 2018
950438954276442113,"List(4354072997, gbfranca22, cpx da congo🔞, 252, 632, mãe nunca te escutei, mas sempre te amarei❤)","List(List(), List())",pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",Mon Jan 08 18:47:59 +0000 2018
950438954276478976,"List(738897225061912576, squeeqi, null, 213, 160, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below)","List(List(), List())",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018
950438954289033216,"List(273646363, iiib53, null, 631, 427, null)","List(List(), List())",ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا اروع للبطاله لكن وزارة النقل قررت ان لا تنظم السوق بحجه السوق الحرة !! اي حريه والشركتين يسحق…,Mon Jan 08 18:47:59 +0000 2018
950438954289033218,"List(1541143441, nappo_what, 🇫🇮🇺🇦, 297, 925, null)","List(List(), List())",ru,RT @craneswordboi: блять мне так смешно от слова срождество,Mon Jan 08 18:47:59 +0000 2018


In [10]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-08-01-01", 1491, tweetDF.count())
dbTest("ET2-P-08-01-02", True, "text" in tweetDF.columns and "id" in tweetDF.columns)

print("Tests passed!")

-sandbox
### Step 2: Write a UDF to Parse URLs

The Python regular expression library `re` allows you to define a set of rules of a string you want to match. In this case, parse just the domain name in the string for the URL of a link in a Tweet. Take a look at the following example:

```
import re

URL = "https://www.databricks.com/"
pattern = re.compile(r"https?://(www\.)?([^/#?]+).*$")
match = pattern.search(URL)
print("The string {} matched {}".format(URL, match.group(2)))
```

This code prints `The string https://www.databricks.com/ matched spark.apache.org`. **Wrap this code into a function named `getDomain` that takes a parameter `URL` and returns the matched string.**

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="https://docs.python.org/3/howto/regex.html" target="_blank">You can find more on the `re` library here.</a>

In [12]:
import re
def getDomain(URL):
  pattern = re.compile(r"https?://(www\.)?([^/#?]+).*$")
  match = pattern.search(URL)
  return match.group(2)
URL = "https://www.databricks.com/"
print("The string {} matched {}".format(URL, getDomain(URL)))

In [13]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-08-02-01", "databricks.com",  getDomain("https://www.databricks.com/"))

print("Tests passed!")

### Step 3: Test and Register the UDF

Now that the function works with a single URL, confirm that it works on different URL formats.

Run the following cell to test your function further.

In [16]:
urls = [
  "https://www.databricks.com/",
  "https://databricks.com/",
  "https://databricks.com/training-overview/training-self-paced",
  "http://www.databricks.com/",
  "http://databricks.com/",
  "http://databricks.com/training-overview/training-self-paced",
  "http://www.apache.org/",
  "http://spark.apache.org/docs/latest/"
]

for url in urls:
  print(getDomain(url))

Register the UDF as `getDomainUDF`.

In [18]:
from pyspark.sql.types import StructField, StructType, ArrayType, StringType, IntegerType, LongType
getDomainUDF = spark.udf.register("getDomainUDF",getDomain,StringType())

In [19]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-08-03-01", True, bool(getDomainUDF))

print("Tests passed!")

### Step 4: Apply the UDF

Create a dataframe called `urlDF` that has three columns:<br><br>

1. `URL`: The URL's from `tweetDF` (located in `entities.urls.expanded_url`) 
2. `parsedURL`: The UDF applied to the column `URL`
3. `created_at`

There can be zero, one, or many URLs in any tweet.  For this step, use the `explode` function, which takes an array like URLs and returns one row for each value in the array.
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode" target="_blank">See the documents here for details.</a>

In [21]:
from pyspark.sql.functions import explode
urlDF = tweetDF.select(explode("entities.urls.expanded_url").alias("URL"),getDomainUDF("URL").alias("parsedURL"),"created_at")
display(urlDF)

URL,parsedURL,created_at
https://www.youtube.com/watch?v=b4iz9nZPzAA,youtube.com,Mon Jan 08 18:47:59 +0000 2018
https://twitter.com/i/web/status/950438954284797958,twitter.com,Mon Jan 08 18:47:59 +0000 2018
http://bit.ly/OYlKII,bit.ly,Mon Jan 08 18:47:59 +0000 2018
https://goo.gl/fb/atjACB,goo.gl,Mon Jan 08 18:47:59 +0000 2018
https://www.instagram.com/p/BdsvNFABXNL/,instagram.com,Mon Jan 08 18:47:59 +0000 2018
https://twitter.com/i/web/status/950438954301644800,twitter.com,Mon Jan 08 18:47:59 +0000 2018
http://alathkar.org,alathkar.org,Mon Jan 08 18:47:59 +0000 2018
https://open.spotify.com/album/60eldei5NAr7QjPGr3Ei1B,open.spotify.com,Mon Jan 08 18:47:59 +0000 2018
https://twitter.com/i/web/status/950438954301616131,twitter.com,Mon Jan 08 18:47:59 +0000 2018
https://twitter.com/frann_frann_/status/949080362218606592,twitter.com,Mon Jan 08 18:47:59 +0000 2018


In [22]:
# TEST - Run this cell to test your solution
cols = urlDF.columns
sample = urlDF.first()

dbTest("ET2-P-08-04-01", True, "URL" in cols and "parsedURL" in cols and "created_at" in cols)
dbTest("ET2-P-08-04-02", "https://www.youtube.com/watch?v=b4iz9nZPzAA", sample["URL"])
dbTest("ET2-P-08-04-03", "Mon Jan 08 18:47:59 +0000 2018", sample["created_at"])
dbTest("ET2-P-08-04-04", "youtube.com", sample["parsedURL"])

print("Tests passed!")

## Exercise 2: Compute Aggregate Statistics

Calculate top trending 10 URLs by hour.

### Step 1: Parse the Timestamp

Create a DataFrame `urlWithTimestampDF` that includes the following columns:<br><br>

* `URL`
* `parsedURL`
* `timestamp`
* `hour`

Import `unix_timestamp` and `hour` from the `functions` module and `TimestampType` from the types `module`. To parse the `create_at` field, use `unix_timestamp` with the format `EEE MMM dd HH:mm:ss ZZZZZ yyyy`.

In [25]:
from pyspark.sql.functions import unix_timestamp,hour
from pyspark.sql.types import TimestampType
timestampFormat = "EEE MMM dd HH:mm:ss ZZZZZ yyyy"
urlWithTimestampDF1 = urlDF.select(
"URL","parsedURL",unix_timestamp("created_at","EEE MMM dd HH:mm:ss ZZZZZ yyyy").astype(TimestampType()).alias("timestamp")
)
urlWithTimestampDF=urlWithTimestampDF1.select("URL","parsedURL","timestamp",hour("timestamp").alias("hour"))
display(urlWithTimestampDF)

URL,parsedURL,timestamp,hour
https://www.youtube.com/watch?v=b4iz9nZPzAA,youtube.com,2018-01-08T18:47:59.000+0000,18
https://twitter.com/i/web/status/950438954284797958,twitter.com,2018-01-08T18:47:59.000+0000,18
http://bit.ly/OYlKII,bit.ly,2018-01-08T18:47:59.000+0000,18
https://goo.gl/fb/atjACB,goo.gl,2018-01-08T18:47:59.000+0000,18
https://www.instagram.com/p/BdsvNFABXNL/,instagram.com,2018-01-08T18:47:59.000+0000,18
https://twitter.com/i/web/status/950438954301644800,twitter.com,2018-01-08T18:47:59.000+0000,18
http://alathkar.org,alathkar.org,2018-01-08T18:47:59.000+0000,18
https://open.spotify.com/album/60eldei5NAr7QjPGr3Ei1B,open.spotify.com,2018-01-08T18:47:59.000+0000,18
https://twitter.com/i/web/status/950438954301616131,twitter.com,2018-01-08T18:47:59.000+0000,18
https://twitter.com/frann_frann_/status/949080362218606592,twitter.com,2018-01-08T18:47:59.000+0000,18


In [26]:
# TEST - Run this cell to test your solution
cols = urlWithTimestampDF.columns
sample = urlWithTimestampDF.first()

dbTest("ET2-P-08-05-01", True, "URL" in cols and "parsedURL" in cols and "timestamp" in cols and "hour" in cols)
dbTest("ET2-P-08-05-02", 18, sample["hour"])

print("Tests passed!")

### Step 2: Calculate Trending URLs

Create a DataFrame `urlTrendsDF` that looks at the top 10 hourly counts of domain names and includes the following columns:<br><br>

* `hour`
* `parsedURL`
* `count`

The result should sort `hour` in ascending order and `count` in descending order.

In [28]:
from pyspark.sql.functions import count,col, desc
urlTrendsDF = (urlWithTimestampDF
               .select("hour","parsedURL")
               .groupBy("parsedURL","hour")
               .count().alias("count")
               .orderBy(desc("count"))
               .limit(10)
              )
display(urlTrendsDF)

parsedURL,hour,count
twitter.com,18,159
bit.ly,18,25
fb.me,18,17
youtu.be,18,16
du3a.org,18,15
goo.gl,18,12
instagram.com,18,10
curiouscat.me,18,6
dlvr.it,18,4
youtube.com,18,3


In [29]:
# TEST - Run this cell to test your solution
cols = urlTrendsDF.columns
sample = urlTrendsDF.first()

dbTest("ET2-P-08-06-01", True, "hour" in cols and "parsedURL" in cols and "count" in cols)
dbTest("ET2-P-08-06-02", 18, sample["hour"])
dbTest("ET2-P-08-06-03", "twitter.com", sample["parsedURL"])
dbTest("ET2-P-08-06-04", 159, sample["count"])

print("Tests passed!")

## Exercise 3: Join New Data

Filter out bad users.

### Step 1: Import Table of Bad Actors

Create the DataFrame `badActorsDF`, a list of bad actors that sits in `/mnt/training/twitter/supplemental/badactors.parquet`.

In [32]:

badActorsDF = (spark.read
              .parquet("/mnt/training/twitter/supplemental/badactors.parquet")
              )
display(badActorsDF)

userID,screenName
4875602384,cris_silvag1
2641769580,therebel1789
932599236516024320,HiLeonyedek88
573733345,KristenSel
1458487634,ImPostMaIone
541253458,Santa_Palabra
1942419721,fidgazi
950292653207425024,marylinbrown272
2333871434,LryManon
3519647781,FeatuuedPromos


In [33]:
# TEST - Run this cell to test your solution
cols = badActorsDF.columns
sample = badActorsDF.first()

dbTest("ET2-P-08-07-01", True, "userID" in cols and "screenName" in cols)
dbTest("ET2-P-08-07-02", 4875602384, sample["userID"])
dbTest("ET2-P-08-07-03", "cris_silvag1", sample["screenName"])

print("Tests passed!")

### Step 2: Add a Column for Bad Actors

Add a new column to `tweetDF` called `maliciousAcct` with `true` if the user is in `badActorsDF`.  Save the results to `tweetWithMaliciousDF`.  Remember to do a left join of the malicious accounts on `tweetDF`.

In [35]:
from pyspark.sql import functions as F

tweetDF2 = tweetDF.join(badActorsDF,tweetDF.user.id==badActorsDF.userID,"left_outer")
tweetWithMaliciousDF=tweetDF2.select("*",F.when(tweetDF2.screenName.isNull(),False).otherwise(True).alias("maliciousAcct"))


display(tweetWithMaliciousDF)

id,user,entities,lang,text,created_at,userID,screenName,maliciousAcct
950438954272096257,"List(371607576, smileifyou_love, null, 473, 160, •Psalm 34:18• Living life one day at a time ✌️)","List(List(), List())",en,RT @TheTinaVasquez: Quick facts for the know-nothings who will tweet me today: MS-13 began in Los Angeles. The U.S. helped fund El Salvador…,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954288914432,"List(732417055, bw198e18, null, 1641, 1285, 【期間限定】今なら無料！！ ただ今話題沸騰中の「ダイエットできるアプリ」こと「ヤセサポ」！！ 今だけ無料でダウンロードできます！この機会にぜひ！お試しあれ★ダウンロード⇒http://bit.ly/MxDjjc)","List(List(List(diet)), List())",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10センチ上にあげてキープ３．２の状態で足を振り上げる。※お腹が床から離れると× #diet,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954276450305,"List(235927210, marlascigarette, null, 214, 223, △)","List(List(), List())",tr,Ben bir beni bulup icine girip saklanirsam kim beni bulur,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954280472576,"List(1564880654, rebaab_1326, null, 45, 0, null)","List(List(List(صاروخ_سعودي_يرعب_ايران)), List(List(https://t.co/j0RgDwS36n, https://www.youtube.com/watch?v=b4iz9nZPzAA, youtube.com/watch?v=b4iz9n…)))",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني يشكر قناة الجزيرة https://t.co/j0RgDwS36n … #صاروخ_سعودي_يرعب_ايران,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954288889856,"List(349070364, puskine, Kampala, Uganda, 5008, 4916, God first . Football fun . Talk so much . Reader. Year of FINANCIAL BREAK THROUGH . Still learning how to love kibitram@gmail.com. +256779646952)","List(List(), List())",en,*Before you argue about your dirty house someone didn't clean or sweep -* *Think of the people who are living in the streets.*,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954280669184,"List(340482488, xNina_Beana, the land , 1130, 1646, Prince Carter ❤️ && Messiah Carter Miles ❤️)","List(List(), List())",en,RT @TippyLexx: Bruh you ever accidentally open a message and be like damn now I gotta reply 😂😂,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954276442113,"List(4354072997, gbfranca22, cpx da congo🔞, 252, 632, mãe nunca te escutei, mas sempre te amarei❤)","List(List(), List())",pt,"RT @MorraoTudo2: A liberdade é só questão de tempo, solta os faixa preta 🔐🔓⏳✌️",Mon Jan 08 18:47:59 +0000 2018,,,False
950438954276478976,"List(738897225061912576, squeeqi, null, 213, 160, We are two guys who have great knowledge in scripting. If you have 10k+ coins on CSGODouble we can help you triple that amount. Check out how in the link below)","List(List(), List())",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018,,,False
950438954289033216,"List(273646363, iiib53, null, 631, 427, null)","List(List(), List())",ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا اروع للبطاله لكن وزارة النقل قررت ان لا تنظم السوق بحجه السوق الحرة !! اي حريه والشركتين يسحق…,Mon Jan 08 18:47:59 +0000 2018,273646363.0,iiib53,True
950438954289033218,"List(1541143441, nappo_what, 🇫🇮🇺🇦, 297, 925, null)","List(List(), List())",ru,RT @craneswordboi: блять мне так смешно от слова срождество,Mon Jan 08 18:47:59 +0000 2018,,,False


In [36]:
# TEST - Run this cell to test your solution
cols = tweetWithMaliciousDF.columns
sample = tweetWithMaliciousDF.first()

dbTest("ET2-P-08-08-01", True, "maliciousAcct" in cols and "id" in cols)
dbTest("ET2-P-08-08-02", 950438954272096257, sample["id"])
dbTest("ET2-P-08-08-03", False, sample["maliciousAcct"])

print("Tests passed!")

## Exercise 4: Load Records

Transform your two DataFrames to 4 partitions and save the results to the following endpoints:

| DataFrame              | Endpoint                            |
|:-----------------------|:------------------------------------|
| `urlTrendsDF`          | `userhome + /tmp/urlTrends.parquet`            |
| `tweetWithMaliciousDF` | `userhome + /tmp/tweetWithMaliciousDF.parquet` |

In [38]:
urlTrendsDF.repartition(4).write.mode("OVERWRITE").parquet(userhome + "/tmp/urlTrends.parquet")
tweetWithMaliciousDF.repartition(4).write.mode("OVERWRITE").parquet(userhome + "/tmp/tweetWithMaliciousDF.parquet")

In [39]:
# TEST - Run this cell to test your solution
urlTrendsDFTemp = spark.read.parquet(userhome + "/tmp/urlTrends.parquet")
tweetWithMaliciousDFTemp = spark.read.parquet(userhome + "/tmp/tweetWithMaliciousDF.parquet")

dbTest("ET2-P-08-09-01", 4, urlTrendsDFTemp.rdd.getNumPartitions())
dbTest("ET2-P-08-09-02", 4, tweetWithMaliciousDFTemp.rdd.getNumPartitions())

print("Tests passed!")

## IMPORTANT Next Steps
* Please complete the <a href="https://www.surveymonkey.com/r/VYGM9TD" target="_blank">short feedback survey</a>.  Your input is extremely important and shapes future course development.
* Congratulations, you have completed ETL Part 2!

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>