<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../resources/logo.png" alt="Intellinum Bootcamp" style="width: 600px; height: 163px">
</div>

# Capstone Project: Custom Transformations, Aggregating and Loading

The goal of this project is to populate aggregate tables using Twitter data.  In the process, you write custom User Defined Functions (UDFs), aggregate daily most trafficked domains, join new records to a lookup table, and load to a target database.


## Instructions

The Capstone work for the previous course in this series (ETL: Part 1) defined a schema and created tables to populate a relational mode. In this capstone project you take the project further.

In this project you ETL JSON Twitter data to build aggregate tables that monitor trending websites and hashtags and filter malicious users using historical data.  Use these four exercises to achieve this goal:<br><br>

1. **Parse tweeted URLs** using a custom UDF
2. **Compute aggregate statistics** of most tweeted websites and hashtags by day
3. **Join new data** to an existing dataset of malicious users
4. **Load records** into a target database

Run the following cell to create the lab environment:

In [1]:
#MODE = "LOCAL"
MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
from matplotlib import interactive
interactive(True)
import matplotlib.pyplot as plt
%matplotlib inline
import json
import math
import numbers
import numpy as np
import plotly
plotly.offline.init_notebook_mode(connected=True)

sys.path.insert(0,'../../src')
from settings import *

try:
    fh = open('../../libs/pyspark24_py36.zip', 'r')
except FileNotFoundError:
    !aws s3 cp s3://devops.intellinum.co/bins/pyspark24_py36.zip ../../libs/pyspark24_py36.zip

try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession detected")
    print("Creating a new SparkSession")

SPARK_DRIVER_MEMORY= "1G"
SPARK_DRIVER_CORE = "1"
SPARK_EXECUTOR_MEMORY= "1G"
SPARK_EXECUTOR_CORE = "1"
SPARK_EXECUTOR_INSTANCES = 12



conf = None
if MODE == "LOCAL":
    os.environ["PYSPARK_PYTHON"] = "/home/yuan/anaconda3/envs/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_etl_11-custom-etl-project").\
            setMaster('local[*]').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', '../../libs/mysql-connector-java-5.1.45-bin.jar').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1')
else:
    os.environ["PYSPARK_PYTHON"] = "./MN/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_etl_11-custom-etl-project-database-writes-rj").\
            setMaster('yarn-client').\
            set('spark.executor.cores', SPARK_EXECUTOR_CORE).\
            set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\
            set('spark.driver.cores', SPARK_DRIVER_CORE).\
            set('spark.driver.memory', SPARK_DRIVER_MEMORY).\
            set("spark.executor.instances", SPARK_EXECUTOR_INSTANCES).\
            set('spark.sql.files.ignoreCorruptFiles', 'true').\
            set('spark.yarn.dist.archives', '../../libs/pyspark24_py36.zip#MN').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1'). \
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', 's3://devops.intellinum.co/bins/mysql-connector-java-5.1.45-bin.jar')
        

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()


sc = spark.sparkContext

sc.addPyFile('../../src/settings.py')

sc=spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

def display(df, limit=10):
    return df.limit(limit).toPandas()

def dfTest(id, expected, result):
    assert str(expected) == str(result), "{} does not equal expected {}".format(result, expected)

No existing SparkSession detected
Creating a new SparkSession


## Exercise 1: Parse Tweeted URLs

Some tweets in the dataset contain links to other websites.  Import and explore the dataset using the provided schema.  Then, parse the domain name from these URLs using a custom UDF.

### Step 1: Import and Explore

The following is the schema created as part of the capstone project for ETL Part 1.  
Run the following cell and then use this schema to import one file of the Twitter data.

In [2]:
from pyspark.sql.types import StructField, StructType, ArrayType, StringType, IntegerType, LongType
from pyspark.sql.functions import col

fullTweetSchema = StructType([
  StructField("id", LongType(), True),
  StructField("user", StructType([
    StructField("id", LongType(), True),
    StructField("screen_name", StringType(), True),
    StructField("location", StringType(), True),
    StructField("friends_count", IntegerType(), True),
    StructField("followers_count", IntegerType(), True),
    StructField("description", StringType(), True)
  ]), True),
  StructField("entities", StructType([
    StructField("hashtags", ArrayType(
      StructType([
        StructField("text", StringType(), True)
      ]),
    ), True),
    StructField("urls", ArrayType(
      StructType([
        StructField("url", StringType(), True),
        StructField("expanded_url", StringType(), True),
        StructField("display_url", StringType(), True)
      ]),
    ), True)
  ]), True),
  StructField("lang", StringType(), True),
  StructField("text", StringType(), True),
  StructField("created_at", StringType(), True)
])

Import one file of the JSON data located at `s3a://data.intellinum.co/bootcamp/common/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4` using the schema.  Be sure to do the following:<br><br>

* Save the result to `tweetDF`
* Apply the schema `fullTweetSchema`
* Filter out null values from the `id` column

In [3]:
# TODO
path = "s3a://data.intellinum.co/bootcamp/common/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"
tweetDF = spark.read.schema(fullTweetSchema).json(path).dropna('any', subset=['id'])

In [4]:
display(tweetDF, 100)

Unnamed: 0,id,user,entities,lang,text,created_at
0,950438954272096257,"(371607576, smileifyou_love, None, 473, 160, •...","([], [])",en,RT @TheTinaVasquez: Quick facts for the know-n...,Mon Jan 08 18:47:59 +0000 2018
1,950438954288914432,"(732417055, bw198e18, None, 1641, 1285, 【期間限定】...","([(diet,)], [])",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10セン...,Mon Jan 08 18:47:59 +0000 2018
2,950438954276450305,"(235927210, marlascigarette, None, 214, 223, △)","([], [])",tr,Ben bir beni bulup icine girip saklanirsam kim...,Mon Jan 08 18:47:59 +0000 2018
3,950438954280472576,"(1564880654, rebaab_1326, None, 45, 0, None)","([(صاروخ_سعودي_يرعب_ايران,)], [(https://t.co/j...",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني...,Mon Jan 08 18:47:59 +0000 2018
4,950438954288889856,"(349070364, puskine, Kampala, Uganda, 5008, 49...","([], [])",en,*Before you argue about your dirty house someo...,Mon Jan 08 18:47:59 +0000 2018
5,950438954280669184,"(340482488, xNina_Beana, the land , 1130, 1646...","([], [])",en,RT @TippyLexx: Bruh you ever accidentally open...,Mon Jan 08 18:47:59 +0000 2018
6,950438954276442113,"(4354072997, gbfranca22, cpx da congo🔞, 252, 6...","([], [])",pt,RT @MorraoTudo2: A liberdade é só questão de t...,Mon Jan 08 18:47:59 +0000 2018
7,950438954276478976,"(738897225061912576, squeeqi, None, 213, 160, ...","([], [])",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018
8,950438954289033216,"(273646363, iiib53, None, 631, 427, None)","([], [])",ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا ...,Mon Jan 08 18:47:59 +0000 2018
9,950438954289033218,"(1541143441, nappo_what, 🇫🇮🇺🇦, 297, 925, None)","([], [])",ru,RT @craneswordboi: блять мне так смешно от сло...,Mon Jan 08 18:47:59 +0000 2018


In [5]:
# TEST - Run this cell to test your solution
dfTest("ET2-P-08-01-01", 1491, tweetDF.count())
dfTest("ET2-P-08-01-02", True, "text" in tweetDF.columns and "id" in tweetDF.columns)

print("Tests passed!")

Tests passed!


### Step 2: Write a UDF to Parse URLs

The Python regular expression library `re` allows you to define a set of rules of a string you want to match. In this case, parse just the domain name in the string for the URL of a link in a Tweet. Take a look at the following example:

```
import re

URL = "https://spark.apache.org/"
pattern = re.compile(r"https?://(www\.)?([^/#?]+).*$")
match = pattern.search(URL)
print("The string {} matched {}".format(URL, match.group(2)))
```

This code prints `The string https://spark.apache.org/ matched spark.apache.org`. **Wrap this code into a function named `getDomain` that takes a parameter `URL` and returns the matched string.**

<a href="https://docs.python.org/3/howto/regex.html" target="_blank">You can find more on the `re` library here.</a>

In [6]:
# TODO
import re

def getDomain(URL):
    pattern = re.compile(r"https?://(www\.)?([^/#?]+).*$")
    match = pattern.search(URL)
    print("The string {} matched {} ".format(URL, match.group(2)))
    return match.group(2)

getDomain("https://t.co/2enH654iIr")
getDomain("https://spark.apache.org/")
getDomain("https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html")

The string https://t.co/2enH654iIr matched t.co 
The string https://spark.apache.org/ matched spark.apache.org 
The string https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html matched docs.conda.io 


'docs.conda.io'

In [7]:
# TEST - Run this cell to test your solution
dfTest("ET2-P-08-02-01", "spark.apache.org",  getDomain("https://spark.apache.org/"))

print("Tests passed!")

The string https://spark.apache.org/ matched spark.apache.org 
Tests passed!


### Step 3: Test and Register the UDF

Now that the function works with a single URL, confirm that it works on different URL formats.

In [8]:
# TEST - Run this cell to test your solution
dfTest("ET2-P-08-02-02", "intellinum.co",  getDomain("https://www.intellinum.co/"))
dfTest("ET2-P-08-02-02", "dzone.com",  getDomain("https://dzone.com/articles/spark-streaming-vs-structured-streaming"))
dfTest("ET2-P-08-02-02", "docs.conda.io",  getDomain("https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html"))
dfTest("ET2-P-08-02-02", "datacamp.com",  getDomain("https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python"))
dfTest("ET2-P-08-02-02", "spark.apache.org",  getDomain("https://spark.apache.org/docs/2.4.0/sql-pyspark-pandas-with-arrow.html"))
dfTest("ET2-P-08-02-02", "apache.org",  getDomain("http://www.apache.org/"))

print("Tests passed!")

The string https://www.intellinum.co/ matched intellinum.co 
The string https://dzone.com/articles/spark-streaming-vs-structured-streaming matched dzone.com 
The string https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html matched docs.conda.io 
The string https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python matched datacamp.com 
The string https://spark.apache.org/docs/2.4.0/sql-pyspark-pandas-with-arrow.html matched spark.apache.org 
The string http://www.apache.org/ matched apache.org 
Tests passed!


Register the UDF as `getDomainUDF`.

In [9]:
from pyspark.sql.types import StringType

getDomainUDF = spark.udf.register('getDomainUDFSQL', getDomain, StringType())

Run the following cell to test your function further.

In [10]:
# TEST - Run this cell to test your solution
dfTest("ET2-P-08-03-01", True, bool(getDomainUDF))

print("Tests passed!")

Tests passed!


### Step 4: Apply the UDF

Create a dataframe called `urlDF` that has three columns:<br><br>

1. `URL`: The URL's from `tweetDF` (located in `entities.urls.expanded_url`) 
2. `parsedURL`: The UDF applied to the column `URL`
3. `created_at`

There can be zero, one, or many URLs in any tweet.  For this step, use the `explode` function, which takes an array like URLs and returns one row for each value in the array.
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.explode" target="_blank">See the documents here for details.</a>

In [11]:
urlSchema = StructType([
    StructField("id", LongType(), True),
    StructField('entities', StructType([
        StructField('urls', ArrayType(
            StructType([
                StructField('display_url', StringType(), True),
                StructField('expanded_url', StringType(), True),
                StructField('url', StringType(), True)
            ])), True)
    ])),
    StructField("created_at", StringType(), True)
])


In [12]:
urlDFPath = "s3a://data.intellinum.co/bootcamp/common/twitter/firehose/2018/01/08/18/twitterstream-1-2018-01-08-18-48-00-bcf3d615-9c04-44ec-aac9-25f966490aa4"
urlTempDF = spark.read.schema(urlSchema).json(urlDFPath).dropna('any', subset=['id']).drop('id')

In [13]:
display(urlTempDF)

Unnamed: 0,entities,created_at
0,"([],)",Mon Jan 08 18:47:59 +0000 2018
1,"([],)",Mon Jan 08 18:47:59 +0000 2018
2,"([],)",Mon Jan 08 18:47:59 +0000 2018
3,"([(youtube.com/watch?v=b4iz9n…, https://www.yo...",Mon Jan 08 18:47:59 +0000 2018
4,"([],)",Mon Jan 08 18:47:59 +0000 2018
5,"([],)",Mon Jan 08 18:47:59 +0000 2018
6,"([],)",Mon Jan 08 18:47:59 +0000 2018
7,"([],)",Mon Jan 08 18:47:59 +0000 2018
8,"([],)",Mon Jan 08 18:47:59 +0000 2018
9,"([],)",Mon Jan 08 18:47:59 +0000 2018


In [14]:
# TODO
urldf = urlTempDF.select(
      F.explode(('entities.urls')).alias('element')
    , F.col('element.expanded_url').alias('URL')
    , F.col('created_at')
).drop('element')

urlDF = urldf.withColumn('parsedURL', getDomainUDF(urldf.URL))
display(urlDF)

Unnamed: 0,URL,created_at,parsedURL
0,https://www.youtube.com/watch?v=b4iz9nZPzAA,Mon Jan 08 18:47:59 +0000 2018,youtube.com
1,https://twitter.com/i/web/status/9504389542847...,Mon Jan 08 18:47:59 +0000 2018,twitter.com
2,http://bit.ly/OYlKII,Mon Jan 08 18:47:59 +0000 2018,bit.ly
3,https://goo.gl/fb/atjACB,Mon Jan 08 18:47:59 +0000 2018,goo.gl
4,https://www.instagram.com/p/BdsvNFABXNL/,Mon Jan 08 18:47:59 +0000 2018,instagram.com
5,https://twitter.com/i/web/status/9504389543016...,Mon Jan 08 18:47:59 +0000 2018,twitter.com
6,http://alathkar.org,Mon Jan 08 18:47:59 +0000 2018,alathkar.org
7,https://open.spotify.com/album/60eldei5NAr7QjP...,Mon Jan 08 18:47:59 +0000 2018,open.spotify.com
8,https://twitter.com/i/web/status/9504389543016...,Mon Jan 08 18:47:59 +0000 2018,twitter.com
9,https://twitter.com/frann_frann_/status/949080...,Mon Jan 08 18:47:59 +0000 2018,twitter.com


In [15]:
# TEST - Run this cell to test your solution
cols = urlDF.columns
sample = urlDF.first()

dfTest("ET2-P-08-04-01", True, "URL" in cols and "parsedURL" in cols and "created_at" in cols)
dfTest("ET2-P-08-04-02", "https://www.youtube.com/watch?v=b4iz9nZPzAA", sample["URL"])
dfTest("ET2-P-08-04-03", "Mon Jan 08 18:47:59 +0000 2018", sample["created_at"])
dfTest("ET2-P-08-04-04", "youtube.com", sample["parsedURL"])

print("Tests passed!")

Tests passed!


## Exercise 2: Compute Aggregate Statistics

Calculate top trending 10 URLs by hour.

### Step 1: Parse the Timestamp

Create a DataFrame `urlWithTimestampDF` that includes the following columns:<br><br>

* `URL`
* `parsedURL`
* `timestamp`
* `hour`

Import `unix_timestamp` and `hour` from the `functions` module and `TimestampType` from the types `module`. To parse the `create_at` field, use `unix_timestamp` with the format `EEE MMM dd HH:mm:ss ZZZZZ yyyy`.

In [16]:
# TODO
from pyspark.sql.types import TimestampType

urlWithTimestampTempDF = urlDF.select(
     F.col('URL')
    ,F.col('parsedURL')
    ,F.unix_timestamp(F.col('created_at'), 'EEE MMM dd HH:mm:ss ZZZZZ yyyy').cast(TimestampType()).alias("timestamp")
)

urlWithTimestampDF = urlWithTimestampTempDF.withColumn('hour', F.hour('timestamp'))


In [17]:
display(urlWithTimestampDF, 100)

Unnamed: 0,URL,parsedURL,timestamp,hour
0,https://www.youtube.com/watch?v=b4iz9nZPzAA,youtube.com,2018-01-08 18:47:59,18
1,https://twitter.com/i/web/status/9504389542847...,twitter.com,2018-01-08 18:47:59,18
2,http://bit.ly/OYlKII,bit.ly,2018-01-08 18:47:59,18
3,https://goo.gl/fb/atjACB,goo.gl,2018-01-08 18:47:59,18
4,https://www.instagram.com/p/BdsvNFABXNL/,instagram.com,2018-01-08 18:47:59,18
5,https://twitter.com/i/web/status/9504389543016...,twitter.com,2018-01-08 18:47:59,18
6,http://alathkar.org,alathkar.org,2018-01-08 18:47:59,18
7,https://open.spotify.com/album/60eldei5NAr7QjP...,open.spotify.com,2018-01-08 18:47:59,18
8,https://twitter.com/i/web/status/9504389543016...,twitter.com,2018-01-08 18:47:59,18
9,https://twitter.com/frann_frann_/status/949080...,twitter.com,2018-01-08 18:47:59,18


In [18]:
# TEST - Run this cell to test your solution
cols = urlWithTimestampDF.columns
sample = urlWithTimestampDF.first()

dfTest("ET2-P-08-05-01", True, "URL" in cols and "parsedURL" in cols and "timestamp" in cols and "hour" in cols)
dfTest("ET2-P-08-05-02", 18, sample["hour"])

print("Tests passed!")

Tests passed!


### Step 2: Calculate Trending URLs

Create a DataFrame `urlTrendsDF` that looks at the top 10 hourly counts of domain names and includes the following columns:<br><br>

* `hour`
* `parsedURL`
* `count`

The result should sort `hour` in ascending order and `count` in descending order.

In [19]:
urlTrendsDF = urlWithTimestampDF.groupby('hour', 'parsedURL').count().orderBy(F.asc('hour'), F.desc('count'))

In [20]:
display(urlTrendsDF)

Unnamed: 0,hour,parsedURL,count
0,18,twitter.com,159
1,18,bit.ly,25
2,18,fb.me,17
3,18,youtu.be,16
4,18,du3a.org,15
5,18,goo.gl,12
6,18,instagram.com,10
7,18,curiouscat.me,6
8,18,dlvr.it,4
9,18,youtube.com,3


In [21]:
# TEST - Run this cell to test your solution
cols = urlTrendsDF.columns
sample = urlTrendsDF.first()

dfTest("ET2-P-08-06-01", True, "hour" in cols and "parsedURL" in cols and "count" in cols)
dfTest("ET2-P-08-06-02", 18, sample["hour"])
dfTest("ET2-P-08-06-03", "twitter.com", sample["parsedURL"])
dfTest("ET2-P-08-06-04", 159, sample["count"])

print("Tests passed!")

Tests passed!


## Exercise 3: Join New Data

Filter out bad users.

### Step 1: Import Table of Bad Actors

Create the DataFrame `badActorsDF`, a list of bad actors that sits in `s3a://data.intellinum.co/bootcamp/common/twitter/supplemental/badactors.parquet`.

In [22]:
# TODO
badActorPath = "s3a://data.intellinum.co/bootcamp/common/twitter/supplemental/badactors.parquet"
badActorsDF = spark.read.parquet(badActorPath)

In [30]:
display(badActorsDF)
print(badActorsDF.count())

72


In [24]:
# TEST - Run this cell to test your solution
cols = badActorsDF.columns
sample = badActorsDF.first()

dfTest("ET2-P-08-07-01", True, "userID" in cols and "screenName" in cols)
dfTest("ET2-P-08-07-02", 4875602384, sample["userID"])
dfTest("ET2-P-08-07-03", "cris_silvag1", sample["screenName"])

print("Tests passed!")

Tests passed!


### Step 2: Add a Column for Bad Actors

Add a new column to `tweetDF` called `maliciousAcct` with `true` if the user is in `badActorsDF`.  Save the results to `tweetWithMaliciousDF`.  Remember to do a left join of the malicious accounts on `tweetDF`.

In [31]:
# TODO
tweetJoinedDF = tweetDF.join(F.broadcast(badActorsDF), tweetDF.user.id == badActorsDF.userID, 'left')


In [43]:
tweetJoinedDF.printSchema()

root
 |-- id: long (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- screen_name: string (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- friends_count: integer (nullable = true)
 |    |-- followers_count: integer (nullable = true)
 |    |-- description: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- urls: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- display_url: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- text: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- userID: long (nullable = true)
 |-- screenName: string (nullable = true)



In [34]:
# tweetJoinedDF.first()
display(tweetJoinedDF, 100)

Unnamed: 0,id,user,entities,lang,text,created_at,userID,screenName
0,950438954272096257,"(371607576, smileifyou_love, None, 473, 160, •...","([], [])",en,RT @TheTinaVasquez: Quick facts for the know-n...,Mon Jan 08 18:47:59 +0000 2018,,
1,950438954288914432,"(732417055, bw198e18, None, 1641, 1285, 【期間限定】...","([(diet,)], [])",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10セン...,Mon Jan 08 18:47:59 +0000 2018,,
2,950438954276450305,"(235927210, marlascigarette, None, 214, 223, △)","([], [])",tr,Ben bir beni bulup icine girip saklanirsam kim...,Mon Jan 08 18:47:59 +0000 2018,,
3,950438954280472576,"(1564880654, rebaab_1326, None, 45, 0, None)","([(صاروخ_سعودي_يرعب_ايران,)], [(https://t.co/j...",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني...,Mon Jan 08 18:47:59 +0000 2018,,
4,950438954288889856,"(349070364, puskine, Kampala, Uganda, 5008, 49...","([], [])",en,*Before you argue about your dirty house someo...,Mon Jan 08 18:47:59 +0000 2018,,
5,950438954280669184,"(340482488, xNina_Beana, the land , 1130, 1646...","([], [])",en,RT @TippyLexx: Bruh you ever accidentally open...,Mon Jan 08 18:47:59 +0000 2018,,
6,950438954276442113,"(4354072997, gbfranca22, cpx da congo🔞, 252, 6...","([], [])",pt,RT @MorraoTudo2: A liberdade é só questão de t...,Mon Jan 08 18:47:59 +0000 2018,,
7,950438954276478976,"(738897225061912576, squeeqi, None, 213, 160, ...","([], [])",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018,,
8,950438954289033216,"(273646363, iiib53, None, 631, 427, None)","([], [])",ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا ...,Mon Jan 08 18:47:59 +0000 2018,273646363.0,iiib53
9,950438954289033218,"(1541143441, nappo_what, 🇫🇮🇺🇦, 297, 925, None)","([], [])",ru,RT @craneswordboi: блять мне так смешно от сло...,Mon Jan 08 18:47:59 +0000 2018,,


In [33]:
print(tweetJoinedDF.count(), tweetDF.count())

1491 1491


In [44]:
tweetWithMaliciousDF = tweetJoinedDF.withColumn('maliciousAcct', F.when(F.col('userID').isNotNull(), True)\
                                               .otherwise(False))

In [45]:
display(tweetWithMaliciousDF)

Unnamed: 0,id,user,entities,lang,text,created_at,userID,screenName,maliciousAcct
0,950438954272096257,"(371607576, smileifyou_love, None, 473, 160, •...","([], [])",en,RT @TheTinaVasquez: Quick facts for the know-n...,Mon Jan 08 18:47:59 +0000 2018,,,False
1,950438954288914432,"(732417055, bw198e18, None, 1641, 1285, 【期間限定】...","([(diet,)], [])",ja,【太ももを引きしめるエクササイズ】足あげ～１．うつぶせに寝る。２．片方の足先を床から10セン...,Mon Jan 08 18:47:59 +0000 2018,,,False
2,950438954276450305,"(235927210, marlascigarette, None, 214, 223, △)","([], [])",tr,Ben bir beni bulup icine girip saklanirsam kim...,Mon Jan 08 18:47:59 +0000 2018,,,False
3,950438954280472576,"(1564880654, rebaab_1326, None, 45, 0, None)","([(صاروخ_سعودي_يرعب_ايران,)], [(https://t.co/j...",ar,تواصل قالوا عن قطر المتحدث باسم الجيش الصهيوني...,Mon Jan 08 18:47:59 +0000 2018,,,False
4,950438954288889856,"(349070364, puskine, Kampala, Uganda, 5008, 49...","([], [])",en,*Before you argue about your dirty house someo...,Mon Jan 08 18:47:59 +0000 2018,,,False
5,950438954280669184,"(340482488, xNina_Beana, the land , 1130, 1646...","([], [])",en,RT @TippyLexx: Bruh you ever accidentally open...,Mon Jan 08 18:47:59 +0000 2018,,,False
6,950438954276442113,"(4354072997, gbfranca22, cpx da congo🔞, 252, 6...","([], [])",pt,RT @MorraoTudo2: A liberdade é só questão de t...,Mon Jan 08 18:47:59 +0000 2018,,,False
7,950438954276478976,"(738897225061912576, squeeqi, None, 213, 160, ...","([], [])",en,I just want this all to be over,Mon Jan 08 18:47:59 +0000 2018,,,False
8,950438954289033216,"(273646363, iiib53, None, 631, 427, None)","([], [])",ar,RT @Arab_original: للاسف قطاع كان ممكن حل ولا ...,Mon Jan 08 18:47:59 +0000 2018,273646363.0,iiib53,True
9,950438954289033218,"(1541143441, nappo_what, 🇫🇮🇺🇦, 297, 925, None)","([], [])",ru,RT @craneswordboi: блять мне так смешно от сло...,Mon Jan 08 18:47:59 +0000 2018,,,False


In [46]:
# TEST - Run this cell to test your solution
cols = tweetWithMaliciousDF.columns
sample = tweetWithMaliciousDF.first()

dfTest("ET2-P-08-08-01", True, "maliciousAcct" in cols and "id" in cols)
dfTest("ET2-P-08-08-02", 950438954272096257, sample["id"])
dfTest("ET2-P-08-08-03", False, sample["maliciousAcct"])

print("Tests passed!")

Tests passed!


## Exercise 4: Load Records

Transform your two DataFrames to 4 partitions and save the results to the following endpoints:

| DataFrame              | Endpoint                            |
|:-----------------------|:------------------------------------|
| `urlTrendsDF`          | `userhome + /tmp/urlTrends.parquet`            |
| `tweetWithMaliciousDF` | `userhome + /tmp/tweetWithMaliciousDF.parquet` |

In [47]:
# TODO
YOUR_FIRST_NAME = 'rajeev'
userhome = f"s3a://temp.intellinum.co/{YOUR_FIRST_NAME}"
urlTrendsDF.repartition(4).write.mode('OVERWRITE').parquet(userhome+'/tmp/urlTrends.parquet')
tweetWithMaliciousDF.repartition(4).write.mode('OVERWRITE').parquet(userhome+'/tmp/tweetWithMaliciousDF.parquet')

In [48]:
!aws s3 ls {userhome.replace('s3a','s3') + "/tmp/urlTrends.parquet"}/
!aws s3 ls {userhome.replace('s3a','s3') + "/tmp/tweetWithMaliciousDF.parquet"}/


2019-06-25 13:45:22          0 _SUCCESS
2019-06-25 13:45:21       1364 part-00000-f33cae13-09a1-4eec-a458-133cfaba4799-c000.snappy.parquet
2019-06-25 13:45:21       1360 part-00001-f33cae13-09a1-4eec-a458-133cfaba4799-c000.snappy.parquet
2019-06-25 13:45:21       1348 part-00002-f33cae13-09a1-4eec-a458-133cfaba4799-c000.snappy.parquet
2019-06-25 13:45:22       1466 part-00003-f33cae13-09a1-4eec-a458-133cfaba4799-c000.snappy.parquet
2019-06-25 13:45:25          0 _SUCCESS
2019-06-25 13:45:24      92076 part-00000-9b36ea02-c76c-4d78-86e7-55d101142c0b-c000.snappy.parquet
2019-06-25 13:45:24      88110 part-00001-9b36ea02-c76c-4d78-86e7-55d101142c0b-c000.snappy.parquet
2019-06-25 13:45:25      90022 part-00002-9b36ea02-c76c-4d78-86e7-55d101142c0b-c000.snappy.parquet
2019-06-25 13:45:25      87642 part-00003-9b36ea02-c76c-4d78-86e7-55d101142c0b-c000.snappy.parquet


In [49]:
# TEST - Run this cell to test your solution
urlTrendsDFTemp = spark.read.parquet(userhome + "/tmp/urlTrends.parquet")
tweetWithMaliciousDFTemp = spark.read.parquet(userhome + "/tmp/tweetWithMaliciousDF.parquet")

dfTest("ET2-P-08-09-01", 2, urlTrendsDFTemp.rdd.getNumPartitions())
dfTest("ET2-P-08-09-02", 2, tweetWithMaliciousDFTemp.rdd.getNumPartitions())

print("Tests passed!")

Tests passed!


## IMPORTANT Next Steps
* Please complete the <a href="https://docs.google.com/forms/d/e/1FAIpQLSd5whqoFBjNEEMvgwW5KRr-PeMyv6Lsczxk1p0es9s3IigEYQ/viewform?vc=0&c=0&w=1" target="_blank">short feedback survey</a>.  Your input is extremely important and shapes future course development.
* Congratulations, you have completed ETL Part 2!

&copy; 2019 [Intellinum Analytics, Inc](http://www.intellinum.co). All rights reserved.<br/>