### Second session on Dataframes + introduction to Structured Streaming

* Dataframes
    * Custom schemas
    * Pivot
    * Window functions
    * Delta Lake

#### Custom schema

In [0]:
# Let's import a JSON dataset and have a look at the file

import requests
r = requests.get("https://think.cs.vt.edu/corgis/datasets/json/airlines/airlines.json")

print(r.json()[:2])

In [0]:
# creating a dataframe from this JSON string (Python dictionary) gives us many nested MapType columns
from pyspark.sql import Row

#airlines_df = spark.createDataFrame(r.json()) # another way of creating df from dict. Inferring schema from Python dict is deprecated, but this may still work
#airlines_df = spark.createDataFrame(Row(**x) for x in r.json()) # unpacking each JSON element into a spark Row element

display(airlines_df)

In [0]:
# let's test some functions on map columns

from pyspark.sql.functions import *

display(airlines_df
        .select(explode("Airport") # we can try explode - but this does not make sense for this kind of data
                #,col("Airport").getItem("Code").alias("AirportCode") # we can use getItem to fetch values per key - good for small map columns, not good for large/nested map columns
                #,col("Airport").getItem("Name").alias("AirportName")
                #,col("Airport.Code"),col("Airport.Name") # alternative, easier to call but may confuse as it looks like struct. But Airport.* does not work for map
                #,map_keys("Airport") # array of keys
                #,map_values("Airport") # array of values
                ,"*")
        )

In [0]:
# usually, a struct type of column is more useful and easier to access
# in this case, we need to custom define a schema when reading in the DataFrame.
# this is also recommended for real-life data pipelines, especially in case of large amount of small data files
# potential down-side: missing schema evolution

from pyspark.sql.types import *

# The outer part needs to be a StructType
# A StructType needs to consist of StructFields
# StructFields have 3 parameters: name, type, nullable

# Note: you can remove or add parts of schema
# Note2: name has to match the key/column name in the dataset.

airport_schema = StructType([
  StructField("Airport",StructType([
    StructField("Code", StringType(), True),
    StructField("Name", StringType(), True)
  ]),True), 
  StructField("Statistics",StructType([
    StructField("Carriers", StructType([
      StructField("Names", StringType(), True),
      StructField("Total", IntegerType(), True)
    ]), True),
    StructField("Minutes Delayed", StructType([
      StructField("Late Aircraft", LongType(), True),
      StructField("National Aviation System", LongType(), True),
      StructField("Weather", LongType(), True),
      StructField("Carrier", LongType(), True),
      StructField("Security", LongType(), True),
      StructField("Total", LongType(), True)
    ]), True),
    StructField("Flights", StructType([
      StructField("Delayed", LongType(), True),
      StructField("Diverted", LongType(), True),
      StructField("Cancelled", LongType(), True),
      StructField("On Time", LongType(), True),
      StructField("Total", LongType(), True)
    ]), True),
    StructField("# of Delays", StructType([
      StructField("Late Aircraft", LongType(), True),
      StructField("National Aviation System", LongType(), True),
      StructField("Weather", LongType(), True),
      StructField("Carrier", LongType(), True),
      StructField("Security", LongType(), True)
    ]), True)
  ]),True),
  StructField("Time",StructType([
    StructField("Label", StringType(), True),
    StructField("Month", IntegerType(), True),
    StructField("Year", IntegerType(), True),
    StructField("Month Name", StringType(), True)
  ]),True)
  #StructField("MyNullTimestamp", TimestampType(), True)
])

In [0]:
# we can now provide this schema as input in our dataframe creation

airport_schema_df = spark.createDataFrame(r.json(), schema=airport_schema) #spark is not inferring schema, so inputting dictionary works fine
display(airport_schema_df)

In [0]:
# another way to create a schema is using StructType's "add" method

airport_add_schema = (StructType()
                      .add("Airport", StructType()
                          .add("Code", StringType())
                          .add("Name", StringType())
                          )
                      .add("Time", StructType()
                          .add("Month",IntegerType())
                          .add("Year",IntegerType()))
                     )

airport_add_schema_df = spark.createDataFrame(r.json(), schema=airport_add_schema) #spark is not inferring schema, so inputting dictionary works fine
display(airport_add_schema_df)

In [0]:
# third method, raw string
airport_string_schema = "Airport STRUCT<Code: STRING, Name: STRING>, Time STRUCT<Month: INTEGER, Year: INTEGER, Date: INTEGER>" # date will be null

airport_string_schema_df = spark.createDataFrame(r.json(), schema=airport_string_schema) #spark is not inferring schema, so inputting dictionary works fine
display(airport_string_schema_df)

In [0]:
# It is now much easier to navigate and manipulate the fields.

display(airport_schema_df
        .select("Airport.*"
               ,"*"
               )
       )

#### Pivot

In [0]:
# pivot table - summarizing a more extensive table
# let's first load in a dataset

airbnb_df = spark.read.parquet("mnt/training/airbnb/amsterdam-listings/amsterdam-listings-2018-12-06.parquet/*")
display(airbnb_df)

In [0]:
# the dataset has many columns. Let's say we are interested in the average prices per city and neighbourhood.
# we are also interested in the size of the place - how many people it accommodates

display(airbnb_df.select("city"
                       , "neighbourhood"
                       , "accommodates"
                       , "price")
       )

In [0]:
from pyspark.sql.functions import desc

display(airbnb_df
        .select("city", "neighbourhood", "accommodates", "price")
        .groupby("city", "neighbourhood") # "row" 
        .pivot("accommodates") # columns
        .mean("price") # data / values     # possible options: mean, sum, min, max, count
        #.na.fill(0) # for filling out null values. 0 for counts/sums
        #.orderBy(desc("2")) # for ordering
       )

In [0]:
# another pivot example

df_wiki = spark.read.parquet("/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/*")

display(df_wiki)

In [0]:
display(df_wiki
        .selectExpr("cast(timestamp as date) as date"
                   ,"hour(timestamp) as hour"
                   ,"site"
                   ,"requests")
        .groupBy("hour") #"date"
        .pivot("site")
        .sum("requests")
        .orderBy("hour") #"date" 
       )

#### SQL window functions

_Note: a "window" can be many different things. Here we talk about classical SQL window functions_

In [0]:
# let's load in a new dataset and have a look at it
healthcare_df = spark.read.parquet("/mnt/training/healthcare/tracker/health_profile_data.snappy.parquet")
display(healthcare_df)

In [0]:
# we need to import the Window API and instantiate a Window specification object
# let's also import pyspark.sql.functions for using aggregations and functions  

from pyspark.sql.functions import *
from pyspark.sql import Window

window_spec = Window.partitionBy("_id").orderBy("dte") 

display(healthcare_df
       .withColumn("row_num", row_number().over(window_spec)) # similarly can use rank and dense_rank
       )

In [0]:
# you can use multiple window specs in parallel
  
window_spec_by_hr = Window.partitionBy("_id").orderBy("resting_heartrate")

display(healthcare_df
       .withColumn("row_num", row_number().over(window_spec)) # similarly can use rank and dense_rank
       .withColumn("rank_num", rank().over(window_spec_by_hr))
       )

In [0]:
# use lag / lead for viewing "back" or "ahead" within a partition's rows

display(healthcare_df
       .withColumn("lag", lag("resting_heartrate").over(window_spec)) # use for getting previous/next value in partition. 
       .withColumn("lead", lead("resting_heartrate", 5, 0).over(window_spec)) # Can define offset and default value
       #.withColumn("diffToPrev",expr("resting_heartrate - lag")) # useful for getting the increase/decrease
       )


In [0]:
# rolling windows - useful for moving averages, rolling total, etc

window_spec_rolling = Window.partitionBy("_id").orderBy("dte").rowsBetween(Window.unboundedPreceding, Window.currentRow) # rolling aggregations - everything up to current row
#window_spec_rolling_last_week = Window.partitionBy("_id").orderBy("dte").rowsBetween(-6, Window.currentRow) # rolling aggregations, use negative integer for previous rows
#window_spec_rolling_plusmin2 = Window.partitionBy("_id").orderBy("dte").rowsBetween(-2, 2) # rolling aggregations, use positive integer for next rows

display(healthcare_df
       .withColumn("row_num", row_number().over(window_spec)) 
       .withColumn("rolling_avg", avg("resting_heartrate").over(window_spec_rolling)) # using aggregations with window functions
       #.withColumn("row_num_sum", sum("row_num").over(window_spec_rolling_plusmin2))
       #.withColumn("max_BMI_3", max("BMI").over(window_spec_rolling_plusmin2))
       )


#### Delta Lake

In [0]:
# let's load in a small df for demonstration purposes

iso_df = (spark.read
          .option("header","true")
          .option("inferSchema","true")
          .csv("/mnt/training/countries/ISOCountryCodes/ISOCountryLookup.csv")
         )
display(iso_df)

In [0]:
# General issue with data lakes / Hive tables / HDFS storage
# Hard to do SQL / Data Warehouse-like updates
# No ACID compliance (Hive has now in newer versions, but not inherently compatible with Spark)

# Delta Lake = open source
# ACID, time-travel, optimized performance, ...

iso_df.write.mode("overwrite").saveAsTable("hive_iso_t")
iso_df.write.format("delta").mode("overwrite").saveAsTable("delta_iso_t") #format "delta"

In [0]:
display(spark.table("hive_iso_t"))
#display(spark.table("delta_iso_t"))

In [0]:
# updating using spark sql statements
#spark.sql("UPDATE hive_iso_t SET independentTerritory = 'Yes' WHERE EnglishShortName = 'Antarctica'") # fails, update not supported
#spark.sql("UPDATE delta_iso_t SET independentTerritory = 'Yes' WHERE EnglishShortName = 'Antarctica'") # OK

display(spark.table("delta_iso_t"))

In [0]:
# another way is to use delta API, useful for more advanced usecases and simpler programmability

from delta.tables import *

delta_table = DeltaTable.forName(spark, "delta_iso_t") 

delta_table.update(
  condition = "EnglishShortName = 'Greenland'",
  set = { "independentTerritory": "'Yes'"}
)

In [0]:
# view history of table

display(delta_table.history()) 

In [0]:
# creating dataframe from previous state

display(spark.sql("DESCRIBE EXTENDED delta_iso_t")) # getting the table path

#timestamp_df = spark.read.format("delta").option("timestampAsOf", "2021-03-20 09:55:00").load("/user/hive/warehouse/delta_iso_t")
#display(timestamp_df)
#version_df = spark.read.format("delta").option("versionAsOf", 4).load("/user/hive/warehouse/delta_iso_t")
#display(version_df)

# restoring table to previous state

#delta_table.restoreToTimestamp('2021-03-20 10:20') # restore to a specific timestamp
#delta_table.restoreToVersion(0) # restore table to (oldest) version
#display(spark.table("delta_iso_t"))

### Structured streaming

Common input/output:
* Input sources
  * Kafka (and other distributed commit logs)
  * Files on a distributed system
  * TCP-IP sockets _note: not fault-tolerant_
* Output (sinks)
  * Kafka (etc)
  * File formats
  * Spark tables

In [0]:
# let's start by looking at reading and writing streams to files
# we need a schema when reading streams

events_schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

streaming_events_df = (spark.readStream
  .schema(events_schema)
  .option("maxFilesPerTrigger", 1) # used for example purposes, reads in 1 file per trigger
  .parquet("/mnt/training/ecommerce/events/events.parquet")
)

In [0]:
# you can check if a dataframe has streaming sources
# this means some functions are unavailable (eg count)

streaming_events_df.isStreaming

In [0]:
# let's create a new dataframe and write it into a file

email_df = (streaming_events_df
            .filter(col("traffic_source") == "email")
            .withColumn("mobile", col("device").isin(["iOS", "Android"]))
            .select("user_id", "event_timestamp", "mobile")
           )

checkpoint_path = "/tmp/email_t2/checkpoint" #"tmp/email_traffic/checkpoint"
output_path = "/tmp/email_t2/output"

devices_query = (email_df.writeStream
  .outputMode("append") # append = only new rows, complete = all rows written on every update, update = only updated rows (used in aggregations, otherwise same as append)
  .format("parquet")
  .queryName("email_traffic_p") # optional name
  .trigger(processingTime="10 second") # how often data is fetched from source
  .option("checkpointLocation", checkpoint_path) # used for fault-tolerance. Note: every query needs to have a unique check point location
  .start(output_path) # location where the file will be written
)

In [0]:
%fs ls tmp/email_t2/output

In [0]:
# monitor the query

devices_query.id # unique per query, persisted when restarted from checkpoint 
#devices_query.name
#devices_query.status # isDataAvailable = new data available, isTriggerActive = trigger actively firing
#devices_query.awaitTermination(5) # timeout in seconds, can use to keep cluster awake, also useful for seeing if stream was quit or got an exception
#devices_query.stop() # note: for streaming data, cluster will keep awake (processing ongoing)

In [0]:
# fetching from TCP

logs_df = (spark.readStream
    .format("socket")
    .option("host", "server1.databricks.training")
    .option("port", 9001)
    .load()
)

display(logs_df)

In [0]:
# we can look at a count by using the agg function

running_count_df = logs_df.agg(count("*"))

display(running_count_df)

In [0]:
# let's see how comfortable it is manipulating the streaming dataframe
# we want to get all rows where there is an error

# Start by parsing out the timestamp and the log data
clean_df = (logs_df
            .withColumn("ts_string", col("value").substr(2, 23))
            .withColumn("epoch", unix_timestamp("ts_string", "yyyy/MM/dd HH:mm:ss.SSS"))
            .withColumn("capturedAt", col("epoch").cast("timestamp"))
            .withColumn("logData", regexp_extract("value", """^.*\]\s+(.*)$""", 1))
           )

# Keep only the columns we want and then filter the data
errors_df = (clean_df
             .select("capturedAt", "logData")
             .filter(col("value").like("% (ERROR) %"))
            )

In [0]:
# let's write the errors dataframe into a delta table

(errors_df.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/tmp/events_stream/_checkpoints/")
  .table("errors_streaming"))

In [0]:
# we can view snapshots from this table

display(spark.table("errors_streaming")
        .sort(desc("capturedAt"))
       )

In [0]:
# we can also check the count of this table increasing

spark.sql("SELECT COUNT(*) FROM errors_streaming").first()[0]

In [0]:
# since it was a delta table, we can look at the history

errors_table = DeltaTable.forName(spark, "errors_streaming")

display(errors_table.history())

In [0]:
# use spark.streams.active to loop over all active streams
# remember to stop streams if not working on them anymore

for stream in spark.streams.active:
  print(stream.name)
  #stream.stop()

## Further reading

* Spark SQL Window functions
  * https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#window
* Delta Lake: 
  * https://delta.io/
* Structured Streaming: 
  * https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss.html
  * https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html
  * http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

### Optional anonymous feedback:

http://tinyurl.com/dfandssintro

### Tasks for session 3

#### Task 1

Employees' dataset: "/mnt/training/manufacturing-org/employees/employees.csv"

Using window functions, find the **employees** who have worked in a specific **department** the *longest* and *shortest* time.

Resulting dataframe should have 3 columns: employee_name, department, employment_duration

Employment_duration should have 2 possible values: 
* **longest**
  * The employee has worked in the department for the longest time. Based on column _active_record_start_
* **shortest**
  * The employee has worked in the department for the shortest time. Based on column _active_record_start_

Resulting dataframe should have total 6 rows.

Example df.take(3):</br>
<table>
  <tr>
    <th>employee_name</th>
    <th>department</th>
    <th>employment_duration</th>
  </tr>
  <tr>
    <td>CISNEROS JR, HERBERT</td>
    <td>OFFICE</td>
    <td>shortest</td>
  </tr>
  <tr>
    <td>CRAVEN, KEVIN J</td>
    <td>OFFICE</td>
    <td>longest</td>
  </tr>
  <tr>
    <td>WRIGHT, RONALD G</td>
    <td>PRODUCTION</td>
    <td>shortest</td>
  </tr>
</table>

Note: highest points are awarded to solutions where dataframe has least transformations.</br>
Should be doable with 2 window functions and 2 transformations.

In [0]:
# your answer

#### Task 2

Create a schema and a streaming dataframe for the JSON files in the following path:  
"/mnt/training/gaming_data/mobile_streaming_events_b/"
  
  
Use the following as basis for creating your schema:  
 |-- eventName: string (nullable = true)  
 |-- eventParams: struct (nullable = true)  
 |    |-- amount: double (nullable = true)  
 |    |-- app_name: string (nullable = true)  
 |    |-- app_version: string (nullable = true)  
 |    |-- client_event_time: string (nullable = true)  
 |    |-- device_id: string (nullable = true)  
 |    |-- game_keyword: string (nullable = true)  
 |    |-- platform: string (nullable = true)  
 |    |-- scoreAdjustment: long (nullable = true)  
  
Read in 2 files per trigger.
  
Create a new modified dataframe:
* keep only rows where eventName is "scoreAdjustment"
* select the *game_keyword*, *platform* and *scoreAdjustment* columns from the eventParams struct.  
* set trigger to run every 5 seconds.

Write the datastream to a delta table called score_adjustments.  
Check to make sure that the table has some data - this should also be visible in the cell results.  
Then stop the datastream.

In [0]:
# your answer