# Databricks Delta Architecture
Databricks&reg; Delta simplifies data pipelines and eliminates the need for the traditional Lambda architecture.

## Datasets Used
* Read Wikipedia edits in real time, with a multitude of different languages. 
* Aggregate the anonymous edits by country, over a window, to see who's editing the English Wikipedia over time.

## CAUTION
* Do not use <b>RunAll</b> mode (next to <b>Permissions</b>).

### Getting Started

Run the following cell to configure our "classroom."

In [3]:
%run "./Includes/Classroom-Setup"

-sandbox
## Lambda Architecture

The Lambda architecture is a big data processing architecture that combines both batch- and real-time processing methods.
It features an append-only immutable data source that serves as system of record. Timestamped events are appended to 
existing events (nothing is overwritten). Data is implicitly ordered by time of arrival. 

Notice how there are really two pipelines here, one batch and one streaming, hence the name <i>lambda</i> architecture.

It is very difficult to combine processing of batch and real-time data as is evidenced by the diagram below.


<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/lambda.png" style="height: 400px"/></div><br/>

-sandbox
## Databricks Delta Architecture

The Databricks Delta Architecture is a vast improvmemt upon the traditional Lambda architecture.

Text files, RDBMS data and streaming data is all collected into a <b>raw</b> table (also known as "bronze" tables at Databricks).

A Raw table is then parsed into <b>query</b> tables (also known as "silver" tables at Databricks). They may be joined with dimension tables.

<b>Summary</b> tables (also known as "gold" tables at Databricks) are business level aggregates often used for reporting and dashboarding. 
This would include aggregations such as daily active website users.

The end outputs are actionable insights, dashboards and reports of business metrics.

<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/delta.png" style="height: 350px"/></div><br/>

Set up relevant paths.

In [7]:
basePath       = userhome + "/wikipedia-streaming"
bronzePath     = basePath + "/wikipediaEditsRaw.delta"
silverPath     = basePath + "/wikipediaEdits.delta"
goldPath       = basePath + "/wikipediaEditsSummary.delta"
checkpointPath = basePath + "/checkpoints"

## Save to RAW table (aka "bronze table")

<b>Raw data</b> is unaltered data that is collected into a data lake, either via bulk upload or through streaming sources.

The following function reads the Wikipedia IRC channels that has been dumped into our Kafka server.

The Kafka server acts as a sort of "firehose" and dumps raw data into our data lake.

Since raw data coming in from a stream is transient, we'd like to save it to a more permanent data structure.

Below, the first step is to set up schema. The fields we use further down in the notebook are commented.

In [9]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
from pyspark.sql.functions import from_json, unix_timestamp

schema = StructType([
  StructField("channel", StringType(), True),
  StructField("comment", StringType(), True),
  StructField("delta", IntegerType(), True),
  StructField("flag", StringType(), True),
  StructField("geocoding", StructType([                 # (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("countryCode2", StringType(), True),
    StructField("countryCode3", StringType(), True),
    StructField("stateProvince", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
  ]), True),
  StructField("isAnonymous", BooleanType(), True),      # (BOOLEAN): Whether or not the change was made by an anonymous user
  StructField("isNewPage", BooleanType(), True),
  StructField("isRobot", BooleanType(), True),
  StructField("isUnpatrolled", BooleanType(), True),
  StructField("namespace", StringType(), True),         # (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType(), True),              # (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType(), True),           # (STRING): URL of the page that was edited
  StructField("timestamp", StringType(), True),         # (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType(), True),
  StructField("user", StringType(), True),              # (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType(), True),
  StructField("wikipediaURL", StringType(), True),
  StructField("wikipedia", StringType(), True),         # (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
])

-sandbox
Next, stream into bronze Databricks Delta directory.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice how we are invoking the `.start(path)` method. 

This is so that the data is streamed into the path we want (and not a default directory).

In [11]:
from pyspark.sql.functions import from_json, col
(spark.readStream
  .format("kafka")  
  .option("kafka.bootstrap.servers", "server1.databricks.training:9092")
  .option("subscribe", "en")
  .load()
  .withColumn("json", from_json(col("value").cast("string"), schema))
  .select(col("timestamp").alias("kafka_timestamp"), col("json.*"))
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath + "/bronze")
  .outputMode("append")
  .start(bronzePath)
)

Wait until stream is done initializing before running this cell.

In [13]:
spark.sql("DROP TABLE IF EXISTS WikipediaEditsRaw")

spark.sql("""
  CREATE TABLE WikipediaEditsRaw
  USING Delta
  LOCATION '{}'
""".format(bronzePath))

Take a look at the raw table.

In [15]:
%sql
SELECT * FROM WikipediaEditsRaw LIMIT 5

## Create QUERY tables (aka "silver tables")

Notice how `WikipediaEditsRaw` has JSON encoding. For example `{"city":null,"country":null,"countryCode2":null,"c..`

In order to be able parse the data in human-readable form, create query tables out of the raw data using columns<br>
`wikipedia`, `isAnonymous`, `namespace`, `page`, `pageURL`, `geocoding`, `timestamp` and `user`.

Stream into a Databricks Delta query directory.

In [17]:
from pyspark.sql.functions import unix_timestamp

(spark.readStream
  .format("delta")
  .load(str(bronzePath))
  .select(col("wikipedia"),
          col("isAnonymous"),
          col("namespace"),
          col("page"),
          col("pageURL"),
          col("geocoding"),
          unix_timestamp(col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss.SSSX").cast("timestamp").alias("timestamp"),
          col("user"))
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath + "/silver")
  .outputMode("append")
  .start(silverPath)
)

Wait till stream is done initializing before running this cell.

In [19]:
spark.sql("DROP TABLE IF EXISTS WikipediaEdits")

spark.sql("""
  CREATE TABLE WikipediaEdits
  USING Delta
  LOCATION '{}'
""".format(silverPath))

Take a peek at the streaming query view.

In [21]:
%sql
SELECT * FROM WikipediaEdits

## Create SUMMARY (aka "gold") level data 

Summary queries can take a long time.

Instead of running the below query off `WikipediaEdits`, let's create a summary query.

We are interested in a breakdown of what countries anonymous edits are coming from.

In [23]:
from pyspark.sql.functions import col, desc, count

goldDF = (spark.readStream
  .format("delta")
  .load(str(silverPath))
  .withColumn("countryCode", col("geocoding.countryCode3"))
  .filter(col("namespace") == "article")
  .filter(col("countryCode") != "null")
  .filter(col("isAnonymous") == True)
  .groupBy(col("countryCode"))
  .count() 
  .withColumnRenamed('count', 'total')
  .orderBy(col("total").desc())
)

-sandbox

## Creating Visualizations (aka "platinum" level) 

#### Mapping Anonymous Editors' Locations

Use that geocoding information to figure out the countries associated with the editors.

When you run the query, the default is a (live) html table.

In order to create a slick world map visualization of the data, you'll need to click on the item below.

<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/plot-options-1.png" style="height: 200px"/></div><br/>

Then go to <b>Plot Options...</b> and drag `countryCode` into the <b>Keys:</b> box and `total` into the <b>Values:</b> box and click <b>Apply</b>.

<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/plot-options-2.png" style="height: 200px"/></div><br/> 

By invoking a `display` action on a DataFrame created from a `readStream` transformation, we can generate a LIVE visualization!

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Keep an eye on the plot for a minute or two and watch the colors change.

In [25]:
display(goldDF)

Make sure all streams are stopped.

In [27]:
for s in spark.streams.active:
    s.stop()
    
dbutils.fs.rm(userhome, True)

## Summary

Use the Databricks Delta architecture to craft raw, query and summary tables to produce beautiful visualizations of key business metrics.

Use these concepts to implement a Delta architecture in the Capstone project.

## Review Questions
**Q:** What is the difference between Lambda and Databricks Delta architecture?<br>
**A:** The principal difference is that with Databricks Delta architecture, output queries can be performed on streaming and historical data at the same time.

In Lambda architecture, streaming and historical data are treated as two separate branches feeding output queries.

**Q:** What is role of raw (bronze) tables?<br>
**A:** Raw tables capture streaming and historical data into a permanent record (streaming data tends to disappear after a short while). Though, it's generally hard to query.

**Q:** What is role of query (silver) tables?<br>
**A:** Query tables consist of normalized raw data that is easier to query.

**Q:** What is role of summary (gold) tables?<br>
**A:** Summary tables contain aggregated key business metrics that are queried frequently, but the silver queries themselves would take too long.

## Additional Topics & Resources

* <a href="http://lambda-architecture.net/#" target="_blank">Lambda Architecture</a>
* <a href="https://bennyaustin.wordpress.com/2010/05/02/kimball-and-inmon-dw-models/#" target="_blank">Data Warehouse Models</a>
* <a href="https://people.apache.org//~pwendell/spark-nightly/spark-branch-2.1-docs/latest/structured-streaming-kafka-integration.html#" target="_blank">Reading structured streams from Kafka</a>
* <a href="http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream#" target="_blank">Create a Kafka Source Stream</a>

**Extra Practice:** Apply what you learned in this module by completing the optional [Gain Actionable Insights from a Data Lake, Satisfy GDPR]($./Optional/Data-Lake-Insights) exercise.