# Gain Actionable Insights from a Data Lake, Satisfy GDPR

In this optional project, use Databricks Delta to manage a data lake consisting of a lot of historical data plus incoming streaming data.

A video gaming company stores historical data in a data lake, which is growing exponentially. 

The data isn't sorted in any particular way (actually, it's quite a mess).

It is proving to be _very_ difficult to query and manage this data because there is so much of it.

To further complicate issues, a regulatory agency has decreed you be able to identify and delete all data associated with a specific user (i.e. GDPR). 

In other words, you must delete data associated with a specific `deviceId`.

## Instructions
0. Read in streaming data into Databricks Delta raw tables
0. Create Databricks Delta query table
0. Compute aggregate statistics about data i.e. create summary table
0. Identify events associated with specific `deviceId` 
0. Do data cleanup using Databricks Delta advanced features

## CAUTION
* Do not use <b>RunAll</b> mode (next to <b>Permissions</b>).

### Getting Started

Run the following cell to configure our "classroom."

In [3]:
%run "../Includes/Classroom-Setup"

Set up relevant paths.

In [5]:
inputPath = "/mnt/training/gaming_data/mobile_streaming_events_b"
outputPathBronze = userhome + "/gaming/bronze"
outputPathSilver = userhome + "/gaming/silver"
outputPathGold = userhome + "/gaming/gold"

### Step 1: Prepare Schema and Read Streaming Data from input source

The input source is a folder containing files of around 100,000 bytes each and is set up to stream slowly.

Run this code to read streaming data in.

In [7]:
from pyspark.sql.types import StructType, StringType, IntegerType, TimestampType, DoubleType

eventSchema = ( StructType()
  .add('eventName', StringType()) 
  .add('eventParams', StructType() 
    .add('game_keyword', StringType()) 
    .add('app_name', StringType()) 
    .add('scoreAdjustment', IntegerType()) 
    .add('platform', StringType()) 
    .add('app_version', StringType()) 
    .add('device_id', StringType()) 
    .add('client_event_time', TimestampType()) 
    .add('amount', DoubleType()) 
  )     
)

gamingEventDF = (spark
  .readStream
  .schema(eventSchema) 
  .option('streamName','mobilestreaming_demo') 
  .option("maxFilesPerTrigger", 1)                # treat each file as Trigger event
  .json(inputPath) 
)

### Step 2: Write Stream

The instructions here are to:

* Write the stream from `gamingEventDF` to the Databricks Delta data lake in path defined by `outputPathBronze`.
* Convert `client_event_time` to a date format and rename to `eventDate`
* Filter out null `eventDate`

In [9]:
# TODO
from pyspark.sql.functions import to_date, col

eventsStream = (gamingEventDF
  .withColumn(FILL_IN)
  .filter(FILL_IN)

  FILL_IN

  .option('checkpointLocation', outputPathBronze + '/_checkpoint') 
  .start(outputPathBronze)
)

Wait until the stream is initialized, then create table `mobile_events_delta_raw`.

In [11]:
# TODO
spark.sql("""
   DROP TABLE IF EXISTS mobile_events_delta_raw
 """)
spark.sql("""
   CREATE TABLE mobile_events_delta_raw
   FILL_IN

In [12]:
# TEST - Run this cell to test your solution.
try:
  rawTableExists = (spark.table("mobile_events_delta_raw") is not None)
except:
  rawTableExists = False
  
firstRow = spark.sql("SELECT * FROM mobile_events_delta_raw").take(1)

dbTest("Delta-08-rawTableExists", True, rawTableExists)  
dbTest("Delta-08-rowsExist", False, not firstRow) 

print("Tests passed!")

### Step 3a: Create a Databricks Delta table

Create `device_id_type_table` from data in `/mnt/training/gaming_data/dimensionData`.

This table associates `deviceId` with `deviceType` = `{android, ios}`.

In [14]:
# TODO
tablePath = 
spark.sql("""
   DROP TABLE IF EXISTS device_id_type_table
 """)
spark.sql("""
   CREATE TABLE device_id_type_table 
   FILL_IN

In [15]:
# TEST - Run this cell to test your solution.
try:
  tableExists = (spark.table("device_id_type_table") is not None)
except:
  tableExists = False
  
dbTest("Delta-08-tableExists", True, tableExists)  

print("Tests passed!")

### Step 3b: Create a query table

Create table `mobile_events_delta_query` by joining `device_id_type_table` with `mobile_events_delta_raw` on `deviceId`.
* Your fields should be `eventName`, `deviceId`, `eventTime`, `eventDate` and `deviceType`.
* Make sure to `PARTITION BY (eventDate)`
* Write to `outputPathSilver`

In [17]:
# TODO
spark.sql("""
   DROP TABLE IF EXISTS mobile_events_delta_query
""")

spark.sql("""
    CREATE TABLE mobile_events_delta_query
    FILL_IN

In [18]:
# TEST - Run this cell to test your solution.
from pyspark.sql.types import StructField, StructType, StringType, TimestampType, DateType
schema = spark.table("mobile_events_delta_query").schema

expectedSchema = StructType([
   StructField("eventName", StringType(), True),
   StructField("deviceId", StringType(), True),
   StructField("eventTime", TimestampType(), True),
   StructField("eventDate", DateType(), True),
   StructField("deviceType", StringType(), True),
])

firstRowQuery = spark.sql("SELECT * FROM mobile_events_delta_query limit 1").collect()[0][0]

dbTest("Delta-08-querySchema", set(expectedSchema), set(schema))
dbTest("Delta-08-queryRowsExist", False, not firstRowQuery) 

print("Tests passed!")

### Step 4a: Create a Delta summary table out of query table

The company executives want to look at the number of active users by week.

Count number of events in the by week.

In [20]:
# TODO
spark.sql("""
   DROP TABLE IF EXISTS mobile_events_delta_summary
""")

spark.sql("""
   CREATE TABLE mobile_events_delta_summary 
   FILL_IN

In [21]:
# TEST - Run this cell to test your solution.
from pyspark.sql.types import StructType, StringType, LongType

WAUactualSchema = spark.table("mobile_events_delta_summary").schema

WAUexpectedSchema = StructType([
   StructField("week",IntegerType(), True),
   StructField("WAU",LongType(), True),
])

dbTest("Delta-L8-WAUquerySchema", set(WAUexpectedSchema), set(WAUactualSchema))

print("Tests passed!")

### Step 4b: Visualization

The company executives are visual people: they like pretty charts.

Create a bar chart out of `mobile_events_delta_summary` where the horizontal axis is month and the vertical axis is WAU.

In [23]:
%sql
-- TODO
FILL_IN

### Step 5: Isolate a specific `deviceId`

Identify all the events associated with a specific user, rougly proxied by the first `deviceId` we encounter in our query. 

Use the `mobile_events_delta_query` table.

The `deviceId` you come up with should be a string.

In [25]:
# TODO 
deviceId = str(spark.sql("FILL_IN").collect()[0][0])

In [26]:
# TEST - Run this cell to test your solution.
deviceIdexists = len(deviceId) > 0

dbTest("Delta-L8-lenDeviceId", True, deviceIdexists)

print("Tests passed!")

-sandbox
### Step 6: ZORDER 

Since the events are implicitly ordered by `eventTime`, implicitly re-order by `deviceId`. 

The data pertaining to this `deviceId` is spread out all over the data lake. (It's definitely _not_ co-located!).

Pass in the `deviceId` variable you defined above.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> `ZORDER` may take a few minutes.

In [28]:
%sql
-- TODO
OPTIMIZE FILL_IN
ZORDER BY FILL_IN

### Step 7: Delete Specific `deviceId`

0. Delete rows with that particular `deviceId` from `mobile_events_delta_query`.
0. Make sure that `deviceId` is no longer in the table!

In [30]:
# TODO
spark.sql("FILL_IN".format(deviceId))
noDeviceId = spark.sql(("FILL_IN").format(deviceId)).collect()

In [31]:
# TEST - Run this cell to test your solution.
dbTest("Delta-08-noDeviceId", True , not noDeviceId)

print("Tests passed!")

dbutils.fs.rm(userhome, True)

### Step 8: Stop the streaming process

In [33]:
# TODO
for streamingQuery in spark.streams.active:
  FILL_IN

In [34]:
# TEST - Run this cell to test your solution.
numActiveStreams = len(spark.streams.active)
dbTest("Delta-08-numActiveStreams", 0, numActiveStreams)

print("Tests passed!")

-sandbox
### Step 9: Clean Up

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Do not use a retention of 0 hours in production, as this may affect queries that are currently in flight. 
By default this value is 7 days. 

We use 0 hours here for purposes of demonstration only.

Recall, we use `VACUUM` to reduce the number of files in each partition directory to 1.

In [36]:
%sql
-- TODO
VACUUM FILL_IN

If the `../eventDate=2018-05-20` directory does not exist, try a different `eventDate` directory.

In [38]:
# TEST - Run this cell to test your solution.
numFilesOne = len(dbutils.fs.ls(("{}/eventDate=2018-05-20").format(outputPathSilver)))

dbTest("Delta-08-numFilesOne", 1, numFilesOne)

print("Tests passed!")

Congratulations: ALL DONE!!