-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Ingest Data with Auto Loader

Databricks Auto Loader is the recommended method for streaming raw data from cloud object storage. Auto Loader provides efficient, incremental, idempotent processing of data files from cloud object storage locations. Several enhancements to this ingestion method make it greatly preferred to directly streaming from the file source using open source Structured Streaming APIs.

For small datasets, the default **directory listing** execution mode will provide provide exceptional performance and cost savings. As the size of your data scales, the preferred execution method is **file notification**, which requires configuring a connection to your storage queue service and event notification, which will allow Databricks to idempotently track and process all files as they arrive in your storage account.

In this notebook, we'll go through the basic configuration to ingest the log files for device MAC addresses from partner gyms.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_gym_logs.png" width="60%" />

## Learning Objectives
By the end of this lesson, students will be able to:
- Use Auto Loader to incrementally, idempotently load data from object storage to Delta Tables
- Locate operation metrics using the **`DESCRIBE HISTORY`** command

## File Detection Modes

![](https://files.training.databricks.com/images/autoloader-detection-modes.png)

Auto Loader can be configured with two different file detection modes. While directory listing mode is the default and works well for small datasets and testing, file notification mode is preferred for large scale production applications.

| **Directory Listing Mode** | **File Notification Mode** |
| --- | --- |
| Default mode | Requires some security permissions to other cloud services |
| Easily stream files from object storage without configuration | Uses cloud storage queue service and event notifications to track files |
| Creates file queue through parallel listing of input directory | Configurations handled automatically by Databricks |
| Good for smaller source directories | Scales well as data grows |

**NOTE**: Only directory listing mode will be shown in this notebook.

## Schema Inference and Evolution
Auto Loader has advanced support for working with data sources with unknown or changing schemas, including the ability to:
* Identify schema on stream initialization
* Auto-detect changes and evolve schema to capture new fields
* Add type hints for enforcement when schema is known
* Rescue data that does not meet expectations

Full documentation of this functionality is available <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html" target="_blank">here</a>.

## Setup

The notebook below defines a function to allow us to manually trigger new data landing in our source container. 

This will allow us to see Auto Loader in action.

In [0]:
%run ../Includes/Classroom-Setup-1.5

Our source directory contains a number of JSON files representing about a week of data.

In [0]:
files = dbutils.fs.ls(DA.paths.gym_mac_logs_json)
display(files)

path,name,size,modificationTime
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191201_1.json,20191201_1.json,98,1666781126000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191201_2.json,20191201_2.json,292,1666781126000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191201_3.json,20191201_3.json,195,1666781126000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191201_4.json,20191201_4.json,195,1666781127000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191201_5.json,20191201_5.json,486,1666781127000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191201_6.json,20191201_6.json,195,1666781127000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191202_2.json,20191202_2.json,389,1666781127000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191202_3.json,20191202_3.json,195,1666781127000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191202_4.json,20191202_4.json,389,1666781127000
dbfs:/user/odl_user_771624@databrickslabs.com/dbacademy/adewd/1.5/gym_mac_logs.json/20191202_5.json,20191202_5.json,389,1666781127000


## Using CloudFiles

Configuring Auto Loader requires using the **`cloudFiles`** format. 

The syntax for this format differs only slightly from a standard streaming read.

To configure we need to:
1. Specify the format as **`cloudFiles`**
2. Specify the file format via **`cloudFiles.format`** as **`json`**
3. Provide the location that Auto Loader will store the inferred schema via **`cloudFiles.schemaLocation`**
4. Optionally configure Auto Loader to use cloud notifications via **`cloudFiles.useNotifications`**<br/>
as opposed to listing the target directory. 

**Note:** The cloud-notifications feature does require additonal cloud configuration.

See the <a href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html#configuration" target="_blank">Auto-Loader documentation</a> for more information.

In [0]:
def load_gym_logs():
    query = (spark.readStream
                  .format("cloudFiles")
                  .option("cloudFiles.format", "json")
                  .option("cloudFiles.schemaLocation", f"{DA.paths.checkpoints}/gym_mac_logs_schema")
                  # .option("cloudFiles.useNotifications","true") # Set this option for file notification mode
                  .load(DA.paths.gym_mac_logs_json)
                  .writeStream
                  .format("delta")
                  .option("checkpointLocation", f"{DA.paths.checkpoints}/gym_mac_logs")
                  .trigger(availableNow=True)
                  .table("gym_mac_logs"))
    
    query.awaitTermination()

Note that we're using **trigger-available-now** logic for batch execution.  trigger-available-now is very similar to **trigger-once** but can run
multiple batches until all available data is consumed instead of one big batch, and is introduced in <a href="https://spark.apache.org/releases/spark-release-3-3-0.html" target="_blank">Spark 3.3.0</a> and <a href="https://docs.databricks.com/release-notes/runtime/10.4.html" target="_blank">Databricks Runtime 10.4 LTS</a>. While we may not have the latency requirements of a Structured Streaming workload, Auto Loader prevents any CDC on our file source, allowing us to simply trigger a cron job daily to process all new data that's arrived.

In [0]:
load_gym_logs()

As always, each batch of newly processed data will create a new version of our table.

In [0]:
%sql 
DESCRIBE HISTORY gym_mac_logs

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
1,2022-10-26T10:47:22.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> cea2994a-322c-430a-b41c-4e5556a76214, epochId -> 0)",,List(4382774100892078),1024-143331-6vol2yy0,0.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 155, numOutputBytes -> 10867, numAddedFiles -> 4)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-10-26T10:47:17.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(4382774100892078),1024-143331-6vol2yy0,,WriteSerializable,True,Map(),,Databricks-Runtime/10.4.x-scala2.12


The helper method below will load an additional day of data.

In [0]:
DA.gym_mac_stream.load()

CloudFiles will ignore previously processed data; only those newly written files will be processed.

In [0]:
load_gym_logs()

In [0]:
%sql DESCRIBE HISTORY gym_mac_logs

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
2,2022-10-26T10:50:31.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> cea2994a-322c-430a-b41c-4e5556a76214, epochId -> 1)",,List(4382774100892078),1024-143331-6vol2yy0,1.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 23, numOutputBytes -> 5898, numAddedFiles -> 3)",,Databricks-Runtime/10.4.x-scala2.12
1,2022-10-26T10:47:22.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> cea2994a-322c-430a-b41c-4e5556a76214, epochId -> 0)",,List(4382774100892078),1024-143331-6vol2yy0,0.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 155, numOutputBytes -> 10867, numAddedFiles -> 4)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-10-26T10:47:17.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(4382774100892078),1024-143331-6vol2yy0,,WriteSerializable,True,Map(),,Databricks-Runtime/10.4.x-scala2.12


Run the cell below to process the remainder of the data provided.

In [0]:
DA.gym_mac_stream.load(continuous=True)
load_gym_logs()

In [0]:
%sql DESCRIBE HISTORY gym_mac_logs

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
3,2022-10-26T10:51:03.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> cea2994a-322c-430a-b41c-4e5556a76214, epochId -> 2)",,List(4382774100892078),1024-143331-6vol2yy0,2.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 136, numOutputBytes -> 10890, numAddedFiles -> 4)",,Databricks-Runtime/10.4.x-scala2.12
2,2022-10-26T10:50:31.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> cea2994a-322c-430a-b41c-4e5556a76214, epochId -> 1)",,List(4382774100892078),1024-143331-6vol2yy0,1.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 23, numOutputBytes -> 5898, numAddedFiles -> 3)",,Databricks-Runtime/10.4.x-scala2.12
1,2022-10-26T10:47:22.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> cea2994a-322c-430a-b41c-4e5556a76214, epochId -> 0)",,List(4382774100892078),1024-143331-6vol2yy0,0.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 155, numOutputBytes -> 10867, numAddedFiles -> 4)",,Databricks-Runtime/10.4.x-scala2.12
0,2022-10-26T10:47:17.000+0000,8453215174696142,odl_user_771624@databrickslabs.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(4382774100892078),1024-143331-6vol2yy0,,WriteSerializable,True,Map(),,Databricks-Runtime/10.4.x-scala2.12


Now we can use SQL to examine our data.

In [0]:
%sql
SELECT * FROM gym_mac_logs

first_timestamp,gym,last_timestamp,mac,_rescued_data
1575720314.0,5,1575728284.0,a4:eb:49:d9:9b:1d,
1575703851.0,5,1575708506.0,1b:55:35:cb:d8:22,
1575704025.0,5,1575709692.0,00:6c:6c:53:51:ef,
1575727997.0,5,1575730807.0,db:c5:50:4d:55:90,
1575702929.0,5,1575709472.0,4c:c5:9f:cb:13:bd,
1575740507.0,5,1575745086.0,3f:7f:18:77:5c:34,
1575701813.0,5,1575705962.0,d0:30:3f:ef:7d:89,
1575739677.0,5,1575743814.0,35:d1:c5:f0:79:75,
1575366633.0,5,1575370147.0,a4:eb:49:d9:9b:1d,
1575357708.0,5,1575363687.0,1b:55:35:cb:d8:22,


Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>