
# What is Databricks Auto Loader?

<img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/autoloader/autoloader-edited-anim.gif" style="float:right; margin-left: 10px" />

[Databricks Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) lets you scan a cloud storage folder (S3, ADLS, GS) and only ingest the new data that arrived since the previous run.

This is called **incremental ingestion**.

Auto Loader can be used in a near real-time stream or in a batch fashion, e.g., running every night to ingest daily data.

Auto Loader provides a strong gaurantee when used with a Delta sink (the data will only be ingested once).

## How Auto Loader simplifies data ingestion

Ingesting data at scale from cloud storage can be really hard at scale. Auto Loader makes it easy, offering these benefits:


* **Incremental** & **cost-efficient** ingestion (removes unnecessary listing or state handling)
* **Simple** and **resilient** operation: no tuning or manual code required
* Scalable to **billions of files**
  * Using incremental listing (recommended, relies on filename order)
  * Leveraging notification + message queue (when incremental listing can't be used)
* **Schema inference** and **schema evolution** are handled out of the box for most formats (csv, json, avro, images...)

In [0]:
%run ./_resources/00-setup $reset_all_data=false

In [0]:
# Explore our json raw data:
display(spark.read.text(volume_folder + "/user_json"))

## Auto Loader basics:

In [0]:
bronze_df = (spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "json")
             .option("cloudFiles.maxFilesPerTrigger", "1")
             .schema("address STRING, creation_date STRING, firstname STRING, lastname STRING, id BIGINT")
             .load(volume_folder + "/user_json"))

display(bronze_df, get_chkp_folder())

## Schema inference:
Specifying the schema manually can be a challenge, especially with dynamic JSON.

* Schema inference has always been expensive and slow at scale, but not with Auto Loader. Auto Loader efficiently samples data to infer the schema and stores it under `cloudFiles.schemaLocation` in your bucket. 
* Additionally, `cloudFiles.inferColumnTypes` will determine the proper data type from your JSON.

*Notes:*
* *With Delta Live Tables you don't even have to set this option, the engine manages the schema location for you.*
* *Sampling size can be changed with `spark.databricks.cloudFiles.schemaInference.sampleSize.numBytes`*

In [0]:
bronze_df = (spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "json")
             .option("cloudFiles.schemaLocation", volume_folder + "/inferred_schema")
             .option("cloudFiles.inferColumnTypes", "true")
             .load(volume_folder + "/user_json"))

display(bronze_df, get_chkp_folder())

## Schema hints:
You might need to enforce a part of your schema, e.g., to convert a timestamp.

In [0]:
bronze_df = (spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "json")
             .option("cloudFiles.schemaLocation", volume_folder + "/inferred_schema")
             .option("cloudFiles.inferColumnTypes", "true")
             .option("cloudFiles.schemaHints", "id BIGINT")
             .load(volume_folder + "/user_json"))

display(bronze_df, get_chkp_folder())

## Incorrect schema:
Auto Loader automatically recovers from incorrect schema and conflicting type. It'll save incorrect data in the `_rescued_data` column.

In [0]:
# Adding an incorrect field ("id" as string instead of bigint):
from pyspark.sql import Row

data = [Row(email="quentin.ambard@databricks.com", firstname="Quentin", id="456455", lastname="Ambard")]
incorrect_data = spark.createDataFrame(data)

(incorrect_data.write
 .format("json")
 .mode("append")
 .save(volume_folder + "/user_json"))

In [0]:
def get_stream():
  return (spark.readStream
                .format("cloudFiles")
                .option("cloudFiles.format", "json")
                .option("cloudFiles.schemaLocation", f"{volume_folder}/inferred_schema")
                .option("cloudFiles.inferColumnTypes", "true")
                .option("cloudFiles.schemaHints", "id BIGINT")
                .load(volume_folder + "/user_json"))

In [0]:
wait_for_rescued_data()

In [0]:
# Start the stream and filter on on the rescue column to see how the incorrect data is captured
display(get_stream().filter("_rescued_data IS NOT NULL"), get_chkp_folder())

### Adding a new column
By default the stream will tigger a `UnknownFieldException` exception on new column. You then have to restart the stream to include the new column.

*Notes*:
* *See `cloudFiles.schemaEvolutionMode` for different behaviors and more details.*
* *Don't forget to add `.writeStream.option("mergeSchema", "true")` to dynamically add columns when writting to a delta table.*

In [0]:
# Existing stream wil fail with: org.apache.spark.sql.catalyst.util.UnknownFieldException.
display(get_stream(), get_chkp_folder())

In [0]:
# Add 'new_column':
from pyspark.sql import Row
data = [Row(email="quentin.ambard@databricks.com", firstname="Quentin", id=456454, lastname="Ambard", new_column="test new column value")]
new_row = spark.createDataFrame(data)

(new_row.write
 .format("json")
 .mode("append")
 .save(volume_folder + "/user_json"))

In [0]:
# We just have to restart it to capture the new data.
display(get_stream().filter("new_column IS NOT NULL"), get_chkp_folder())

## Ingesting a high volume of input files:
Scanning folders with many files to detect new data is an expensive operation, leading to ingestion challenges and higher cloud storage costs.

To solve this issue and support an efficient listing, Databricks autoloader offers two modes:

- Incremental listing with `cloudFiles.useIncrementalListing` (recommended), based on the alphabetical order of the file's path to only scan new data: (`ingestion_path/YYYY-MM-DD`)
- Notification system with `cloudFiles.useNotifications`, which sets up a managed cloud notification system sending new file name to a queue.

<img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/autoloader-mode.png" width="700"/>

Use the incremental listing option whenever possible. Databricks Auto Loader will try to auto-detect and use the incremental approach when possible.

## Support for images:
Databricks Auto Loader provides native support for images and binary files.

<img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/autoloader-images.png" width="800" />

Just set the format accordingly and the engine will do the rest: `.option("cloudFiles.format", "binaryFile")`

Use-cases:

- ETL images into a Delta table using Auto Loader.
- Automatically ingest continuously arriving new images.
- Easily retrain ML models on new images.
- Perform distributed inference using a pandas UDF directly from Delta.

## Deploying robust ingestion jobs in production

Let's see how to use Auto Loader to ingest JSON files, support schema evolution, and automatically restart when a new column is found.

If you need your job to be resilient with regard to an evolving schema, you have multiple options:

* Let the full job fail & configure Databricks Workflow to restart it automatically.
* Leverage Delta Live Tables to simplify all the setup (DLT handles everything for you out of the box).
* Wrap your call to restart the stream when the new column appears.

In [0]:
def start_stream_restart_on_schema_evolution():
    while True:
        try:
            stream = (spark.readStream
                            .format("cloudFiles")
                            .option("cloudFiles.format", "json")
                            .option("cloudFiles.schemaLocation", volume_folder + "/inferred_schema")
                            .option("cloudFiles.inferColumnTypes", "true")
                            .load(volume_folder + "/user_json")
                           .writeStream
                            .format("delta")
                            .option("checkpointLocation", volume_folder + "/checkpoint")
                            .option("mergeSchema", "true")
                            .table("autoloader_demo_output"))
            stream.awaitTermination()
        except BaseException as e:
            if not ('UnknownFieldException' in str(e)):
                raise e

In [0]:
DBDemos.stop_all_streams()