**Auto Loader** is Databricks’ cloud-native file ingestion engine for ingesting new files incrementally from object storage.<br>

Auto Loader incrementally processes only new files arriving in cloud storage, without reprocessing old files, and scales to millions of files efficiently.

Supported Sources:
- AWS S3
- Azure ADLS Gen2
- Google Cloud Storage (GCS)

Auto Loader uses two main mechanisms or Modes:
It detects new files only using either:
- **Directory listing** - Directory listing scans storage paths to detect new files using the checkpoint feature
- File notification - Processes files as soon as they arrive at scale ((recommended))

A checkpoint is like a bookmark that remembers how much work is already done, so Spark doesn’t start again from the beginning

**Directory listing Mode**<br>
1. Auto Loader periodically scans the storage directory, lists all files, compares them with its checkpoint, and processes only the newly discovered files.<br>
1. Auto Loader scans the input directory
2. Lists all files present
3. Compares with checkpoint metadata
4. Identifies new files
5. Reads only new files
6. Updates checkpoint



**File notification Mode**<br>
Auto Loader does not scan directories.
Instead, it listens to cloud storage events that notify when a new file arrives.

1. File arrives in cloud storage
2. Cloud emits an event
3. Event is sent to a message queue
4. Auto Loader reads event metadata
5. Auto Loader reads only that file
6. Updates checkpoint

No directory scanning at all


####1. Directory listing
1. Spark lists directory
2. Detects new CSV or any files
3. Infers schema / evolves if needed
4. store the file into bronze layer
5. Updates checkpoint (file1 is processed...)
6. Waits for next job run to list directory and collect any new files into bronze

| Part                   | Meaning                    |
| ---------------------- | -------------------------- |
| `spark`                | Spark session              |
| `readStream`           | Read data **continuously** |
| `format("cloudFiles")` | Use **Auto Loader**        |
<br>

**option("checkpointLocation")**<br>
CheckpointLocation is a memory folder that remembers how much data is already processed.<br>

**If you remove checkpointLocation**

- Stream starts from scratch
- Files get reprocessed
- Duplicate data risk

-------

**option("cloudFiles.schemaLocation)**<br>
SchemaLocation is a folder where Auto Loader remembers the data structure (schema).<br>

**What it stores**

- Column names
- Data types
- Schema versions (history)


In [0]:
df1=spark.readStream.format("cloudFiles")\
    .option("cloudFiles.format","csv")\
    .option("cloudFiles.schemaEvolutionMode","addNewColumns")\
    .option("cloudFiles.maxFilesPerTrigger",5)\
    .option("cloudFiles.inferColumnTypes",True)\
    .option("checkpointLocation","/Volumes/sales_project/sales/pipeline_1/_checkpoint/")\
    .option("cloudFiles.schemaLocation","/Volumes/sales_project/sales/pipeline_1/_schema/")\
    .option("header",True)\
    .load("/Volumes/sales_project/sales/pipeline_1/AWS_Raw_Data/")#this can be s3/adls/gcs
    #.option("cloudFiles.useNotifications", "true") (Remove this option to enable directory listing)


availableNow=True tells Spark to process all data that is currently available and then stop the stream automatically.

In [0]:
df1.writeStream.trigger(availableNow=True)\
    .option("checkpointLocation","/Volumes/sales_project/sales/pipeline_1/_checkpoint/")\
    .option("cloudFiles.schemaLocation","/Volumes/sales_project/sales/pipeline_1/_schema/")\
    .start("/Volumes/sales_project/sales/pipeline_1/streamwrite1/")

In [0]:
df2 = spark.read.format("delta").load("/Volumes/sales_project/sales/pipeline_1/streamwrite1/")
display(df2)