![](/Workspace/Users/infoblisstech@gmail.com/databricks-code-repo/6_lakeflow_pipelines/autoloader_file_ingestion_usecase1.png)

###Auto Loader is Databricks
**Auto Loader is Databricksâ€™ cloud-native file ingestion engine for ingesting new files incrementally from object storage.**

Supported Sources:
- AWS S3
- Azure ADLS Gen2
- Google Cloud Storage (GCS)

Modes:
- **Directory listing** - Directory listing scans storage paths to detect new files (This works in free edition)
- File notification - Processes files as soon as they arrive at scale (This will not work in free edition because the cloud storage event trigger can't control/trigger Databricks LF Ingestion)

**Directory listing** (Databricks Lakeflow Ingestion - Autoloader - Directory Listing)
1. Spark lists the Cloud directory (pull model)
2. Detects new files **(Incremental Autoloader)**
3. Infers schema / **evolves** if needed
4. Copy the file(s) & store the schema info in a schema file, so further schema inference is not needed.
5. After file1 is copied to Bronze layer -> Updates checkpoint (maintaining the file info of whichever is copied already)
6. Waits for next trigger of the Lakeflow pipeline and follow step 1 to 5.

**File Notification** (we will see it in the cloud databricks version)
1. Cloud storage emits file-create event (S3 Event, ADLS Event Grid, GCS Pub/Sub)
2. Event is delivered to Databricks queue
3. Auto Loader receives notification (push model)
4. New file is registered
5. Infers schema / evolves if needed
6. Copy the file(s) & store the schema info in a schema file, so further schema inference is not needed.
7. Updates checkpoint (file1 is processed...)
8. Stream stays idle until next event arrives

**Benifits of Autoloader:**
- Incremental and Efficient File Ingestion: Auto Loader automatically detects and processes new files as they arrive in your source directory (e.g., S3 or Unity Catalog volume). This eliminates manual tracking and reprocessing, ensuring only new data is ingested each run.

- Schema Evolution Support: With options like "cloudFiles.schemaEvolutionMode": "addNewColumns" and "mergeSchema": "true", Auto Loader can handle changes in your data schema over time, adding new columns without breaking your pipeline.

- Scalability and Resource Optimization: Properties such as "cloudFiles.maxFilesPerTrigger" allow you to control how many files are processed per batch, helping manage resource usage and scale to large datasets.

- Checkpointing and Fault Tolerance: Auto Loader maintains checkpoints and schema locations, so it can resume from where it left off in case of failures, ensuring reliable and consistent data ingestion.

- Unified Streaming and Batch Processing: By using readStream and writeStream, your pipeline can handle both streaming and batch workloads seamlessly, making it suitable for real-time and scheduled data ingestion.

**To perform schema evolution, we have to use the below properties:**<br>
**Read side:** <br>
.option("cloudFiles.schemaEvolutionMode","addNewColumns")<br>
**Write side:** <br>
.option("mergeSchema", "true")<br>

In [0]:
#We learn Autoloading of Incremental data from cloud source, Schema evolution, 
cloudsrc="/Volumes/catalog2_we47/schema2_we47/datalake/sourcepath/"#s3 storage path
#cloudsrc="gs://izsourcebucket/Master_City_List_hour1.csv"
bronzetgt="/Volumes/catalog2_we47/schema2_we47/datalake/bronze/streamwrite1/"
#To resolve it, you must use a data source that is accessible from your AWS-based Databricks workspace, such as an S3 bucket or a Unity Catalog volume.
ckptlocation="/Volumes/catalog2_we47/schema2_we47/datalake/bronze/streamwrite1/_checkpoint"#stores the files copied information post write is successful
schemalocation="/Volumes/catalog2_we47/schema2_we47/datalake/bronze/streamwrite1/_schema"#stores the inferred schema of the source data
df1=spark.readStream.format("cloudFiles")\
.option("cloudFiles.format","csv")\
.option("cloudFiles.maxFilesPerTrigger",1)\
.option("cloudFiles.inferColumnTypes",True)\
.option("cloudFiles.schemaEvolutionMode","addNewColumns")\
.option("checkpointLocation", ckptlocation)\
.option("cloudFiles.schemaLocation", schemalocation)\
.option("header",True)\
.load(cloudsrc)#this can be s3/adls/gcs
#.option("cloudFiles.useNotifications", "true") (Remove this option to enable directory listing)
#maxFilesPerTrigger - this property help spark to process howmany files in an iteration to control the resource utilization (all files will be processed ultimately)

In [0]:
#realtime trigger is not possible in free serverless
#writeStream will read data from df1 (materialized here) and write to bronzetgt using the schema generated by reader and checkpoint info stored
df1.writeStream.trigger(availableNow=True)\
.option("checkpointLocation", ckptlocation)\
.option("cloudFiles.schemaLocation", schemalocation)\
.option("mergeSchema", "true") \
.start(bronzetgt)
#.option("mergeSchema", "true") \

In [0]:
spark.read.format("delta").load(bronzetgt).orderBy("city_name").show(100)