
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Appendix - Python Auto Loader
### Extra material, not part of a live teach.

In this demonstration we will introduce running Auto Loader in Python for incremental ingestion. In this example you will be execute Auto Loader manually to incrementally ingest the data.


## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.


## A. Classroom Setup

Run the following cell to configure your working environment for this notebook.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course in the lab environment.

In [0]:
%run ../Includes/Classroom-Setup-Auto-Loader

View the available file(s) in your `/Volumes/dbacademy/labuser/csv_files_autoloader_source` volume. Notice only one file exists.

In [0]:
%python
spark.sql(f'LIST "/Volumes/dbacademy/{DA.schema_name}/csv_files_autoloader_source"').display()


[What is Auto Loader?](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/)

[Using Auto Loader with Unity Catalog](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/unity-catalog)

[Auto Loader options](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options#csv-options)

Below is an example of Auto Loader.

In [0]:
%python

## Create a volume to store the Auto Loader checkpoint files
spark.sql(f'CREATE VOLUME IF NOT EXISTS dbacademy.{DA.schema_name}.auto_loader_files')

## Set checkpoint location to the volume from above
checkpoint_file_location = f'/Volumes/dbacademy/{DA.schema_name}/auto_loader_files'

## Incrementally (or stream) data using Auto Loader
(spark
 .readStream
   .format("cloudFiles")
   .option("cloudFiles.format", "csv")
   .option("header", "true")
   .option("sep", "|")
   .option("inferSchema", "true")
   .option("cloudFiles.schemaLocation", f"{checkpoint_file_location}")
   .load(f"/Volumes/dbacademy/{DA.schema_name}/csv_files_autoloader_source/")
 .writeStream
   .option("checkpointLocation", f"{checkpoint_file_location}")
   .trigger(once=True)
   .toTable(f"dbacademy.{DA.schema_name}.python_csv_autoloader")
)

View the new table **python_csv_autoloader**. Notice the data was ingested into a table and contains 3,149 rows.

In [0]:
SELECT *
FROM python_csv_autoloader;

Add a CSV file to your `/Volumes/dbacademy/labuser/csv_files_autoloader_source/` volume.

In [0]:
%python
copy_files(copy_from = '/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/', copy_to = f"/Volumes/dbacademy/{DA.schema_name}/csv_files_autoloader_source/", n=2)

Confirm your volume contains 2 CSV files.

In [0]:
%python
spark.sql(f'LIST "/Volumes/dbacademy/{DA.schema_name}/csv_files_autoloader_source"').display()

Rerun your Auto Loader ingestion from above (pasted for you below) to incrementally ingest only the new file.

In [0]:
%python

checkpoint_file_location = f'/Volumes/dbacademy/{DA.schema_name}/auto_loader_files'

(spark
 .readStream
   .format("cloudFiles")
   .option("cloudFiles.format", "csv")
   .option("header", "true")
   .option("sep", "|")
   .option("inferSchema", "true")
   .option("cloudFiles.schemaLocation", f"{checkpoint_file_location}")
   .load(f"/Volumes/dbacademy/{DA.schema_name}/csv_files_autoloader_source/")
 .writeStream
   .option("checkpointLocation", f"{checkpoint_file_location}")
   .trigger(once=True)
   .toTable(f"dbacademy.{DA.schema_name}.python_csv_autoloader")
)

View the **python_csv_autoloader** table. Notice that it now contains 6,081 rows.

In [0]:
SELECT * 
FROM python_csv_autoloader;

View the history of the **python_csv_autoloader** table. Notice the two **STREAMING UPDATES**. In the **operationMetrics** column you can see how many rows were ingestion in each streaming update. Notice that it only is ingestion new files.

In [0]:
DESCRIBE HISTORY python_csv_autoloader;

Drop the **python_csv_autoloader** table.

In [0]:
DROP TABLE IF EXISTS python_csv_autoloader;

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>