# Incremental loading: deep dive

## 1. Introduction

In this lesson, we will take a deep dive into how incremental loading works in `dlt`, and how you can customize it to fit your needs.

We will also learn:
- how to perform backfills
- how to handle late arriving data
- how to debug common incremental loading issues

## 2. How incremental loading works in `dlt`

There are two main components that work together to enable incremental loading in `dlt`:

- Cursor configuration
- Pipeline state

In short, we can configure a cursor to let `dlt` know how to identify new or updated rows in the source data. `dlt` will then store the last seen cursor value in the pipeline state, and use it to filter the source data on subsequent runs.

Notably, the filtering is performed **after** the data is extracted from the source and any [mapping funcions](https://dlthub.com/docs/dlt-ecosystem/transformations/add-map), but **before** it is normalized and loaded into the destination.

### Cursor configuration

The cursor configuration consists of the following components:

- Cursor column
- Cursor initial value
- Cursor end value (optional)

The latter two, combined with pipeline state, are used by `dlt` to determine the current cursor value at the start of each pipeline run.

For more information on how pipeline state is used in incremental loading, see the [Pipeline state](#pipeline-state) section below.

#### Cursor column

This is the column in the source data that `dlt` uses to determine which rows are new or updated since the last load. It's typically a timestamp or an incrementing integer.

#### Cursor value

This is the value of the cursor column from the last successful load. `dlt` uses this value to filter the source data. We can manually specify the initial and end values for specific purposes.

#### Cursor initial value

This is the value that `dlt` uses for the cursor value on the first run of the pipeline. It is specified in the `dlt` configuration file (`config.toml`).

#### Cursor end value

This is used by `dlt` for backfills. Specifing this value tells `dlt` to **not** update the cursor value after the run, allowing you to re-extract data up to this value in subsequent runs.

### Pipeline state

This is a persistent storage that `dlt` uses to store
the cursor value and other metadata about the pipeline runs. It allows `dlt` to remember the state of the pipeline between runs.

Below is a diagram showcasing how pipeline state is used during a pipeline run:

![dlt incremental loads - pipeline state](../assets/dlt_incremental_loads_pipeline_state.png)

Notably, on the first execution, the pipeline will use the specified `cursor_initial_value` to filter the data:

![dlt incremental loads - pipeline state - first run](../assets/dlt_incremental_loads_pipeline_state__first_run.png)


### Cursor value types & timezones

## 3. Backfilling data

## 4. Handling late arriving data

## 5. Debugging incremental loading issues

### 5.1 Incorrect cursor column name

#### Specifying normalized name

When

Remember that incremental filtering is applied **after** the data is extracted and any mapping functions are applied, but **before** normalization and loading. Due to this, we need to use the non-normalized column name when specifying the cursor column:

Resource:

```json
{"MyColumn": "value"}
```

Incorrect incremental configuration:
```python
dlt.sources.incremental("my_column")
```

Correct incremental configuration:
```python
dlt.sources.incremental("MyColumn")
```


#### Not escaping special characters

A common issue is when the resource retrieves data in the following format:

```json
{"My Column": "value"}
```

And then we try to specify the cursor column as:

```python
dlt.sources.incremental("My Column")
```

However, internally, `dlt` uses JSONPath to select the cursor value using our specified column name as part of the path. So, in this case, we need to escape the special characters (space) in the column name by wrapping it in `['...']`:

```python
dlt.sources.incremental("['My Column']")
```