# Incremental Loading and Write Dispositions

## 1. Introduction

In this section, we will discover how to use `dlt` effectively by loading only new/modified data using two dlt features in tandem: write dispositions and incremental loading.


### ELT patterns

There are two ideal data source types, in terms of efficiency:
- an immutable source (eg. logs), from which we're able to extract only the new records

  In this case, we're able to use incremental loading with the `append` strategy to load data in the most efficient way.
- a mutable source (eg. a database), but one from which we're able to extract new and modified records

    In this case, we can use the `merge` write disposition.

The diagram below describes the most optimal ELT strategy given how we're able to extract data from a data source.

![](https://thescalableway.com/img/uT145YgjSn-960.webp)

Credit: https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines/#efficiency

## A note on the scope of this notebook

While `dlt` supports handling of deleted records in the `merge` write disposition, doing so depends on upstream source managing these records in a specific way (we need a column that indicates whether a record has been deleted). This is not a common practice and implementing such scenarios typically requires data engineering work at the data generation level (eg. collaborating with database admins), making this an advanced scenario, and so we will not cover it in this notebook.

For now, assume that in the case records are deleted, a full refresh must be performed.

For more information, see [dlt documentation](https://dlthub.com/docs/general-usage/incremental-loading#delete-records).

## 2. Write Dispositions

## 3. Incremental loading

In [None]:
!rm -f _clickstream_last_id.txt

In [None]:
from collections.abc import Generator
from datetime import UTC, datetime
from pathlib import Path
from random import sample
from secrets import choice, randbelow
from typing import Any

import dlt
from dlt.pipeline.pipeline import Pipeline
from faker import Faker

fake = Faker()

n_users = 10
# n_clicks = 100000
n_clicks = 10000

def person() -> Generator[dict[str, Any], None, None]:
    """Simulate data from a source.

    We keep the first row static, while rows 2 and 3 are "updated" each time the
    function is called.

    We also showcase the usage of `cursor.last_value`, which could be used to filter
    only new data at the extract stage (eg, by passing it to a filtering parameter
    such as `since` in a REST API).

    For more information on this usage, see
    https://dlthub.com/docs/general-usage/incremental-loading#incremental-loading-with-a-cursor-field.

    The `cursor` variable is injected by the `edp_resource()` decorator.
    """
    ids = range(n_users)
    # Simulate updating two random rows.
    ids_to_update = sample(ids, 2)
    for _id in ids:
        yield {
            "id": _id,
            "name": fake.name(),
            "country": choice(["USA", "China", "Poland"]),
            "updated_at": datetime.now(UTC)
            if _id in ids_to_update
            else datetime(2024, 1, 1, 0, 0, 0, 0, UTC),
        }


def clickstream() -> Generator[dict[str, Any], None, None]:
    """Simulate clickstream data."""
    # Keep a cursor so we can simulate incremental loading.
    cursor_file_path = Path("_clickstream_last_id.txt")
    if cursor_file_path.exists():
        with cursor_file_path.open() as f:
            last_id = int(f.read())
    else:
        last_id = 0

    pages = ["/home", "/about", "/contact", "/pricing", "/blog"]
    yield [
        {
            "id": _,
            "user_id": randbelow(n_users),
            "timestamp": fake.date_time_between(start_date="-1m", end_date="now"),
            "page": choice(pages),
        }
        for _ in range(last_id + 1, last_id + 1 + n_clicks)
    ]

    new_last_id = last_id + n_clicks
    with cursor_file_path.open("w") as f:
        f.write(str(new_last_id))


pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination=dlt.destinations.duckdb("incremental.duckdb"),
    dataset_name="bronze",
)

- run the pipeline again; notice duplicated data
- TODO: show modified script, execute, show both tables (merge & append)

For more information regarding integrating `dlt` with Prefect, see [our article](https://thescalableway.com/blog/dlt-and-prefect-a-great-combo-for-streamlined-data-ingestion-pipelines) on the topic.