# Introduction to dlt: Resources, Sources, and Pipelines

## 1. Introduction

### What is dlt?

[dlt](https://dlthub.com/) is a Python data ingestion framework enabling data engineers to define connectors and pipelines as code. It offers a rich set of features for building best-practice pipelines and supports both built-in and custom connectors built with regular Python code.

For more information, see the [official docs](https://dlthub.com/docs/intro#what-is-dlt).

### How does it work?

dlt ingests data in three stages: extract, normalize, and load. The extract stage downloads source data to disk. The normalize stage applies light transformations to the data, such as column renaming or datetime parsing. The load stage loads the data into the destination system.

![how does dlt work diagram](https://dlthub.com/docs/assets/images/dlt-onepager-c61255330e30060ca8f2fa6d7b73b600.png)
Credit: [dlt documentation](https://dlthub.com/docs/reference/explainers/how-dlt-works)

## 2. Hello, dlt

Let's jump in and see dlt in action!



In [None]:
import dlt


# Sample data.
people = [
    {"id": "1", "name": "Warren Buffet", "country": "USA"},
    {"id": "2", "name": "Jack Ma", "country": "China"},
    {"id": "3", "name": "Rafal Brzoska", "country": "Poland"},
]

# Set pipeline name, destination, and dataset name.
pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination="duckdb",
    dataset_name="mydata",
)

load_info = pipeline.run(people, table_name="person")

print(load_info)

Let's see what was loaded into the database:

In [None]:
!echo "select table_catalog, table_schema, table_name from information_schema.tables;" | duckdb dummy_source_to_duckdb.duckdb

In [None]:
!echo "select * from mydata.person;" | duckdb dummy_source_to_duckdb.duckdb

In the next sections, we will learn more about the features and ways of working with dlt as we build gradually more complex pipelines.

## 3. Sources, resources, and pipelines

In this section, we will:

- learn about the three key dlt concepts
- configure a source and two resources
- briefly showcase destinations

### 3.1 Resources

Resources represent the data that flows through a dlt pipeline. They allow us to use various dlt funcionalities such as incremental loading and specify some of the ELT configuration, which we would be unable to do if we worked with raw data.

In the script below, we use `dlt.resource()` in order to specify the name of the target table when defining the resource, rather than at pipeline runtime (`pipeline.run()`).

In [None]:
import dlt


# We now describe the data source as a dlt resource rather than a Python list.
@dlt.resource(table_name="person")
def people():
    yield [
        {"id": "1", "name": "Warren Buffet", "country": "USA"},
        {"id": "2", "name": "Jack Ma", "country": "China"},
        {"id": "3", "name": "Rafal Brzoska", "country": "Poland"},
    ]


pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination="duckdb",
    dataset_name="mydata",
)

load_info = pipeline.run(people)

print(load_info)

### 3.2 Sources

Sources are groups of resources. They allow us to define the source of the data and the resources that will be loaded from that source. For example, a source could be an SQL database, while a resource would be a table in that database.

dlt offers several built-in standard sources such as databases, REST APIs, or cloud storage. We can also define custom sources by using the `dlt.source()` decorator.

In the script below, we define a custom source with two resources.

In [None]:
import dlt


@dlt.source
def dummy_data():
    @dlt.resource(table_name="person")
    def people():
        yield [
            {"id": "1", "name": "Warren Buffet", "country": "USA"},
            {"id": "2", "name": "Jack Ma", "country": "China"},
            {"id": "3", "name": "Rafal Brzoska", "country": "Poland"},
        ]

    @dlt.resource(table_name="country")
    def countries():
        yield [
            {"id": "1", "name": "USA", "population": 331449281},
            {"id": "2", "name": "China", "population": 1444216107},
            {"id": "3", "name": "Poland", "population": 37846611},
        ]

    # NOTE: We need to return the resources here.
    return people, countries


pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination="duckdb",
    dataset_name="mydata",
)

if __name__ == "__main__":
    # NOTE: We still provide resources as the data, since `dummy_data()` returns
    # the two resources.
    load_info = pipeline.run(data=dummy_data())

print(load_info)

##### A more real-life example

Just to show you how easy this is with any built-in dlt source as well, let's sidestep and quickly load some data from a production MySQL database:

In [None]:
%%capture

# Install required dependencies for the MySQL connector.
!uv add dlt --extra sql_database
!uv add pymysql

NOTE: executing this might take a minute.

In [None]:
from dlt.sources.sql_database import sql_database


# NOTE: without .with_resources(), we'd be replicating the entire database, which is
# too resource-intensive for this tutorial.
source = sql_database(
    "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam"
).with_resources("family", "genome")

pipeline = dlt.pipeline(
    pipeline_name="sql_database_example",
    destination="duckdb",
    dataset_name="sql_data",
)

load_info = pipeline.run(source)

print(load_info)

In [None]:
!echo "select * from sql_data.family limit 3;" | duckdb sql_database_example.duckdb

That's it! Just like that, we loaded real-life data from a MySQL database into our local DuckDB instance*.

*Note that this is a public database and so we didn't have to specify any credentials. We'll learn about those later in the workshop.

### 3.3 Pipelines

In `dlt`, a pipeline describes the flow of data from resource(s) to a destination. Each pipeline loads resources to a single destination.

Pipelines can be reused to ingest different resources each run. For example, we can have one “Postgres to S3” pipeline, but ingest each Postgres table separately due to different scheduling or configuration needs.

A pipeline definition contains pipeline- or pipeline run-specific destination configuration, as well as settings for the load phase of the ingestion. Under the hood, a pipeline run (`pipeline.run()`) executes each pipeline step: extract (`pipeline.extract()`), normalize (`pipeline.normalize()`), and load (`pipeline.load()`).

Let's use this knowledge to better control our destination configuration - in this case, we'll control the name of the DuckDB database where the data will be loaded.

In [None]:
import dlt


@dlt.source
def dummy_data():
    @dlt.resource(table_name="person")
    def people():
        yield [
            {"id": "1", "name": "Warren Buffet", "country": "USA"},
            {"id": "2", "name": "Jack Ma", "country": "China"},
            {"id": "3", "name": "Rafal Brzoska", "country": "Poland"},
        ]

    @dlt.resource(table_name="country")
    def countries():
        yield [
            {"id": "1", "name": "USA", "population": 331449281},
            {"id": "2", "name": "China", "population": 1444216107},
            {"id": "3", "name": "Poland", "population": 37846611},
        ]

    return people, countries


pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination=dlt.destinations.duckdb("db.duckdb"),  # NOTE: we renamed the db here.
    dataset_name="mydata",
)

if __name__ == "__main__":
    load_info = pipeline.run(data=dummy_data())

print(load_info)

#### Inspecting data in the destination

`dlt` offers two main built-in ways to inspect the data in the destination, SQL client and datasets.

##### SQL client

In [None]:
%%capture

# Install required dependencies for the SQL client.
!uv add pandas

In [None]:
with pipeline.sql_client() as client:
    with client.execute_query("SELECT * FROM mydata.person") as cursor:
        data = cursor.df()

data

##### Dataset

In [None]:
dataset = pipeline.dataset(dataset_type="default")
dataset.country.df()

## 4. Exercise 1

Let's now use all of the knowledge we've gained to build a pipeline that loads data from two CSV files into a DuckDB database:

- define a source, `csvs`, that will contain CSV file resources
- use provided two util functions to define two resources, `iris` and `wine`
- define a pipeline that will load the resources into a `bronze` schema in a `csvs.duckdb` DuckdDB database

In [None]:
# Your solution here.
import dlt
import pandas as pd


def read_iris():
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    col_names = ["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width", "Class"]
    return pd.read_csv(url, names=col_names).to_dict(orient="records")


def read_wine():
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
    col_names = [
        "fixed_acidity",
        "volatile_acidity",
        "citric_acid",
        "residual_sugar",
        "chlorides",
        "free_sulfur_dioxide",
        "total_sulfur_dioxide",
        "density",
        "pH",
        "sulphates",
        "alcohol",
        "quality",
        "color",
    ]
    return pd.read_csv(url, names=col_names).to_dict(orient="records")


# Define the resource.
...

# Define the pipeline.
...

if __name__ == "__main__":
    # Run the pipeline and print load info.
    ...

#### Solution

For the solution to this exercise, see the solutions notebook (`1b_solutions.ipynb`).

If everything went well, we should be able to query the data now:

In [None]:
!echo "select * from bronze.iris limit 3;" | duckdb csvs.duckdb

In [None]:
!echo "select * from bronze.wine limit 3;" | duckdb csvs.duckdb

## 5. Summary

In this lesson, we've:

- learned and got a feel for what dlt is
- learned about its fundamental concepts: sources, resources, and pipelines
- loaded some fake and actual data from various sources (a Python object, MySQL database, CSV files) into a local DuckDB database
