# dlt - configuration & secrets

## 1. Introduction

There are several things we need to configure in a dlt project:
- the tool itself
- project-level default values
- secrets

### A note on the scope of this notebook

While dlt supports [various config providers](https://dlthub.com/docs/general-usage/credentials/setup#available-config-providers), to keep things brief, in this tutorial we will be focusing on only one of the available approaches: config files.

### Tool configuration

`dlt` is typically configured in the `config.toml` file. Here you can specify project-level configuration of internal settings such as loading behavior or memory optimizations.

### Project-level default values

We can specify these in the `config.toml` file as well. Project defaults can include things such as reusable destination configuration; for example, if using a data lake as the staging layer, you can specify the data lake config (such as the bucket and path to the base directory).

### Secrets

While the best-practice way of specifying secrets is via [vaults](https://dlthub.com/docs/general-usage/credentials/setup#vaults), this feature is not yet provided out-of-the-box by dlt and requires a custom implementation. Therefore, for the purposes of this tutorial, we will be using the `secrets.toml` file.

### Config/secret injection mechanism

But, how do we use config/secrets in our pipelines?

For this purpose, `dlt` has a config injection mechanism, used to "magically" fetch config/secrets values for us by using two "magic" variables, `dlt.secrets.value` or `dlt.config.value`. Don't worry if this is confusing right now; will show this in practice in our pipelines later in this notebook.

## 2. Configuration

The best place to store common, non-secret configuration, is the `config.toml` file.

Recall when we first introduced `dlt` in the previous notebook? We mentioned that it consists of three stages (extract, normalize, and load), and that it stored extracted and normalized data locally before loading it into the destination.

The details of this process depend on used destination. By default, for DuckDB, `dlt` stores the final data (called a "load package") as compressed INSERT files, which makes it hard to inspect if you ever wanted to debug the data before it's loaded into the destination:

In [None]:
# Clean up any existing files.
!rm -rf ~/.dlt/pipelines

In [None]:
# Reuse an example pipeline from intro notebook.

import dlt

people = [
    {"id": "1", "name": "Warren Buffet", "country": "USA"},
    {"id": "2", "name": "Jack Ma", "country": "China"},
    {"id": "3", "name": "Rafal Brzoska", "country": "Poland"},
]

pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination="duckdb",
    dataset_name="mydata",
)

load_info = pipeline.extract(people, table_name="person")
load_info = pipeline.normalize()

print(load_info)

load_file_path = next(job.file_path for job in load_info.load_packages[0].jobs["new_jobs"] if job.job_file_info.table_name == "person")
print()
print(f"Load file path: {load_file_path}")

This is what the load file looks like on disk:

In [None]:
!gzip -dc $load_file_path

Now, let's use dlt's config to change that behavior. We will make `dlt` store load packages as `parquet` files instead of `insert_values`.

Uncomment the `[normalize]` and `loader_file_format` lines in `.dlt/config.toml` (lines 7-8). This setting changes the load file format to `parquet`. Next, reload the notebook and execute all the cells above except the last command (`gzip`).

This is what the load file looks now:

In [None]:
%%capture

# Install required dependency.
!uv add pandas

In [None]:
import pandas as pd

df = pd.read_parquet(load_file_path)
df

## 3. Secrets

In this section, we will show dlt's config injection mechanism in practice. Without further ado:

In [None]:
# Reuse an example pipeline from intro notebook.

import dlt


# NOTE: the last person's country is taken from the `secrets.toml` file.
@dlt.resource(table_name="person")
def people(secret_country: str = dlt.secrets.value):
    yield [
        {"id": "1", "name": "Warren Buffet", "country": "USA"},
        {"id": "2", "name": "Jack Ma", "country": "China"},
        {"id": "3", "name": "Rafal Brzoska", "country": secret_country},
    ]


pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination="duckdb",
    dataset_name="mydata",
)

load_info = pipeline.run(people)

print(load_info)

In [None]:
!echo "select * from mydata.person where name = 'Rafal Brzoska';" | duckdb dummy_source_to_duckdb.duckdb

### What happened?

`dlt` automatically fetched the value for the `secret_country` parameter using its [injection mechanism](https://dlthub.com/docs/general-usage/credentials/setup#naming-convention).

In this case, we've specified our secret in the `<pipeline_name>` config section in `secrets.toml`. Alternatively, dlt offers [other ways of providing the secret config](https://dlthub.com/docs/general-usage/credentials/setup#naming-convention). One common way is to specify destination credentials in the following way:

```toml
[destinations.destination_name.credentials]
some_credential = "some_value"
```

For example:

```toml
[destination.bigquery.credentials]
project_id = "project_id" # please set me up!
private_key = "private_key" # please set me up!
client_email = "client_email" # please set me up!
```

### Preview: vault integration

As a bonus, let's show how you could go about using your company's vault (eg. Google Cloud Secret Manager) for storing secrets used in `dlt` pipelines.

Essentially, we need to implement a utility function for fetching secrets, and then use it whenever we need to fetch a secret value within the resource/pipeline code.

In [None]:
# Reuse an example pipeline from intro notebook.

import dlt


def get_secret_from_gcsm(secret_name: str):
    """
    We'd normally fetch the value of the user-specified secret from the vault here.

    In the case of GCSM, OAuth can be used to authenticate the machine with Google
    Cloud before executing this code. In such case, we don't need to provide GCSM
    credentials anywhere in dlt.

    In case of some other vaults where we need to pass credentials to the vault
    itself, we can store those in an environment variable in our production machines,
    and utilize `dlt.secrets.values` with the environment variables config provider
    (eg. credentials: GcpServiceAccountCredentials = dlt.secrets.value) together
    with the with_config() decorator to use these vault credentials securely.
    """
    return f"{secret_name}'s secret_value"


# NOTE: the last person's country is taken from the `secrets.toml` file.
@dlt.resource(table_name="person")
def people(secret_country: str | None = None):
    secret_country = secret_country or get_secret_from_gcsm("secret_country")
    yield [
        {"id": "1", "name": "Warren Buffet", "country": "USA"},
        {"id": "2", "name": "Jack Ma", "country": "China"},
        {"id": "3", "name": "Rafal Brzoska", "country": secret_country},
    ]


pipeline = dlt.pipeline(
    pipeline_name="dummy_source_to_duckdb",
    destination="duckdb",
    dataset_name="mydata",
)

load_info = pipeline.run(people)

print(load_info)

In [None]:
!echo "select * from mydata.person where name = 'Rafal Brzoska';" | duckdb dummy_source_to_duckdb.duckdb

## Summary

In this lesson, we've learned:

- what are the ways to configure things in `dlt` project: `dlt` itself, common pipeline configuration, and secrets
- how to use `config.toml` and `secrets.toml` and how dlt's injection mechanism works
- (bonus) how we could use an external vault such as Google Cloud Secret Manager for storing our secrets