<a href="https://colab.research.google.com/github/sg-56/DLT_Workshop/blob/main/2_Defining_Secrets_%26_Configs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

# **Defining Secrets & Configs** 🤫🔩

**Here, you will learn or brush up on how to:**
- Add values to `secrets.toml` or `config.toml`
- Use environment variables to handle both secrets & configs

---
##  **(0) Add values to `secrets.toml` or `config.toml`**


> Please note that Colab is not well-suited for using `secrets.toml` or `config.toml` files. As a result, these sections will provide instructions rather than code cells, detailing how to use them in a local environment. You should test this functionality on your own machine. For Colab, it is recommended to use environment variables instead.

The `secrets.toml` file - along with the `config.toml` file - should be stored in the `.dlt` directory where your pipeline code is located:

```
/your_project_directory
│
├── .dlt
│   ├── secrets.toml
│   └── config.toml
│
└── my_pipeline.py
```

Let's say your `my_pipeline.py` looks something like this:

```python
import dlt
from dlt.sources.helpers import requests


def pagination(url):
    while True:
        response = requests.get(url)
        response.raise_for_status()
        yield response.json()

        # Get next page
        if "next" not in response.links:
            break
        url = response.links["next"]["url"]


@dlt.resource(table_name="issues")
def get_issues():
    url = "https://api.github.com/repos/dlt-hub/dlt/issues?per_page=100"
    yield pagination(url)

# Rest of the pipeline code
```

The primary rate limit for unauthenticated requests is 60 requests per hour. Sooner or later you will face rate limit errors. You can use a personal access token to make API requests. This will increase the rate to 5,000 requests per hour.

To do that you would first add your access token to `secrets.toml`:

```
# .dlt/secrets.toml

access_token = "your_access_token"
```



Then you would let the `dlt` resource access it with `dlt.secrets.value` the following way:

```python
import dlt
from dlt.sources.helpers import requests


def pagination(url, access_token):
    while True:
        response = requests.get(url, headers={"Authorization": f"Bearer {access_token}"})
        response.raise_for_status()
        yield response.json()

        # Get next page
        if "next" not in response.links:
            break
        url = response.links["next"]["url"]


@dlt.resource(table_name="issues")
def get_issues(
    access_token=dlt.secrets.value
):
    url = "https://api.github.com/repos/dlt-hub/dlt/issues?per_page=100"
    yield pagination(url, access_token)

```

Configs are defined in a similar way but are accessed using `dlt.config.value`. However, since configuration variables are internally managed by `dlt`, it is unlikely that you would need to explicitly use `dlt.config.value` in most cases.

---
##  **(0) Use environment variables**


Install `dlt` with DuckDB as destination:

In [1]:
%%capture
!pip install "dlt[duckdb]"

Create an environment variable for your access token. If you have your token ready in Colab, you can directly use it:

In [None]:
import os
from google.colab import userdata

# Get the access token from user input
access_token = userdata.get('ACCESS_TOKEN')

# Set it as an environment variable
os.environ['ACCESS_TOKEN'] = access_token

Let's reduce the number of per page results to 1, so that we can see that the unauthorized rate limit of 60 isn't an issue since we're using an access token:

In [None]:
import dlt
from dlt.sources.helpers import requests


def pagination(url, access_token):
    while True:
        response = requests.get(url, headers={"Authorization": f"Bearer {access_token}"})
        response.raise_for_status()
        yield response.json()

        # Get next page
        if "next" not in response.links:
            break
        url = response.links["next"]["url"]


@dlt.resource(table_name="issues")
def get_issues(
    access_token=dlt.secrets.value
):
    url = "https://api.github.com/repos/dlt-hub/dlt/issues?per_page=1" # Set per page to 1
    yield pagination(url, access_token)

Create a pipeline:

In [None]:
pipeline = dlt.pipeline(
    pipeline_name="my_pipeline",
    destination="duckdb"
)

Run the pipeline:

In [None]:
load_info = pipeline.run(
    get_issues
)
print(load_info)

Pipeline my_pipeline load step completed in 1.33 seconds
1 load package(s) were loaded to destination duckdb and into dataset my_pipeline_dataset
The duckdb destination used duckdb:////content/my_pipeline.duckdb location to store data
Load package 1726589501.8991365 is LOADED and contains no failed jobs


Check how many rows of data/requests were made:

In [None]:
print(pipeline.last_trace.last_normalize_info)

Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- issues: 167 row(s)
- issues__labels: 122 row(s)
- issues__assignees: 62 row(s)
- issues__performed_via_github_app__events: 25 row(s)

Load package 1726589501.8991365 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs


Alternatively, you can also set `dlt.secrets` with environment variables:

In [None]:
dlt.secrets["access_token"] = userdata.get('ACCESS_TOKEN')

Redefine the resource:

In [None]:
import dlt
from dlt.sources.helpers import requests


def pagination(url, access_token):
    while True:
        response = requests.get(url, headers={"Authorization": f"Bearer {access_token}"})
        response.raise_for_status()
        yield response.json()

        # Get next page
        if "next" not in response.links:
            break
        url = response.links["next"]["url"]


@dlt.resource(table_name="issues")
def get_issues(
    access_token=dlt.secrets.value
):
    url = "https://api.github.com/repos/dlt-hub/dlt/issues?per_page=1" # Set per page to 1
    yield pagination(url, access_token)

Run the pipeline again:

In [None]:
load_info = pipeline.run(
    get_issues
)
print(load_info)

Pipeline my_pipeline load step completed in 1.16 seconds
1 load package(s) were loaded to destination duckdb and into dataset my_pipeline_dataset
The duckdb destination used duckdb:////content/my_pipeline.duckdb location to store data
Load package 1726589568.2547753 is LOADED and contains no failed jobs


Check how many rows of data/requests were made:

In [None]:
print(pipeline.last_trace.last_normalize_info)

Normalized data for the following tables:
- issues: 184 row(s)
- issues__assignees: 78 row(s)
- issues__labels: 150 row(s)
- issues__performed_via_github_app__events: 25 row(s)

Load package 1726069850.8036563 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs


> The naming convention for environment variables in `dlt` follows a specific pattern. All names are capitalized and sections are separated with double underscores __ . For example, if you have a config or secret variable named `destination.filesystem` in your TOML file, it would become `DESTINATION__FILESYSTEM` in your environment variables. Similarly, if you have a nested structure like `destination.filesystem.bucket_url`, it would become `DESTINATION__FILESYSTEM__BUCKET_URL` in your environment variables.
>
>
>```
[destination.filesystem]
bucket_url = 'random_ul'
```
>
>```
DESTINATION__FILESYSTEM__BUCKET_URL
```