# Bulk File Download with Asyncio

In this notebook we will guide you through how to leverage asynchronous programming in Python to improve the throughput of downloading files from the datalake/retrieve TDP API for your application through a worked example.

Suppose you need to download a collection of files from TDP, identified by a list of TDP fileIds. You can download these files concurrently by using Python’s asynchronous programming methods, allowing for faster retrieval. We will download the files from TDP in two steps. First, we call the datalake/retrieve endpoint to fetch a S3 pre-signed url. Then we will download the file from S3 using the pre-signed url. We take this approach, as it is almost always faster than downloading the file from datalake/retrieve directly. The table below shows the average time taken across 3 runs to download 100 files selected at random from a TetraScience TDP development environment. We see that using a combination of asynchronous programming and the S3 pre-signed URL significantly improves download throughput. The benchmarks with `asyncio` were ran with 25 concurrent connections.

| Method | S3 or API | Time [s] |
| ------ | --------- | -------- |
| Synchronous | API | 58.17 |
| Synchronous | S3 | 54.62 |
| Async | API | 29.86 |
| Async | S3 | 4.08 |

## Environment Setup and Authorization

We import the libraries we will use, `aiohttp` and `aiohttp-retry`, and set up our authentication headers.

In [11]:
import asyncio

import aiohttp
from aiohttp_retry import RetryClient, RandomRetry

In [21]:
TDP_BASE_URL = "https://api.tetrascience.com/v1/"
ORG_SLUG = "tetrascience"
AUTH_TOKEN = ""
AUTH_HEADER = {
    "x-org-slug": ORG_SLUG,
    "ts-auth-token": AUTH_TOKEN,
}

## Download functions

We define helper functions to retrieve the S3 pre-signed url from the `datalake/retrieve` endpoint, and to download a file form the S3-presigned url.

In [13]:
async def retrieve_presigned_url(client: aiohttp.ClientSession, file_id: str) -> str:
    """Retrieve S3 pre-signed url from TDP

    Args:
        session: aiohttp client session
        file_id: TDP file identifier

    Returns:
        S3 pre-signed url
    """
    url = TDP_BASE_URL + "datalake/retrieve"
    params = {"fileId": file_id, "getPresigned": "true"}
    async with client.get(url, params=params) as response:
        contents = await response.json()
        return contents["url"]

In [14]:
async def download_task(client: aiohttp.ClientSession, file_id: str) -> bytes:
    """Download a TDP file from S3

    Args:
        file_id: TDP file id

    Returns:
        file contents
    """
    url = await retrieve_presigned_url(client, file_id)
    async with client.get(url) as response:
        return await response.read()

## Download Files

Finally we can write our function to download a list of files from TDP. There are a few concepts that are worth exploring further in this code snippet.

**Session Management**: we create a single HTTP session, the RetryClient, to manage connections for all requests. Under the hood, the client manages a connection pool such that we do not need to create a new connection for every request. To help ourselves not forget to close() the session, we use it as a context manager. We also pass our authentication header to the session, so we don’t need to pass those headers to each request manually.

**Error Handling**: Sometimes a request is unsuccessful, for example we could pass a non-existent file ID to the download function, or perhaps the upstream service rate limits your client. When downloading a large number of files concurrently, it is not unlikely to run into some rate limiting responses. When that happens, we want to automatically retry the request after a short period of time instead of throwing an error and aborting our progress. The aiohttp-retry package offers several retry options, here we implement RandomRetry, which retries only failed requests with status code 502 up to 3 times.

**Concurrency**: To prevent our client from sending too many requests in a short period of time, we need to limit the number of concurrent requests in flight. More concurrency does not always improve throughput, for example when the network bandwidth is saturated. A limit of 10 concurrent connections is a good place to start.

In [15]:
async def download_files(file_ids: list[str], concurrency: int):
    """Download a collection of files from TDP

    Args:
        file_ids: list of TDP fileIds to download
        concurrency: maximum number of concurrent connections

    Returns:
        A list of file contents
    """
    retry_options = RandomRetry(
        attempts=3, statuses=[502], min_timeout=0.1, max_timeout=3
    )
    conn = aiohttp.TCPConnector(limit=concurrency)

    async with RetryClient(
        retry_options=retry_options,
        headers=AUTH_HEADER,
        connector=conn,
    ) as client:
        tasks = [download_task(client, fid) for fid in file_ids]
        return await asyncio.gather(*tasks)


## Running the Application

Finally, we can run the function using asyncio. In the notebook, we can directly `await` the coroutine.

In [None]:
file_ids = [
    "d51abcdc-04fd-40f8-9556-27a7add9a342",
    "ff807173-587d-40c6-88e7-7ca31522b71b",
    "eb4f967f-9547-4575-810b-2d263244cd34",
]

files = await download_files(file_ids, concurrency=10)

However in a Python programme, we need to use asyncio.run to run the download_files function in an event loop, like so:

In [20]:
# NB: This will not run in a notebook
def main():
    file_ids = [
        "d51abcdc-04fd-40f8-9556-27a7add9a342",
        "ff807173-587d-40c6-88e7-7ca31522b71b",
        "eb4f967f-9547-4575-810b-2d263244cd34",
    ]

    files = asyncio.run(download_files(file_ids, concurrency=10))
    return files