# ETL With DuckDB

At Quantum Signals we use ML models to do short and medium term prediction of markets.  One source of data we use is Limit Order Book data, which lists the ask and bid prices of a stock throughout the day.

Each record of this data is pretty small, but the cardinality is extremely high and the timestamp is in nanoseconds--and even then, many events can have the same timestamp.  Let's use Python and DuckDB to query a subset of this information over S3, and do a quick ETL transform to write the midpoint price of each transaction to a hive-partitioned (by symbol) set of parquet files.

# WARNING!

For purposes of an easy workshop, credentials for this read-only S3 bucket are provided below.  They will be deleted after the workshop.  Never store credentials in a repository; it's a very bad way to go, but doing so here will greatly simplify getting our data and I thought it was worth it for the workshop itself.  This is a personal s3 bucket, so please don't abuse it.

In [None]:
import duckdb

def create_connection():
    # let's set up our duckdb in memory connection
    conn=duckdb.connect(":memory:")
    # the in-memory connection now must be extended to support https access to S3
    conn.install_extension("httpfs")
    conn.load_extension("httpfs")
    # we'll also set up credentials.  Doing it this way is NOT recommended; never
    # store secrets in repositories in production!
    conn.execute("""
    CREATE SECRET secret (
      TYPE S3,
      KEY_ID 'DO801T8KVC4GP7XCU74A',
      SECRET 'lZVY1vZlGUYJRim+f1WRpVYmv7PtJvYffheKSW4iJOQ',
      REGION 'US',
      ENDPOINT 'lon1.digitaloceanspaces.com'
    )""")
    return conn

conn = create_connection()

A `DuckDBPyConnection` represents a connection to a database.  This can be tightly bound to queries, but also configured as necessary.  An *in-memory* duckdb connection stores all data in memory that isn't attached via outside storage, but makes a very effective tool for queries and transforms with external storage.

A query itself is represented as a `DuckDBPyRelation`, and below we're going to query a compressed CSV file over S3.  On my home network, this took about 30 seconds.

In [None]:
lstr_rel = conn.sql("SELECT * FROM read_csv_auto('s3://thingotron-qs1/downloads/databento/6_stocks_2023/xnas-itch-20230101-20231230.mbp-10.LSTR.csv.zst')")
print(lstr_rel.shape)

So about 9 million rows, each with 74 columns.  30 seconds doesn't sound like much, but this was a low-traded stock; the compressed CSV was only about 356 Mb.  It's slow because CSV is a row-oriented format that is sloppy with its data formats and must be read entire to count rows or get summary information.

Short form: CSV is terrible.  Never use CSV unless you cannot avoid it, and if you cannot avoid it, convert it to parquet or some other form when you can.  But while we have this relation, let's see what we can learn from it.

A relation can be requeried, and in this manner you can chain relations and queries quite a way.  We can DESCRIBE the data to learn about what columns are there...

In [None]:
lstr_rel.query("lstr", "DESCRIBE lstr")

The `query` on a relation takes a reference name as its first argument, so you can erfer to it in the following query...and you can chain queries together:

In [None]:
lstr_rel.query("lstr1", "SELECT ts_event, ask_px_00, bid_px_00 FROM lstr1 LIMIT 10").query("lstr2", "DESCRIBE lstr2")

Let's move on.  Luckily at least one year of this data, containing four stocks has been converted to parquet for us; let's look at CSCO in 2021 as above to compare, with a fresh new connection.

In [None]:
conn = create_connection()

In [None]:
csco_rel = conn.sql("SELECT * FROM read_parquet(['s3://thingotron-qs1/artifacts/databento_1_CSCO_parquet:latest/CSCO.parquet'])")
print(csco_rel.shape)
print(csco_rel.query("csco", "DESCRIBE csco"))

This is a slightly different format (this is how we convert the raw databento data for our use at Quantum Signals), but I want to call out a couple of things:

1. The speed of querying this parquet file is lightning fast compared to the compressed CSV, particularly since it has about 20 times the data.
2. The size of this parquet file is only about 2.32 Gb compared to the previous CSV which was 356Mb

For many, many, many cases parquet is a natural fit for file format.  It compresses well, allows columnar access so you can do complex queries without accessing the full dataset and materializing it in memory, and is widely supported by libraries such as Arrow (http://arrow.apache.org) and very efficient dataframe libraries like pandas and polars.

Speaking of which, it is trivial to convert a duckdb SQL query into a polars or pandas dataframe.  We'll use polars here, but will avoid materializing the whole dataframe which could rapidly fill up memory and make it difficult to proceed; instead we'll take the first 1000 rows:

In [None]:
polars_df = csco_rel.limit(1000).pl()
pandas_df = csco_rel.limit(1000).df()
print(polars_df.head())
print(pandas_df.head())

Lastly, let's look at memory usage of the pandas dataframe:

In [None]:
print(pandas_df.info(memory_usage="deep"))

If the first 1000 rows of our query took 705*1024 bytes of data, and the full dataset had (194,025,498/1000=194,000) times that amount of data, we're looking at around 130Gb to fully materialize this query, and that's just one stock over one year.

To transform this, we'll need to be clever; proceed to part 3