# ETL With DuckDB part 2: Basic ETL

We'd now like to proceed to a somewhat real task: querying a decent amount of data in a way that won't materialize it into memory and converting it into a local hive which we can query at our leisure.  Let's start by creating an in-memory duckdb connection which represents our sources of data as one table as before we'll make our in-memory connection:

In [None]:
import duckdb

def create_connection():
    # let's set up our duckdb in memory connection
    conn=duckdb.connect(":memory:")
    # the in-memory connection now must be extended to support https access to S3
    conn.install_extension("httpfs")
    conn.load_extension("httpfs")
    # we'll also set up credentials.  Doing it this way is NOT recommended; never
    # store secrets in repositories in production!
    conn.execute("""
    CREATE SECRET secret (
      TYPE S3,
      KEY_ID 'DO801T8KVC4GP7XCU74A',
      SECRET 'lZVY1vZlGUYJRim+f1WRpVYmv7PtJvYffheKSW4iJOQ',
      REGION 'US',
      ENDPOINT 'lon1.digitaloceanspaces.com'
    )""")
    return conn

conn = create_connection()

But this time we'll build up our data source from multiple parquet files (duckdb supports wildcards as well and as we'll see later can treat a hive partition in a single read, but for now we'll assemble these as if they represent a single data source--another benefit of duckdb)

In [None]:
symbols = ["CSCO", "LSTR", "NFLX", "SHLS", "SOFI", "WING"]
files = [f"s3://thingotron-qs1/artifacts/databento_1_{symbol}_parquet:latest/{symbol}.parquet" for symbol in symbols]

rel=conn.sql(f"""SELECT * FROM read_parquet([{", ".join(["'"+file+"'" for file in files])}])""")

In [None]:
print(rel.shape)

## Check: How would you list the columns?

In [None]:
# code here

Now, the databento folks use a sentinel value (9223372036854775807) to mark price values where the price isn't set.  We'll get a slightly smaller dataset by restricting the query a bit for the next step by only selecting those rows with level 0 defined on both sides:

In [None]:
rel2 = rel.query("rel", "SELECT * FROM rel WHERE buyside_price_00 < 9.2e13 AND sellside_price_00 < 9.2e13")
print(rel2.shape)

# The ETL step

Suppose we don't care about much of the data in there and want to write an ETL step which takes only the timestamp, symbol, and first level of the Limit Order Book, represented by the following columns:

symbol
time_stamp
buyside_price_00
sellside_price_00

Write some code now that 
* takes the above relation `rel`, 
* selects the above columns from it, 
* Converts the prices by dividing them by 1x10^5 and renames them to "buyside_price" and "sellside_price"
* and writes the result into a hive-partitioned set of parquet files in the directory "minilob"!

Hints:

1. In your select statement, you can rename and lightly manipulate columns as such:
```
SELECT 
  my_column * 2 AS twice_my_column
```

2. While you experiment, consider limiting the query (`rel.limit(10)`) to test the output.


In [None]:
# code here