# DuckDB

## Why DuckDB

DuckDB is a lot like SQLite but with more features:
 - Open local databases/files (including sqlite databases)
 - Connect to remote databases or files hosted on AWS S3, and you can treat them as if they were local
 - Automatically allow querying of pandas dataframes via %sql magic

## Using DuckDB

1. Install packages
    1. `duckdb-engine` provides the database
    2. `jupysql` provides jupyter magic (**WARNING: may not play nice with `ipython-sql`**)
2. Load packages
    1. `import duckdb` instead of `sqlite`
    2. `%load_ext sql` (NOTE: `jupysql` prints a single line advertisement for its maintainer, "ploomer")
3. Set up the database
    1. `%sql duckdb://` instead of `sqrlite://`
    2. Load data from CSV, sqlite, wherever
    3. Query away!

In [None]:
# Install the required packages (duckdb, pandas, pyarrow)
# -U is for upgrade
# -q is for quiet
%pip install -qU duckdb-engine pandas pyarrow jupysql

In [None]:
import os
import duckdb
import pandas as pd

%load_ext sql
%config SqlMagic.autopandas = True  # Return Pandas DataFrames instead of regular result sets
%config SqlMagic.displaycon = False # Don't show connection string after execute
%config SqlMagic.feedback = False   # Don't print number of rows affected

# Create a connection to the DuckDB database
%sql duckdb://

# Load chinook.sqlite into the DuckDB database
%sql DETACH DATABASE IF EXISTS chinook # Detach the database if it's already attached
%sql ATTACH DATABASE 'chinook.sqlite' AS chinook # Attach the database in the 'chinook' schema

In [None]:
# Test the attached SQLite using a query
%sql SELECT * FROM chinook.albums LIMIT 5

In [None]:
# Test querying a Pandas dataframe
df = pd.DataFrame({'a': ['this', 'a', 'data'], 'b': ['is', 'pandas', 'frame']})
%sql select * from df

In [None]:
## We can set the default path to 'chinook'
## WARNING: Since chinook is SQLite, it doesn't support schemas, this will break if we want to add schemas later
# %sql USE chinook
# %sql SELECT * FROM albums LIMIT 5

In [None]:
# We can copy all tables from chinook to the default schema for convenience
chinook_tables = %sql SELECT table_name FROM information_schema.tables WHERE table_catalog = 'chinook'
for table in chinook_tables.table_name:
    %sql DROP TABLE IF EXISTS {{table}}
    %sql CREATE TABLE {{table}} AS SELECT * FROM chinook.{{table}}

# Test the connection using a query
%sql SELECT * FROM albums LIMIT 5

In [None]:
# Load the F1 csv's into the DuckDB database under the 'f1' schema

# Create a new schema for the F1 data, if it doesn't already exist
%sql CREATE SCHEMA IF NOT EXISTS f1 

# Load the csv's into the database 
# NOTE: This requires the csv's to be in the 'practice-f1/data' folder (run the practice notebook there first)
data_files = [f for f in os.listdir('practice-f1/data') if f.endswith('.csv')]
for f in data_files:
    table_name = f.split('.')[0]
    %sql DROP TABLE IF EXISTS f1.{{table_name}}
    %sql CREATE TABLE f1.{{table_name}} AS SELECT * FROM read_csv_auto('practice-f1/data/{{f}}')

# Test the connection using a query
%sql SELECT table_schema, table_name FROM information_schema.tables WHERE table_schema = 'f1'

In [None]:
# Show the tables in the 'f1' schema
%sql SELECT * FROM f1.circuits LIMIT 5