
DuckDB tutorial by Data with Marc
- https://youtu.be/AjsB6lM2-zw 
- https://robust-dinosaur-2ef.notion.site/DuckDB-Tutorial-Getting-started-for-beginners-b80bf0de8d6142d6979e78e59ffbbefe
- [DuckDB vs SQLite](https://db-engines.com/en/system/DuckDB%3BSQLite)

GUI tool supporting DuckDB
- https://dbeaver.io/  Universal DB tool
- https://blog.ouseful.info/2022/02/11/sql-databases-in-the-browser-via-wasm-sqlite-and-duckdb/

In [71]:
import os
import glob
import time
from pathlib import Path
import pandas as pd
import duckdb

conn = duckdb.connect() # create an in-memory database

In [2]:
!cd

C:\Users\p2p2l\projects\wgong\py4kids\lesson-14.1-db\duckdb


In [8]:
data_dir = "../data/sales"

In [9]:
os.listdir(data_dir)

['Sales_April_2019.csv',
 'Sales_August_2019.csv',
 'Sales_December_2019.csv',
 'Sales_February_2019.csv',
 'Sales_January_2019.csv',
 'Sales_July_2019.csv',
 'Sales_June_2019.csv',
 'Sales_March_2019.csv',
 'Sales_May_2019.csv',
 'Sales_November_2019.csv',
 'Sales_October_2019.csv',
 'Sales_September_2019.csv']

## Read CSV files

- pandas: 0.321530 s
- duckdb: 0.267112 s

### with pandas

In [10]:

cur_time = time.time()
df = pd.concat([pd.read_csv(f) for f in glob.glob(f'{data_dir}/*.csv')])
print(f"time: {(time.time() - cur_time)}")
print(df.head(10))

time: 0.3215305805206299
  Order ID                     Product Quantity Ordered Price Each  \
0   176558        USB-C Charging Cable                2      11.95   
1      NaN                         NaN              NaN        NaN   
2   176559  Bose SoundSport Headphones                1      99.99   
3   176560                Google Phone                1        600   
4   176560            Wired Headphones                1      11.99   
5   176561            Wired Headphones                1      11.99   
6   176562        USB-C Charging Cable                1      11.95   
7   176563  Bose SoundSport Headphones                1      99.99   
8   176564        USB-C Charging Cable                1      11.95   
9   176565          Macbook Pro Laptop                1       1700   

       Order Date                        Purchase Address  
0  04/19/19 08:46            917 1st St, Dallas, TX 75001  
1             NaN                                     NaN  
2  04/07/19 22:30       

In [11]:
df.shape

(186850, 6)

### with duckdb

In [20]:
cur_time = time.time()
df = conn.execute(f"""
	SELECT *
	FROM '{data_dir}/*.csv'
	-- LIMIT 10
""").df()
print(f"time: {(time.time() - cur_time)}")
print(df.head(10))

time: 0.26711201667785645
    column0                     column1           column2     column3  \
0  Order ID                     Product  Quantity Ordered  Price Each   
1    176558        USB-C Charging Cable                 2       11.95   
2       NaN                         NaN               NaN         NaN   
3    176559  Bose SoundSport Headphones                 1       99.99   
4    176560                Google Phone                 1         600   
5    176560            Wired Headphones                 1       11.99   
6    176561            Wired Headphones                 1       11.99   
7    176562        USB-C Charging Cable                 1       11.95   
8    176563  Bose SoundSport Headphones                 1       99.99   
9    176564        USB-C Charging Cable                 1       11.95   

          column4                                 column5  
0      Order Date                        Purchase Address  
1  04/19/19 08:46            917 1st St, Dallas, T

In [21]:
df.shape

(186862, 6)

In [22]:
type(df)

pandas.core.frame.DataFrame

In [23]:
df.dtypes

column0    object
column1    object
column2    object
column3    object
column4    object
column5    object
dtype: object

register dataframe as view

### Register dataframe as DuckDB view

In [24]:
conn.register("df_view", df)
conn.execute("DESCRIBE df_view").df() # doesn't work if you don't register df as a virtual table

Unnamed: 0,column_name,column_type,null,key,default,extra
0,column0,VARCHAR,YES,,,
1,column1,VARCHAR,YES,,,
2,column2,VARCHAR,YES,,,
3,column3,VARCHAR,YES,,,
4,column4,VARCHAR,YES,,,
5,column5,VARCHAR,YES,,,


### Rename columns

replacing space with underscore

In [25]:
df.head(1)

Unnamed: 0,column0,column1,column2,column3,column4,column5
0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [34]:
col_map = {k:v.replace(" ","_")     for k,v in df.iloc[0].to_dict().items()}

In [35]:
col_map

{'column0': 'Order_ID',
 'column1': 'Product',
 'column2': 'Quantity_Ordered',
 'column3': 'Price_Each',
 'column4': 'Order_Date',
 'column5': 'Purchase_Address'}

In [36]:
df.rename(columns=col_map, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns=col_map, inplace=True)


In [40]:
df.drop(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(0, inplace=True)


In [41]:
df.head(4)

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
1,176558,USB-C Charging Cable,2,11.95,04/19/19 08:46,"917 1st St, Dallas, TX 75001"
3,176559,Bose SoundSport Headphones,1,99.99,04/07/19 22:30,"682 Chestnut St, Boston, MA 02215"
4,176560,Google Phone,1,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
5,176560,Wired Headphones,1,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


In [27]:
df.isnull().sum()

column0    545
column1    545
column2    545
column3    545
column4    545
column5    545
dtype: int64

In [28]:
df = df.dropna(how='all')

In [44]:
df.shape

(186316, 6)

- With DuckDB you can run SQL queries on top of Pandas dataframes
- use doubles quotes if your column name contains spaces

In [42]:
conn.execute("""
    SELECT * FROM df WHERE order_id='176560'
""").df()

Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,176560,Google Phone,1,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
1,176560,Wired Headphones,1,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


In [43]:
conn.execute("""
    SELECT COUNT(*) FROM df
""").df()

Unnamed: 0,count_star()
0,186316


## Create table

A View/Virtual Table is a SELECT statement. That statement is run every time the view is referenced in a query. Views are great for abstracting the complexity of the underlying tables they reference.

In [45]:
conn.execute("""
CREATE OR REPLACE TABLE sales AS
	SELECT
		"Order_ID"::INTEGER AS order_id,
		Product AS product,
		"Quantity_Ordered"::INTEGER AS quantity,
		"Price_Each"::DECIMAL AS price_each,
		strptime("Order_Date", '%m/%d/%Y %H:%M')::DATE as order_date,
		"Purchase_Address" AS purchase_address
	FROM df
	WHERE
		TRY_CAST("Order_ID" AS INTEGER) NOTNULL
""")

<duckdb.DuckDBPyConnection at 0x1d1a8378470>

In [46]:
conn.execute("""
    SELECT COUNT(*) FROM sales
""").df()

Unnamed: 0,count_star()
0,185950


In [48]:
conn.execute("DESCRIBE df_view").df()

Unnamed: 0,column_name,column_type,null,key,default,extra
0,column0,VARCHAR,YES,,,
1,column1,VARCHAR,YES,,,
2,column2,VARCHAR,YES,,,
3,column3,VARCHAR,YES,,,
4,column4,VARCHAR,YES,,,
5,column5,VARCHAR,YES,,,


In [49]:
conn.execute("DESCRIBE df").df()

CatalogException: Catalog Error: Table with name df does not exist!
Did you mean "df_view"?

In [47]:
conn.execute("DESCRIBE sales").df()

Unnamed: 0,column_name,column_type,null,key,default,extra
0,order_id,INTEGER,YES,,,
1,product,VARCHAR,YES,,,
2,quantity,INTEGER,YES,,,
3,price_each,"DECIMAL(18,3)",YES,,,
4,order_date,DATE,YES,,,
5,purchase_address,VARCHAR,YES,,,


### FROM

In [51]:
conn.execute("FROM sales limit 10").df()  # short-hand

Unnamed: 0,order_id,product,quantity,price_each,order_date,purchase_address
0,176558,USB-C Charging Cable,2,11.95,1772-12-15 22:43:41.128654848,"917 1st St, Dallas, TX 75001"
1,176559,Bose SoundSport Headphones,1,99.99,1772-12-03 22:43:41.128654848,"682 Chestnut St, Boston, MA 02215"
2,176560,Google Phone,1,600.0,1772-12-08 22:43:41.128654848,"669 Spruce St, Los Angeles, CA 90001"
3,176560,Wired Headphones,1,11.99,1772-12-08 22:43:41.128654848,"669 Spruce St, Los Angeles, CA 90001"
4,176561,Wired Headphones,1,11.99,1772-12-26 22:43:41.128654848,"333 8th St, Los Angeles, CA 90001"
5,176562,USB-C Charging Cable,1,11.95,1772-12-25 22:43:41.128654848,"381 Wilson St, San Francisco, CA 94016"
6,176563,Bose SoundSport Headphones,1,99.99,1772-11-28 22:43:41.128654848,"668 Center St, Seattle, WA 98101"
7,176564,USB-C Charging Cable,1,11.95,1772-12-08 22:43:41.128654848,"790 Ridge St, Atlanta, GA 30301"
8,176565,Macbook Pro Laptop,1,1700.0,1772-12-20 22:43:41.128654848,"915 Willow St, San Francisco, CA 94016"
9,176566,Wired Headphones,1,11.99,1772-12-04 22:43:41.128654848,"83 7th St, Boston, MA 02215"


### EXCLUDE

In [52]:
conn.execute("""
	SELECT 
		* EXCLUDE ( order_date, purchase_address)
	FROM sales
	""").df()

Unnamed: 0,order_id,product,quantity,price_each
0,176558,USB-C Charging Cable,2,11.95
1,176559,Bose SoundSport Headphones,1,99.99
2,176560,Google Phone,1,600.00
3,176560,Wired Headphones,1,11.99
4,176561,Wired Headphones,1,11.99
...,...,...,...,...
185945,259353,AAA Batteries (4-pack),3,2.99
185946,259354,iPhone,1,700.00
185947,259355,iPhone,1,700.00
185948,259356,34in Ultrawide Monitor,1,379.99


### COLUMNS expression

In [55]:
df_1 = conn.execute("""
	SELECT 
		MIN(COLUMNS(* EXCLUDE (product, purchase_address))),
		Max(COLUMNS(* EXCLUDE (product, purchase_address))) 
	FROM sales
	""").df()

In [56]:
df_1

Unnamed: 0,min(sales.order_id),min(sales.quantity),min(sales.price_each),min(sales.order_date),max(sales.order_id),max(sales.quantity),max(sales.price_each),max(sales.order_date)
0,141234,1,2.99,1772-08-29 22:43:41.128654848,319670,9,1700.0,1773-08-29 22:43:41.128654848


In [57]:
df_1.columns

Index(['min(sales.order_id)', 'min(sales.quantity)', 'min(sales.price_each)',
       'min(sales.order_date)', 'max(sales.order_id)', 'max(sales.quantity)',
       'max(sales.price_each)', 'max(sales.order_date)'],
      dtype='object')

In [58]:
type(df_1)

pandas.core.frame.DataFrame

## Create VIEW

Since VIEWS are recreated each time a query reference them, if new data is added to the sales table, the VIEW gets updated as well

In [63]:
conn.execute("""
	CREATE OR REPLACE VIEW aggregated_sales AS
	SELECT
		order_id,
		COUNT(1) as nb_orders,
		MONTH(order_date) as month,
		str_split(purchase_address, ',')[2] AS city,
		SUM(quantity * price_each) AS revenue
	FROM sales
	GROUP BY ALL
""")

conn.execute("""
from aggregated_sales limit 10
""").df()

Unnamed: 0,order_id,nb_orders,month,city,revenue
0,220468,1,6,Seattle,999.99
1,220485,1,6,Los Angeles,14.95
2,220486,1,6,San Francisco,149.99
3,220490,1,6,San Francisco,400.0
4,220495,1,6,San Francisco,999.99
5,220498,1,6,Boston,11.99
6,220514,1,6,Los Angeles,3.84
7,220517,1,6,San Francisco,700.0
8,220524,1,6,Atlanta,3.84
9,220562,1,6,San Francisco,11.99


In [64]:
conn.execute("""
    select * from sales where order_id in (220468,211058)
""").df()

Unnamed: 0,order_id,product,quantity,price_each,order_date,purchase_address
0,211058,Apple Airpods Headphones,1,150.0,1773-02-17 22:43:41.128654848,"438 13th St, Austin, TX 73301"
1,211058,Macbook Pro Laptop,1,1700.0,1773-02-17 22:43:41.128654848,"438 13th St, Austin, TX 73301"
2,220468,ThinkPad Laptop,1,999.99,1773-02-21 22:43:41.128654848,"487 Washington St, Seattle, WA 98101"


## Save and read parquet files

Querying Parquet files give much better performances than with CSV files

In [67]:
file_out = '../data/sales/agg_sales.parquet'

In [68]:
conn.execute(f"""
    COPY (FROM aggregated_sales) TO '{file_out}' (FORMAT 'parquet')
""")

<duckdb.DuckDBPyConnection at 0x1d1a8378470>

In [70]:
conn.execute(f"FROM '{file_out}'").df().head(10)

Unnamed: 0,order_id,nb_orders,month,city,revenue
0,220468,1,6,Seattle,999.99
1,220485,1,6,Los Angeles,14.95
2,220486,1,6,San Francisco,149.99
3,220490,1,6,San Francisco,400.0
4,220495,1,6,San Francisco,999.99
5,220498,1,6,Boston,11.99
6,220514,1,6,Los Angeles,3.84
7,220517,1,6,San Francisco,700.0
8,220524,1,6,Atlanta,3.84
9,220562,1,6,San Francisco,11.99
