# DuckDB tutorial

Date: November 2025

Group: 
- Charlotte Michon
- Duy Vu Dinh
- Valérian Wislez

This tutorial will introduce DuckDB, a modern column-oriented DBMS. Its target audience is mainly bachelor students in Computer Science or any person having some experience with relational row-store database like Postgres, MariaDB ... that wants to know how to use column-oriented systems.

After describing this technology, we will describe how to get started quickly and easily using Python as the interface to DuckDB.

Then we will proceed with basic queries and describe the most important aspects of DuckDB's Postgres-like SQL dialect.

Next, we will see more advanced queries, suited to OLAP workload.

Finally we will compare it with SQLite, a popular row-store relational DBMS, to highlight the advantages and drawbacks of column-oriented systems.


## Table of contents

* [1. What is DuckDB?](#What-is-DuckDB?)
* [2. Why DuckDB?](#Why-DuckDB?)
* [3. Prerequisites](#Prerequisites)
* [4. Installation](#Installation)


# What is DuckDB?
[DuckDB](https://duckdb.org/) is a database management system originally developed by Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands and was first released in 2019. It enjoys the following properties. It is:
- fast
- portable
- open-source
- in-process
- analytical
- column-oriented 
- relational

As of November 2025, it's latest stable release is 1.4.1. It is written in C++ and released under the MIT license. 


# Why DuckDB?

We will now motivate why using DuckDB is a great choice for modern OLAP data workloads.

## Workloads

Unlike traditional transactional database systems like MySQL or PostgreSQL, that are optimized for [Online Transaction Processing](https://en.wikipedia.org/wiki/Online_transaction_processing) (OLTP) workloads involving many small reads and writes, DuckDB targets [Online Analytical Processing](https://en.wikipedia.org/wiki/Online_analytical_processing) (OLAP) workloads that require large scans, aggregations, and joins over big datasets. It uses a column-oriented engine with vectorized execution to process millions of rows at high speed, often outperforming both traditional databases and tools like pandas for analytical tasks. It is thus well suited to data science, machine learning, scientific computing, economy, ... Any field requiring many resources to be processed in read-only.


## DuckDB's strengths

DuckDB stores data in columnar format (instead of row-by-row). This allows it to read only the columns that are needed for a query, which makes it optimal for OLAP analyses.
DuckDB processes data in optimized batches (vectors), instead of row-by-row, which improves its speed for large datasets.

DuckDB runs inside the application that is using it. There is no database server, no connection to the DBMS, which makes it really easy to use and deploy.

DuckDB stands out as an entry-level OLAP tool due to its minimalism and accessibility, making it easier to adopt than distributed systems like [ClickHouse](https://clickhouse.com/), [Druid](https://druid.apache.org/), or [Apache Pinot](https://pinot.apache.org/).

It offers good performance and low latencies thanks to its in-process architecture (no network overhead).

IT can directly work with files and many other [formats](https://duckdb.org/docs/stable/data/data_sources), such as CSV, parquet, JSON, ... or other  

It also supports [extensions](https://duckdb.org/docs/stable/extensions/overview) to enhance its functionalities (e.g. adding support for other file formats, introducing new types, adding domain-specific functionalities).

## DuckDB's limitations

As it is column-oriented, it is less suited to OLTP workloads.

It is not scalable to multiple machines, there's no distributed querying.

It has limited support for concurrency

Drawback : not distributed, only one node. 
Advantage : simple, efficient. DuckDB supports a [variety of languages](https://duckdb.org/docs/stable/clients/overview) for interacting with it. 



# Prerequisites
There are some prerequisites for following along this tutorial. Fortunately, it is quite simple to set up.
- A recent version of python (3.9+). We would recommend a more recent version, as 3.9 is not supported anymore, yet 3.9 is the minimum version supported by DuckDB. The tutorial has been tested for versions 3.13.7. 
- Basic SQL knowledge and familiarity with classical row-store DMBSes like MySQL, MariaDB, Postgres, etc.
- Familiarity with Python, python virtual environments, and Jupyter Notebooks. 

# Installation

DuckDB is available as a module managed by pip, which will be easier to interface with in this notebook, as we only need Python. SQlite is also part of python itself so there is no need to install it.

The tutorial has been tested for Python version 3.13.7

First, create a Python virtual environment. The following shell command will create a virtual environment in the current directory named 'duckdb':

``
python -m venv duckdb
``

We can activate the environment using either on UNIX-like systems:

``
source duckdb/bin/activate
``

or on Microsoft Windows:

``
.\duckdb\Scripts\activate
``

N.B.: On Windows, this might require to [change the scripts execution policy](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy?view=powershell-7.5) on the host machine. In powershell, one can change it using the following cmdlet. This will allow the system to run scripts for the current process (i.e. terminal session).

``
Set-ExecutionPolicy -ExecutionPolicy AllSigned -Scope Process
``

Once the virtual environment is activated, we can start downloading the required libraries.

``
pip install -r requirements.txt
``

We are now ready to go.



In [5]:
# If this cell executes correctly, you have all the dependencies needed for this tutorial.

import time
import duckdb
import pandas as pd
import sqlite3  # Directly present in python, no need to pip install it

# Hello, DuckDB!
We will simply test our DuckDB installation by printing a traditional Hello World. This basic script will create a `.db` file which is able to persist data even after the connexion is closed.

In [6]:
# This is a simple test to see if duckdb works.
# DuckDB allows to perform queries in process only, no file will be created.
test_con = duckdb.connect(':memory:')

query = "CREATE TABLE IF NOT EXISTS hello AS SELECT 'Hello, world!'"

test_con.execute(query)

fetch = "SELECT * FROM hello"

result = test_con.execute(fetch).fetchall()

print(result)

test_con.close()

[('Hello, world!',)]


# Dataset
We will now fetch a dataset that best suits our needs. 
There are multiple ways to get a dataset. We will use an easy, online-fetch of a parquet file.
DuckDB is able to handle a [variety](https://duckdb.org/docs/stable/data/overview) of filetypes, including CSV, JSON, parquet

In [7]:
# Load a small subset of NYC Taxi data from a parquet file
url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'
df = pd.read_parquet(url)
df = df.head(10000)
print(f"Dataset shape: {df.shape}")


Dataset shape: (10000, 19)


# Dump the dataset to a .db file
We will now dump a subpart of the dataset into dedicated .db files, each of them respectively managed by sqlite and duckdb. 

In [8]:
# Connect to a local DuckDB file (creates it if it doesn't exist)
duckdb_con = duckdb.connect('taxi_data.ddb')

# Register the Pandas DataFrame as a temporary view
duckdb_con.register('temp_view', df)

# Dump the data: Create a persistent table from the view
duckdb_con.execute("CREATE TABLE IF NOT EXISTS trips AS SELECT * FROM temp_view")

# Verify (optional): Query a sample
print(duckdb_con.sql("SELECT * FROM trips LIMIT 5").df())

# Close the connection (data is now persisted in the file)
duckdb_con.close()

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         2  2023-01-01 00:32:10   2023-01-01 00:40:36              1.0   
1         2  2023-01-01 00:55:08   2023-01-01 01:01:27              1.0   
2         2  2023-01-01 00:25:04   2023-01-01 00:37:49              1.0   
3         1  2023-01-01 00:03:48   2023-01-01 00:13:25              0.0   
4         2  2023-01-01 00:10:29   2023-01-01 00:21:19              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           0.97         1.0                  N           161           141   
1           1.10         1.0                  N            43           237   
2           2.51         1.0                  N            48           238   
3           1.90         1.0                  N           138             7   
4           1.43         1.0                  N           107            79   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \


In [9]:
# Connect to a local SQLite file (creates it if it doesn't exist)
sqlite_con = sqlite3.connect('taxi_data.sqlite')

# Dump the data: Write the DataFrame to a table (replaces if exists)
df.to_sql('taxi_data', sqlite_con, if_exists='replace')

# Testing by querying the database
print(pd.read_sql_query("SELECT * FROM taxi_data LIMIT 5", sqlite_con))
sqlite_con.close()

   index  VendorID tpep_pickup_datetime tpep_dropoff_datetime  \
0      0         2  2023-01-01 00:32:10   2023-01-01 00:40:36   
1      1         2  2023-01-01 00:55:08   2023-01-01 01:01:27   
2      2         2  2023-01-01 00:25:04   2023-01-01 00:37:49   
3      3         1  2023-01-01 00:03:48   2023-01-01 00:13:25   
4      4         2  2023-01-01 00:10:29   2023-01-01 00:21:19   

   passenger_count  trip_distance  RatecodeID store_and_fwd_flag  \
0              1.0           0.97         1.0                  N   
1              1.0           1.10         1.0                  N   
2              1.0           2.51         1.0                  N   
3              0.0           1.90         1.0                  N   
4              1.0           1.43         1.0                  N   

   PULocationID  DOLocationID  payment_type  fare_amount  extra  mta_tax  \
0           161           141             2          9.3   1.00      0.5   
1            43           237             1     

# Basic queries


# Advanced queries

# Comparison to SQLite
Now that we know how to use DuckDB, we will illustrate its advantages and drawbacks against SQLite, a popular row-store DBMS.

## What is SQLite?


# References

[Wikipedia article about DuckDB](https://en.wikipedia.org/wiki/DuckDB)

[The official DuckDB website](https://duckdb.org/)

[DuckDB Python API](https://duckdb.org/docs/stable/clients/python/overview)
