# DuckDB tutorial

Date: November 2025

Group: 
- Charlotte Michon
- Duy Vu Dinh
- Valérian Wislez

This tutorial will introduce DuckDB, a modern column-oriented DBMS. Its target audience is mainly bachelor students in Computer Science or any person having some experience with relational row-store database like Postgres, MariaDB ... that wants to know how to use column-oriented systems.

After describing this technology, we will describe how to get started quickly and easily using Python as the interface to DuckDB.

Then we will proceed with basic queries and describe the most important aspects of DuckDB's Postgres-like SQL dialect.

Next, we will see more advanced queries, suited to OLAP workload.

Finally we will compare it with SQLite, a popular row-store relational DBMS, to highlight the advantages and drawbacks of column-oriented systems.


## Table of contents

* [1. What is DuckDB?](#What-is-DuckDB?)
* [2. Why DuckDB?](#Why-DuckDB?)
* [3. Prerequisites](#Prerequisites)
* [4. Installation](#Installation)
* [5. Hello, DuckDB!](#Hello,-DuckDB!)
* [6. Dataset](#Dataset)
* [6. Basic queries](#Basic-queries)
* [7. Comparison to SQLite](#Comparison-to-SQLite)
* [8. Advanced functionalities](#advanced-functionalities)



# What is DuckDB?
[DuckDB](https://duckdb.org/) is a database management system originally developed by Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands and was first released in 2019. It enjoys the following properties. It is:
- fast
- portable
- open-source
- in-process
- analytical
- column-oriented 
- relational

As of November 2025, it's latest stable release is 1.4.1. It is written in C++ and released under the MIT license. 


# Why DuckDB?

We will now motivate why using DuckDB is a great choice for modern OLAP data workloads.

## Workloads

Unlike traditional transactional database systems like MySQL or PostgreSQL, that are optimized for [Online Transaction Processing](https://en.wikipedia.org/wiki/Online_transaction_processing) (OLTP) workloads involving many small reads and writes, DuckDB targets [Online Analytical Processing](https://en.wikipedia.org/wiki/Online_analytical_processing) (OLAP) workloads that require large scans, aggregations, and joins over big datasets. It uses a column-oriented engine with vectorized execution to process millions of rows at high speed, often outperforming both traditional databases and tools like pandas for analytical tasks. It is thus well suited to data science, machine learning, scientific computing, economy, ... Any field requiring many resources to be processed in read-only.


## DuckDB's strengths

DuckDB stores data in columnar format (instead of row-by-row). This allows it to read only the columns that are needed for a query, which makes it optimal for OLAP analyses.
DuckDB processes data in optimized batches (vectors), instead of row-by-row, which improves its speed for large datasets.

DuckDB runs inside the application that is using it. There is no database server, no connection to the DBMS, which makes it really easy to use and deploy.

DuckDB stands out as an entry-level OLAP tool due to its minimalism and accessibility, making it easier to adopt than distributed systems like [ClickHouse](https://clickhouse.com/), [Druid](https://druid.apache.org/), or [Apache Pinot](https://pinot.apache.org/).

It offers good performance and low latencies thanks to its in-process architecture (no network overhead).

IT can directly work with files and many other [formats](https://duckdb.org/docs/stable/data/data_sources), such as CSV, parquet, JSON, ... or other  

It also supports [extensions](https://duckdb.org/docs/stable/extensions/overview) to enhance its functionalities (e.g. adding support for other file formats, introducing new types, adding domain-specific functionalities).

## DuckDB's limitations

As it is column-oriented, it is less suited to OLTP workloads.

It is not scalable to multiple machines, there's no distributed querying.

It has limited support for concurrency

Drawback : not distributed, only one node. 
Advantage : simple, efficient. DuckDB supports a [variety of languages](https://duckdb.org/docs/stable/clients/overview) for interacting with it. 



# Prerequisites
There are some prerequisites for following along this tutorial. Fortunately, it is quite simple to set up.
- A recent version of python (3.9+). We would recommend a more recent version, as 3.9 is not supported anymore, yet 3.9 is the minimum version supported by DuckDB. The tutorial has been tested for versions 3.13.7. 
- Basic SQL knowledge and familiarity with classical row-store DMBSes like MySQL, MariaDB, Postgres, etc.
- Familiarity with Python, python virtual environments, and Jupyter Notebooks. 

# Installation

DuckDB is available as a module managed by pip, which will be easier to interface with in this notebook, as we only need Python. SQlite is also part of python itself so there is no need to install it.

The tutorial has been tested for Python version 3.13.7

First, create a Python virtual environment. The following shell command will create a virtual environment in the current directory named 'duckdb':

``
python -m venv duckdb
``

We can activate the environment using either on UNIX-like systems:

``
source duckdb/bin/activate
``

or on Microsoft Windows:

``
.\duckdb\Scripts\activate
``

N.B.: On Windows, this might require to [change the scripts execution policy](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy?view=powershell-7.5) on the host machine. In powershell, one can change it using the following cmdlet. This will allow the system to run scripts for the current process (i.e. terminal session).

``
Set-ExecutionPolicy -ExecutionPolicy AllSigned -Scope Process
``

Once the virtual environment is activated, we can start downloading the required libraries.

``
pip install -r requirements.txt
``

We are now ready to go.



In [None]:
# If this cell executes correctly, you have all the dependencies needed for this tutorial.

import time
import duckdb
import pandas as pd
import sqlite3  # Directly present in python, no need to pip install it
import kagglehub
import matplotlib

# Hello, DuckDB!
We will simply test our DuckDB installation by printing a traditional Hello World. This basic script will create a `.db` file which is able to persist data even after the connexion is closed.

In [None]:
# This is a simple test to see if duckdb works.
# DuckDB allows to perform queries in process only, no file will be created.
test_con = duckdb.connect(':memory:')

query = "CREATE TABLE IF NOT EXISTS hello AS SELECT 'Hello, world!'"

test_con.execute(query)

fetch = "SELECT * FROM hello"

result = test_con.execute(fetch).fetchall()

print(result)

test_con.close()

# Dataset
We will now fetch a dataset that best suits our needs. 
There are multiple ways to get a dataset. We will use an easy, online-fetch of a parquet file.
DuckDB is able to handle a [variety](https://duckdb.org/docs/stable/data/overview) of filetypes, including CSV, JSON, parquet

In [None]:
path = kagglehub.dataset_download("kartik2112/fraud-detection")
print("Path to dataset files:", path)
df = pd.read_csv(path + "/fraudTrain.csv")
df = df.drop(columns=["Unnamed: 0"])


In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.dtypes

# Basic queries
Show that simple queries are just like in traditional sql. We will create a DuckDB instance and show a few rows.

In [None]:
con = duckdb.connect()
con.register("fraud", df)

In [None]:
con.sql("""
    SELECT *
    FROM fraud
    LIMIT 5
""")

Let's count how many transactions there are :

In [None]:
con.sql("""
    SELECT COUNT(*) AS n_transactions
    FROM fraud
""")

Total and average transaction amount

In [None]:
con.sql("""
    SELECT 
        SUM(amt)  AS total_amount,
        AVG(amt)  AS avg_amount
    FROM fraud
""")

Distinct customers and cities

In [None]:
con.sql("""
    SELECT
        COUNT(DISTINCT cc_num) AS n_customers,
        COUNT(DISTINCT city)   AS n_cities
    FROM fraud
""")

Fraud vs non-fraud counts

In [None]:
con.sql("""
    SELECT 
        is_fraud,
        COUNT(*) AS n
    FROM fraud
    GROUP BY is_fraud
    ORDER BY is_fraud
""")

# Comparison to SQLite
Now that we know how to use DuckDB, we will illustrate its advantages and drawbacks against SQLite, a popular row-store DBMS, by performing some more advanced queries.

## What is SQLite?

SQLite is a **serverless, open-source, row-store, relational** database engine that stores an entire database (including tables, indexes, and data) as a single cross-platform file. As for DuckDB, SQLite runs directly **in-process**, requiring zero configuration and no separate server.
It supports SQL standards, ACID transactions, and can handle databases up to several terabytes in size. 
Because of its **simplicity, reliability, and tiny footprint**, SQLite is **the most widely deployed database in the world**. It is embedded in billions of applications, smartphones, browsers, desktop applications, IoT devices, etc.

## Speed comparison with SQLite
We will now perform several operations on both database systems to highlight the key performance differences between row stores and column stores.


Let's first connect to a local database file with SQLite:

In [None]:
sqlite_con = sqlite3.connect("fraud.sqlite")

Let's now transfer the dataframe to the database file. This operation is quite heavy and might take about 15 seconds.

In [None]:
%%time
df.to_sql("fraud", sqlite_con, if_exists="replace", index=False)

con.unregister("fraud")
con.register("fraud", df)

We will query the total amount by category:

In [None]:
query1 = """
    SELECT
        category,
        COUNT(*)      AS n_transactions,
        SUM(amt)      AS total_amount,
        AVG(amt)      AS avg_amount
    FROM fraud
    GROUP BY category
"""

Let's measure the time taken by DuckDB to execute the query.

In [None]:
start = time.time()
result_duck =  con.sql(query1).df()

end = time.time()

display(result_duck)
print(f"DuckDB execution time: {end - start:.4f} seconds")

Let's do the same for SQLite.

In [None]:
start = time.time()
result_sqlite = pd.read_sql_query(query1, sqlite_con)

end = time.time()

display(result_sqlite)
print(f"SQLite execution time: {end - start:.4f} seconds")

# Conclusion : ADD CONCLUSION

Let's query the fraud rate by category.

In [None]:
query2 = """
    SELECT
        category,
        COUNT(*)                    AS n_transactions,
        SUM(is_fraud)               AS n_fraud,
        AVG(CAST(is_fraud AS REAL)) AS fraud_rate
    FROM fraud
    GROUP BY category
    ORDER BY fraud_rate DESC
"""

In [None]:
start = time.time()
result_duck = con.sql(query2).df()

end = time.time()

display(result_duck)
print(f"DuckDB execution time: {end - start:.4f} seconds")

In [None]:
start = time.time()
result_sqlite = pd.read_sql_query(query2, sqlite_con)

end = time.time()

display(result_sqlite)
print(f"SQLite execution time: {end - start:.4f} seconds")

Time-based aggregation

In [None]:
query3 = """
    SELECT
        DATE(trans_date_trans_time) AS day,
        COUNT(*)                   AS n_transactions,
        SUM(is_fraud)              AS n_fraud,
        SUM(amt)                   AS total_amount
    FROM fraud
    GROUP BY day
    ORDER BY day
"""


In [None]:
start = time.time()
result_duck = con.sql(query3).df()

end = time.time()

display(result_duck)
print(f"DuckDB execution time: {end - start:.4f} seconds")

In [None]:
start = time.time()
result_sqlite = pd.read_sql_query(query3, sqlite_con)

end = time.time()

display(result_sqlite)
print(f"SQLite execution time: {end - start:.4f} seconds")


Rolling Average Transaction Amount per Card (Window Function)

In [None]:
query_window = """
    SELECT
        cc_num,
        trans_date_trans_time,
        amt,
        AVG(amt) OVER (
            PARTITION BY cc_num
            ORDER BY trans_date_trans_time
            ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
        ) AS rolling_avg_amt
    FROM fraud
"""


In [None]:
start = time.time()
result_duck = con.sql(query_window).df()

end = time.time()

display(result_duck)
print(f"DuckDB execution time: {end - start:.4f} seconds")

In [None]:
start = time.time()
result_sqlite = pd.read_sql_query(query_window, sqlite_con)

end = time.time()

display(result_sqlite)
print(f"SQLite execution time: {end - start:.4f} seconds")

# add one last complex query for comparison

## Advanced functionalities
ideas:
querying files directly

# References

[Wikipedia article about DuckDB](https://en.wikipedia.org/wiki/DuckDB)

[The official DuckDB website](https://duckdb.org/)

[DuckDB Python API](https://duckdb.org/docs/stable/clients/python/overview)

[The official SQLite website](https://sqlite.org/)

[Wikipedia article about SQLite](https://en.wikipedia.org/wiki/SQLite)
