# DuckDB tutorial

In this tutorial, we will introduce DuckDB. This tutorial should serve as a first contact with DuckDB and column-oriented DBMSes. Its target audience is mainly bachelor students in Computer Science or any person having some experience with relational row-store database like Postgres, MariaDB ... that wants to know how to use column-oriented systems.

We will first describe how to get started quickly and easily using Python as the interface to DuckDB.

Then we will proceed with basic queries and describe the most important aspects of DuckDB's Postgres-like SQL dialect.

Next, we will see more advanced queries, suited to OLAP workload.

Finally we will compare it with SQLite, a popular row-store relational DBMS, to highlight the advantages and drawbacks of column-oriented systems.

The whole tutorial will use a well-suited dataset provided by.


## Table of contents

* [1. What is DuckDB?](#What-is-DuckDB?)
* [2. Why DuckDB?](#Why-DuckDB?)
* [3. Prerequisites](#Prerequisites)
* [4. Installation](#Installation)


# What is DuckDB?
[DuckDB](https://duckdb.org/) is a database management system originally developed by Mark Raasveldt and Hannes MÃ¼hleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands and was first released in 2019. It enjoys the following properties. It is:
- fast
- portable
- open-source
- in-process
- analytical
- column-oriented 
- relational

As of November 2025, it's latest stable release is 1.4.1. It is written in C++ and released under the MIT license. 

DuckDB supports a [variety of languages](https://duckdb.org/docs/stable/clients/overview) for interacting with it. 


# Why DuckDB?

# Prerequisites
There are some prerequisites for following along this tutorial. Fortunately, it is quite simple to set up.
- A recent version of python (3.9+). We would recommend a more recent version, as 3.9 is not supported anymore. The tutorial has been tested for versions 3.13.7. However, 3.9 is the minimum recommended by DuckDB.
- Basic SQL knowledge and familiarity with classical row-store DMBSes like MySQL, MariaDB, Postgres, etc.
- Familiarity with Python, python virtual environments, and Jupyter Nnotebooks. 

# Installation

DuckDB is available as a module managed by pip, which will be easier to interface with in this notebook, as we only need Python. SQlite is also part of python itself so there is no need to install it.

The tutorial has been tested for Python version 3.13.7

First, create a Python virtual environment. The following shell command will create a virtual environment in the current directory named 'duckdb':

``
python -m venv duckdb
``

We can activate the environment using either on UNIX-like systems:

``
source duckdb/bin/activate
``

or on Microsoft Windows:

``
.\duckdb\Scripts\activate
``

N.B.: On Windows, this might require to [change the scripts execution policy](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy?view=powershell-7.5) on the host machine. In powershell, one can change it using the following cmdlet. This will allow the system to run scripts for the current process (i.e. terminal session).

``
Set-ExecutionPolicy -ExecutionPolicy AllSigned -Scope Process
``

Once the virtual environment is activated, we can start downloading the required libraries.

``
pip install -r requirements.txt
``

We are now ready to go.



In [None]:
# If this cell executes correctly, you have all the dependencies needed for this tutorial.

import time
import duckdb
import pandas as pd
import sqlite3  # Directly present in python, no need to pip install it

# Hello, DuckDB!
We will simply test our DuckDB installation by printing a traditional Hello World. This basic script will create a `.db` file which is able to persist data even after the connexion is closed.

In [None]:
# This is a simple test to see if duckdb works
# Will create the file if it doesn't exist
test_con = duckdb.connect('test.db')

query = "CREATE TABLE IF NOT EXISTS hello AS SELECT 'Hello, world!'"

test_con.execute(query)

fetch = "SELECT * FROM hello"

result = test_con.execute(fetch).fetchall()

print(result)

test_con.close()

# Dataset
We will now fetch a dataset that best suits our needs. 
There are multiple ways to get a dataset. We will use an easy, online-fetch of a parquet file.
DuckDB is able to handle a [variety](https://duckdb.org/docs/stable/data/overview) of filetypes, including CSV, JSON, parquet

In [None]:
# Load a small subset of NYC Taxi data from a parquet file
url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'
df = pd.read_parquet(url)
df = df.head(10000)
print(f"Dataset shape: {df.shape}")


# Dump the dataset to a .db file
We will now dump a subpart of the dataset into dedicated .db files, each of them respectively managed by sqlite and duckdb. 

In [None]:
# Connect to a local DuckDB file (creates it if it doesn't exist)
duckdb_con = duckdb.connect('taxi_data.ddb')

# Register the Pandas DataFrame as a temporary view
duckdb_con.register('temp_view', df)

# Dump the data: Create a persistent table from the view
duckdb_con.execute("CREATE TABLE IF NOT EXISTS trips AS SELECT * FROM temp_view")

# Verify (optional): Query a sample
print(duckdb_con.sql("SELECT * FROM trips LIMIT 5").df())

# Close the connection (data is now persisted in the file)
duckdb_con.close()

In [None]:
# Connect to a local SQLite file (creates it if it doesn't exist)
sqlite_con = sqlite3.connect('taxi_data.sqlite')

# Dump the data: Write the DataFrame to a table (replaces if exists)
df.to_sql('taxi_data', sqlite_con, if_exists='replace')

# Testing by querying the database
print(pd.read_sql_query("SELECT * FROM taxi_data LIMIT 5", sqlite_con))
sqlite_con.close()

# Basic queries


# Advanced queries

# Comparison to SQLite
Now that we know how to use DuckDB, we will illustrate its advantages and drawbacks against SQLite, a popular row-store DBMS.

## What is SQLite?
