# DuckDB tutorial

## Setup

First, create a Python virtual environment. The following shell command will create a virtual environment in the current directory named 'duckdb':


``
python -m venv duckdb
``

We can activate the environment using either on UNIX-like systems:

``
source duckdb/bin/activate
``

or on Microsoft Windows:

``
.\duckdb\Scripts\activate
``

This might require to [change the scripts execution policy](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy?view=powershell-7.5) on the host machine. In powershell, one can change it using the following cmdlet. This will allow the system to run scripts for the current process (i.e. terminal session).

``
Set-ExecutionPolicy -ExecutionPolicy AllSigned -Scope Process
``

Now we can start downloading the required libraries.

``
pip install -r requirements.txt
``

We are now ready to go.


In [5]:
import time
import duckdb
import pandas as pd
import sqlite3  # Directly present in python, no need to pip install it

In [6]:
# Will create the file if it doesn't exist
conn = duckdb.connect('duckdb.db')

query = "CREATE TABLE IF NOT EXISTS hello AS SELECT 'World'"

conn.execute(query)

fetch = "SELECT * FROM hello"

result = conn.execute(fetch).fetchall()

print(result)

[('World',)]


In [None]:
# Load a small subset of NYC Taxi data from a parquet file
url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'
df = pd.read_parquet(url)
df = df.head(10000)
print(f"Dataset shape: {df.shape}")

dist = df['trip_distance']
print(dist)

Dataset shape: (10000, 19)
0       0.97
1       1.10
2       2.51
3       1.90
4       1.43
        ... 
9995    1.97
9996    3.33
9997    4.70
9998    1.00
9999    1.80
Name: trip_distance, Length: 10000, dtype: float64
