# Introduction to DuckDB

[DuckDB](https://duckdb.org/) is an open-source analytical SQL database management system designed for high performance and efficiency. It's built to excel in handling complex analytical queries on large volumes of data while maintaining a lightweight footprint. DuckDB is known for its exceptional speed in executing queries and its ability to operate seamlessly within various environments, from laptops to large-scale server setups.

One of its standout features is its support for standard SQL queries, making it accessible to users familiar with SQL syntax. Additionally, DuckDB is optimized for read-heavy workloads, making it an ideal choice for data exploration, analytics, and research purposes.

DuckDB also provides integration with popular data science tools like Jupyter Notebooks and Pandas, facilitating a smooth workflow for analysts and data scientists. Its compatibility with Jupyter Notebooks allows for an interactive and collaborative environment where users can harness DuckDB's power alongside their code and analysis.

Moreover, DuckDB's integration with Pandas simplifies data manipulation and analysis. It enables the execution of SQL queries directly on Pandas DataFrames, providing a familiar interface for those comfortable with Pandas while leveraging DuckDB's speed and efficiency for data processing.

Another notable aspect is that it can handle various data formats such as CSV, Parquet, and others. And let's you interact with these formats through SQL queries streamlines the process of querying and analyzing diverse data sources without the need for extensive data preprocessing.

So DuckDB offers a user-friendly environment for data exploration, analysis, and manipulation.

To get started let's import duckdb and pandas. To be able to convert jupyter notebook cells to sql cells we need to also load the jupysql extension.

In [1]:
import duckdb
import pandas as pd
import json

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

### Configure the notebook
Set configrations on the ipython sql extension to directly output data to Pandas and to simplify the output that is printed to the notebook.

In [2]:
%config SqlMagic.autopandAS = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

  if await self.run_code(code, result, async_=asy):


### Connect to a DuckDB database
Connect to a DuckDB Database using a SQLAlchemy-style connection string. You may either connect to an in memory DuckDB, or a file backed db.

In [3]:
%sql duckdb:///duckdb_db/sampleDB.duckdb

## Loading data

We have two files in our data directory:
- pokemon__donations.json
- pokemon__masterdata.csv

### Load a json file into duckdb

So we will see how we can use DuckDB on json and csv files. let's start with the json. Here we will create a pandas dataframe first and than load that into the database.

In [4]:
df_donations = pd.read_json('data/pokemon__donations.json', orient='records')
df_donations.head()

Unnamed: 0,pokemon_id,donation_date,donation_number_of_pokemon,donation_amount_eur
0,195,2023-12-21,4,85.6
1,251,2023-12-28,74,6569.72
2,27,2023-12-30,24,1734.24
3,267,2023-12-13,36,3584.16
4,511,2023-12-06,74,2797.2


With duckdb we now can query this dataframe using SQL.

In [7]:
%%sql

Select * from df_donations
where donation_amount_eur > 1000;

pokemon_id,donation_date,donation_number_of_pokemon,donation_amount_eur
251,2023-12-28,74,6569.72
27,2023-12-30,24,1734.24
267,2023-12-13,36,3584.16
511,2023-12-06,74,2797.2
430,2023-12-02,77,6303.99
458,2023-12-13,70,5184.2
719,2023-12-23,18,1312.56
194,2023-12-18,55,3502.95
799,2023-12-25,58,4409.16
520,2023-12-21,48,2317.92


Or we can quite easily create a table from this dataframe in a duckdb database with duckdb:

In [5]:
%%sql
CREATE TABLE pokemon__donations AS SELECT * FROM df_donations

Count


### Load a csv file into duckdb

But we can also load a csv file directly into duckdb. 

In [10]:
%%sql

SELECT * FROM 'data/pokemon__masterdata.csv';

pokemon_id,pokemon_name,pokemon_type
1,Bulbasaur,grass
2,Ivysaur,grass
3,Venusaur,grass
4,Charmander,fire
5,Charmeleon,fire
6,Charizard,fire
7,Squirtle,water
8,Wartortle,water
9,Blastoise,water
10,Caterpie,bug


And of course we also can create a table in the duckdb database. We can do this with the COPY command. First we create the table:

In [8]:
%%sql

CREATE TABLE pokemon__masterdata(
    pokemon_id INTEGER PRIMARY KEY NOT NULL,
    pokemon_name VARCHAR,
    pokemon_type VARCHAR
    );

Count


And than we can load the data into the table:

In [11]:
%%sql

COPY pokemon__masterdata FROM 'data/pokemon__masterdata.csv' (HEADER TRUE, DELIMITER ',');

Count


## Querying the database and exporting the results

So now we created the tables and loaded the data into the database we can query the database using SQL. Let's start with the donations table:

In [12]:
%%sql

select * from pokemon__donations

pokemon_id,donation_date,donation_number_of_pokemon,donation_amount_eur
195,2023-12-21,4,85.6
251,2023-12-28,74,6569.72
27,2023-12-30,24,1734.24
267,2023-12-13,36,3584.16
511,2023-12-06,74,2797.2
430,2023-12-02,77,6303.99
607,2023-12-17,51,783.87
458,2023-12-13,70,5184.2
719,2023-12-23,18,1312.56
291,2023-12-21,80,996.0


In [13]:
%%sql

select * from pokemon__masterdata

pokemon_id,pokemon_name,pokemon_type
1,Bulbasaur,grass
2,Ivysaur,grass
3,Venusaur,grass
4,Charmander,fire
5,Charmeleon,fire
6,Charizard,fire
7,Squirtle,water
8,Wartortle,water
9,Blastoise,water
10,Caterpie,bug


We could now join the donations table with the pokemon table to get the pokemon names for the donations. And export the result to a pandas dataframe:

In [20]:
%%sql

df_join << SELECT * FROM pokemon__donations 
LEFT JOIN pokemon__masterdata 
ON pokemon__donations.pokemon_id = pokemon__masterdata.pokemon_id

In [21]:
df_join = df_join.DataFrame()
df_join.head()

Unnamed: 0,pokemon_id,donation_date,donation_number_of_pokemon,donation_amount_eur,pokemon_id_2,pokemon_name,pokemon_type
0,195,2023-12-21,4,85.6,195,Quagsire,water
1,251,2023-12-28,74,6569.72,251,Celebi,psychic
2,27,2023-12-30,24,1734.24,27,Sandshrew,ground
3,267,2023-12-13,36,3584.16,267,Beautifly,bug
4,511,2023-12-06,74,2797.2,511,Pansage,grass


Or we could save the result to a csv file:

In [22]:
%%sql

COPY(
    SELECT * FROM pokemon__donations 
    LEFT JOIN pokemon__masterdata 
    ON pokemon__donations.pokemon_id = pokemon__masterdata.pokemon_id
)
TO 'data/output.csv' (HEADER, DELIMITER ',');

Count


## Why should you use DuckDB over Pandas? 

You shouldn't (except you like SQL more than Python). But you can use DuckDB in combination with Pandas. So you can combine the power of DuckDB and the power of Pandas.