# Creating a DuckDB database from a dataset

There are lots of options out there for creating SQL databases, but a very powerful and free option is to use [DuckDB](https://duckdb.org/). The advantage of DuckDB is that has an API for just about every programming language out there (Python, C, R, Java, etc.) so you can use it however you like. It also requires very little set up as you can query directly from a flat file like a CSV, or multiple files, or from a partitioned file system for better efficiency.

We'll use the same Spotify chart [dataset](https://www.kaggle.com/datasets/sunnykakar/spotify-charts-all-audio-data) for this tutorial.

We can use the read_csv function of duckdb within the query itself to scan the CSV file. It automatically detects that the file is compressed. We could use the auto-infer feature just like in Pandas to pick up data types, but we can also do it manually with the columns parameter.

In [11]:
import duckdb

In [15]:
query = '''
SELECT 
    title,
    rank,
    date,
    artist,
    region,
    chart
FROM read_csv('merged_data.csv.gz')
WHERE title ILIKE '%bad blood%' AND 
      artist ILIKE '%taylor swift%' AND 
      region = 'United States'
'''

df = duckdb.query(query).df()

In [18]:
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_csv('merged_data.csv.gz',compression='gzip',nrows=10000)

In [19]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,rank,date,artist,url,region,chart,trend,streams,track_id,album,popularity,duration_ms,explicit,release_date,available_markets,af_danceability,af_energy,af_key,af_loudness,af_mode,af_speechiness,af_acousticness,af_instrumentalness,af_liveness,af_valence,af_tempo,af_time_signature
0,0,Chantaje (feat. Maluma),1,2017-01-01,Shakira,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,Argentina,top200,SAME_POSITION,253019.0,6mICuAdrwEjh6Y6lroV2Kg,El Dorado,78.0,195840.0,False,2017-05-26,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA...",0.852,0.773,8.0,-2.921,0.0,0.0776,0.187,3e-05,0.159,0.907,102.034,4.0
1,1,Vente Pa' Ca (feat. Maluma),2,2017-01-01,Ricky Martin,https://open.spotify.com/track/7DM4BPaS7uofFul...,Argentina,top200,MOVE_UP,223988.0,7DM4BPaS7uofFul3ywMe46,Vente Pa' Ca (feat. Maluma),72.0,259195.0,False,2016-09-22,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA...",0.663,0.92,11.0,-4.07,0.0,0.226,0.00431,1.7e-05,0.101,0.533,99.935,4.0
2,2,Reggaetón Lento (Bailemos),3,2017-01-01,CNCO,https://open.spotify.com/track/3AEZUABDXNtecAO...,Argentina,top200,MOVE_DOWN,210943.0,3AEZUABDXNtecAOSC1qTfo,Primera Cita,73.0,222560.0,False,2016-08-26,"['AR', 'AU', 'AT', 'BE', 'BO', 'BG', 'CA', 'CL...",0.761,0.838,4.0,-3.073,0.0,0.0502,0.4,0.0,0.176,0.71,93.974,4.0
3,3,Safari,4,2017-01-01,"J Balvin, Pharrell Williams, BIA, Sky",https://open.spotify.com/track/6rQSrBHf7HlZjtc...,Argentina,top200,SAME_POSITION,173865.0,6rQSrBHf7HlZjtcMZ4S4bO,Energía,0.0,205600.0,False,2016-06-24,[],0.508,0.687,0.0,-4.361,1.0,0.326,0.551,3e-06,0.126,0.555,180.044,4.0
4,4,Shaky Shaky,5,2017-01-01,Daddy Yankee,https://open.spotify.com/track/58IL315gMSTD37D...,Argentina,top200,MOVE_UP,153956.0,58IL315gMSTD37DOZPJ2hf,Shaky Shaky,0.0,234320.0,False,2016-04-08,[],0.899,0.626,6.0,-4.228,0.0,0.292,0.076,0.0,0.0631,0.873,88.007,4.0


This actually ran quite a bit faster than the code we were using before. However, we could make this more efficient if we partition the dataset, let's say by region. We'll also introduce a new file type: parquet, which is a compressed file that perserves data types.

In [49]:
import pandas as pd
dtypes = {
    'title':pd.StringDtype(),
    'rank':pd.Int64Dtype(),
    'date':pd.StringDtype(),
    'artist':pd.StringDtype(),
    'url':pd.StringDtype(),
    'region':pd.StringDtype(),
    'chart':pd.StringDtype(),
    'trend':pd.StringDtype(),
    'streams':pd.Int64Dtype(),
    'track_id':pd.StringDtype(),
    'album':pd.StringDtype(),
    'popularity':pd.Int64Dtype(),
    'duration_ms':pd.Int64Dtype(),
    'explicit':pd.BooleanDtype(),
    'release_date':pd.StringDtype(),
    'available_markets':pd.StringDtype(),
    'af_danceability':pd.Float64Dtype(),
    'af_energy':pd.Float64Dtype(),
    'af_key':pd.Int64Dtype(),
    'af_loudness':pd.Float64Dtype(),
    'af_mode':pd.Int64Dtype(),
    'af_speechiness':pd.Float64Dtype(),
    'af_acousticness':pd.Float64Dtype(),
    'af_instrumentalness':pd.Float64Dtype(),
    'af_liveness':pd.Float64Dtype(),
    'af_valence':pd.Float64Dtype(),
    'af_tempo':pd.Float64Dtype(),
    'af_time_signature':pd.Int64Dtype()    
}

df = pd.read_csv('merged_data.csv.gz',compression='gzip',nrows=10000,dtype=dtypes)

In [50]:
def process_spotify_data(df):
    df['date'] = pd.to_datetime(df['date'],errors='coerce')
    df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')
    
    df.to_parquet('spotify',compression='snappy',index=False,partition_cols=['chart','region'])
    

In [51]:
for chunk in pd.read_csv('merged_data.csv.gz',
                         compression='gzip',
                         usecols=dtypes.keys(),
                         dtype=dtypes,
                         chunksize=10000):
    
    process_spotify_data(chunk)

  df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')
  df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')
  df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')
  df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce')


In [56]:
chunk

Unnamed: 0,title,rank,date,artist,url,region,chart,trend,streams,track_id,album,popularity,duration_ms,explicit,release_date,available_markets,af_danceability,af_energy,af_key,af_loudness,af_mode,af_speechiness,af_acousticness,af_instrumentalness,af_liveness,af_valence,af_tempo,af_time_signature
26170000,Passion,25,2021-07-30,PinkPantheress,https://open.spotify.com/track/6ZJqCviTotiIujl...,Singapore,viral50,MOVE_DOWN,,6ZJqCviTotiIujl1rfcL53,Passion,56,138268,False,2021-07-01,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA...",0.699,0.628,0,-5.163,0,0.0369,0.49,0.782,0.101,0.607,79.968,4
26170001,Beethoven,26,2021-07-30,Kenndog,https://open.spotify.com/track/5V7eE5xpJKbBCnz...,Singapore,viral50,NEW_ENTRY,,5V7eE5xpJKbBCnzlO4JSNc,Beethoven,0,144561,True,2021-06-11,[],0.875,0.65,9,-8.914,0,0.113,0.149,0.000009,0.106,0.501,102.971,4
26170002,情结,27,2021-07-30,你们的好朋友大雨,https://open.spotify.com/track/6PvqwE59f0NNYe9...,Singapore,viral50,MOVE_UP,,,,,,,NaT,,,,,,,,,,,,,
26170003,Renegade (feat. Taylor Swift),28,2021-07-30,Big Red Machine,https://open.spotify.com/track/1aU1wpYBSpP0M6I...,Singapore,viral50,MOVE_DOWN,,1aU1wpYBSpP0M6IiihY5Ue,Renegade (feat. Taylor Swift),0,254466,True,2021-07-02,[],0.532,0.708,0,-8.121,1,0.0505,0.435,0.0236,0.107,0.586,167.977,4
26170004,Woman,29,2021-07-30,Doja Cat,https://open.spotify.com/track/6Uj1ctrBOjOas8x...,Singapore,viral50,MOVE_UP,,6Uj1ctrBOjOas8xZXGqKk4,Planet Her,85,172626,True,2021-06-25,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA...",0.824,0.764,5,-4.175,0,0.0854,0.0888,0.00294,0.117,0.881,107.998,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26174264,BYE,46,2021-07-31,Jaden,https://open.spotify.com/track/3OUyyDN7EZrL7i0...,Vietnam,viral50,MOVE_UP,,,,,,,NaT,,,,,,,,,,,,,
26174265,Pillars,47,2021-07-31,My Anh,https://open.spotify.com/track/6eky30oFiQbHUAT...,Vietnam,viral50,NEW_ENTRY,,,,,,,NaT,,,,,,,,,,,,,
26174266,Gái Độc Thân,48,2021-07-31,Tlinh,https://open.spotify.com/track/2klsSb2iTfgDh95...,Vietnam,viral50,MOVE_DOWN,,2klsSb2iTfgDh95Ak9uWY2,Gái Độc Thân,50,185175,False,2021-06-30,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA...",0.619,0.665,0,-8.625,0,0.0432,0.194,0.0036,0.154,0.544,99.977,4
26174267,Renegade (feat. Taylor Swift),49,2021-07-31,Big Red Machine,https://open.spotify.com/track/1aU1wpYBSpP0M6I...,Vietnam,viral50,MOVE_DOWN,,1aU1wpYBSpP0M6IiihY5Ue,Renegade (feat. Taylor Swift),0,254466,True,2021-07-02,[],0.532,0.708,0,-8.121,1,0.0505,0.435,0.0236,0.107,0.586,167.977,4


In [57]:
query = '''
SELECT *
FROM read_parquet('spotify/*/*/*.parquet',hive_partitioning=true)
WHERE title ILIKE '%bad blood%' AND 
      artist ILIKE '%taylor swift%' AND 
      region = 'United%20States'
      AND chart = 'top200'
'''

df = duckdb.query(query).df()

In [58]:
df

Unnamed: 0,title,rank,date,artist,url,trend,streams,track_id,album,popularity,duration_ms,explicit,release_date,available_markets,af_danceability,af_energy,af_key,af_loudness,af_mode,af_speechiness,af_acousticness,af_instrumentalness,af_liveness,af_valence,af_tempo,af_time_signature,chart,region
0,Bad Blood,157,2017-06-12,Taylor Swift,https://open.spotify.com/track/0TvQLMecTE8utzo...,MOVE_UP,186481,0TvQLMecTE8utzoNmvXRbK,1989,74,211933,False,2014-01-01,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CL...",0.652,0.802,7,-6.114,1,0.181,0.0871,6e-06,0.148,0.295,170.157,4,top200,United%20States
1,Bad Blood,175,2017-06-11,Taylor Swift,https://open.spotify.com/track/0TvQLMecTE8utzo...,MOVE_DOWN,160422,0TvQLMecTE8utzoNmvXRbK,1989,74,211933,False,2014-01-01,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CL...",0.652,0.802,7,-6.114,1,0.181,0.0871,6e-06,0.148,0.295,170.157,4,top200,United%20States
2,Bad Blood,181,2017-06-13,Taylor Swift,https://open.spotify.com/track/0TvQLMecTE8utzo...,MOVE_DOWN,169950,0TvQLMecTE8utzoNmvXRbK,1989,74,211933,False,2014-01-01,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CL...",0.652,0.802,7,-6.114,1,0.181,0.0871,6e-06,0.148,0.295,170.157,4,top200,United%20States
3,Bad Blood,122,2017-06-10,Taylor Swift,https://open.spotify.com/track/0TvQLMecTE8utzo...,MOVE_UP,215312,0TvQLMecTE8utzoNmvXRbK,1989,74,211933,False,2014-01-01,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CL...",0.652,0.802,7,-6.114,1,0.181,0.0871,6e-06,0.148,0.295,170.157,4,top200,United%20States
4,Bad Blood,127,2017-06-09,Taylor Swift,https://open.spotify.com/track/0TvQLMecTE8utzo...,NEW_ENTRY,238176,0TvQLMecTE8utzoNmvXRbK,1989,74,211933,False,2014-01-01,"['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CL...",0.652,0.802,7,-6.114,1,0.181,0.0871,6e-06,0.148,0.295,170.157,4,top200,United%20States
