# Using DuckDB's SQL as an alternative workflow for large CSV / Parquet files

For large CSV files, we may look to use Dask or PySpark even, when loading the CSV to a database is not possible.  But, there is an alternative SQL-based workflow using DuckDB.

### Let's start with Dask workflow

In [1]:
import dask.dataframe as dd
import numpy as np

#### For best practice, it is usually best to define data type, instead of having dask infer/guess the data type (HINT: type inference is often wrong and leads to poor performance)

In [2]:
col_types = {
    'SUPP_NO': str,
    'ASN_NO': str,
    'ACTUAL_DT': str,
    'ACTUAL_TIME': str,
    'DC_SEG': 'category',
    'TRLR_NO': str,
    'TRLR_ARR_DT': str,
    'TRLR_ARR_TIME': str,
    'ORDERED_QTY': np.int32,   # Use lowest, memory-consuming "bit-ness" (Are we going to order more than 64K parts?  Negative values?)
    'ASN_PART_QTY': np.int32,         # ditto
    'PART_UNLD_QTY': np.int32,        # ditto
    'TOTAL_ASN_PART_QTY': np.int32,   # ditto
}

In [3]:
df = dd.read_csv('shipping_receiving_all_plants.csv', dtype=col_types)

In [4]:
df.head(2).transpose()

Unnamed: 0,0,1
REPORT_ID,GPROD006,GPROD006
SUPP_NO,JN999901,JN999901
SUPP_NAME,HONDA MOTOR COMPANY,HONDA MOTOR COMPANY
ORD_NO,685.0,685.0
PART_NO,2002MR4J K000,2002MR4J K000
PART_CLR_CD,,
PART_DESC,DCT MISSION ASSY(R4JT-AA0),DCT MISSION ASSY(R4JT-AA0)
DELV_SCDL_DT,2019-02-26 15:00:00,2019-02-26 15:00:00
SORTDATE,201902261500,201902261500
ASN_NO,RL194971,RL194971


#### Let's time how long it would take to sum a column using the csv file

In [5]:
%%timeit
df['ORDERED_QTY'].sum().compute()

47.5 s ± 2.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Thanks to Dask's parallelization, we didn't blow out our local machine's memory, but it still took on average almost 50 seconds.  Can we do better?

## Querying the parquet format using Dask instead

As before, we should define the data type or schema for the columns.  Parquet's underlying technology is Apache Arrow, so we need to use pyarrow's data types instead.

In [6]:
import pyarrow as pa

#### Converting CSV to parquet format

In [7]:
col_schema = {
    'SUPP_NO': pa.string(),
    'ASN_NO': pa.string(),
    'ACTUAL_DT': pa.string(),
    'ACTUAL_TIME': pa.string(),
    'DC_SEG': pa.string(),
    'TRLR_NO': pa.string(),
    'TRLR_ARR_DT': pa.string(),
    'TRLR_ARR_TIME': pa.string(),
    'ORDERED_QTY': pa.int32(),
    'ASN_PART_QTY': pa.int32(),
    'PART_UNLD_QTY': pa.int32(),
    'TOTAL_ASN_PART_QTY': pa.int32(),
}

#### Now save Dask dataframe as a parquet file.  UPDATE: Later found out that repartioning to a single partition, boosted performance.

In [8]:
%%time
df = df.repartition(npartitions=1)
df.to_parquet('s_r_all_plants.parquet', write_index=False, schema=col_schema)

Wall time: 1min 58s


The conversion resulted in a folder called `s_r_all_plants.parquet` folder containing a single partition parquet file.

In [None]:
!dir s_r_all_plants.parquet

With Dask, we don't need to read all the individual parquet files, just reference the folder containing them:

In [10]:
%%time
dfp = dd.read_parquet('s_r_all_plants.parquet/')

Wall time: 106 ms


In [11]:
dfp.head()

Unnamed: 0,REPORT_ID,SUPP_NO,SUPP_NAME,ORD_NO,PART_NO,PART_CLR_CD,PART_DESC,DELV_SCDL_DT,SORTDATE,ASN_NO,...,RECEIPT_ADJ_NO,PLANT_CD,PLANT_DESC,ASN_PART_QTY1,ADJ_QTY,DATE1,TOTAL_ASN_PART_QTY,DC_SEG,ACTUAL_DT,ACTUAL_TIME
0,GPROD006,JN999901,HONDA MOTOR COMPANY,685.0,2002MR4J K000,,DCT MISSION ASSY(R4JT-AA0),2019-02-26 15:00:00,201902261500,RL194971,...,,AEP,ANNA ENGINE PLANT,0,0,20210624,30,,,
1,GPROD006,JN999901,HONDA MOTOR COMPANY,685.0,2002MR4J K000,,DCT MISSION ASSY(R4JT-AA0),2019-02-26 15:00:00,201902261500,RL194971,...,,AEP,ANNA ENGINE PLANT,0,0,20210624,30,,,
2,GPROD006,JN999901,HONDA MOTOR COMPANY,685.0,2002MR4J K000,,DCT MISSION ASSY(R4JT-AA0),2019-02-26 15:00:00,201902261500,RL194971,...,,AEP,ANNA ENGINE PLANT,0,0,20210624,30,,,
3,GPROD006,JN999901,HONDA MOTOR COMPANY,685.0,2002MR4J K000,,DCT MISSION ASSY(R4JT-AA0),2019-02-26 15:00:00,201902261500,RL194971,...,,AEP,ANNA ENGINE PLANT,0,0,20210624,30,,,
4,GPROD006,JN999901,HONDA MOTOR COMPANY,685.0,2002MR4J K000,,DCT MISSION ASSY(R4JT-AA0),2019-02-26 15:00:00,201902261500,RL194971,...,,AEP,ANNA ENGINE PLANT,0,0,20210624,30,,,


#### Computing the sum on parquet file took less than a second

In [12]:
%%timeit
dfp['ORDERED_QTY'].sum().compute()

195 ms ± 48.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


But, what if you don't want to use dataframe API?  But a SQL workflow instead?

## SQL Workflow using DuckDB

Starting with version 0.2.7, DuckDB now supports querying parquet files!  Just `pip install duckdb`, then you're good to go.

In [13]:
import duckdb

In [14]:
duckdb.__version__

'0.2.7'

#### I prefer to use context manager (using `with` statement) when connecting to databases.  But, DuckDB does not yet support context management.  So I have to implement a custom class myself.

In [15]:
class DuckDBConn:
    """"""
    def __init__(self):
        """Constructor"""
        pass
    def __enter__(self):
        """
        Open the database connection
        """
        self.conn = duckdb.connect()
        return self.conn
    def __exit__(self, exc_type, exc_val, exc_tb):
        """
        Close the connection
        """
        self.conn.close()
        if exc_val:
            raise

DuckDB supports wildcard "globbing" with the asterisk symbol to read in multiple parquet files:

In [16]:
%%timeit
with DuckDBConn() as con:
    df_sr = con.execute("""
        SELECT sum(ORDERED_QTY) as sum_ordered
        FROM 's_r_all_plants.parquet/*.parquet'
    """).df()
    
df_sr

113 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


DuckDB was just as fast as Dask.

Let's try to convert the parquet files into a single parquet file to see if there could be performance improvements.  But the problem is, there doesn't seem to be good documentation to determine optimal `row_group_size`.  Through trial and error, I settled on 200K rows.

In [17]:
import pyarrow.parquet as pq

In [18]:
pq.write_table(pq.ParquetDataset('s_r_all_plants.parquet/').read(), 'shipping_receiving.parquet')

In [None]:
!dir

In [20]:
%%timeit
with DuckDBConn() as con:
    df_sr = con.execute("""
       SELECT sum(ORDERED_QTY) as sum_ordered
       FROM 'shipping_receiving.parquet'
       """).df()
    
df_sr

140 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


With DuckDB, we got some speed boost with a single file, parquet file.

#### Let's see if there is a difference with Dask

In [21]:
dfp = dd.read_parquet('shipping_receiving.parquet')

In [22]:
%%time
dfp['ORDERED_QTY'].sum().compute()

Wall time: 324 ms


15301755765

With Dask reading a single file, parquet file, it did not get much, if any speed boost.