Use streaming for Snowflake queries #1337

sgrebnov · 2024-05-08T08:21:56Z

Implements #1319

PR switches to fetching/processing Snowflake query results in streaming mode - accessing and returning to DataFusion Arrow RecordBatches based on their availability (data chunk is downloaded)

This dramatically reduces memory usage for large datasets as the records are processed by chunks w/o keeping all result in memory and improves performance by giving early access to records.

With this change I was finally able to perform test queries against very large snowflake_sample_data.tpch_sf100 dataset. Previously was failing after few minutes with very high app memory (8GB+)

sql> SELECT
       "L_RETURNFLAG",
       "L_LINESTATUS",
       SUM("L_QUANTITY") AS "SUM_QTY",
       SUM("L_EXTENDEDPRICE") AS "SUM_BASE_PRICE",
       SUM("L_EXTENDEDPRICE" * (1-"L_DISCOUNT")) AS "SUM_DISC_PRICE",
       SUM("L_EXTENDEDPRICE" * (1-"L_DISCOUNT") * (1+"L_TAX")) AS "SUM_CHARGE",
       AVG("L_QUANTITY") AS "AVG_QTY",
       AVG("L_EXTENDEDPRICE") AS "AVG_PRICE",
       AVG("L_DISCOUNT") AS "AVG_DISC",
       COUNT(*) AS "COUNT_ORDER"
FROM
       lineitem
WHERE
       "L_SHIPDATE" <= DATE '1998-12-01' - INTERVAL '90' DAY
GROUP BY
       "L_RETURNFLAG",
       "L_LINESTATUS"
ORDER BY
       "L_RETURNFLAG",
       "L_LINESTATUS";
+--------------+--------------+--------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-------------+
| L_RETURNFLAG | L_LINESTATUS | SUM_QTY      | SUM_BASE_PRICE   | SUM_DISC_PRICE    | SUM_CHARGE         | AVG_QTY            | AVG_PRICE          | AVG_DISC           | COUNT_ORDER |
+--------------+--------------+--------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-------------+
| A            | F            | 377512775800 | 566077609719445  | -2264319380385681 | -11322063130531741 | 2549.9370423275427 | 3823611.6984304897 | 5.000224353092903  | 148047881   |
| N            | F            | 9855306200   | 14777109838598   | -59084214370854   | -295454378203398   | 2550.155695688288  | 3823719.9388804506 | 4.998528433805397  | 3864590     |
| N            | O            | 743630297600 | 1115072568137359 | -4460231163250018 | -22303052391770873 | 2550.000940437419  | 3823722.764636094  | 4.999791831562552  | 291619617   |
| R            | F            | 377572497000 | 566160303274534  | -2264734385024769 | -11324393336414073 | 2550.006628406532  | 3823669.725845297  | 5.0001304339654125 | 148067261   |
+--------------+--------------+--------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-------------+

Time: 412.024292208 seconds. 4 rows.

Corresponding PR to snowflake-rs library: https://github.com/mycelial/snowflake-rs/pull/44/files

Streaming support for Snowflake queries

ecf1cac

sgrebnov requested a review from a team as a code owner May 8, 2024 08:21

sgrebnov self-assigned this May 8, 2024

Merge branch 'trunk' into sgrebnov/sf-streaming

5a56bda

phillipleblanc approved these changes May 8, 2024

View reviewed changes

Merge branch 'trunk' into sgrebnov/sf-streaming

200f6a6

sgrebnov changed the title ~~Add streaming support for Snowflake queries~~ Use streaming for Snowflake queries May 8, 2024

sgrebnov merged commit 55db989 into trunk May 8, 2024
16 checks passed

sgrebnov deleted the sgrebnov/sf-streaming branch May 8, 2024 16:31

sgrebnov mentioned this pull request May 8, 2024

Snowflake Streaming support #1319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use streaming for Snowflake queries #1337

Use streaming for Snowflake queries #1337

sgrebnov commented May 8, 2024 •

edited

Use streaming for Snowflake queries #1337

Use streaming for Snowflake queries #1337

Conversation

sgrebnov commented May 8, 2024 • edited

sgrebnov commented May 8, 2024 •

edited