-
Notifications
You must be signed in to change notification settings - Fork 96
Description
I tried to import a dataset comprising of over 700 CSV files totallng a billion rows.
Despite best effort and despite all the help from the engineering team in the slack channel, it's basically impossible.
How to reproduce
-
Download dataset
https://support.kraken.com/hc/en-us/articles/360047543791-Downloadable-historical-market-data-time-and-sales- -
Select the top ten largest CSV files:
du -ah | sort -rh | head -n 10
31G .
2.6G ./XBTEUR.csv
2.1G ./XBTUSD.csv
1.6G ./USDTUSD.csv
1.6G ./USDTEUR.csv
1.4G ./ETHUSD.csv
1.2G ./ETHEUR.csv
1.1G ./ETHXBT.csv
663M ./XRPEUR.csv
628M ./EURUSD.csv
Pick any of them, say XBTEUR, create a table:
CREATE STREAM default.kraken_xbteur
(
`timestamp` datetime64(3),
`price` float64,
`volume` float64,
`_tp_time` datetime64(3, 'UTC') DEFAULT timestamp CODEC(DoubleDelta, LZ4),
INDEX _tp_time_index _tp_time TYPE minmax GRANULARITY 2
)
ENGINE = Stream(1, 1, rand())
PARTITION BY to_YYYYMM(timestamp)
ORDER BY to_start_of_hour(_tp_time)
SETTINGS event_time_column = 'timestamp', index_granularity = 8192
Try to import into the newly created table:
INSERT INTO kraken_xbteur (timestamp,price,volume) SELECT timestamp,price,volume FROM file('XBTEUR.csv', 'CSV', 'timestamp datetime64(3), price float64, volume float64')
Error message and/or stacktrace
There are multiple problems, as discussed in Slack.
- File size limit is set to 100 MB. This is ridiculous. Just fix this in the next release and set it to a more realistic 15 GB.
- Starting the proton server with a custom configuration fails with multiple errors due to missing env variable and incorrect file permission.
If you use the Rust client to insert batch-wise, you encounter another error related to the size limit regardless of batch-size.
Additional context
When this is solved, please publish benchmarks that demonstrates that Proton can handle larger imports.
Specifically, for any serious application proton must demonstrate that it can import the following datasets:
In finance, you want to insert all historical trade data into a table, connect to a live stream, and then keep filling the table to keep the trade data up to date. It's self evident that this cannot be done with Proton until large CSV file import gets fixed. And even then I want to see real import benchmarks because I noticed some performance problems.