-
Notifications
You must be signed in to change notification settings - Fork 96
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I am running following query to count the event on a file stream
SELECT
count(*)
FROM
file('/var/lib/proton/user_files/EQY_US_ALL_TRADE_20231003.gz', 'CSV', 'Time string, Exchange string, Symbol string, Sale_Condition string, Trade_Volume uint32, Trade_Price decimal(20,6), Trade_Stop_Stock_Indicator string, Trade_Correction_Indicator string, Sequence_Number uint32, Trade_Id uint64, Source_of_Trade string, Trade_Reporting_Facility string, Participant_Timestamp string, Trade_Reporting_Facility_TRF_Timestamp string, Trade_Through_Exempt_Indicator uint8')
SETTINGS
format_csv_delimiter = '|'
Query id: 7081aea4-765c-4ec7-b6f5-3be4b6198ee6
┌─count()─┐
│ 3008814 │
└─────────┘
┌─count()─┐
│ 5821401 │
└─────────┘
┌─count()─┐
│ 8633988 │
└─────────┘
┌──count()─┐
│ 11446575 │
└──────────┘
┌──count()─┐
│ 14193753 │
└──────────┘
┌──count()─┐
│ 17006340 │
└──────────┘
┌──count()─┐
│ 19753518 │
└──────────┘
┌──count()─┐
│ 22566105 │
└──────────┘
┌──count()─┐
│ 25378692 │
└──────────┘
┌──count()─┐
│ 28191279 │
└──────────┘
┌──count()─┐
│ 31003866 │
└──────────┘
┌──count()─┐
│ 33816453 │
└──────────┘
┌──count()─┐
│ 36694449 │
└──────────┘
┌──count()─┐
│ 39637854 │
└──────────┘
┌──count()─┐
│ 42450441 │
└──────────┘
┌──count()─┐
│ 45263028 │
└──────────┘
┌──count()─┐
│ 48075615 │
└──────────┘
┌──count()─┐
│ 50888202 │
└──────────┘
┌──count()─┐
│ 53700789 │
└──────────┘
┌──count()─┐
│ 56513376 │
└──────────┘
┌──count()─┐
│ 59325963 │
└──────────┘
┌──count()─┐
│ 62138550 │
└──────────┘
┌──count()─┐
│ 64951137 │
└──────────┘
┌──count()─┐
│ 67763724 │
└──────────┘
┌──count()─┐
│ 70576311 │
└──────────┘
┌──count()─┐
│ 73388898 │
└──────────┘
26 rows in set. Elapsed: 54.436 sec. Processed 75.81 million rows, 12.74 GB (1.39 million rows/s., 234.07 MB/s.)then I create a view on that file stream query
WITH raw AS
(
SELECT
*
FROM
file('/var/lib/proton/user_files/EQY_US_ALL_TRADE_20231003.gz', 'CSV', 'Time string, Exchange string, Symbol string, Sale_Condition string, Trade_Volume uint32, Trade_Price decimal(20,6), Trade_Stop_Stock_Indicator string, Trade_Correction_Indicator string, Sequence_Number uint32, Trade_Id uint64, Source_of_Trade string, Trade_Reporting_Facility string, Participant_Timestamp string, Trade_Reporting_Facility_TRF_Timestamp string, Trade_Through_Exempt_Indicator uint8')
SETTINGS
format_csv_delimiter = '|'
),
raw_tp_time as
(
SELECT
*, extract_all_groups(Time, '^(\\d{2})(\\d{2})(\\d{2})(\\d{9})') AS g, to_datetime64(concat('2023-10-03T', (g[1])[1], ':', (g[1])[2], ':', (g[1])[3], '.', (g[1])[4]), 9) AS _tp_time
FROM
raw
) select * except g from raw_tp_timethis view just extract timestamp from the file stream
SELECT
count(*)
FROM
v_nyse_trade_20231003
Query id: a29c6d45-cf14-4d48-b21b-ca0fd7b6a4aa
┌─count()─┐
│ 3008814 │
└─────────┘
┌─count()─┐
│ 6017628 │
└─────────┘
┌─count()─┐
│ 9026442 │
└─────────┘
┌──count()─┐
│ 12035256 │
└──────────┘
┌──count()─┐
│ 15044070 │
└──────────┘
┌──count()─┐
│ 18052884 │
└──────────┘
┌──count()─┐
│ 21061698 │
└──────────┘
┌──count()─┐
│ 24070512 │
└──────────┘
┌──count()─┐
│ 27079326 │
└──────────┘
┌──count()─┐
│ 30088140 │
└──────────┘
┌──count()─┐
│ 33162363 │
└──────────┘
┌──count()─┐
│ 36171177 │
└──────────┘
┌──count()─┐
│ 39179991 │
└──────────┘
┌──count()─┐
│ 42188805 │
└──────────┘
┌──count()─┐
│ 45197619 │
└──────────┘
┌──count()─┐
│ 48206433 │
└──────────┘
┌──count()─┐
│ 51280656 │
└──────────┘
┌──count()─┐
│ 54354879 │
└──────────┘
┌──count()─┐
│ 57298284 │
└──────────┘
┌──count()─┐
│ 60307098 │
└──────────┘
┌──count()─┐
│ 63315912 │
└──────────┘
┌──count()─┐
│ 66324726 │
└──────────┘
┌──count()─┐
│ 69333540 │
└──────────┘
┌──count()─┐
│ 72342354 │
└──────────┘
┌──count()─┐
│ 75351168 │
└──────────┘
25 rows in set. Elapsed: 50.767 sec. Processed 75.81 million rows, 12.74 GB (1.49 million rows/s., 250.99 MB/s.)the second query has more result than the first one, I beleive the view did not cause new rows. so one of the count or both must be wrong.
also I tried to insert the query from the view into a real stream.
INSERT INTO nyse_trade_20231003 SELECT
*
FROM
v_nyse_trade_20231003and then count what is in that stream, the count is also different than the previsous two
SELECT
count(*)
FROM
table(nyse_trade_20231003)
Query id: 7cb51827-b7f4-4348-826c-7c6bc9916331
┌──count()─┐
│ 75811661 │
└──────────┘
1 row in set. Elapsed: 0.128 sec. Processed 75.81 million rows, 75.81 MB (592.76 million rows/s., 592.76 MB/s.)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working