Skip to content

count() on file stream returns different result than view on it #469

@gangtao

Description

@gangtao

I am running following query to count the event on a file stream

SELECT 
  count(*)
FROM 
  file('/var/lib/proton/user_files/EQY_US_ALL_TRADE_20231003.gz', 'CSV', 'Time string, Exchange string, Symbol string, Sale_Condition string, Trade_Volume uint32, Trade_Price decimal(20,6), Trade_Stop_Stock_Indicator string, Trade_Correction_Indicator string, Sequence_Number uint32, Trade_Id uint64, Source_of_Trade string, Trade_Reporting_Facility string, Participant_Timestamp string, Trade_Reporting_Facility_TRF_Timestamp string, Trade_Through_Exempt_Indicator uint8')
SETTINGS 
  format_csv_delimiter = '|'

Query id: 7081aea4-765c-4ec7-b6f5-3be4b6198ee6

┌─count()─┐
│ 3008814 │
└─────────┘
┌─count()─┐
│ 5821401 │
└─────────┘
┌─count()─┐
│ 8633988 │
└─────────┘
┌──count()─┐
│ 11446575 │
└──────────┘
┌──count()─┐
│ 14193753 │
└──────────┘
┌──count()─┐
│ 17006340 │
└──────────┘
┌──count()─┐
│ 19753518 │
└──────────┘
┌──count()─┐
│ 22566105 │
└──────────┘
┌──count()─┐
│ 25378692 │
└──────────┘
┌──count()─┐
│ 28191279 │
└──────────┘
┌──count()─┐
│ 31003866 │
└──────────┘
┌──count()─┐
│ 33816453 │
└──────────┘
┌──count()─┐
│ 36694449 │
└──────────┘
┌──count()─┐
│ 39637854 │
└──────────┘
┌──count()─┐
│ 42450441 │
└──────────┘
┌──count()─┐
│ 45263028 │
└──────────┘
┌──count()─┐
│ 48075615 │
└──────────┘
┌──count()─┐
│ 50888202 │
└──────────┘
┌──count()─┐
│ 53700789 │
└──────────┘
┌──count()─┐
│ 56513376 │
└──────────┘
┌──count()─┐
│ 59325963 │
└──────────┘
┌──count()─┐
│ 62138550 │
└──────────┘
┌──count()─┐
│ 64951137 │
└──────────┘
┌──count()─┐
│ 67763724 │
└──────────┘
┌──count()─┐
│ 70576311 │
└──────────┘
┌──count()─┐
│ 73388898 │
└──────────┘

26 rows in set. Elapsed: 54.436 sec. Processed 75.81 million rows, 12.74 GB (1.39 million rows/s., 234.07 MB/s.)

then I create a view on that file stream query

WITH raw AS
  (
    SELECT 
      *
    FROM 
      file('/var/lib/proton/user_files/EQY_US_ALL_TRADE_20231003.gz', 'CSV', 'Time string, Exchange string, Symbol string, Sale_Condition string, Trade_Volume uint32, Trade_Price decimal(20,6), Trade_Stop_Stock_Indicator string, Trade_Correction_Indicator string, Sequence_Number uint32, Trade_Id uint64, Source_of_Trade string, Trade_Reporting_Facility string, Participant_Timestamp string, Trade_Reporting_Facility_TRF_Timestamp string, Trade_Through_Exempt_Indicator uint8')
    SETTINGS 
      format_csv_delimiter = '|'
  ),
raw_tp_time as
(
SELECT 
  *, extract_all_groups(Time, '^(\\d{2})(\\d{2})(\\d{2})(\\d{9})') AS g, to_datetime64(concat('2023-10-03T', (g[1])[1], ':', (g[1])[2], ':', (g[1])[3], '.', (g[1])[4]), 9) AS _tp_time
FROM 
  raw
) select * except g from raw_tp_time

this view just extract timestamp from the file stream

SELECT 
  count(*)
FROM 
  v_nyse_trade_20231003

Query id: a29c6d45-cf14-4d48-b21b-ca0fd7b6a4aa

┌─count()─┐
│ 3008814 │
└─────────┘
┌─count()─┐
│ 6017628 │
└─────────┘
┌─count()─┐
│ 9026442 │
└─────────┘
┌──count()─┐
│ 12035256 │
└──────────┘
┌──count()─┐
│ 15044070 │
└──────────┘
┌──count()─┐
│ 18052884 │
└──────────┘
┌──count()─┐
│ 21061698 │
└──────────┘
┌──count()─┐
│ 24070512 │
└──────────┘
┌──count()─┐
│ 27079326 │
└──────────┘
┌──count()─┐
│ 30088140 │
└──────────┘
┌──count()─┐
│ 33162363 │
└──────────┘
┌──count()─┐
│ 36171177 │
└──────────┘
┌──count()─┐
│ 39179991 │
└──────────┘
┌──count()─┐
│ 42188805 │
└──────────┘
┌──count()─┐
│ 45197619 │
└──────────┘
┌──count()─┐
│ 48206433 │
└──────────┘
┌──count()─┐
│ 51280656 │
└──────────┘
┌──count()─┐
│ 54354879 │
└──────────┘
┌──count()─┐
│ 57298284 │
└──────────┘
┌──count()─┐
│ 60307098 │
└──────────┘
┌──count()─┐
│ 63315912 │
└──────────┘
┌──count()─┐
│ 66324726 │
└──────────┘
┌──count()─┐
│ 69333540 │
└──────────┘
┌──count()─┐
│ 72342354 │
└──────────┘
┌──count()─┐
│ 75351168 │
└──────────┘

25 rows in set. Elapsed: 50.767 sec. Processed 75.81 million rows, 12.74 GB (1.49 million rows/s., 250.99 MB/s.)

the second query has more result than the first one, I beleive the view did not cause new rows. so one of the count or both must be wrong.

also I tried to insert the query from the view into a real stream.

INSERT INTO nyse_trade_20231003 SELECT 
  *
FROM 
  v_nyse_trade_20231003

and then count what is in that stream, the count is also different than the previsous two

SELECT 
  count(*)
FROM 
  table(nyse_trade_20231003)

Query id: 7cb51827-b7f4-4348-826c-7c6bc9916331

┌──count()─┐
│ 75811661 │
└──────────┘

1 row in set. Elapsed: 0.128 sec. Processed 75.81 million rows, 75.81 MB (592.76 million rows/s., 592.76 MB/s.)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions