trino count(*) less hive #18223

zhaoyankun · 2023-07-11T03:11:22Z

trino and hive has different result,
hive table store as csv.gz files,
and files by cat merge. like this " cat 20230710*.gz > aaa.gz"
I find when cat files more than 300,
trino while could not read over, but hive and spark is normal.

ebyhr · 2023-07-11T03:41:13Z

Please share the entire steps to reproduce, the directory & file structure, config properties and Trino version.

zhaoyankun · 2023-07-11T10:18:37Z

hive version is 3.1.2
trino version 411 and 420

files merge. like this " cat *.gz > result.gz"

When the number of files is relatively small, it is normal, but when the number of merged files is relatively large, trino may experience incomplete reading, but hive and spark is normal.I think trino has a threshold。

hive table structure

CREATE TABLE intf_test(
length string,
local_pce_txt string,
local_cy_txt string,
owner_proe_txt string,
owner_cy_txt string,
ring_type string)
PARTITIONED BY (
date_cd string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

source.CSV.gz

result.CSV.gz

findinpath · 2023-07-12T11:07:47Z

➜  ~ zcat ~/Downloads/result.CSV.gz | wc -l          
4712000

0: jdbc:hive2://localhost:10000/default> select count(*) from tiny.intf_test;

INFO  : OK
+----------+
|   _c0    |
+----------+
| 4712000  |
+----------+

with the native csv reader

trino:default> select count(*) from hive.tiny.intf_test;
 _col0 
-------
  8512 
(1 row)

with the native (csv.native-reader.enabled=false and text-file.native-reader.enabled=false in hive.properties) csv reader disabled

trino:default> select count(*) from hive.tiny.intf_test;
  _col0  
---------
 4712000 
(1 row)

The issue is easily reproducible based on the scenario provided above.

findinpath · 2023-07-12T15:13:42Z

trino/lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/text/TextLineReader.java

Lines 246 to 249 in 8bb53d6

    
           try { 
        
               // fill as much of the buffer as possible 
        
               bufferEnd = in.readNBytes(buffer, 0, buffer.length); 
        
           }

While debugging I've discovered that on the native text reader, after reading 36848 (several rounds of reading 1024 bytes) bytes from the input stream the method responsible for filling the buffer with new bytes returns 0 - like we would be at the end of the split (which is definitely not the case).

bufferEnd = in.readNBytes(buffer, 0, buffer.length);

This method is probably not the right method to use anyway because we don't get notified whether we are at the end of the stream.
Probably using java.io.InputStream#read(byte[], int, int) would be more appropriate.

I've tried also refactoring locally the code to use java.io.InputStream#read(byte[], int, int) method and surprisingly the stream was still being closed prematurely - i'm not sure why this is happening. It may be linked probably to the specific gzip file that is being used for testing.

electrum · 2023-07-12T19:44:05Z

The Javadoc for the method says

Reads the requested number of bytes from the input stream into the given byte array. This method blocks until len bytes of input data have been read, end of stream is detected, or an exception is thrown. The number of bytes actually read, possibly zero, is returned.

In the case where end of stream is reached before len bytes have been read, then the actual number of bytes read will be returned.

If it is prematurely returning zero bytes, then that seems to be a bug in the implementation of the method.

electrum · 2023-07-12T21:03:23Z

Thanks for the report and the example file. I was able to reproduce it locally.

@findinpath was on the right track in noticing that readNBytes() was returning prematurely. I further debugged it and found that GZIPInputStream incorrectly assumes that InputStream.available() == 0 means end of stream, which is clearly incorrect. It's surprising that such a basic bug could exist in the JDK for so long. This is probably because concatenated GZIP files are uncommon.

@dain found JDK-8081450 which has been open for eight years.

We should be able to work around this in JdkGzipHadoopInputStream by wrappering the input with a custom version of BufferedInputStream that only returns available() == 0 when at EOS.

electrum · 2023-07-12T23:11:18Z

Fixed in airlift/aircompressor#171

findinpath · 2023-07-13T05:08:45Z

Do you plan to release a new version of aircompressor very soon ?

zhaoyankun · 2023-07-13T06:18:26Z

i replace aircompressor-0.24.jar ，trino result is correct. thanks very much.

electrum · 2023-07-14T00:39:00Z

@zhaoyankun Thanks for verifying the fix. This will be in the next Trino release (422).

findinpath added the bug Something isn't working label Jul 12, 2023

findinpath self-assigned this Jul 12, 2023

findepi added the correctness label Jul 12, 2023

findepi mentioned this issue Jul 12, 2023

Decouple Trino from Hadoop and Hive codebases #15921

Closed

findepi added the RELEASE-BLOCKER label Jul 12, 2023

findinpath removed their assignment Jul 12, 2023

findinpath mentioned this issue Jul 12, 2023

Disable Hive text native readers #18254

Closed

electrum mentioned this issue Jul 13, 2023

Fix reading concatenated GZIP streams #18281

Merged

martint closed this as completed in #18281 Jul 14, 2023

zhaoyankun mentioned this issue Jul 18, 2023

fix permission denied when docs compile #18327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trino count(*) less hive #18223

trino count(*) less hive #18223

zhaoyankun commented Jul 11, 2023 •

edited by electrum

ebyhr commented Jul 11, 2023 •

edited

zhaoyankun commented Jul 11, 2023

findinpath commented Jul 12, 2023 •

edited

findinpath commented Jul 12, 2023 •

edited

electrum commented Jul 12, 2023

electrum commented Jul 12, 2023

electrum commented Jul 12, 2023

findinpath commented Jul 13, 2023

zhaoyankun commented Jul 13, 2023

electrum commented Jul 14, 2023

trino count(*) less hive #18223

trino count(*) less hive #18223

Comments

zhaoyankun commented Jul 11, 2023 • edited by electrum

ebyhr commented Jul 11, 2023 • edited

zhaoyankun commented Jul 11, 2023

findinpath commented Jul 12, 2023 • edited

findinpath commented Jul 12, 2023 • edited

electrum commented Jul 12, 2023

electrum commented Jul 12, 2023

electrum commented Jul 12, 2023

findinpath commented Jul 13, 2023

zhaoyankun commented Jul 13, 2023

electrum commented Jul 14, 2023

zhaoyankun commented Jul 11, 2023 •

edited by electrum

ebyhr commented Jul 11, 2023 •

edited

findinpath commented Jul 12, 2023 •

edited

findinpath commented Jul 12, 2023 •

edited