Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trino count(*) less hive #18223

Closed
zhaoyankun opened this issue Jul 11, 2023 · 10 comments · Fixed by #18281
Closed

trino count(*) less hive #18223

zhaoyankun opened this issue Jul 11, 2023 · 10 comments · Fixed by #18281
Labels

Comments

@zhaoyankun
Copy link

zhaoyankun commented Jul 11, 2023

trino and hive has different result,
hive table store as csv.gz files,
and files by cat merge. like this " cat 20230710*.gz > aaa.gz"
I find when cat files more than 300,
trino while could not read over, but hive and spark is normal.

@ebyhr
Copy link
Member

ebyhr commented Jul 11, 2023

Please share the entire steps to reproduce, the directory & file structure, config properties and Trino version.

@zhaoyankun
Copy link
Author

hive version is 3.1.2
trino version 411 and 420

files merge. like this " cat *.gz > result.gz"

When the number of files is relatively small, it is normal, but when the number of merged files is relatively large, trino may experience incomplete reading, but hive and spark is normal.I think trino has a threshold。

hive table structure

CREATE TABLE intf_test(
length string,
local_pce_txt string,
local_cy_txt string,
owner_proe_txt string,
owner_cy_txt string,
ring_type string)
PARTITIONED BY (
date_cd string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

source.CSV.gz

result.CSV.gz

@findinpath
Copy link
Contributor

findinpath commented Jul 12, 2023

➜  ~ zcat ~/Downloads/result.CSV.gz | wc -l          
4712000
0: jdbc:hive2://localhost:10000/default> select count(*) from tiny.intf_test;

INFO  : OK
+----------+
|   _c0    |
+----------+
| 4712000  |
+----------+

with the native csv reader

trino:default> select count(*) from hive.tiny.intf_test;
 _col0 
-------
  8512 
(1 row)

with the native (csv.native-reader.enabled=false and text-file.native-reader.enabled=false in hive.properties) csv reader disabled

trino:default> select count(*) from hive.tiny.intf_test;
  _col0  
---------
 4712000 
(1 row)

The issue is easily reproducible based on the scenario provided above.

@findinpath
Copy link
Contributor

findinpath commented Jul 12, 2023

try {
// fill as much of the buffer as possible
bufferEnd = in.readNBytes(buffer, 0, buffer.length);
}

While debugging I've discovered that on the native text reader, after reading 36848 (several rounds of reading 1024 bytes) bytes from the input stream the method responsible for filling the buffer with new bytes returns 0 - like we would be at the end of the split (which is definitely not the case).

bufferEnd = in.readNBytes(buffer, 0, buffer.length);

This method is probably not the right method to use anyway because we don't get notified whether we are at the end of the stream.
Probably using java.io.InputStream#read(byte[], int, int) would be more appropriate.

I've tried also refactoring locally the code to use java.io.InputStream#read(byte[], int, int) method and surprisingly the stream was still being closed prematurely - i'm not sure why this is happening. It may be linked probably to the specific gzip file that is being used for testing.

@electrum
Copy link
Member

The Javadoc for the method says

Reads the requested number of bytes from the input stream into the given byte array. This method blocks until len bytes of input data have been read, end of stream is detected, or an exception is thrown. The number of bytes actually read, possibly zero, is returned.

In the case where end of stream is reached before len bytes have been read, then the actual number of bytes read will be returned.

If it is prematurely returning zero bytes, then that seems to be a bug in the implementation of the method.

@electrum
Copy link
Member

Thanks for the report and the example file. I was able to reproduce it locally.

@findinpath was on the right track in noticing that readNBytes() was returning prematurely. I further debugged it and found that GZIPInputStream incorrectly assumes that InputStream.available() == 0 means end of stream, which is clearly incorrect. It's surprising that such a basic bug could exist in the JDK for so long. This is probably because concatenated GZIP files are uncommon.

@dain found JDK-8081450 which has been open for eight years.

We should be able to work around this in JdkGzipHadoopInputStream by wrappering the input with a custom version of BufferedInputStream that only returns available() == 0 when at EOS.

@electrum
Copy link
Member

Fixed in airlift/aircompressor#171

@findinpath
Copy link
Contributor

Do you plan to release a new version of aircompressor very soon ?

@zhaoyankun
Copy link
Author

i replace aircompressor-0.24.jar ,trino result is correct. thanks very much.

@electrum
Copy link
Member

@zhaoyankun Thanks for verifying the fix. This will be in the next Trino release (422).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

5 participants