New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trino count(*) less hive #18223
Comments
Please share the entire steps to reproduce, the directory & file structure, config properties and Trino version. |
hive version is 3.1.2 files merge. like this " cat *.gz > result.gz" When the number of files is relatively small, it is normal, but when the number of merged files is relatively large, trino may experience incomplete reading, but hive and spark is normal.I think trino has a threshold。 hive table structure CREATE TABLE |
with the native csv reader
with the native (
The issue is easily reproducible based on the scenario provided above. |
trino/lib/trino-hive-formats/src/main/java/io/trino/hive/formats/line/text/TextLineReader.java Lines 246 to 249 in 8bb53d6
While debugging I've discovered that on the native text reader, after reading
This method is probably not the right method to use anyway because we don't get notified whether we are at the end of the stream. I've tried also refactoring locally the code to use |
The Javadoc for the method says
If it is prematurely returning zero bytes, then that seems to be a bug in the implementation of the method. |
Thanks for the report and the example file. I was able to reproduce it locally. @findinpath was on the right track in noticing that @dain found JDK-8081450 which has been open for eight years. We should be able to work around this in |
Fixed in airlift/aircompressor#171 |
Do you plan to release a new version of |
i replace aircompressor-0.24.jar ,trino result is correct. thanks very much. |
@zhaoyankun Thanks for verifying the fix. This will be in the next Trino release (422). |
trino and hive has different result,
hive table store as csv.gz files,
and files by cat merge. like this " cat 20230710*.gz > aaa.gz"
I find when cat files more than 300,
trino while could not read over, but hive and spark is normal.
The text was updated successfully, but these errors were encountered: