Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized parquet reader does not perform overflow checks for decimals without rescaling #16355

Open
wypb opened this issue Mar 3, 2023 · 3 comments
Assignees

Comments

@wypb
Copy link

wypb commented Mar 3, 2023

I have a hive table with the following structure:

CREATE TABLE hive.default.parquet_type_test5 (
    int32_field integer,
    int64_field bigint,
    binary_decimal decimal(2, 1)
 )
 WITH (
    external_location = 'file:/tmp/parquet/ParquetBenchmarks/',
    format = 'PARQUET'
 )

The meta information of the parquet file is as follows
image

and I turning on the batch reader of parquet (set session hive.parquet_optimized_reader_enabled=true), the execution results are as follows

trino:default> set session hive.parquet_optimized_reader_enabled=true;
SET SESSION
trino:default> select binary_decimal from parquet_type_test5 where binary_decimal < 0 order by binary_decimal limit 10;
 binary_decimal
----------------
          -12.8
          -12.8
          -12.8
          -12.8
          -12.7
          -12.7
          -12.7
          -12.7
          -12.7
          -12.7
(10 rows)

Query 20230303_020242_00008_e9qxe, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
2.98 [1000 rows, 15.7KB] [335 rows/s, 5.28KB/s]

if I turn off the batch reader of parquet, another exception occurred:

trino:default> set session hive.parquet_optimized_reader_enabled=false;
SET SESSION
trino:default> select binary_decimal from parquet_type_test5 where binary_decimal < 0 order by binary_decimal limit 10;
Query 20230303_020345_00010_e9qxe failed: Cannot cast DECIMAL(2, 1) '-11.1' to DECIMAL(2, 1)

It seems that the batch reader of parquet does not detect the validity of the data.
image

If necessary, the following is the download address of the test file:
https://trinodb.slack.com/files/U049BH76GP5/F04S3BNRRGV/parquet-1m

@raunaqmorarka raunaqmorarka self-assigned this Mar 3, 2023
@raunaqmorarka raunaqmorarka changed the title The batch reader of parquet does not detect the validity of the data. The batch reader of parquet does not apply overflow checks to decimals Mar 3, 2023
@raunaqmorarka raunaqmorarka changed the title The batch reader of parquet does not apply overflow checks to decimals optimized parquet reader does not apply overflow checks to decimals Mar 3, 2023
@raunaqmorarka raunaqmorarka changed the title optimized parquet reader does not apply overflow checks to decimals Optimized parquet reader does not perform overflow checks for decimals without rescaling Mar 3, 2023
@raunaqmorarka
Copy link
Member

@397090770 could you share more details about what tool you're using to generate this file ?
The new optimized reader will do overflow checks for decimals only when the precision and scale from the decimal logical type annotation in the file does not match with precision and scale in the Trino decimal data type.
When the type annotation and trino type match, the overflow check is skipped on the assumption that the writer has followed the parquet standard and populated the correct precision in the decimal logical type annotation for the data.
The standard https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal says fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits. For fixed_len_byte_array of size 1 this leads to 2 base-10 digits. So logically the writer should be either specifying a higher precision in the logical annotation with longer type length or write only 2 digit values.
We could add extra checks to verify this and throw exceptions like the old parquet reader. However, it would add a small perf penalty and it's not clear to me if we should make that extra effort for a non-standard way of generating parquet files.
@skrzypo987 @martint @sopel39 thoughts ?

@raunaqmorarka
Copy link
Member

Just as a point of comparison, I ran this same file through Apache Hive and Apache Spark.
Hive populates a NULL for all the overflowed values and the actual value otherwise.
Spark has the same behaviour as Trino's optimized parquet reader.

@wypb
Copy link
Author

wypb commented Mar 6, 2023

HI @raunaqmorarka, The data is generated by my own program, there is no limit to the size of the data range.

protected List<Object> generateValues()
{
    List<Object> values = new ArrayList<>();

    for (int i = 0; i < ROWS; ++i) {
         values.add(Binary.fromConstantByteArray(longToBytes(random.nextLong(), byteArrayLength)));
    }
    return values;
}

random.nextLong() this place should limit the scope of the data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants