Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctness issues in S3 Select Pushdown #17775

Closed
alexjo2144 opened this issue Jun 6, 2023 · 7 comments
Closed

Correctness issues in S3 Select Pushdown #17775

alexjo2144 opened this issue Jun 6, 2023 · 7 comments

Comments

@alexjo2144
Copy link
Member

alexjo2144 commented Jun 6, 2023

An initial set of corrections is here: #17563

  • This fixed issues around null comparison and added tests

While testing that PR there were a few other issues I ran into:

  • Comparison of decimal types produces incorrect results on AWS. (works on Minio)
  • TEXTFILE tables (using CSV in S3 Select) fail to parse or produce incorrect results if these three properties are not explicitly set
    textfile_field_separator, textfile_field_separator_escape, null_format
    • The reason for this is that AWS uses default values which assume CSV encoding, but they are not the same defaults as LazySimpleSerde which is what it's mapped to in Trino.
  • The S3SelectCsvRecordReader checks for a quote character in the Hive schema, but that property can't be set when creating tables in Trino using the TEXTFILE format
@alexjo2144
Copy link
Member Author

Bullets 2 and 3 can be fixed within the Trino code, but the first one looks like a bug on the AWS side of things. @pettyjamesm do you have much background with s3 select? The test I wrote that hit this is here: https://github.com/trinodb/trino/pull/17563/files#diff-60fcb98253871d1bddcafc2121e18998f5f0b98a5f127731450f98eb446d928fR133

@pettyjamesm
Copy link
Member

Not specifically, let's see if @dnanuti can help out with that

@dnanuti
Copy link
Member

dnanuti commented Jun 7, 2023

Hello! Happy to help.
The default field delimiters of LazySimpleSerde are incompatible with S3Select. This was the reasoning on our side when we re-enabled Select pushdown: #12633
In order to isolate the issue, would you please provide:

  • the query sent by Trino to S3Select
  • a sample .csv file
  • the expected result
    Thank you!

@alexjo2144
Copy link
Member Author

alexjo2144 commented Jun 8, 2023

Thanks Diana (replied with the decimal comparison example in a different thread)

For anyone looking at potentially picking this up, these are some additional tests that don't pass. I believe all the issues here (besides decimal comparison) are on the Trino side, not AWS:

    @Test(dataProvider = "s3SelectFileFormats")
    public void testS3SelectPushdownWithSpecialCharacters(String tableProperties)
    {
        Session usingAppendInserts = Session.builder(getSession())
                .setCatalogSessionProperty("hive", "insert_existing_partitions_behavior", "APPEND")
                .build();
        List<String> values = ImmutableList.of(
                "1, 'a,comma'",
                "2, 'a|pipe'",
                "3, 'an''escaped quote'",
                "4, 'a~null encoding'");
        try (TestTable table = new TestTable(
                sql -> getQueryRunner().execute(usingAppendInserts, sql),
                "hive.%s.test_s3_select_pushdown".formatted(HIVE_TEST_SCHEMA),
                "(id INT, string_t VARCHAR) WITH (" + tableProperties + ")", values)) {
            assertS3SelectQuery("SELECT id FROM " + table.getName() + " WHERE string_t ='a,comma'", "VALUES 1", false);
            assertS3SelectQuery("SELECT id FROM " + table.getName() + " WHERE string_t ='a|pipe'", "VALUES 2", false);
            assertS3SelectQuery("SELECT id FROM " + table.getName() + " WHERE string_t ='an''escaped quote'", "VALUES 3", false);
            assertS3SelectQuery("SELECT id FROM " + table.getName() + " WHERE string_t ='a~null encoding'", "VALUES 4", false);
        }
    }

Additionally if you add

{"format = 'TEXTFILE'"},

to the test data provider here

@electrum
Copy link
Member

electrum commented Jul 6, 2023

@pettyjamesm @dnanuti can you reply to the above? @alexjo2144 has #18102 to disable S3 Select CSV support until this can be fixed (or worked around on our side)

@pettyjamesm
Copy link
Member

pettyjamesm commented Jul 7, 2023

The approach in #18102 seems reasonable- S3 select currently ignores scale and precision on decimal types if both are provided (docs: link) so removing support for push down on that type makes sense so long as that limitation stands.

The other issues do sound like they're fixable from the Trino side (something about a mismatch between LazySimpleSerDe and S3 Select defaults for escape characters?), but putting the feature behind a flag would be an OK mitigation in the mean time if we're not confident about the root cause or not sure how to fix the other (non-decimal) issues.

@alexjo2144
Copy link
Member Author

S3 select was removed in #18241

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants