Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bloom filter write support to ParquetWriter #20662

Merged
merged 5 commits into from
Apr 16, 2024

Conversation

jkylling
Copy link
Contributor

@jkylling jkylling commented Feb 12, 2024

Description

The bloom filters are added after all the row groups, right before the footer, similar to the first option described here.

We do not support writing bloom filters for types for which we do not have read support.

Additional context and related issues

Fixes #16536

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Add support for writing bloom filters in parquet files to speed up equality and IN predicates on high cardinality columns. ({issue}`20662`)

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things to work on from first look:

  • Check where in the parquet file Hive and Spark store the bloom filter (they use parquet-mr directly). We will want to make sure that we're not differing from that.
  • We need product tests that files with bloom filter written by trino work on Apache Hive and Spark
  • We don't want this getting enabled by default. Users should opt into it on a per-table basis similar to the way bloom filter columns can be specified in ORC in Trino
  • BaseTestParquetWithBloomFilters should be updated to use Trino to write bloom filters

@github-actions github-actions bot added iceberg Iceberg connector hive Hive connector labels Feb 12, 2024
@jkylling jkylling force-pushed the parquet-bloom-writer branch 3 times, most recently from 8172007 to 11a130a Compare February 15, 2024 22:24
@jkylling
Copy link
Contributor Author

There remains some work to decide on how to configure the bloom filters in Hive (and also Iceberg and Delta). We could do the following:
For the Hive table properties:

parquet.bloom.filter.enabled#<column-name>=true
parquet.bloom.filter.fpp#<column-name>=0.1 # double between 0 and 1
parquet.bloom.filter.expected.ndv#<column-name>=1 # integer between 1 and Long.MAX_VALUE

These are the same properties used by parquet-mr: https://github.com/apache/parquet-mr/blob/20d43639b5a380335953742ad6c9b3dd98e09f29/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L152-L155

If any of the Hive table properties have invalid values we treat them as unspecified. parquet.bloom.filter.enabled#<column-name>=true must always be specified for the others table properties to take effect.

For the Iceberg table properties we will likely want to do writer.parquet.bloom-filter-enabled.column.<column-name>, and similar for the other properties, https://iceberg.apache.org/docs/latest/configuration/#write-properties

For the Trino table properties we define:

parquet_bloom_filter_enabled, map(varchar, boolean) = MAP(ARRAY['<column-name>'], ARRAY[<enabled>]]
parquet_bloom_filter_fpp, map(varchar, double) = MAP(ARRAY['<column-name>'], ARRAY[<fpp>]]
parquet_bloom_filter_ndv, map(varchar, bigint) = MAP[ARRAY['<column-name>'], ARRAY[<ndv>]]
parquet_bloom_filters_enabled, boolean. It enables bloom filters for all columns which can support it. If parquet_bloom_filter_columns is also specified, the entries of parquet_bloom_filter_enabled takes precedence.

@jkylling jkylling marked this pull request as ready for review February 15, 2024 22:41
@jkylling jkylling changed the title [DRAFT] Add bloom filter write support to ParquetWriter Add bloom filter write support to ParquetWriter Feb 15, 2024
Copy link

github-actions bot commented Mar 8, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Mar 8, 2024
@github-actions github-actions bot removed the stale label Mar 12, 2024
@jkylling jkylling force-pushed the parquet-bloom-writer branch 2 times, most recently from eb309f0 to e4ecdd0 Compare March 13, 2024 18:18
Copy link

github-actions bot commented Apr 4, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Apr 4, 2024
@github-actions github-actions bot removed the stale label Apr 5, 2024
@jkylling jkylling force-pushed the parquet-bloom-writer branch 2 times, most recently from 70504f9 to baed59e Compare April 7, 2024 12:18
@raunaqmorarka
Copy link
Member

I've pushed some minor fixups, please apply them into the appropriate commits

@jkylling jkylling force-pushed the parquet-bloom-writer branch 2 times, most recently from 59fcecd to ab34493 Compare April 12, 2024 07:03
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, lgtm

The bloom filters are added after all the row groups, right before the
footer, similar to the first option described
[here](https://github.com/apache/parquet-format/blob/master/BloomFilter.md#file-format).

We do not support writing bloom filters for types for which we do not
have read support.
Bloom filters must be explicitly enabled for a column of a table by
setting the Hive table property
`parquet.bloom.filter.columns=<column-name-1>,<column-name-2>`

This Hive table propery can be enabled with the Trino table property
`parquet_bloom_filter_columns = ARRAY['<column-name>']`.

We do not support configuring the NDV or FPP of the filter.
@raunaqmorarka raunaqmorarka merged commit 5041496 into trinodb:master Apr 16, 2024
99 checks passed
@github-actions github-actions bot added this to the 445 milestone Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Support writing bloom filters in optimized parquet writer
3 participants