Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow caching stats in JDBC but not metadata #19859

Merged
merged 8 commits into from Nov 23, 2023

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 22, 2023

Apply CachingHiveMetastore capabilities in JDBC. Allow turning on
statistics caching without having to use caching for all of the
metadata. In common case caching statistics has much fewer side effects
than caching other stuff.

@cla-bot cla-bot bot added the cla-signed label Nov 22, 2023
@@ -44,6 +45,7 @@ public class BaseJdbcConfig
private Duration metadataCacheTtl = new Duration(0, SECONDS);
private Optional<Duration> schemaNamesCacheTtl = Optional.empty();
private Optional<Duration> tableNamesCacheTtl = Optional.empty();
private Optional<Duration> statisticsCacheTtl = Optional.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not enable this by default with some value like 5m ? We enabled it by default in hive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backwards compatibility. enabling stats cache may harm some workloads. like staging tables that are sometimes empty and sometimes full of data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's going to be better for the vast majority of workloads, then I would still enable it despite the harm to some less common cases. This can be explicitly disabled in the catalogs where it turns out to be harmful.
Alternatively, can we enable it for connectors where staging tables are rarely or not used at all ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the usage patterns, so I didn't want to enable it by default, at least just yet.
Do you see a problem with this PR if this is not enabled by default from the start?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a hard blocker for the PR, but the whole point of this change in hive connector was to allow us to enable stats metadata caching by default rather than just allowing for greater manual tweaking of cache ttls. If we don't have a path way to enabling this by default ever in JDBC connectors, then we're missing out on the main benefit of this change.
Unless you have a different plan to find out the usage patterns, the only way I see of discovering the problems with enabling this by default is to just enable it and see what kind of problems are reported due to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against enabling it by default and I think we agree that this is not required for this PR. In other words, can be enabled as a follow-up.

@@ -129,7 +131,7 @@ public CachingJdbcClient(
tableHandlesByQueryCache = buildCache(cacheMaximumSize, metadataCachingTtl);
procedureHandlesByQueryCache = buildCache(cacheMaximumSize, metadataCachingTtl);
columnsCache = buildCache(cacheMaximumSize, metadataCachingTtl);
statisticsCache = buildCache(cacheMaximumSize, metadataCachingTtl);
statisticsCache = buildCache(cacheMaximumSize, statisticsCachingTtl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If metadataCachingTtl is set higher than statisticsCachingTtl, then that should be used as the value for statisticsCachingTtl (we did that in hive metadata caching to avoid inadvertently using smaller ttl for stats cache than overall metadata cache).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want such logic, this should rather be a validation, not silently ignored config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to ever provide some default value for statisticsCachingTtl, then there's a chance that this default could be lower than the existing value of metadataCachingTtl setup by the user. In this case, using smaller value of statisticsCachingTtl would potentially regress perf and applying validation would force user to read docs and update the config values. Applying validation strictly is a correct but disruptive choice, we chose to do the right thing silently in the case of hive.
cc: @sopel39

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hive we just use

        Duration metastoreCacheTtl = config.getMetastoreCacheTtl();
        Duration statsCacheTtl = config.getStatsCacheTtl();
        if (metastoreCacheTtl.compareTo(statsCacheTtl) > 0) {
            statsCacheTtl = metastoreCacheTtl;
        }

it mitigates all bad configs and I think it's OK (less troubles for us)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- static fields before instance fields
- private where appropriate
- remove special values
- fix code indentation
- use explicit `Duration` constructor instead of approximating
  `Duration.succinctDuration`
Inline in usages to simplify CachingJdbcClient construction.
Previously the test used cacheMissing=true by default, despite this
being false by default in the config.
Using builder avoids maintaining overloads of various "create some
CachingJdbcClient" methods.

Using config avoids passing 3 Duration parameters positionally to the
constructor. The config is equivalent of named parameters.

This also moves helper methods below tests.

This is code cleanup and preparatory change before adding yet another
Duration parameter.
Many TestCachingJdbcClient's tests construct their own instance, so
providing a default instance in the `@Before` creates duplicated state
and confusion.
This logically reverts commit b25c90b.
The class uses `EvictableCacheBuilder` and the
`io.trino.cache.EvictableCacheBuilder#cacheDisabled` considers the cache
to be disabled when TTL is 0 or cache size is 0, so no need to set cache
size to 0 if TTL is known to be 0.
Apply `CachingHiveMetastore` capabilities in JDBC. Allow turning on
statistics caching without having to use caching for all of the
metadata. In common case caching statistics has much fewer side effects
than caching other stuff.
The test used a masked Thread.sleep to await for cache expiration, and
thus lasted at least 6 seconds. Now the whole test class runs about 1
second.
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I don't have strong opinions about last commit.

@findepi
Copy link
Member Author

findepi commented Nov 23, 2023

I don't have strong opinions about last commit.

I run TestCachingJdbcClient many times yesterday, so I feel very strongly about the last commit :)

@findepi findepi merged commit 8137b67 into trinodb:master Nov 23, 2023
66 checks passed
@findepi findepi deleted the findepi/jdbc-cache-stats branch November 23, 2023 08:54
@github-actions github-actions bot added this to the 434 milestone Nov 23, 2023
@mosabua
Copy link
Member

mosabua commented Nov 23, 2023

I think only creates the potential for user facing changes and therefore no release notes are needed. Can you confirm @findepi ?

@findepi
Copy link
Member Author

findepi commented Nov 23, 2023

Yes, there is a new configuration toggle added.

@mosabua
Copy link
Member

mosabua commented Nov 23, 2023

Yes, there is a new configuration toggle added.

Uff .. I overlooked that. I sent #19881 to fix this.

Also we should add a release notes entry then .. right @findepi and @raunaqmorarka - any suggestions?

@mosabua
Copy link
Member

mosabua commented Nov 27, 2023

I will add a release notes entry for all affected connectors.. done now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

5 participants