Avoid caching presto worker nodes #465

JamesRTaylor · 2020-10-06T22:34:06Z

No description provided.

JamesRTaylor · 2020-10-06T22:50:59Z

@stagraqubole - not sure about the test integration failure.

shubhamtagra · 2020-10-07T11:29:43Z

rubix-prestosql/src/main/java/com/qubole/rubix/prestosql/PrestoClusterManager.java

@@ -48,170 +28,51 @@
 /**
 * Created by stagra on 14/1/16.
 */
-public class PrestoClusterManager extends ClusterManager
+public class PrestoClusterManager extends SyncClusterManager


We need the other form too because there is a standalone mode where Presto's NodeManager is not available. We should create a new one say SyncPrestoClusterManager as a copy of what you have and keep PrestoClusterManager based on AsyncClusterManager

I created an implementation of Presto's NodeManager that can be used for standalone mode. We can move that out of the test side of things if need be and even provide a constructor for PrestoClusterManager that passes the NodeManager in. This provides a much cleaner separation of concerns IMHO.

rubix-prestosql/src/test/java/com/qubole/rubix/prestosql/TestingNodeManager.java

shubhamtagra · 2020-10-07T11:32:45Z

rubix-prestosql/src/test/java/com/qubole/rubix/prestosql/TestingNodeManager.java

+  private final Node currentNode;
+  private final int serverPort;
+
+  public TestingNodeManager(Configuration conf, Node currentNode)


I guess you wont need this once you have Sync and Async implementations of PrestoClusterManager

shubhamtagra · 2020-10-07T11:49:12Z

rubix-spi/src/main/java/com/qubole/rubix/spi/SyncClusterManager.java

+
+  private Set<String> currentNodes;
+
+  protected abstract boolean hasStateChanged();


Current model will spike up CPU usage a lot. The ClusterManager methods are very frequently called, e.g. getCurrentNodeName will be called for every MB of data being read via the worker node. This converts to 1k calls per GB to hasStateChanged which does comparisons internally.

I think the optimal way would be work with Presto team to add the logic in Presto's NodeManager to return info about state change e.g. it could store a version which increments every time there is a change in membership. Based on that version, we can hasStateChanged value here.

I've confirmed that the hasStateChanged() method returns on an == check of the equals method between the existing and new worker nodes. There's no extra overhead for this. The only time extra work is done is when the worker nodes change.

Wouldn't the check be on ImmutableSet.equals which is not a simple == check?

The == check is the first line of the ImmutableSet.equals implementation.

Ok, so you confirmed that over a period of time if cluster state does not change then Presto's NodeManager returns same object? That is a dependency on undocumented behaviour of NodeManager, we will need to way/test to catch if this behaviour changes in future otherwise slowness will creep in at runtime.

Ok, so you confirmed that over a period of time if cluster state does not change then Presto's NodeManager returns same object?

Yes, I confirmed this with logging that I've since removed. I can add that back if you'd like. Maybe an info level log line if the node membership has changed and a debug level log line for when the == check fails.

Ok, every 5sec there will be extra work to compare all the elements of the Set. I still think that doing it in conjunction with Presto NodeManager, something like a callback that NodeManager calls when it sees change in membership would work out best.
The problem with current approach, multiple comparisons every 5sec, would amplify if we remove the synchronisation you have added for all methods of SyncClusterManager. Given that there will be 100s of parallel requests to these methods (as many as task runner threads running in Presto), current code will serialise them which will definitely have perf impact. And as soon as you remove sync, you will have the problem of these 100s of threads (potentially all but may not be always all) doing the extra comparisons every 5sec. So a solution that is deeply integrated with Presto would be better.

Tagging @sopel39 @losipiuk to get inputs if it would be acceptable to add stateChangeListener to InternalNodeManager in Presto.

With the enormity of everything rubix is doing, there’s no way comparing 100 strings every five seconds would add any measurable overhead. How about running it with presto to measure it?

I think the optimal way would be work with Presto team to add the logic in Presto's NodeManager to return info about state change

This just shifts the problem to Presto. Workers list is not a real-time and requires some time to refresh (I think discovery requests are sent every 1s, but there is also blacklist of failed nodes refreshed independently).

On a general note, I'm not sure what problem are we trying to fix? Is that an issue that we cache workers list for 5s? Workers should join/depart cluster much less frequently. Cluster with nodes changing every 5s won't be stable anyway.

Even if the list is not current, queries should not fail.

@sopel39 - this issue (#462) is a secondary issue in that it causes more queries to fail than otherwise would with caching enabled when a node becomes unresponsive and/or is subsequently restarted/terminated. The reason is that rubix is caching the set of worker nodes and using this to determine where to send cache read requests (i.e. so it's not the caching by DiscoveryNodeManager, but instead the internal caching done by rubix I'm trying to get rid of). Because the set of worker nodes is cached by rubix instead of always getting it from the NodeManager, more cache read requests will be sent to nodes that are down which leads to more internal error query failures. We've observed this behavior consistently on a heavily loaded test cluster, while without caching enabled under heavy load, queries instead fail with an insufficient resources timeout (which is what is expected). Preventing the proliferation of these internal error type failures is important because otherwise worker nodes get terminated in our infra and engineers start to get paged.

This PR solves this by always getting the list of worker nodes from the NodeManager and only doing the additional computation needed by rubix when the set of worker nodes changes. @stagraqubole is worried about the equals check we do on the worker node set causing too much overhead and is asking if it's feasible to have some kind of callback in NodeManager when the set of active workers changes. Looking closer at DiscoveryNodeManager, though it polls for the node state every 5 seconds, it seems that it only updates the set of nodes if they've changed. Irregardless of this, even if it did update every 5 seconds, I don't think the string compare of an equals operation occurring every 5 seconds would add any overhead.

The primary issue (which is being worked on by @stagraqubole) is fixing the disk usage accounting to prevent out of disk space issues.

Other primary issues are:

Preventing worker nodes from being overwhelmed by dynamically sizing the thread pools. We've worked around this by configuring the thread pools down.

Removing cache files prior to calculating allowed disk usage. We've worked around this by changing our presto startup script.

JamesRTaylor · 2020-10-20T17:27:17Z

@shubhamtagra - I can put this change behind a config flag so that it maintains the current implementation if you're worried about it. We'd really like to have this capability it in the next release to test out in conjunction with your other fixes.

JamesRTaylor · 2020-10-28T21:25:14Z

WDYT @shubhamtagra?

JamesRTaylor · 2020-11-05T23:20:24Z

@shubhamtagra - I've updated the PR as we've discussed. Please review when you have a chance.

…ter manager

shubhamtagra

~~Just one comment about keeping the existing code in PrestoClusterManager to get worker nodes list over http.~~

To use SyncPrestoCM, you plan to use rubix.cluster.manager.presto.class config?

rubix-prestosql/src/main/java/com/qubole/rubix/prestosql/PrestoClusterManager.java

JamesRTaylor · 2020-11-10T22:05:10Z

To use SyncPrestoCM, you plan to use rubix.cluster.manager.presto.class config?

Yes, exactly.

Would it be possible to get a new version of rubix with this change in it?

Thanks again for the code reviews.

shubhamtagra · 2020-11-11T08:05:27Z

Merged, thanks @JamesRTaylor. I will release a new version so that this could be picked up in Presto.

JamesRTaylor changed the title ~~Avoid caching presto worker worker nodes #462~~ Avoid caching presto worker worker nodes Oct 6, 2020

shubhamtagra requested changes Oct 7, 2020

View reviewed changes

JamesRTaylor force-pushed the no-worker-node-cache branch from 4ad62e0 to f5899e3 Compare October 7, 2020 21:56

JamesRTaylor changed the title ~~Avoid caching presto worker worker nodes~~ Avoid caching presto worker nodes Nov 4, 2020

James Taylor added 2 commits November 4, 2020 14:20

Avoid caching presto worker worker nodes qubole#462

adca495

Expand wildcard exports

bb22e73

JamesRTaylor force-pushed the no-worker-node-cache branch from f5899e3 to 4e8af9c Compare November 5, 2020 23:19

Remove synchronization of getters and default to async prestosql clus…

9e0b728

…ter manager

JamesRTaylor force-pushed the no-worker-node-cache branch from 4e8af9c to 9e0b728 Compare November 5, 2020 23:34

shubhamtagra requested changes Nov 10, 2020

View reviewed changes

rubix-prestosql/src/main/java/com/qubole/rubix/prestosql/PrestoClusterManager.java Show resolved Hide resolved

shubhamtagra self-requested a review November 10, 2020 06:40

shubhamtagra approved these changes Nov 10, 2020

View reviewed changes

shubhamtagra merged commit 9b87ec9 into qubole:master Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid caching presto worker nodes #465

Avoid caching presto worker nodes #465

JamesRTaylor commented Oct 6, 2020

JamesRTaylor commented Oct 6, 2020

shubhamtagra Oct 7, 2020

JamesRTaylor Oct 7, 2020 •

edited

Loading

shubhamtagra Oct 7, 2020

shubhamtagra Oct 7, 2020

JamesRTaylor Oct 7, 2020 •

edited

Loading

shubhamtagra Oct 8, 2020

JamesRTaylor Oct 8, 2020

shubhamtagra Oct 8, 2020

JamesRTaylor Oct 8, 2020 •

edited

Loading

shubhamtagra Oct 9, 2020

JamesRTaylor Oct 9, 2020

sopel39 Oct 9, 2020 •

edited

Loading

JamesRTaylor Oct 9, 2020 •

edited

Loading

JamesRTaylor commented Oct 20, 2020

JamesRTaylor commented Oct 28, 2020

JamesRTaylor commented Nov 5, 2020

shubhamtagra left a comment •

edited

Loading

JamesRTaylor commented Nov 10, 2020 •

edited

Loading

shubhamtagra commented Nov 11, 2020


		private Set<String> currentNodes;

		protected abstract boolean hasStateChanged();

Avoid caching presto worker nodes #465

Avoid caching presto worker nodes #465

Conversation

JamesRTaylor commented Oct 6, 2020

JamesRTaylor commented Oct 6, 2020

Choose a reason for hiding this comment

JamesRTaylor Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesRTaylor Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesRTaylor Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

JamesRTaylor Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

JamesRTaylor commented Oct 20, 2020

JamesRTaylor commented Oct 28, 2020

JamesRTaylor commented Nov 5, 2020

shubhamtagra left a comment • edited Loading

Choose a reason for hiding this comment

JamesRTaylor commented Nov 10, 2020 • edited Loading

shubhamtagra commented Nov 11, 2020

JamesRTaylor Oct 7, 2020 •

edited

Loading

JamesRTaylor Oct 7, 2020 •

edited

Loading

JamesRTaylor Oct 8, 2020 •

edited

Loading

sopel39 Oct 9, 2020 •

edited

Loading

JamesRTaylor Oct 9, 2020 •

edited

Loading

shubhamtagra left a comment •

edited

Loading

JamesRTaylor commented Nov 10, 2020 •

edited

Loading