Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid caching presto worker nodes #465

Merged
merged 3 commits into from
Nov 11, 2020

Conversation

JamesRTaylor
Copy link
Contributor

No description provided.

@JamesRTaylor JamesRTaylor changed the title Avoid caching presto worker worker nodes #462 Avoid caching presto worker worker nodes Oct 6, 2020
@JamesRTaylor
Copy link
Contributor Author

@stagraqubole - not sure about the test integration failure.

@@ -48,170 +28,51 @@
/**
* Created by stagra on 14/1/16.
*/
public class PrestoClusterManager extends ClusterManager
public class PrestoClusterManager extends SyncClusterManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the other form too because there is a standalone mode where Presto's NodeManager is not available. We should create a new one say SyncPrestoClusterManager as a copy of what you have and keep PrestoClusterManager based on AsyncClusterManager

Copy link
Contributor Author

@JamesRTaylor JamesRTaylor Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an implementation of Presto's NodeManager that can be used for standalone mode. We can move that out of the test side of things if need be and even provide a constructor for PrestoClusterManager that passes the NodeManager in. This provides a much cleaner separation of concerns IMHO.

private final Node currentNode;
private final int serverPort;

public TestingNodeManager(Configuration conf, Node currentNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you wont need this once you have Sync and Async implementations of PrestoClusterManager


private Set<String> currentNodes;

protected abstract boolean hasStateChanged();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current model will spike up CPU usage a lot. The ClusterManager methods are very frequently called, e.g. getCurrentNodeName will be called for every MB of data being read via the worker node. This converts to 1k calls per GB to hasStateChanged which does comparisons internally.

I think the optimal way would be work with Presto team to add the logic in Presto's NodeManager to return info about state change e.g. it could store a version which increments every time there is a change in membership. Based on that version, we can hasStateChanged value here.

Copy link
Contributor Author

@JamesRTaylor JamesRTaylor Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed that the hasStateChanged() method returns on an == check of the equals method between the existing and new worker nodes. There's no extra overhead for this. The only time extra work is done is when the worker nodes change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the check be on ImmutableSet.equals which is not a simple == check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The == check is the first line of the ImmutableSet.equals implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you confirmed that over a period of time if cluster state does not change then Presto's NodeManager returns same object? That is a dependency on undocumented behaviour of NodeManager, we will need to way/test to catch if this behaviour changes in future otherwise slowness will creep in at runtime.

Copy link
Contributor Author

@JamesRTaylor JamesRTaylor Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you confirmed that over a period of time if cluster state does not change then Presto's NodeManager returns same object?

Yes, I confirmed this with logging that I've since removed. I can add that back if you'd like. Maybe an info level log line if the node membership has changed and a debug level log line for when the == check fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, every 5sec there will be extra work to compare all the elements of the Set. I still think that doing it in conjunction with Presto NodeManager, something like a callback that NodeManager calls when it sees change in membership would work out best.
The problem with current approach, multiple comparisons every 5sec, would amplify if we remove the synchronisation you have added for all methods of SyncClusterManager. Given that there will be 100s of parallel requests to these methods (as many as task runner threads running in Presto), current code will serialise them which will definitely have perf impact. And as soon as you remove sync, you will have the problem of these 100s of threads (potentially all but may not be always all) doing the extra comparisons every 5sec. So a solution that is deeply integrated with Presto would be better.

Tagging @sopel39 @losipiuk to get inputs if it would be acceptable to add stateChangeListener to InternalNodeManager in Presto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the enormity of everything rubix is doing, there’s no way comparing 100 strings every five seconds would add any measurable overhead. How about running it with presto to measure it?

Copy link
Contributor

@sopel39 sopel39 Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the optimal way would be work with Presto team to add the logic in Presto's NodeManager to return info about state change

This just shifts the problem to Presto. Workers list is not a real-time and requires some time to refresh (I think discovery requests are sent every 1s, but there is also blacklist of failed nodes refreshed independently).

On a general note, I'm not sure what problem are we trying to fix? Is that an issue that we cache workers list for 5s? Workers should join/depart cluster much less frequently. Cluster with nodes changing every 5s won't be stable anyway.

Even if the list is not current, queries should not fail.

Copy link
Contributor Author

@JamesRTaylor JamesRTaylor Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 - this issue (#462) is a secondary issue in that it causes more queries to fail than otherwise would with caching enabled when a node becomes unresponsive and/or is subsequently restarted/terminated. The reason is that rubix is caching the set of worker nodes and using this to determine where to send cache read requests (i.e. so it's not the caching by DiscoveryNodeManager, but instead the internal caching done by rubix I'm trying to get rid of). Because the set of worker nodes is cached by rubix instead of always getting it from the NodeManager, more cache read requests will be sent to nodes that are down which leads to more internal error query failures. We've observed this behavior consistently on a heavily loaded test cluster, while without caching enabled under heavy load, queries instead fail with an insufficient resources timeout (which is what is expected). Preventing the proliferation of these internal error type failures is important because otherwise worker nodes get terminated in our infra and engineers start to get paged.

This PR solves this by always getting the list of worker nodes from the NodeManager and only doing the additional computation needed by rubix when the set of worker nodes changes. @stagraqubole is worried about the equals check we do on the worker node set causing too much overhead and is asking if it's feasible to have some kind of callback in NodeManager when the set of active workers changes. Looking closer at DiscoveryNodeManager, though it polls for the node state every 5 seconds, it seems that it only updates the set of nodes if they've changed. Irregardless of this, even if it did update every 5 seconds, I don't think the string compare of an equals operation occurring every 5 seconds would add any overhead.

The primary issue (which is being worked on by @stagraqubole) is fixing the disk usage accounting to prevent out of disk space issues.

Other primary issues are:

@JamesRTaylor
Copy link
Contributor Author

@shubhamtagra - I can put this change behind a config flag so that it maintains the current implementation if you're worried about it. We'd really like to have this capability it in the next release to test out in conjunction with your other fixes.

@JamesRTaylor
Copy link
Contributor Author

WDYT @shubhamtagra?

@JamesRTaylor JamesRTaylor changed the title Avoid caching presto worker worker nodes Avoid caching presto worker nodes Nov 4, 2020
@JamesRTaylor
Copy link
Contributor Author

@shubhamtagra - I've updated the PR as we've discussed. Please review when you have a chance.

Copy link
Contributor

@shubhamtagra shubhamtagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment about keeping the existing code in PrestoClusterManager to get worker nodes list over http.

To use SyncPrestoCM, you plan to use rubix.cluster.manager.presto.class config?

@shubhamtagra shubhamtagra self-requested a review November 10, 2020 06:40
@JamesRTaylor
Copy link
Contributor Author

JamesRTaylor commented Nov 10, 2020

To use SyncPrestoCM, you plan to use rubix.cluster.manager.presto.class config?

Yes, exactly.

Would it be possible to get a new version of rubix with this change in it?

Thanks again for the code reviews.

@shubhamtagra shubhamtagra merged commit 9b87ec9 into qubole:master Nov 11, 2020
@shubhamtagra
Copy link
Contributor

Merged, thanks @JamesRTaylor. I will release a new version so that this could be picked up in Presto.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants