Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: added region cache warmup #250
cache: added region cache warmup #250
Changes from all commits
bacdf35
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should establish a connection immediately. I would think that just loading the cache is enough and once one of the region is being used, then a connection will be established. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the
SendRPC
function, we first callgetRegionFromCache
to retrieve the region from the cache. If the region is not found in the cache, we then callfindRegion
to look up the region from the meta table. It seems that we are not establishing a connection when callinggetRegionFromCache
, but we are establishing one infindRegion
. This leads me to think that we should establish a connection when loading a region into the cache. But I could be missing somethingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dethi's suggestion is a good one, to make the connection establishment lazy, but I don't think it would work well today.
What you could do is remove the
MarkUnavailable()
andestablishRegion
calls. Then when an RPC is madegetRegionAndClientForRPC
will see thatreg.Client() == nil
for that region and then callreestablishRegion
. The problem is that the way this works today is it callsestablishRegion
with""
as theaddr
, which results in re-looking up the region from the meta region. So, then we are back to a flurry of requests to the meta region during start up (though, only one per region).I think we should come up with a better fix for this in a later change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my concern is because if you have 10k regions, this will cause 10k goroutines to start trying to establish a connection to maybe ~100 RegionServers. They will all get block behind a lock, that ensure that we establish only one client connection to a RegionServers, so that part isn't a problem. It's more the impact of all these goroutines on the Go scheduler, the CPU usage of the client on startup and what it would mean for the RegionServers as well if you have a few 10s clients doing the same thing at once. Because one of the thing we do in
establishRegion
is a Get to validate that we have properly established the connection to the right regionserver and that the region is online, seeisRegionEstablished
. That could cause quite a bit of storm.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sucks, and we are doing O(number of regions * number of RegionServers) work during this warmup because
clientRegionCache.put
is going to iterate over every RegionServer for each region added.I still think we should try to merge this change as-is so that we can try out the tradeoffs and continue to work to improve things here. We might be starting 10k goroutines, but at least we aren't sending out 10k scan requests to the meta region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, but need to get the CI to pass first
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.