-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: added region cache warmup #250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left a couple comments. Let me know if they make sense.
@dethi Do you want to take a final look at this? |
} | ||
|
||
// Start a goroutine to connect to the region | ||
go c.establishRegion(reg, addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should establish a connection immediately. I would think that just loading the cache is enough and once one of the region is being used, then a connection will be established. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the SendRPC
function, we first call getRegionFromCache
to retrieve the region from the cache. If the region is not found in the cache, we then call findRegion
to look up the region from the meta table. It seems that we are not establishing a connection when calling getRegionFromCache
, but we are establishing one in findRegion
. This leads me to think that we should establish a connection when loading a region into the cache. But I could be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dethi's suggestion is a good one, to make the connection establishment lazy, but I don't think it would work well today.
What you could do is remove the MarkUnavailable()
and establishRegion
calls. Then when an RPC is made getRegionAndClientForRPC
will see that reg.Client() == nil
for that region and then call reestablishRegion
. The problem is that the way this works today is it calls establishRegion
with ""
as the addr
, which results in re-looking up the region from the meta region. So, then we are back to a flurry of requests to the meta region during start up (though, only one per region).
I think we should come up with a better fix for this in a later change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my concern is because if you have 10k regions, this will cause 10k goroutines to start trying to establish a connection to maybe ~100 RegionServers. They will all get block behind a lock, that ensure that we establish only one client connection to a RegionServers, so that part isn't a problem. It's more the impact of all these goroutines on the Go scheduler, the CPU usage of the client on startup and what it would mean for the RegionServers as well if you have a few 10s clients doing the same thing at once. Because one of the thing we do in establishRegion
is a Get to validate that we have properly established the connection to the right regionserver and that the region is online, see isRegionEstablished
. That could cause quite a bit of storm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sucks, and we are doing O(number of regions * number of RegionServers) work during this warmup because clientRegionCache.put
is going to iterate over every RegionServer for each region added.
I still think we should try to merge this change as-is so that we can try out the tradeoffs and continue to work to improve things here. We might be starting 10k goroutines, but at least we aren't sending out 10k scan requests to the meta region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, but need to get the CI to pass first
Left a few comment + CI is complaining about the mock missing the new function. |
Looks like there was a data race detected in the test, related to marshalJSON |
5aac5a1
to
78c56cf
Compare
This change adds an option in gohbase to prepopulate its region cache. It does a single scan of the meta table and cache all regions.
This change adds an option in gohbase to prepopulate its region cache. It does a single scan of the meta table and cache all regions.