Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add protection on namespace cache in case namespace is not created #824

Merged
merged 8 commits into from
Oct 15, 2020

Conversation

wxing1292
Copy link
Contributor

@wxing1292 wxing1292 commented Oct 9, 2020

What changed?
Add additional frequency check on namespace cache to reduce the impact if namespace is not registered.

Why?
Currently, if a namespace does not exists within cache, the logic will check if the namespace exists in DB and possibly perform a cache refreshment. If the namespace does not exists in DB, and caller keeps on calling server APIs, the DB will encounter huge API call volume.

How did you test it?
Existing tests

temporal|namespace-cache-protection ⇒ go test -count=20 -race go.temporal.io/server/common/cache
ok  	go.temporal.io/server/common/cache	89.144s

Potential risks
N/A

@wxing1292 wxing1292 requested review from samarabbas and a team October 9, 2020 20:20
return nil
}

func (c *namespaceCache) checkNamespaceExists(
name string,
id string,
) error {
) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logic of this func should be changed. It is great that it returns (bool, error) now but bool should mean exist/not exist and error should indicate the error: persistence or throttling. I would return service.ResourceExhausted right from here and propagate it up to the caller. Caller should be able to differentiate if it is actual NotFound or "you are pinging me to fast".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but bool should mean exist/not exist

not really, bool means continue to check the cache & refresh the cache or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and there is not guarantee the check DB see if the namespace exists logic will be exercised within this function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then don't add this bool at all. Let it return serviceerror.NotFound or serviceerror.ResourceExhausted which will be propagated to the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop, that is not the purpose of this function

the purpose of this function is check whether to continue the path, i.e. check & potentially force refresh the cache.
the bool return value means whether to continue (either the domain exists, or somebody has already checked within short period of time and it is really likely cache can contain the target domain, but no guarantee)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functionality (function in question) was originally introduced to solve the issue of
t = 0, cache got refreshed, to be refreshed again at t = 10
t = 1, new domain got registered
t = 2, caller just want to use the domain (from within cache)

the function in question (along with the following logic) will solve the above case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be the function name should be changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, change the name. If existsSomething() returns bool it usually means exists/doesn't exist.

Also you return NotFound at t=2 now, right? I think Unavailable suites better here (client should wait and try again w/o performing extra actions): https://pkg.go.dev/google.golang.org/grpc/codes (comment to FailedPrecondition Code = 8).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function name is changed

Unavailable means server fault, while in this case (namespace missing in cache) is client fault.
the only guarantee we have with customer is: if you namespace is in the DB, it is guaranteed to be in the cache after 10s + creation time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clients should not know about cache or any internal details. So if the statement is "namespace creation can take up to 10 seconds" then returning NotFound within those 10 seconds makes perfect sense.

c.checkLock.Lock()
defer c.checkLock.Unlock()

now = c.timeSource.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need it one more time here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, consider 1000 caller with 10 frontend host with namespace not registered.

on average, there will be ~ 100 concurrent call to this cache, so the line 514 line now := c.timeSource.Now() maybe called a while back.

@@ -563,9 +590,13 @@ func (c *namespaceCache) getNamespace(
return c.getNamespaceByID(id, true)
}

if err := c.checkNamespaceExists(name, ""); err != nil {
doContinue, err := c.checkAndRefresh(name, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably there is a better name for doContinue?

@wxing1292 wxing1292 closed this Oct 15, 2020
@wxing1292 wxing1292 reopened this Oct 15, 2020
@wxing1292 wxing1292 merged commit b25bd3d into temporalio:master Oct 15, 2020
@wxing1292 wxing1292 deleted the namespace-cache-protection branch October 15, 2020 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants