-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forcing eventlistener sink to resolve eventlistener #977
Conversation
The following is the coverage report on the affected files.
|
40536ad
to
a64e56e
Compare
The following is the coverage report on the affected files.
|
a64e56e
to
b2c0d1a
Compare
The following is the coverage report on the affected files.
|
pkg/sink/initialization.go
Outdated
Duration: 50 * time.Millisecond, | ||
Factor: 2.0, | ||
Jitter: 0.3, | ||
Steps: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding this correctly, it will try a max of 10 times right? (Not try for a max of 5seconds?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, its a little of both. the steps is the max number of times, if it ever hits the cap it will set the wait time to that amount and then bail out.
in this case, the first wait time is 50 milliseconds with a scale factor of 2 and jitter of 0.3. The jitter means the scale factor will be somewhere between 1.7 and 2.3. This means that after 10 steps, it waits somewhere around 10 seconds max before it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense...do you think 10 seconds is quick enough? I was thinking 30s might be a better default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me increase the factor and the num steps. Again, this is basically a question of:
how long will it take for the k8s API server to commit the eventlistener
I think closer to the 15-20 seconds is appropriate, especially considering that this is a function that blocks the startup of the HTTP endpoint. Let me modify the factor so it ends up around this value
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two minor comments. Otherwise good to go!
pkg/sink/initialization.go
Outdated
Duration: 50 * time.Millisecond, | ||
Factor: 2.0, | ||
Jitter: 0.3, | ||
Steps: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense...do you think 10 seconds is quick enough? I was thinking 30s might be a better default?
pkg/sink/initialization_test.go
Outdated
r.WaitForEventListenerOrDie() | ||
} | ||
cmd := exec.Command(os.Args[0], "-test.run=TestWaitForEventlistener_Fatal") //nolint:gosec | ||
cmd.Env = append(os.Environ(), "EL_FATAL_CRASH=1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is clever...though we might be able to get away with passing in a test logger and capturing its output and then comparing (like we do here:
triggers/pkg/sink/sink_test.go
Lines 926 to 943 in 1066d18
logger := zaptest.NewLogger(t, zaptest.WrapOptions(zap.WrapCore(func(zapcore.Core) zapcore.Core { return core }))).Sugar() | |
sink.Logger = logger | |
ts := httptest.NewServer(http.HandlerFunc(sink.HandleEvent)) | |
defer ts.Close() | |
resp, err := http.Post(ts.URL, "application/json", bytes.NewReader(tc.eventBody)) | |
if err != nil { | |
t.Fatalf("error making request to eventListener: %s", err) | |
} | |
if resp.StatusCode != tc.wantStatusCode { | |
t.Fatalf("Status code mismatch: got %d, want %d", resp.StatusCode, http.StatusInternalServerError) | |
} | |
if tc.wantErrLogMsg != "" { | |
matches := logs.FilterMessage(tc.wantErrLogMsg) | |
if matches == nil || matches.Len() == 0 { | |
t.Fatalf("did not find log entry: %s.\n Logs are: %v", tc.wantErrLogMsg, logs.All()) | |
} |
What do you think?
Another option would be to have the function return an error and then have main.go call the log.Fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let me take a look at this strategy and try to implement it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, it doesn't look like this is possible because the behavior of fatal can't be overridden even though there is a noop action in the library. basically, its not a supported action on fatal.
i updated this function to return an error and panic instead
13bd2d7
to
7d926bb
Compare
The following is the coverage report on the affected files.
|
7d926bb
to
4401def
Compare
The following is the coverage report on the affected files.
|
This means that the sink HTTP process (and readiness probe) will not pass until the EventListenerLister is able to resolve the EventListener from the API server. This is especially useful in startup cases, but can also assist if the pod is started without permission to read the EventListener object. In this situation, given this change, the eventlistener pod will restart with a logged error message about lack of access to that specific API resource.
4401def
to
78c2587
Compare
The following is the coverage report on the affected files.
|
/approve Just one minor thing -> panic to logger.Fatal in main |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dibyom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: Dibyo Mukherjee <dibyo@google.com>
The following is the coverage report on the affected files.
|
/test pull-tekton-triggers-integration-tests |
/lgtm |
Previously, we had a race where the EL would start serving traffic before the lister caches were synced. This leads to the intermittent resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for the EventListener resource. This commit fixes it for all resource types by first registering all the listers, and then syncing the cache before serving traffic. Tested by modifying the e2e test to run the intermittently failing test 10 times. Without this fix, it fails while with it it does not (See tektoncd#1012). Fixes tektoncd#896 Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Previously, we had a race where the EL would start serving traffic before the lister caches were synced. This leads to the intermittent resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for the EventListener resource. This commit fixes it for all resource types by first registering all the listers, and then syncing the cache before serving traffic. Tested by modifying the e2e test to run the intermittently failing test 10 times. Without this fix, it fails while with it it does not (See tektoncd#1012). Fixes tektoncd#896 Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Previously, we had a race where the EL would start serving traffic before the lister caches were synced. This leads to the intermittent resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for the EventListener resource. This commit fixes it for all resource types by first registering all the listers, and then syncing the cache before serving traffic. Tested by modifying the e2e test to run the intermittently failing test 10 times. Without this fix, it fails while with it it does not (See tektoncd#1012). Fixes tektoncd#896 Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Previously, we had a race where the EL would start serving traffic before the lister caches were synced. This leads to the intermittent resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for the EventListener resource. This commit fixes it for all resource types by first registering all the listers, and then syncing the cache before serving traffic. Tested by modifying the e2e test to run the intermittently failing test 10 times. Without this fix, it fails while with it it does not (See tektoncd#1012). Fixes tektoncd#896 Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Previously, we had a race where the EL would start serving traffic before the lister caches were synced. This leads to the intermittent resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for the EventListener resource. This commit fixes it for all resource types by first registering all the listers, and then syncing the cache before serving traffic. We will wait for the cache to sync for 1 minute before timing out. Tested by modifying the e2e test to run the intermittently failing test 10 times. Without this fix, it fails while with it it does not (See tektoncd#1012). Fixes tektoncd#896 Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Previously, we had a race where the EL would start serving traffic before the lister caches were synced. This leads to the intermittent resolution issues described in #896. #977 was an attempt to fix this for the EventListener resource. This commit fixes it for all resource types by first registering all the listers, and then syncing the cache before serving traffic. We will wait for the cache to sync for 1 minute before timing out. Tested by modifying the e2e test to run the intermittently failing test 10 times. Without this fix, it fails while with it it does not (See #1012). Fixes #896 Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Changes
This change forces the eventlistenersink process to be able to resolve
the EventListener before starting the HTTP server.
This means that the sink HTTP process (and readiness probe)
will not be listen until the EventListenerLister is able to resolve
the EventListener from the API server.
This is especially useful in startup cases, but can also assist
if the pod is started without permission to read the EventListener
object. In this situation, given this change, the eventlistener
pod will restart with a logged error message about lack of access to
that specific API resource.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Release Notes