-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors on subctl verify
when running with Globalnet
#707
Comments
Some observations from the Globalnet pod:
We have some test-cases in our e2e which will create globalEgressIPs at the namespace level and verify that connectivity is working with the globalIP assigned at the namespace level. From the above logs, we can see that after the test-case finishes, it tries to delete the globalEgressIP object which in turn triggers the cleanup in the globalnet Pod. We can see that Globalnet pod runs
All the 6 failures are in discovery test-cases and looking closely at the logs, we can see that GlobalIPs were successfully allocated in all the 6 test cases. So this problem needs to be investigated from Lighthouse POV. |
I'm seeing the following errors in lighthouse agent code running on mkolesni-0.12-testday1
It says CC: @aswinsuryan @vthapar |
Just an observation and may not be the reason for the failures. There are a lot of WARNINGS in Lighthouse agent code with the following message
|
@sridhargaddam First issue could be one of two:
Secon one is just a warning with zero impact and is being tracked #576 |
globalIP was actually allocated in time, so it could be some other issue then.
@mkolesnik is the problem consistently reproduced? |
I didn't have a chance to re run this as the whole test run (due to the obscenely long timeouts) took I think more than an hour, maybe two. The setup has been torn down (automatically) so it's not available anymore. |
@sridhargaddam @vthapar what's the latest on this? this is the only issue blocking us from releasing 0.12.0 AFAIK. |
@nyechiel I had a close look at the logs and from Globalnet POV things are looking fine. There are some errors in Lighthouse tests and DNS resolution is failing because of which discovery tests are failing. Apart from the observations I shared earlier, I also see some errors in LH when there is a request for DNS resolution and the entry was missing.
Since we no longer have the original setup, I'm trying to deploy AWS Clusters to check if the problem is transient or consistent. @vthapar do you have any additional observations looking at Lighthouse logs? |
Okay, I re-ran the discovery tests on AWS Clusters running with OCP 4.9 + Submariner 0.12.0-rc1 + Globalnet and all the tests passed. So, looks like it was some transient error. @nyechiel we can remove this issue from the blocking list.
|
@mkolesnik shall we close this issue as not reproducible? |
@sridhargaddam @nyechiel I've run it twice with no failures and now working on my 3rd run to be sure. It does seem to be transient. @mkolesnik This makes a good case of fast fail so that we can gather logs at the point of failure and not run any more tests. With e2e currently doing namespace cleanup on failure we are missing crucial information like Endpointslices, Exports and Imports. |
I disagree this should just be closed, clearly there was an issue that happened on multiple tests and as such doesnt seem like "transient" to me (or it would fail just 1 test). That's also why "fast fail" is not a good option for test day, since you don't know if the test failed randomly or if it happens consistently. If we're missing information to understand the problem then we need to address that as well, since if it happens on a production env we won't have any more information than whats on this bug (and possibly less). |
Found the issue and confirmed from Coredns logs. Issue is only with headless/stateful set tests where we query a specific pod or pods in specific cluster.
Note that it is missing the podname for DNS query. It is because of We should throw error to user when trying to deploy cluster with |
I see no problem prohibiting Also is |
We should probably follow K8s naming restrictions https://kubernetes.io/docs/concepts/overview/working-with-objects/names/ |
Yes, RFC1035 validations. So where to add all these checks?
|
I believe in |
Current check for validating ClusterId allows `.` in clustername. Instead it should follow Label Names validations as defined in https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names Fixes github.com/submariner-io/lighthouse/issues/707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes submariner-io#707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes #707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Current check for validating ClusterId allows `.` in clustername. Instead it should follow Label Names validations as defined in https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names Fixes github.com/submariner-io/lighthouse/issues/707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Current check for validating ClusterId allows `.` in clustername. Instead it should follow Label Names validations as defined in https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names Fixes github.com/submariner-io/lighthouse/issues/707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes submariner-io#707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes submariner-io#707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes submariner-io#707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes submariner-io#707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Check if ClusterID is valid DNS1123 Label and exit with error if not. Invalid ClusterID causes errors when replying to DNS queries for service discovery. Fixes #707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Current check for validating ClusterId allows `.` in clustername. Instead it should follow Label Names validations as defined in https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names Fixes github.com/submariner-io/lighthouse/issues/707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Current check for validating ClusterId allows `.` in clustername. Instead it should follow Label Names validations as defined in https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names Fixes github.com/submariner-io/lighthouse/issues/707 Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
What happened:
Running
subctl verify
on 0.12.0-rc1 test day, the clusters are installed with Globalnet.Got the following errors:
Full log: log.txt
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
Diagnose information (use
subctl diagnose all
): All passedGather information (use
subctl gather
): gather.tar.gzCloud provider or hardware configuration: AWS, OCP 4.9
Install tools:
Others:
The text was updated successfully, but these errors were encountered: