Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors on subctl verify when running with Globalnet #707

Closed
mkolesnik opened this issue Mar 14, 2022 · 17 comments · Fixed by #709
Closed

Errors on subctl verify when running with Globalnet #707

mkolesnik opened this issue Mar 14, 2022 · 17 comments · Fixed by #709
Labels
0.12.0-testday bug Something isn't working

Comments

@mkolesnik
Copy link
Contributor

What happened:
Running subctl verify on 0.12.0-rc1 test day, the clusters are installed with Globalnet.
Got the following errors:

• Failure [295.955 seconds]
[discovery] Test Service Discovery Across Clusters
github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/service_discovery.go:42
  when a pod tries to resolve a service in a specific remote cluster by its cluster name
  github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/service_discovery.go:74
    should resolve the service on the specified cluster [It]
    github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/service_discovery.go:75

    Failed to verify if service IP is discoverable. expected execution result "" to contain "172.30.17.231"
    Unexpected error:
        <*errors.errorString | 0xc00018bea0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    occurred

    github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513
------------------------------
• Failure [285.577 seconds]
[discovery] Test Headless Service Discovery Across Clusters
github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/headless_services.go:37
  when a pod tries to resolve a headless service in a remote cluster
  github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/headless_services.go:40
    should resolve the backing pod IPs from the remote cluster [It]
    github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/headless_services.go:41

    Failed to  service IP verification. expected execution result "" to contain "242.1.255.252"
    Unexpected error:
        <*errors.errorString | 0xc00018bea0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    occurred

    github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513
------------------------------
• Failure [279.380 seconds]
[discovery] Test Stateful Sets Discovery Across Clusters
github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:34
  when a pod tries to resolve a podname from stateful set in a remote cluster
  github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:37
    should resolve the pod IP from the remote cluster [It]
    github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:38

    Failed to  service IP verification. expected execution result "" to contain "242.1.255.252"
    Unexpected error:
        <*errors.errorString | 0xc00018bea0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    occurred

    github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513
------------------------------
• Failure [280.104 seconds]
[discovery] Test Stateful Sets Discovery Across Clusters
github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:34
  when a pod tries to resolve a podname from stateful set in a local cluster
  github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:43
    should resolve the pod IP from the local cluster [It]
    github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:44

    Failed to  service IP verification. expected execution result "" to contain "242.0.255.252"
    Unexpected error:
        <*errors.errorString | 0xc00018bea0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    occurred

    github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513
------------------------------
• Failure [280.258 seconds]
[discovery] Test Stateful Sets Discovery Across Clusters
github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:34
  when the number of active pods backing a stateful set changes
  github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:49
    should only resolve the IPs from the active pods [It]
    github.com/submariner-io/lighthouse@v0.12.0-rc1/test/e2e/discovery/statefulsets.go:50

    Failed to  service IP verification. expected execution result "" to contain "242.1.255.250"
    Unexpected error:
        <*errors.errorString | 0xc00018bea0>: {
            s: "timed out waiting for the condition",
        }
        timed out waiting for the condition
    occurred

    github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513
------------------------------


Summarizing 6 Failures:

[Fail] [discovery] Test Service Discovery Across Clusters when a pod tries to resolve a service in a specific remote cluster by its cluster name [It] should resolve the service on the specified cluster 
github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513

[Fail] [discovery] Test Headless Service Discovery Across Clusters when a pod tries to resolve a headless service in a remote cluster [It] should resolve the backing pod IPs from the remote cluster 
github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513

[Fail] [discovery] Test Headless Service Discovery Across Clusters when a pod tries to resolve a headless service in a specific remote cluster by its cluster name [It] should resolve the backing pod IPs from the specified remote cluster 
github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513

[Fail] [discovery] Test Stateful Sets Discovery Across Clusters when a pod tries to resolve a podname from stateful set in a remote cluster [It] should resolve the pod IP from the remote cluster 
github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513

[Fail] [discovery] Test Stateful Sets Discovery Across Clusters when a pod tries to resolve a podname from stateful set in a local cluster [It] should resolve the pod IP from the local cluster 
github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513

[Fail] [discovery] Test Stateful Sets Discovery Across Clusters when the number of active pods backing a stateful set changes [It] should only resolve the IPs from the active pods 
github.com/submariner-io/shipyard@v0.12.0-rc1/test/e2e/framework/framework.go:513

Ran 24 of 41 Specs in 2129.496 seconds
FAIL! -- 18 Passed | 6 Failed | 0 Pending | 17 Skipped
[] E2E failed

Full log: log.txt

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Diagnose information (use subctl diagnose all): All passed

  • Gather information (use subctl gather): gather.tar.gz

  • Cloud provider or hardware configuration: AWS, OCP 4.9

  • Install tools:

  • Others:

@mkolesnik mkolesnik added bug Something isn't working 0.12.0-testday labels Mar 14, 2022
@sridhargaddam
Copy link
Member

sridhargaddam commented Mar 14, 2022

Some observations from the Globalnet pod:

I0314 10:39:16.500260       1 global_egressip_controller.go:138] Processing deleted GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63782851137, loc:(*time.Location)(0x2034060)}}, Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.252"}}
I0314 10:39:16.500308       1 base_controllers.go:161] Releasing previously allocated IPs [242.0.255.252] for "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb"
I0314 10:39:16.500334       1 iface.go:263] Deleting iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb": -p all -m set --match-set SM-GN-PGRBRJVHOEQH5ZPDMSOGQZGOF src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.252
I0314 10:39:16.502730       1 ipset.go:349] Running ipset [destroy SM-GN-PGRBRJVHOEQH5ZPDMSOGQZGOF]
E0314 10:39:16.507602       1 global_egressip_controller.go:298] Error destroying the ipSet "SM-GN-PGRBRJVHOEQH5ZPDMSOGQZGOF" for "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb": error destroying set "SM-GN-PGRBRJVHOEQH5ZPDMSOGQZGOF": exit status 1 (ipset v7.11: Set cannot be destroyed: it is in use by a kernel component
)
I0314 10:39:16.517872       1 global_egressip_controller.go:138] Processing deleted GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63782851137, loc:(*time.Location)(0x2034060)}}, Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.252"}}
I0314 10:39:16.517904       1 base_controllers.go:161] Releasing previously allocated IPs [242.0.255.252] for "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb"
I0314 10:39:16.517920       1 iface.go:263] Deleting iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb": -p all -m set --match-set SM-GN-PGRBRJVHOEQH5ZPDMSOGQZGOF src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.252
I0314 10:39:16.519788       1 ipset.go:349] Running ipset [destroy SM-GN-PGRBRJVHOEQH5ZPDMSOGQZGOF]
I0314 10:39:16.525574       1 global_egressip_controller.go:305] Successfully deleted all the iptables/ipset rules for "e2e-tests-dataplane-gn-conn-nd-mn5pb/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-mn5pb" 

We have some test-cases in our e2e which will create globalEgressIPs at the namespace level and verify that connectivity is working with the globalIP assigned at the namespace level. From the above logs, we can see that after the test-case finishes, it tries to delete the globalEgressIP object which in turn triggers the cleanup in the globalnet Pod.

We can see that Globalnet pod runs iptables delete ... followed by "ipset delete ..." - it looks like iptables delete is taking few milliseconds and in the meantime when we try to delete the corresponding ipset, it gets an error from kernel saying ipset v7.11: Set cannot be destroyed: it is in use by a kernel component). But we have a retry mechanism in the Globalnet code which successfully manages to delete the ipset after few milliseconds. So from the Globalnet Pod, thing look fine.

STEP: Executing "dig +short mkolesni-0.12-testday2.nginx-headless.e2e-tests-discovery-zfxpp.svc.clusterset.local" to verify IPs [242.1.255.252] for service "nginx-headless" "are" discoverable
STEP: Executing "dig +short mkolesni-0.12-testday1.nginx-headless.e2e-tests-discovery-q7kq2.svc.clusterset.local" to verify IPs [242.0.255.252] for service "nginx-headless" "are" discoverable
STEP: Executing "dig +short web-0.mkolesni-0.12-testday2.nginx-ss.e2e-tests-discovery-hkzc2.svc.clusterset.local" to verify IPs [242.1.255.252] for pod "web-0.mkolesni-0.12-testday2.nginx-ss" "are" discoverable
STEP: Executing "dig +short web-0.mkolesni-0.12-testday1.nginx-ss.e2e-tests-discovery-wjjgt.svc.clusterset.local" to verify IPs [242.0.255.252] for pod "web-0.mkolesni-0.12-testday1.nginx-ss" "are" discoverable
STEP: Executing "dig +short web-2.mkolesni-0.12-testday2.nginx-ss.e2e-tests-discovery-j2t8x.svc.clusterset.local" to verify IPs [242.1.255.250] for pod "web-2.mkolesni-0.12-testday2.nginx-ss" "are" discoverable

All the 6 failures are in discovery test-cases and looking closely at the logs, we can see that GlobalIPs were successfully allocated in all the 6 test cases. So this problem needs to be investigated from Lighthouse POV.

@sridhargaddam
Copy link
Member

I'm seeing the following errors in lighthouse agent code running on mkolesni-0.12-testday1

E0314 10:32:29.383035       1 queue.go:103] broker -> local for *v1beta1.EndpointSlice: Failed to process object with key "submariner-k8s-broker/nginx-demo-mkolesni-0.12-testday2": error distributing resource "submariner-k8s-broker/nginx-demo-mkolesni-0.12-testday2": error creating or updating resource: error creating &unstructured.Unstructured{Object:map[string]interface {}{"addressType":"IPv4", "apiVersion":"discovery.k8s.io/v1beta1", "endpoints":[]interface {}{map[string]interface {}{"addresses":[]interface {}{"10.129.2.84"}, "conditions":map[string]interface {}{"ready":true}, "hostname":"nginx-demo-b8f8b4fc6-hl4v2", "topology":map[string]interface {}{"kubernetes.io/hostname":"ip-10-0-132-97.ec2.internal"}}}, "kind":"EndpointSlice", "metadata":map[string]interface {}{"labels":map[string]interface {}{"endpointslice.kubernetes.io/managed-by":"lighthouse-agent.submariner.io", "lighthouse.submariner.io/sourceNamespace":"e2e-tests-discovery-46gz7", "multicluster.kubernetes.io/service-name":"nginx-demo", "multicluster.kubernetes.io/source-cluster":"mkolesni-0.12-testday2", "submariner-io/clusterID":"mkolesni-0.12-testday2", "submariner-io/originatingNamespace":"e2e-tests-discovery-46gz7"}, "name":"nginx-demo-mkolesni-0.12-testday2", "namespace":"e2e-tests-discovery-46gz7"}, "ports":[]interface {}{map[string]interface {}{"name":"metrics", "port":8082, "protocol":"TCP"}, map[string]interface {}{"name":"http", "port":8080, "protocol":"TCP"}}}}: endpointslices.discovery.k8s.io "nginx-demo-mkolesni-0.12-testday2" is forbidden: unable to create new content in namespace e2e-tests-discovery-46gz7 because it is being terminated
E0314 11:01:09.524583       1 queue.go:103] Endpoints -> EndpointSlice: Failed to process object with key "e2e-tests-discovery-wjjgt/nginx-ss": error distributing resource "e2e-tests-discovery-wjjgt/nginx-ss": error creating or updating resource: error creating &unstructured.Unstructured{Object:map[string]interface {}{"addressType":"IPv4", "apiVersion":"discovery.k8s.io/v1beta1", "endpoints":interface {}(nil), "kind":"EndpointSlice", "metadata":map[string]interface {}{"labels":map[string]interface {}{"endpointslice.kubernetes.io/managed-by":"lighthouse-agent.submariner.io", "lighthouse.submariner.io/sourceNamespace":"e2e-tests-discovery-wjjgt", "multicluster.kubernetes.io/service-name":"nginx-ss", "multicluster.kubernetes.io/source-cluster":"mkolesni-0.12-testday1"}, "name":"nginx-ss-mkolesni-0.12-testday1", "namespace":"e2e-tests-discovery-wjjgt"}, "ports":interface {}(nil)}}: endpointslices.discovery.k8s.io "nginx-ss-mkolesni-0.12-testday1" is forbidden: unable to create new content in namespace e2e-tests-discovery-wjjgt because it is being terminated

It says "nginx-ss-mkolesni-0.12-testday1" is forbidden: unable to create new content in namespace e2e-tests-discovery-wjjgt because it is being terminated. Why is LH/Admiral code trying to create something while the namespace is getting deleted?
Is it some issue with the test-code?

CC: @aswinsuryan @vthapar

@sridhargaddam
Copy link
Member

Just an observation and may not be the reason for the failures.

There are a lot of WARNINGS in Lighthouse agent code with the following message

W0314 11:01:23.203691       1 warnings.go:67] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0314 11:01:23.217613       1 warnings.go:67] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0314 11:01:23.224789       1 warnings.go:67] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0314 11:01:23.234061       1 warnings.go:67] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0314 11:05:35.545354       1 warnings.go:67] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice

@vthapar
Copy link
Contributor

vthapar commented Mar 15, 2022

@sridhargaddam First issue could be one of two:

  1. We're waiting for globalIP [or some other event] but by the time we get notification for it test has already timed out and started cleanup process. So our test fails and starts cleanup but during cleanup, ligthouse agent gets notification and throws this error.
  2. Issue with running tests in parallel. We had a similar issue about a year or so ago, but IIRC @tpantelis already fixed those.

Secon one is just a warning with zero impact and is being tracked #576

@sridhargaddam
Copy link
Member

  1. We're waiting for globalIP [or some other event] but by the time we get notification for it test has already timed out and started cleanup process. So our test fails and starts cleanup but during cleanup, ligthouse agent gets notification and throws this error.

globalIP was actually allocated in time, so it could be some other issue then.

  1. Issue with running tests in parallel. We had a similar issue about a year or so ago, but IIRC @tpantelis already fixed those.

@mkolesnik is the problem consistently reproduced?

@mkolesnik
Copy link
Contributor Author

@mkolesnik is the problem consistently reproduced?

I didn't have a chance to re run this as the whole test run (due to the obscenely long timeouts) took I think more than an hour, maybe two. The setup has been torn down (automatically) so it's not available anymore.
However, diagnose and everything else seemed to be working fine so IDK if it's a problem with the tests or with LH itself, the gather logs from both clusters are attached and should have enough info.

@nyechiel
Copy link
Member

@sridhargaddam @vthapar what's the latest on this? this is the only issue blocking us from releasing 0.12.0 AFAIK.

@sridhargaddam
Copy link
Member

@nyechiel I had a close look at the logs and from Globalnet POV things are looking fine. There are some errors in Lighthouse tests and DNS resolution is failing because of which discovery tests are failing.

Apart from the observations I shared earlier, I also see some errors in LH when there is a request for DNS resolution and the entry was missing.

[ERROR] plugin/errors: 3 mkolesni-0.12-testday1.nginx-demo.e2e-tests-discovery-jnhsl.svc.clusterset.local. A: plugin/lighthouse: record not found
[ERROR] plugin/errors: 3 mkolesni-0.12-testday1.nginx-demo.e2e-tests-discovery-jnhsl.svc.clusterset.local. A: plugin/lighthouse: record not found
[ERROR] plugin/errors: 3 mkolesni-0.12-testday2.nginx-headless.e2e-tests-discovery-zfxpp.svc.clusterset.local. A: plugin/lighthouse: record not found
[ERROR] plugin/errors: 3 mkolesni-0.12-testday2.nginx-headless.e2e-tests-discovery-zfxpp.svc.clusterset.local. A: plugin/lighthouse: record not found

Since we no longer have the original setup, I'm trying to deploy AWS Clusters to check if the problem is transient or consistent.

@vthapar do you have any additional observations looking at Lighthouse logs?

@sridhargaddam
Copy link
Member

Okay, I re-ran the discovery tests on AWS Clusters running with OCP 4.9 + Submariner 0.12.0-rc1 + Globalnet and all the tests passed. So, looks like it was some transient error. @nyechiel we can remove this issue from the blocking list.

[sgaddam@localhost aws-ocp]$ subctl verify --only service-discovery sgaddam-aws-spoke1/auth/kubeconfig sgaddam-aws-spoke2/auth/kubeconfig --verbose
...
...
------------------------------
SSSSSSSSSS
Ran 12 of 41 Specs in 593.811 seconds
SUCCESS! -- 12 Passed | 0 Failed | 0 Pending | 29 Skipped
[sgaddam@localhost aws-ocp]$ oc version
Client Version: 4.9.11
Server Version: 4.9.0
Kubernetes Version: v1.22.0-rc.0+894a78b
[sgaddam@localhost aws-ocp]$ subctl show all
Cluster "sgaddam-aws-spoke1"
 ✓ Detecting broker(s) 
NAMESPACE                NAME                     COMPONENTS                              
submariner-k8s-broker    submariner-broker        service-discovery, connectivity         

 ✓ Showing Connections 
GATEWAY         CLUSTER             REMOTE IP     NAT  CABLE DRIVER  SUBNETS       STATUS     RTT avg.    
ip-10-0-23-121  sgaddam-aws-spoke2  52.15.37.203  yes  libreswan     242.1.0.0/16  connected  937.845µs   

 ✓ Showing Endpoints 
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
sgaddam-aws-spoke1            10.0.43.92      18.118.194.194  libreswan           local           
sgaddam-aws-spoke2            10.0.23.121     52.15.37.203    libreswan           remote          

 ✓ Showing Gateways 
NODE                            HA STATUS       SUMMARY                         
ip-10-0-43-92                   active          All connections (1) are established

    Discovered network details via Submariner:
 ✓ Showing Network details
        Network plugin:  OpenShiftSDN
        Service CIDRs:   [172.30.0.0/16]
        Cluster CIDRs:   [10.128.0.0/14]
        Global CIDR:     242.0.0.0/16

 ✓ Showing versions 
COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.12.0-rc1      
submariner-operator             quay.io/submariner                                    0.12.0-rc1      
service-discovery               quay.io/submariner                                    0.12.0-rc1      
COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.12.0-rc1      
submariner-operator             quay.io/submariner                                    0.12.0-rc1      
service-discovery               quay.io/submariner                                    0.12.0-rc1      

[sgaddam@localhost aws-ocp]$

@sridhargaddam
Copy link
Member

@mkolesnik shall we close this issue as not reproducible?

@vthapar
Copy link
Contributor

vthapar commented Mar 16, 2022

@sridhargaddam @nyechiel I've run it twice with no failures and now working on my 3rd run to be sure. It does seem to be transient.

@mkolesnik This makes a good case of fast fail so that we can gather logs at the point of failure and not run any more tests. With e2e currently doing namespace cleanup on failure we are missing crucial information like Endpointslices, Exports and Imports.

@mkolesnik
Copy link
Contributor Author

I disagree this should just be closed, clearly there was an issue that happened on multiple tests and as such doesnt seem like "transient" to me (or it would fail just 1 test). That's also why "fast fail" is not a good option for test day, since you don't know if the test failed randomly or if it happens consistently.

If we're missing information to understand the problem then we need to address that as well, since if it happens on a production env we won't have any more information than whats on this bug (and possibly less).

@vthapar
Copy link
Contributor

vthapar commented Mar 16, 2022

Found the issue and confirmed from Coredns logs. Issue is only with headless/stateful set tests where we query a specific pod or pods in specific cluster.

[ERROR] plugin/errors: 3 mkolesni-0.12-testday2.nginx-headless.e2e-tests-discovery-zfxpp.svc.clusterset.local. A: plugin/lighthouse: record not found
[ERROR] plugin/errors: 3 mkolesni-0.12-testday2.nginx-headless.e2e-tests-discovery-zfxpp.svc.clusterset.local. A: plugin/lighthouse: record not found

Note that it is missing the podname for DNS query. It is because of . in cluster name. So it ends up interpreting mkolesni-0 as podname and 12-testday2 as clustername. Clustername should be a valid DNS Label [alphanumeric and - only].

We should throw error to user when trying to deploy cluster with . in name. We can consider allowing it if servicediscovery is not enabled.

@mkolesnik
Copy link
Contributor Author

I see no problem prohibiting . in cluster names altogether.
Note that I didn't specify an explicit name though so subctl join should generate a compliant name if it's not requested, and we should enforce only explicit name requests.

Also is . the only limitation or are there more?

@mkolesnik
Copy link
Contributor Author

We should probably follow K8s naming restrictions https://kubernetes.io/docs/concepts/overview/working-with-objects/names/

@vthapar
Copy link
Contributor

vthapar commented Mar 16, 2022

Yes, RFC1035 validations. So where to add all these checks?

  1. subctl deploy-broker?
  2. subctl join
  3. All operators and agents, basically anything that uses clusterid at bringup?

@mkolesnik
Copy link
Contributor Author

I believe in subctl join would make the most sense, and obviously in the agent as you specified that it'll throw an error. If we need it in other agents I guess it could make sense somewhat, but realistically they dont use the name for DNS so it doesnt matter, so I'd focus on the places where it matters.

vthapar added a commit to vthapar/submariner-operator that referenced this issue Mar 16, 2022
Current check for validating ClusterId allows `.` in clustername.
Instead it should follow Label Names validations as defined in
https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names

Fixes github.com/submariner-io/lighthouse/issues/707

Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
vthapar added a commit to vthapar/lighthouse that referenced this issue Mar 16, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes submariner-io#707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
skitt pushed a commit that referenced this issue Mar 16, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes #707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
skitt pushed a commit to submariner-io/submariner-operator that referenced this issue Mar 17, 2022
Current check for validating ClusterId allows `.` in clustername.
Instead it should follow Label Names validations as defined in
https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names

Fixes github.com/submariner-io/lighthouse/issues/707

Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
sridhargaddam pushed a commit to sridhargaddam/submariner-operator that referenced this issue Mar 24, 2022
Current check for validating ClusterId allows `.` in clustername.
Instead it should follow Label Names validations as defined in
https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names

Fixes github.com/submariner-io/lighthouse/issues/707

Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
vthapar added a commit to vthapar/lighthouse that referenced this issue Apr 5, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes submariner-io#707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
vthapar added a commit to vthapar/lighthouse that referenced this issue Apr 5, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes submariner-io#707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Jaanki pushed a commit to Jaanki/lighthouse that referenced this issue Apr 5, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes submariner-io#707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Jaanki pushed a commit to Jaanki/lighthouse that referenced this issue Apr 5, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes submariner-io#707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
skitt pushed a commit that referenced this issue Apr 5, 2022
Check if ClusterID is valid DNS1123 Label and exit with
error if not. Invalid ClusterID causes errors when replying
to DNS queries for service discovery.

Fixes #707
Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
anfredette pushed a commit to anfredette/submariner-operator that referenced this issue Apr 19, 2022
Current check for validating ClusterId allows `.` in clustername.
Instead it should follow Label Names validations as defined in
https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names

Fixes github.com/submariner-io/lighthouse/issues/707

Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
anfredette pushed a commit to submariner-io/submariner-operator that referenced this issue Apr 19, 2022
Current check for validating ClusterId allows `.` in clustername.
Instead it should follow Label Names validations as defined in
https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names

Fixes github.com/submariner-io/lighthouse/issues/707

Signed-off-by: Vishal Thapar <5137689+vthapar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.12.0-testday bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants