Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: All submariner-routeagent pods got into CrashLoopBackOff #736

Closed
manosnoam opened this issue Jul 22, 2020 · 14 comments
Closed

AWS: All submariner-routeagent pods got into CrashLoopBackOff #736

manosnoam opened this issue Jul 22, 2020 · 14 comments
Assignees
Labels
blocker A blocker bug bug Something isn't working QE Labels related to QE handling items
Milestone

Comments

@manosnoam
Copy link
Contributor

manosnoam commented Jul 22, 2020

What happened:

Right after a new submariner deploy and join, I see:

pod/submariner-routeagent-8554g                      0/1     CrashLoopBackOff   1          25s   app=submariner-routeagent,component=routeagent,controller-revision-hash=c677d9594,pod-template-generation=1
pod/submariner-routeagent-c7494                      0/1     CrashLoopBackOff   1          25s   app=submariner-routeagent,component=routeagent,controller-revision-hash=c677d9594,pod-template-generation=1
pod/submariner-routeagent-fmg5c                      0/1     CrashLoopBackOff   1          25s   app=submariner-routeagent,component=routeagent,controller-revision-hash=c677d9594,pod-template-generation=1
pod/submariner-routeagent-mf2f5                      0/1     CrashLoopBackOff   1          25s   app=submariner-routeagent,component=routeagent,controller-revision-hash=c677d9594,pod-template-generation=1

In routeagent pods logs I see:

Command:
      submariner-route-agent.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 22 Jul 2020 10:54:12 +0300
      Finished:     Wed, 22 Jul 2020 10:54:12 +0300

F0722 07:54:09.225159       1 main.go:97] Error running route controller: error annotating the node "ip-10-166-56-196" with cniIfaceIP: error updatating node "ip-10-166-56-196", err: unable to get node info for node ip-10-166-56-196, err: nodes "ip-10-166-56-196" not found

This issue is reproduced, so it's a blocker.

Please see full test report, including pod logs collection (last step):

Environment:

  • subctl version: v0.5.0-6-gad8f10a

  • Cluster A (AWS public):
    Client Version: 4.5.2
    Server Version: 4.5.2
    Kubernetes Version: v1.18.3+b74c5ed

  • Cluster B (OSP on-prem):
    Client Version: 4.5.2
    Server Version: 4.4.3
    Kubernetes Version: v1.17.1

@manosnoam manosnoam added the bug Something isn't working label Jul 22, 2020
@manosnoam manosnoam added this to the 0.6.0 milestone Jul 22, 2020
@manosnoam manosnoam added this to To do in Submariner Project v0.6.0 via automation Jul 22, 2020
@manosnoam
Copy link
Contributor Author

According to @sridhargaddam this bug is related to recent change:
#729

Its something to do with hostnames. Which are constructed in a different manner on AWS only (it does not occur on OSP).
Basically its a new issue and looks like we did not validate hostNetworking to remoteService use-case where the sourcePod was on AWS Cluster.

@manosnoam manosnoam added the blocker A blocker bug label Jul 22, 2020
@sridhargaddam sridhargaddam changed the title All submariner-routeagent pods got into CrashLoopBackOff AWS: All submariner-routeagent pods got into CrashLoopBackOff Jul 22, 2020
@sridhargaddam sridhargaddam moved this from To do to In progress in Submariner Project v0.6.0 Jul 22, 2020
@sridhargaddam sridhargaddam self-assigned this Jul 22, 2020
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jul 22, 2020
On AWS Clusters, it was seen that nodes are named with FQDN
(i.e., hostname.domainname). However, on many of the other
K8s clusters we tested so far (like KIND, OSP etc), the
nodenames were matching with the nodeNames.

In order to support connectivity from HostNetwork to
remoteClusters (when using Globalnet), routeagent Pod
annotates the respective nodes with the cniIfaceIP
annotation. In the current code, where this annotation
is added, we were using the hostname to query the node
info. This was failing on AWS Clusters for the issue
mentioned above. This PR fixes it.

Fixes issue: submariner-io#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jul 22, 2020
On AWS Clusters, it was seen that nodes are named with FQDN
(i.e., hostname.domainname). However, on many of the other
K8s clusters we tested so far (like KIND, OSP etc), the
hostnames were matching with the nodeNames.

In order to support connectivity from HostNetwork to
remoteClusters (when using Globalnet), routeagent Pod
annotates the respective nodes with the cniIfaceIP
annotation. In the current code, where this annotation
is added, we were using the hostname to query the node
info. This was failing on AWS Clusters for the issue
mentioned above. This PR fixes it.

Fixes issue: submariner-io#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
@sridhargaddam
Copy link
Member

On AWS Clusters, it was seen that nodes are named with FQDN (i.e., hostname.domainname). However, on many of the other
K8s clusters we tested so far (like KIND, OSP etc), the hostnames were matching with the nodeNames.

In order to support connectivity from HostNetwork to remoteClusters (when using Globalnet), routeagent Pod annotates the respective nodes with the cniIfaceIP annotation. In the current code, where this annotation is added, we were using the hostname to query the node info. This was failing on AWS Clusters for the issue mentioned above.

This is now fixed via the following PR: #737

@sridhargaddam sridhargaddam moved this from In progress to Review in Progress in Submariner Project v0.6.0 Jul 22, 2020
tpantelis pushed a commit that referenced this issue Jul 22, 2020
On AWS Clusters, it was seen that nodes are named with FQDN
(i.e., hostname.domainname). However, on many of the other
K8s clusters we tested so far (like KIND, OSP etc), the
hostnames were matching with the nodeNames.

In order to support connectivity from HostNetwork to
remoteClusters (when using Globalnet), routeagent Pod
annotates the respective nodes with the cniIfaceIP
annotation. In the current code, where this annotation
is added, we were using the hostname to query the node
info. This was failing on AWS Clusters for the issue
mentioned above. This PR fixes it.

Fixes issue: #736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
@nyechiel nyechiel added the QE Labels related to QE handling items label Jul 23, 2020
sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Jul 23, 2020
Normally, most of the platforms are configured with hostname without the domainname.
However, on AWS, for one of the nodes, it was seen that hostname was configured as
FQDN (i.e., hostname.domainname) while all the remaining nodes were configured with
just the hostname alone. Because of this, route-agent was failing to start on that
node. This PR fixes it.

Fixes issue: submariner-io#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
tpantelis pushed a commit that referenced this issue Jul 23, 2020
Normally, most of the platforms are configured with hostname without the domainname.
However, on AWS, for one of the nodes, it was seen that hostname was configured as
FQDN (i.e., hostname.domainname) while all the remaining nodes were configured with
just the hostname alone. Because of this, route-agent was failing to start on that
node. This PR fixes it.

Fixes issue: #736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
@manosnoam
Copy link
Contributor Author

Thanks @sridhargaddam I've verified the fix

Submariner Project v0.6.0 automation moved this from Review in Progress to Done Jul 23, 2020
@linanh
Copy link

linanh commented Sep 2, 2020

If nodes are named with FQDN (hostname.domainname), and pod node name use only part of FQDN(hostname). Then the issue still exists.

@manosnoam
Copy link
Contributor Author

@linanh Can you share more info please, perhaps the failed pod logs ?

@manosnoam manosnoam reopened this Sep 2, 2020
Submariner Project v0.6.0 automation moved this from Done to In progress Sep 2, 2020
@sridhargaddam
Copy link
Member

If nodes are named with FQDN (hostname.domainname), and pod node name use only part of FQDN(hostname). Then the issue still exists.

The following PRs address this issue in Submariner:
#737
#738

Please check if you are using the Submariner image which includes these fixes. Also, if you can share the logs for submariner-route-agent pod, we can look into it.

@linanh
Copy link

linanh commented Sep 2, 2020

  1. OS hostname is named "node-01.example.com"
  2. The node name is named "node-01" using rancher
  3. error log:
    Hostname is "node-01.example.com" and routeAgentNodeName is ""
    Error running route controller: could not get the nodeName on host "node-01.example.com"
  4. Environment:
    submariner v0.6.0
    rancher rke (kubernetes v1.18.6)

@sridhargaddam
Copy link
Member

  1. OS hostname is named "node-01.example.com"
  2. The node name is named "node-01" using rancher
  3. error log:
    Hostname is "node-01.example.com" and routeAgentNodeName is ""

That's interesting. Let's see the nodeName configured in the Pod.spec. Can you share the output of the following command?
kubectl get pods --selector=app=submariner-routeagent -o jsonpath='{.items[*].spec.nodeName}' -n submariner-operator

@linanh
Copy link

linanh commented Sep 2, 2020

@sridhargaddam , the command output is "node-01"

@sridhargaddam
Copy link
Member

@sridhargaddam , the command output is "node-01"

Thanks for the details @linanh
I think this needs to be handled. Will push a PR shortly to fix it.

sridhargaddam added a commit to sridhargaddam/submariner that referenced this issue Sep 3, 2020
On Rancher cluster, it was observed that OS.Hostname was configured
as "node-01.example.com" whereas the corresponding nodeName was "node-01".
This was not handled in the code and was causing RouteAgent to fail.
This PR fixes it.

Fixes Issue: submariner-io#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
@sridhargaddam
Copy link
Member

@sridhargaddam , the command output is "node-01"

Thanks for the details @linanh
I think this needs to be handled. Will push a PR shortly to fix it.

@linanh Thanks for reporting the issue. The following PR will fix it: #783

@sridhargaddam sridhargaddam moved this from In progress to Review / Verify in Submariner Project v0.6.0 Sep 3, 2020
sridhargaddam added a commit that referenced this issue Sep 4, 2020
On Rancher cluster, it was observed that OS.Hostname was configured
as "node-01.example.com" whereas the corresponding nodeName was "node-01".
This was not handled in the code and was causing RouteAgent to fail.
This PR fixes it.

Fixes Issue: #736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
@sridhargaddam
Copy link
Member

@linanh the necessary fix is now merged, so closing the issue. Please let us know if you still see the issue.

Submariner Project v0.6.0 automation moved this from Review / Verify to Done Sep 7, 2020
@linanh
Copy link

linanh commented Sep 7, 2020

Thanks, @sridhargaddam. By using submariner devel image, the submariner-route-agent is normal.

@mangelajo
Copy link
Contributor

@Oats87 ^ :-)

novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023
On AWS Clusters, it was seen that nodes are named with FQDN
(i.e., hostname.domainname). However, on many of the other
K8s clusters we tested so far (like KIND, OSP etc), the
hostnames were matching with the nodeNames.

In order to support connectivity from HostNetwork to
remoteClusters (when using Globalnet), routeagent Pod
annotates the respective nodes with the cniIfaceIP
annotation. In the current code, where this annotation
is added, we were using the hostname to query the node
info. This was failing on AWS Clusters for the issue
mentioned above. This PR fixes it.

Fixes issue: submariner-io/submariner#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023
Normally, most of the platforms are configured with hostname without the domainname.
However, on AWS, for one of the nodes, it was seen that hostname was configured as
FQDN (i.e., hostname.domainname) while all the remaining nodes were configured with
just the hostname alone. Because of this, route-agent was failing to start on that
node. This PR fixes it.

Fixes issue: submariner-io/submariner#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
novad03 pushed a commit to novad03/k8s-submariner that referenced this issue Nov 25, 2023
On Rancher cluster, it was observed that OS.Hostname was configured
as "node-01.example.com" whereas the corresponding nodeName was "node-01".
This was not handled in the code and was causing RouteAgent to fail.
This PR fixes it.

Fixes Issue: submariner-io/submariner#736

Signed-Off-by: Sridhar Gaddam <sgaddam@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker A blocker bug bug Something isn't working QE Labels related to QE handling items
Projects
No open projects
Development

No branches or pull requests

5 participants