Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting the exporter pod causes a crash due to port already in use #11831

Closed
travisn opened this issue Mar 3, 2023 · 8 comments · Fixed by #12193 or #12215
Closed

Restarting the exporter pod causes a crash due to port already in use #11831

travisn opened this issue Mar 3, 2023 · 8 comments · Fixed by #12193 or #12215

Comments

@travisn
Copy link
Member

travisn commented Mar 3, 2023

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
Restarting the exporter pod causes it to fail since the previous exporter has not yet freed the resources while its pod is still terminating.

    -3> 2023-03-03T19:44:07.291+0000 ffffaa1be010 -1 asok(0xaaab0ae8d310) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
     0> 2023-03-03T19:44:07.295+0000 ffffaa1be010 -1 *** Caught signal (Segmentation fault) **
 in thread ffffaa1be010 thread_name:ceph-exporter

 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
 1: __kernel_rt_sigreturn()
 2: (std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x24) [0xffffa90defbc]
 3: (DaemonMetricCollector::dump_asok_metrics()+0x1d8) [0xaaaacc93b9ec]
 4: ceph-exporter(+0x6dcb0) [0xaaaacc93dcb0]
 5: ceph-exporter(+0x6e1fc) [0xaaaacc93e1fc]
 6: (boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&)+0x1ac) [0xaaaacc9452dc]
 7: ceph-exporter(+0x6326c) [0xaaaacc93326c]
 8: (DaemonMetricCollector::main()+0x88) [0xaaaacc91dd08]
 9: main()
 10: __libc_start_main()
 11: ceph-exporter(+0x4e978) [0xaaaacc91e978]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

After restarting the exporter pod again, I see this error in the log, but then the pod was able to retry and continue automatically after that.

2023-03-03T19:45:29.284+0000 ffffb6051010 -1 asok(0xaaaaf5b5f310) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists

Expected behavior:
Restarting the pod should succeed. If the previous exporter pod is not terminated yet, the new pod should wait for it to exit, or perhaps just retry until the port is available. But if it takes too long (perhaps 60s), it should give up and crash.

How to reproduce it (minimal and precise):

  • Start the cluster with the exporter enabled
  • Wait for the exporter to start succesfully
  • kubectl -n rook-ceph delete pod <exporter>
  • Watch for the new exporter to fail

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

  • Operator's logs, if necessary

  • Crashing pod(s) logs, if necessary

    To get logs, use kubectl -n <namespace> logs <pod name>
    When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
    Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of krew commands, if necessary

    To get the health of the cluster, use kubectl rook-ceph health
    To get the status of the cluster, use kubectl rook-ceph ceph status
    For more details, see the Rook Krew Plugin

Environment:

  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
@travisn travisn added this to To do in v1.11 via automation Mar 3, 2023
@travisn travisn moved this from To do to Blocking Release in v1.11 Mar 23, 2023
@taxilian
Copy link
Contributor

In case anyone else is hitting this, mitigation is to scale the affected deployment to --replicas 0 and then once it terminates --replicas 1

This is not debilitating but it is quite annoying =]

@uhthomas
Copy link
Contributor

I think I see this too.

bash-4.4$ ceph crash info 2023-04-08T07:33:36.872560Z_8881ab4f-fc4d-4179-998e-6c2179a345df
{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f7b453eecf0]",
        "gsignal()",
        "abort()",
        "/lib64/libstdc++.so.6(+0x9009b) [0x7f7b445e809b]",
        "/lib64/libstdc++.so.6(+0x9653c) [0x7f7b445ee53c]",
        "/lib64/libstdc++.so.6(+0x96597) [0x7f7b445ee597]",
        "/lib64/libstdc++.so.6(+0x967f8) [0x7f7b445ee7f8]",
        "(boost::json::detail::throw_system_error(boost::system::error_code const&, boost::source_location const&)+0x107) [0x565554483a71]",
        "ceph-exporter(+0x4f5c1) [0x5655544845c1]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x4e4) [0x5655544a3e04]",
        "ceph-exporter(+0x71233) [0x5655544a6233]",
        "ceph-exporter(+0x717cd) [0x5655544a67cd]",
        "(boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&)+0x3da) [0x5655544acdba]",
        "ceph-exporter(+0x657a9) [0x56555449a7a9]",
        "(DaemonMetricCollector::main()+0x94) [0x565554486f44]",
        "main()",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2023-04-08T07:33:36.872560Z_8881ab4f-fc4d-4179-998e-6c2179a345df",
    "entity_name": "client.admin",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-exporter",
    "stack_sig": "65c05e9ed3f5dacba5f4686c8740dcb604c07b1781a027a2c9d308259a15b64c",
    "timestamp": "2023-04-08T07:33:36.872560Z",
    "utsname_hostname": "rook-ceph-exporter-talos-su3-l23-6f7dd54444-cxs8t",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.102-talos",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Mon Mar 13 18:10:38 UTC 2023"
}

@travisn
Copy link
Member Author

travisn commented Apr 12, 2023

@avanthakkar Are you looking into this?

@travisn
Copy link
Member Author

travisn commented Apr 12, 2023

The exporter is being disabled with #12077, so this issue will have more time for investigation...

@travisn travisn moved this from Blocking Release to In progress in v1.11 Apr 12, 2023
v1.11 automation moved this from In progress to Done May 8, 2023
@travisn
Copy link
Member Author

travisn commented May 8, 2023

Should not have been closed

@travisn travisn reopened this May 8, 2023
v1.11 automation moved this from Done to In progress May 8, 2023
@avanthakkar
Copy link
Member

I wasn't able to reproduce this locally with v1.11.5. The exporter restarts w/o any crash.
@uhthomas @taxilian Do you mind checking again with latest version if you're still able to reproducible it?

@taxilian
Copy link
Contributor

taxilian commented May 9, 2023

is it even creating them now? I haven't seen the issue, but I also only have an exporter deployment for 3 of my 6 nodes

@travisn
Copy link
Member Author

travisn commented May 9, 2023

I am not able to repro this again by enabling the exporter with a local code change to allow v17.2.6.

The exporter log does sometimes show the following error, but then is able to retry and then successfully start. Some related issue must have fixed this in 17.2.6. About every other restart I am seeing this log error, and other times it is successful on the first try.

2023-05-09T19:13:18.673+0000 ffffa325f010 -1 asok(0xaaaadba22cb0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/ceph-client.admin.asok': (17) File exists
system:0
system:0

@avanthakkar Note that I also experimented with setting TerminationGracePeriodSeconds: 2 and it didn't change whether I saw the message, but it did greatly shorten the time that the exporter stayed in terminating state before the pod was deleted. Unless there is a reason to wait longer for the exporter to shut down, go ahead and open a PR to apply this. We can close this issue unless someone else reports they would still be seeing it on 17.2.6 (would require rook v1.11.4 since we disabled the exporter again in v1.11.5).

v1.11 automation moved this from In progress to Done May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment