Skip to content
This repository was archived by the owner on Nov 18, 2020. It is now read-only.

Conversation

hairyhum
Copy link
Contributor

@hairyhum hairyhum commented Feb 1, 2019

Shutdown waits for the node to stop based on OS pid,
if the node is not local, the pid file won't be local and will never exist.

[#142699795]

It's not that easy to autotest this feature, because it requires a node on a different host.

Shutdown waits for the node to stop based on OS pid,
if the node is not local, it will be the wrong pid.

[#142699795]
@hairyhum hairyhum changed the title Make shutdown command require a node to be local. Make shutdown command require the node to be local. Feb 1, 2019
@michaelklishin
Copy link
Contributor

I've moved the check to a validator.

@michaelklishin
Copy link
Contributor

michaelklishin commented Feb 3, 2019

Let's double check if the BOSH release may be using this in a non-local scenario under any circumstances since it technically does work (the node will stop), even though not really as expected (the command won't wait for shutdown since there never will be a local pid file).

@michaelklishin
Copy link
Contributor

@gerhard FYI.

Add --[no-]wait (enabled by default) for those who would prefer
to use it to shut down remote nodes even though it wouldn't wait for
a verified node termination.

Propagate --timeout to calls.

References #309.
@michaelklishin
Copy link
Contributor

I introduced a new flag, --[no]-wait, that lets the user shut down remote nodes without waiting for a confirmed termination if she opts in.

@michaelklishin michaelklishin merged commit 69b1e57 into master Feb 3, 2019
@michaelklishin michaelklishin deleted the shutdown_require_local_host branch February 3, 2019 17:43
@michaelklishin michaelklishin added this to the 3.7.12 milestone Feb 3, 2019
michaelklishin added a commit that referenced this pull request Feb 3, 2019
Add --[no-]wait (enabled by default) for those who would prefer
to use it to shut down remote nodes even though it wouldn't wait for
a verified node termination.

Propagate --timeout to calls.

References #309.

(cherry picked from commit 6d73256)
@michaelklishin
Copy link
Contributor

Backported to v3.7.x.

michaelklishin added a commit that referenced this pull request Feb 3, 2019
michaelklishin added a commit that referenced this pull request Feb 3, 2019
References #309.

(cherry picked from commit ee74f46)
@gerhard
Copy link
Contributor

gerhard commented Feb 4, 2019

This makes sense, good catch.

I can see that cf-rabbitmq-release uses shutdown with remote nodes, this change is likely to impact them. cc @rabbitmq/rabbitmq-pcf-tile-team

rabbitmq-server-boshrelease uses shutdown with local node, no impact. cc @rabbitmq/sme

@michaelklishin michaelklishin changed the title Make shutdown command require the node to be local. Make shutdown command require the node to be local (unless --no-wait is provided) Feb 6, 2019
michaelklishin added a commit that referenced this pull request Feb 7, 2019
michaelklishin added a commit that referenced this pull request Feb 7, 2019
@gerhard
Copy link
Contributor

gerhard commented Mar 6, 2019

We've discovered with @nodo that in cf-rabbitmq-release inet_db:gethostname/0 returns localhost for all nodes. This results in the remote node check failing, since hostname == remote_hostname is always true. Because hostnames are resolved via erl_inetrc, the OS cannot resolve RabbitMQ node hostnames - including the local one - which is why the hostname defaults to localhost on all nodes.

Should we make the remote node check account for this edge-case?

FWIW, the order in which inet_db data gets loaded in Erlang/OTP v20.3.8.20

@lukebakken
Copy link
Contributor

If they are both localhost we could just print a warning that they're using shutdown in yolo mode 😄

@michaelklishin
Copy link
Contributor

@gerhard if a hostname is localhost for all nodes, how can they be distinguished? I would strongly prefer to see this fixed in cf-rabbitmq-release over us adding more hacks.

@lukebakken
Copy link
Contributor

FWIW this is probably pretty common. Here's what I see on my own workstation -

^C(21.2.6)lbakken@shostakovich ~/development/rabbitmq/umbrella (master=)
$ erl
Erlang/OTP 21 [erts-10.2.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

Eshell V10.2.4  (abort with ^G)
1> inet_db:gethostname().
"localhost"

(21.2.6)lbakken@shostakovich ~/development/rabbitmq/umbrella (master=)
$ hostname
shostakovich

(21.2.6)lbakken@shostakovich ~/development/rabbitmq/umbrella (master=)
$ hostname -f
localhost.localdomain

The startup scripts use $HOSTNAME and / or the output of hostname -f to set node names, and not inet_db which is why this sort of thing doesn't affect RMQ operation normally.

@gerhard
Copy link
Contributor

gerhard commented Mar 6, 2019

You might be thinking about the value returned innode(), and the nodename that gets used for RPC calls. inet_db:get_hostname() differs from that. If erl_inetrc is used to resolve node hostnames, instead of the OS, inet_db:get_hostname() will diverge from node() hence the troubles.

@michaelklishin
Copy link
Contributor

If inet_db doesn't take node name or ERL_INETRC into account, we should use a different mechanism.

@gerhard
Copy link
Contributor

gerhard commented Mar 6, 2019

Let me reproduce this issue locally so that it's easier to fix.

@nodo
Copy link

nodo commented Mar 6, 2019

@gerhard happy to help if you want to.

michaelklishin added a commit that referenced this pull request Mar 6, 2019
inet_db is not a very reliable source as it doesn't take
node name CLI arguments and ERL_INETRC file settings.
That can lead to false positives in environments where
inet_db returns the same value (e.g. `localhost`) for
every cluster member.

Per discussion with @gerhard.

Closes #327.
References #309.
michaelklishin added a commit that referenced this pull request Mar 6, 2019
inet_db is not a very reliable source as it doesn't take
node name CLI arguments and ERL_INETRC file settings.
That can lead to false positives in environments where
inet_db returns the same value (e.g. `localhost`) for
every cluster member.

Per discussion with @gerhard.

Closes #327.
References #309.

(cherry picked from commit f83ea58)
michaelklishin added a commit that referenced this pull request Mar 6, 2019
Without this -n has to be used when it previously wasn't required.

Follow-up to #328, references #327, #309.

Per discussion with @lukebakken.
michaelklishin added a commit that referenced this pull request Mar 6, 2019
Without this -n has to be used when it previously wasn't required.

Follow-up to #328, references #327, #309.

Per discussion with @lukebakken.

(cherry picked from commit 952ec9e)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants