Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: upgrade from 3.5.4 -> 3.7.4 can fail due to computed node name case differences #1568

Closed
Bhaal22 opened this Issue Mar 29, 2018 · 13 comments

Comments

Projects
None yet
2 participants
@Bhaal22
Copy link

Bhaal22 commented Mar 29, 2018

Hi,

I do experience migration issues from 3.5.4 to 3.7.4 on windows environment.
I did prepare windows docker containers to ease the reproduction of the issue. We experience the same issue on windows virtual machines.

How to reproduce

  • setup your machine name to: rmq (it's easier in the Container than on real windows machines)
  • start rabbitmq 3.5.4 (OTP 18.3) instance
  • stop the broker
  • start rabbitmq 3.7.4 (OTP 20.0) instance using the same RABBITMQ_BASE folder
  • migration failed with message:
BOOT FAILED
===========

Error description:
    init:do_boot/3 line 793
    init:start_em/1 line 1085
    rabbit:start_it/1 line 445
    rabbit:'-boot/0-fun-0-'/0 line 296
    rabbit_upgrade:run_mnesia_upgrades/2 line 155
    rabbit_upgrade:die/2 line 209
    io:format(<0.56.0>, "\n\n****\n\nCluster upgrade needed but other disc nodes shut down after this one.\nPlease first star...", [])
error:badarg
Log file(s) (may contain more information):
   c:/rmq-data/log/RABBIT~1.LOG
   c:/rmq-data/log/rabbit@rmq_upgrade.log

{"init terminating in do_boot",badarg}
init terminating in do_boot (badarg)

Crash dump is being written to: c:\rmq-data\log\erl_crash.dump...done

investigations done

if "!RABBITMQ_NODENAME!"=="" (
    if "!NODENAME!"=="" (
        set RABBITMQ_NODENAME=rabbit@!COMPUTERNAME!
    ) else (
        set RABBITMQ_NODENAME=!NODENAME!
    )
)

in the default case, rabbitmq will generate rabbit@COMPUTERNAME (all in uppercase)

if "!RABBITMQ_NODENAME!"=="" (
    if "!NODENAME!"=="" (
        REM We use Erlang to query the local hostname because
        REM !COMPUTERNAME! and Erlang may return different results.
        REM Start erl with -sname to make sure epmd is started.
        call "%ERLANG_HOME%\bin\erl.exe" -A0 -noinput -boot start_clean -sname rabbit-prelaunch-epmd -eval "init:stop()." >nul 2>&1
        for /f "delims=" %%F in ('call "%ERLANG_HOME%\bin\erl.exe" -A0 -noinput -boot start_clean -eval "net_kernel:start([list_to_atom(""rabbit-gethostname-"" ++ os:getpid()), %NAMETYPE%]), [_, H] = string:tokens(atom_to_list(node()), ""@""), io:format(""~s~n"", [H]), init:stop()."') do @set HOSTNAME=%%F
        set RABBITMQ_NODENAME=rabbit@!HOSTNAME!
        set HOSTNAME=
    ) else (
        set RABBITMQ_NODENAME=!NODENAME!
    )
)

And here rabbitmq generates rabbit@hostname where hostname has the same value as cmd hostname

Workaround

  • delete db folder (not really possible)
  • manually set the RABBITMQ_NODENAME environment variable
  • rename the machine with everything as capital letters

How to reproduce with windows docker containers

DockerHub images are built from this repository: https://github.com/gsx-solutions/rmq-win

docker volume create rmq-data

docker run --rm -h rmq -v rmq-data:c:\rmq-data -ti gsxsolutions/rmq:3.5.4
docker run --rm -h rmq -v rmq-data:c:\rmq-data -ti gsxsolutions/rmq:3.7.4

Then you can just use -h RMQ to make it working.

Thank you for your work and support.

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).

We get at least a dozen of questions through various venues every single day, often light on details.
At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.

Please post this to rabbitmq-users.

Thank you.

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

Cluster upgrades between feature versions require an ordered restart, which

Cluster upgrade needed but other disc nodes shut down after this one

hints at. That and more (e.g. Blue/Green deployment migrations) are documented in the Upgrade guide.

@Bhaal22

This comment has been minimized.

Copy link
Author

Bhaal22 commented Mar 29, 2018

Yes and in the matrix: 3.5.x to 3.7.x is supported.

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

We test quite a few upgrade permutations, including an upgrade from 3.5.8 as part of our CI pipeline.

Cluster upgrade from 3.5.x to 3.7.x will require a cluster-wide shutdown with an ordered restart, as the docs explain. Or you can do a Blue/Green deployment upgrade.

Sorry but there is no evidence of a bug. This is mailing list material at this point.

@rabbitmq rabbitmq locked and limited conversation to collaborators Mar 29, 2018

@rabbitmq rabbitmq unlocked this conversation Mar 29, 2018

@Bhaal22

This comment has been minimized.

Copy link
Author

Bhaal22 commented Mar 29, 2018

The cluster is one member node.

Do you have CI pipeline on windows environment?
MoreOver it happens only in a particular case

When COMPUTERNAME environment variable != hostname

if my hostname is "rmq" then environment variable COMPUTERNAME is "RMQ" (uppercase)

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

I now see you have a section about case sensitivity of node names. This is a never ending source of fun on Windows and keeping track what was done by default in what version from years ago is not realistic. Setting RABBITMQ_NODENAME is a reasonable workaround which we will mention in the docs. Thanks for bringing this up to our attention.

I suspect that other operating systems which tend to use case-insensitive filesystems are not affected.

@Bhaal22

This comment has been minimized.

Copy link
Author

Bhaal22 commented Mar 29, 2018

could be an idea also to not fallback on default values for such a env variable.

I think we can have the same issue on linux if the machine is renamed using uppercase in /etc/hostname

initially rmq and then RMQ
I can test that on a linux container fast.

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

@Bhaal22 we have Windows package tests but not upgrades on Windows. I filed a documentation guides issue and this will be covered. I'm not sure we can safely force a particular case assuming there are years worth of releases that do not do that.

Perhaps we can emit a warning of some kind when COMPUTERNAME and hostname do not match but most of our team have concluded that virtually no developer pays any attention to warnings or logs until things go south (in production, obviously).

Scenarios where hostnames have changed are operator's responsibility. Explicitly setting RABBITMQ_NODENAME or using rabbitmqctl rename_cluster_node in that case will be needed for other reasons.

@michaelklishin michaelklishin changed the title RabbitMQ migration 3.5.4 -> 3.7.4 fails Windows: upgrade from 3.5.4 -> 3.7.4 can fail due to computed node name case differences Mar 29, 2018

@Bhaal22

This comment has been minimized.

Copy link
Author

Bhaal22 commented Mar 29, 2018

the issue I see with this is:
When you start the migration with the different name RMQ -> rmq

then the file nodes_running_at_shutdown is updated like if it was a cluster with 2 members
and then fails. So even if you rename your host afterwards for example, the process will fail until you remove manually the wrong entries in this file.

Yeah I understand

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

We recognise that it is not ideal and can be very confusing. Thank you for getting to the bottom of it.

If @lukebakken has ideas about what kind of change would be reasonably safe here, we'd be happy to file another issue and consider it.

We can modify the list loaded from nodes_running_at_shutdown to be all lowercase and filter out duplicates, for example, but I expect such change to break unexpectedly in other scenarios, possibly even beyond Windows :(

This is a yet another argument for Blue/Green deployment upgrades, which our docs don't promote enough.

@Bhaal22

This comment has been minimized.

Copy link
Author

Bhaal22 commented Mar 29, 2018

@michaelklishin in factissue already existed and you did a PR.
Which is closed but not merged as far as I see

#637

@Bhaal22

This comment has been minimized.

Copy link
Author

Bhaal22 commented Mar 29, 2018

I agree now we are in weird state.

With releases using uppercases and others no using it.
The PR is even not applicable. It would break existing deployment using lowercase

@michaelklishin

This comment has been minimized.

Copy link
Member

michaelklishin commented Mar 29, 2018

The PR was indeed closed without merging because we figured it was not a safe things to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.