Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait should be implemented in start/stop functions in RHEL6 init script #26

Closed
pdilung opened this issue Dec 23, 2015 · 6 comments
Closed

Comments

@pdilung
Copy link
Contributor

pdilung commented Dec 23, 2015

Hi,

I've came across a problem that is likely related to startup/shutdown of consul in RHEL 6. Following are the details:

Version:

  • consul version: 0.5.2
  • consul-rpm version: 0.5.2-2.17.el6

Details:

  1. It often takes some time until port 8400 is fired up after consul has been started by daemon function.
  2. It often takes some time until consul shuts down when killed by killproc.

This leads to following errors:

  1. Refusal of connection to agent port 8400 when fired right after the startup.
  2. The service fails to start upon restart action because some ports are still bound.

Steps to reproduce:

  1. Refusal of connection to agent port 8400:

    # service consul start && echo '>>> trying to join after start'; consul join -rpc-addr=127.0.0.1:8400 <ip_addr>; echo '>>> trying to join after 2s sleep'; sleep 2; consul join -rpc-addr=127.0.0.1:8400 <ip_addr>
    Starting consul:                                           [  OK  ]
    >>> trying to join after start
    Error connecting to Consul agent: dial tcp 127.0.0.1:8400: connection refused
    >>> trying to join after 2s sleep
    Successfully joined cluster by contacting 1 nodes.
    
    # service consul start; netstat -nptl | grep '8400.*LISTEN' || echo "no listener on tcp 8400"; sleep 1 && netstat -nptl | grep '8400.*LISTEN'
    Starting consul:                                           [  OK  ]
    no listener on tcp 8400
    tcp        0      0 127.0.0.1:8400              0.0.0.0:*                   LISTEN      21187/consul
    
  2. The service fails to start upon restart action because some ports are still bound.

    # service consul restart; echo '>>> after restart' && netstat -nptl | grep consul; echo '>>> after 2s sleep'; sleep 2; netstat -nptl | grep consul; echo '>>> log output'; grep 'address already in use' /var/log/consul | tail -1; echo -n '>>> consul pid: '; pgrep consul || echo 'N/A'
    Shutting down consul:                                      [  OK  ]
    Starting consul:                                           [  OK  ]
    >>> after restart
    tcp        0      0 <host_ip>:8301              0.0.0.0:*                   LISTEN      22934/consul
    tcp        0      0 127.0.0.1:8400              0.0.0.0:*                   LISTEN      22934/consul
    tcp        0      0 127.0.0.1:8500              0.0.0.0:*                   LISTEN      22934/consul
    tcp        0      0 127.0.0.1:8600              0.0.0.0:*                   LISTEN      22934/consul
    >>> after 2s sleep
    >>> log output
    ==> Error starting agent: Failed to start Consul client: Failed to start lan serf: Failed to start TCP listener. Err: listen tcp <host_ip>:8301: bind: address already in use
    >>> consul pid: N/A
    

Unfortunately I haven't investigated this in RHEL 7, nor I've tried different consul versions yet, however I implemented wait in RHEL 6 init script in PR #25.

@duritong, @tomhillable: Would you please review and eventually merge?

Thanks.

@duritong
Copy link
Collaborator

Reading https://consul.io/docs/commands/leave.html & https://consul.io/docs/agent/basics.html I see 2 different ways of shutting it down, but with the same outcome: The node leaves the cluster.

The only disadvantage I see by using consul leave is that it won't work if someone puts the RPC agent on a different port than the default, which then means that consul leave would require the different port, while killproc won't. Any specific reason why you (@pdilung ) went with consul leave? If not then I would propose to stay with killproc to avoid any problems when putting the RPC agent on a different port.

The other question is if we really wanna wait 60s by default or maybe just 10? I have no idea what a reasonable timeout might be, but 60s sounds quite long.

Otherwise it looks fine to me, though I think the problem might be different or non-existent on EL7, as systemd might take care of it.

@pdilung
Copy link
Contributor Author

pdilung commented Dec 23, 2015

@duritong:
ad. consul leave vs killproc: I am fine with killproc and can reimplement this.
ad. MAXWAIT: This is configurable in /etc/sysconfig/consul, however, I have no objection using 10s as a sane default. I tried to run it in a while loop with 10s and it works just fine.
Let me know and I can push the changes you propose into pdilung:fix-rh6-init-script branch.
ad. RHEL7 & systemd: Let's try it in our lab :)

@duritong
Copy link
Collaborator

Souns fine, let's update the branch then.

@pdilung
Copy link
Contributor Author

pdilung commented Dec 23, 2015

OK, done. I tested it and it seems to work fine.

@duritong
Copy link
Collaborator

thanks a lot, I merged the PR, if you find any problems on EL7 let's open another ticket.

@pdilung
Copy link
Contributor Author

pdilung commented Dec 23, 2015

OK, let's do so then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants