Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating docker stack crashes cluster #1783

Closed
regnerisch opened this issue Jun 11, 2024 · 4 comments
Closed

Updating docker stack crashes cluster #1783

regnerisch opened this issue Jun 11, 2024 · 4 comments

Comments

@regnerisch
Copy link

regnerisch commented Jun 11, 2024

Description

We run a 3 node docker swarm cluster with typesense on each node. It's configured as discribed in this guide: https://typesense.org/docs/guide/docker-swarm-high-availability.html

After updating the typesense configurations e.g. adding new labels to the services, I run docker stack deploy -c docker-stack.yaml my_stack. Looking inside the typesense container I get the following errors:

W20240611 13:49:50.319334   166 raft_server.cpp:721] Multi-node with no leader: refusing to reset peers.
I20240611 13:49:55.816380   205 node.cpp:1579] node default_group:10.0.2.163:8107:8108 term 2 start pre_vote
W20240611 13:49:55.816478   205 node.cpp:1589] node default_group:10.0.2.163:8107:8108 can't do pre_vote as it is not in 10.0.2.51:8107:8108,10.0.2.57:8107:8108,10.0.2.63:8107:8108
I20240611 13:50:00.321345   166 raft_server.cpp:693] Term: 2, pending_queue: 0, last_index: 3, committed: 0, known_applied: 0, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 0

The nodes file looks like: typesense-1:8107:8108,typesense-2:8107:8108,typesense-3:8107:8108

It seems that typesense resolves the hostnames and caches the resolved ip addresses. This leeds to a cluster crash after updating the stack as the container may get another IP. I did specify 10.0.2.0/24 as subnet and would expect that typesense will find the new ip addresses. Especially because i specified the hostnames and not an static ip.

Steps to reproduce

  • Setup Docker Swarm (at least 2 nodes)
  • Setup typesense as described at https://typesense.org/docs/guide/docker-swarm-high-availability.html
  • Change some services config inside docker-stack.yaml e.g. adding labels or add an depends_on to force docker stack deploy to create a new container
  • Run docker stack deploy -c docker-stack.yaml {stack_name}
  • Watch logs docker logs -f {container_id}

Expected Behavior

Cluster will stay alive.

Actual Behavior

Cluster crashes.

Metadata

Typesense Version: 26

OS: Fedora 39 (Docker version 26.1.4, build 5650f9b)

@regnerisch regnerisch changed the title Updating docker stack crashes cluster config Updating docker stack crashes cluster Jun 11, 2024
@kishorenc
Copy link
Member

It seems that typesense resolves the hostnames and caches the resolved ip addresses.

We don't cache the IP -- we resolve it periodically when we check for changes to the nodes file.

You have to do rolling rotation of the nodes with enough time for the cluster to recover because a 3-node raft cluster requires atleast 2 nodes to be up for healthy functioning.

Otherwise, you can add the reset-peers-on-error flag to make the cluster hard reset clustering state if multiple containers have been rotated at one go.

@regnerisch
Copy link
Author

Thanks, if --reset-peers-on-error is enabled, can it happen to loose all the data as only one container writes it? Or will it keep the data and "recover" it?

@kishorenc
Copy link
Member

We don't recommend using that flag if you have high volume of writes. The best way to do rotation is container by container. Some people have mostly static clusters whether they don't mind using this flag for fast recovery.

@regnerisch
Copy link
Author

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants