Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster bootstrap fails ("too many colons in address") when IPv6 addresses are returned by DNS discovery #1500

Closed
jtackaberry opened this issue Dec 19, 2023 · 8 comments · Fixed by #1501

Comments

@jtackaberry
Copy link
Contributor

What version are you running?

v8.12.1

Are you using Docker or Kubernetes to run your system?

Yes, Microk8s configured with IPv6 support (dual stack)

Are you running a single node or a cluster?

3 node cluster

What did you do?

In developing the Helm chart, I tested out enabling dual stack mode on the K8s service that rqlite uses for service discovery. So rqlite will receive IPv6 addresses from DNS in discovering its peers. With both dns and dns-srv, I observed the cluster failing to bootstrap with the following error:

[cluster-bootstrap] 2023/12/19 03:16:36 failed to notify all targets: [2001:470:b0d6:fd:dfe:afeb:1d8a:32bb:4001 10.3.50.187:4001 2001:470:b0d6:fd:ce70:bd4e:b908:3b6d:4001 10.3.59.109:4001 2001:470:b0d6:fd:8a42:53cc:a14a:5a26:4001 10.3.90.38:4001] (failed to notify node at 2001:470:b0d6:fd:dfe:afeb:1d8a:32bb:4001: factory is not able to fill the pool: dial tcp: address 2001:470:b0d6:fd:dfe:afeb:1d8a:32bb:4001: too many colons in address, will retry)

I haven't looked at the code, but my complete guess is that rqlite is handling the DNS lookup, receiving IPv6 addresses, appending :<port> to each address, and passing that up to the raft library. I suspect the raft library requires IPv6 addresses to be represented in the canonical form [<addr>]:<port>, for example [2001:470:b0d6:fd:dfe:afeb:1d8a:32bb]:4001, and it's rqlite's responsibility to format it appropriately?

A bit of a tangent here, but note port 4001 above. As you can tell from the diagnostic info below, I started rqlited with -disco-mode=dns-srv -disco-config={"name":"rqlite-headless","service":"http"}. Even once I reverted the rqlite-headless service to IPv4-only, I still had problems with dns-srv. I specified http as the service name because the docs references the HTTP port. But in practice this seems to require the raft port (4002), not http. Changing disco-config to {"name":"rqlite-headless","service":"raft"} which gets back port 4002 fixed that additional issue.

Please include the Status, Nodes, and Expvar output from each node (or at least the Leader!)

Run from one of the nodes. Cluster didn't bootstrap so technically no leader.

Welcome to the rqlite CLI.
Enter ".help" for usage hints.
Connected to http://127.0.0.1:4001 running version v8.12.1
127.0.0.1:4001> .status
disco:
  dns_name:: _http._tcp.rqlite-headless.
  last_addresses: [2001:470:b0d6:fd:dfe:afeb:1d8a:32bb:4001 10.3.50.187:4001 2001:470:b0d6:fd:8a42:53cc:a14a:5a26:4001 10.3.90.38:4001 2001:470:b0d6:fd:ce70:bd4e:b908:3b6d:4001 10.3.59.109:4001]
  last_contact: 2023-12-19T03:17:40.861442717Z
  mode: dns-srv
  name: rqlite-headless
  service: http
network:
  interfaces:
    eth0:
      flags: up|broadcast|multicast|running
      hardware_address: f2:66:4a:6b:51:85
      addresses: [map[address:10.3.59.109/32] map[address:2001:470:b0d6:fd:ce70:bd4e:b908:3b6d/128] map[address:fe80::f066:4aff:fe6b:5185/64]]
    lo:
      flags: up|loopback|running
      hardware_address:
      addresses: [map[address:127.0.0.1/8] map[address:::1/128]]
node:
  uptime: 1m5.533166848s
  current_time: 2023-12-19T03:17:42.55135457Z
  start_time: 2023-12-19T03:16:37.018188494Z
store:
  reap_read_only_timeout: 0s
  last_applied_index: 0
  leader:
    node_id:
    addr:
  ready: false
  trailing_logs: 10240
  db_conf:
    fk_constraints: true
  snapshot_store:
    snapshots: []
    db_path:
    dir: /rqlite/rsnapshots
  dir: /rqlite
  raft:
    snapshot_version_max: 1
    snapshot_version_min: 0
    applied_index: 0
    commit_index: 0
    last_snapshot_index: 0
    latest_configuration_index: 0
    num_peers: 0
    protocol_version_max: 3
    last_log_term: 0
    last_snapshot_term: 0
    log_size: 32768
    term: 0
    bolt:
      OpenTxN: 0
      TxStats:
        CursorCount: 21
        NodeCount: 3
        RebalanceTime: 0
        Split: 0
        SpillTime: 20698
        WriteTime: 7094211
        PageCount: 4
        NodeDeref: 0
        Rebalance: 0
        Spill: 2
        Write: 6
        PageAlloc: 16384
      FreePageN: 0
      PendingPageN: 2
      FreeAlloc: 8192
      FreelistInuse: 32
      TxN: 7
    fsm_pending: 0
    latest_configuration: []
    protocol_version: 3
    protocol_version_min: 0
    last_contact: never
    last_log_index: 0
    state: Follower
    voter: false
  sqlite3:
    ro_dsn: file:/rqlite/db.sqlite?mode=ro&_fk=true
    wal_size: 0
    db_size: 4096
    path: /rqlite/db.sqlite
    pragmas:
      ro:
        foreign_keys: 1
        journal_mode: wal
        synchronous: 1
        wal_autocheckpoint: 1000
      rw:
        wal_autocheckpoint: 0
        foreign_keys: 1
        journal_mode: wal
        synchronous: 0
    rw_dsn: file:/rqlite/db.sqlite?_fk=true
    size: 4096
    version: 3.44.0
    compile_options: [ATOMIC_INTRINSICS=1 COMPILER=gcc-9.4.0 DEFAULT_AUTOVACUUM DEFAULT_CACHE_SIZE=-2000 DEFAULT_FILE_FORMAT=4 DEFAULT_JOURNAL_SIZE_LIMIT=-1 DEFAULT_MMAP_SIZE=0 DEFAULT_PAGE_SIZE=4096 DEFAULT_PCACHE_INITSZ=20 DEFAULT_RECURSIVE_TRIGGERS DEFAULT_SECTOR_SIZE=4096 DEFAULT_SYNCHRONOUS=2 DEFAULT_WAL_AUTOCHECKPOINT=1000 DEFAULT_WAL_SYNCHRONOUS=1 DEFAULT_WORKER_THREADS=0 ENABLE_DBSTAT_VTAB ENABLE_FTS3 ENABLE_FTS3_PARENTHESIS ENABLE_FTS5 ENABLE_RTREE ENABLE_UPDATE_DELETE_LIMIT MALLOC_SOFT_LIMIT=1024 MAX_ATTACHED=10 MAX_COLUMN=2000 MAX_COMPOUND_SELECT=500 MAX_DEFAULT_PAGE_SIZE=8192 MAX_EXPR_DEPTH=1000 MAX_FUNCTION_ARG=127 MAX_LENGTH=1000000000 MAX_LIKE_PATTERN_LENGTH=50000 MAX_MMAP_SIZE=0x7fff0000 MAX_PAGE_COUNT=1073741823 MAX_PAGE_SIZE=65536 MAX_SQL_LENGTH=1000000000 MAX_TRIGGER_DEPTH=1000 MAX_VARIABLE_NUMBER=32766 MAX_VDBE_OP=250000000 MAX_WORKER_THREADS=8 MUTEX_PTHREADS OMIT_DEPRECATED OMIT_LOAD_EXTENSION OMIT_SHARED_CACHE SYSTEM_MALLOC TEMP_STORE=1 THREADSAFE=1]
    conn_pool_stats:
      ro:
        max_idle_closed: 0
        open_connections: 1
        in_use: 0
        idle: 1
        wait_count: 0
        wait_duration: 0
        max_idle_time_closed: 0
        max_lifetime_closed: 0
        max_open_connections: 0
      rw:
        max_open_connections: 1
        in_use: 0
        wait_count: 0
        max_idle_closed: 0
        max_idle_time_closed: 0
        max_lifetime_closed: 0
        open_connections: 1
        idle: 1
        wait_duration: 0
    mem_stats:
      page_count: 1
      page_size: 4096
      soft_heap_limit: 0
      cache_size: -2000
      freelist_count: 0
      hard_heap_limit: 0
      max_page_count: 1073741823
  db_applied_index: 0
  fsm_index: 0
  open: true
  request_marshaler:
    compression_batch: 50
    compression_size: 1024
    force_compression: false
  addr: rqlite-1.rqlite-headless.data.svc.cluster.local:4002
  dir_size: 69632
  node_id: rqlite-1
  observer:
    dropped: 0
    observed: 0
  apply_timeout: 10s
  no_freelist_sync: false
  heartbeat_timeout: 1s
  nodes: []
  reap_timeout: 0s
  snapshot_interval: 30s
  snapshot_threshold: 8192
  election_timeout: 1s
build:
  branch: master
  build_time: 2023-12-18T11:18:08-0500
  commit: 24e5e2d99401f73d21b892ec3fdef6c97c65b98f
  compiler: gc
  version: v8.12.1
cluster:
  https: false
  addr: rqlite-1.rqlite-headless.data.svc.cluster.local:4002
  api_addr: rqlite-1.rqlite-headless.data.svc.cluster.local:4001
http:
  tls:
    enabled: false
  auth: enabled
  bind_addr: [::]:4001
  cluster:
    conn_pool_stats:
      10.3.50.187:4001:
        idle: 0
        max_open_connections: 64
        open_connections: 23
      10.3.59.109:4001:
        idle: 0
        max_open_connections: 64
        open_connections: 23
      10.3.90.38:4001:
        idle: 0
        max_open_connections: 64
        open_connections: 23
    local_node_addr: rqlite-1.rqlite-headless.data.svc.cluster.local:4002
    timeout: 30s
  queue:
    _default:
      batch_size: 128
      max_size: 1024
      sequence_number: 0
      timeout: 50ms
os:
  executable: /bin/rqlited
  hostname: rqlite-1
  page_size: 4096
  pid: 1
  ppid: 0
runtime:
  GOARCH: amd64
  GOMAXPROCS: 10
  GOOS: linux
  num_cpu: 10
  num_goroutine: 20
  version: go1.21.5
127.0.0.1:4001> .nodes
127.0.0.1:4001> .expvar
downloader:
  download_bytes: 0
  num_downloads_fail: 0
  num_downloads_ok: 0
memstats:
  LastGC: 1702955838047004570
  BuckHashSys: 1447249
  Lookups: 0
  OtherSys: 2106399
  Alloc: 3729600
  StackSys: 851968
  MCacheSys: 15600
  DebugGC: false
  TotalAlloc: 5273360
  HeapSys: 11730944
  HeapObjects: 23358
  PauseNs: [75503 86134 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
  NumGC: 2
  EnableGC: true
  HeapAlloc: 3729600
  MSpanSys: 179256
  PauseEnd: [1702955797015621183 1702955838047004570 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
  GCCPUFraction: 0.00001926514891987652
  BySize: [map[Frees:0 Mallocs:0 Size:0] map[Frees:474 Mallocs:1633 Size:8] map[Frees:3296 Mallocs:7838 Size:16] map[Frees:554 Mallocs:1317 Size:24] map[Frees:812 Mallocs:1526 Size:32] map[Frees:1434 Mallocs:4044 Size:48] map[Frees:623 Mallocs:1345 Size:64] map[Frees:856 Mallocs:1476 Size:80] map[Frees:525 Mallocs:1072 Size:96] map[Frees:586 Mallocs:779 Size:112] map[Frees:275 Mallocs:509 Size:128] map[Frees:132 Mallocs:9024 Size:144] map[Frees:100 Mallocs:246 Size:160] map[Frees:3 Mallocs:105 Size:176] map[Frees:13 Mallocs:46 Size:192] map[Frees:95 Mallocs:393 Size:208] map[Frees:24 Mallocs:42 Size:224] map[Frees:1 Mallocs:12 Size:240] map[Frees:64 Mallocs:145 Size:256] map[Frees:29 Mallocs:656 Size:288] map[Frees:25 Mallocs:45 Size:320] map[Frees:44 Mallocs:210 Size:352] map[Frees:5 Mallocs:46 Size:384] map[Frees:18 Mallocs:125 Size:416] map[Frees:3 Mallocs:36 Size:448] map[Frees:3 Mallocs:16 Size:480] map[Frees:18 Mallocs:47 Size:512] map[Frees:26 Mallocs:101 Size:576] map[Frees:11 Mallocs:31 Size:640] map[Frees:14 Mallocs:31 Size:704] map[Frees:0 Mallocs:11 Size:768] map[Frees:19 Mallocs:44 Size:896] map[Frees:60 Mallocs:128 Size:1024] map[Frees:7 Mallocs:101 Size:1152] map[Frees:21 Mallocs:67 Size:1280] map[Frees:4 Mallocs:9 Size:1408] map[Frees:0 Mallocs:2 Size:1536] map[Frees:5 Mallocs:21 Size:1792] map[Frees:4 Mallocs:14 Size:2048] map[Frees:3 Mallocs:132 Size:2304] map[Frees:14 Mallocs:19 Size:2688] map[Frees:3 Mallocs:7 Size:3072] map[Frees:0 Mallocs:1 Size:3200] map[Frees:0 Mallocs:3 Size:3456] map[Frees:8 Mallocs:42 Size:4096] map[Frees:14 Mallocs:33 Size:4864] map[Frees:2 Mallocs:12 Size:5376] map[Frees:0 Mallocs:4 Size:6144] map[Frees:0 Mallocs:0 Size:6528] map[Frees:0 Mallocs:0 Size:6784] map[Frees:2 Mallocs:2 Size:6912] map[Frees:1 Mallocs:18 Size:8192] map[Frees:5 Mallocs:16 Size:9472] map[Frees:0 Mallocs:0 Size:9728] map[Frees:0 Mallocs:0 Size:10240] map[Frees:0 Mallocs:0 Size:10880] map[Frees:2 Mallocs:2 Size:12288] map[Frees:0 Mallocs:0 Size:13568] map[Frees:0 Mallocs:0 Size:14336] map[Frees:1 Mallocs:5 Size:16384] map[Frees:1 Mallocs:1 Size:18432]]
  Mallocs: 35323
  Sys: 20288776
  MCacheInuse: 12000
  NextGC: 6857704
  Frees: 11965
  HeapInuse: 5193728
  HeapReleased: 5562368
  StackInuse: 851968
  MSpanInuse: 164976
  GCSys: 3957360
  PauseTotalNs: 161637
  NumForcedGC: 0
  HeapIdle: 6537216
queue:
  num_timeout: 0
  statements_rx: 0
  statements_tx: 0
  num_flush: 0
snapshot:
  latest_persist_duration: 0
  latest_persist_size: 0
  upgrade_fail: 0
  upgrade_ok: 0
cluster:
  num_query_req: 0
  num_remove_node_req: 0
  num_backup_req: 0
  num_get_node_api_req: 0
  num_get_node_api_resp: 0
  num_load_req: 0
  num_notify_req: 0
  num_client_retries: 0
  num_execute_req: 0
  num_get_node_api_req_local: 0
  num_join_req: 0
  num_request_req: 0
cmdline: [/bin/rqlited -node-id rqlite-1 -http-addr 0.0.0.0:4001 -http-adv-addr rqlite-1.rqlite-headless.data.svc.cluster.local:4001 -raft-addr 0.0.0.0:4002 -raft-adv-addr rqlite-1.rqlite-headless.data.svc.cluster.local:4002 -auth=/secrets/users.json -join-as=_system_rqlite -disco-mode=dns-srv -disco-config={"name":"rqlite-headless","service":"http"} -bootstrap-expect=3 -join-interval=1s -join-attempts=120 -raft-shutdown-stepdown -fk=true /rqlite]
mux:
  num_connections_handled: 0
  num_unregistered_handlers: 0
proto:
  num_uncompressed_requests: 0
  num_compressed_bytes: 0
  num_compressed_requests: 0
  num_compression_misses: 0
  num_precompressed_bytes: 0
  num_requests: 0
  num_uncompressed_bytes: 0
store:
  num_auto_restores_skipped: 0
  snapshot_persist_duration: 0
  failed_heartbeat_observed: 0
  num_auto_restores_failed: 0
  num_provides: 0
  num_snapshots: 0
  num_user_snapshots: 0
  leader_changes_dropped: 0
  nodes_reaped_ok: 0
  num_backups: 0
  num_removed_before_joins: 0
  num_auto_restores: 0
  num_boots: 0
  num_ignored_joins: 0
  num_recoveries: 0
  num_db_stats_errors: 0
  num_restores_failed: 0
  num_uncompressed_commands: 0
  snapshot_precompact_wal_size: 0
  snapshot_create_duration: 0
  leader_changes_observed: 0
  num_restores: 0
  num_snapshots_full: 0
  num_snapshots_incremental: 0
  nodes_reaped_failed: 0
  num_compressed_commands: 0
  num_loads: 0
  num_joins: 0
  snapshot_wal_size: 0
uploader:
  num_uploads_ok: 0
  num_uploads_skipped: 0
  total_upload_bytes: 0
  last_upload_bytes: 0
  num_uploads_fail: 0
db:
  checkpointed_moves: 0
  checkpointed_pages: 0
  checkpoint_duration_ns: 0
  checkpoint_errors: 0
  executions: 0
  open_duration_ms: 7
  checkpoints: 0
  execute_transactions: 0
  request_transactions: 0
  requests: 0
  query_errors: 0
  query_transactions: 0
  execution_errors: 0
  queries: 18
http:
  authFail: 1
  authOK: 21
  queued_executions_num_stmts_tx: 0
  remote_remove_node: 0
  remote_backups: 0
  remote_executions: 0
  remote_loads: 0
  requests: 0
  execute_stmts_rx: 0
  leader_not_found: 0
  queries: 0
  queued_executions_ok: 0
  queued_executions_num_stmts_rx: 0
  queued_executions_unknown_error: 0
  remote_requests_failed: 0
  backups: 0
  boot: 0
  loads_aborted: 0
  queued_executions_no_leader: 0
  remote_requests: 0
  query_stmts_rx: 0
  queued_executions_failed: 0
  queued_executions_leadership_lost: 0
  remote_executions_failed: 0
  num_readyz: 16
  loads: 0
  queued_executions: 0
  remote_queries: 0
  remote_queries_failed: 0
  request_stmts_rx: 0
  executions: 0
  num_status: 2
  queued_executions_not_leader: 0
  queued_executions_wait: 0
@otoolep
Copy link
Member

otoolep commented Dec 19, 2023

Thanks -- I've just released 8.12.2 which should fix your immediate issue. That said, while I want to support IPv6, I suspect you're going to hit other issues like this since I've never actually tested IPv6. I'll fix them as you hit them.

As for the other issues, that was due to out-of-date docs, which I've fixed now (though let me know if needs to be clearer). In summary rqlite 8.x switched to using the Raft port of other nodes for Join operations. Releases before that used the HTTP port of other nodes. Perhaps this means you will need two versions of your charts -- one for 7.x and one for 8.x (or just support 8.x -- there will be no more 7.x releases).

@otoolep
Copy link
Member

otoolep commented Dec 19, 2023

Anyway, if you confirm this specific issue is addressed in the latest release, perhaps you can close.

@otoolep
Copy link
Member

otoolep commented Dec 19, 2023

All my testing so far shows that using IPv6 addresses works.

@jtackaberry
Copy link
Contributor Author

Yep, looks good here with v8.12.3, @otoolep. Cluster is now bootstrapping with v6 addresses from DNS discovery. Also thanks for fixing the port details in the docs.

I'm getting the impression that you don't sleep. :)

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Dec 19, 2023

As for the other issues, that was due to out-of-date docs, which I've fixed now (though let me know if needs to be clearer).

The language is clear enough now. The only tweak I might be inclined to make is in the code block (and the corresponding DNS name in the paragraph below), changing the service name from rqlite-svc (which is ambiguous) to something like rqlite-raft. That'll help out the doc-skimmers who are mostly looking at the config examples. :)

Perhaps this means you will need two versions of your charts -- one for 7.x and one for 8.x (or just support 8.x -- there will be no more 7.x releases).

I think I'll just target 8.x for the chart, unless there's a compelling use case for new deployments running 7.x.

otoolep added a commit to rqlite/rqlite.io that referenced this issue Dec 19, 2023
@otoolep
Copy link
Member

otoolep commented Dec 19, 2023

language is clear enough now. The only tweak I might be inclined to make is in the code block (and the corresponding DNS name in the paragraph below), changing the service name from rqlite-svc (which is ambiguous) to something like rqlite-raft. That'll help out the doc-skimmers who are mostly looking at the config examples. :)

Good idea, done. See rqlite/rqlite.io@186d936

@jtackaberry
Copy link
Contributor Author

In summary rqlite 8.x switched to using the Raft port of other nodes for Join operations. Releases before that used the HTTP port of other nodes

Given that, I believe https://rqlite.io/docs/guides/security/#secure-cluster-example also needs updating? The section has examples that use HTTP URLs with -join.

@otoolep
Copy link
Member

otoolep commented Dec 20, 2023

Jeez -- yeah. Thanks.

otoolep added a commit to rqlite/rqlite.io that referenced this issue Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants