Problem deploy cluster on prem #50

aHsirG · 2022-05-05T21:59:15Z

На одной машине все успешно работает, при сборе в кластер на 3ех не собирается
Ставлю на 3 машины Compute в Yandex Cloud
Если в host и node_id указываю IP как внутренние, как и внешние то такая ошибка

`2022-05-05T20:48:33.875257Z :INTERCONNECT NOTICE: Proxy [1:7094356886080632466:1] [node 2] ICP25 outgoing handshake failed, temporary: 1 explanation: outgoing handshake Peer# 10.129.0.10(10.129.0.10:19001) error from peer: ReceiverHostName# 10.129.0.10 mismatch, expected# localhost incoming: [0:0:0] held: no
2022-05-05T20:48:33.875276Z :INTERCONNECT NOTICE: Proxy [1:7094356886080632466:1] [node 2] ICP32 transit to hold-by-error state Explanation# outgoing handshake Peer# 10.129.0.10(10.129.0.10:19001) error from peer: ReceiverHostName# 10.129.0.10 mismatch, expected# localhost LastSessionDieTime# 1970-01-01T00:00:00.000000Z
2022-05-05T20:48:36.210727Z :INTERCONNECT NOTICE: Handshake [1:7094357040699463145:4107] [node 2] ICH03 handshake failed, explanation# incoming handshake Peer# localhost(::ffff:51.250.25.185) ReceiverHostName# 51.250.96.80 mismatch, expected# localhost
2022-05-05T20:48:40.421824Z :INTERCONNECT NOTICE: Handshake [1:7094357057879333233:8186] [node 3] ICH03 handshake failed, explanation# incoming handshake Peer# localhost(::ffff:51.250.28.84) ReceiverHostName# 51.250.96.80 mismatch, expected# localhost

Если указываю Внутренний FQDN (ydb1.ru-central1.internal,ydb2.ru-central1.internal,ydb3.ru-central1.internal)
Caught exception: ydb/core/driver_lib/cli_utils/cli_cmds_server.cpp:349: cannot detect node ID for ydb1:19001
`
Если указываю hostname (имя машин ydb1,ydb2,ydb3)
ICP32 transit to hold-by-error state Explanation# outgoing handshake Peer# ydb2() DNS resolve error: Could not contact DNS servers LastSessionDieTime# 1970-01-01T00:00:00.000000Z

telnet и nslookup успешно отрабатывают

mvgorbunov · 2022-05-06T05:17:48Z

Hi!
It't not a good idea to use ip-address in configuration files as they can be changed, better to use FQDN that can be resolved on every node.

Caught exception: ydb/core/driver_lib/cli_utils/cli_cmds_server.cpp:349: cannot detect node ID for ydb1:19001

Did you follow on-premise deployment documentation while configuring multinode cluster? Could you show your configuration file and server startup command line?

aHsirG · 2022-05-06T07:31:38Z

config.txt
Config.txt is renamed "config.yaml"
comand
/opt/ydb/ydbd/bin/ydbd server --tcp --yaml-config /opt/ydb/cfg/config.yaml --node static
--grpc-port 2135 --ic-port 19001 --mon-port 8765
--log-file-name logs/storage_start.log > logs/storage_start_output.log 2>logs/storage_start_err.log &
or
/opt/ydb/ydbd/bin/ydbd server --tcp --yaml-config /opt/ydb/cfg/config.yaml --node 1
--grpc-port 2135 --ic-port 19001 --mon-port 8765
--log-file-name logs/storage_start.log > logs/storage_start_output.log 2>logs/storage_start_err.log &

I am look with example https://ydb.tech/ru/docs/getting_started/self_hosted/ydb_local from sh

mvgorbunov · 2022-05-06T08:12:47Z

/opt/ydb/ydbd/bin/ydbd server --tcp --yaml-config /opt/ydb/cfg/config.yaml --node static
--grpc-port 2135 --ic-port 19001 --mon-port 8765
--log-file-name logs/storage_start.log > logs/storage_start_output.log 2>logs/storage_start_err.log &

This should works fine.

I am look with example https://ydb.tech/ru/docs/getting_started/self_hosted/ydb_local from sh

You'd better use example config mirror-3dc-3-nodes.yaml
Please note - for fault tolerant cluster (mirror-3 erasure) you must have at least 3 nodes with 3 disks (block device or file on filesystem). Also you can specify only one disk per node and use erasure: none, but if one of the node fails - your database will be unavailable.

aHsirG · 2022-05-07T13:43:13Z

Please check my problem on single node.
If using FQDN node started with error
Caught exception: ydb/core/driver_lib/cli_utils/cli_cmds_server.cpp:349: cannot detect node ID for ydb1:19001
If using hostname node started - it good for single node, but if using cluster with hostname not work (view my first post)
command.txt
config_not_work.txt
config_work.txt
error.txt

mvgorbunov · 2022-05-08T11:13:29Z

You should specifiy host name in configuration file same as you get it using hostname command.
So, if you want to use FQDN (and it's nessesary for multinode cluster to communicate between nodes) configure your server host name with FQDN (sudo hostname ydb1.ru-central1.internal or add to /etc/hostname), then use FQDN in configuration file.

aHsirG · 2022-05-08T16:43:25Z

Good, interconnect it fixs, but next problem

in config, a change only host names (
config.txt
)
https://github.com/ydb-platform/ydb/blob/main/ydb/deploy/yaml_config_examples/mirror-3dc-3-nodes.yaml

In log erros
2022-05-08T16:35:41.407210Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936131, type: Console, boot
2022-05-08T16:35:41.407944Z :BS_PROXY_DISCOVER ERROR: [a6cef7aa52e3541a] Status# ERROR Marker# DSPDM02
2022-05-08T16:35:41.407959Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-08T16:35:41.407970Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-08T16:35:41.407973Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1
2022-05-08T16:35:41.408397Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936131, type: Console, boot
2022-05-08T16:35:41.409057Z :BS_PROXY_DISCOVER ERROR: [02f6be02ea6f8374] Status# ERROR Marker# DSPDM02
2022-05-08T16:35:41.409075Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-08T16:35:41.409083Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-08T16:35:41.409086Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1
2022-05-08T16:35:41.409399Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936131, type: Console, boot
2022-05-08T16:35:41.410328Z :BS_PROXY_DISCOVER ERROR: [6c625560d57343e7] Status# ERROR Marker# DSPDM02
2022-05-08T16:35:41.410343Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-08T16:35:41.410347Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-08T16:35:41.410350Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, Suggested

ydb using 1 core cpu and network traffic 10-14Mbit/s, but
command
/opt/ydb/ydbd/bin/ydbd -s grpc://localhost:2135 admin blobstorage config init --yaml-file cfg/config.yaml > logs/init_storage.log 2>&1
not finish correct, finish wth timeout
MP-0130 Tablet request timed out Marker# MBT4

mvgorbunov · 2022-05-12T08:37:11Z

This config assumes using 3 raw block devices that were already formated as described https://ydb.tech/en/docs/deploy/manual/deploy-ydb-on-premises#prepare-and-format-disks-on-each-server-%7B#-prepare-disks%7D
If you use local data file - replace drive: path in configs on all nodes.

aHsirG · 2022-05-12T11:23:45Z

I am created 9 disk (for 3 machine) and using command for mount
sudo parted /dev/nvme0n1 mklabel gpt -s
sudo parted -a optimal /dev/nvme0n1 mkpart primary 0% 100%
sudo parted /dev/nvme0n1 name 1 ydb_disk_01
sudo partx --u /dev/nvme0n1

mvgorbunov · 2022-05-13T08:26:24Z

You have misconfiguration - you labeled disk as ydb_disk_01 but used /dev/disk/by-partlabel/ydb_disk_ssd_01 in config.txt (this is our misspell in the documentation and we will fix it, thank you for the report!).
The easiest way to fix it now is replace ydb_disk_ssd_0[1-3] with ydb_disk_01 in your config.
Also don't forget to format each labeled disk using sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/lib /opt/ydb/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_01 command for every disk on every server.

aHsirG · 2022-05-13T13:50:10Z

ok, sory, stupid mistake, but is not fixs error, now error

2022-05-13T13:37:20.837011Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936128, type: Cms, boot
2022-05-13T13:37:20.837829Z :BS_PROXY_DISCOVER ERROR: [69afb1ded53966b6] Status# ERROR Marker# DSPDM02
2022-05-13T13:37:20.837839Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-13T13:37:20.837847Z :TABLET_MAIN ERROR: Tablet: 72057594037936128 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-13T13:37:20.837849Z :TABLET_MAIN ERROR: Tablet: 72057594037936128 Type: Cms, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1

full commands
sudo parted /dev/vdb mklabel gpt -s
sudo parted -a optimal /dev/vdb mkpart primary 0% 100%
sudo parted /dev/vdb name 1 ydb_disk_01
sudo partx --u /dev/vdb
sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/ydbd/lib /opt/ydb/ydbd/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_01

sudo parted /dev/vdc mklabel gpt -s
sudo parted -a optimal /dev/vdc mkpart primary 0% 100%
sudo parted /dev/vdc name 1 ydb_disk_02
sudo partx --u /dev/vdc
sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/ydbd/lib /opt/ydb/ydbd/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_02

sudo parted /dev/vdd mklabel gpt -s
sudo parted -a optimal /dev/vdd mkpart primary 0% 100%
sudo parted /dev/vdd name 1 ydb_disk_03
sudo partx --u /dev/vdd
sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/ydbd/lib /opt/ydb/ydbd/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_03

aHsirG · 2022-05-14T11:21:10Z

for additional, testing simple 1 node 1 disk install
error
2022-05-14T11:05:59.794320Z :BS_PROXY_DISCOVER ERROR: [69485e2bc1c4387b] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794326Z :BS_PROXY_DISCOVER ERROR: [16ad85e56deff3ae] StepDiscovery Die. Disintegrated. DomainRequestsSent# 1 DomainReplies# 1 DomainSuccess# 0 ParityParts# 0 Handoff# 0 Marker# BSD08
2022-05-14T11:05:59.794327Z :BS_PROXY_DISCOVER ERROR: [16ad85e56deff3ae] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794330Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794333Z :BOOTSTRAPPER NOTICE: tablet: 72057594046382081, type: Mediator, boot
2022-05-14T11:05:59.794340Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794342Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794345Z :TABLET_MAIN ERROR: Tablet: 72075186232723360 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794345Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794346Z :TABLET_MAIN ERROR: Tablet: 72075186232723360 Type: SchemeShard, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794350Z :TABLET_MAIN ERROR: Tablet: 72057594037936130 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794351Z :TABLET_MAIN ERROR: Tablet: 72057594037936130 Type: TenantSlotBroker, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794363Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794364Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794372Z :BS_PROXY_DISCOVER ERROR: [fae4ae78692b6f02] StepDiscovery Die. Disintegrated. DomainRequestsSent# 1 DomainReplies# 1 DomainSuccess# 0 ParityParts# 0 Handoff# 0 Marker# BSD08
2022-05-14T11:05:59.794374Z :BS_PROXY_DISCOVER ERROR: [fae4ae78692b6f02] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794380Z :TABLET_MAIN ERROR: Tablet: 72057594037932033 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794381Z :TABLET_MAIN ERROR: Tablet: 72057594037932033 Type: BSController, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794381Z :BS_PROXY_DISCOVER ERROR: [946392aebc310326] StepDiscovery Die. Disintegrated. DomainRequestsSent# 1 DomainReplies# 1 DomainSuccess# 0 ParityParts# 0 Handoff# 0 Marker# BSD08
2022-05-14T11:05:59.794383Z :BS_PROXY_DISCOVER ERROR: [946392aebc310326] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794386Z :BOOTSTRAPPER NOTICE: tablet: 72075186232723360, type: SchemeShard, boot
2022-05-14T11:05:59.794393Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936130, type: TenantSlotBroker, boot

aHsirG · 2022-05-14T20:18:33Z

Its not work correct
sudo groupadd ydb
sudo useradd ydb -g ydb
sudo usermod -aG disk ydb
after
sudo chmod 777 /dev/disk/
sudo chmod 777 /dev/disk/by-partlabel
sudo chmod 777 /dev/disk/by-partlabel/ydb_disk_01
sudo chmod 777 /dev/disk/by-partlabel/ydb_disk_02
sudo chmod 777 /dev/disk/by-partlabel/ydb_disk_03
cluster correctd started,
now, simple testing and close issue

mvgorbunov · 2022-05-15T19:30:40Z

It's not a good idea to give 777 premissions on /dev/disk. We tested on Debian systems and it's enough to add user ydb (under which server is running) in disk group.

mvgorbunov · 2022-05-23T12:17:28Z

@aHsirG Everything works fine? Can we close this issue?

aHsirG · 2022-05-23T15:34:38Z

All fine)
My problem with permission on Ubuntu, i am not tested on Debian for undstanding problem with permission only ubuntu or my mistake.
Will check on Debian in a few days.

aHsirG · 2022-05-24T08:22:17Z

After retest
sudo groupadd ydb
sudo useradd ydb -g ydb
sudo usermod -aG disk ydb
it good work on Debian and Ubuntu, but need restarting VM.
I did not know this because I am not a unix administrator.

Thanks a lot for your help!

fomichev3000 added area/deploy Cluster deployment issues help wanted Extra attention is needed labels May 18, 2022

aHsirG closed this as completed May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem deploy cluster on prem #50

Problem deploy cluster on prem #50

aHsirG commented May 5, 2022

mvgorbunov commented May 6, 2022 •

edited

aHsirG commented May 6, 2022

mvgorbunov commented May 6, 2022 •

edited

aHsirG commented May 7, 2022

mvgorbunov commented May 8, 2022

aHsirG commented May 8, 2022

mvgorbunov commented May 12, 2022

aHsirG commented May 12, 2022

mvgorbunov commented May 13, 2022

aHsirG commented May 13, 2022

aHsirG commented May 14, 2022

aHsirG commented May 14, 2022

mvgorbunov commented May 15, 2022

mvgorbunov commented May 23, 2022

aHsirG commented May 23, 2022

aHsirG commented May 24, 2022

Problem deploy cluster on prem #50

Problem deploy cluster on prem #50

Comments

aHsirG commented May 5, 2022

mvgorbunov commented May 6, 2022 • edited

aHsirG commented May 6, 2022

mvgorbunov commented May 6, 2022 • edited

aHsirG commented May 7, 2022

mvgorbunov commented May 8, 2022

aHsirG commented May 8, 2022

mvgorbunov commented May 12, 2022

aHsirG commented May 12, 2022

mvgorbunov commented May 13, 2022

aHsirG commented May 13, 2022

aHsirG commented May 14, 2022

aHsirG commented May 14, 2022

mvgorbunov commented May 15, 2022

mvgorbunov commented May 23, 2022

aHsirG commented May 23, 2022

aHsirG commented May 24, 2022

mvgorbunov commented May 6, 2022 •

edited

mvgorbunov commented May 6, 2022 •

edited