Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem deploy cluster on prem #50

Closed
aHsirG opened this issue May 5, 2022 · 16 comments
Closed

Problem deploy cluster on prem #50

aHsirG opened this issue May 5, 2022 · 16 comments
Labels
area/deploy Cluster deployment issues help wanted Extra attention is needed

Comments

@aHsirG
Copy link

aHsirG commented May 5, 2022

На одной машине все успешно работает, при сборе в кластер на 3ех не собирается
Ставлю на 3 машины Compute в Yandex Cloud
Если в host и node_id указываю IP как внутренние, как и внешние то такая ошибка

`2022-05-05T20:48:33.875257Z :INTERCONNECT NOTICE: Proxy [1:7094356886080632466:1] [node 2] ICP25 outgoing handshake failed, temporary: 1 explanation: outgoing handshake Peer# 10.129.0.10(10.129.0.10:19001) error from peer: ReceiverHostName# 10.129.0.10 mismatch, expected# localhost incoming: [0:0:0] held: no
2022-05-05T20:48:33.875276Z :INTERCONNECT NOTICE: Proxy [1:7094356886080632466:1] [node 2] ICP32 transit to hold-by-error state Explanation# outgoing handshake Peer# 10.129.0.10(10.129.0.10:19001) error from peer: ReceiverHostName# 10.129.0.10 mismatch, expected# localhost LastSessionDieTime# 1970-01-01T00:00:00.000000Z
2022-05-05T20:48:36.210727Z :INTERCONNECT NOTICE: Handshake [1:7094357040699463145:4107] [node 2] ICH03 handshake failed, explanation# incoming handshake Peer# localhost(::ffff:51.250.25.185) ReceiverHostName# 51.250.96.80 mismatch, expected# localhost
2022-05-05T20:48:40.421824Z :INTERCONNECT NOTICE: Handshake [1:7094357057879333233:8186] [node 3] ICH03 handshake failed, explanation# incoming handshake Peer# localhost(::ffff:51.250.28.84) ReceiverHostName# 51.250.96.80 mismatch, expected# localhost

Если указываю Внутренний FQDN (ydb1.ru-central1.internal,ydb2.ru-central1.internal,ydb3.ru-central1.internal)
Caught exception: ydb/core/driver_lib/cli_utils/cli_cmds_server.cpp:349: cannot detect node ID for ydb1:19001
`
Если указываю hostname (имя машин ydb1,ydb2,ydb3)
ICP32 transit to hold-by-error state Explanation# outgoing handshake Peer# ydb2() DNS resolve error: Could not contact DNS servers LastSessionDieTime# 1970-01-01T00:00:00.000000Z

telnet и nslookup успешно отрабатывают

@mvgorbunov
Copy link
Collaborator

mvgorbunov commented May 6, 2022

Hi!
It't not a good idea to use ip-address in configuration files as they can be changed, better to use FQDN that can be resolved on every node.

Caught exception: ydb/core/driver_lib/cli_utils/cli_cmds_server.cpp:349: cannot detect node ID for ydb1:19001

Did you follow on-premise deployment documentation while configuring multinode cluster? Could you show your configuration file and server startup command line?

@aHsirG
Copy link
Author

aHsirG commented May 6, 2022

config.txt
Config.txt is renamed "config.yaml"
comand
/opt/ydb/ydbd/bin/ydbd server --tcp --yaml-config /opt/ydb/cfg/config.yaml --node static
--grpc-port 2135 --ic-port 19001 --mon-port 8765
--log-file-name logs/storage_start.log > logs/storage_start_output.log 2>logs/storage_start_err.log &
or
/opt/ydb/ydbd/bin/ydbd server --tcp --yaml-config /opt/ydb/cfg/config.yaml --node 1
--grpc-port 2135 --ic-port 19001 --mon-port 8765
--log-file-name logs/storage_start.log > logs/storage_start_output.log 2>logs/storage_start_err.log &

I am look with example https://ydb.tech/ru/docs/getting_started/self_hosted/ydb_local from sh

@mvgorbunov
Copy link
Collaborator

mvgorbunov commented May 6, 2022

/opt/ydb/ydbd/bin/ydbd server --tcp --yaml-config /opt/ydb/cfg/config.yaml --node static
--grpc-port 2135 --ic-port 19001 --mon-port 8765
--log-file-name logs/storage_start.log > logs/storage_start_output.log 2>logs/storage_start_err.log &

This should works fine.

I am look with example https://ydb.tech/ru/docs/getting_started/self_hosted/ydb_local from sh

You'd better use example config mirror-3dc-3-nodes.yaml
Please note - for fault tolerant cluster (mirror-3 erasure) you must have at least 3 nodes with 3 disks (block device or file on filesystem). Also you can specify only one disk per node and use erasure: none, but if one of the node fails - your database will be unavailable.

@aHsirG
Copy link
Author

aHsirG commented May 7, 2022

Please check my problem on single node.
If using FQDN node started with error
Caught exception: ydb/core/driver_lib/cli_utils/cli_cmds_server.cpp:349: cannot detect node ID for ydb1:19001
If using hostname node started - it good for single node, but if using cluster with hostname not work (view my first post)
command.txt
config_not_work.txt
config_work.txt
error.txt
img
info

@mvgorbunov
Copy link
Collaborator

You should specifiy host name in configuration file same as you get it using hostname command.
So, if you want to use FQDN (and it's nessesary for multinode cluster to communicate between nodes) configure your server host name with FQDN (sudo hostname ydb1.ru-central1.internal or add to /etc/hostname), then use FQDN in configuration file.

@aHsirG
Copy link
Author

aHsirG commented May 8, 2022

Good, interconnect it fixs, but next problem

in config, a change only host names (
config.txt
)
https://github.com/ydb-platform/ydb/blob/main/ydb/deploy/yaml_config_examples/mirror-3dc-3-nodes.yaml

In log erros
2022-05-08T16:35:41.407210Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936131, type: Console, boot
2022-05-08T16:35:41.407944Z :BS_PROXY_DISCOVER ERROR: [a6cef7aa52e3541a] Status# ERROR Marker# DSPDM02
2022-05-08T16:35:41.407959Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-08T16:35:41.407970Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-08T16:35:41.407973Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1
2022-05-08T16:35:41.408397Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936131, type: Console, boot
2022-05-08T16:35:41.409057Z :BS_PROXY_DISCOVER ERROR: [02f6be02ea6f8374] Status# ERROR Marker# DSPDM02
2022-05-08T16:35:41.409075Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-08T16:35:41.409083Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-08T16:35:41.409086Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1
2022-05-08T16:35:41.409399Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936131, type: Console, boot
2022-05-08T16:35:41.410328Z :BS_PROXY_DISCOVER ERROR: [6c625560d57343e7] Status# ERROR Marker# DSPDM02
2022-05-08T16:35:41.410343Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-08T16:35:41.410347Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-08T16:35:41.410350Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, Suggested

ydb using 1 core cpu and network traffic 10-14Mbit/s, but
command
/opt/ydb/ydbd/bin/ydbd -s grpc://localhost:2135 admin blobstorage config init --yaml-file cfg/config.yaml > logs/init_storage.log 2>&1
not finish correct, finish wth timeout
MP-0130 Tablet request timed out Marker# MBT4

@mvgorbunov
Copy link
Collaborator

This config assumes using 3 raw block devices that were already formated as described https://ydb.tech/en/docs/deploy/manual/deploy-ydb-on-premises#prepare-and-format-disks-on-each-server-%7B#-prepare-disks%7D
If you use local data file - replace drive: path in configs on all nodes.

@aHsirG
Copy link
Author

aHsirG commented May 12, 2022

I am created 9 disk (for 3 machine) and using command for mount
sudo parted /dev/nvme0n1 mklabel gpt -s
sudo parted -a optimal /dev/nvme0n1 mkpart primary 0% 100%
sudo parted /dev/nvme0n1 name 1 ydb_disk_01
sudo partx --u /dev/nvme0n1

@mvgorbunov
Copy link
Collaborator

You have misconfiguration - you labeled disk as ydb_disk_01 but used /dev/disk/by-partlabel/ydb_disk_ssd_01 in config.txt (this is our misspell in the documentation and we will fix it, thank you for the report!).
The easiest way to fix it now is replace ydb_disk_ssd_0[1-3] with ydb_disk_01 in your config.
Also don't forget to format each labeled disk using sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/lib /opt/ydb/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_01 command for every disk on every server.

@aHsirG
Copy link
Author

aHsirG commented May 13, 2022

ok, sory, stupid mistake, but is not fixs error, now error

2022-05-13T13:37:20.837011Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936128, type: Cms, boot
2022-05-13T13:37:20.837829Z :BS_PROXY_DISCOVER ERROR: [69afb1ded53966b6] Status# ERROR Marker# DSPDM02
2022-05-13T13:37:20.837839Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-13T13:37:20.837847Z :TABLET_MAIN ERROR: Tablet: 72057594037936128 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-13T13:37:20.837849Z :TABLET_MAIN ERROR: Tablet: 72057594037936128 Type: Cms, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1
image
image

full commands
sudo parted /dev/vdb mklabel gpt -s
sudo parted -a optimal /dev/vdb mkpart primary 0% 100%
sudo parted /dev/vdb name 1 ydb_disk_01
sudo partx --u /dev/vdb
sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/ydbd/lib /opt/ydb/ydbd/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_01

sudo parted /dev/vdc mklabel gpt -s
sudo parted -a optimal /dev/vdc mkpart primary 0% 100%
sudo parted /dev/vdc name 1 ydb_disk_02
sudo partx --u /dev/vdc
sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/ydbd/lib /opt/ydb/ydbd/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_02

sudo parted /dev/vdd mklabel gpt -s
sudo parted -a optimal /dev/vdd mkpart primary 0% 100%
sudo parted /dev/vdd name 1 ydb_disk_03
sudo partx --u /dev/vdd
sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/ydb/ydbd/lib /opt/ydb/ydbd/bin/ydbd admin bs disk obliterate /dev/disk/by-partlabel/ydb_disk_03

@aHsirG
Copy link
Author

aHsirG commented May 14, 2022

for additional, testing simple 1 node 1 disk install
error
2022-05-14T11:05:59.794320Z :BS_PROXY_DISCOVER ERROR: [69485e2bc1c4387b] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794326Z :BS_PROXY_DISCOVER ERROR: [16ad85e56deff3ae] StepDiscovery Die. Disintegrated. DomainRequestsSent# 1 DomainReplies# 1 DomainSuccess# 0 ParityParts# 0 Handoff# 0 Marker# BSD08
2022-05-14T11:05:59.794327Z :BS_PROXY_DISCOVER ERROR: [16ad85e56deff3ae] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794330Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794333Z :BOOTSTRAPPER NOTICE: tablet: 72057594046382081, type: Mediator, boot
2022-05-14T11:05:59.794340Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794342Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794345Z :TABLET_MAIN ERROR: Tablet: 72075186232723360 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794345Z :TABLET_MAIN ERROR: Handle::TEvDiscoverResult, result status ERROR
2022-05-14T11:05:59.794346Z :TABLET_MAIN ERROR: Tablet: 72075186232723360 Type: SchemeShard, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794350Z :TABLET_MAIN ERROR: Tablet: 72057594037936130 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794351Z :TABLET_MAIN ERROR: Tablet: 72057594037936130 Type: TenantSlotBroker, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794363Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794364Z :TABLET_MAIN ERROR: Tablet: 72057594037936131 Type: Console, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794372Z :BS_PROXY_DISCOVER ERROR: [fae4ae78692b6f02] StepDiscovery Die. Disintegrated. DomainRequestsSent# 1 DomainReplies# 1 DomainSuccess# 0 ParityParts# 0 Handoff# 0 Marker# BSD08
2022-05-14T11:05:59.794374Z :BS_PROXY_DISCOVER ERROR: [fae4ae78692b6f02] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794380Z :TABLET_MAIN ERROR: Tablet: 72057594037932033 HandleFindLatestLogEntry, msg->Status: ERROR
2022-05-14T11:05:59.794381Z :TABLET_MAIN ERROR: Tablet: 72057594037932033 Type: BSController, EReason: ReasonBootBSError, SuggestedGeneration: 0, KnownGeneration: 1, Details: Group# 0 disintegrated, type A.
2022-05-14T11:05:59.794381Z :BS_PROXY_DISCOVER ERROR: [946392aebc310326] StepDiscovery Die. Disintegrated. DomainRequestsSent# 1 DomainReplies# 1 DomainSuccess# 0 ParityParts# 0 Handoff# 0 Marker# BSD08
2022-05-14T11:05:59.794383Z :BS_PROXY_DISCOVER ERROR: [946392aebc310326] Result# TEvDiscoverResult {Status# ERROR BlockedGeneration# 0 Id# [0:0:0:0:0:0:0] Size# 0 MinGeneration# 0 ErrorReason# "Group# 0 disintegrated, type A."} Marker# BSD01
2022-05-14T11:05:59.794386Z :BOOTSTRAPPER NOTICE: tablet: 72075186232723360, type: SchemeShard, boot
2022-05-14T11:05:59.794393Z :BOOTSTRAPPER NOTICE: tablet: 72057594037936130, type: TenantSlotBroker, boot

@aHsirG
Copy link
Author

aHsirG commented May 14, 2022

Its not work correct
sudo groupadd ydb
sudo useradd ydb -g ydb
sudo usermod -aG disk ydb
after
sudo chmod 777 /dev/disk/
sudo chmod 777 /dev/disk/by-partlabel
sudo chmod 777 /dev/disk/by-partlabel/ydb_disk_01
sudo chmod 777 /dev/disk/by-partlabel/ydb_disk_02
sudo chmod 777 /dev/disk/by-partlabel/ydb_disk_03
cluster correctd started,
now, simple testing and close issue

@mvgorbunov
Copy link
Collaborator

It's not a good idea to give 777 premissions on /dev/disk. We tested on Debian systems and it's enough to add user ydb (under which server is running) in disk group.

@fomichev3000 fomichev3000 added area/deploy Cluster deployment issues help wanted Extra attention is needed labels May 18, 2022
@mvgorbunov
Copy link
Collaborator

@aHsirG Everything works fine? Can we close this issue?

@aHsirG
Copy link
Author

aHsirG commented May 23, 2022

All fine)
My problem with permission on Ubuntu, i am not tested on Debian for undstanding problem with permission only ubuntu or my mistake.
Will check on Debian in a few days.

@aHsirG
Copy link
Author

aHsirG commented May 24, 2022

After retest
sudo groupadd ydb
sudo useradd ydb -g ydb
sudo usermod -aG disk ydb
it good work on Debian and Ubuntu, but need restarting VM.
I did not know this because I am not a unix administrator.

Thanks a lot for your help!

@aHsirG aHsirG closed this as completed May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deploy Cluster deployment issues help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants