Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all-in-one节点更换ip后服务异常 #16119

Closed
saltfishh opened this issue Mar 3, 2023 · 11 comments
Closed

all-in-one节点更换ip后服务异常 #16119

saltfishh opened this issue Mar 3, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@saltfishh
Copy link

saltfishh commented Mar 3, 2023

What happened:
all-in-one节点更换ip后, 部分宿主机信息仍然显示原ip, 虚拟机无法启动
Environment:

  • OS (e.g. cat /etc/os-release):

    • CentOS Linux release 7.9.2009 (Core)
  • Kernel (e.g. uname -a):

  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):

    • release/3.8(97a18aaed22061603)

经过:

原all-in-one部署的一台服务器ipwei 10.1.1.251, 将其ip变更为 10.13.1.2 按照官网更换ip文档https://www.cloudpods.org/zh/docs/setup/changeip/操作.

现状:

  • 虚拟机详情界面会有报错, Get "https://default-glance:30292/v1/images?id=54c63c51-4f6f-419c-87d2-26fb2cec5379&scope=system": dial tcp 10.108.209.16:30292: connect: connection refused
    • 请求
{
  "method": "get",
  "url": "/v1/images",
  "headers": {
    "Accept": "application/json, text/plain, */*",
    "x-yunion-lang": "zh-CN"
  },
  "params": {
    "id": "54c63c51-4f6f-419c-87d2-26fb2cec5379",
    "scope": "system"
  }
}
  • 错误信息
{
  "class": "ConnectRefusedError",
  "code": 499,
  "details": "Get \"https://default-glance:30292/v1/images?id=54c63c51-4f6f-419c-87d2-26fb2cec5379&scope=system\": dial tcp 10.108.209.16:30292: connect: connection refused",
  "time": "2023-03-03T07:06:59.967Z"
}
  • web界面能正常显示原虚拟机. 但是无法开机. 开启虚拟机的时候报错:

    • 请求:

{
"method": "post",
"url": "/v2/servers/9ccf6a3e-7704-4f3e-80d0-b810bb8972e5/start",
"headers": {
"Accept": "application/json, text/plain, /",
"x-yunion-lang": "zh-CN"
},
"params": {}
}

  + 错误:
```bash
{
  "class": "InvalidStatusError",
  "code": 400,
  "details": "部分磁盘未准备好",
  "time": "2023-03-03T07:09:08.765Z"
}
  • 宿主机-> 详情界面-> 网络接口仍然显示原来的ip 10.1.1.251
  • 宿主机物理存储显示为 0B . 但是在宿主机详情 -> 存储界面, 能够显示更换ip之前的另一块磁盘信息.
@saltfishh saltfishh added the bug Something isn't working label Mar 3, 2023
@saltfishh
Copy link
Author

saltfishh commented Mar 3, 2023

pods中 default-glance error.

以下为最后10行日志

[info 230303 06:53:57 cloudcommon.InitDB(database.go:67)] using inmemory lockman
[info 230303 06:53:57 db.CheckSync(models.go:98)] Start check database schema ...
[info 230303 06:53:57 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 230303 06:53:57 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[info 230303 06:53:57 cloudcommon.AppDBInit(database.go:143)] Total 44 db workers, set db connection max
[info 230303 06:53:57 options.StartOptionManagerWithSessionDriver(manager.go:61)] OptionManager start to fetch service configs ...
[info 230303 06:53:57 options.optionsEquals(manager.go:110)] Options added: {"api_server":"https://10.13.1.2","s3_access_key":"minioadmin","s3_endpoint":"minio.onecloud-minio:9000","s3_secret_key":"yunionminio@admin"}
[info 230303 06:53:57 watcher.(*SInformerSyncManager).startWatcher(watcher.go:82)]ServiceConfigManager: Start resource informer watcher for service
[info 230303 06:53:57 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched
[info 230303 06:53:57 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success.
[fatal 230303 06:55:01 service.initS3(service.go:157)] failed init s3 client new minio client: fetchBuckets: client.ListBuckets: Get "http://minio.onecloud-minio:9000/": dial tcp: lookup minio.onecloud-minio on 10.96.0.10:53: no such host

除此之外就没有状态异常的 pod 了

[root@cloud-stack ~]# kubectl get pods -A | grep -v Running
NAMESPACE             NAME                                                READY   STATUS             RESTARTS   AGE
onecloud              default-glance-5c9cd89f8b-t2mjw                     0/1     CrashLoopBackOff   44         4h30m
[root@cloud-stack ~]# 

@zexi
Copy link
Member

zexi commented Mar 3, 2023

glance 错误日志有
[fatal 230303 06:55:01 service.initS3(service.go:157)] failed init s3 client new minio client: fetchBuckets: client.ListBuckets: Get "http://minio.onecloud-minio:9000/": dial tcp: lookup minio.onecloud-minio on 10.96.0.10:53: no such host

@saltfishh 这个问题看起来是 glance 连不上 minio 容器了,执行下面的命令看下有没有 minio pod:

$ kubectl get pods -A | grep minio

@saltfishh
Copy link
Author

saltfishh commented Mar 3, 2023

@zexi >

kubectl get pods -A -o wide | grep minio
onecloud-monitoring   monitor-minio-6c6cc845dc-9d9hl                      1/1     Running            0          18h     10.40.250.139   cloud-stack   <none>           <none>

只有一个名为 monitor-minio 的pod

@zexi
Copy link
Member

zexi commented Mar 3, 2023

@saltfishh 那 allinone 安装的节点应该不会使用 glance 连接 minio 的机制,感觉你之前的环境 glance 打开了连接 minio 的参数,但是重新部署后 minio pod 没有启动,可以使用下面的方式关闭 glance 使用 minio :

# 找到 storage_driver 选项,改成 local
$ climc service-config-edit glance
...
  storage_driver: local
...

# 然后重启下 glance 服务试试
$ kubectl rollout restart deployment -n onecloud default-glance

@saltfishh
Copy link
Author

saltfishh commented Mar 3, 2023

但是 default-webconsole 在我刚才升级到3.9后, 这个pod 出问题了.
已经在 deployment 中的 command 下加了 -sync-user 选项. 仍然是这个报错

@saltfishh 这个原因是 webconsole 在 3.9 版本加上了连接 mysql 的功能,从3.8低版本升级上来有时候会遇到这个问题,用以下步骤解决:

# 删除 webconsole 的配置文件,改文件会由 operator 重新生成
$ kubectl delete configmap -n onecloud default-webconsole

# 删除 default-webconsole deployment ,让 operator 新建
$ kubectl delete deployment -n onecloud default-webconsole

等 webconsole 的 configmap 和 deployment 重建,应该就正常了

@zexi
Copy link
Member

zexi commented Mar 3, 2023

但是 default-webconsole 在我刚才升级到3.9后, 这个pod 出问题了.
已经在 deployment 中的 command 下加了 -sync-user 选项. 仍然是这个报错

@saltfishh 这个原因是 webconsole 在 3.9 版本加上了连接 mysql 的功能,从3.8低版本升级上来有时候会遇到这个问题,用以下步骤解决:

# 删除 webconsole 的配置文件,改文件会由 operator 重新生成
$ kubectl delete configmap -n onecloud default-webconsole

# 删除 default-webconsole deployment ,让 operator 新建
$ kubectl delete deployment -n onecloud default-webconsole

等 webconsole 的 configmap 和 deployment 重建,应该就正常了

@saltfishh 不好意思我不小心把你的评论覆盖了,可以用这个步骤试下。

@saltfishh
Copy link
Author

saltfishh commented Mar 3, 2023

@zexi

但是 default-webconsole 在我刚才升级到3.9后, 这个pod 出问题了.
已经在 deployment 中的 command 下加了 -sync-user 选项. 仍然是这个报错

@saltfishh 这个原因是 webconsole 在 3.9 版本加上了连接 mysql 的功能,从3.8低版本升级上来有时候会遇到这个问题,用以下步骤解决:

# 删除 webconsole 的配置文件,改文件会由 operator 重新生成
$ kubectl delete configmap -n onecloud default-webconsole

# 删除 default-webconsole deployment ,让 operator 新建
$ kubectl delete deployment -n onecloud default-webconsole

等 webconsole 的 configmap 和 deployment 重建,应该就正常了

@saltfishh 不好意思我不小心把你的评论覆盖了,可以用这个步骤试下。

感谢回复, 已经在尝试了.
还有一个问题请教. 我原本是3.8版本, 更换了IP. 但是目前无法开机.提示磁盘未准备好. 存储类型是本地存储.
之前配置的块存储是GPFS, 当时刚使用cloudpods, 我也不知道怎么配置的是这个类型.
现在升级到3.9后, 能否使用之前的数据?

@zexi
Copy link
Member

zexi commented Mar 4, 2023

@saltfishh
如果你没有手动加过 GPFS 的存储,可能就是遇上之前的 bug 了,那个 bug 是把本地存储识别成了 GPFS。执行下面的命令把结果发下,需要看下现在环境的存储和磁盘分别是什么:

# 进入 climc 容器
$ kubectl exec -ti -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- bash
# 下面的命令在 climc 容器里面执行,然后把结果发下

# 查询宿主机信息
$ climc host-list --details

# 查询存储信息
$ climc storage-list --details

# 查询宿主机和存储关联关系
$ climc host-storage-list --details

# 列出虚拟机
$ climc server-list --scope system --details

# 列出磁盘
$ climc disk-list --details --scope system

# 列出虚拟机和磁盘关系
$ climc server-disk-list --scope system --details

@saltfishh
Copy link
Author

saltfishh commented Mar 4, 2023

@saltfishh 如果你没有手动加过 GPFS 的存储,可能就是遇上之前的 bug 了,那个 bug 是把本地存储识别成了 GPFS。执行下面的命令把结果发下,需要看下现在环境的存储和磁盘分别是什么:

# 进入 climc 容器
$ kubectl exec -ti -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- bash
# 下面的命令在 climc 容器里面执行,然后把结果发下

# 查询宿主机信息
$ climc host-list --details

# 查询存储信息
$ climc storage-list --details

# 查询宿主机和存储关联关系
$ climc host-storage-list --details

# 列出虚拟机
$ climc server-list --scope system --details

# 列出磁盘
$ climc disk-list --details --scope system

# 列出虚拟机和磁盘关系
$ climc server-disk-list --scope system --details

@zexi 感谢回复, 目前虚拟机已经跑起来了, 只是无法上外网.
VPC网段: 10.11.1.0/128
EIP网段: 10.12.1.0/128
宿主机只有一个网卡, 10.13.1.2/24
无论从 10.13.1.0/24网络(已经加了到12网段的下一跳到 10.13.1.2的路由) 还是宿主机,直接ping eip 都是不可达的.
且宿主机到 vpceip 的路由都直接过网关出口了. 貌似没有到虚拟机.
以下是 宿主机vpceip 的路由追踪.

traceroute to 10.12.1.10 (10.12.1.10), 30 hops max, 60 byte packets
 1  gateway (10.13.1.1)  0.377 ms  0.276 ms  0.261 ms
 2  117.34.xx.xx (117.34.xx.xx)  1.171 ms  1.387 ms  1.550 ms
---

以下是 default-ovn-north pod 的错误日志:

ovsdb-server-sb: 2023-03-04T06:38:07.925Z|12119|jsonrpc|WARN|tcp:10.108.209.16:46136: receive error: Connection reset by peer
ovsdb-server-sb: 2023-03-04T06:38:07.925Z|12120|reconnect|WARN|tcp:10.108.209.16:46136: connection dropped (Connection reset by peer)
ovsdb-server-sb: 2023-03-04T06:38:14.949Z|12121|reconnect|WARN|tcp:10.108.209.16:46244: connection dropped (Connection reset by peer)
ovsdb-server-sb: 2023-03-04T06:38:21.973Z|12122|jsonrpc|WARN|Dropped 1 log messages in last 7 seconds (most recently, 7 seconds ago) due to excessive rate
ovsdb-server-sb: 2023-03-04T06:38:21.973Z|12123|jsonrpc|WARN|tcp:10.108.209.16:46334: receive error: Connection reset by peer
ovsdb-server-sb: 2023-03-04T06:38:21.973Z|12124|reconnect|WARN|tcp:10.108.209.16:46334: connection dropped (Connection reset by peer)
ovsdb-server-sb: 2023-03-04T06:38:29.001Z|12125|jsonrpc|WARN|tcp:10.108.209.16:46404: receive error: Connection reset by peer
ovsdb-server-sb: 2023-03-04T06:38:29.001Z|12126|reconnect|WARN|tcp:10.108.209.16:46404: connection dropped (Connection reset by peer)
ovsdb-server-sb: 2023-03-04T06:38:36.026Z|12127|reconnect|WARN|tcp:10.108.209.16:46476: connection dropped (Connection reset by peer)
ovsdb-server-sb: 2023-03-04T06:38:43.053Z|12128|jsonrpc|WARN|Dropped 1 log messages in last 7 seconds (most recently, 7 seconds ago) due to excessive rate
ovsdb-server-sb: 2023-03-04T06:38:43.053Z|12129|jsonrpc|WARN|tcp:10.108.209.16:46618: receive error: Connection reset by peer

尝试重启过该pod和宿主机. 未果.

@saltfishh
Copy link
Author

@zexi 你好,请问这个是否有处理办法?

@saltfishh
Copy link
Author

十分感谢, 已经修好了. /etc/yunion/host.conf 下 sdn_enable_eip_man 没开.

@saltfishh saltfishh reopened this Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants