Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node节点watch机制失效,任务暂停依然执行 #121

Closed
godtou opened this issue Nov 27, 2018 · 10 comments
Closed

node节点watch机制失效,任务暂停依然执行 #121

godtou opened this issue Nov 27, 2018 · 10 comments

Comments

@godtou
Copy link

godtou commented Nov 27, 2018

Please answer these questions before submitting your issue. Thanks!
在你提交 issue 前,请先回答以下问题,谢谢!

  1. What version of Go and cronsun version are you using?
    你用的是哪个版本的 Go 和 哪个版本的 cronsun?
    go 1.11
    cronsun v0.3.4

  2. What operating system and processor architecture are you using (go env)?
    你用的是哪个操作系统,什么架构的?
    Centos 6.5 amd64

  3. What did you do?
    If possible, provide a recipe for reproducing the error.
    A complete runnable program is good.
    你做了什么,遇到了什么问题?尽可能描述清楚问题,最好把操作步骤写下来,按这些步骤操作后能重现你的问题。

某些node节点watch机制失效,无法同步任务修改事件

  1. What did you expect to see?
    你期望得到什么样的结果?

任务暂停,修改,删除,节点能正常同步任务状态

  1. What did you see instead?
    现在你得到的结果是什么样的?

目前出现两次 node watch中断的问题,导致暂停和删除的任务正常在该节点调度

@QLeelulu
Copy link
Contributor

用的etcd是哪个版本呢?

@godtou
Copy link
Author

godtou commented Nov 27, 2018

用的etcd是哪个版本呢?

etcdctl version: 3.3.10
API version: 3.3

@godtou
Copy link
Author

godtou commented Nov 27, 2018

@QLeelulu

Progress_Notify - When set, the watch will periodically receive a WatchResponse with no events, if there are no recent events. It is useful when clients wish to recover a disconnected watcher starting from a recent known revision. The etcd server decides how often to send notifications based on current server load.

不知道watchRequest中的这个字段是否有帮助,etcd我这边刚开始玩耍,还不熟,我看cronsun的watch Request 的opts里没有设置这个。

@QLeelulu
Copy link
Contributor

谢谢。
可以发一下没生效的任务ID,和对应的cronnode节点日志看一下?

@godtou
Copy link
Author

godtou commented Nov 28, 2018

谢谢。
可以发一下没生效的任务ID,和对应的cronnode节点日志看一下?

正常节点

2018-11-26T14:25:50.566+0800 INFO node/server.go:62 cronsun node[2a3a7b4e-a729-4edb-8fe1-9ac00bb53c9e] pid[21660] service started, Ctrl+C or send kill sign to exit
2018-11-26T14:28:58.290+0800 INFO node/node.go:270 job[e3acc018] group[Finance] rule[e2acc018] timer[0 * * * * *] has added
2018-11-26T14:31:24.295+0800 INFO node/node.go:270 job[e5acc018] group[Finance] rule[e4acc018] timer[0 */1 * * * ?] has added
2018-11-26T14:32:41.596+0800 INFO node/node.go:270 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * ?] has added
2018-11-26T14:37:12.285+0800 INFO node/node.go:292 job[e3acc018] group[Finance] rule[e2acc018] timer[0 * * * * *] has updated
2018-11-26T14:37:19.390+0800 INFO node/node.go:292 job[e5acc018] group[Finance] rule[e4acc018] timer[0 */1 * * * ?] has updated
2018-11-26T14:37:25.476+0800 INFO node/node.go:292 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * ?] has updated
2018-11-26T14:37:57.890+0800 INFO node/node.go:292 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * *] has updated
2018-11-26T14:38:06.758+0800 INFO node/node.go:292 job[e5acc018] group[Finance] rule[e4acc018] timer[0 */1 * * * *] has updated
2018-11-26T14:38:22.687+0800 INFO node/node.go:292 job[e5acc018] group[Finance] rule[e4acc018] timer[0 * * * * *] has updated
2018-11-26T14:40:16.087+0800 INFO node/node.go:270 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has added
2018-11-26T19:44:44.138+0800 INFO node/node.go:299 job[e3acc018] group[Finance] rule[e2acc018] timer[0 * * * * *] has deleted
2018-11-26T19:44:47.864+0800 INFO node/node.go:299 job[e5acc018] group[Finance] rule[e4acc018] timer[0 * * * * *] has deleted
2018-11-26T19:44:49.704+0800 INFO node/node.go:299 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * *] has deleted
2018-11-26T19:44:51.548+0800 INFO node/node.go:299 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has deleted
2018-11-27T18:05:06.886+0800 INFO node/node.go:270 job[e3acc018] group[Finance] rule[e2acc018] timer[0 * * * * *] has added
2018-11-27T18:05:08.808+0800 INFO node/node.go:270 job[e5acc018] group[Finance] rule[e4acc018] timer[0 * * * * *] has added
2018-11-27T18:05:10.472+0800 INFO node/node.go:270 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * *] has added
2018-11-27T18:05:12.009+0800 INFO node/node.go:270 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has added

异常节点

2018-11-26T15:27:39.324+0800 INFO node/server.go:62 cronsun node[06a3dfbf-ecd8-44d4-bc1f-3706ddeedf3d] pid[15412] service started, Ctrl+C or send kill sign to exit
2018-11-26T15:27:56.214+0800 INFO node/node.go:270 job[e3acc018] group[Finance] rule[e2acc018] timer[0 * * * * *] has added
2018-11-26T15:27:56.216+0800 INFO node/node.go:270 job[e5acc018] group[Finance] rule[e4acc018] timer[0 * * * * *] has added
2018-11-26T15:27:56.217+0800 INFO node/node.go:270 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has added
2018-11-26T15:27:56.218+0800 INFO node/node.go:270 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * *] has added
2018-11-27T17:16:19.394+0800 INFO node/server.go:71 exit success
nohup: 忽略输入
2018-11-27T17:16:33.907+0800 INFO node/server.go:62 cronsun node[06a3dfbf-ecd8-44d4-bc1f-3706ddeedf3d] pid[4151] service started, Ctrl+C or send kill sign to exit
2018-11-27T18:05:06.886+0800 INFO node/node.go:270 job[e3acc018] group[Finance] rule[e2acc018] timer[0 * * * * *] has added
2018-11-27T18:05:08.808+0800 INFO node/node.go:270 job[e5acc018] group[Finance] rule[e4acc018] timer[0 * * * * *] has added
2018-11-27T18:05:10.473+0800 INFO node/node.go:270 job[e7acc018] group[Finance] rule[e6acc018] timer[0 * * * * *] has added
2018-11-27T18:05:12.010+0800 INFO node/node.go:270 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has added

可以看到异常节点上所有任务的变更都没有收到

@QLeelulu
Copy link
Contributor

etcdctl cluster-health

用这个命令看看etcd集群是否正常?
怎么看到两个cronnode节点更新的内容都有差异。
例如

2018-11-26T15:27:56.217+0800 INFO node/node.go:270 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has added

这个在正常节点相同时间点并没有看到对应的日志

@godtou
Copy link
Author

godtou commented Nov 28, 2018

etcdctl cluster-health

用这个命令看看etcd集群是否正常?
怎么看到两个cronnode节点更新的内容都有差异。
例如

2018-11-26T15:27:56.217+0800 INFO node/node.go:270 job[f9acc018] group[Finance] rule[f8acc018] timer[0 1 0 * * *] has added

这个在正常节点相同时间点并没有看到对应的日志

etcdctl endpoint health

:2379 is healthy: successfully committed proposal: took = 570.227µs
:2379 is healthy: successfully committed proposal: took = 909.463µs
:2379 is healthy: successfully committed proposal: took = 759.799µs

两个节点的日志差异是这两个节点启动有先后顺序,正常节点14:25就启动,然后进行了任务创建修改,异常节点 15:27才启动,这时候只是进行了任务同步,任务是选了这两个节点组成的一个group进行执行的,设置的是单机单进程

@QLeelulu
Copy link
Contributor

QLeelulu commented Nov 28, 2018

etcdctl endpoint --cluster status -w table 这个命令再看下 ?

@godtou
Copy link
Author

godtou commented Nov 28, 2018

etcdctl endpoint --cluster status -w table 这个命令再看下 ?

+--------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------+------------------+---------+---------+-----------+-----------+------------+
| :2379 | e20b8da8ecbc3363 | 3.3.10 | 242 MB | false | 107 | 5556450 |
| :2379 | ca3c4cdc869af0f | 3.3.10 | 242 MB | false | 107 | 5556451 |
| :2379 | 685b0ebcb18cd06 | 3.3.10 | 242 MB | true | 107 | 5556452 |
+--------------------+------------------+---------+---------+-----------+-----------+------------+

@QLeelulu
Copy link
Contributor

QQ交流群: 123731057

你加下Q群?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants