Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phxpaxos 启动不了 #32

Closed
xinmingyao opened this issue Nov 22, 2016 · 9 comments
Closed

phxpaxos 启动不了 #32

xinmingyao opened this issue Nov 22, 2016 · 9 comments

Comments

@xinmingyao
Copy link

一台服务器paxos服务一次重启后突然就启动不了,后台有日志:

2016-11-22 08:08:03.11s CheckpointInstanceID 6497752
2016-11-22 08:08:03.11s ERR(0): PN8phxpaxos8DatabaseE::GetMinChosenInstanceID no min chosen instanceid
2016-11-22 08:08:03.11s ERR(0): PN8phxpaxos8InstanceE::PlayLog log read fail, instanceid 6497754 ret 1
这个有可能是什么原因?

版本为master

@lynncui00
Copy link
Collaborator

从这个日志来看,PaxosLog部分丢失了。
打开完整的日志级别,然后贴完整的启动日志来看看。

@xinmingyao
Copy link
Author

我们bSync设置的是false

2016-11-22 09:03:51.11s CheckpointInstanceID 6497752

2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8LogStoreE::ParseFileID fileid 19 offset 52932087 checksum 2691965949
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos8LogStoreE::RebuildIndex START fileid 19 offset 52932087 checksum 2691965949
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::OpenFile ok, path ../storage/paxoslog/g0/vfile/19.f
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::RebuildIndexForOneFile rebuild one index ok, fileid 19 offset 52932087 instanceid 6570821 checksum 2691965949 buff
er size 399 )
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos8LogStoreE::RebuildIndexForOneFile File Data End, fileid 19 offset 52932498
2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8LogStoreE::RebuildIndexForOneFile file not exist, filepath ../storage/paxoslog/g0/vfile/20.f
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::RebuildIndex END rebuild ok, nowfileid 20
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::OpenFile ok, path ../storage/paxoslog/g0/vfile/19.f
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos8LogStoreE::Init ok, path ../storage/paxoslog/g0/vfile fileid 19 meta checksum 1676158829 nowfilesize 104857600 nowfilewriteoffs
et 52932498 )
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8DatabaseE::Init OK, db_path ../storage/paxoslog/g0
2016-11-22 09:03:51.11s Showy: PN8phxpaxos13MultiDatabaseE::Init OK, DBPath ../storage/paxoslog groupcount 1
2016-11-22 09:03:51.11s Showy: PN8phxpaxos5PNodeE::InitLogStorage OK, use default logstorage
2016-11-22 09:03:51.11s Showy: PN8phxpaxos5PNodeE::InitNetWork OK, use default network
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos18MasterStateMachineE::Init OK, master nodeid 2095519671909359521 version 6560819 expiretime 1222895188
2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8DatabaseE::GetFromLevelDB LevelDB.Get not found, instanceid 18446744073709551614
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos20SystemVariablesStoreE::Read DB.Get not found, groupidx 0
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos9SystemVSME::Init variables not exist
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos9SystemVSME::RefleshNodeID ip 10.10.122.228 port 8097 nodeid 16463482425872228257
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos9SystemVSME::RefleshNodeID ip 10.10.123.130 port 8097 nodeid 9402119685132001185
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos9SystemVSME::RefleshNodeID ip 10.10.123.153 port 8097 nodeid 11059444348004343713
2016-11-22 09:03:51.11s Imp(0): PN8phxpaxos6ConfigE::Init OK
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos11MasterDamonE::TryBeMaster Ohter as master, can't try be master, masterid 2095519671909359521 myid 9402119685132001185
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos11MasterDamonE::run TryBeMaster, sleep time 3299ms
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8PaxosLogE::GetMaxInstanceIDFromLog OK, MaxInstanceID 6570821 groupidsx 0
2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8LogStoreE::ParseFileID fileid 19 offset 52932087 checksum 2691965949
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::OpenFile ok, path ../storage/paxoslog/g0/vfile/19.f
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::Read ok, fileid 19 offset 52932087 instanceid 6570821 buffer size 399
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos13AcceptorStateE::Load GroupIdx 0 InstanceID 6570821 PromiseID 246 PromiseNodeID 2095519671909359521 AccectpedID 246 AcceptedN
odeID 2095519671909359521 ValueLen 359 Checksum 275208196
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8AcceptorE::Init OK
2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8DatabaseE::GetFromLevelDB LevelDB.Get not found, instanceid 18446744073709551615
2016-11-22 09:03:51.11s ERR(0): PN8phxpaxos8DatabaseE::GetMinChosenInstanceID no min chosen instanceid
2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8LogStoreE::ParseFileID fileid 0 offset 34 checksum 4187667134
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::OpenFile ok, path ../storage/paxoslog/g0/vfile/0.f
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::Read ok, fileid 0 offset 34 instanceid 0 buffer size 70
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos7CleanerE::FixMinChosenInstanceID ok, old minchosen 0 fix minchosen 0
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8InstanceE::Init Acceptor.OK, Log.InstanceID 6570821 Checkpoint.InstanceID 6497753
2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8LogStoreE::ParseFileID fileid 19 offset 24157577 checksum 2042402563
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::OpenFile ok, path ../storage/paxoslog/g0/vfile/19.f
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8LogStoreE::Read ok, fileid 19 offset 24157577 instanceid 6497753 buffer size 399
2016-11-22 09:03:51.11s no need to sync checkpoint, skiptimes 1

2016-11-22 09:03:51.11s DEBUG(0): PN8phxpaxos8DatabaseE::GetFromLevelDB LevelDB.Get not found, instanceid 6497754
2016-11-22 09:03:51.11s Showy(0): PN8phxpaxos8PaxosLogE::ReadState DB.Get not found, groupidx 0
2016-11-22 09:03:51.11s ERR(0): PN8phxpaxos8InstanceE::PlayLog log read fail, instanceid 6497754 ret 1

@lynncui00
Copy link
Collaborator

leveldb的数据丢了一个instance 6497754,原因未知。
你编译一下src/tools目录下的paxos_log_tools工具,检查一下6497753之后的数据丢失情况。

@xinmingyao
Copy link
Author

检查了一下 6497753 到6570821(最大值)之间的数据
6497754 6498905 6530370 6540384 ,然后是65707722到6553500之间大概丢了16660个数据。
磁盘是普通的sata盘,bSync设置的是false

@lynncui00
Copy link
Collaborator

期间机器是否有重启过?如果bSync设置为false并且机器重启的话是有可能出问题的。

目前的解决办法只能直接删掉paxos log数据重启了。

@xinmingyao
Copy link
Author

bSync设置为false的话在sata盘上性能比较差的,
phxpaxos 是否可以提供一个stop的接口,这样重启前可以先存盘paxos log,另外如果能做到只丢后面的数据而不是中间的数据应该比较合适,这样不会影响服务启动,丢掉的数据能从集群中别的机器同步过来

@lynncui00
Copy link
Collaborator

lynncui00 commented Nov 23, 2016

机器重启没有机会写磁盘的,比如突然机器断电。
另外一般情况应该也不会丢中间数据的,这里的情况应该也属于极端异常了。
要做到不丢数据,除了设置bSync为true,暂时没有好的方法,另外phxpaxos对sata盘的性能很差,如果要用在sata盘,建议可以自己重写存储模块。

@xinmingyao
Copy link
Author

暂时先在paxos_log_tools基础上加了个方法,从第一个丢失的开始到最大值都删除掉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants