refactor pd job mechanism #103

siddontang · 2016-05-13T01:51:07Z

Problems

Now we may have lots of jobs in pd, because we don't handle duplicated command for a region.
And even worse, if one leader 1 in a region like 1 asks to add peer, pd may first create the ConfChange Add peer 2 for it, but doesn't execute it quickly, a time later, the peer 1 asks to add peer again, pd may create the ConfChange Add peer 3 for it, so we have two different jobs for the same region at same time.
If a job for a region is not executed successfully because of timeout or other errors, pd will retry it in loop, this will block any other region following jobs too.

~~## Principle~~

One region can only do one thing at same time, if the leader in the region asks for adding peer, pd will skip any other commands for this region after the adding peer is finished (success or failure both ok).

Expanding 1, if pd decides how to do for the command, e,g, for region a, adding peer 3 for it, it will execute this job all the time, no matter how much times it retries or the leader asks.

The execution for one region won't block other region jobs.

~~## How to~~ (Deprecated)

Use region id for job, like /job/region_id -> job, so we can discard duplicated region jobs and guarantee one job for the region at same time.~~

Pd can scan multi region jobs and execute concurrently.

Expanding 2, above scanning may have a problem, we will get less region id jobs every time, e.g, if region 1 and 2 both have job, but region 2 is earlier, but we still scan region 1 first. So we should have also using the job list like before, so we have two KV pair, one is region_id -> job_id, the other is job_id -> job.
So the job meta is: /job_region/region_id -> job_id and /job/job_id -> job.

Guarantee all job executions are retry-able and can't corrupt the raft group, e.g, adding peer 3 for region a, we must be sure if adding peer 3 ok, another retry adding peer 3 is failure.

Should have a cancel mechanism if the job can't be executed successfully for a long time, pd may cancel the job or notify the user to handle it manually.

/cc @ngaut @qiuyesuifeng @disksing @tiancaiamao

The text was updated successfully, but these errors were encountered:

siddontang · 2016-05-14T15:22:32Z

How To

We have only one final state for every region.

Pd caches whole region in memory.
Leader peer for the region will report region status regularly.

For conf change

Leader 1 for region a asks ChangePeer, pd adds 2 to region a directly, so the region meta in etcd contains peer 1, 2 now. The region in TiKV must enter this final state.

If leader 1 asks ChangePeer again, pd still replies adding peer 2 directly until pd finds region a has peer 1, 2 and then does the following Conf change.

For split

Leader 1 for region a [a-c) asks Split, pd allocs the new region ID and peer IDs only.
Region a [a-c) splits -> a [a-b) + b [b-c).
Leader in region a [a-b) reports region info to pd, pd finds it is different from the memory cache, updates the pd first and then update cache.
Leader in region b [b-c) reports new region info to pd, pd finds it is not in cache and not in etcd, so adds it directly to etcd and cache.

Problem: If region a reports status first, but region b delayed, we may meet a gap in key range, because we don't know where to find the data in [b, c). But this gap may be fixed later.

qiuyesuifeng · 2016-05-14T15:44:00Z

/cc @disksing Seems something should be done in ticlient.

tiancaiamao · 2016-05-16T10:02:14Z

One region can only do one thing at same time

Do you mean all command? including read-only request such as GetRegion?

Expanding 1, if pd decides how to do for the command, e,g, for region a, adding peer 3 for it, it will execute this job all the time, no matter how much times it retries or the leader asks.

This brings risk for dead lock. we have two 'state', one is in raft node, the other in pd.

raft node send request to pd
pd decides how to do and send job to raft node
raft execute the job and update it's state to pd periodly
pd response after it know state updated

the whole process forms a ring, if raft have some trouble and can't finish the job( although I haven't come up with a specific case), then dead lock will occur.

tiancaiamao · 2016-05-16T10:05:35Z

b.t.w as long as we have GetRegion, those changes don't affect #102 much.... i.e. if peer found itself inactive for a long time, then it ask pd whether it's still live.

siddontang · 2016-05-16T10:55:46Z

Pd now has no job, so no deadlock.

siddontang mentioned this issue May 16, 2016

tests/raftstore: a split brain testcase tikv/tikv#553

Merged

qiuyesuifeng mentioned this issue May 21, 2016

Refactor pd. #108

Merged

siddontang closed this as completed May 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor pd job mechanism #103

refactor pd job mechanism #103

siddontang commented May 13, 2016 •

edited

Loading

siddontang commented May 14, 2016 •

edited

Loading

qiuyesuifeng commented May 14, 2016

tiancaiamao commented May 16, 2016

tiancaiamao commented May 16, 2016

siddontang commented May 16, 2016

refactor pd job mechanism #103

refactor pd job mechanism #103

Comments

siddontang commented May 13, 2016 • edited Loading

Problems

siddontang commented May 14, 2016 • edited Loading

How To

For conf change

For split

qiuyesuifeng commented May 14, 2016

tiancaiamao commented May 16, 2016

tiancaiamao commented May 16, 2016

siddontang commented May 16, 2016

siddontang commented May 13, 2016 •

edited

Loading

siddontang commented May 14, 2016 •

edited

Loading