Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize log file format for faster recovery #83

Merged
merged 29 commits into from
Aug 19, 2021

Conversation

MrCroxx
Copy link
Member

@MrCroxx MrCroxx commented Aug 5, 2021

The new log file format can be found in annotation in
LogBatch::encode_to_bytes.

What's changed?

  • Optimize log file format for faster recovery.
  • Refine EntryIndex fields in memtable.

TODO:

  • There are 2 [PLEASE REVIEW] todo(s).
  • Optimize read_file for the new log format.
  • Benchmark

Signed-off-by: MrCroxx mrcroxx@outlook.com

The new log file format can be found in annotation in
`LogBatch::encode_to_bytes`.

What's changed?
+ Optimize log file format for faster recovery.
+ Refine `EntryIndex` fields in memtable.

TODO:
+ [ ] There is a [PLEASE REVIEW] todo.
+ [ ] Optimize `read_file` for the new log format.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
@MrCroxx
Copy link
Member Author

MrCroxx commented Aug 5, 2021

@tabokie I found that LogBatch::entries_size was replaced with LogBatch::approximate_size in raft-engine#82, but it is not used for now. I'm confused why we need the the field? Is it prepared for metric?

src/memtable.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
src/log_batch.rs Outdated Show resolved Hide resolved
Instead of reading the whole file, recovery now just read the needed
parts of the log file, which saves recovery time.

Whats changed?
+ `LogBatchFileReader`: yeild recovered `LogBatch` from log file, which
interally maintains a buffer.
+ Move common file operation functions from `crate::file_pipe_log` to `crate::util`
+ All tests passed.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
src/log_batch.rs Outdated Show resolved Hide resolved
Only write the first index of a `LogItem::Entries`, for the entries
it carries are always continuous.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
src/memtable.rs Outdated Show resolved Hide resolved
src/memtable.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
src/reader.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
+ Add bench `bench_recovery`.
+ Add failpoint in `open` to disable page cache when benching recovery.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Copy link
Member

@tabokie tabokie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, let's see some results! BTW You can put the benchmark into a separate PR.

Cargo.toml Outdated Show resolved Hide resolved
benches/bench_recovery.rs Outdated Show resolved Hide resolved
benches/bench_recovery.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
@MrCroxx MrCroxx changed the title Optimize log file format for faster recovery. Optimize log file format for faster recovery Aug 13, 2021
@MrCroxx
Copy link
Member Author

MrCroxx commented Aug 13, 2021

/rebuild

src/engine.rs Outdated Show resolved Hide resolved
let mut offset = (FILE_MAGIC_HEADER.len() + Version::len()) as u64;
debug!("recover from log file {:?}:{:?}", queue, file_id);
let fd = self.get_queue(queue).get_fd(file_id)?;
let mut reader = LogBatchFileReader::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we implement this logic in recover_queue, so to reuse LogBatchFileReader and avoid buffering log batches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. But now some function like get_queue is only implemented for FilePipeLog, but recover_queue is implemented for the trait PepeLog. Is that fine that we implement a recover_queue only for FilePipeLog version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, ideally we shouldn't expose log file. Maybe we could replace read_file_into_log_batch with replay_log_batch, which takes in a closure Fn(LogBatch).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next pr I'll implement parallel recovery. I'm not sure if the modification is good for it. What about modify it in the next pr or a seperate pr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have a big PR coming up tomorrow, and will fix this issue there.

src/reader.rs Outdated Show resolved Hide resolved
src/reader.rs Outdated Show resolved Hide resolved
src/file_pipe_log.rs Show resolved Hide resolved
src/reader.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Previously, `LogBatch` is partially filled (only entries_index needs
to be recovered), which is unreasonable. This commit seperate entries
from itmes, and add structs to hold them.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Copy link
Member

@tabokie tabokie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look good overall, still needs some benchmarks to evaluate the regression of write performance and IO usage.

src/reader.rs Outdated Show resolved Hide resolved
src/reader.rs Outdated Show resolved Hide resolved
src/reader.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
src/log_batch.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
src/reader.rs Outdated Show resolved Hide resolved
src/reader.rs Outdated Show resolved Hide resolved
Signed-off-by: MrCroxx <mrcroxx@outlook.com>
@MrCroxx
Copy link
Member Author

MrCroxx commented Aug 18, 2021

Benchmark (recovery):

configs:

config 0 - default 1 - compressed 2 - small-batch 3 - 10GB
total size 1GB 1GB 1GB 10GB
region count 100 100 100 1000
batch size 1MB 1MB 1KB 1MB
item size 1KB 1KB 256B 1KB
entry size 256B 256B 32B 256B
compression threshold - 8KB - -

env: slow ssd, fadvise - DONTNEED has been called before reads to disable page cache.

config 0 - default 1 - compressed 2 - small-batch 3 - 10GB
old 5.3239s 5.7234s 17.207s 64.410s
new 1.1660s 1.2687s 7.5835s 15.332s
-78.1% -77.8% -55.9% -76.2%

@MrCroxx
Copy link
Member Author

MrCroxx commented Aug 19, 2021

Benchmark (stress):

old:

--compression-threshold 8KB --regions 1000 --write-entry-count 10 --write-region-count 10 --write-sync true --write-threads 100 --time 600
[write]
Throughput(QPS) = 2991.55
Latency(μs) min = 224, avg = 33403.32, p50 = 26607, p90 = 65503, p95 = 78463, p99 = 116799, p99.9 = 184831, max = 332543
Fairness = 99.2%
Write Bandwidth = 2.6MiB/s

new:

--compression-threshold 8KB --regions 1000 --write-entry-count 10 --write-region-count 10 --write-sync true --write-threads 100 --time 600
[write]
Throughput(QPS) = 3270.73
Latency(μs) min = 289, avg = 30545.84, p50 = 26719, p90 = 52479, p95 = 65791, p99 = 92479, p99.9 = 133503, max = 422655
Fairness = 99.1%
Write Bandwidth = 3.8MiB/s

WB / QPS new / old * 100% = 133% (include randomness)

@MrCroxx
Copy link
Member Author

MrCroxx commented Aug 19, 2021

@tabokie Is this pr ready to merge?

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
@MrCroxx
Copy link
Member Author

MrCroxx commented Aug 19, 2021

Benchmark (recovery, limited by batch count):

configs:

config 0 - default 1 - compressed 2 - 10k
total batch count 1000 1000 10000
region count 100 100 1000
batch size 1MB 1MB 1MB
item size 1KB 1KB 1KB
entry size 256B 256B 256B
compression threshold - 8KB -

env: slow ssd, fadvise - DONTNEED has been called before reads to disable page cache.

config 0 - default 1 - compressed 2 - 10k
old 4.1913s 4.3947s 50.943s
new 0.9803s 1.0350s 11.382s
-77.1% -76.4% -77.7%

@tabokie tabokie merged commit f18b737 into tikv:master Aug 19, 2021
tabokie pushed a commit that referenced this pull request Sep 7, 2021
The new log file format can be found in annotation in
`LogBatch::encode_to_bytes`.

What's changed?
+ Optimize log file format for faster recovery.
+ Refine `EntryIndex` fields in memtable.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
tabokie pushed a commit to tabokie/raft-engine that referenced this pull request Sep 8, 2021
The new log file format can be found in annotation in
`LogBatch::encode_to_bytes`.

What's changed?
+ Optimize log file format for faster recovery.
+ Refine `EntryIndex` fields in memtable.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie pushed a commit to tabokie/raft-engine that referenced this pull request Sep 8, 2021
The new log file format can be found in annotation in
`LogBatch::encode_to_bytes`.

What's changed?
+ Optimize log file format for faster recovery.
+ Refine `EntryIndex` fields in memtable.

Signed-off-by: MrCroxx <mrcroxx@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants