Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PCP-21] Titan GC doesn’t affect online write #121

Merged
merged 29 commits into from
Feb 15, 2020

Conversation

tabokie
Copy link
Member

@tabokie tabokie commented Nov 29, 2019

Green Blob GC by merge operator

tikv/tikv#5739

  • New value type: kMergeBlobIndex
  • New value content: LSN + BlobIndex

Merge Semantics: keep freshest version

A. key + merged index: source-index must be older than key, apply key

B. index + merged index: if index is stale (GCed), apply merged-index or simply delete key; else apply index

implementation:

  • if any merge-index is sourced from base-index, it's certain that base-index is stale.
  • source-file-number info can be stored in merge-blob-index

C. merged index + merged index (merge against delete): deletes key (use deletion marker instead)

D. merged index + merged index (partial merge): by LSN version of original put.

implementation:

  • embed LSN in merge operand, during blob building append LSN to blob key, during GC pass this LSN to merge operand (also to new blob file).
  • embed LSN in blob key, when conflict read blob for latest index. (partial merge must read blob)

GC workflow:

Three approaches to merge new index into rocksdb, requires further benchmarking:

  • plain merge
  • ingestion
  • fast ingestion
    • write SST of newly created blob index as Merge op
    • ingest external SST in a weak consistent way (do not pause write and flush on overlap, only guarantee ingested file is latest in persistent level)

Rocksdb Change:

tikv/rocksdb#147

  • support kTypeBlobIndex in merge operation (to support user-specified merge on top of gc-merge, extend merge value to include operand type)
  • add value type input/output to merge interface
  • merge operator deletes key
  • skip_memtable_check flag in IngestOption

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
@tabokie
Copy link
Member Author

tabokie commented Nov 30, 2019

A bug is spotted in current work: merge's correctness depends on retrieving up-to-date logical file state, meanwhile the file state transit depends on all new indexes being inserted so we can use LatestSequenceNumber as obsolete sequence. The problem is tricky to solve given titan inclines to atomically change file state and persist it to log.
New idea is to avoid file state inspection (which also save us bunch of locking). Instead, embed old file number to merge operands, so that base index is ignored if sharing the same file number with merges (announcing itself the source index producing this merge operand).
New value content is now LSN + old file number + BlobIndex

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
@tabokie
Copy link
Member Author

tabokie commented Nov 30, 2019

ready for review now @yiwu-arbug @Connor1996. benchmark coming soon.

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
@yiwu-arbug
Copy link
Collaborator

@tabokie Awesome! Let's see how benchmark is doing.

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
@tabokie
Copy link
Member Author

tabokie commented Dec 29, 2019

@yiwu-arbug @Connor1996 Here are some test results, Write performance brought by new gc is impressive, while gc-with-merge is more feasible with read regression below 10 percent.

dbbench (./titandb_bench --key_size="16" --value_size="256" --cache_size="4294967296" --titan_min_blob_size="0" --num="10000000" --numdistinct="10000" --benchmarks="fillseq,overwrite,readrandom" --threads="6" --titan_max_background_gc="6"):
gc_rewrite_mode = kDefault
fillseq: 99.3 MB/s; overwrite: 23.2 MB/s; readrandom: 130.5 MB/s
gc_rewrite_mode = kMerge
fillseq: 102.5 MB/s; overwrite: 59.3 MB/s; readrandom: 122.8 MB/s
gc_rewrite_mode = kFastIngest
fillseq: 101.1 MB/s; overwrite: 61.8 MB/s; readrandom: 82.4 MB/s

go-ycsb (max-background-gc=4, threads=256, readcount=1kw, operationcount=2kw)
gc_rewrite_mode = kDefault
update ops 5w -> 4.53w, read ops 12w
gc_rewrite_mode = kMerge
update ops 5w -> 4.58w, read ops 11w
gc_rewrite_mode = kFastIngest
update ops 5w -> 4.7w, read ops 7w

Signed-off-by: tabokie <xy.tao@outlook.com>
@codecov-io
Copy link

codecov-io commented Dec 29, 2019

Codecov Report

Merging #121 into master will decrease coverage by 0.56%.
The diff coverage is 85.38%.

@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
- Coverage   86.02%   85.46%   -0.57%     
==========================================
  Files          45       45              
  Lines        3528     3419     -109     
==========================================
- Hits         3035     2922     -113     
- Misses        493      497       +4

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
@yiwu-arbug
Copy link
Collaborator

yiwu-arbug commented Feb 9, 2020

@tabokie As we are approaching the end of PCP, we need to merge this PR this week. To speed it up, let's do the following:

  1. remove all ingest file related logic to make review simpler.
  2. send PR for related rocksdb changes to tikv/rocksdb repo.
  3. default to turn off the feature. we can add option to option.h later.
  4. adding sequence to blob record is a breaking change. can you remind us why it is needed? I'm incline to remove it for now, even if it will introduce correctness issue.
  5. can you remind us on the merge semantics? I find the following in the old docs but it is still unclear to me:

Screen Shot 2020-02-09 at 1 15 14 PM

This is great work overall! We can keep working on remaining work after PCP. We definitely want it to included in tikv 4.0 release.

@tabokie
Copy link
Member Author

tabokie commented Feb 10, 2020

@yiwu-arbug Some Q&A first:

  1. Embedding LSN in blob record is a critical design choice. In this snippet, merge operator avoids losing blob index by only keeping the latest one in multiple merge operands. Here, latest refer to timestamp of the original Put, which is a meaningful concept only on titan end, and therefore required to be stored in blob record payload. In detail, this blob timestamp is flowing in this path: MemKey ->[table-builder/GC-by-put]-> BlobRecord ->[GC-by-merge/merge]->MergeOperand. Notice the GC-by-put mutation is causing mis-alignment between the two timestamp.
  2. On a second thought, merge semantics is better explained by describing the anomalies it tries to avoid ([X] refers to put, (X) refers to merge, full content is [X blob-ts source-fno value-fno], time is flowing rightwards):
    *. basic rule, keep the base value: [?] ... [X] (Y) (Z) => [X]
    this is mostly true when Y/Z stem from older ?, guaranteed by rocksdb visibility constraint.
    a. base + delta: [Y] ... [X] (Y) (X') => [X']
    simple equal case where X' stem from exactly X. We identify this case by checking the source-file-number stored in merge operands, in detail: [X - - 0] (Y) (X' - 0 1).
    b. merge reorder: [A] [B] (B') (A') => [B']
    visibility order of merge != that of value. We need to preserve original order in reordered merge operands (explained before). in detail: [A 0 - -] [B 1 - -] (B' 1 - -) (A' 0 - -)
    c. deletion: [delete] (X) (Y) => [delete marker]
    introducing deletion marker is a workaround against a restriction that rocksdb merge can't produce empty result.

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
@tabokie
Copy link
Member Author

tabokie commented Feb 11, 2020

Almost good now, PTAL @yiwu-arbug @Connor1996

src/blob_format.h Show resolved Hide resolved
src/blob_gc_job.cc Outdated Show resolved Hide resolved
src/blob_index_merge_operator.h Show resolved Hide resolved
src/db_iter.h Show resolved Hide resolved
src/blob_index_merge_operator.h Outdated Show resolved Hide resolved
@yiwu-arbug
Copy link
Collaborator

yiwu-arbug commented Feb 11, 2020

I may be missing something, but I still think merge operator don't need to know LSN of values. Just like the write callback for gc-with-put doesn't need to know LSN of keys. I think we can simply use the following logic in merge op (pseudo-code):

blob_index = existing_value;
for (merge_op : in_merge_op_list) {
    if (merge_op.source_file == blob_index.file && merge_op.source_offset == blob_index.offset) {
        blob_index.file = merge_op.file;
        blob_index.offset = merge_op.offset
    }
}
merge_output = blob_index;

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

src/db_impl.cc Show resolved Hide resolved
src/blob_index_merge_operator.h Show resolved Hide resolved
src/blob_index_merge_operator.h Show resolved Hide resolved
Signed-off-by: tabokie <xy.tao@outlook.com>
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's merge rocksdb first

@yiwu-arbug
Copy link
Collaborator

LGTM. Leave to @Connor1996 for merge.

@zhangjinpeng87
Copy link
Member

@Connor1996 Can we merge it now?

Signed-off-by: tabokie <xy.tao@outlook.com>
Signed-off-by: tabokie <xy.tao@outlook.com>
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Connor1996 Connor1996 merged commit 4dc4ba8 into tikv:master Feb 15, 2020
yiwu-arbug pushed a commit that referenced this pull request Mar 20, 2020
* Green Blob GC by merge operator

Signed-off-by: tabokie <xy.tao@outlook.com>
Little-Wallace added a commit to Little-Wallace/titan that referenced this pull request Nov 9, 2021
tabokie added a commit to tabokie/titan that referenced this pull request May 26, 2022
This reverts commit 4dc4ba8.

Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie added a commit that referenced this pull request Jun 6, 2022
* Revert "[PCP-21] Titan GC doesn’t affect online write (#121)"

This reverts commit 4dc4ba8.

Signed-off-by: tabokie <xy.tao@outlook.com>

* format and build

Signed-off-by: tabokie <xy.tao@outlook.com>

* restore some modifications to tests

Signed-off-by: tabokie <xy.tao@outlook.com>

* Trigger rebuild

Signed-off-by: tabokie <xy.tao@outlook.com>
tabokie added a commit that referenced this pull request Jun 30, 2022
* work with 6.29.tikv

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix some more

Signed-off-by: tabokie <xy.tao@outlook.com>

* make it build

Signed-off-by: tabokie <xy.tao@outlook.com>

* format

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix snappy include dir

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix more includes

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix some conflicts with blobdb

Signed-off-by: tabokie <xy.tao@outlook.com>

* properly initialize compaction filter when creating cf

Signed-off-by: tabokie <xy.tao@outlook.com>

* clean up comments

Signed-off-by: tabokie <xy.tao@outlook.com>

* Revert "[PCP-21] Titan GC doesn’t affect online write (#121)"

This reverts commit 4dc4ba8.

Signed-off-by: tabokie <xy.tao@outlook.com>

* format and build

Signed-off-by: tabokie <xy.tao@outlook.com>

* restore some modifications to tests

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix double free and test issues

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix test case LevelMerge

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix testharness link

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix release build

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix test build

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix asan errors

Signed-off-by: tabokie <xy.tao@outlook.com>

* remove stale code

Signed-off-by: tabokie <xy.tao@outlook.com>

* re-inline db_iter implementation

Signed-off-by: tabokie <xy.tao@outlook.com>

* re-order includes

Signed-off-by: tabokie <xy.tao@outlook.com>

* rearrange CMakeLists.txt

Signed-off-by: tabokie <xy.tao@outlook.com>

* address comment and fix tools build

Signed-off-by: tabokie <xy.tao@outlook.com>

* fix unstable test DeleteFilesInRangeDuringGC

Signed-off-by: tabokie <xy.tao@outlook.com>

* update readme

Signed-off-by: tabokie <xy.tao@outlook.com>

* group some more includes

Signed-off-by: tabokie <xy.tao@outlook.com>

* update README

Signed-off-by: tabokie <xy.tao@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants