From b459c107ae5e4e730262126d2003db58f5e9da54 Mon Sep 17 00:00:00 2001 From: Jay Lee Date: Sun, 5 Jun 2022 00:14:27 -0700 Subject: [PATCH 1/5] introduce latest cf Signed-off-by: Jay Lee --- text/xxxx-add-latest-cf.md | 72 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 text/xxxx-add-latest-cf.md diff --git a/text/xxxx-add-latest-cf.md b/text/xxxx-add-latest-cf.md new file mode 100644 index 00000000..5d4efe49 --- /dev/null +++ b/text/xxxx-add-latest-cf.md @@ -0,0 +1,72 @@ +# Add latest cf + +- RFC PR: https://github.com/tikv/rfcs/pull/0000 +- Tracking Issue: https://github.com/tikv/repo/issues/0000 + +## Summary + +Add a cf (column family in RocksDB) named "latest" to store the latest version of keys in MVCC. + +## Motivation + +As we all know, currently TiKV stores all its data in RocksDB. It creates a cf (column family) named "write" to store all the available versions. The key format looks like following: + +``` +| key | version | +``` + +. Version is a 64 bits number and encoded in desc order (larger goes first). + +When reading a key with version v0, TiKV is expected to return the largest version v1 that v1 <= v0. Because TiKV has no idea what version is available, so it has to create an iterator and use version v0 to encode a seek key. If any key is hit, then the key should be the requested key. + +The procedure is straightforward but expensive. Creating an iterator in RocksDB is not free. Even for point get, an iterator is still necessary. And seek operation is also expensive, almost as expendsive as creating iterator. To avoid seeking too many times, we introduce `near_seek` to TiKV in the early days, which tries `next` several times before fallback to `seek`. + +As explained, the reason why seek is necessary is because TiKV has no idea what versions are available. Otherwise it can just use `get` to query the specific key, which is a lot cheaper. + +## Detailed design + +TiKV doesn't need to know all existing versions of keys. In fact, most of the time, v0 is larger than any existing versions of keys in TiKV if there is more read than write. So it should be enough to just let TiKV knows the latest version of all keys. + +The RFC propose to add a new cf named "latest". When a key is inserted using transaction API, it should update write cf (and default cf) as current does. In addition, it also insert a key to latest cf with +- key set the original key without any encoding or version +- value set the same value in write cf but include the corresponding version. + +For example, insert k1 with version v0 and value dummy will insert two keys +- to write cf, k1|v0 -> (dummy and other meta) +- to latest cf, k1 -> (dummy, v0 and other meta) + +So all keys in latest cf represent the latest version of all keys. + +When all the versions of a key are gced, then it should also delete the key in latest cf only when it matches the last gc key version. + +When a key is queried, it should query latest cf using `get` first. If nothing is found or the version is larger than requested, it should query the write cf as fallback. In most case, only one `get` is performed. + +When a range scan is triggered, it should scan the latest cf directly. If a larger version is met, it should fallback to seek write cf for that specific key instead. Because only the latest versions are stored in latest cf, so the keys needs to be scanned will be way fewer than the write cf. And in most cases, only one `seek` is performed, and all other operations are `next`. + +All of the fallback queries should be performed lazily. + +The improvment should be very significant when update keys frequently. + +### Compatibility + +Because all existing cfs are updated just as before, so there are no major compatibility issues. + +But using latest cf should be triggered explicitly by client. Client should ensure only when it updates all keys with latest cf will it ask TiKV to query using latest cf. + +Take TiDB as an example, it can add a new storage format at table level. Perhaps even add a new DDL job for table to changing the storage format. In the new format, latest cf is always updated. And only trigger TiKV to use latest cf when the target table is fully upgraded to the new format. + +### Why use a new cf? + +If the latest keys are written to write cf instead, then it will break compatibility. It also makes range scan less efficient as more version need to be scanned and skipped. + +## Drawbacks + +It introduces a write amplification appearantly. That is also the reason why the access pattern needs to be controlled by client. Client should enable latest cf only when it knows a range of keys are updated very often and can be benificial from the change. + +On the other hand, the additional write is just a key in a different cf and a value that is probably not larger than 255 bytes, the overhead may not be very signifiant. More experiments are needed. + +## Alternatives + +unistore separates the latest version and other versions by adjust file format. So when flushing or compacting, it will make latest versions key be the first part, and the rest in the second part. This approach doesn't have write overhead, but is not backward compatible in TiKV. + +## Unresolved questions From d91583d2f0df93d2b8e2fb68e1478dd1f3125990 Mon Sep 17 00:00:00 2001 From: Jay Lee Date: Sun, 5 Jun 2022 00:16:03 -0700 Subject: [PATCH 2/5] update number Signed-off-by: Jay Lee --- text/{xxxx-add-latest-cf.md => 0095-add-latest-cf.md} | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) rename text/{xxxx-add-latest-cf.md => 0095-add-latest-cf.md} (98%) diff --git a/text/xxxx-add-latest-cf.md b/text/0095-add-latest-cf.md similarity index 98% rename from text/xxxx-add-latest-cf.md rename to text/0095-add-latest-cf.md index 5d4efe49..b85bd116 100644 --- a/text/xxxx-add-latest-cf.md +++ b/text/0095-add-latest-cf.md @@ -1,6 +1,6 @@ -# Add latest cf +# Introduce latest cf -- RFC PR: https://github.com/tikv/rfcs/pull/0000 +- RFC PR: https://github.com/tikv/rfcs/pull/95 - Tracking Issue: https://github.com/tikv/repo/issues/0000 ## Summary From 75ffd9e05dddd58d884bb1557481089259edd5a9 Mon Sep 17 00:00:00 2001 From: Jay Lee Date: Sun, 5 Jun 2022 16:22:59 -0700 Subject: [PATCH 3/5] add another alternative and compability issue Signed-off-by: Jay Lee --- text/0095-add-latest-cf.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/text/0095-add-latest-cf.md b/text/0095-add-latest-cf.md index b85bd116..c6dcb506 100644 --- a/text/0095-add-latest-cf.md +++ b/text/0095-add-latest-cf.md @@ -55,6 +55,8 @@ But using latest cf should be triggered explicitly by client. Client should ensu Take TiDB as an example, it can add a new storage format at table level. Perhaps even add a new DDL job for table to changing the storage format. In the new format, latest cf is always updated. And only trigger TiKV to use latest cf when the target table is fully upgraded to the new format. +As a new cf is added, it needs to be also included in the raft snapshot between replicas. + ### Why use a new cf? If the latest keys are written to write cf instead, then it will break compatibility. It also makes range scan less efficient as more version need to be scanned and skipped. @@ -69,4 +71,10 @@ On the other hand, the additional write is just a key in a different cf and a va unistore separates the latest version and other versions by adjust file format. So when flushing or compacting, it will make latest versions key be the first part, and the rest in the second part. This approach doesn't have write overhead, but is not backward compatible in TiKV. +Another proposal has also been discussed in the past that instead of adding latest cf, adding a history cf to store as many as versions. All keys are written to write cf first, and then using compaction filter to move all versions except the latest to history cf. This approach delay the additional writes to background job, so may have less impact on the foreground writes. But it has following shortcomings: +- compaction filter is not reliable. The timing it's triggered can be tricky. We have observed issue that introduced by compaction not in time. tikv/tikv#12729. +- compaction filter only works on SST files, versions in memory are still mixed. +- point get still requires seek unless we switch to user timestamp completely, which is not used in production yet. +- If we remove KV WAL completely, writing in compaction can be expensive as it needs to be either ingested as new SST or triggers flush, otherwise restarting TiKV may lose data. + ## Unresolved questions From e267b076085fcfd546909575c80a6dfb7012ef55 Mon Sep 17 00:00:00 2001 From: Jay Lee Date: Sun, 12 Jun 2022 10:52:39 -0700 Subject: [PATCH 4/5] address comment Signed-off-by: Jay Lee --- text/0095-add-latest-cf.md | 34 +++++++++++++++++++++++----------- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/text/0095-add-latest-cf.md b/text/0095-add-latest-cf.md index c6dcb506..ab5da4cb 100644 --- a/text/0095-add-latest-cf.md +++ b/text/0095-add-latest-cf.md @@ -27,13 +27,14 @@ As explained, the reason why seek is necessary is because TiKV has no idea what TiKV doesn't need to know all existing versions of keys. In fact, most of the time, v0 is larger than any existing versions of keys in TiKV if there is more read than write. So it should be enough to just let TiKV knows the latest version of all keys. -The RFC propose to add a new cf named "latest". When a key is inserted using transaction API, it should update write cf (and default cf) as current does. In addition, it also insert a key to latest cf with -- key set the original key without any encoding or version -- value set the same value in write cf but include the corresponding version. +The RFC propose to add a new cf named "latest". When a key is inserted using transaction API, it should update latest cf using the original key without any encoding or version. The value should be similar with the one that is used for updating write cf in the past, but including the corresponding version. If the key exists in the latest cf, then its value should be read and write to write cf using the old format along side with the latest cf update. -For example, insert k1 with version v0 and value dummy will insert two keys -- to write cf, k1|v0 -> (dummy and other meta) -- to latest cf, k1 -> (dummy, v0 and other meta) +For example, supposing there is no key in latest cf. Inserting k1 with version v0 and value foo will insert one key: +- to latest cf, k1 -> (foo, v0 and other meta) + +Inserting v1 again with version v1 and value bar will insert two keys: +- to write cf, k1|v0 -> (foo and other meta) +- to latest cf, k1 -> (bar, v1 and other meta) So all keys in latest cf represent the latest version of all keys. @@ -49,17 +50,28 @@ The improvment should be very significant when update keys frequently. ### Compatibility -Because all existing cfs are updated just as before, so there are no major compatibility issues. +Because all keys are written to latest cf first, so it will not be compatible with existing write cf as at least one key is missing. To make the them switch easier, Let's introduce a intermediate format that latest value is written to lastest cf and write cf at the same time. Every range that is expecting to upgrade to latest format, it should upgrade to intermediate format first. + +```mermaid +graph LR; + origin[Original Format] + inter[Intermediate Format] + latest[Latest Format] + origin --"query using origin way"--> inter --"query using latest way"--> latest +``` + +Because write cf may not contain the latest change, latest cf should be always queried in all TiKV internal services like GC. -But using latest cf should be triggered explicitly by client. Client should ensure only when it updates all keys with latest cf will it ask TiKV to query using latest cf. +Public APIs should tell TiKV whether latest cf should be used, so that upgrading between format can be seamless to TiKV. Client should ensure only when it updates the range to at least intermediate format will it ask TiKV to query using latest cf. -Take TiDB as an example, it can add a new storage format at table level. Perhaps even add a new DDL job for table to changing the storage format. In the new format, latest cf is always updated. And only trigger TiKV to use latest cf when the target table is fully upgraded to the new format. +Take TiDB as an example, it can add a new storage format at table level. Any new table should use the latest format. For existing tables, origin format should be used. However, it can add a new DDL job for table to upgrade the storage format to intermediate format. And only trigger TiKV to use latest cf when the target table is fully upgraded to the intermediate format. As a new cf is added, it needs to be also included in the raft snapshot between replicas. ### Why use a new cf? -If the latest keys are written to write cf instead, then it will break compatibility. It also makes range scan less efficient as more version need to be scanned and skipped. +1. The key format is different, using different cf is more efficient. +2. Changing existing cf can bring more compatibility issues than introducing a new one. ## Drawbacks @@ -71,7 +83,7 @@ On the other hand, the additional write is just a key in a different cf and a va unistore separates the latest version and other versions by adjust file format. So when flushing or compacting, it will make latest versions key be the first part, and the rest in the second part. This approach doesn't have write overhead, but is not backward compatible in TiKV. -Another proposal has also been discussed in the past that instead of adding latest cf, adding a history cf to store as many as versions. All keys are written to write cf first, and then using compaction filter to move all versions except the latest to history cf. This approach delay the additional writes to background job, so may have less impact on the foreground writes. But it has following shortcomings: +Another proposal has also been discussed in the past that instead of adding latest cf, adding a history cf to store old versions. All keys are written to write cf first, and then using compaction filter to move all versions except the latest to history cf. This approach delay the additional writes to background job, so may have less impact on the foreground writes. But it has following shortcomings: - compaction filter is not reliable. The timing it's triggered can be tricky. We have observed issue that introduced by compaction not in time. tikv/tikv#12729. - compaction filter only works on SST files, versions in memory are still mixed. - point get still requires seek unless we switch to user timestamp completely, which is not used in production yet. From d64a4a6120f79f1d290b56fe57110b984261f5f3 Mon Sep 17 00:00:00 2001 From: Jay Lee Date: Wed, 15 Jun 2022 10:52:24 -0700 Subject: [PATCH 5/5] fix typo Signed-off-by: Jay Lee --- text/0095-add-latest-cf.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0095-add-latest-cf.md b/text/0095-add-latest-cf.md index ab5da4cb..10cf9d2d 100644 --- a/text/0095-add-latest-cf.md +++ b/text/0095-add-latest-cf.md @@ -32,7 +32,7 @@ The RFC propose to add a new cf named "latest". When a key is inserted using tra For example, supposing there is no key in latest cf. Inserting k1 with version v0 and value foo will insert one key: - to latest cf, k1 -> (foo, v0 and other meta) -Inserting v1 again with version v1 and value bar will insert two keys: +Inserting k1 again with version v1 and value bar will insert two keys: - to write cf, k1|v0 -> (foo and other meta) - to latest cf, k1 -> (bar, v1 and other meta)