New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txn_types: Add GC fence feild to Write records #9207
Changes from all commits
6b4466b
ae871ec
bbc5b0c
7ae0c9f
7fcd438
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,13 @@ | ||
// Copyright 2016 TiKV Project Authors. Licensed under Apache-2.0. | ||
|
||
use codec::prelude::NumberDecoder; | ||
use std::mem::size_of; | ||
use tikv_util::codec::number::{self, NumberEncoder, MAX_VAR_U64_LEN}; | ||
|
||
use crate::lock::LockType; | ||
use crate::timestamp::TimeStamp; | ||
use crate::types::{Value, SHORT_VALUE_MAX_LEN, SHORT_VALUE_PREFIX}; | ||
use crate::types::{Value, SHORT_VALUE_PREFIX}; | ||
use crate::{Error, ErrorInner, Result}; | ||
use codec::prelude::NumberDecoder; | ||
use tikv_util::codec::number::{NumberEncoder, MAX_VAR_U64_LEN}; | ||
|
||
#[derive(Debug, Clone, Copy, PartialEq)] | ||
pub enum WriteType { | ||
|
@@ -22,6 +24,8 @@ const FLAG_ROLLBACK: u8 = b'R'; | |
|
||
const FLAG_OVERLAPPED_ROLLBACK: u8 = b'R'; | ||
|
||
const GC_FENCE_PREFIX: u8 = b'F'; | ||
|
||
/// The short value for rollback records which are protected from being collapsed. | ||
const PROTECTED_ROLLBACK_SHORT_VALUE: &[u8] = b"p"; | ||
|
||
|
@@ -60,6 +64,7 @@ pub struct Write { | |
pub write_type: WriteType, | ||
pub start_ts: TimeStamp, | ||
pub short_value: Option<Value>, | ||
|
||
/// The `commit_ts` of transactions can be non-globally-unique. But since we store Rollback | ||
/// records in the same CF where Commit records is, and Rollback records are saved with | ||
/// `user_key{start_ts}` as the internal key, the collision between Commit and Rollback | ||
|
@@ -68,6 +73,57 @@ pub struct Write { | |
/// Also note that `has_overlapped_rollback` field is only necessary when the Rollback record | ||
/// should be protected. | ||
pub has_overlapped_rollback: bool, | ||
|
||
/// Records the next version after this version when overlapping rollback happens on an already | ||
/// existed commit record. | ||
/// | ||
/// When a rollback flag is written on an already-written commit record, it causes rewriting | ||
/// the commit record. It may cause problems with the GC compaction filter. Consider this case: | ||
/// | ||
/// ```text | ||
/// Key_100_put, Key_120_del | ||
/// ``` | ||
/// | ||
/// and a rollback on `100` happens: | ||
/// | ||
/// ```text | ||
/// Key_100_put_R, Key_120_del | ||
/// ``` | ||
/// | ||
/// Then GC with safepoint = 130 may happen. However a follower may not have finished applying | ||
/// the change. So on the follower, it's possible that: | ||
/// | ||
/// 1. `Key_100_put`, `Key_120_del` applied | ||
/// 2. GC with safepoint = 130 started and `Key_100_put`, `Key_120_del` are deleted | ||
/// 3. Finished applying `Key_100_put_R`, which means to rewrite `Key_100_put` | ||
/// 4. Read at `140` should get nothing (since it's MVCC-deleted at 120) but finds `Key_100_put` | ||
/// | ||
/// To solve the problem, when marking `has_overlapped_rollback` on an already-existed commit | ||
/// record, add a special field `gc_fence` on it. If there is a newer version after the record | ||
/// being rewritten, the next version's `commit_ts` will be recorded. When MVCC reading finds | ||
/// a commit record with a GC fence timestamp but the corresponding version that matches that ts | ||
/// doesn't exist, the current version will be believed to be already GC-ed and ignored. | ||
/// | ||
/// Therefore, for the example above, in the 3rd step it will record the version `120` to the | ||
/// `gc_fence` field: | ||
/// | ||
/// ```text | ||
/// Key_100_put_R_120, Key_120_del | ||
/// ``` | ||
/// | ||
/// And when the reading in the 4th step finds the `PUT` record but the version at 120 doesn't | ||
/// exist, it will be regarded as already GC-ed and ignored. | ||
/// | ||
/// For CDC and TiFlash, when they receives a commit record with `gc_fence` field set, it can | ||
/// determine that it must be caused by an overlapped rollback instead of an actual commit. | ||
/// | ||
/// The meaning of the field: | ||
/// * `None`: A record that haven't been rewritten | ||
/// * `Some(0)`: A commit record that has been rewritten due to overlapping rollback, but it | ||
/// doesn't have an newer version. | ||
/// * `Some(ts)`: A commit record that has been rewritten due to overlapping rollback, | ||
/// and it's next version's `commit_ts` is `ts` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great comment! 👍👍👍 |
||
pub gc_fence: Option<TimeStamp>, | ||
} | ||
|
||
impl std::fmt::Debug for Write { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
@@ -84,6 +140,7 @@ impl std::fmt::Debug for Write { | |
.unwrap_or_else(|| "None".to_owned()), | ||
) | ||
.field("has_overlapped_rollback", &self.has_overlapped_rollback) | ||
.field("gc_fence", &self.gc_fence) | ||
.finish() | ||
} | ||
} | ||
|
@@ -97,6 +154,7 @@ impl Write { | |
start_ts, | ||
short_value, | ||
has_overlapped_rollback: false, | ||
gc_fence: None, | ||
} | ||
} | ||
|
||
|
@@ -113,12 +171,18 @@ impl Write { | |
start_ts, | ||
short_value, | ||
has_overlapped_rollback: false, | ||
gc_fence: None, | ||
} | ||
} | ||
|
||
#[inline] | ||
pub fn set_overlapped_rollback(mut self, has_overlapped_rollback: bool) -> Self { | ||
pub fn set_overlapped_rollback( | ||
mut self, | ||
has_overlapped_rollback: bool, | ||
gc_fence: Option<TimeStamp>, | ||
) -> Self { | ||
self.has_overlapped_rollback = has_overlapped_rollback; | ||
self.gc_fence = gc_fence; | ||
self | ||
} | ||
|
||
|
@@ -137,6 +201,7 @@ impl Write { | |
start_ts: self.start_ts, | ||
short_value: self.short_value.as_deref(), | ||
has_overlapped_rollback: self.has_overlapped_rollback, | ||
gc_fence: self.gc_fence, | ||
} | ||
} | ||
} | ||
|
@@ -154,6 +219,12 @@ pub struct WriteRef<'a> { | |
/// Also note that `has_overlapped_rollback` field is only necessary when the Rollback record | ||
/// should be protected. | ||
pub has_overlapped_rollback: bool, | ||
|
||
/// Records the next version after this version when overlapping rollback happens on an already | ||
/// existed commit record. | ||
/// | ||
/// See [`Write::gc_fence`] for more detail. | ||
pub gc_fence: Option<TimeStamp>, | ||
} | ||
|
||
impl WriteRef<'_> { | ||
|
@@ -170,6 +241,7 @@ impl WriteRef<'_> { | |
|
||
let mut short_value = None; | ||
let mut has_overlapped_rollback = false; | ||
let mut gc_fence = None; | ||
|
||
while !b.is_empty() { | ||
match b | ||
|
@@ -193,6 +265,7 @@ impl WriteRef<'_> { | |
FLAG_OVERLAPPED_ROLLBACK => { | ||
has_overlapped_rollback = true; | ||
} | ||
GC_FENCE_PREFIX => gc_fence = Some(number::decode_u64(&mut b)?.into()), | ||
flag => panic!("invalid flag [{}] in write", flag), | ||
} | ||
} | ||
|
@@ -202,11 +275,12 @@ impl WriteRef<'_> { | |
start_ts, | ||
short_value, | ||
has_overlapped_rollback, | ||
gc_fence, | ||
}) | ||
} | ||
|
||
pub fn to_bytes(&self) -> Vec<u8> { | ||
let mut b = Vec::with_capacity(1 + MAX_VAR_U64_LEN + SHORT_VALUE_MAX_LEN + 2 + 1); | ||
let mut b = Vec::with_capacity(self.pre_allocate_size()); | ||
b.push(self.write_type.to_u8()); | ||
b.encode_var_u64(self.start_ts.into_inner()).unwrap(); | ||
if let Some(v) = self.short_value { | ||
|
@@ -217,9 +291,25 @@ impl WriteRef<'_> { | |
if self.has_overlapped_rollback { | ||
b.push(FLAG_OVERLAPPED_ROLLBACK); | ||
} | ||
if let Some(ts) = self.gc_fence { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can calculate a better capacity estimation like https://github.com/tikv/tikv/blob/6b4466ba8a/components/txn_types/src/lock.rs#L161 |
||
b.push(GC_FENCE_PREFIX); | ||
b.encode_u64(ts.into_inner()).unwrap(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. VarInt saves space when ts is 0. But VarInt occupies more space (9 bytes vs 8 bytes) and is slower when ts is a real timestamp. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought it doesn't matter much which encoding to use here, but encode_u64 looks faster than var_u64 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But it's strange to encode start_ts in var int and this ts in u64. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can see in Lock's encoding, everything new is using encode_u64... |
||
} | ||
b | ||
} | ||
|
||
fn pre_allocate_size(&self) -> usize { | ||
let mut size = 1 + MAX_VAR_U64_LEN + self.has_overlapped_rollback as usize; | ||
|
||
if let Some(v) = &self.short_value { | ||
size += 2 + v.len(); | ||
} | ||
if self.gc_fence.is_some() { | ||
size += 1 + size_of::<u64>(); | ||
} | ||
size | ||
} | ||
|
||
#[inline] | ||
pub fn is_protected(&self) -> bool { | ||
self.write_type == WriteType::Rollback | ||
|
@@ -237,7 +327,7 @@ impl WriteRef<'_> { | |
self.start_ts, | ||
self.short_value.map(|v| v.to_owned()), | ||
) | ||
.set_overlapped_rollback(self.has_overlapped_rollback) | ||
.set_overlapped_rollback(self.has_overlapped_rollback, self.gc_fence) | ||
} | ||
} | ||
|
||
|
@@ -285,12 +375,21 @@ mod tests { | |
Write::new(WriteType::Delete, (1 << 20).into(), None), | ||
Write::new_rollback((1 << 40).into(), true), | ||
Write::new(WriteType::Rollback, (1 << 41).into(), None), | ||
Write::new(WriteType::Put, 123.into(), None).set_overlapped_rollback(true), | ||
Write::new(WriteType::Put, 123.into(), None).set_overlapped_rollback(true, None), | ||
Write::new(WriteType::Put, 123.into(), None) | ||
.set_overlapped_rollback(true, Some(1234567.into())), | ||
Write::new(WriteType::Put, 456.into(), Some(b"short_value".to_vec())) | ||
.set_overlapped_rollback(true), | ||
.set_overlapped_rollback(true, None), | ||
Write::new(WriteType::Put, 456.into(), Some(b"short_value".to_vec())) | ||
.set_overlapped_rollback(true, Some(0.into())), | ||
Write::new(WriteType::Put, 456.into(), Some(b"short_value".to_vec())) | ||
.set_overlapped_rollback(true, Some(2345678.into())), | ||
Write::new(WriteType::Put, 456.into(), Some(b"short_value".to_vec())) | ||
.set_overlapped_rollback(true, Some(421397468076048385.into())), | ||
]; | ||
for (i, write) in writes.drain(..).enumerate() { | ||
let v = write.as_ref().to_bytes(); | ||
assert!(v.len() <= write.as_ref().pre_allocate_size()); | ||
let w = WriteRef::parse(&v[..]) | ||
.unwrap_or_else(|e| panic!("#{} parse() err: {:?}", i, e)) | ||
.to_owned(); | ||
|
@@ -314,7 +413,7 @@ mod tests { | |
assert!(!Write::new( | ||
WriteType::Put, | ||
3.into(), | ||
Some(PROTECTED_ROLLBACK_SHORT_VALUE.to_vec()) | ||
Some(PROTECTED_ROLLBACK_SHORT_VALUE.to_vec()), | ||
) | ||
.as_ref() | ||
.is_protected()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it can be replaced by
gc_fence
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's produced because commit_ts == rollback_ts in the lock, it doesn't need a gc fence.