Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign uputil: compare raw and encoded keys without explicit decoding #5613
Conversation
Why does index key hardly become primary key? I think it works in the opposite way. |
for raw_len in 0..=255 { | ||
let raw: Vec<u8> = (0..raw_len).collect(); | ||
let encoded = Key::from_raw(&raw); | ||
assert!(encoded.is_encoded_from(&raw)); |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
sticnarf
Oct 10, 2019
Author
Contributor
Added. PTAL again
Tests are deliberately repeated in tikv_util::codec and keys, in case we change the implementation in keys to components/codec.
This comment has been minimized.
This comment has been minimized.
My fault...Index keys don't become primary keys in pessimistic transactions. In optimistic transactions, index keys tend to become primary keys. I'll update the main thread. Anyway, comparing from the end helps return early. |
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
.zip(raw) | ||
.all(|(&enc, &raw)| enc == !raw) | ||
&& encoded[len..encoded.len() - 1].iter().all(|&v| v == 0xff) | ||
&& encoded[ENC_GROUP_SIZE] == !(ENC_MARKER - pad) |
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
|
This comment has been minimized.
This comment has been minimized.
When will this compare function being used? |
This comment has been minimized.
This comment has been minimized.
In #5575, to prevent primary keys of pessimistic transactions from being collapsed. And maybe in the future, to avoid writing rollback or lock of secondary keys. |
Looks like this comparing implementation is even slower than simply decoding.. 1000 bytes:
10000 bytes:
|
|
||
let mut rev_encoded_chunks = encoded.rchunks_exact(ENC_GROUP_SIZE + 1); | ||
// Valid encoded bytes must has complete chunks | ||
assert!(rev_encoded_chunks.remainder().is_empty()); |
This comment has been minimized.
This comment has been minimized.
breeswish
Oct 12, 2019
Member
I think it would be better to accept invalid bytes but also return false.
This comment has been minimized.
This comment has been minimized.
sticnarf
Oct 12, 2019
Author
Contributor
Encoded bytes are all generated and used inside TiKV. Therefore, it's either a bug or data corruption if encoded bytes are invalid. So I choose to panic here.
This comment has been minimized.
This comment has been minimized.
breeswish
Oct 12, 2019
Member
However as a utility function it should not hold that knowledge. You can't prevent this utility function not to be used elsewhere as a common way to compare some user passed-in bytes or TiDB given bytes.
This comment has been minimized.
This comment has been minimized.
// Valid encoded bytes must has complete chunks | ||
assert!(rev_encoded_chunks.remainder().is_empty()); | ||
|
||
// Bytes are compared in reverse order because in TiKV, if two keys are different, the last |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
sticnarf
Oct 12, 2019
Author
Contributor
Ah, yes...Anyway, I still think it's all right to depend on this feature.
This comment has been minimized.
This comment has been minimized.
@breeswish If necessary, I think we can do the similar optimizations. But actually this PR helps when we can get the result by check the length or just a few bytes... |
This comment has been minimized.
This comment has been minimized.
@sticnarf If there can be optimizations to do in this PR, then why not do it 🤪I guess the source of slowness is caused from too much abstractions instead of simple and raw array operation.. Note that in TiDB scenario, the key is very short and looks like the effectiveness of this PR can be very trivial. |
This comment has been minimized.
This comment has been minimized.
My consideration is that this PR is going to be used in transactions to fix a bug in 3.0. Correctness is more important than effectiveness. (And comparing keys is far less costly compared to other operations like rocksdb seek, such optimizations don't improve the final performance much) Later in master, we can further improve this. There will be more time to test before the next major release. |
This comment has been minimized.
This comment has been minimized.
@sticnarf I don't think bug fix can be an excuse of writing inefficient code, especially considering that it is not that hard to be efficient, as well as this is not a bug that occurs in high priority clients (?).. If there are future plans to improve it, please open an new issue to track it and I can accept it. |
|
||
let mut rev_encoded_chunks = encoded.rchunks_exact(ENC_GROUP_SIZE + 1); | ||
// Valid encoded bytes must has complete chunks | ||
assert!(rev_encoded_chunks.remainder().is_empty()); |
This comment has been minimized.
This comment has been minimized.
breeswish
Oct 12, 2019
Member
However as a utility function it should not hold that knowledge. You can't prevent this utility function not to be used elsewhere as a common way to compare some user passed-in bytes or TiDB given bytes.
let raw_chunks = raw.chunks_exact(ENC_GROUP_SIZE); | ||
// Check the last chunk first | ||
match rev_encoded_chunks.next() { | ||
Some(encoded_chunk) if check_single_chunk(encoded_chunk, raw_chunks.remainder()) => {} |
This comment has been minimized.
This comment has been minimized.
breeswish
Oct 12, 2019
Member
what will happen if raw_chunks
's length is N*ENC_GROUP_SIZE
so that raw_chunks.remainder()
is empty?
This comment has been minimized.
This comment has been minimized.
Except for the panic design, I'm fine with the rest. |
.iter() | ||
.zip(raw) | ||
.all(|(&enc, &raw)| enc == !raw) | ||
&& encoded[len..encoded.len() - 1].iter().all(|&v| v == 0xff) |
This comment has been minimized.
This comment has been minimized.
} else { | ||
encoded[ENC_GROUP_SIZE] == (ENC_MARKER - pad) | ||
&& &encoded[..len] == raw | ||
&& encoded[len..encoded.len() - 1].iter().all(|&v| v == 0) |
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
This comment has been minimized.
This comment has been minimized.
@breeswish Panic behavior is changed. #5641 gives a faster implementation. Could you review that PR and help us evaluate the risk of using that one now? |
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
/// # Panics | ||
/// | ||
/// Panics if `encoded` is not valid | ||
/// Returns whether `encoded` bytes is encoded from `raw`. Returns `false` if `encoded` is invalid. |
This comment has been minimized.
This comment has been minimized.
breeswish
Oct 14, 2019
Member
If encoded
is invalid, it must not be encoded from raw
. Thus this scenario is already covered by the first sentence.
This comment has been minimized.
This comment has been minimized.
sticnarf
Oct 14, 2019
Author
Contributor
It's just to clarify that this function can accept invalid encoded bytes :)
This comment has been minimized.
This comment has been minimized.
/test |
/// | ||
/// # Panics | ||
/// | ||
/// Panics if `self` is not a valid encoded key. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
This comment has been minimized.
This comment has been minimized.
/merge |
This comment has been minimized.
This comment has been minimized.
/run-all-tests |
This comment has been minimized.
This comment has been minimized.
cherry pick to release-3.1 in PR #5645 |
This comment has been minimized.
This comment has been minimized.
cherry pick to release-3.0 in PR #5646 |
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
sticnarf commentedOct 10, 2019
•
edited
What have you changed?
This PR adds a function to check if the encoded and raw format refers to the same key.
Motivation
The motivation is that we want to know if a lock is a primary lock. However, inside TiKV we use encoded keys while the stored primary key in the lock is in the raw format.
An alternatives are directly passing raw keys into the transaction layer. But this makes the transaction code uglier. The benchmark result below shows that for a typical key length (30 bytes), it takes just less than 10ns and in most cases, two different keys are different in their tails. So the performance impact is quite small.
What is the type of the changes?
How is the PR tested?
Does this PR affect documentation (docs) or should it be mentioned in the release notes?
No
Does this PR affect
tidb-ansible
?No
Refer to a related PR or issue link (optional)
#5575 may utilize it.
Benchmark result if necessary (optional)
Any examples? (optional)