New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Fix WAL reuse issue under encryption scenario #16698
Labels
2.18 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
yusong-yan
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Apr 5, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Apr 5, 2023
yusong-yan
changed the title
[DocDB] Add NewWritableFile Encryption Wrapper to fix WAL reuse issue
[DocDB] Fix WAL reuse issue under encryption scenario
Apr 5, 2023
yusong-yan
added a commit
that referenced
this issue
May 16, 2023
…use issue Summary: **Issue:** We have seen WAL reuse feature caused cluster crash under encryption on rest. scenario. This is because WAL reuse feature reopens the writable file by using `NewWritableFile`, which doesn't have encryption wrappers implemented like `NewTempWritableFile` and `NewRandomAccessFile`. Thus, after the log reopens the last file for write, the future appended data will not be encrypted in this file. **Fix:** Implemented the encryption wrapper for `NewWritableFile`: 1. If the file is encrypted already, we first decrypt its header. Every WAL file’s encryption header contains its own unique key(used for encrypting file data) and universe key id. 2. If the universe key id is not the latest, we will skip reusing the WAL. Instead, a new segment will be allocated as the active writable file. Otherwise, move to step 3. 3. Then use universe key id to decrypt file's own unique key. The future appended data will be encrypted by using this decrypted file's key. 4. When reopening this file for write, there is an `initial_offset` that determines where to begin to write. Under the encryption scenario, we need to add the encryption header size into `initial_offset`. Here is the implementation level explanation: the `initial_offset`value comes from `RandomAccessFile::read()` iterating WAL entries until the last entry. For example, when we ask encrypted `RandomAccessFile` to read from offset **x**, internally, it always starts reading from offset **x**+**head_size**, which is why we need to adjust the `initial_offset` value. Extra clarification for step 2. Here is the reason why we want to make sure the file’s universe key id is the latest: As I mentioned, every WAL file’s encryption header contains its own unique key(used for encrypting file data) and universe key id. And the relation between these two keys is: the file’s own unique key is randomly generated at the beginning, then later encrypted by the latest universe key. Thus, without WAL reuse feature, after a tserver or tablet restart, the tablet’s active writable WAL file always has the most updated/latest universe key id because, after the restart, a new WAL file gets allocated. However, this will no longer be the case with the log reuse feature, since after tserver restart, WAL’s active writable file might contain old universe key id. Which is why it is necessary to do the universe key id comparison in order to make sure the file contains the latest universe key id. The above description is for encryption-enabled scenario. Sometimes we might turn on or off the encryption when the universe is running. So here are four cases that could happen when we are trying to reuse the WAL file: 1. File has an encryption header, encryption is on. 2. File has an encryption header, encryption is off. 3. File doesn't have an encryption header, encryption is on. 4. File doesn't have an encryption header, encryption is off. For cases 1 and 4, since the env and file's encryption status are the same it is okay to continue reusing the file. For cases 2 and 3, WAL will give up reusing the file and allocate a new file, so that the future appended data can have the correct encryption status. In the crash loop scenario, even if reuse WAL fails in the first restart, future restarts will be either case 1 or 4. Thus, crash loop memory hazards can still be prevented. There are many places calling `NewWritableFile`, but for now, only the WAL reuse feature uses the `NewWritableFile` encryption implementation, because it's fairly complicated to ensure all callers of `NewWritableFile` are safe to do the encryption. Thus, there will be a new diff to handle it. Test Plan: ./yb_build.sh release --cxx-test integration-tests_encryption-test --gtest_filter EncryptionTest.EncryptWALDataAfterWALReuse --clang15 -n 100 ./yb_build.sh release --cxx-test integration-tests_encryption-test --gtest_filter EncryptionTest.EncryptWALDataAfterWALReuseWithRotateKey --clang15 -n 100 ./yb_build.sh release --cxx-test encryption_encrypted_env-test --gtest_filter TestEncryptedEnv.FileOps --clang15 -n 100 Reviewers: timur, rahuldesirazu, rthallam Reviewed By: timur, rahuldesirazu Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D24155
1 task
yusong-yan
added a commit
that referenced
this issue
May 19, 2023
Summary: Now that the WAL reuse encryption issue is fixed in #16698, we can enable WAL reuse feature. Jira: DB-6578 Test Plan: Check the detective test result. Reviewers: rthallam Reviewed By: rthallam Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D25455
1 task
yusong-yan
added a commit
that referenced
this issue
May 19, 2023
Summary: Now that the WAL reuse encryption issue is fixed in #16698, we can enable WAL reuse feature. Jira: DB-6578 Test Plan: Jenkins: skip Reviewers: rthallam, timur Reviewed By: rthallam Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D25552
yusong-yan
added a commit
that referenced
this issue
May 24, 2023
… to fix WAL reuse issue Summary: Original commit: 6ef4381 / D24155 **Issue:** We have seen WAL reuse feature caused cluster crash under encryption on rest. scenario. This is because WAL reuse feature reopens the writable file by using `NewWritableFile`, which doesn't have encryption wrappers implemented like `NewTempWritableFile` and `NewRandomAccessFile`. Thus, after the log reopens the last file for write, the future appended data will not be encrypted in this file. **Fix:** Implemented the encryption wrapper for `NewWritableFile`: 1. If the file is encrypted already, we first decrypt its header. Every WAL file’s encryption header contains its own unique key(used for encrypting file data) and universe key id. 2. If the universe key id is not the latest, we will skip reusing the WAL. Instead, a new segment will be allocated as the active writable file. Otherwise, move to step 3. 3. Then use universe key id to decrypt file's own unique key. The future appended data will be encrypted by using this decrypted file's key. 4. When reopening this file for write, there is an `initial_offset` that determines where to begin to write. Under the encryption scenario, we need to add the encryption header size into `initial_offset`. Here is the implementation level explanation: the `initial_offset`value comes from `RandomAccessFile::read()` iterating WAL entries until the last entry. For example, when we ask encrypted `RandomAccessFile` to read from offset **x**, internally, it always starts reading from offset **x**+**head_size**, which is why we need to adjust the `initial_offset` value. Extra clarification for step 2. Here is the reason why we want to make sure the file’s universe key id is the latest: As I mentioned, every WAL file’s encryption header contains its own unique key(used for encrypting file data) and universe key id. And the relation between these two keys is: the file’s own unique key is randomly generated at the beginning, then later encrypted by the latest universe key. Thus, without WAL reuse feature, after a tserver or tablet restart, the tablet’s active writable WAL file always has the most updated/latest universe key id because, after the restart, a new WAL file gets allocated. However, this will no longer be the case with the log reuse feature, since after tserver restart, WAL’s active writable file might contain old universe key id. Which is why it is necessary to do the universe key id comparison in order to make sure the file contains the latest universe key id. The above description is for encryption-enabled scenario. Sometimes we might turn on or off the encryption when the universe is running. So here are four cases that could happen when we are trying to reuse the WAL file: 1. File has an encryption header, encryption is on. 2. File has an encryption header, encryption is off. 3. File doesn't have an encryption header, encryption is on. 4. File doesn't have an encryption header, encryption is off. For cases 1 and 4, since the env and file's encryption status are the same it is okay to continue reusing the file. For cases 2 and 3, WAL will give up reusing the file and allocate a new file, so that the future appended data can have the correct encryption status. In the crash loop scenario, even if reuse WAL fails in the first restart, future restarts will be either case 1 or 4. Thus, crash loop memory hazards can still be prevented. There are many places calling `NewWritableFile`, but for now, only the WAL reuse feature uses the `NewWritableFile` encryption implementation, because it's fairly complicated to ensure all callers of `NewWritableFile` are safe to do the encryption. Thus, there will be a new diff to handle it. Jira: DB-6074 Test Plan: ./yb_build.sh release --cxx-test integration-tests_encryption-test --gtest_filter EncryptionTest.EncryptWALDataAfterWALReuse --clang15 -n 100 ./yb_build.sh release --cxx-test integration-tests_encryption-test --gtest_filter EncryptionTest.EncryptWALDataAfterWALReuseWithRotateKey --clang15 -n 100 ./yb_build.sh release --cxx-test encryption_encrypted_env-test --gtest_filter TestEncryptedEnv.FileOps --clang15 -n 100 Reviewers: timur, rahuldesirazu, rthallam Reviewed By: timur Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D25555
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.18 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-6074
Description
Following up on GH16345, the reason why WAL reuse didn't work successfully under encryption on the rest scenario is that this new feature used
NewWritableFile
for reopening the writable file, but it didn't have encryption wrapper likeNewTempWritableFile
andNewRandomAccessFile
. Thus, after the log reopens the last file for write, the future appended data will not be encrypted in this file.Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: