Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WT-10862 Support read-only fast-truncate data format change in MongoDB 6.0 release #9396

Merged
merged 10 commits into from Jul 21, 2023
2 changes: 2 additions & 0 deletions src/btree/bt_vrfy_dsk.c
Expand Up @@ -122,6 +122,8 @@ __wt_verify_dsk_image(WT_SESSION_IMPL *session, const char *tag, const WT_PAGE_H
LF_CLR(WT_PAGE_ENCRYPTED);
if (LF_ISSET(WT_PAGE_UNUSED))
LF_CLR(WT_PAGE_UNUSED);
if (LF_ISSET(WT_PAGE_FT_UPDATE))
LF_CLR(WT_PAGE_FT_UPDATE);
if (flags != 0)
WT_RET_VRFY(session, "page at %s has invalid flags set: 0x%" PRIx8, tag, flags);

Expand Down
4 changes: 2 additions & 2 deletions src/btree/row_modify.c
Expand Up @@ -133,8 +133,8 @@ __wt_row_modify(WT_CURSOR_BTREE *cbt, const WT_ITEM *key, const WT_ITEM *value,
WT_ASSERT(session,
!WT_IS_HS(S2BT(session)->dhandle) ||
(*upd_entry == NULL ||
((*upd_entry)->type == WT_UPDATE_TOMBSTONE && (*upd_entry)->txnid == WT_TS_NONE &&
(*upd_entry)->start_ts == WT_TS_NONE)) ||
((*upd_entry)->type == WT_UPDATE_TOMBSTONE &&
(*upd_entry)->txnid == WT_TXN_NONE && (*upd_entry)->start_ts == WT_TS_NONE)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - do we need to be concerned about this? It looks like a bug regardless of the backport here?

Copy link
Contributor Author

@mm-ng mm-ng Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like a typo that I decided to just fix here. I think it's actually ok because WT_TS_NONE and WT_TXN_NONE are actually the same value in the codebase, so I don't know if we should call it a bug. It's just confusing to read in the condition.

(upd_arg->type == WT_UPDATE_TOMBSTONE && upd_arg->start_ts == WT_TS_NONE &&
upd_arg->next == NULL) ||
(upd_arg->type == WT_UPDATE_TOMBSTONE && upd_arg->next != NULL &&
Expand Down
1 change: 1 addition & 0 deletions src/include/btmem.h
Expand Up @@ -77,6 +77,7 @@ struct __wt_page_header {
#define WT_PAGE_EMPTY_V_NONE 0x04u /* Page has no zero-length values */
#define WT_PAGE_ENCRYPTED 0x08u /* Page is encrypted on disk */
#define WT_PAGE_UNUSED 0x10u /* Historic lookaside store page updates, no longer used */
#define WT_PAGE_FT_UPDATE 0x20u /* Page contains updated fast-truncate information */
uint8_t flags; /* 25: flags */

/* A byte of padding, positioned to be added to the flags. */
Expand Down
11 changes: 6 additions & 5 deletions src/include/cell.h
Expand Up @@ -132,20 +132,21 @@
*/
struct __wt_cell {
/*
* Maximum of 71 bytes:
* Maximum of 98 bytes:
* 1: cell descriptor byte
* 1: prefix compression count
* 1: secondary descriptor byte
* 36: 4 timestamps (uint64_t encoding, max 9 bytes)
* 18: 2 transaction IDs (uint64_t encoding, max 9 bytes)
* 9: associated 64-bit value (uint64_t encoding, max 9 bytes)
* 27: fast-delete information (transaction ID, 2 timestamps)
* 5: data length (uint32_t encoding, max 5 bytes)
*
* This calculation is extremely pessimistic: the prefix compression
* count and 64V value overlap, and the validity window, 64V value
* and data length are all optional in some cases.
* This calculation is pessimistic: the prefix compression count and 64V value overlap, and the
* validity window, 64V value, fast-delete information and data length are all optional in some
* or even most cases.
*/
uint8_t __chunk[1 + 1 + 1 + 7 * WT_INTPACK64_MAXSIZE + WT_INTPACK32_MAXSIZE];
uint8_t __chunk[98];
};

/* AUTOMATIC FLAG VALUE GENERATION START 0 */
Expand Down
18 changes: 18 additions & 0 deletions src/include/cell_inline.h
Expand Up @@ -670,6 +670,7 @@ __wt_cell_unpack_safe(WT_SESSION_IMPL *session, const WT_PAGE_HEADER *dsk, WT_CE
WT_TIME_WINDOW tw;
} copy;
WT_CELL_UNPACK_COMMON *unpack;
WT_PAGE_DELETED page_del;
WT_TIME_AGGREGATE *ta;
WT_TIME_WINDOW *tw;
uint64_t v;
Expand Down Expand Up @@ -767,6 +768,7 @@ __wt_cell_unpack_safe(WT_SESSION_IMPL *session, const WT_PAGE_HEADER *dsk, WT_CE
if ((cell->__chunk[0] & WT_CELL_SECOND_DESC) == 0)
break;
flags = *p++; /* skip second descriptor byte */
WT_CELL_LEN_CHK(p, 0);

if (LF_ISSET(WT_CELL_PREPARE))
ta->prepare = 1;
Expand Down Expand Up @@ -810,6 +812,7 @@ __wt_cell_unpack_safe(WT_SESSION_IMPL *session, const WT_PAGE_HEADER *dsk, WT_CE
if ((cell->__chunk[0] & WT_CELL_SECOND_DESC) == 0)
break;
flags = *p++; /* skip second descriptor byte */
WT_CELL_LEN_CHK(p, 0);

if (LF_ISSET(WT_CELL_PREPARE))
tw->prepare = 1;
Expand Down Expand Up @@ -845,6 +848,21 @@ __wt_cell_unpack_safe(WT_SESSION_IMPL *session, const WT_PAGE_HEADER *dsk, WT_CE
break;
}

/*
* Unpack any fast-truncate information. Note that there is no way to write fast-truncate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just skip the fast truncate information, rather than unpacking it?

I guess it's a question of whether information that was generated by a newer release should be parsed out and maybe, therefore, alter decisions made at runtime.

It feels as though it is going to be hard to test that combination (i.e: run with a newer version, generate fast truncate information, downgrade, read fast truncate information, ensure correct decisions are being made).

So, I'd be inclined towards ignoring the information as early as possible, rather than using it to populate data structures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Alex, but I believe we need to unpack it, otherwise, we may get the problem. Instead of saving the unpacked data into the unpack_addr, just save them into a local variable and ignore them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't believe it's possible to simply skip the information - I think it could disrupt other information that is being unpacked as it would affect the position of the pointer. I can definitely unpack it into a local WT_PAGE_DELETED structure rather than into a structure inside the unpack_addr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this logic into a dedicated function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quchenhao Would there be any specific advantages of having a separate function for this? We're only unpacking the information once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make the code cleaner.

* information to disk in versions before 6.1, but this information may still exist in the
* database files if we are downgrading from newer versions. If the fast-truncate information is
* present, it needs to be unpacked but we will ignore these values. Ignoring these values is
* equivalent to writing out truncated content that is not associated with any visibility.
*/
if (unpack->raw == WT_CELL_ADDR_DEL && F_ISSET(dsk, WT_PAGE_FT_UPDATE)) {
WT_RET(
__wt_vunpack_uint(&p, end == NULL ? 0 : WT_PTRDIFF(end, p), (uint64_t *)&page_del.txnid));
WT_RET(__wt_vunpack_uint(&p, end == NULL ? 0 : WT_PTRDIFF(end, p), &page_del.timestamp));
WT_RET(
__wt_vunpack_uint(&p, end == NULL ? 0 : WT_PTRDIFF(end, p), &page_del.durable_timestamp));
}

/*
* Check for an RLE count or record number that optionally follows the cell descriptor byte on
* column-store variable-length pages.
Expand Down