Skip to content

Fix: S3 GW list multipart uploads ordering#9847

Open
N-o-Z wants to merge 9 commits intomasterfrom
fix/s3gw-list-parts-ordering-9554
Open

Fix: S3 GW list multipart uploads ordering#9847
N-o-Z wants to merge 9 commits intomasterfrom
fix/s3gw-list-parts-ordering-9554

Conversation

@N-o-Z
Copy link
Member

@N-o-Z N-o-Z commented Dec 19, 2025

Closes #9554

This PR should not be merged before we decide on migration strategy (see below)

Change Description

Bug Fix

  • Added a new method to multipart tracker - List
  • Refactor multipart upload keys - multiparts are now part of the repo partition and the key is a combination of path+uploadID
  • multipart upload listing now comes from the KV store and not from the object store
  • Implemented new upload ID iterator
  • Fixed additional bugs that were found along the way

Testing Details

Added new unit and integration tests

Breaking Change?

Yes

Migration Strategy:

This PR introduces a breaking change due to the change in the upload ID key in our database.
This means that users who upgrade to this version and have ongoing MPUs will basically lose them.
This becomes even more of an issue since in our current implementation we don't save upload IDs in the context of a repository and the UplaodID data does not save the repository information - and therefore we cannot go through the route of standard migration

Below is a proposal on how to deal with existing MPUs:

Create a migration flow that basically aborts any ongoing MPUs and cleanup any remaining keys in the old partition.

  • Users that try to upgrade without performing migration will fail to load the server. They will either need to complete / abort any outstanding MPUs or run the migration which will de-facto abort the outstanding MPUs
  • This flow will be valid only for version latest++
  • For any version > latest++, upgrade will be blocked in case there are outstanding MPUs
  • This behavior should be well document as well as highlighted in the release notes / changelog

@N-o-Z N-o-Z self-assigned this Dec 19, 2025
@N-o-Z N-o-Z added bug Something isn't working include-changelog PR description should be included in next release changelog labels Dec 19, 2025
@github-actions github-actions bot added area/gateway Changes to the gateway area/testing Improvements or additions to tests labels Dec 19, 2025
@N-o-Z N-o-Z force-pushed the fix/s3gw-list-parts-ordering-9554 branch from 9d9c513 to 036db8a Compare December 19, 2025 22:19
@N-o-Z N-o-Z force-pushed the fix/s3gw-list-parts-ordering-9554 branch from 036db8a to 4d60edf Compare December 19, 2025 22:38
@N-o-Z N-o-Z requested review from a team, arielshaqed, itaiad200 and nopcoder December 21, 2025 00:32
@arielshaqed
Copy link
Contributor

users who upgrade to this version and have ongoing MPUs will basically lose them

It's probably a bit worse: users who start an MPU during the rollout may also lose that MPU. I think we may need some product direction here. There is another option - to release a version supporting both modes, upgrade to that version, then after a week release a version dropping the old mode. That one is of course expensive, hence we should ask product.

@N-o-Z
Copy link
Member Author

N-o-Z commented Dec 21, 2025

users who upgrade to this version and have ongoing MPUs will basically lose them

It's probably a bit worse: users who start an MPU during the rollout may also lose that MPU. I think we may need some product direction here. There is another option - to release a version supporting both modes, upgrade to that version, then after a week release a version dropping the old mode. That one is of course expensive, hence we should ask product.

I don't think that's necessary since once we declare the migration path we can require users to not perform any MPUs during the upgrade. The choice whether to complete outstanding MPUs before upgrade or let the migration process abort them - all of that responsibility will be rolled down to the user

Comment on lines +319 to +325
t.Cleanup(func() {
_, _ = s3Client.AbortMultipartUpload(ctx, &s3.AbortMultipartUploadInput{
Bucket: aws.String(repo),
Key: aws.String(key),
UploadId: resp.UploadId,
})
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worried that this will do more harm than good.
Is there a way to list all the uploads on and delete/abort the ones that relevant for the test before the test starts? it will enable us to identify real failure during the test and the cleanup/setup that we perform before the test.

Comment on lines 459 to 462
// IsTruncated should be nil or false when not truncated
if outputExact.IsTruncated != nil {
require.False(t, *outputExact.IsTruncated, "should not be truncated when request is completely fulfilled")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant to multiple places in the test code.
In this case we require nil or or pointer to false - we can require using apiutil.Value or require.True(outputExact.IsTruncated == nil || !*outputExact.IsTruncated).

Not a blocker.

Comment on lines 10 to 25
// UploadIterator is an iterator over multipart uploads sorted by Path, then UploadID
type UploadIterator interface {
// Next advances the iterator to the next upload
// Returns true if there is a next upload, false otherwise
Next() bool
// Value returns the current upload
// Should only be called after Next returns true
Value() *Upload
// Err returns any error encountered during iteration
Err() error
// Close releases resources associated with the iterator
Close()
// SeekGE seeks to the first upload with key >= uploadIDKey(path, uploadID)
// After calling SeekGE, Next() must be called to access the first element at or after the seek position
SeekGE(key, uploadID string)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is part of the tracker interface and code here should define the UploadIterator implementation and return a pointer to the actual struct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the comment - please clarify

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving UploadIterator interface to pkg/gateway/multipart/tracker.go where it is used.
The newUploadIterator func should return the implementation type - *kvUploadIterator.

Comment on lines 37 to 40
if errors.Is(err, kv.ErrNotFound) {
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrNoSuchBucket))
}
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrInternalError))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return or else

Comment on lines 82 to 98
func (it *kvUploadIterator) Err() error {
if it.err != nil {
return it.err
}
if !it.closed {
return it.kvIter.Err()
}
return nil
}

func (it *kvUploadIterator) Close() {
if it.closed {
return
}
it.kvIter.Close()
it.closed = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this code it seems that we don't need to manage the 'closed' state - we just delegate the state to the underlaying iterator. The closed indicator can be the kvIter itself if needed.

Comment on lines 438 to 441
if errors.Is(err, kv.ErrNotFound) {
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrNoSuchBucket))
}
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrInternalError))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return or else

Comment on lines 482 to 492
// Check if there are more uploads (for IsTruncated flag)
// If we exited the loop due to length limit, iter.Next() hasn't been called for the next item yet
isTruncated := iter.Next()

// Set pagination markers for next page
var nextKeyMarker, nextUploadIDMarker string
if isTruncated && len(uploads) > 0 {
last := uploads[len(uploads)-1]
nextKeyMarker = last.Key
nextUploadIDMarker = last.UploadID
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to prevent endless loop - in case isTruncated is true but len(uploads) is zero, we should set isTruncated to false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't happen but added anyway

Comment on lines 79 to 82
if errors.Is(err, kv.ErrNotFound) {
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrNoSuchBucket))
}
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrInternalError))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return or else

Comment on lines 118 to 121
if errors.Is(err, kv.ErrNotFound) {
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrNoSuchBucket))
}
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrInternalError))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return or else

Comment on lines 166 to 169
if errors.Is(err, kv.ErrNotFound) {
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrNoSuchBucket))
}
_ = o.EncodeError(w, req, err, gatewayerrors.Codes.ToAPIErr(gatewayerrors.ErrInternalError))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return or else

…-ordering-9554

# Conflicts:
#	esti/commit_test.go
#	esti/s3_gateway_test.go
@N-o-Z N-o-Z marked this pull request as ready for review December 30, 2025 00:34
@N-o-Z N-o-Z requested a review from nopcoder December 30, 2025 00:38
@N-o-Z
Copy link
Member Author

N-o-Z commented Dec 30, 2025

@nopcoder Thanks for the thorough review - I hope I didn't miss anything

@arielshaqed
Copy link
Contributor

users who upgrade to this version and have ongoing MPUs will basically lose them

It's probably a bit worse: users who start an MPU during the rollout may also lose that MPU. I think we may need some product direction here. There is another option - to release a version supporting both modes, upgrade to that version, then after a week release a version dropping the old mode. That one is of course expensive, hence we should ask product.

I don't think that's necessary since once we declare the migration path we can require users to not perform any MPUs during the upgrade. The choice whether to complete outstanding MPUs before upgrade or let the migration process abort them - all of that responsibility will be rolled down to the user

This is even more product than before. Bear in mind that "the user" is not a single person. For instance, think about how to roll this out on lakeFS Cloud. We would need to announce that at a certain time slot all MPUs are disallowed.1 And then we must upgrade the entire cluster within that time slot. This will probably require CS involvement.

Footnotes

  1. For instance, some Spark users use S3A, and will end up doing MPUs.

@N-o-Z
Copy link
Member Author

N-o-Z commented Dec 30, 2025

users who upgrade to this version and have ongoing MPUs will basically lose them

It's probably a bit worse: users who start an MPU during the rollout may also lose that MPU. I think we may need some product direction here. There is another option - to release a version supporting both modes, upgrade to that version, then after a week release a version dropping the old mode. That one is of course expensive, hence we should ask product.

I don't think that's necessary since once we declare the migration path we can require users to not perform any MPUs during the upgrade. The choice whether to complete outstanding MPUs before upgrade or let the migration process abort them - all of that responsibility will be rolled down to the user

This is even more product than before. Bear in mind that "the user" is not a single person. For instance, think about how to roll this out on lakeFS Cloud. We would need to announce that at a certain time slot all MPUs are disallowed.1 And then we must upgrade the entire cluster within that time slot. This will probably require CS involvement.

Footnotes

  1. For instance, some Spark users use S3A, and will end up doing MPUs.

I agree completely with everything you said.
That's why we're not going to merge this change before we get @treeverse/product's input and decide on the migration path

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments.
There is one concern I have with the kvUploadIterator implementation:

  1. The Seek implementation first checks for errors which it means we don't enable calling Seek after we failed to New or Seek
  2. Because Seek is not exactly like New in the case we expect the user to call Close in case we didn't return an error - we need to check if we have 'it.kvIter' set, in the case New was successful, Seek failed and Close that will find 'it.kvIter' set to nil.

@N-o-Z N-o-Z requested a review from nopcoder January 12, 2026 23:38
@ozkatz
Copy link
Collaborator

ozkatz commented Jan 28, 2026

You've asked for product feedback so here it is :)

To make things simple, let's put forth a constraint: our rule of thumb should be to never require downtime to upgrade a lakeFS minor version. There might be extreme cases where we'd have to break that rule, but I'm not convinced this is one of them.

I suggest we either find a way to fix this while allowing MPUs to continue uninterrupted during the upgrade process (without downtime) - or consider this a big enough breaking change that we simply can't introduce in lakeFS 1.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/gateway Changes to the gateway area/testing Improvements or additions to tests bug Something isn't working include-changelog PR description should be included in next release changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway's ListMultipartUploads doesn't respect S3 ordering

4 participants