Skip to content

List objects with unicode - should sort keys using byte-by-byte order and not using utf8 sort order #8218

Open
@guymguym

Description

@guymguym

Environment info

  • NooBaa Version: 5.17
  • Platform: Any

Actual behavior

  1. We always read object keys into js strings, which are UTF8 encoded, and then sorted.
  2. However AWS S3 sorts keys in their "binary" form and compares it byte-by-byte.
  3. We can see that both empirically and it is hinted in the AWS docs.
  4. See this link with a simple test example that I checked and indeed AWS returns the "binary order" while noobaa returns the "UTF8 order".
  5. AWS docs hint that this is the behavior that they use by saying it uses "binary order" - "List results are always returned in UTF-8 binary order." see https://docs.aws.amazon.com/AmazonS3/latest/userguide/ListingKeysUsingAPIs.html.

Expected behavior

  1. Sort order of ListObjects with unicode should be compatible with AWS and not rely to UTF8 sort order.
  2. We can load the string into a buffer and use Buffer.compare instead - the concern is just the amount of work and GC it addes to the listing flow, so we should try to minimize this overhead.

Steps to reproduce

  1. Here is a copy of the flow described in: https://forum.moonwalkinc.com/t/determining-s3-listing-order/116

Some third-party implementations of Amazon’s S3 protocol return object information (‘file listings’) in UTF-16 code-unit order rather than the Amazon-compatible Unicode code-point order.

Introduced in Moonwalk 2023.2, when configuring Moonwalk’s s3generic:// plugin (as well as certain other plugins that provide 3rd party S3 support such as s3cos://), a ‘UTF-16 listing order work-around’ option is provided in the Plugin Configuration panel to allow Moonwalk to correctly process results returned in this non-standard order and thereby allow correct and complete scanning of your S3 buckets.

How do you determine whether you need to enable this option?
The following experiment will test the sort order of your S3-compatible device.

Create a new folder on a Windows server with Moonwalk Agent installed
Add files with the EXACT names shown below - use cut & paste to get them right
file_ꦏ_1.txt
file__2.txt
file__3.txt
file_𐎣_4.txt
Don’t worry about the order that Windows shows the files in and don’t worry if some programs just show the characters between the underscores as a box or a question mark etc
Use an Ingest policy to upload this folder to a test bucket on your S3-compatible storage
Use a Gather Statistics policy to scan the location to which you just ingested the files
a. Tick ‘Export raw file metadata’
b. Untick the ‘Compress (gzip)’ option
c. Choose ‘CSV’ format
Check the exported CSV data (e.g. using notepad) to determine the order in which the files appear:
If the files appear in 1, 2, 3, 4 order: congratulations, your S3-compatible device uses the expected AWS ordering - you should NOT tick the workaround box
If the files appear in 1, 4, 2, 3 order: your device is using UTF-16 code-unit order - you WILL need to tick the ‘UTF-16 listing order work-around’ box
Note: this option does not change the order in which results are actually returned, it just ensures that Moonwalk processes them correctly.

More information - Screenshots / Logs / Other output

Metadata

Metadata

Assignees

No one assigned

    Labels

    S3-CompatibilityS3 Compatibility and Namespace over AWS

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions