Skip to content

feat: support unicode in object names#875

Open
ferhatelmas wants to merge 25 commits intomasterfrom
ferhat/name
Open

feat: support unicode in object names#875
ferhatelmas wants to merge 25 commits intomasterfrom
ferhat/name

Conversation

@ferhatelmas
Copy link
Member

@ferhatelmas ferhatelmas commented Feb 26, 2026

What kind of change does this PR introduce?

feature

What is the current behavior?

Limited set of characters are permitted in object names.

What is the new behavior?

Unicode is supported across the board.

Additional context

#133

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 4ee18241-4689-423b-998a-5e458aab46a8

📥 Commits

Reviewing files that changed from the base of the PR and between e9d8095 and cadda24.

📒 Files selected for processing (1)
  • src/test/object.test.ts

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Broader Unicode support for object names, optional source-version copy, XML numeric-entity decoding, webhook disable filtering, and safer encoding for S3 copy sources. Signed-upload/render URL validation tightened.
  • Bug Fixes

    • Clearer "invalid object name" errors, rejection of disallowed Unicode/non-character sequences, fixes to continuation-token and signing/encoding edge cases (including double-encoded paths).
  • Tests

    • Large expansion of Unicode, signed-URL, copy/multipart, webhook-filtering, TUS and test-hygiene coverage.

Walkthrough

Adds a tenant SQL migration to enforce object-name constraints (excluding control characters and disallowed Unicode non-characters), adds XML numeric entity decoding to the XML parser, tightens key validation to reject XML-invalid and non-character code points, standardizes S3 CopySource encoding and parsing (including support and validation of source/versionId), versioned continuation tokens, simplifies signed URL signing APIs with path-matching utilities, refines DB error messages for the new constraint, and adds extensive tests for Unicode, CopySource, signed URLs, and token behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP as S3 Handler
    participant Parser as parseCopySource
    participant Storage as Storage Service
    participant DB as Database
    participant Backend as Object Backend

    Client->>HTTP: COPY request (CopySource header, Destination)
    HTTP->>Parser: parseCopySource(CopySource)
    alt malformed
        Parser-->>HTTP: throw InvalidParameter / MissingParameter
        HTTP-->>Client: 400
    else parsed
        Parser-->>HTTP: {bucket, key, sourceVersion?}
        HTTP->>Storage: copyObject(srcBucket, srcKey, srcVersion?, destBucket, destKey)
        Storage->>DB: SELECT version WHERE bucket=srcBucket AND name=srcKey
        DB-->>Storage: {version, metadata} or not found
        alt sourceVersion provided
            Storage->>Storage: compare provided sourceVersion vs DB.version
            alt mismatch
                Storage-->>HTTP: throw NoSuchKey / InvalidParameter
                HTTP-->>Client: 404/400
            else match
                Storage->>Backend: CopyObject with encoded CopySource (includes versionId)
                Backend-->>Storage: {eTag, lastModified}
                Storage-->>HTTP: success
                HTTP-->>Client: 200 OK (copy response)
            end
        else no sourceVersion
            Storage->>Backend: CopyObject using DB.version in CopySource
            Backend-->>Storage: {eTag, lastModified}
            Storage-->>HTTP: success
            HTTP-->>Client: 200 OK (copy response)
        end
    end
Loading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ferhatelmas ferhatelmas force-pushed the ferhat/name branch 7 times, most recently from 3dd9e78 to 60b9538 Compare March 5, 2026 10:20
@ferhatelmas ferhatelmas marked this pull request as ready for review March 5, 2026 13:15
@ferhatelmas ferhatelmas requested a review from a team as a code owner March 5, 2026 13:15
@ferhatelmas ferhatelmas changed the title feat: support utf8 in object names feat: support unicode in object names Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/storage/object.ts (1)

876-894: ⚠️ Potential issue | 🟠 Major

Backward compatibility break in continuation token decoding corrupts pagination cursors

Line 889 unconditionally decodes all token values with decodeURIComponent(). Old tokens stored raw unescaped values (e.g., a%2Fb as a literal string), but the new decoder treats %2F as an encoded /, silently transforming cursor positions and causing list operations to skip or duplicate results.

Proposed fix (version-prefix encoded tokens)
 function encodeContinuationToken(tokenInfo: ContinuationToken) {
   const result: string[] = []
   for (const [k, v] of Object.entries(CONTINUATION_TOKEN_PART_MAP)) {
     const value = tokenInfo[v]
     if (value) {
-      result.push(`${k}:${encodeURIComponent(value)}`)
+      result.push(`${k}:e:${encodeURIComponent(value)}`)
     }
   }
   return Buffer.from(result.join('\n')).toString('base64')
 }

 function decodeContinuationToken(token: string): ContinuationToken {
@@
-    let value = partMatch[2]
-    try {
-      value = decodeURIComponent(value)
-    } catch {
-      // Backward compatibility: previously cursor values were stored unescaped.
-    }
+    let value = partMatch[2]
+    if (value.startsWith('e:')) {
+      value = decodeURIComponent(value.slice(2))
+    }
     result[CONTINUATION_TOKEN_PART_MAP[partMatch[1]]] = value
   }
   return result
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/object.ts` around lines 876 - 894, The decoder
decodeContinuationToken currently always runs decodeURIComponent on every part
which corrupts legacy unescaped tokens; update decodeContinuationToken to
support a version prefix: detect and strip a token version prefix (e.g., "v1:"
vs legacy/no-prefix), and only apply decodeURIComponent for the new version (v1)
while preserving legacy behavior by leaving values raw (no decoding) when the
token has no version prefix; keep using CONTINUATION_TOKEN_PART_MAP and the same
part parsing logic, and ensure the function throws the same Invalid continuation
token error for malformed parts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/internal/errors/codes.ts`:
- Line 327: The error message in ERRORS.InvalidKey uses encodeURIComponent(key)
which can throw for unpaired UTF-16 surrogates; update ERRORS.InvalidKey (where
the message is constructed) to produce a safe-encoded key: prefer using
key.toWellFormed() if available, otherwise normalize/replace unpaired surrogates
(or fall back to a try/catch around encodeURIComponent and sanitize the key to
replace invalid code units with U+FFFD) and then call encodeURIComponent on the
sanitized string before embedding it in the message so URIError cannot escape
and turn a 400 into a 500.

---

Outside diff comments:
In `@src/storage/object.ts`:
- Around line 876-894: The decoder decodeContinuationToken currently always runs
decodeURIComponent on every part which corrupts legacy unescaped tokens; update
decodeContinuationToken to support a version prefix: detect and strip a token
version prefix (e.g., "v1:" vs legacy/no-prefix), and only apply
decodeURIComponent for the new version (v1) while preserving legacy behavior by
leaving values raw (no decoding) when the token has no version prefix; keep
using CONTINUATION_TOKEN_PART_MAP and the same part parsing logic, and ensure
the function throws the same Invalid continuation token error for malformed
parts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 46444ed6-2080-41d2-80b4-c88c86a791d1

📥 Commits

Reviewing files that changed from the base of the PR and between d8dba53 and cde444c.

📒 Files selected for processing (21)
  • migrations/tenant/57-unicode-object-names.sql
  • package.json
  • src/http/plugins/xml.ts
  • src/internal/database/migrations/types.ts
  • src/internal/errors/codes.ts
  • src/scripts/migrations-types.ts
  • src/storage/backend/s3/adapter.ts
  • src/storage/database/knex.ts
  • src/storage/limits.ts
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/bucket.test.ts
  • src/test/common.ts
  • src/test/limits.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/object.test.ts
  • src/test/s3-adapter.test.ts
  • src/test/s3-protocol.test.ts
  • src/test/test-hygiene.test.ts
  • src/test/tus.test.ts
  • src/test/xml-plugin.test.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/test/xml-plugin.test.ts`:
- Around line 3-15: Add tests to ensure decodeXmlNumericEntities rejects
XML-forbidden but in-range numeric entities (so they remain un-decoded). In the
existing test suite for decodeXmlNumericEntities add negative cases asserting
inputs like '�' (and '�' hex/dec variants), surrogate halves such as
'�' (and decimal '�'), and noncharacters like '' (and
decimal '') are returned unchanged; reference the
decodeXmlNumericEntities function to locate the implementation to validate
behavior and ensure both hex (&#x...) and decimal (&#...;) forms are covered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: af25095c-594e-4a8f-afaa-599e92b3eee5

📥 Commits

Reviewing files that changed from the base of the PR and between cde444c and fc16aa6.

📒 Files selected for processing (21)
  • migrations/tenant/57-unicode-object-names.sql
  • package.json
  • src/http/plugins/xml.ts
  • src/internal/database/migrations/types.ts
  • src/internal/errors/codes.ts
  • src/scripts/migrations-types.ts
  • src/storage/backend/s3/adapter.ts
  • src/storage/database/knex.ts
  • src/storage/limits.ts
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/bucket.test.ts
  • src/test/common.ts
  • src/test/limits.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/object.test.ts
  • src/test/s3-adapter.test.ts
  • src/test/s3-protocol.test.ts
  • src/test/test-hygiene.test.ts
  • src/test/tus.test.ts
  • src/test/xml-plugin.test.ts
🚧 Files skipped from review as they are similar to previous changes (9)
  • src/test/test-hygiene.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/common.ts
  • migrations/tenant/57-unicode-object-names.sql
  • src/storage/object.ts
  • src/storage/backend/s3/adapter.ts
  • src/internal/errors/codes.ts
  • src/http/plugins/xml.ts
  • src/test/limits.test.ts

@ferhatelmas ferhatelmas force-pushed the ferhat/name branch 3 times, most recently from ff9670e to bf8efff Compare March 6, 2026 09:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/protocols/s3/s3-handler.ts`:
- Around line 1448-1501: Change decodeLegacyContinuationToken and
decodeContinuationToken to accept a parameter name (e.g.
decodeContinuationToken(token: string, parameterName: string)) and replace all
plain throws like throw new Error('Invalid continuation token') with the
standardized request error from your ERRORS collection, passing the parameter
name (e.g. throw ERRORS.InvalidArgument(parameterName) or the appropriate
ERRORS.* factory used elsewhere). Update callers (e.g. where
KeyMarker/UploadIdMarker are decoded) to pass the parameter name string such as
'KeyMarker' or 'UploadIdMarker' when calling
decodeContinuationToken/decodeLegacyContinuationToken so malformed client tokens
produce a 4xx request error instead of a generic exception.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 773df860-6e8f-4ac9-bba0-96890cba39d8

📥 Commits

Reviewing files that changed from the base of the PR and between 0121419 and ff9670e.

📒 Files selected for processing (20)
  • migrations/tenant/57-unicode-object-names.sql
  • src/http/routes/object/getSignedObject.ts
  • src/http/routes/object/getSignedUploadURL.ts
  • src/http/routes/render/renderSignedImage.ts
  • src/http/routes/signed-url.ts
  • src/storage/backend/s3/adapter.ts
  • src/storage/events/lifecycle/webhook-filter.ts
  • src/storage/events/lifecycle/webhook.ts
  • src/storage/object.ts
  • src/storage/protocols/s3/s3-handler.ts
  • src/test/limits.test.ts
  • src/test/object-list-v2.test.ts
  • src/test/object.test.ts
  • src/test/render-routes.test.ts
  • src/test/s3-adapter.test.ts
  • src/test/s3-protocol.test.ts
  • src/test/signed-url-route.test.ts
  • src/test/tus.test.ts
  • src/test/webhook-filter.test.ts
  • src/test/webhooks.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/storage/backend/s3/adapter.ts
  • src/test/limits.test.ts

Comment on lines +1448 to +1501
function decodeLegacyContinuationToken(decoded: string) {
// Backward compatibility: preserve pre-version behavior for old in-flight tokens.
const continuationToken = decoded.split(':')[1]
if (!continuationToken) {
throw new Error('Invalid continuation token')
}
return continuationToken
}

function decodeContinuationToken(token: string) {
const decoded = Buffer.from(token, 'base64').toString().split(':')
const decoded = Buffer.from(token, 'base64').toString()

if (decoded.startsWith('l:')) {
return decodeLegacyContinuationToken(decoded)
}

const parts = decoded.split('\n')
let version: string | undefined
let encodedValue: string | undefined

for (const part of parts) {
const match = part.match(/^(\S):(.*)/)
if (!match || match.length !== 3) {
throw new Error('Invalid continuation token')
}

if (match[1] === 'v') {
if (version !== undefined) {
throw new Error('Invalid continuation token')
}
version = match[2]
continue
}

if (match[1] === 'l') {
if (encodedValue !== undefined) {
throw new Error('Invalid continuation token')
}
encodedValue = match[2]
continue
}

if (decoded.length === 0) {
throw new Error('Invalid continuation token')
}

return decoded[1]
if (version !== '1' || !encodedValue) {
throw new Error('Invalid continuation token')
}

try {
return decodeURIComponent(encodedValue)
} catch {
throw new Error('Invalid continuation token')
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return structured request errors for invalid continuation tokens.

These helpers now throw plain Error for malformed client-supplied markers. In this handler, invalid request input is otherwise surfaced via ERRORS.*; letting KeyMarker / UploadIdMarker failures escape as generic exceptions makes the response path unpredictable and can turn a bad client token into an internal error instead of a 4xx.

Proposed fix
-function decodeLegacyContinuationToken(decoded: string) {
+function decodeLegacyContinuationToken(decoded: string, parameterName: string) {
   // Backward compatibility: preserve pre-version behavior for old in-flight tokens.
   const continuationToken = decoded.split(':')[1]
   if (!continuationToken) {
-    throw new Error('Invalid continuation token')
+    throw ERRORS.InvalidParameter(parameterName)
   }
   return continuationToken
 }
 
-function decodeContinuationToken(token: string) {
+function decodeContinuationToken(token: string, parameterName: string) {
   const decoded = Buffer.from(token, 'base64').toString()
 
   if (decoded.startsWith('l:')) {
-    return decodeLegacyContinuationToken(decoded)
+    return decodeLegacyContinuationToken(decoded, parameterName)
   }
 
   const parts = decoded.split('\n')
   let version: string | undefined
   let encodedValue: string | undefined
 
   for (const part of parts) {
     const match = part.match(/^(\S):(.*)/)
     if (!match || match.length !== 3) {
-      throw new Error('Invalid continuation token')
+      throw ERRORS.InvalidParameter(parameterName)
     }
 
     if (match[1] === 'v') {
       if (version !== undefined) {
-        throw new Error('Invalid continuation token')
+        throw ERRORS.InvalidParameter(parameterName)
       }
       version = match[2]
       continue
     }
 
     if (match[1] === 'l') {
       if (encodedValue !== undefined) {
-        throw new Error('Invalid continuation token')
+        throw ERRORS.InvalidParameter(parameterName)
       }
       encodedValue = match[2]
       continue
     }
 
-    throw new Error('Invalid continuation token')
+    throw ERRORS.InvalidParameter(parameterName)
   }
 
   if (version !== '1' || !encodedValue) {
-    throw new Error('Invalid continuation token')
+    throw ERRORS.InvalidParameter(parameterName)
   }
 
   try {
     return decodeURIComponent(encodedValue)
   } catch {
-    throw new Error('Invalid continuation token')
+    throw ERRORS.InvalidParameter(parameterName)
   }
 }

And pass the parameter name at the call sites, e.g. decodeContinuationToken(keyContinuationToken, 'KeyMarker').

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/protocols/s3/s3-handler.ts` around lines 1448 - 1501, Change
decodeLegacyContinuationToken and decodeContinuationToken to accept a parameter
name (e.g. decodeContinuationToken(token: string, parameterName: string)) and
replace all plain throws like throw new Error('Invalid continuation token') with
the standardized request error from your ERRORS collection, passing the
parameter name (e.g. throw ERRORS.InvalidArgument(parameterName) or the
appropriate ERRORS.* factory used elsewhere). Update callers (e.g. where
KeyMarker/UploadIdMarker are decoded) to pass the parameter name string such as
'KeyMarker' or 'UploadIdMarker' when calling
decodeContinuationToken/decodeLegacyContinuationToken so malformed client tokens
produce a 4xx request error instead of a generic exception.

mlatief and others added 11 commits March 6, 2026 14:42
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants