When adding a collection on macOS, qmd fails to index files containing non-ASCII characters (like German umlauts) in directory or file names. The error shows mojibake encoding artifacts suggesting a UTF-8 encoding issue.
Environment
- OS: macOS (any version)
- qmd version: latest
- Shell: zsh/bash
- File location: iCloud Drive, Obsidian vaults, or any directory with non-ASCII filenames
Steps to Reproduce
-
Create a file with umlaut characters:
mkdir -p ~/test-notes/Jürgen\ Müller
echo "# Test note" > ~/test-notes/Jürgen\ Müller/note.md
-
Add as collection:
qmd collection add ~/test-notes --name test-encoding
Expected Behavior
Collection should index successfully, recognizing files like Jürgen Müller/note.md.
Actual Behavior
Indexing fails with ENOENT:
ENOENT: no such file or directory, stat '/Users/user/test-notes/J[corrupted-path]/note.md'
The error shows mojibake (corrupted UTF-8 interpreted as Latin-1).
Root Cause
Bun UTF-8 Path Corruption Bug
Investigation revealed that only Bun.file().stat() has a bug that corrupts UTF-8 file paths internally before making system calls:
- ✗
Bun.file(path).stat() - BROKEN - corrupts UTF-8 paths, causes ENOENT
- ✓
Bun.file(path).text() - WORKS - uses different code path, not affected
When calling Bun.file(filepath).stat() on paths containing non-ASCII characters (e.g., German umlauts like ö, ü, ä), Bun mangles the encoding, causing the underlying stat() syscall to receive a corrupted path.
Example:
- Input path:
naïve-notes.txt (with composed Unicode characters)
- Corrupted path in syscall: mojibake bytes causing ENOENT
Suggested Fix
Replace Bun.file().stat() calls with Node.js fs module:
// Before (broken):
const stat = await Bun.file(filepath).stat();
// After (works):
import { statSync } from 'fs';
const stat = statSync(filepath);
While only .stat() is buggy, we may also standardize on readFileSync() instead of Bun.file().text() for API consistency.
Additional Context
- This affects files with: German umlauts (ä, ö, ü), French accents (é, è, ê), etc.
- German, French, Japanese, and other non-ASCII users are impacted
- Node.js
fs module handles UTF-8 paths correctly on macOS
- Original thought was this was macOS NFD encoding issue, but investigation revealed it's specifically a Bun bug
Related
When adding a collection on macOS, qmd fails to index files containing non-ASCII characters (like German umlauts) in directory or file names. The error shows mojibake encoding artifacts suggesting a UTF-8 encoding issue.
Environment
Steps to Reproduce
Create a file with umlaut characters:
Add as collection:
qmd collection add ~/test-notes --name test-encodingExpected Behavior
Collection should index successfully, recognizing files like
Jürgen Müller/note.md.Actual Behavior
Indexing fails with
ENOENT:The error shows mojibake (corrupted UTF-8 interpreted as Latin-1).
Root Cause
Bun UTF-8 Path Corruption Bug
Investigation revealed that only
Bun.file().stat()has a bug that corrupts UTF-8 file paths internally before making system calls:Bun.file(path).stat()- BROKEN - corrupts UTF-8 paths, causes ENOENTBun.file(path).text()- WORKS - uses different code path, not affectedWhen calling
Bun.file(filepath).stat()on paths containing non-ASCII characters (e.g., German umlauts like ö, ü, ä), Bun mangles the encoding, causing the underlyingstat()syscall to receive a corrupted path.Example:
naïve-notes.txt(with composed Unicode characters)Suggested Fix
Replace
Bun.file().stat()calls with Node.jsfsmodule:While only
.stat()is buggy, we may also standardize onreadFileSync()instead ofBun.file().text()for API consistency.Additional Context
fsmodule handles UTF-8 paths correctly on macOSRelated