Skip to content

ENOENT error on macOS when indexing files with umlaut characters in paths #81

@c-stoeckl

Description

@c-stoeckl

When adding a collection on macOS, qmd fails to index files containing non-ASCII characters (like German umlauts) in directory or file names. The error shows mojibake encoding artifacts suggesting a UTF-8 encoding issue.

Environment

  • OS: macOS (any version)
  • qmd version: latest
  • Shell: zsh/bash
  • File location: iCloud Drive, Obsidian vaults, or any directory with non-ASCII filenames

Steps to Reproduce

  1. Create a file with umlaut characters:

    mkdir -p ~/test-notes/Jürgen\ Müller
    echo "# Test note" > ~/test-notes/Jürgen\ Müller/note.md
  2. Add as collection:

    qmd collection add ~/test-notes --name test-encoding

Expected Behavior

Collection should index successfully, recognizing files like Jürgen Müller/note.md.

Actual Behavior

Indexing fails with ENOENT:

ENOENT: no such file or directory, stat '/Users/user/test-notes/J[corrupted-path]/note.md'

The error shows mojibake (corrupted UTF-8 interpreted as Latin-1).

Root Cause

Bun UTF-8 Path Corruption Bug

Investigation revealed that only Bun.file().stat() has a bug that corrupts UTF-8 file paths internally before making system calls:

  • Bun.file(path).stat() - BROKEN - corrupts UTF-8 paths, causes ENOENT
  • Bun.file(path).text() - WORKS - uses different code path, not affected

When calling Bun.file(filepath).stat() on paths containing non-ASCII characters (e.g., German umlauts like ö, ü, ä), Bun mangles the encoding, causing the underlying stat() syscall to receive a corrupted path.

Example:

  • Input path: naïve-notes.txt (with composed Unicode characters)
  • Corrupted path in syscall: mojibake bytes causing ENOENT

Suggested Fix

Replace Bun.file().stat() calls with Node.js fs module:

// Before (broken):
const stat = await Bun.file(filepath).stat();

// After (works):
import { statSync } from 'fs';
const stat = statSync(filepath);

While only .stat() is buggy, we may also standardize on readFileSync() instead of Bun.file().text() for API consistency.

Additional Context

  • This affects files with: German umlauts (ä, ö, ü), French accents (é, è, ê), etc.
  • German, French, Japanese, and other non-ASCII users are impacted
  • Node.js fs module handles UTF-8 paths correctly on macOS
  • Original thought was this was macOS NFD encoding issue, but investigation revealed it's specifically a Bun bug

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions