Skip to content

feat: support multi-character input delimiters#132

Merged
vmvarela merged 2 commits intomasterfrom
issue-86/multi-char-delimiter
May 7, 2026
Merged

feat: support multi-character input delimiters#132
vmvarela merged 2 commits intomasterfrom
issue-86/multi-char-delimiter

Conversation

@vmvarela
Copy link
Copy Markdown
Owner

@vmvarela vmvarela commented May 7, 2026

Summary

Closes #86.

Extends -d/--delimiter to accept strings of 1–8 bytes instead of a single character. Common real-world separators like ||, ;;, or two spaces now work without preprocessing.

Changes

src/csv.zig — streaming parser

  • delimiter: u8delimiter: []const u8; added partial_delim: usize = 0 to track in-progress multi-byte matches
  • State machine updated in .field_start, .unquoted, and .quote_saw: when delimiter[0] is seen, bytes are tentatively consumed; on full match the field is flushed, on mismatch the prefix is emitted as literal field content and the current byte is re-evaluated
  • EOF handler flushes any pending partial-match bytes before closing the record
  • partial_delim is reset at the top of nextRecord to guard against stale state after a non-fatal I/O error
  • +11 unit tests covering: 2-char, 3-char, partial false-positive, quoted field containing delimiter, empty first/last field, only-delimiter input, EOF without newline, partial at EOF, greedy left-to-right behavior

src/main.zig — CLI and output

  • parseDelimiter: now returns []const u8; accepts 1–8 bytes, rejects empty and len > 8
  • All delimiter: u8 fields updated across ParsedArgs, ColumnsArgs, ValidateArgs, SampleArgs
  • writeField/printRow/printHeaderRow: delimiter type updated to []const u8; quoting detection uses std.mem.indexOf instead of char-by-char comparison

build.zig

  • Integration tests 95–98: || delimiter, ;;; delimiter, empty delimiter error (exit 1), delimiter > 8 chars error (exit 1)

Docs

  • README.md and docs/sql-pipe.1.scd: updated to reflect 1–8 character range, added multi-char examples

Behavior notes

  • Single-character delimiters are fully backward compatible (fast path: delimiter.len == 1 avoids partial-match overhead)
  • Matching is greedy left-to-right; overlapping patterns (e.g. aa in aaa) follow this documented strategy
  • --tsv remains a shorthand for --delimiter $'\t' (single char, unchanged)
  • --sample output uses the same delimiter as input (consistent with existing single-char behavior)

Testing

zig build test   # all integration + unit tests
ziglint src build.zig

vmvarela added 2 commits May 7, 2026 13:57
Extend -d/--delimiter to accept strings of 1-8 bytes instead of a
single character. Common real-world separators like '||', ';;', or
two spaces now work without preprocessing.

- CsvReader: delimiter field changed from u8 to []const u8; added
  partial_delim: usize to track in-progress multi-byte matches in
  the streaming state machine
- parseDelimiter: returns []const u8, rejects empty and >8-byte values
- writeField / printRow / printHeaderRow: delimiter type updated to
  []const u8; quoting detection uses std.mem.indexOf instead of
  byte-by-byte comparison
- New unit tests: 2-char (||), 3-char (;;;), partial-match false
  positive, quoted field containing multi-char delimiter
- New integration tests 95-98: double-pipe, three-char, empty-error,
  too-long-error
- README and man page updated to describe the 1-8 char constraint
- Reset partial_delim at start of nextRecord (latent correctness fix:
  avoids stale state if a previous call exited via a non-fatal error)
- Add 6 unit tests covering: empty first field, empty last field,
  only-delimiter input, EOF without newline, partial delimiter at EOF
  treated as field content, and greedy left-to-right matching behavior
@vmvarela vmvarela added type:feature New functionality priority:medium Should be done soon size:m Medium — 4 to 8 hours status:review In code review or waiting for feedback labels May 7, 2026
@vmvarela vmvarela merged commit fe2e1b1 into master May 7, 2026
5 checks passed
@vmvarela vmvarela deleted the issue-86/multi-char-delimiter branch May 7, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Should be done soon size:m Medium — 4 to 8 hours status:review In code review or waiting for feedback type:feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multi-character input delimiters

1 participant