feat: support multi-character input delimiters#132
Merged
Conversation
Extend -d/--delimiter to accept strings of 1-8 bytes instead of a single character. Common real-world separators like '||', ';;', or two spaces now work without preprocessing. - CsvReader: delimiter field changed from u8 to []const u8; added partial_delim: usize to track in-progress multi-byte matches in the streaming state machine - parseDelimiter: returns []const u8, rejects empty and >8-byte values - writeField / printRow / printHeaderRow: delimiter type updated to []const u8; quoting detection uses std.mem.indexOf instead of byte-by-byte comparison - New unit tests: 2-char (||), 3-char (;;;), partial-match false positive, quoted field containing multi-char delimiter - New integration tests 95-98: double-pipe, three-char, empty-error, too-long-error - README and man page updated to describe the 1-8 char constraint
- Reset partial_delim at start of nextRecord (latent correctness fix: avoids stale state if a previous call exited via a non-fatal error) - Add 6 unit tests covering: empty first field, empty last field, only-delimiter input, EOF without newline, partial delimiter at EOF treated as field content, and greedy left-to-right matching behavior
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #86.
Extends
-d/--delimiterto accept strings of 1–8 bytes instead of a single character. Common real-world separators like||,;;, or two spaces now work without preprocessing.Changes
src/csv.zig— streaming parserdelimiter: u8→delimiter: []const u8; addedpartial_delim: usize = 0to track in-progress multi-byte matches.field_start,.unquoted, and.quote_saw: whendelimiter[0]is seen, bytes are tentatively consumed; on full match the field is flushed, on mismatch the prefix is emitted as literal field content and the current byte is re-evaluatedpartial_delimis reset at the top ofnextRecordto guard against stale state after a non-fatal I/O errorsrc/main.zig— CLI and outputparseDelimiter: now returns[]const u8; accepts 1–8 bytes, rejects empty andlen > 8delimiter: u8fields updated acrossParsedArgs,ColumnsArgs,ValidateArgs,SampleArgswriteField/printRow/printHeaderRow: delimiter type updated to[]const u8; quoting detection usesstd.mem.indexOfinstead of char-by-char comparisonbuild.zig||delimiter,;;;delimiter, empty delimiter error (exit 1), delimiter > 8 chars error (exit 1)Docs
README.mdanddocs/sql-pipe.1.scd: updated to reflect 1–8 character range, added multi-char examplesBehavior notes
delimiter.len == 1avoids partial-match overhead)aainaaa) follow this documented strategy--tsvremains a shorthand for--delimiter $'\t'(single char, unchanged)--sampleoutput uses the same delimiter as input (consistent with existing single-char behavior)Testing