refactor: align internal terminology with ubiquitous language#127
refactor: align internal terminology with ubiquitous language#127stevehansen wants to merge 1 commit into
Conversation
Internal/private identifiers and XML-doc comments now follow a single documented vocabulary (record / physical line / field / value / quoting). No public API changes — every rename is internal or doc-only. - Reader record classes: rawSplitLine->rawFields, parsedLine->parsedValues, and the private `Line` property (which returned the parsed field array) ->ParsedValues. - Writer: fix the escape-vs-quote naming (FixedEscapeChars->QuoteTriggerChars, needsGeneralEscape->needsQuoting, the wrap-the-field `escape` flag->mustQuote). Kept cell/WriteCell/WriteRow for consistency with the public writer API. - Reword misleading public XML docs (ColumnCount counts fields, Read* yields records, int indexers take a field index, WriteCell does quoting and escaping). - Add UBIQUITOUS_LANGUAGE.md: the glossary, blast-radius analysis, and a vNext rename-target list for the frozen public names. Builds on netstandard2.0/net8.0/net9.0; all 179 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request aligns the codebase and XML documentation with a newly defined 'Ubiquitous Language' glossary, renaming internal fields and updating comments to consistently use terms like 'record', 'field', and 'quoting' instead of 'line', 'cell', and 'escaping'. Feedback points out a critical inconsistency where \r is omitted from the quote-trigger characters in CsvWriter (unlike CsvBufferWriter), which could result in malformed CSVs. Additionally, the reviewer recommends caching the quote-trigger characters in a static array to avoid unnecessary per-row allocations in the writer's serialization methods.
| #if NET8_0_OR_GREATER | ||
| // The separator is per-call so it can't be baked into a single cached SearchValues. | ||
| // Keep the fixed escape chars cached and check the separator with a separate Contains. | ||
| // Keep the fixed quote-trigger chars cached and check the separator with a separate Contains. | ||
| // Without this caching, MemoryExtensions.IndexOfAny(ReadOnlySpan, ReadOnlySpan)/char[] | ||
| // builds a fresh SearchValues<char> on the heap every call. | ||
| private static readonly SearchValues<char> FixedEscapeChars = SearchValues.Create("'\n"); | ||
| private static readonly SearchValues<char> QuoteTriggerChars = SearchValues.Create("'\n"); | ||
| #endif |
There was a problem hiding this comment.
There is an inconsistency in the quote-trigger characters between CsvWriter and CsvBufferWriter. CsvBufferWriter correctly includes \r (carriage return) in its QuoteTriggerChars ("'\n\r"), but CsvWriter only includes '\n in its QuoteTriggerChars and quoteTriggerChars arrays.
Omitting \r can cause fields containing carriage returns to be written without quotes, resulting in malformed CSV files when read by standard parsers.
Additionally, we can define a static FixedQuoteTriggerChars array here to avoid allocating a new char[] array on every row write in WriteLine and WriteLineAsync.
#if NET8_0_OR_GREATER
// The separator is per-call so it can't be baked into a single cached SearchValues.
// Keep the fixed quote-trigger chars cached and check the separator with a separate Contains.
// Without this caching, MemoryExtensions.IndexOfAny(ReadOnlySpan, ReadOnlySpan)/char[]
// builds a fresh SearchValues<char> on the heap every call.
private static readonly SearchValues<char> QuoteTriggerChars = SearchValues.Create("'\n\r");
#endif
private static readonly char[] FixedQuoteTriggerChars = { '\'', '\n', '\r' };| var quoteTriggerChars = new[] { separator, '\'', '\n' }; | ||
| for (var i = 0; i < columnCount; i++) | ||
| { | ||
| if (i > 0) | ||
| writer.Write(separator); | ||
|
|
||
| if (i < data.Length) | ||
| { | ||
| var escape = false; | ||
| var mustQuote = false; | ||
| var cell = data[i] ?? string.Empty; | ||
| #if NET8_0_OR_GREATER | ||
| if (cell.Contains('"')) | ||
| #else | ||
| if (cell.Contains("\"")) | ||
| #endif | ||
| { | ||
| escape = true; | ||
| mustQuote = true; | ||
| cell = cell.Replace("\"", "\"\""); | ||
| } | ||
| else if (cell.IndexOfAny(escapeChars) >= 0) | ||
| escape = true; | ||
| else if (cell.IndexOfAny(quoteTriggerChars) >= 0) | ||
| mustQuote = true; |
There was a problem hiding this comment.
Allocating a new char[] array (quoteTriggerChars) on every single call to WriteLine introduces unnecessary garbage collection pressure, especially when writing large datasets.
By using the static FixedQuoteTriggerChars array and checking the dynamic separator character separately via IndexOf, we can completely eliminate this per-row allocation.
for (var i = 0; i < columnCount; i++)
{
if (i > 0)
writer.Write(separator);
if (i < data.Length)
{
var mustQuote = false;
var cell = data[i] ?? string.Empty;
#if NET8_0_OR_GREATER
if (cell.Contains('"'))
#else
if (cell.Contains("\""))
#endif
{
mustQuote = true;
cell = cell.Replace("\"", "\"\"");
}
else if (cell.IndexOf(separator) >= 0 || cell.IndexOfAny(FixedQuoteTriggerChars) >= 0)
mustQuote = true;| var quoteTriggerChars = new[] { separator, '\'', '\n' }; | ||
| for (var i = 0; i < columnCount; i++) | ||
| { |
| await writer.WriteAsync('"').ConfigureAwait(false); | ||
| } | ||
| else if (cell.IndexOfAny(escapeChars) >= 0) | ||
| else if (cell.IndexOfAny(quoteTriggerChars) >= 0) |
|
Thanks @gemini-code-assist — both points are valid, and the
This PR is a deliberately behavior-neutral terminology/rename pass, so to keep it clean I've addressed both in a focused follow-up rather than here: #128 (adds |
|
Thanks for the update, @stevehansen. That sounds like a solid plan. Keeping the terminology refactor focused while addressing the |
What
Aligns the codebase to a single documented vocabulary (record / physical line / field / value / quoting), captured in a new
UBIQUITOUS_LANGUAGE.md. This came out of a glossary pass that surfaced six pervasive terminology conflicts; this PR fixes the ones that are safe to fix now.Why
Csvis a public package with millions of downloads, so renaming a public member is a SemVer-major break. The changes here are scoped by blast radius so nothing in the public contract moves:Csv.TestsviaInternalsVisibleTo→ zero ecosystem impact)[Obsolete]forwardersChanges
Internal renames
rawSplitLine→rawFields,RawSplitLine→RawFields,parsedLine→parsedValues, and the private property literally namedLine(which returned the parsed field array) →ParsedValues.FixedEscapeChars→QuoteTriggerChars,escapeChars→quoteTriggerChars,needsGeneralEscape→needsQuoting, and the wrap-the-fieldescapeflag →mustQuote.needsQuoteEscapeis kept (it genuinely means quote-doubling).cell/WriteCell/WriteRow/WriteLineare kept for consistency with the public writer API.Doc rewording (non-breaking)
ColumnCountnow documents "number of fields in this record";ValidateColumnCountmatches "field count per row";Read*summaries say "Reads the records";intindexers andICsvLineSpan.GetSpan/GetMemory/TryGet*document a "field index";CsvBufferWriter.WriteCelldocuments "quoting and escaping".New file
UBIQUITOUS_LANGUAGE.md— the glossary, the blast-radius analysis, what changed in this pass, and a vNext rename-target table for the frozen public names (ColumnCount→FieldCount,ValidateColumnCount→ValidateFieldCount,LineHasColumn→RecordHasValue,ICsvLine→ICsvRecord).Verification
netstandard2.0/net8.0/net9.0, 0 errors.🤖 Generated with Claude Code