Skip to content

refactor: align internal terminology with ubiquitous language#127

Open
stevehansen wants to merge 1 commit into
masterfrom
refactor/align-internal-terminology
Open

refactor: align internal terminology with ubiquitous language#127
stevehansen wants to merge 1 commit into
masterfrom
refactor/align-internal-terminology

Conversation

@stevehansen
Copy link
Copy Markdown
Owner

What

Aligns the codebase to a single documented vocabulary (record / physical line / field / value / quoting), captured in a new UBIQUITOUS_LANGUAGE.md. This came out of a glossary pass that surfaced six pervasive terminology conflicts; this PR fixes the ones that are safe to fix now.

Why

Csv is a public package with millions of downloads, so renaming a public member is a SemVer-major break. The changes here are scoped by blast radius so nothing in the public contract moves:

Bucket Action
🟢 Internal / private identifiers Renamed (visible only to Csv.Tests via InternalsVisibleTo → zero ecosystem impact)
🟡 Public XML-doc text Reworded to canonical terms (not part of the binary/source contract)
🔴 Public member names Untouched — deferred to a future vNext behind [Obsolete] forwarders

Changes

Internal renames

  • Reader record classes: rawSplitLinerawFields, RawSplitLineRawFields, parsedLineparsedValues, and the private property literally named Line (which returned the parsed field array) → ParsedValues.
  • Writer escape-vs-quote fix: FixedEscapeCharsQuoteTriggerChars, escapeCharsquoteTriggerChars, needsGeneralEscapeneedsQuoting, and the wrap-the-field escape flag → mustQuote. needsQuoteEscape is kept (it genuinely means quote-doubling). cell/WriteCell/WriteRow/WriteLine are kept for consistency with the public writer API.

Doc rewording (non-breaking)

  • ColumnCount now documents "number of fields in this record"; ValidateColumnCount matches "field count per row"; Read* summaries say "Reads the records"; int indexers and ICsvLineSpan.GetSpan/GetMemory/TryGet* document a "field index"; CsvBufferWriter.WriteCell documents "quoting and escaping".

New file

  • UBIQUITOUS_LANGUAGE.md — the glossary, the blast-radius analysis, what changed in this pass, and a vNext rename-target table for the frozen public names (ColumnCountFieldCount, ValidateColumnCountValidateFieldCount, LineHasColumnRecordHasValue, ICsvLineICsvRecord).

Verification

  • Builds on netstandard2.0 / net8.0 / net9.0, 0 errors.
  • All 179 tests pass.
  • No public API surface changed (renames are internal/private; only XML-doc text and internal identifiers were touched).

🤖 Generated with Claude Code

Internal/private identifiers and XML-doc comments now follow a single
documented vocabulary (record / physical line / field / value / quoting).
No public API changes — every rename is internal or doc-only.

- Reader record classes: rawSplitLine->rawFields, parsedLine->parsedValues,
  and the private `Line` property (which returned the parsed field array)
  ->ParsedValues.
- Writer: fix the escape-vs-quote naming (FixedEscapeChars->QuoteTriggerChars,
  needsGeneralEscape->needsQuoting, the wrap-the-field `escape` flag->mustQuote).
  Kept cell/WriteCell/WriteRow for consistency with the public writer API.
- Reword misleading public XML docs (ColumnCount counts fields, Read* yields
  records, int indexers take a field index, WriteCell does quoting and escaping).
- Add UBIQUITOUS_LANGUAGE.md: the glossary, blast-radius analysis, and a vNext
  rename-target list for the frozen public names.

Builds on netstandard2.0/net8.0/net9.0; all 179 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aligns the codebase and XML documentation with a newly defined 'Ubiquitous Language' glossary, renaming internal fields and updating comments to consistently use terms like 'record', 'field', and 'quoting' instead of 'line', 'cell', and 'escaping'. Feedback points out a critical inconsistency where \r is omitted from the quote-trigger characters in CsvWriter (unlike CsvBufferWriter), which could result in malformed CSVs. Additionally, the reviewer recommends caching the quote-trigger characters in a static array to avoid unnecessary per-row allocations in the writer's serialization methods.

Comment thread Csv/CsvWriter.cs
Comment on lines 24 to 30
#if NET8_0_OR_GREATER
// The separator is per-call so it can't be baked into a single cached SearchValues.
// Keep the fixed escape chars cached and check the separator with a separate Contains.
// Keep the fixed quote-trigger chars cached and check the separator with a separate Contains.
// Without this caching, MemoryExtensions.IndexOfAny(ReadOnlySpan, ReadOnlySpan)/char[]
// builds a fresh SearchValues<char> on the heap every call.
private static readonly SearchValues<char> FixedEscapeChars = SearchValues.Create("'\n");
private static readonly SearchValues<char> QuoteTriggerChars = SearchValues.Create("'\n");
#endif
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is an inconsistency in the quote-trigger characters between CsvWriter and CsvBufferWriter. CsvBufferWriter correctly includes \r (carriage return) in its QuoteTriggerChars ("'\n\r"), but CsvWriter only includes '\n in its QuoteTriggerChars and quoteTriggerChars arrays.

Omitting \r can cause fields containing carriage returns to be written without quotes, resulting in malformed CSV files when read by standard parsers.

Additionally, we can define a static FixedQuoteTriggerChars array here to avoid allocating a new char[] array on every row write in WriteLine and WriteLineAsync.

#if NET8_0_OR_GREATER
        // The separator is per-call so it can't be baked into a single cached SearchValues.
        // Keep the fixed quote-trigger chars cached and check the separator with a separate Contains.
        // Without this caching, MemoryExtensions.IndexOfAny(ReadOnlySpan, ReadOnlySpan)/char[]
        // builds a fresh SearchValues<char> on the heap every call.
        private static readonly SearchValues<char> QuoteTriggerChars = SearchValues.Create("'\n\r");
#endif
        private static readonly char[] FixedQuoteTriggerChars = { '\'', '\n', '\r' };

Comment thread Csv/CsvWriter.cs
Comment on lines +471 to +491
var quoteTriggerChars = new[] { separator, '\'', '\n' };
for (var i = 0; i < columnCount; i++)
{
if (i > 0)
writer.Write(separator);

if (i < data.Length)
{
var escape = false;
var mustQuote = false;
var cell = data[i] ?? string.Empty;
#if NET8_0_OR_GREATER
if (cell.Contains('"'))
#else
if (cell.Contains("\""))
#endif
{
escape = true;
mustQuote = true;
cell = cell.Replace("\"", "\"\"");
}
else if (cell.IndexOfAny(escapeChars) >= 0)
escape = true;
else if (cell.IndexOfAny(quoteTriggerChars) >= 0)
mustQuote = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Allocating a new char[] array (quoteTriggerChars) on every single call to WriteLine introduces unnecessary garbage collection pressure, especially when writing large datasets.

By using the static FixedQuoteTriggerChars array and checking the dynamic separator character separately via IndexOf, we can completely eliminate this per-row allocation.

            for (var i = 0; i < columnCount; i++)
            {
                if (i > 0)
                    writer.Write(separator);

                if (i < data.Length)
                {
                    var mustQuote = false;
                    var cell = data[i] ?? string.Empty;
#if NET8_0_OR_GREATER
                    if (cell.Contains('"'))
#else
                    if (cell.Contains("\""))
#endif
                    {
                        mustQuote = true;
                        cell = cell.Replace("\"", "\"\"");
                    }
                    else if (cell.IndexOf(separator) >= 0 || cell.IndexOfAny(FixedQuoteTriggerChars) >= 0)
                        mustQuote = true;

Comment thread Csv/CsvWriter.cs
Comment on lines +506 to 508
var quoteTriggerChars = new[] { separator, '\'', '\n' };
for (var i = 0; i < columnCount; i++)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to WriteLine, we can avoid allocating the quoteTriggerChars array on every call to WriteLineAsync by removing it and using FixedQuoteTriggerChars instead.

            for (var i = 0; i < columnCount; i++)
            {

Comment thread Csv/CsvWriter.cs
await writer.WriteAsync('"').ConfigureAwait(false);
}
else if (cell.IndexOfAny(escapeChars) >= 0)
else if (cell.IndexOfAny(quoteTriggerChars) >= 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the check in WriteLineAsync to use FixedQuoteTriggerChars and check the separator separately, avoiding the allocated array.

                    else if (cell.IndexOf(separator) >= 0 || cell.IndexOfAny(FixedQuoteTriggerChars) >= 0)

@stevehansen
Copy link
Copy Markdown
Owner Author

Thanks @gemini-code-assist — both points are valid, and the \r one is a genuine pre-existing bug:

  • \r missing from CsvWriter's quote triggers (HIGH): confirmed. CsvWriter triggered on ", the separator, ', and \n but not \r, while CsvBufferWriter already included it — so a field like a\rb was written unquoted and is malformed for strict RFC-4180 parsers.
  • Per-row char[] allocation (MEDIUM): agreed, worth hoisting to a static array + a separate separator check.

This PR is a deliberately behavior-neutral terminology/rename pass, so to keep it clean I've addressed both in a focused follow-up rather than here: #128 (adds \r across all CsvWriter paths — sync, async, and the ReadOnlyMemory<char> paths — removes the allocation, and adds regression tests + a CHANGELOG entry). This PR stays terminology-only.

@gemini-code-assist
Copy link
Copy Markdown

Thanks for the update, @stevehansen. That sounds like a solid plan. Keeping the terminology refactor focused while addressing the \r bug and allocation improvements in a separate, dedicated PR (#128) is definitely the right approach to ensure both are handled cleanly and effectively. I look forward to reviewing the follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant