Skip to content

.Net: Bug: TextChunker does not actually split on newlines #12556

Closed
@kyle-rader-msft

Description

@kyle-rader-msft

Context

While debugging some usage on a large git diff we're building summaries of, the text chunker only split output into 2 giant chunks, but we thought it was splitting into lines, and then splitting those lines into paragraphs, but in reality, we're only getting 2 "paragraphs" with newline chars preserved in the input.

Describe the bug
The TextChunker.SplitPlainTextLines does not actually produce a list that has split on any newline characters as it implies, if the input token count is less than the maxTokenCount per line passed in.

This is quite confusing as the name would imply it's going to split on lines. If use this function, to get a list of input strings and then use SplitPlainTextParagraphs, we aren't not really getting what we would expect since our input isn't nearly as split as we might think.

To Reproduce

This unit test fails:

[Theory]
[InlineData("First line\r\nSecond line\r\nThird line")]
[InlineData("First line\nSecond line\nThird line")]
public void ActuallySplitsOnNewLines(string input)
{
    var result = TextChunker.SplitPlainTextLines(input, 10);

    var expected = new[]
    {
        "First line",
        "Second line",
        "Third line"
    };

    Assert.Equal(expected, result); // ❌
}

with message:

Assert.Equal() Failure: Collections differ
                        ↓ (pos 0)
Expected: string[]     ["First line", "Second line", "Third line"]
Actual:   List<string> ["First line\nSecond line\nThird line"]
                        ↑ (pos 0)

Expected behavior

I am unsure if this is intentional, but I don't think it is given the prescense of the \r\n in the textSplitOptions, which is itself, not a valid line ending.

TextChunker.SplitPlainTextLines would produce a list of strings that do split on \n characters.

Platform

  • Language: [DotNet]

Metadata

Metadata

Assignees

Labels

.NETIssue or Pull requests regarding .NET codebugSomething isn't workingmemory

Type

Projects

Status

Sprint: Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions