Description
Context
While debugging some usage on a large git diff we're building summaries of, the text chunker only split output into 2 giant chunks, but we thought it was splitting into lines, and then splitting those lines into paragraphs, but in reality, we're only getting 2 "paragraphs" with newline chars preserved in the input.
Describe the bug
The TextChunker.SplitPlainTextLines
does not actually produce a list that has split on any newline characters as it implies, if the input token count is less than the maxTokenCount per line passed in.
This is quite confusing as the name would imply it's going to split on lines. If use this function, to get a list of input strings and then use SplitPlainTextParagraphs, we aren't not really getting what we would expect since our input isn't nearly as split as we might think.
To Reproduce
This unit test fails:
[Theory]
[InlineData("First line\r\nSecond line\r\nThird line")]
[InlineData("First line\nSecond line\nThird line")]
public void ActuallySplitsOnNewLines(string input)
{
var result = TextChunker.SplitPlainTextLines(input, 10);
var expected = new[]
{
"First line",
"Second line",
"Third line"
};
Assert.Equal(expected, result); // ❌
}
with message:
Assert.Equal() Failure: Collections differ
↓ (pos 0)
Expected: string[] ["First line", "Second line", "Third line"]
Actual: List<string> ["First line\nSecond line\nThird line"]
↑ (pos 0)
Expected behavior
I am unsure if this is intentional, but I don't think it is given the prescense of the \r\n
in the textSplitOptions, which is itself, not a valid line ending.
TextChunker.SplitPlainTextLines
would produce a list of strings that do split on \n
characters.
Platform
- Language: [DotNet]
Metadata
Metadata
Assignees
Type
Projects
Status