Skip to content

.Net: Fix TextChunker.SplitPlainTextParagraphs to handle embedded newlines in input strings #12558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

shethaadit
Copy link
Contributor

@shethaadit shethaadit commented Jun 20, 2025

Description

Summary

Fixes issue #12556 where TextChunker.SplitPlainTextParagraphs does not properly handle embedded newlines in input strings.

Problem

The SplitPlainTextParagraphs method had two issues:

  1. Incorrect separator: Used "\n\r" (LF+CR) which is not a standard line ending format - should be "\r\n" (CR+LF) for Windows or "\n" for Unix
  2. No embedded newline handling: When input strings contained embedded newlines, they were not split into separate lines for processing

This caused the method to process text with embedded newlines as single units instead of handling each line separately.

Solution

  • Modified s_plaintextSplitOptions array to use "\n" as the separator for proper newline recognition
  • Modified SplitPlainTextParagraphs to use SelectMany with Split('\n') to handle embedded newlines
  • Added normalization of all newline formats (\r\n, \r, \n) to ensure consistent handling
  • Lines are split before processing but may be recombined based on token limits (expected behavior)

Changes

  • Modified: s_plaintextSplitOptions array to use correct newline separator
  • Modified: SplitPlainTextParagraphs method to split embedded newlines before processing
  • Preserved: Existing paragraph grouping behavior based on token limits

Testing

  • ✅ Fixes handling of embedded newlines in input strings
  • ✅ All existing tests continue to pass, including CanSplitTextParagraphsOnNewlines
  • ✅ Maintains backward compatibility for paragraph splitting behavior

@shethaadit shethaadit requested a review from a team as a code owner June 20, 2025 23:46
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels Jun 20, 2025
@github-actions github-actions bot changed the title Fix TextChunker.SplitPlainTextLines to actually split on newlines regardless of token count .Net: Fix TextChunker.SplitPlainTextLines to actually split on newlines regardless of token count Jun 20, 2025
Co-authored-by: Kyle Rader <126627085+kyle-rader-msft@users.noreply.github.com>
@westey-m
Copy link
Contributor

@shethaadit, thanks for the contribution. Looks like there is a build failure. Can you take a look?

@shethaadit
Copy link
Contributor Author

@shethaadit, thanks for the contribution. Looks like there is a build failure. Can you take a look?

Hi @westey-m, I just updated the code and seems the build is passing locally.

image

@shethaadit shethaadit changed the title .Net: Fix TextChunker.SplitPlainTextLines to actually split on newlines regardless of token count .Net: Fix TextChunker.SplitPlainTextParagraphs to handle embedded newlines in input strings Jun 24, 2025
@markwallace-microsoft markwallace-microsoft added this pull request to the merge queue Jun 25, 2025
Merged via the queue into microsoft:main with commit 55132da Jun 25, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants