Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026

julianpeterson1 · 2023-08-18T16:24:18Z

There are some cases where the sentence parser doesn't parse correctly when using quotations or paratheses:

Example 1:

Descartes famously said, "I think therefore I am." I think Descartes is wrong.

Should return an array of two sentences:

Descartes famously said, "I think therefore I am."
I believe Descartes is wrong

Instead, it returns just a single sentence. (this is an issue with either inline quotes or parentheses).

Example 2:

In the case where multiple sentences exist within a set of paratheses or an inline quote, the sentence parser doesn't return the correct result:

Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.) I believe Descartes is wrong.

Should return an array of two sentences:

Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.)
I believe Descartes is wrong.

Instead, it returns the whole text as a single sentence.

Thanks! Awesome library.

spencermountain · 2023-08-22T13:52:19Z

Hey Julian - apologies for the delay, I've been off-keyboard for a week or two.

yea - I understand the frustration, I've gone back and forth on this a few times. If you have strong feelings about one style, I could be persuaded.

My concern was things like Descartes famously said "Yo!" and I agree. - I didn't want to tokenize "descartes famously said" as a full sentence. Maybe there's a good way to classify scare-quotes vs block-quotes - if it has a subj-verb-obj? I dunno.

You can see the current logic here to determine if a sentence is within a quotation - it simply uses a character-count. PR is welcome, if there's a proper definition from oxford or something. Maybe some other tokenizers have clearer opinions.

you're also welcome to swap-out a custom sentence splitter completely - I had to do it for the japanese compromise and can help you if you prefer this.
cheers

julianpeterson1 · 2023-09-19T15:45:31Z

Hey Spencer,

I think the rule is that full-stop punctuation at the end of a quotation or a set of parentheses should be considered the end of the whole sentence unless it is followed by coordinating conjunction, such as and, but, or, etc.

For example:

Descartes famously said "Yo!" and I agree. -- One sentence
Descartes famously said "Yo!" but I agree. One sentence.
Descartes famously said "Yo!" I agree. Two sentences.

Without that conjunction, the splitting should consider the full stop punctuation to signify the end of the sentence.

Let me know what you think,

Julian

julianpeterson1 · 2024-01-15T16:51:06Z

Just following up on this, I think I got the rule right in the above comment. Let me know.

spaceemotion · 2024-07-14T22:01:54Z

Just stumbled over a similar issue. I want to extract all sentenced from a given text, but leave dialogue out of the sentence detection.

For example, I expected to only get three sentences back for this text:

The bell above the door jingled as a gust of autumn air swept into Sweetie Pie's Bakery. The aroma of cinnamon and fresh-baked bread filled the cozy space.

"Me? Heavens, no! I've got my hands full with my own recipes. Why would I need Agatha's?"

Instead, I get each and every one, ignoring the quotes entirely.

spencermountain · 2024-07-18T13:18:34Z

hey, yeah you're right Leonie - this needs some work. Maybe we should add a .blockquote() function, that matches multi-sentence quotes, so in your case you can remove them.

Will add this feature request to the pile. I don't have a lot of time at the moment.

In the meantime, you could do something like this:

let str = `The bell above the door jingled as a gust of autumn air swept into Sweetie Pie's Bakery. The aroma of cinnamon and fresh-baked bread filled the cozy space.

"Me? Heavens, no! I've got my hands full with my own recipes. Why would I need Agatha's?"`

let doc = nlp(str)
doc.firstTerms().match('@hasQuotation').tag('QuoteStart')
doc.lastTerms().match('@hasQuotation').ifNo('QuoteStart').tag('QuoteEnd')
doc.debug()

Then remove what you'd like, using some custom for-loop.

cheers

spaceemotion · 2024-07-18T13:21:14Z

ah perfect, will try that one out - thanks!

Here is the workaround i built that kind of works, but isn't fancy and easy to read:

/**
 * Extracts all sentences from a given text.
 *
 * Handles various cases including:
 * - Regular sentences
 * - Quoted text as single sentences
 * - Multiple sentences within quotes
 * - Sentences split across multiple lines
 */
export const extractSentences = (text: string): string[] => {
  const sentences: string[] = [];
  let currentSentence = '';
  let inQuotes = false;

  const addSentence = () => {
    const trimmed = currentSentence.trim();

    if (trimmed.length > 0) {
      sentences.push(trimmed);
    }

    currentSentence = '';
  }

  for (let i = 0; i < text.length; i++) {
    const char = text[i];
    currentSentence += char;

    if (
      // quotes
      char === '"' || char === '“' || char === '”'
      // guillemets
      || char === '«' || char === '»'
    ) {
      inQuotes = !inQuotes;
    }

    if (!inQuotes && char === '\n') {
      addSentence();
      continue;
    }

    if (!inQuotes && (
      char === '.'
      || char === '!'
      || char === '?'
      || char === '…'
      || char === '‽'
    )) {
      if (i === text.length - 1) {
        continue;
      }

      // Check if the next character is a space or if we're at the end of the text
      if (text[i + 1] === ' ' || text[i + 1] === '\n') {
        addSentence();
      }
    }
  }

  // Add any remaining text as a sentence
  addSentence();

  return sentences;
};

Edit: i just replaced the logic with yours and all our tests are still passing. that's awesome.
instead of debug i just had to use return doc.out('array');

spaceemotion · 2024-07-18T17:59:41Z

@spencermountain i found a bug with your implementation/quick fix still:

“Do you think Mrs. Hargrove really found that sapphire in her scone?”

“I wouldn’t put it past her. She’s been dying to show off her treasure, hasn’t she? You can practically hear her voice echoing from the back of the shop.”

does not get recognized correctly. it kind of splits the text up until the "." of "put it past her.". A quick fix i added was to pre-split the text by new-lines and then treat each line individually, but interesting nonetheless.

spencermountain added the hmmm label Aug 22, 2023

spencermountain added the enhancement label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026

Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026

julianpeterson1 commented Aug 18, 2023 •

edited

Loading

spencermountain commented Aug 22, 2023 •

edited

Loading

julianpeterson1 commented Sep 19, 2023

julianpeterson1 commented Jan 15, 2024

spaceemotion commented Jul 14, 2024

spencermountain commented Jul 18, 2024

spaceemotion commented Jul 18, 2024 •

edited

Loading

spaceemotion commented Jul 18, 2024

Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026

Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026

Comments

julianpeterson1 commented Aug 18, 2023 • edited Loading

spencermountain commented Aug 22, 2023 • edited Loading

julianpeterson1 commented Sep 19, 2023

julianpeterson1 commented Jan 15, 2024

spaceemotion commented Jul 14, 2024

spencermountain commented Jul 18, 2024

spaceemotion commented Jul 18, 2024 • edited Loading

spaceemotion commented Jul 18, 2024

julianpeterson1 commented Aug 18, 2023 •

edited

Loading

spencermountain commented Aug 22, 2023 •

edited

Loading

spaceemotion commented Jul 18, 2024 •

edited

Loading