-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text containing quotes or parentheses sometimes isn't split into sentences correctly. #1026
Comments
Hey Julian - apologies for the delay, I've been off-keyboard for a week or two. yea - I understand the frustration, I've gone back and forth on this a few times. If you have strong feelings about one style, I could be persuaded. My concern was things like You can see the current logic here to determine if a sentence is within a quotation - it simply uses a character-count. PR is welcome, if there's a proper definition from oxford or something. Maybe some other tokenizers have clearer opinions. you're also welcome to swap-out a custom sentence splitter completely - I had to do it for the japanese compromise and can help you if you prefer this. |
Hey Spencer, I think the rule is that full-stop punctuation at the end of a quotation or a set of parentheses should be considered the end of the whole sentence unless it is followed by coordinating conjunction, such as and, but, or, etc. For example: Descartes famously said "Yo!" and I agree. -- One sentence Without that conjunction, the splitting should consider the full stop punctuation to signify the end of the sentence. Let me know what you think, Julian |
Just following up on this, I think I got the rule right in the above comment. Let me know. |
Just stumbled over a similar issue. I want to extract all sentenced from a given text, but leave dialogue out of the sentence detection. For example, I expected to only get three sentences back for this text: The bell above the door jingled as a gust of autumn air swept into Sweetie Pie's Bakery. The aroma of cinnamon and fresh-baked bread filled the cozy space.
"Me? Heavens, no! I've got my hands full with my own recipes. Why would I need Agatha's?" Instead, I get each and every one, ignoring the quotes entirely. |
hey, yeah you're right Leonie - this needs some work. Maybe we should add a Will add this feature request to the pile. I don't have a lot of time at the moment. In the meantime, you could do something like this: let str = `The bell above the door jingled as a gust of autumn air swept into Sweetie Pie's Bakery. The aroma of cinnamon and fresh-baked bread filled the cozy space.
"Me? Heavens, no! I've got my hands full with my own recipes. Why would I need Agatha's?"`
let doc = nlp(str)
doc.firstTerms().match('@hasQuotation').tag('QuoteStart')
doc.lastTerms().match('@hasQuotation').ifNo('QuoteStart').tag('QuoteEnd')
doc.debug() Then remove what you'd like, using some custom for-loop. cheers |
ah perfect, will try that one out - thanks! Here is the workaround i built that kind of works, but isn't fancy and easy to read: /**
* Extracts all sentences from a given text.
*
* Handles various cases including:
* - Regular sentences
* - Quoted text as single sentences
* - Multiple sentences within quotes
* - Sentences split across multiple lines
*/
export const extractSentences = (text: string): string[] => {
const sentences: string[] = [];
let currentSentence = '';
let inQuotes = false;
const addSentence = () => {
const trimmed = currentSentence.trim();
if (trimmed.length > 0) {
sentences.push(trimmed);
}
currentSentence = '';
}
for (let i = 0; i < text.length; i++) {
const char = text[i];
currentSentence += char;
if (
// quotes
char === '"' || char === '“' || char === '”'
// guillemets
|| char === '«' || char === '»'
) {
inQuotes = !inQuotes;
}
if (!inQuotes && char === '\n') {
addSentence();
continue;
}
if (!inQuotes && (
char === '.'
|| char === '!'
|| char === '?'
|| char === '…'
|| char === '‽'
)) {
if (i === text.length - 1) {
continue;
}
// Check if the next character is a space or if we're at the end of the text
if (text[i + 1] === ' ' || text[i + 1] === '\n') {
addSentence();
}
}
}
// Add any remaining text as a sentence
addSentence();
return sentences;
}; Edit: i just replaced the logic with yours and all our tests are still passing. that's awesome. |
@spencermountain i found a bug with your implementation/quick fix still:
does not get recognized correctly. it kind of splits the text up until the "." of "put it past her.". A quick fix i added was to pre-split the text by new-lines and then treat each line individually, but interesting nonetheless. |
There are some cases where the sentence parser doesn't parse correctly when using quotations or paratheses:
Example 1:
Descartes famously said, "I think therefore I am." I think Descartes is wrong.
Should return an array of two sentences:
Instead, it returns just a single sentence. (this is an issue with either inline quotes or parentheses).
Example 2:
In the case where multiple sentences exist within a set of paratheses or an inline quote, the sentence parser doesn't return the correct result:
Descartes famously said cool things (well, he didn't say super cool things actually. But whatever.) I believe Descartes is wrong.
Instead, it returns the whole text as a single sentence.
Thanks! Awesome library.
The text was updated successfully, but these errors were encountered: