[QUESTION]Semantic Sentence Tokenization #1383

TheAIMagics · 2024-04-18T14:09:09Z

I'm working with a corpus that primarily consists of longer documents. I'm seeking recommendations for the most effective approach to semantically tokenize them.

Examples:

Original Text: "I like the ambiance but the food was terrible."
Desired Output: ["I like the ambiance"] ["but the food was terrible."]

Original Text: "I don't know. I like the restaurant but not the food."
Desired Output: ["I don't know."] ["I like the restaurant"] ["but not the food."]

Any suggestions or advice on how to achieve this would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-04-18T14:33:20Z

We don't have anything we explicitly does what you're looking for. You could constituency parse the sentence and take the top level divisions and that might do a good job, though.

TheAIMagics added the question label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]Semantic Sentence Tokenization #1383

[QUESTION]Semantic Sentence Tokenization #1383

TheAIMagics commented Apr 18, 2024

AngledLuffa commented Apr 18, 2024

[QUESTION]Semantic Sentence Tokenization #1383

[QUESTION]Semantic Sentence Tokenization #1383

Comments

TheAIMagics commented Apr 18, 2024

AngledLuffa commented Apr 18, 2024