Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

lenartgolob · 2024-05-08T14:24:24Z

I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in ingest_service.py
node_parser = SentenceWindowNodeParser.from_defaults()

instead of the default I tried implementing SentenceSplitter like this:
node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)

I also tried implementing SemanticSplitterNodeParser like this:

        ollama_embedding = OllamaEmbedding(
            model_name="nomic-embed-text:latest",
            base_url="http://localhost:11434",
            ollama_additional_kwargs={"mirostat": 0},
        )           

        node_parser = SemanticSplitterNodeParser(buffer_size=5, embed_model=ollama_embedding)

Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.

Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with:
PGPT_PROFILES=vllm make run
Thanks in advance for any help!

The text was updated successfully, but these errors were encountered:

AlexPerkin · 2024-05-10T09:03:24Z

Try for example
node_parser = SentenceWindowNodeParser.from_defaults(window_size=20)

lenartgolob changed the title ~~Change chunking method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser~~ Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

lenartgolob commented May 8, 2024 •

edited

AlexPerkin commented May 10, 2024

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Comments

lenartgolob commented May 8, 2024 • edited

AlexPerkin commented May 10, 2024

lenartgolob commented May 8, 2024 •

edited