Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Open
lenartgolob opened this issue May 8, 2024 · 1 comment

Comments

@lenartgolob
Copy link

lenartgolob commented May 8, 2024

I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in ingest_service.py
node_parser = SentenceWindowNodeParser.from_defaults()

instead of the default I tried implementing SentenceSplitter like this:
node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)

I also tried implementing SemanticSplitterNodeParser like this:

        ollama_embedding = OllamaEmbedding(
            model_name="nomic-embed-text:latest",
            base_url="http://localhost:11434",
            ollama_additional_kwargs={"mirostat": 0},
        )           

        node_parser = SemanticSplitterNodeParser(buffer_size=5, embed_model=ollama_embedding)

Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.

Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with:
PGPT_PROFILES=vllm make run
Thanks in advance for any help!

@lenartgolob lenartgolob changed the title Change chunking method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser May 8, 2024
@AlexPerkin
Copy link

Try for example
node_parser = SentenceWindowNodeParser.from_defaults(window_size=20)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants