You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in ingest_service.py node_parser = SentenceWindowNodeParser.from_defaults()
instead of the default I tried implementing SentenceSplitter like this: node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)
I also tried implementing SemanticSplitterNodeParser like this:
Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.
Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with: PGPT_PROFILES=vllm make run
Thanks in advance for any help!
The text was updated successfully, but these errors were encountered:
lenartgolob
changed the title
Change chunking method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser
Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser
May 8, 2024
I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in
ingest_service.py
node_parser = SentenceWindowNodeParser.from_defaults()
instead of the default I tried implementing SentenceSplitter like this:
node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)
I also tried implementing SemanticSplitterNodeParser like this:
Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.
Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with:
PGPT_PROFILES=vllm make run
Thanks in advance for any help!
The text was updated successfully, but these errors were encountered: