Knowledge Chunking and RAG

Knowledge bases are the core input layer in GEOFlow. High-quality knowledge makes generation more stable and more grounded in real business material. Low-quality knowledge turns automation into a noise amplifier.

For a complete explanation of knowledge base principles, usage, vectorization rules, retrieval algorithms, and governance logic, start with: AI Knowledge Base Tutorial.

1. Minimum Setup

If you only want to upload files and preview chunks, an embedding model is not required.

If you want RAG retrieval during article generation, you need:

at least one working chat model
at least one working embedding model
the embedding model selected as the default embedding model
PostgreSQL pgvector available
the knowledge base re-saved, uploaded again, or refreshed for vector writing

2. Chunking Strategies

Knowledge chunking is configured from the AI Models page.

Strategy	Description	Best fit
Structured rule chunking	GEOFlow chunks by headings, paragraphs, length, and overlap windows	Recommended default; stable, controllable, low-cost
Automatic strategy	GEOFlow chooses a suitable strategy from configuration	Use when you want simpler setup
LLM semantic planning	A chat model plans semantic boundaries; GEOFlow rebuilds final chunks from the source text	Long documents, complex structures, or documents where semantic completeness matters

3. How LLM Semantic Planning Works

Semantic planning does not rewrite the knowledge base and does not generate new knowledge.

It does one thing:

Plan chunk boundaries from the source document.

The chunks stored in the database are still rebuilt from the original text. This balances semantic completeness, cost, speed, and traceability.

4. Stable Fallback

If semantic planning fails, for example because:

the model times out
JSON is invalid
boundary counts are abnormal
planned boundaries cannot be mapped back to source text

GEOFlow falls back to structured rule chunking so knowledge ingestion can still complete.

5. Chunk Metadata

Each chunk keeps as much metadata as possible:

chunk title
section path
chunking strategy
sequence number
token / character estimate
source hash

This metadata is useful for previewing, debugging, rebuilding, and future retrieval improvements.

6. Common Questions

Why are there chunks but zero vectors?

Usually because the embedding model is missing, disabled, failing, or not selected as the default embedding model.

Why did semantic chunking not take effect?

Check that the LLM semantic planning strategy is selected and that a working chat model is available for planning. Even when semantic planning fails, GEOFlow falls back to rule chunking.

Which model should be used for semantic planning?

Use a fast, low-cost chat model with enough context, such as a lightweight Gemini or OpenAI-compatible model. Boundary planning does not require the heaviest reasoning model.

Should every knowledge base use LLM semantic planning?

No. Structured rule chunking is usually enough for clean documents. Semantic planning is most valuable for long, complex documents with cross-section semantics.

GEOFlow Repository · Changelog · 更新日志

Knowledge Chunking and RAG

Knowledge Chunking and RAG

1. Minimum Setup

2. Chunking Strategies

3. How LLM Semantic Planning Works

4. Stable Fallback

5. Chunk Metadata

6. Common Questions

Why are there chunks but zero vectors?

Why did semantic chunking not take effect?

Which model should be used for semantic planning?

Should every knowledge base use LLM semantic planning?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

中文

English

Clone this wiki locally