Skip to content

AI Knowledge Base Tutorial

yaojingang edited this page Jun 2, 2026 · 1 revision

AI Knowledge Base Tutorial

The AI knowledge base is GEOFlow's fact layer and retrieval layer. It is not a magic button that makes the model smarter after files are uploaded. It is a way to turn scattered material into searchable, citable, and governable knowledge assets.

For GEO content engineering, knowledge quality determines factual accuracy, reuse quality, and long-term maintainability. A strong knowledge base is the foundation of disciplined GEO work and is worth sustained investment.

1. What It Solves

GEOFlow's AI knowledge base is designed to:

  • Turn documents, webpages, business material, and team experience into structured knowledge.
  • Retrieve evidence by title, keywords, section paths, and semantic vectors during task generation.
  • Reduce unsupported model output by grounding generation in real source material.
  • Preserve governance metadata such as source, date, business line, risk level, and review status.
  • Help generated content cite concrete evidence instead of relying only on model memory.

Vectorization is only one part of RAG. Retrieval quality also depends on chunking, metadata, hybrid retrieval, conflict handling, review governance, and evidence citation.

2. Basic Workflow

A complete knowledge ingestion and retrieval flow usually looks like this:

  1. Collect material: paste text, upload files in batches, or generate knowledge from a URL.
  2. Clean the body: normalize encoding, line breaks, whitespace, and body structure.
  3. Create structured chunks: split by headings, sections, paragraphs, lists, tables, quotes, and length.
  4. Store metadata: keep source, date, business line, risk level, review status, and section path.
  5. Vectorize chunks: use the default embedding model to write real vectors.
  6. Retrieve evidence: use a hybrid of keywords, headings, sections, and vector similarity.
  7. Inject evidence: pass retrieved evidence into the generation prompt and require evidence IDs when facts are used.

3. How To Use It

Recommended path:

  1. Configure at least one working chat model in the AI Configurator.
  2. Configure an embedding model in AI Models and set it as the default embedding model.
  3. Open Materials -> AI Knowledge Base Hub.
  4. Click New Knowledge Base, then upload documents or paste text.
  5. Fill in the knowledge base name, source, business line, effective date, risk level, and review status.
  6. Choose Save and Generate Chunks, or save first and run Refresh Chunks from the list.
  7. Select the knowledge base when creating or editing a task.
  8. After generation, review whether facts, citations, and source evidence match.

If you only want to organize material first, an embedding model is not required. GEOFlow can still create chunks. A default embedding model is required when you need RAG retrieval with real semantic vectors.

4. Chunking Rules

GEOFlow currently uses a "structured rule chunking + optional LLM semantic planning + stable fallback" strategy.

Structured Rule Chunking

This is the recommended default. It is stable, controllable, and low-cost.

It detects:

  • Markdown headings and section levels.
  • Paragraphs, lists, tables, quotes, and code blocks.
  • Oversized text blocks that need length-based splitting.
  • Current section paths and inferred chunk titles.

The system preserves source structure as much as possible and does not ask the model to rewrite knowledge content.

LLM Semantic Planning

Semantic planning asks a chat model to decide which source blocks should belong to each chunk. It does not generate new knowledge and does not rewrite the source text.

The final stored chunks are still rebuilt from the original source by GEOFlow. This keeps semantic completeness, cost, speed, and traceability in balance.

Stable Fallback

If semantic planning fails because of timeout, invalid JSON, abnormal boundaries, or failed source mapping, GEOFlow falls back to structured rule chunking so ingestion can still finish.

5. Vectorization Rules

Vectorization converts each knowledge chunk into an embedding vector for semantic similarity retrieval.

Current rules:

  • GEOFlow prefers the default embedding model configured in the admin panel.
  • It supports OpenAI-compatible embedding endpoints and native Gemini embedding endpoints.
  • Additional embedding services can be connected through the unified model adapter, including compatible endpoints such as Volcengine / Doubao and Zhipu.
  • When real vectors are written, GEOFlow stores the model ID, vector dimensions, provider, and pgvector field.
  • During ordinary save flows, if the embedding service is temporarily unavailable, GEOFlow keeps the knowledge base and chunks instead of blocking ingestion.
  • During manual Refresh Chunks with real vector refresh, failures are reported clearly and are not silently treated as success.
  • For providers with strict batch-size limits, GEOFlow uses conservative batching and single-request fallback to reduce 400-level parameter errors.

Vectorization alone does not guarantee correctness. It only improves semantic similarity. Reliable answers still depend on source quality, chunk quality, metadata, and evidence-aware generation.

6. Retrieval Algorithm

GEOFlow uses hybrid retrieval instead of vector-only search.

The current score combines:

  • Vector similarity.
  • Keyword matching.
  • Title and section-path matching.
  • Metadata quality, including source, business line, effective date, risk level, and review status.

The current weighting is centered on vector similarity and keyword matching, with additional weight for titles, sections, and metadata. This helps avoid the weaknesses of pure vector retrieval and pure keyword retrieval.

For large knowledge bases, GEOFlow avoids loading every chunk into PHP for every query:

  • Small knowledge bases still use full local scoring for simplicity.
  • Large knowledge bases use pgvector top candidates when available.
  • If pgvector is unavailable, GEOFlow uses keyword prefiltering across titles, section paths, and content.
  • Prefiltered candidates then enter the normal hybrid scoring flow.

7. Conflict Handling And Governance

Real business material often has old and new versions of the same topic. GEOFlow performs lightweight conflict merging for same-topic evidence.

The topic key is primarily derived from:

  • Section path.
  • Chunk title.

GEOFlow treats same-topic material as a version conflict only when effective dates differ. This avoids incorrectly merging multiple chunks from the same section.

When a conflict is detected, GEOFlow prefers:

  1. Reviewed evidence.
  2. Newer evidence.
  3. Lower-risk evidence.
  4. Higher retrieval score.

High-risk and unreviewed material is excluded from evidence context. When conflict merging happens, the evidence context includes a note that newer or reviewed material was preferred.

8. Evidence Citation

GEOFlow formats retrieved chunks as knowledge evidence and injects them into the generation prompt.

Each evidence item can include:

  • Evidence ID, such as K1 or K2.
  • Title.
  • Section.
  • Source.
  • Link.
  • Date.
  • Business line.
  • Risk and review status.
  • Content excerpt.

When generated content uses facts, data, or business judgments from the knowledge base, GEOFlow asks the model to cite evidence IDs such as [K1]. If evidence is insufficient, the model should use cautious wording and avoid inventing sources or conclusions.

9. Current Advantages

Compared with a simple "chunk documents and run vector search" RAG system, GEOFlow has several advantages:

  • Stable: default rule chunking does not depend on an external model.
  • Controllable: LLMs plan boundaries only; they do not rewrite source knowledge.
  • Traceable: chunks preserve titles, sections, sources, dates, and source hashes.
  • Better for Chinese business material: keyword retrieval includes Chinese n-gram terms.
  • More robust retrieval: vector similarity, keywords, title/section matches, and metadata are combined.
  • Governable: high-risk unreviewed material is excluded, while reviewed and low-risk material receives higher weight.
  • Version-aware: same-topic evidence prefers newer or reviewed versions.
  • Scalable enough for larger knowledge bases: large collections are prefiltered before hybrid scoring.

10. Best Practices

  • Keep each knowledge base focused on one business topic.
  • Clean duplicated, outdated, low-quality, or promotional material before upload.
  • Add source, date, business line, and review status whenever possible.
  • Review high-risk material before using it for generation.
  • Refresh chunks and vectors after editing source material.
  • Test retrieval with real task titles, not just by looking at chunk counts.
  • Do not treat vectorization as a quality guarantee. Poor material remains poor after vectorization.

11. Common Questions

Why does generation not cite the knowledge base clearly?

Common causes include not selecting a knowledge base for the task, a mismatch between the task title and the knowledge topic, missing embedding configuration, or source material that lacks citable facts.

Why are there chunks but very few real vectors?

The default embedding model may be missing, the provider URL may be wrong, the API key may be invalid, the model endpoint may be failing, or the knowledge base may have been saved without a real vector refresh.

Should every knowledge base use LLM semantic planning?

No. Clean Markdown, product documents, and FAQs are usually fine with rule chunking. Semantic planning is most useful for long, messy, or cross-section documents.

Can GEOFlow guarantee generated content is always correct?

No. GEOFlow improves evidence retrieval, citation, and governance, but factual accuracy still depends on source quality, review workflow, and human judgment. GEO optimizes probabilities and factual expression quality. It does not control platform outcomes absolutely.

12. Related Pages

Clone this wiki locally