-
Notifications
You must be signed in to change notification settings - Fork 595
AI Knowledge Base Tutorial
The AI knowledge base is GEOFlow's fact layer and retrieval layer. It is not a magic button that makes the model smarter after files are uploaded. It is a way to turn scattered material into searchable, citable, and governable knowledge assets.
For GEO content engineering, knowledge quality determines factual accuracy, reuse quality, and long-term maintainability. A strong knowledge base is the foundation of disciplined GEO work and is worth sustained investment.
GEOFlow's AI knowledge base is designed to:
- Turn documents, webpages, business material, and team experience into structured knowledge.
- Retrieve evidence by title, keywords, section paths, and semantic vectors during task generation.
- Reduce unsupported model output by grounding generation in real source material.
- Preserve governance metadata such as source, date, business line, risk level, and review status.
- Help generated content cite concrete evidence instead of relying only on model memory.
Vectorization is only one part of RAG. Retrieval quality also depends on chunking, metadata, hybrid retrieval, conflict handling, review governance, and evidence citation.
A complete knowledge ingestion and retrieval flow usually looks like this:
- Collect material: paste text, upload files in batches, or generate knowledge from a URL.
- Clean the body: normalize encoding, line breaks, whitespace, and body structure.
- Create structured chunks: split by headings, sections, paragraphs, lists, tables, quotes, and length.
- Store metadata: keep source, date, business line, risk level, review status, and section path.
- Vectorize chunks: use the default embedding model to write real vectors.
- Retrieve evidence: use a hybrid of keywords, headings, sections, and vector similarity.
- Inject evidence: pass retrieved evidence into the generation prompt and require evidence IDs when facts are used.
Recommended path:
- Configure at least one working chat model in the AI Configurator.
- Configure an embedding model in AI Models and set it as the default embedding model.
- Open Materials -> AI Knowledge Base Hub.
- Click New Knowledge Base, then upload documents or paste text.
- Fill in the knowledge base name, source, business line, effective date, risk level, and review status.
- Choose Save and Generate Chunks, or save first and run Refresh Chunks from the list.
- Select the knowledge base when creating or editing a task.
- After generation, review whether facts, citations, and source evidence match.
If you only want to organize material first, an embedding model is not required. GEOFlow can still create chunks. A default embedding model is required when you need RAG retrieval with real semantic vectors.
GEOFlow currently uses a "structured rule chunking + optional LLM semantic planning + stable fallback" strategy.
This is the recommended default. It is stable, controllable, and low-cost.
It detects:
- Markdown headings and section levels.
- Paragraphs, lists, tables, quotes, and code blocks.
- Oversized text blocks that need length-based splitting.
- Current section paths and inferred chunk titles.
The system preserves source structure as much as possible and does not ask the model to rewrite knowledge content.
Semantic planning asks a chat model to decide which source blocks should belong to each chunk. It does not generate new knowledge and does not rewrite the source text.
The final stored chunks are still rebuilt from the original source by GEOFlow. This keeps semantic completeness, cost, speed, and traceability in balance.
If semantic planning fails because of timeout, invalid JSON, abnormal boundaries, or failed source mapping, GEOFlow falls back to structured rule chunking so ingestion can still finish.
Vectorization converts each knowledge chunk into an embedding vector for semantic similarity retrieval.
Current rules:
- GEOFlow prefers the default embedding model configured in the admin panel.
- It supports OpenAI-compatible embedding endpoints and native Gemini embedding endpoints.
- Additional embedding services can be connected through the unified model adapter, including compatible endpoints such as Volcengine / Doubao and Zhipu.
- When real vectors are written, GEOFlow stores the model ID, vector dimensions, provider, and pgvector field.
- During ordinary save flows, if the embedding service is temporarily unavailable, GEOFlow keeps the knowledge base and chunks instead of blocking ingestion.
- During manual Refresh Chunks with real vector refresh, failures are reported clearly and are not silently treated as success.
- For providers with strict batch-size limits, GEOFlow uses conservative batching and single-request fallback to reduce 400-level parameter errors.
Vectorization alone does not guarantee correctness. It only improves semantic similarity. Reliable answers still depend on source quality, chunk quality, metadata, and evidence-aware generation.
GEOFlow uses hybrid retrieval instead of vector-only search.
The current score combines:
- Vector similarity.
- Keyword matching.
- Title and section-path matching.
- Metadata quality, including source, business line, effective date, risk level, and review status.
The current weighting is centered on vector similarity and keyword matching, with additional weight for titles, sections, and metadata. This helps avoid the weaknesses of pure vector retrieval and pure keyword retrieval.
For large knowledge bases, GEOFlow avoids loading every chunk into PHP for every query:
- Small knowledge bases still use full local scoring for simplicity.
- Large knowledge bases use pgvector top candidates when available.
- If pgvector is unavailable, GEOFlow uses keyword prefiltering across titles, section paths, and content.
- Prefiltered candidates then enter the normal hybrid scoring flow.
Real business material often has old and new versions of the same topic. GEOFlow performs lightweight conflict merging for same-topic evidence.
The topic key is primarily derived from:
- Section path.
- Chunk title.
GEOFlow treats same-topic material as a version conflict only when effective dates differ. This avoids incorrectly merging multiple chunks from the same section.
When a conflict is detected, GEOFlow prefers:
- Reviewed evidence.
- Newer evidence.
- Lower-risk evidence.
- Higher retrieval score.
High-risk and unreviewed material is excluded from evidence context. When conflict merging happens, the evidence context includes a note that newer or reviewed material was preferred.
GEOFlow formats retrieved chunks as knowledge evidence and injects them into the generation prompt.
Each evidence item can include:
- Evidence ID, such as K1 or K2.
- Title.
- Section.
- Source.
- Link.
- Date.
- Business line.
- Risk and review status.
- Content excerpt.
When generated content uses facts, data, or business judgments from the knowledge base, GEOFlow asks the model to cite evidence IDs such as [K1]. If evidence is insufficient, the model should use cautious wording and avoid inventing sources or conclusions.
Compared with a simple "chunk documents and run vector search" RAG system, GEOFlow has several advantages:
- Stable: default rule chunking does not depend on an external model.
- Controllable: LLMs plan boundaries only; they do not rewrite source knowledge.
- Traceable: chunks preserve titles, sections, sources, dates, and source hashes.
- Better for Chinese business material: keyword retrieval includes Chinese n-gram terms.
- More robust retrieval: vector similarity, keywords, title/section matches, and metadata are combined.
- Governable: high-risk unreviewed material is excluded, while reviewed and low-risk material receives higher weight.
- Version-aware: same-topic evidence prefers newer or reviewed versions.
- Scalable enough for larger knowledge bases: large collections are prefiltered before hybrid scoring.
- Keep each knowledge base focused on one business topic.
- Clean duplicated, outdated, low-quality, or promotional material before upload.
- Add source, date, business line, and review status whenever possible.
- Review high-risk material before using it for generation.
- Refresh chunks and vectors after editing source material.
- Test retrieval with real task titles, not just by looking at chunk counts.
- Do not treat vectorization as a quality guarantee. Poor material remains poor after vectorization.
Common causes include not selecting a knowledge base for the task, a mismatch between the task title and the knowledge topic, missing embedding configuration, or source material that lacks citable facts.
The default embedding model may be missing, the provider URL may be wrong, the API key may be invalid, the model endpoint may be failing, or the knowledge base may have been saved without a real vector refresh.
No. Clean Markdown, product documents, and FAQs are usually fine with rule chunking. Semantic planning is most useful for long, messy, or cross-section documents.
No. GEOFlow improves evidence retrieval, citation, and governance, but factual accuracy still depends on source quality, review workflow, and human judgment. GEO optimizes probabilities and factual expression quality. It does not control platform outcomes absolutely.
- 首页
- 快速上手
- 常见问题
- 部署指南
- 部署脚本使用指南
- 部署检查清单
- 模板与主题工作流
- 模型接入指南
- AI 知识库教程
- 知识库切片与 RAG
- 分发管理与目标站点
- 数据分析与日志
- 什么是 GEOFlow
- GEOFlow 方法论
- 使用边界与内容底线
- 适用场景
- 场景部署与使用方式
- 核心能力总览
- 推荐采用路径
- Skill / CLI / API 生态
- 路线图
- 作者与项目
- Home
- Getting Started
- FAQ
- Deployment Guide
- Deployment Scripts Guide
- Deployment Checklist
- Theme and Template Workflow
- Model Setup Guide
- AI Knowledge Base Tutorial
- Knowledge Chunking and RAG
- Distribution Management and Target Sites
- Analytics and Logs
- What Is GEOFlow
- GEOFlow Methodology
- Principles and Content Boundaries
- Use Cases
- Deployment Patterns by Scenario
- Core Capabilities
- Recommended Adoption Path
- Skill / CLI / API Ecosystem
- Roadmap
- Author and Project