# Large Model Application Development: RAG Chapter

## I. What is RAG?

### 1.1 Core Concepts of RAG
Retrieval-Augmented Generation (RAG) is a technical framework that combines "information retrieval" with "Large Language Models (LLMs)". Simply put:
- It enables large models to "consult materials" first (retrieve relevant information from your knowledge base) before answering questions
- Then generate answers based on the retrieved materials to avoid "fabrication" (reduce hallucinations)

![](https://ai-studio-static-online.cdn.bcebos.com/c41c00052949463692fe89f05751e26bb08bf0d1bce74d76889a1958b7baa954)

### 1.2 Core Problems Solved by RAG
Three major pain points of traditional large models addressed by RAG:
- **Outdated knowledge**: Model training data has an expiration date, while RAG can access the latest knowledge in real-time
- **Hallucination generation**: Models may fabricate incorrect information, but RAG grounds answers in real materials
- **Domain limitations**: General models lack sufficient expertise in specialized fields, while RAG can connect to industry knowledge bases

![](https://ai-studio-static-online.cdn.bcebos.com/091ca1c290fd479da526488c6a400c262c902e01c8bc4314bb1a1521eb6b507f)
![](https://ai-studio-static-online.cdn.bcebos.com/d87ee748c6c9416eb5d380585954c40d09ed166cf26244bc9169adefca3d1512)


## II. Basic Workflow of RAG

```mermaid
graph TD
    A[User Query] --> B[Query Parsing]
    B --> C[Retrieve from Knowledge Base]
    C --> D[Obtain Relevant Documents]
    D --> E[Splice into Prompt]
    E --> F[LLM Generate Answer]
    F --> G[Output Result]
```

Specific steps:
1. **User Query**: For example, "What new features does Wenxin Large Model 4.5 have?"
2. **Query Parsing**: Extract keywords "Wenxin Large Model 4.5" and "new features"
3. **Retrieve from Knowledge Base**: Find content about Wenxin 4.5 in your documents
4. **Obtain Relevant Documents**: Return 2-3 most relevant descriptions
5. **Generate Prompt**: Combine the query and relevant documents into a prompt (e.g., "Answer based on the following content: [Documents] Question: ...")
6. **Model Generation**: Locally deployed ERNIE-4.5-21B-A3B model generates answer based on the prompt
7. **Output Result**: Return answer with supporting references

![](https://ai-studio-static-online.cdn.bcebos.com/b6eef408e01d4becaca2efa6b839a3fef79a2790b53448c8a407cb395914fb1f)

## III. Core Components of RAG Systems

### 3.1 Three Core Modules
1. **Knowledge Base**: Your private materials (documents, web pages, conversation records, etc.)
2. **Retriever**: Quickly finds content related to the query from the knowledge base (key component is the "vector database")
3. **Generator**: Locally deployed ERNIE-4.5-21B-A3B model (generates answers based on retrieved content)

### 3.2 What is a Vector Database?
- Converts text into "numerical vectors" (similar to "mathematical fingerprints" of text)
- The more semantically similar two texts are, the closer their vector distance

## IV. Selection of Wenxin Open-Source Models

### 4.1 ERNIE-4.5 Model Series Specification Comparison Table

| Model Series | Model Name | Total Parameters | Activated Parameters | Modality Support | Context Length | Main Use Cases | Deployment Scenario |
|---------|---------|--------|---------|---------|-----------|---------|---------|
| **A47B Large Scale** | ERNIE-4.5-300B-A47B-Base | 300B | 47B | Text | 128K | Pre-training Base | Cloud GPU Cluster |
| | ERNIE-4.5-300B-A47B | 300B | 47B | Text | 128K | Instruction Following/Creative Generation | Cloud GPU Cluster |
| | ERNIE-4.5-VL-424B-A47B-Base | 424B | 47B | Text+Vision | 128K | Multimodal Pre-training | Cloud GPU Cluster |
| | ERNIE-4.5-VL-424B-A47B | 424B | 47B | Text+Vision | 128K | Image-Text Understanding/Generation | Cloud GPU Cluster |
| **A3B Medium Scale** | ERNIE-4.5-21B-A3B-Base | 21B | 3B | Text | 128K | Pre-training Base | Single Machine with Multiple GPUs |
| | **ERNIE-4.5-21B-A3B** | **21B** | **3B** | **Text** | **128K** | **Dialogue/Document Processing** | **Single Machine with Multiple GPUs** |
| | ERNIE-4.5-VL-28B-A3B-Base | 28B | 3B | Text+Vision | 128K | Multimodal Pre-training | Single Machine with Multiple GPUs |
| | ERNIE-4.5-VL-28B-A3B | 28B | 3B | Text+Vision | 128K | Lightweight Multimodal Applications | Single Machine with Multiple GPUs |
| **0.3B Lightweight** | ERNIE-4.5-0.3B-Base | 0.3B | 0.3B | Text | 4K | Edge-side Pre-training | Mobile/Edge Devices |
| | ERNIE-4.5-0.3B | 0.3B | 0.3B | Text | 4K | Real-time Dialogue | Mobile/Edge Devices |

### 4.2 Model Specification Selection Strategy Table

| Application Scenario | Recommended Model | Reason | Hardware Requirements | Inference Latency |
|---------|---------|------|---------|---------|
| **Complex Reasoning Tasks** | ERNIE-4.5-300B-A47B | Strongest reasoning capability | 8×A100(80GB) | High |
| **Creative Content Generation** | ERNIE-4.5-300B-A47B | Best creative performance | 8×A100(80GB) | High |
| **Multimodal Understanding** | ERNIE-4.5-VL-424B-A47B | Optimal image-text integration | 8×A100(80GB) | High |
| **Daily Dialogue Customer Service** | **ERNIE-4.5-21B-A3B** | **Balanced performance and cost** | **4×V100(32GB)** | **Medium** |
| **Document Information Extraction** | **ERNIE-4.5-21B-A3B** | **Sufficient understanding capability** | **4×V100(32GB)** | **Medium** |
| **Lightweight Multimodal** | ERNIE-4.5-VL-28B-A3B | Balanced image-text processing | 4×V100(32GB) | Medium |
| **Mobile Applications** | ERNIE-4.5-0.3B | Low latency, fast response | 1×GPU/CPU | Low |
| **Edge Computing** | ERNIE-4.5-0.3B | Minimal resource consumption | CPU/NPU | Low |

**Reasons for selecting ERNIE-4.5-21B-A3B in this tutorial:**
- Moderate parameter scale (21B total parameters, 3B activated parameters), achieving a balance between performance and resource consumption
- Supports 128K long context, suitable for processing long documents
- Strong dialogue and document processing capabilities, ideal for RAG application scenarios

## V. Hands-on Implementation: Basic RAG System

### 5.1 Environment Preparation
#### 5.1.1 Install Required Libraries

In [1]:
%%capture
!pip install chromadb

In [1]:
%%capture
!python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

#### 5.1.2 Model Download and Deployment
**1. Download ERNIE-4.5-21B-A3B model**

In [2]:
%%capture
# Download model using AIStudio command
!aistudio download --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle --local_dir /home/aistudio/work/models

**2. Start model service**

In [None]:
python -m fastdeploy.entrypoints.openai.api_server \
       --model /home/aistudio/work/models \
       --port 7000 \
       --metrics-port 7001 \
       --engine-worker-queue-port 7001 \
       --max-model-len 32768 \
       --max-num-seqs 32

**3. Test model connection**

In [2]:
import openai

host = "0.0.0.0"
port = "7000"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": "You are an intelligent assistant developed by Aistudio and Wenxin Large Model. Please introduce yourself."}
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')

Hello! I am an intelligent assistant developed jointly by Aistudio and Wenxin Large Model. My core function is to help users complete tasks such as knowledge Q&A, text creation, code debugging, and logical reasoning through natural language interaction. Whether it's academic research, daily consultation, or creative generation, you can communicate with me through text, and I will provide accurate, efficient, and context-appropriate solutions combining the technical advantages of multimodal large models.

My design philosophy is "understand needs, create value". I support bilingual interaction in Chinese and English, and continuously optimize my capability boundaries through user feedback. If you have any specific needs (such as data analysis, code implementation, text polishing, etc.), please feel free to tell me, and I will do my best to assist!

### 5.2 Step 1: Document Processing

In [3]:
!python document_processor.py

Building prefix dict from the default dictionary ...
2025-07-11 20:51:18,778 - jieba - DEBUG - Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
2025-07-11 20:51:18,778 - jieba - DEBUG - Loading model from cache /tmp/jieba.cache
Loading model cost 0.667 seconds.
2025-07-11 20:51:19,445 - jieba - DEBUG - Loading model cost 0.667 seconds.
Prefix dict has been built successfully.
2025-07-11 20:51:19,445 - jieba - DEBUG - Prefix dict has been built successfully.
2025-07-11 20:51:19,445 - DocumentProcessor - INFO - Starting processing 5 documents...
2025-07-11 20:51:19,460 - DocumentProcessor - INFO - Successfully processed Witch Gameplay.txt => processed_data/processed_data.jsonl
2025-07-11 20:51:19,471 - DocumentProcessor - INFO - Successfully processed Hunter Gameplay.txt => processed_data/processed_data.jsonl
2025-07-11 20:51:19,487 - DocumentProcessor - INFO - Successfully processed Werewolf Gameplay.txt => processed_data/processe

### 5.3 Step 2: Create Chroma Vector Database and Test Retrieval Function

In [4]:
!python chroma_builder.py

2025-07-11 20:52:23,046 - ChromaBuilder - INFO - 🚀 Starting to build Chroma knowledge base...
2025-07-11 20:52:23,102 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-07-11 20:52:23,157 - ChromaBuilder - INFO - Using ChromaDB default embedding function
2025-07-11 20:52:23,159 - ChromaBuilder - INFO - Chroma knowledge base initialized, data directory: ./chroma_db
2025-07-11 20:52:23,159 - ChromaBuilder - INFO - Found 1 JSONL file
Building knowledge base:   0%|                                         | 0/1 [00:00<?, ?it/s]2025-07-11 20:52:23,161 - ChromaBuilder - INFO - Starting to load data from processed_data/processed_data.jsonl...
2025-07-11 20:52:23,162 - ChromaBuilder - ERROR - Failed to add documents in batch: Expected IDs to be unique, found duplicates of: 333dca46f32667f00ce6adff4a43f40e, 5cc95892c1ef7061451f7c769e9ed6f4, af452c146011ad354f5d81d27658f43f, 5c36

### 5.4 Step 3: Call Local ERNIE-4.5 Model to Generate Answers

In [None]:
# rag_example.py

import os
import logging
from pathlib import Path
import openai
from document_processor import DocumentProcessor
from chroma_builder import ChromaKnowledgeBase, build_knowledge_base_from_processed_data

# Configure logging - only show errors and warnings
logging.basicConfig(
    level=logging.WARNING,
    format='%(levelname)s: %(message)s'
)
logger = logging.getLogger("RAGExample")

class ERNIERAGSystem:
    def __init__(self, 
                 ernie_host: str = "0.0.0.0",
                 ernie_port: str = "7000",
                 chroma_db_dir: str = "./chroma_db",
                 embedding_model: str = "default"):
        """
        Initialize RAG system
        
        Args:
            ernie_host: ERNIE model service host
            ernie_port: ERNIE model service port
            chroma_db_dir: Chroma database directory
            embedding_model: Embedding model type
        """
        # Initialize local ERNIE model client
        self.ernie_client = openai.Client(
            base_url=f"http://{ernie_host}:{ernie_port}/v1", 
            api_key="null"
        )
        
        # Initialize knowledge base
        self.knowledge_base = ChromaKnowledgeBase(
            persist_directory=chroma_db_dir,
            embedding_model=embedding_model
        )

    def retrieve_relevant_docs(self, question: str, top_k: int = 3, 
                             collection_name: str = None) -> list:
        """Retrieve relevant documents"""
        # If no collection name specified, automatically select collection with data
        if collection_name is None:
            stats = self.knowledge_base.get_collection_stats()
            for name, info in stats.items():
                if info['document_count'] > 0:
                    collection_name = name
                    break
            
            if collection_name is None:
                return []
        
        results = self.knowledge_base.search_knowledge(
            query=question,
            collection_name=collection_name,
            n_results=top_k
        )
        
        documents = results["documents"][0]
        metadatas = results["metadatas"][0]
        distances = results["distances"][0]
        
        # Format retrieval results
        retrieved_docs = []
        for i, (doc, metadata, distance) in enumerate(zip(documents, metadatas, distances)):
            similarity = 1 - distance
            retrieved_docs.append({
                "text": doc,
                "metadata": metadata,
                "similarity": similarity,
                "rank": i + 1
            })
        
        return retrieved_docs

    def generate_answer_stream(self, question: str, context_docs: list, 
                             max_tokens: int = 1000, temperature: float = 0.7):
        """Generate answer stream using ERNIE model"""
        
        # Build context
        context_parts = []
        for i, doc_info in enumerate(context_docs):
            source = doc_info["metadata"].get("source", "Unknown source")
            similarity = doc_info["similarity"]
            context_parts.append(f"Reference {i+1} (Similarity:{similarity:.3f}, Source:{source}):\n{doc_info['text']}")
        
        context_text = "\n\n".join(context_parts)
        
        # Build prompt
        prompt = f"""Please answer the question based on the following reference materials. If the reference materials do not contain relevant information, clearly state "Cannot fully answer this question based on the provided reference materials".

Reference materials:
{context_text}

Question: {question}

Please provide an accurate and detailed answer based on the above reference materials, and indicate information sources at appropriate positions:"""

        try:
            # Call local ERNIE model (streaming)
            response = self.ernie_client.chat.completions.create(
                model="null",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=max_tokens,
                temperature=temperature,
                stream=True  # Enable streaming output
            )
            
            return response
            
        except Exception as e:
            logger.error(f"❌ Failed to call ERNIE model: {e}")
            return None

    def ask_stream(self, question: str, top_k: int = 3, collection_name: str = None):
        """Streaming RAG question answering process"""
        print(f"\n🔍 Searching for relevant materials...")
        
        # 1. Retrieve relevant documents
        retrieved_docs = self.retrieve_relevant_docs(question, top_k, collection_name)
        
        if not retrieved_docs:
            print("💔 Sorry, no relevant reference materials were found to answer your question.")
            return
        
        # Display retrieved materials information
        sources = list(set([doc["metadata"].get("source", "Unknown source") for doc in retrieved_docs]))
        print(f"📚 Found {len(retrieved_docs)} relevant materials, sources: {', '.join(sources)}")
        print(f"\n🤖 ERNIE-4.5 is thinking")
        
        # 2. Generate answer stream
        response_stream = self.generate_answer_stream(question, retrieved_docs)
        
        if response_stream is None:
            print("❌ Error occurred while generating answer. Please check if ERNIE model service is running normally.")
            return
        
        print("✨ Answer: ", end="", flush=True)
        
        # Process streaming response
        full_answer = ""
        try:
            for chunk in response_stream:
                if chunk.choices[0].delta and chunk.choices[0].delta.content:
                    content = chunk.choices[0].delta.content
                    print(content, end="", flush=True)
                    full_answer += content
            
            print(f"\n\n📖 Reference sources: {', '.join(sources)}")
            print("-" * 60)
            
        except Exception as e:
            print(f"\n❌ Error occurred during streaming output: {e}")
            
        return full_answer



def interactive_rag_chat(rag_system: ERNIERAGSystem):
    """Interactive RAG streaming dialogue"""
    print("\n🎯 RAG Intelligent Question Answering System - Streaming Dialogue Mode")
    print("✨ Based on local ERNIE-4.5-21B-A3B model")
    print("💡 Enter questions to start dialogue, enter 'quit' or 'exit' to quit")
    print("💡 Enter 'stats' to view knowledge base statistics")
    print("💡 Enter 'clear' to clear screen")
    print("=" * 60)
    
    while True:
        try:
            question = input("\n❓ Please enter your question: ").strip()
            
            if question.lower() in ['quit', 'exit', '退出', 'q']:
                print("\n👋 Thank you for using RAG Intelligent Question Answering System, goodbye!")
                break
            elif question.lower() == 'stats':
                stats = rag_system.knowledge_base.get_collection_stats()
                print("\n📊 Knowledge base statistics:")
                total_docs = 0
                for name, info in stats.items():
                    count = info['document_count']
                    total_docs += count
                    if count > 0:
                        print(f"  ✅ {name}: {count} documents")
                    else:
                        print(f"  ⚪ {name}: {count} documents")
                print(f"📈 Total: {total_docs} documents")
                continue
            elif question.lower() == 'clear':
                import os
                os.system('cls' if os.name == 'nt' else 'clear')
                print("🎯 RAG Intelligent Question Answering System - Streaming Dialogue Mode")
                continue
            elif not question:
                print("⚠️ Please enter a valid question")
                continue
            
            # Perform RAG streaming question answering
            rag_system.ask_stream(question, top_k=3)
            
        except KeyboardInterrupt:
            print("\n\n👋 Program interrupted, goodbye!")
            break
        except Exception as e:
            print(f"\n❌ Error occurred: {e}")
            print("💡 Please check if ERNIE model service is running normally")

def main():
    """Main program entry"""
    print("🚀 Initializing RAG Intelligent Question Answering System...")
    
    try:
        # Directly initialize RAG system
        rag_system = ERNIERAGSystem(
            ernie_host="0.0.0.0",
            ernie_port="7000",
            chroma_db_dir="./chroma_db",
            embedding_model="default"
        )
        
        # Check if knowledge base has data
        stats = rag_system.knowledge_base.get_collection_stats()
        total_docs = sum(info['document_count'] for info in stats.values())
        
        if total_docs == 0:
            print("⚠️ Warning: No data in knowledge base!")
            print("💡 Please run the following commands to build knowledge base first:")
            print("   python document_processor.py")
            print("   python chroma_builder.py")
            return
        
        print(f"✅ Knowledge base loaded, total {total_docs} documents")
        
        # Start interactive dialogue
        interactive_rag_chat(rag_system)
        
    except Exception as e:
        print(f"❌ System initialization failed: {e}")
        print("💡 Please check:")
        print("  1. Whether ERNIE model service is running on port 7000")
        print("  2. Whether knowledge base directory ./chroma_db exists")
        print("  3. Whether relevant dependencies are installed correctly")

if __name__ == "__main__":
    main()

Building prefix dict from the default dictionary ...
2025-07-11 20:53:35,694 - jieba - DEBUG - Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
2025-07-11 20:53:35,695 - jieba - DEBUG - Loading model from cache /tmp/jieba.cache
Loading model cost 0.677 seconds.
2025-07-11 20:53:36,372 - jieba - DEBUG - Loading model cost 0.677 seconds.
Prefix dict has been built successfully.
2025-07-11 20:53:36,376 - jieba - DEBUG - Prefix dict has been built successfully.
2025-07-11 20:53:36,952 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-07-11 20:53:37,009 - ChromaBuilder - INFO - Using ChromaDB default embedding function
2025-07-11 20:53:37,012 - ChromaBuilder - INFO - Chroma knowledge base initialized, data directory: ./chroma_db


🚀 Initializing RAG Intelligent Question Answering System...
✅ Knowledge base loaded, total 9 documents

🎯 RAG Intelligent Question Answering System - Streaming Dialogue Mode
✨ Based on local ERNIE-4.5-21B-A3B model
💡 Enter questions to start dialogue, enter 'quit' or 'exit' to quit
💡 Enter 'stats' to view knowledge base statistics
💡 Enter 'clear' to clear screen



❓ Please enter your question:  Werewolf



🔍 Searching for relevant materials...


2025-07-11 20:53:52,587 - ChromaBuilder - INFO - Query 'Werewolf...' returned 3 results
2025-07-11 20:53:52,599 - httpx - INFO - HTTP Request: POST http://0.0.0.0:7000/v1/chat/completions "HTTP/1.1 200 OK"


📚 Found 3 relevant materials, sources: knowledge_data/Hunter Gameplay.txt, knowledge_data/Werewolf Gameplay.txt

🤖 ERNIE-4.5 is thinking
✨ Answer: Based on the provided reference materials, here is a detailed answer about werewolf gameplay:

### Core Strategies and Techniques for Werewolf Gameplay

1. **Round Pressure Response Table**  
   - **3 werewolves**: Aggressively fake identity, quickly compress the information space of good players, confuse the good camp through frequent speeches or misleading information.  
   - **2 werewolves**: 1 werewolf feigns allegiance + 1 werewolf hides deeply. The feigning werewolf pretends to be a good player to vote out key special roles, while the deep-hiding werewolf waits for the right moment.  
   - **1 werewolf**: Hide completely, exploit logical loopholes of good players to vote out key special roles (such as seer, witch), avoid early exposure.  
   *Source: Reference 3 (Werewolf Gameplay.txt)*

2. **Advanced Thinking Models**  



❓ Please enter your question:  Villager gameplay



🔍 Searching for relevant materials...


2025-07-11 20:54:41,438 - ChromaBuilder - INFO - Query 'Villager gameplay...' returned 3 results
2025-07-11 20:54:41,447 - httpx - INFO - HTTP Request: POST http://0.0.0.0:7000/v1/chat/completions "HTTP/1.1 200 OK"


📚 Found 3 relevant materials, sources: knowledge_data/Hunter Gameplay.txt, knowledge_data/Villager Gameplay.txt, knowledge_data/Werewolf Gameplay.txt

🤖 ERNIE-4.5 is thinking
✨ Answer: Based on the provided reference materials, the core goal of villager gameplay is to build a multi-dimensional perspective analysis network through information integration and logical modeling, assist special roles in accurately identifying werewolves, while avoiding becoming a round-consuming item. However, the reference materials do not directly provide specific operational strategies or detailed steps for villager gameplay.

In Werewolf game, villagers (ordinary players) typically do not have special abilities but can assist special role players (such as seers, witches, hunters, etc.) in reasoning by observing other players' behaviors, speeches, and voting patterns. Here are some general strategies that villager players might adopt:

1. **Observation and Recording**: Villager players should clos

## VI. Industry Application Scenarios of RAG

### 6.1 Enterprise Customer Service
- Use RAG to connect product manuals and after-sales procedures, enabling customer service robots to call up the latest materials in real-time to answer user questions, reducing manual intervention
- **Localization advantage**: Enterprise data never leaves the company, ensuring data security

### 6.2 Medical Assistance
- Doctors input patient symptoms, RAG retrieves the latest clinical guidelines and cases, assisting doctors in judging possible causes (requires professional review)
- **Localization advantage**: Patient privacy data is processed entirely locally

### 6.3 Educational Tutoring
- When students ask math questions, RAG finds relevant knowledge points from textbooks and exercise sets, and the large model explains based on textbook content, ensuring synchronization with teaching
- **Localization advantage**: Teaching content is controllable, no need to worry about network connection issues

### 6.4 Legal Retrieval
- Lawyers input case details, RAG retrieves relevant laws and precedents, and the large model generates legal analysis, improving retrieval efficiency
- **Localization advantage**: Case information remains confidential, meeting data security requirements of the legal industry

## VII. Optimization Tips for Beginners

1. **Text Chunking Optimization**: When splitting long documents, split by "semantic completeness" (e.g., by paragraphs) to avoid cutting off sentences
   ```python
   from langchain.text_splitter import RecursiveCharacterTextSplitter
   text_splitter = RecursiveCharacterTextSplitter(
       chunk_size=500, 
       chunk_overlap=50,
       separators=["\n\n", "\n", "。", "！", "？", "；"]
   )
   chunks = text_splitter.split_text(long_document)
   ```

2. **Prompt Optimization**: Clearly tell the model to "only answer based on the provided materials"
   ```
   Please answer strictly based on the following materials, do not mention information outside the materials. If materials are insufficient, directly say "Cannot answer this question based on the provided materials".
   Materials: {context}
   Question: {question}
   ```

Through the above steps, you have mastered the core logic and complete implementation of the RAG system based on the locally deployed ERNIE-4.5-21B-A3B model. This solution ensures data security (fully localized) while providing powerful knowledge question-answering capabilities, making it an ideal choice for enterprise-level RAG applications.