# 【ERNIE-4.5-VL-28B】Multimodal Intelligent Medical Consultation System Based on ERNIE-4.5-VL-28B-Paddle Local Deployment + RAG + Multi-Agent Collaboration

## Project Overview

This project builds an intelligent medical consultation system integrating RAG knowledge base retrieval and multi-Agent collaboration mechanisms based on the **locally deployed ERNIE-4.5-VL-28B-A3B-Paddle** multimodal large model. Through efficient local deployment of the model via the FastDeploy framework, combined with ChromaDB knowledge base and multimodal understanding capabilities, it provides users with professional medical consulting services.

### 🎯 Project Highlights

- **🏥 Complete Medical Scenarios**: Full-process intelligent consultation from symptom description to treatment recommendations
- **🖼️ Multimodal Fusion**: Supports mixed input of text + images, capable of analyzing medical images such as skin lesions  
- **🧠 Local Deployment**: Fully localized solution based on ERNIE-4.5-VL-28B-A3B-Paddle, ensuring data security and controllability
- **📚 Knowledge Base Driven**: Medical knowledge base built with ChromaDB, supporting intelligent retrieval of symptoms, diseases, and treatment plans
- **🤖 Multi-Agent Collaboration**: Collaborative work of professional Agents for symptom parsing, knowledge retrieval, diagnostic decision-making, etc.
- **⚡ High-Performance Inference**: FastDeploy acceleration framework, single-machine multi-card deployment, and optimized inference latency

### 🏗️ System Architecture Diagram

```mermaid
graph TD
    A[User Consultation Input] --> B[Gradio Frontend Interface]
    B --> C[MedicalConsultation]
    C --> D[AgentCoordinator]
    
    D --> E[ERNIE-4.5-VL Local Model]
    E --> F[FastDeploy Inference Engine]
    F --> G[Multimodal Understanding]
    
    D --> H[SymptomParserAgent]
    D --> I[KnowledgeRetrievalAgent] 
    D --> J[DiagnosisAgent]
    
    I --> K[ChromaDB Knowledge Base]
    K --> L[Symptom Database]
    K --> M[Disease Database]
    K --> N[Treatment Database]
    
    H --> O[Symptom Extraction]
    I --> P[Knowledge Retrieval]
    J --> Q[Risk Assessment]
    J --> R[Treatment Recommendations]
    
    O --> S[Diagnostic Result Integration]
    P --> S
    Q --> S
    R --> S
    
    S --> T[Structured Medical Report]
    T --> B
```

### 🏆 Technical Innovation Summary

This project realizes a complete technical chain from **large model local deployment** to **intelligent medical applications**:

1. **🔥 Core Breakthrough**: Efficient local deployment of the 28B-parameter ERNIE-4.5-VL multimodal large model
2. **🧠 Intelligent Upgrade**: RAG knowledge base retrieval + multi-Agent collaborative medical expert system  
3. **🛡️ Data Security**: Fully localized solution, ensuring zero leakage of patient privacy
4. **⚡ Performance Optimization**: FastDeploy inference acceleration, achieving second-level response for medical consultations

### 🛠️ Technology Stack Selection

| Layer | Technical Component | Version | Function |
|------|---------|------|------|
| **AI Model Layer** | ERNIE-4.5-VL-28B-A3B-Paddle | 28B parameters | Multimodal understanding and generation |
| **Inference Framework** | FastDeploy | Latest version | Model deployment and inference acceleration |
| **Knowledge Base** | ChromaDB | 1.0.15 | Vector database and semantic retrieval |
| **Web Framework** | Gradio | 5.35.0 | Interactive user interface |
| **Agent Framework** | Self-developed multi-Agent system | - | Task coordination and business logic |
| **Data Processing** | Pillow + NumPy | 10.2.0 + 1.24.3 | Image processing and numerical calculation |

## 🏥 Implementation of Intelligent Medical Consultation System

### Core Function Modules

#### 1. Multimodal Input Processing
```python
class ErnieClient:
    def medical_image_analysis(self, image_path: str) -> str:
        """Medical image analysis"""
        with open(image_path, "rb") as image_file:
            base64_image = base64.b64encode(image_file.read()).decode("utf-8")
        
        messages = [{
            "role": "user", 
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
                {"type": "text", "text": "Please analyze this medical image and describe the visible symptom characteristics"}
            ]
        }]
        # Call local ERNIE-4.5-VL model
        return self._call_local_model(messages)
```

#### 2. Knowledge Base Retrieval System
```python
class KnowledgeBase:
    def __init__(self, persist_directory="medical_kb"):
        # Build medical knowledge base using ChromaDB
        self.client = chromadb.PersistentClient(path=persist_directory)
        
        # Create professional medical collections
        self.symptoms_collection = self.client.get_or_create_collection("symptoms")
        self.diseases_collection = self.client.get_or_create_collection("diseases")  
        self.treatments_collection = self.client.get_or_create_collection("treatments")
```

#### 3. Multi-Agent Collaboration System
```python
class AgentCoordinator:
    def process_consultation(self, text_input: str, image_path: str = None):
        # 1. Image analysis (if provided)
        image_analysis = self.ernie.medical_image_analysis(image_path) if image_path else None
        
        # 2. Symptom parsing Agent
        symptoms = self.symptom_parser.parse_symptoms(text_input, image_analysis)
        
        # 3. Knowledge retrieval Agent  
        medical_info = self.knowledge_retriever.retrieve_relevant_info(symptoms)
        
        # 4. Diagnostic decision Agent
        risk_assessment = self.diagnosis_agent.analyze_risk_level(symptoms, medical_info)
        treatment_plan = self.diagnosis_agent.generate_treatment_plan(symptoms, medical_info)
        
        return {
            "symptoms": symptoms,
            "risk_assessment": risk_assessment, 
            "treatment_plan": treatment_plan,
            "image_analysis": image_analysis
        }
```

### 🎯 Application Scenarios and Effects

#### Typical Usage Process
1. **User Input**: Describe symptoms + upload lesion images (optional)
2. **Multimodal Analysis**: ERNIE-4.5-VL understands both text and images simultaneously
3. **Symptom Extraction**: AI identifies key symptoms and medical terms
4. **Knowledge Retrieval**: Retrieve relevant disease information from professional medical databases
5. **Risk Assessment**: Evaluate the severity of the condition and the urgency of medical treatment
6. **Treatment Recommendations**: Generate personalized suggestions for examinations, medications, and lifestyle

#### System Advantages
- **Data Security**: Fully local deployment, ensuring patient data remains on-site
- **Professional Accuracy**: Based on a professional medical model with 28B parameters
- **Fast Response**: Local inference without network latency
- **Continuous Learning**: The knowledge base can be continuously expanded and updated

## 🧠 Selection of ERNIE-4.5-VL-28B-A3B-Paddle Model

### Why Choose ERNIE-4.5-VL-28B-A3B-Paddle?

#### Selection of Wenxin Open-Source Models

### ERNIE-4.5 Model Series Specification Comparison Table

| Model Series | Model Name | Total Parameters | Activated Parameters | Modal Support | Context Length | Main Usage | Deployment Scenario |
|---------|---------|--------|---------|---------|-----------|---------|---------|
| **A47B Large-Scale** | ERNIE-4.5-300B-A47B-Base | 300B | 47B | Text | 128K | Pre-training base | Cloud GPU cluster |
| | ERNIE-4.5-300B-A47B | 300B | 47B | Text | 128K | Instruction following/creative generation | Cloud GPU cluster |
| | ERNIE-4.5-VL-424B-A47B-Base | 424B | 47B | Text+Vision | 128K | Multimodal pre-training | Cloud GPU cluster |
| | ERNIE-4.5-VL-424B-A47B | 424B | 47B | Text+Vision | 128K | Image-text understanding/generation | Cloud GPU cluster |
| **A3B Medium-Scale** | ERNIE-4.5-21B-A3B-Base | 21B | 3B | Text | 128K | Pre-training base | Single-machine multi-card |
| | ERNIE-4.5-21B-A3B | 21B | 3B | Text | 128K | Dialogue/document processing | Single-machine multi-card |
| | ERNIE-4.5-VL-28B-A3B-Base | 28B | 3B | Text+Vision | 128K | Multimodal pre-training | Single-machine multi-card |
| | ERNIE-4.5-VL-28B-A3B | 28B | 3B | Text+Vision | 128K | Lightweight multimodal applications | Single-machine multi-card |
| **0.3B Lightweight** | ERNIE-4.5-0.3B-Base | 0.3B | 0.3B | Text | 4K | End-side pre-training | Mobile/edge |
| | ERNIE-4.5-0.3B | 0.3B | 0.3B | Text | 4K | Real-time dialogue | Mobile/edge |

### Model Specification Selection Strategy Table

| Application Scenario | Recommended Model | Reason | Hardware Requirements | Inference Latency |
|---------|---------|------|---------|---------|
| **Complex reasoning tasks** | ERNIE-4.5-300B-A47B | Strongest reasoning ability | 8×A100(80GB) | High |
| **Creative content generation** | ERNIE-4.5-300B-A47B | Best creative performance | 8×A100(80GB) | High |
| **Multimodal understanding** | ERNIE-4.5-VL-424B-A47B | Image-text fusion understanding | 8×A100(80GB) | High |
| **Daily dialogue customer service** | ERNIE-4.5-21B-A3B | Balanced performance and cost | 4×V100(32GB) | Medium |
| **Document information extraction** | ERNIE-4.5-21B-A3B | Sufficient understanding ability | 4×V100(32GB) | Medium |
| **Lightweight multimodal** | ERNIE-4.5-VL-28B-A3B | Balanced image-text processing | 4×V100(32GB) | Medium |
| **Mobile applications** | ERNIE-4.5-0.3B | Low latency and fast response | 1×GPU/CPU | Low |
| **Edge computing** | ERNIE-4.5-0.3B | Minimal resource consumption | CPU/NPU | Low |

### Why Choose ERNIE-4.5-VL-28B-A3B-Paddle Model?

#### 1. **Optimal Balance Between Performance and Cost**
- **Moderate parameter scale**: 28B total parameters, 3B activated parameters, ensuring reasoning ability while controlling computing costs
- **Reasonable hardware requirements**: Supports single-machine multi-card deployment (4×V100 or 2×A100), reducing hardware threshold by 75% compared to A47B series
- **Moderate inference latency**: 3-5 times faster response speed than large-scale models while ensuring output quality

#### 2. **Outstanding Multimodal Capabilities**
- **Text+vision fusion**: Natively supports image-text understanding without additional visual encoders
- **Long context support**: 128K token context length, capable of processing long documents and multiple images
- **Rich application scenarios**: Suitable for document analysis, image description, multimodal question answering, etc.

#### 3. **Deployment Friendliness**
- **AIStudio platform optimization**: Official in-depth adaptation, providing one-click download and deployment
- **FastDeploy integration**: Complete inference acceleration and service support
- **Open-source ecosystem**: PaddlePaddle ecosystem with complete documentation and active community

#### 4. **Practical Application Value**
- **Enterprise-grade availability**: Significantly improved understanding and generation quality compared to 0.3B models
- **Controllable cost**: 70% lower deployment cost and 60% lower operating cost compared to A47B series
- **Strong scalability**: Supports LoRA fine-tuning, which can be optimized for specific scenarios

#### 5. **Technical Advancement**
- **MoE architecture**: Mixture of Experts model with activated parameters only 1/9 of total parameters, high inference efficiency
- **Multimodal alignment**: Deep fusion of visual and text features, with understanding ability close to GPT-4V
- **Chinese optimization**: In-depth optimization for Chinese scenarios, excellent performance in Chinese multimodal tasks

#### Selection Recommendations
**Recommended scenarios**:
- Multimodal AI application development for small and medium-sized enterprises
- Multimodal experiments in educational and scientific research projects
- AI product prototype verification for individual developers
- Business systems requiring image-text understanding capabilities

**Not recommended scenarios**:
- Real-time systems with extremely high requirements for inference latency (choose 0.3B)
- Scenarios with sufficient budget and pursuit of ultimate performance (choose A47B)
- Pure text applications without multimodal needs (choose 21B-A3B)

## 🚀 ERNIE-4.5-VL Model Local Deployment Solution

### I. Environment Preparation
### 1. Hardware Requirements
- **GPU**: NVIDIA A100 80GB (supports single-card/multi-card, CUDA 11.8+ recommended)  
- **Memory**: ≥60GB RAM  
- **Storage**: ≥60GB (model is approximately 28GB, reserved space for logs/cache)  

### 2. Software Dependencies

In [None]:
%%capture
!python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

### 3. Environment Verification
#### ① Check CUDA Version

In [1]:
!nvcc --version
# Expected output example:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0


#### ② Check GPU Information

In [2]:
# Check GPU status and memory
!nvidia-smi
# Expected output example:

Tue Jul  8 01:25:28 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A800-SXM4-80GB          On  |   00000000:D3:00.0 Off |                    0 |
| N/A   41C    P0             66W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                     

#### ③ Troubleshooting Common Environment Issues

In [None]:
# Check PaddlePaddle GPU support
!python -c "import paddle; print('GPU available:', paddle.is_compiled_with_cuda()); print('Number of GPU devices:', paddle.device.cuda.device_count())"

# Check FastDeploy installation
!python -c "from fastdeploy import LLM, SamplingParams; print('FastDeploy installed successfully!')"

# Check OpenAI library version
!python -c "import openai; print('OpenAI library version:', openai.__version__)"

## II. Model Download and Directory Structure
### 1. Download Model Files

In [None]:
%%capture
# Download model using AIStudio command
!aistudio download --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Paddle --local_dir /home/aistudio/work/models

In [3]:
# View model files
!ls -l work/models

total 57441133
-rw-r--r-- 1 aistudio aistudio      11366 Jul  6 18:57 LICENSE
-rw-r--r-- 1 aistudio aistudio       9077 Jul  6 18:56 README.md
-rw-r--r-- 1 aistudio aistudio      86904 Jul  6 18:56 added_tokens.json
-rw-r--r-- 1 aistudio aistudio       1306 Jul  6 18:57 config.json
-rw-r--r-- 1 aistudio aistudio        134 Jul  6 18:57 generation_config.json
-rw-r--r-- 1 aistudio aistudio 4991326368 Jul  6 18:57 model-00001-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4988696384 Jul  6 18:56 model-00002-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4999185600 Jul  6 18:56 model-00003-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4995268296 Jul  6 18:57 model-00004-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4988696984 Jul  6 18:56 model-00005-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4999193256 Jul  6 18:57 model-00006-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4995261896 Jul  6 18:57 model-00007-of-00012.safetensors
-rw-r--r--

### 2. Directory Structure
```
work/
└── models/
    ├── LICENSE                           # License file
    ├── README.md                         # Model description document
    ├── added_tokens.json                 # Added token configuration
    ├── config.json                       # Model configuration file
    ├── generation_config.json            # Generation configuration file
    ├── model-00001-of-00012.safetensors  # Model parameter file (shard 1/12)
    ├── model-00002-of-00012.safetensors  # Model parameter file (shard 2/12)
    ├── model-00003-of-00012.safetensors  # Model parameter file (shard 3/12)
    ├── model-00004-of-00012.safetensors  # Model parameter file (shard 4/12)
    ├── model-00005-of-00012.safetensors  # Model parameter file (shard 5/12)
    ├── model-00006-of-00012.safetensors  # Model parameter file (shard 6/12)
    ├── model-00007-of-00012.safetensors  # Model parameter file (shard 7/12)
    ├── model-00008-of-00012.safetensors  # Model parameter file (shard 8/12)
    ├── model-00009-of-00012.safetensors  # Model parameter file (shard 9/12)
    ├── model-00010-of-00012.safetensors  # Model parameter file (shard 10/12)
    ├── model-00011-of-00012.safetensors  # Model parameter file (shard 11/12)
    ├── model-00012-of-00012.safetensors  # Model parameter file (shard 12/12)
    ├── model.safetensors.index.json      # Model shard index file
    ├── preprocessor_config.json          # Preprocessor configuration
    ├── special_tokens_map.json           # Special token mapping
    ├── tokenizer.model                   # Tokenizer model file
    └── tokenizer_config.json             # Tokenizer configuration file
```

### 3. File Description
| File Type | File Name | Description |
|---------|--------|------|
| **Model weights** | model-00001~00012-of-00012.safetensors | Model parameter shard files in safe tensor format |
| **Index file** | model.safetensors.index.json | Model shard index, specifying which shard each parameter is in |
| **Configuration file** | config.json | Model architecture configuration, including number of layers, hidden layer size, etc. |
| **Generation configuration** | generation_config.json | Text generation-related configurations, such as maximum length, sampling parameters, etc. |
| **Tokenizer** | tokenizer.model | SentencePiece tokenizer model |
| **Tokenizer configuration** | tokenizer_config.json | Tokenizer configuration parameters |
| **Preprocessor** | preprocessor_config.json | Image preprocessing configuration (for multimodal models only) |
| **Special tokens** | special_tokens_map.json | Special token mapping, such as padding, unknown, etc. |
| **Added tokens** | added_tokens.json | User-defined added tokens |

## III. Start Service (Key Commands)

In [None]:
python -m fastdeploy.entrypoints.openai.api_server \
       --model work/models \
       --port 8180 \
       --metrics-port 8181 \
       --engine-worker-queue-port 8182 \
       --max-model-len 32768 \
       --enable-mm \
       --reasoning-parser ernie-45-vl \
       --max-num-seqs 32

[32m[2025-07-06 18:57:55,035] [    INFO][0m - loading configuration file work/models/preprocessor_config.json[0m
INFO     2025-07-06 18:57:58,188 5574  engine.py[line:206] Waitting worker processes ready...
Loading Weights: 100%|████████████████████████| 100/100 [01:17<00:00,  1.30it/s]
Loading Layers: 100%|█████████████████████████| 100/100 [00:07<00:00, 13.31it/s]
INFO     2025-07-06 18:59:35,857 5574  engine.py[line:276] Worker processes are launched with 119.6253821849823 seconds.
INFO     2025-07-06 18:59:35,857 5574  api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
INFO     2025-07-06 18:59:35,857 5574  api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
INFO     2025-07-06 18:59:35,857 5574  api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
[32mINFO[0m:     Started server process [[36m5574[0m]
[32mINFO[0m:     Waiting for application startup.


### Detailed Description of Service Startup Parameters

| Parameter Name | Description | Default Value | Example Value |
|---------|---------|-------|-------|
| `--model` | Path to model files, directory containing model weights and configuration files | Required | `work/models` |
| `--port` | API service listening port, through which clients access the service | 8000 | `8180` |
| `--metrics-port` | Monitoring metrics service port, used for performance monitoring and health checks | 8001 | `8181` |
| `--engine-worker-queue-port` | Engine worker queue port, used for internal task scheduling | 8002 | `8182` |
| `--max-model-len` | Maximum sequence length (number of tokens) supported by the model | 2048 | `32768` |
| `--enable-mm` | Enable multimodal function (text+image processing) | False | `Enabled` |
| `--reasoning-parser` | Reasoning parser type, specifying the model's reasoning logic | None | `ernie-45-vl` |
| `--max-num-seqs` | Maximum number of concurrent sequences, controlling batch size | 256 | `32` |

### Other Common Parameters

| Parameter Name | Description | Default Value | Remarks |
|---------|---------|-------|------|
| `--host` | Host address bound by the service | 0.0.0.0 | Set to 0.0.0.0 to allow external access |
| `--trust-remote-code` | Trust remote code execution | False | Required when loading custom models |
| `--tensor-parallel-size` | Tensor parallel size (multi-GPU) | 1 | Set according to the number of GPUs |
| `--gpu-memory-utilization` | GPU memory utilization | 0.9 | Recommended between 0.8-0.95 |
| `--max-num-batched-tokens` | Maximum number of batched tokens | Automatically calculated | Adjust according to GPU memory |
| `--swap-space` | Swap space size (GB) | 4 | Used when memory is insufficient |
| `--enable-lora` | Enable LoRA adapter | False | Used when fine-tuning models |
| `--max-log-len` | Maximum log length | Unlimited | Control log file size |

### Service Verification
```bash
# Check if the model is loaded correctly
curl http://localhost:8180/v1/models

# Expected output (contains model ID)
{"data":[{"id":"ernie-4.5-vl-28b-a3b-paddle","object":"model"}]}
```

## IV. Model Calling Examples

### 1. OpenAI Library Calling Method

#### ① Text Generation
```python
import openai

host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": "You are an intelligent assistant developed by Aistudio and Wenxin large model. Please introduce yourself."}
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
```

#### ② Image Description Generation
```python
import openai
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

client = openai.Client(
    base_url="http://0.0.0.0:8180/v1",
    api_key="null"
)

response = client.chat.completions.create(
    model="null",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('1.jpg')}"
                    }
                },
                {
                    "type": "text",
                    "text": "Generate a description of this image"
                }
            ]
        }
    ],
    stream=True,
)

print("Image description: ", end='', flush=True)
for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)
print()
```

### 2. Requests Library Calling Method

#### ① Text Generation
```python
import requests
import json

url = "http://localhost:8180/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
    "model": "null",
    "messages": [
        {"role": "user", "content": "You are an intelligent assistant developed by Aistudio and Wenxin large model. Please introduce yourself."}
    ],
    "stream": True
}

response = requests.post(url, headers=headers, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8').replace('data: ', '')
        
        if line.strip() == '[DONE]':
            continue
        try:
            data = json.loads(line)
            
            if 'choices' in data and len(data['choices']) > 0:
                delta = data['choices'][0].get('delta', {})
                content = delta.get('content', '')
                if content:
                    print(content, end='', flush=True)
        except json.JSONDecodeError as e:
            print(f"Error decoding line: {line}")
            print(f"Error: {e}")

print()
```

#### ② Image Description Generation
```python
import requests
import json
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

url = "http://localhost:8180/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
    "model": "null",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('1.jpg')}"}},
                {"type": "text", "text": "Generate image description"}
            ]
        }
    ],
    "stream": True
}

response = requests.post(url, headers=headers, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8').replace('data: ', '')
        
        if line.strip() == '[DONE]':
            continue
        try:
            data = json.loads(line)
            
            if 'choices' in data and len(data['choices']) > 0:
                delta = data['choices'][0].get('delta', {})
                content = delta.get('content', '')
                if content:
                    print(content, end='', flush=True)
        except json.JSONDecodeError as e:
            print(f"Error decoding line: {line}")
            print(f"Error: {e}")

print()
```

### 3. Description of Calling Parameters

#### Common Parameter Configuration
```python
# Complete calling parameter example
response = client.chat.completions.create(
    model="null",                    # Model name (fixed value)
    messages=[...],                  # Message list
    stream=True,                     # Whether to enable streaming response
    max_tokens=2048,                 # Maximum number of generated tokens
    temperature=0.7,                 # Temperature parameter (0.0-2.0)
    top_p=0.9,                      # Nucleus sampling parameter (0.0-1.0)
    frequency_penalty=0.0,           # Frequency penalty
    presence_penalty=0.0,            # Presence penalty
    stop=["<|endoftext|>"],         # Stop word list
)
```

#### Detailed Parameter Description
| Parameter Name | Type | Default Value | Description |
|---------|------|--------|------|
| `model` | str | "null" | Model name, fixed as "null" for local deployment |
| `messages` | list | Required | List of dialogue messages, including role and content |
| `stream` | bool | False | Whether to enable streaming response |
| `max_tokens` | int | Automatic | Maximum number of generated tokens |
| `temperature` | float | 1.0 | Controls randomness, higher values are more random |
| `top_p` | float | 1.0 | Nucleus sampling, controls the range of vocabulary selection |
| `frequency_penalty` | float | 0.0 | Frequency penalty, reduces repetitive content |
| `presence_penalty` | float | 0.0 | Presence penalty, encourages talking about new topics |
| `stop` | list | None | List of strings to stop generation |

## V. Common Problems and Solutions
### 1. Port Occupation Problem
```bash
# Check process occupying the port
lsof -i:8180

# Terminate the process (replace <PID> with the actual process ID)
kill -9 <PID>
```

### 2. Model Loading Failure
- **Check directory**: Ensure complete model files (.pdparams/config.json/vocab.txt) exist under `work/models`  
- **Parser parameters**: Ensure the startup command includes `--reasoning-parser ernie-45-vl`  
- **Driver version**: NVIDIA driver must be ≥520.61.05 (supports A100)  

### 3. Abnormal Streaming Response
- Ensure the `stream=True` parameter is correctly passed  
- Check service logs (`fastdeploy_server.log`) for insufficient memory/video memory errors  


## VI. Test Cases
### 1. Text Generation Verification
- **Input**: `"Summarize the core steps of this tutorial"`  
- **Expected**: Output coherent text containing keywords such as "environment configuration", "service startup", "interface calling", etc.  

### 2. Image Description Verification
- **Test image**: Use jpg files containing natural scenes (such as landscapes, people's activities)  
- **Expected**: Output sentences containing scene features (such as "green grassland under blue sky and white clouds") and action descriptions (such as "people walking by the lake")  

## VII. Intelligent Medical Consultation System Deployment

### 1. System Dependency Installation
```bash
# Update dependency package versions
pip install -r requirements.txt
```

### 2. Medical Knowledge Base Initialization
```bash
# Initialize ChromaDB medical knowledge base
python init_knowledge_base.py

# After startup, the following collections will be created:
# - symptoms: Symptom knowledge base
# - diseases: Disease knowledge base  
# - treatments: Treatment plan base
```

### 3. Start the Complete System
```bash
# Step 1: Start ERNIE-4.5-VL model service
python -m fastdeploy.entrypoints.openai.api_server \
       --model work/models \
       --port 8180 \
       --enable-mm \
       --reasoning-parser ernie-45-vl

# Step 2: Start the medical consultation Web interface  
python main.gradio.py
```

### 4. System Verification
```bash
# Run system tests
python test_system.py

# Expected output:
# ✅ ERNIE service connection successful
# ✅ Knowledge base connection normal  
# ✅ Symptom analysis function normal
# ✅ Consultation process completed
```

## VIII. Function Display and Effects

### 🖼️ System Interface
![Intelligent Medical Consultation System Main Interface](https://ai-studio-static-online.cdn.bcebos.com/9071eee410474644988213a51be33c75de6f65b21f5b44328d25094600a98361)

### 📋 Diagnostic Report Example
![Multimodal Consultation Result](https://ai-studio-static-online.cdn.bcebos.com/b899c223a17849f3ae0414b569f480bc24c98fc54fbf4923b57e24eb15850c65)

### 🔍 Image Analysis Capability  
![Skin Lesion Image Analysis](https://ai-studio-static-online.cdn.bcebos.com/be14a288a7a848a783b700d5292b767eddf11bee1253412c95460e0619dadba2)

### 💊 Treatment Recommendation Output Format
```
【Symptom Analysis】
Identified symptoms: fever, cough, fatigue

【Risk Assessment】  
Risk level: ⚠️⚠️ (Recommend ordinary outpatient visit)
Recommendations:
- Symptoms have persisted for a short time, no serious complications
- It is recommended to seek medical attention promptly to rule out infectious diseases

【Recommended Examination Items】
- Routine blood test
- C-reactive protein test
- Chest X-ray

【Medication Recommendations】
- Symptomatic treatment: Ibuprofen for fever reduction
- Cough relief and phlegm reduction: Compound Glycyrrhiza Tablets
- Please take medication as directed by a doctor, avoid self-medication

【Lifestyle Recommendations】  
- Adequate rest, avoid strenuous exercise
- Drink plenty of water, keep indoor ventilation
- Keep warm, avoid catching cold again
```

## IX. Technical Innovation Points

### 🔬 Core Technical Advantages

#### 1. Localized Multimodal Large Model
- **Model scale**: Efficient MoE architecture with 28B parameters and 3B activated parameters
- **Multimodal capability**: Natively supports joint understanding of text + images
- **Deployment optimization**: FastDeploy framework, efficient inference with single-machine multi-card
- **Data security**: Fully localized, zero leakage of patient privacy

#### 2. RAG-Enhanced Knowledge System
- **Vectorized storage**: Efficient semantic retrieval built with ChromaDB
- **Hierarchical knowledge base**: Structured management of symptoms, diseases, and treatment plans
- **Real-time retrieval**: Millisecond-level similarity matching and knowledge recall
- **Dynamic update**: Supports incremental update and expansion of the knowledge base

#### 3. Multi-Agent Collaboration Architecture
- **Modular design**: Each Agent is responsible for specific medical tasks
- **Intelligent orchestration**: AgentCoordinator for unified scheduling and data flow management
- **Fault-tolerance mechanism**: Failure of a single Agent does not affect the overall system operation
- **Scalability**: New medical specialty Agents can be plug-and-play

#### 4. User Experience Optimization
- **Streaming response**: Real-time display of AI analysis process, improving interaction experience
- **Multi-terminal adaptation**: Web interface supports PC and mobile access
- **Result visualization**: Structured medical reports for easy understanding and saving
- **Operation convenience**: Drag-and-drop image upload, quick text input

## X. Application Value and Scenarios

### 🏥 Medical Scenario Applications

#### Primary Medical Institutions
- **Preliminary consultation**: Assisting general practitioners in symptom analysis
- **Triage assistance**: Assessing the urgency of patients' conditions  
- **Knowledge support**: Providing doctors with reference for disease diagnosis and treatment

#### Telemedicine Services
- **Online consultation**: 24-hour intelligent medical consultation service
- **Image diagnosis**: Analysis of visual diseases such as skin diseases and traumas
- **Health education**: Providing professional health management advice
\n
#### Personal Health Management
- **Symptom self-check**: Users independently assess their health status
- **Medical guidance**: Providing scientific medical advice and department recommendations
- **Medication consultation**: Safe medication guidance based on symptoms

### 💡 Technical Value

#### Industry Promotion
- **AI medical standardization**: Technical paradigm for multimodal medical AI
- **Local deployment**: Providing solutions for medical data security
- **Open-source ecosystem**: Complete technology stack based on PaddlePaddle

#### Innovation Breakthroughs
- **Multimodal fusion**: Integrated medical understanding capability of images and texts
- **Knowledge base driven**: In-depth application of RAG technology in the medical field
- **Agent collaboration**: Collaboration mechanism for professional AI systems

## XI. System Monitoring and Maintenance

### 📊 Performance Monitoring
```bash
# Check model service status
curl http://localhost:8180/v1/models

# Monitor system resource usage
nvidia-smi  # GPU usage
top         # CPU and memory usage

# View service logs
tail -f logs/gradio_app_*.log


### 📞 Contact Information
- **Project Author**: Wechat: X_ruilian

In [27]:
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": "你是Aistudio和文心大模型开发的智能助手，请介绍一下你自己."}
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')

2025-07-06 21:47:22,130 - INFO - HTTP Request: POST http://0.0.0.0:8180/v1/chat/completions "HTTP/1.1 200 OK"




I am an intelligent assistant empowered by **Aistudio** (Baidu's open-source AI development platform) and the **Wenxin Large Model** technology framework. Here is an introduction to my core features and functions:

### 1. **Technical Background**
   - Based on the foundational capabilities of Baidu's self-developed **Wenxin Large Model**, integrating multimodal understanding and generation technologies, with strong natural language processing capabilities.
   - Relying on Aistudio's open-source ecosystem to support developer collaboration and model iteration optimization.

### 2. **Core Capabilities**
   - **Knowledge Q&A**: Covering a wide range of fields (technology, culture, daily life, etc.), providing accurate and concise answers.
   - **Text Generation**: Capable of writing articles, code, poems, dialogues, etc., supporting both creative and practical scenarios.
   - **Logical Reasoning**: Analyzing complex problems and providing structured thinking paths.
   - **Mu

In [4]:
import openai
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

client = openai.Client(
    base_url="http://0.0.0.0:8180/v1",
    api_key="null"
)

response = client.chat.completions.create(
    model="null",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('1.jpg')}"
                    }
                },
                {
                    "type": "text",
                    "text": "生成这张图片的描述"
                }
            ]
        }
    ],
    stream=True,
)

print("图片描述：", end='', flush=True)
for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)
print()

Image description:
This image shows the core architecture and functional modules of PaddleNLP (PaddlePaddle Natural Language Processing development library). The overall design is concise and clear, using blue and white colors to highlight a sense of technology. The image is divided into three main parts:

### 1. **Industrial-grade Predefined Tasks (Taskflow)**
   - **Natural Language Understanding**: Including lexical analysis, text error correction, sentiment analysis, and syntactic analysis, covering basic text processing and semantic understanding tasks.
   - **Natural Language Generation**: Supporting automatic couplet creation, intelligent poem writing, generative Q&A, and open-domain dialogue, reflecting the application of NLP in creative tasks.

### 2. **Industry-grade Model Library**
   - **Self-developed Pre-trained Models**: Listing multiple ERNIE series models (such as ERNIE-1.0, ERNIE-2.0, ERNIE-Tiny, etc.) as well as models like PLATO-2 and SKEP, demonstrating the

In [4]:
%%capture
pip install -r requirements.txt --user

In [5]:
!python init_knowledge_base.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


✅ Successfully added 3 records to the symptoms collection
✅ Successfully added 3 records to the diseases collection
✅ Successfully added 3 records to the treatments collection
Medical knowledge base initialization completed!


### Initialization failed, please download the embedding model first

In [3]:
!wget -O ~/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz \
    https://chroma-onnx-models.s3.amazonaws.com/all-MiniLM-L6-v2/onnx.tar.gz

--2025-07-07 14:21:26--  https://chroma-onnx-models.s3.amazonaws.com/all-MiniLM-L6-v2/onnx.tar.gz
Resolving chroma-onnx-models.s3.amazonaws.com (chroma-onnx-models.s3.amazonaws.com)... 52.217.143.65, 3.5.27.174, 3.5.25.41, ...
Connecting to chroma-onnx-models.s3.amazonaws.com (chroma-onnx-models.s3.amazonaws.com)|52.217.143.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83178821 (79M) [application/x-gzip]
Saving to: '/home/aistudio/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz'


2025-07-07 14:21:34 (11.3 MB/s) - '/home/aistudio/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz' saved [83178821/83178821]

