# ERNIE-4.5-VL-28B-A3B-Paddle Local Deployment and Calling Tutorial

## Selection of Wenxin Open-Source Models

### ERNIE-4.5 Model Series Specification Comparison Table

| Model Series | Model Name | Total Parameters | Activated Parameters | Modality Support | Context Length | Main Usage | Deployment Scenario |
|---------|---------|--------|---------|---------|-----------|---------|---------|
| **A47B Large Scale** | ERNIE-4.5-300B-A47B-Base | 300B | 47B | Text | 128K | Pre-training Base | Cloud GPU Cluster |
| | ERNIE-4.5-300B-A47B | 300B | 47B | Text | 128K | Instruction Following/Creative Generation | Cloud GPU Cluster |
| | ERNIE-4.5-VL-424B-A47B-Base | 424B | 47B | Text+Vision | 128K | Multimodal Pre-training | Cloud GPU Cluster |
| | ERNIE-4.5-VL-424B-A47B | 424B | 47B | Text+Vision | 128K | Image-Text Understanding/Generation | Cloud GPU Cluster |
| **A3B Medium Scale** | ERNIE-4.5-21B-A3B-Base | 21B | 3B | Text | 128K | Pre-training Base | Single-machine Multi-GPU |
| | ERNIE-4.5-21B-A3B | 21B | 3B | Text | 128K | Dialogue/Document Processing | Single-machine Multi-GPU |
| | ERNIE-4.5-VL-28B-A3B-Base | 28B | 3B | Text+Vision | 128K | Multimodal Pre-training | Single-machine Multi-GPU |
| | ERNIE-4.5-VL-28B-A3B | 28B | 3B | Text+Vision | 128K | Lightweight Multimodal Applications | Single-machine Multi-GPU |
| **0.3B Lightweight** | ERNIE-4.5-0.3B-Base | 0.3B | 0.3B | Text | 4K | End-side Pre-training | Mobile/Edge |
| | ERNIE-4.5-0.3B | 0.3B | 0.3B | Text | 4K | Real-time Dialogue | Mobile/Edge |

### Model Specification Selection Strategy Table

| Application Scenario | Recommended Model | Reason | Hardware Requirements | Inference Latency |
|---------|---------|------|---------|---------|
| **Complex Reasoning Tasks** | ERNIE-4.5-300B-A47B | Strongest reasoning capability | 8×A100(80GB) | High |
| **Creative Content Generation** | ERNIE-4.5-300B-A47B | Best creative performance | 8×A100(80GB) | High |
| **Multimodal Understanding** | ERNIE-4.5-VL-424B-A47B | Image-text fusion understanding | 8×A100(80GB) | High |
| **Daily Dialogue Customer Service** | ERNIE-4.5-21B-A3B | Balanced performance and cost | 4×V100(32GB) | Medium |
| **Document Information Extraction** | ERNIE-4.5-21B-A3B | Sufficient understanding capability | 4×V100(32GB) | Medium |
| **Lightweight Multimodal** | ERNIE-4.5-VL-28B-A3B | Balanced image-text processing | 4×V100(32GB) | Medium |
| **Mobile Applications** | ERNIE-4.5-0.3B | Low latency and fast response | 1×GPU/CPU | Low |
| **Edge Computing** | ERNIE-4.5-0.3B | Minimal resource consumption | CPU/NPU | Low |

### Why Choose ERNIE-4.5-VL-28B-A3B-Paddle Model?

#### 1. **Optimal Balance between Performance and Cost**
- **Moderate parameter scale**: 28B total parameters, 3B activated parameters, ensuring inference capability while controlling computational costs
- **Reasonable hardware requirements**: Supports single-machine multi-GPU deployment (4×V100 or 2×A100), reducing hardware threshold by 75% compared to A47B series
- **Moderate inference latency**: 3-5 times faster response speed than large-scale models while ensuring output quality

#### 2. **Outstanding Multimodal Capabilities**
- **Text+Vision fusion**: Natively supports image-text understanding without additional visual encoders
- **Long context support**: 128K token context length, capable of processing long documents and multiple images
- **Rich application scenarios**: Suitable for document analysis, image description, multimodal question answering, etc.

#### 3. **Deployment Friendliness**
- **AIStudio platform optimization**: Official in-depth adaptation, providing one-click download and deployment
- **FastDeploy integration**: Complete inference acceleration and service support
- **Open-source ecosystem**: PaddlePaddle ecosystem with comprehensive documentation and active community

#### 4. **Practical Application Value**
- **Enterprise-grade availability**: Significantly improved understanding capability and generation quality compared to 0.3B models
- **Controllable costs**: 70% lower deployment cost and 60% lower operating cost compared to A47B series
- **Strong scalability**: Supports LoRA fine-tuning for scenario-specific optimization

#### 5. **Technical Advancement**
- **MoE architecture**: Mixture of Experts model with activated parameters only 1/9 of total parameters, high inference efficiency
- **Multimodal alignment**: Deep fusion of visual and text features, understanding capability close to GPT-4V
- **Chinese optimization**: In-depth optimization for Chinese scenarios, excellent performance in Chinese multimodal tasks

#### Selection Recommendations
**Recommended scenarios**:
- Multimodal AI application development for small and medium enterprises
- Multimodal experiments in educational and research projects
- AI product prototype verification for individual developers
- Business systems requiring image-text understanding capabilities

**Not recommended scenarios**:
- Real-time systems with extremely high requirements for inference latency (choose 0.3B)
- Scenarios with sufficient budget pursuing ultimate performance (choose A47B)
- Pure text applications without multimodal needs (choose 21B-A3B)

## I. Environment Preparation
### 1. Hardware Requirements
- **GPU**: NVIDIA A100 80GB (supports single/multi-GPU, recommended CUDA 11.8+)  
- **Memory**: ≥60GB RAM  
- **Storage**: ≥120GB (model is approximately 28GB, need reserved space for logs/cache)  

### 2. Software Dependencies

In [20]:
%%capture
!python -m pip install fastdeploy-gpu -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-gpu-80_90/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

### 3. Environment Verification
#### ① Check CUDA Version

In [21]:
!nvcc --version
# Expected output example:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0


#### ② Check GPU Information

In [22]:
# Check GPU status and memory
!nvidia-smi
# Expected output example:

Sun Jul  6 18:10:39 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A800-SXM4-80GB          On  |   00000000:D0:00.0 Off |                    0 |
| N/A   29C    P0             64W /  400W |   68463MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                     

#### ③ Environment Verification Instructions
| Check Item | Minimum Requirement | Recommended Configuration | Description |
|---------|---------|---------|------|
| **CUDA Version** | 11.8+ | 12.6+ | Supports latest GPU acceleration features |
| **Driver Version** | 520.61.05+ | 570.148.08+ | Compatible with A800/A100 series GPUs |
| **GPU Model** | V100(32GB) | A800/A100(80GB) | Ensure sufficient memory for 28B model |
| **Memory Capacity** | 32GB+ | 80GB+ | Model loading requires approximately 25-30GB memory |
| **GPU Utilization** | 0% | 0% | GPU should be idle before startup |

#### ④ Troubleshooting Common Environment Issues

In [28]:
# Check PaddlePaddle GPU support
!python -c "import paddle; print('GPU available:', paddle.is_compiled_with_cuda()); print('Number of GPU devices:', paddle.device.cuda.device_count())"

# Check FastDeploy installation
!python -c "from fastdeploy import LLM, SamplingParams; print('FastDeploy installed successfully!')"

# Check OpenAI library version
!python -c "import openai; print('OpenAI library version:', openai.__version__)"

GPU available: True
Number of GPU devices: 1
FastDeploy installed successfully!
OpenAI library version: 1.91.0


## II. Model Download and Directory Structure
### 1. Download Model Files

In [19]:
# Download model using AIStudio command
%%capture
!aistudio download --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Paddle --local_dir /home/aistudio/work/models

In [32]:
# View model files
!ls -l work/models

total 57441133
-rw-r--r-- 1 aistudio aistudio      11366 Jul  5 18:29 LICENSE
-rw-r--r-- 1 aistudio aistudio       9077 Jul  5 18:30 README.md
-rw-r--r-- 1 aistudio aistudio      86904 Jul  5 18:29 added_tokens.json
-rw-r--r-- 1 aistudio aistudio       1306 Jul  5 18:28 config.json
-rw-r--r-- 1 aistudio aistudio        134 Jul  5 18:29 generation_config.json
-rw-r--r-- 1 aistudio aistudio 4991326368 Jul  5 18:30 model-00001-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4988696384 Jul  5 18:29 model-00002-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4999185600 Jul  5 18:30 model-00003-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4995268296 Jul  5 18:29 model-00004-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4988696984 Jul  5 18:30 model-00005-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4999193256 Jul  5 18:29 model-00006-of-00012.safetensors
-rw-r--r-- 1 aistudio aistudio 4995261896 Jul  5 18:30 model-00007-of-00012.safetensors
-rw-r--r--

### 2. Directory Structure
```
work/
└── models/
    ├── LICENSE                           # License file
    ├── README.md                         # Model description document
    ├── added_tokens.json                 # Added token configuration
    ├── config.json                       # Model configuration file
    ├── generation_config.json            # Generation configuration file
    ├── model-00001-of-00012.safetensors  # Model parameter file (shard 1/12)
    ├── model-00002-of-00012.safetensors  # Model parameter file (shard 2/12)
    ├── model-00003-of-00012.safetensors  # Model parameter file (shard 3/12)
    ├── model-00004-of-00012.safetensors  # Model parameter file (shard 4/12)
    ├── model-00005-of-00012.safetensors  # Model parameter file (shard 5/12)
    ├── model-00006-of-00012.safetensors  # Model parameter file (shard 6/12)
    ├── model-00007-of-00012.safetensors  # Model parameter file (shard 7/12)
    ├── model-00008-of-00012.safetensors  # Model parameter file (shard 8/12)
    ├── model-00009-of-00012.safetensors  # Model parameter file (shard 9/12)
    ├── model-00010-of-00012.safetensors  # Model parameter file (shard 10/12)
    ├── model-00011-of-00012.safetensors  # Model parameter file (shard 11/12)
    ├── model-00012-of-00012.safetensors  # Model parameter file (shard 12/12)
    ├── model.safetensors.index.json      # Model shard index file
    ├── preprocessor_config.json          # Preprocessor configuration
    ├── special_tokens_map.json           # Special token mapping
    ├── tokenizer.model                   # Tokenizer model file
    └── tokenizer_config.json             # Tokenizer configuration file
```

## III. Start Service (Key Commands)

In [None]:
!python -m fastdeploy.entrypoints.openai.api_server \
       --model work/models \
       --port 8180 \
       --metrics-port 8181 \
       --engine-worker-queue-port 8182 \
       --max-model-len 32768 \
       --enable-mm \
       --reasoning-parser ernie-45-vl \
       --max-num-seqs 32

[32m[2025-07-05 18:31:56,005] [    INFO][0m - loading configuration file work/models/preprocessor_config.json[0m
INFO     2025-07-05 18:31:59,153 14427 engine.py[line:206] Waitting worker processes ready...
Loading Weights: 100%|████████████████████████| 100/100 [01:17<00:00,  1.30it/s]
Loading Layers: 100%|█████████████████████████| 100/100 [00:07<00:00, 13.31it/s]
INFO     2025-07-05 18:33:37,799 14427 engine.py[line:276] Worker processes are launched with 119.77000260353088 seconds.
INFO     2025-07-05 18:33:37,800 14427 api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
INFO     2025-07-05 18:33:37,800 14427 api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
INFO     2025-07-05 18:33:37,800 14427 api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
[32mINFO[0m:     Started server process [[36m14427[0m]
[32mINFO[0m:     Waiting for application startup.

## Model Deployment (Recommended to Deploy in a New Terminal)
![](https://ai-studio-static-online.cdn.bcebos.com/adb145c8b36e4bfcb4eb4fc95addd20dedc26cffd6514e189c18900cbc593dc2)

### Detailed Explanation of Service Startup Parameters

| Parameter Name | Description | Default Value | Example Value |
|---------|---------|-------|-------|
| `--model` | Model file path, directory containing model weights and configuration files | Required | `work/models` |
| `--port` | API service listening port, clients access the service through this port | 8000 | `8180` |
| `--metrics-port` | Monitoring metrics service port, used for performance monitoring and health checks | 8001 | `8181` |
| `--engine-worker-queue-port` | Engine worker queue port, used for internal task scheduling | 8002 | `8182` |
| `--max-model-len` | Maximum sequence length (number of tokens) supported by the model | 2048 | `32768` |
| `--enable-mm` | Enable multimodal functionality (text+image processing) | False | `Enabled` |
| `--reasoning-parser` | Reasoning parser type, specifying the model's reasoning logic | None | `ernie-45-vl` |
| `--max-num-seqs` | Maximum number of concurrent sequences, controlling batch size | 256 | `32` |

### Other Common Parameters

| Parameter Name | Description | Default Value | Notes |
|---------|---------|-------|------|
| `--host` | Host address bound by the service | 0.0.0.0 | Set to 0.0.0.0 to allow external access |
| `--trust-remote-code` | Trust remote code execution | False | Required when loading custom models |
| `--tensor-parallel-size` | Tensor parallel size (multi-GPU) | 1 | Set according to the number of GPUs |
| `--gpu-memory-utilization` | GPU memory utilization | 0.9 | Recommended between 0.8-0.95 |
| `--max-num-batched-tokens` | Maximum number of batched tokens | Auto-calculated | Adjust according to GPU memory |
| `--swap-space` | Swap space size (GB) | 4 | Used when memory is insufficient |
| `--enable-lora` | Enable LoRA adapter | False | Used when fine-tuning models |
| `--max-log-len` | Maximum log length | Unlimited | Control log file size |

### Service Verification

In [43]:
# Check if the model is loaded correctly
!curl -X POST http://localhost:8180/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "null","messages": [{"role": "user", "content": "Hello, please introduce yourself"}],"temperature": 0.7}'

{"id":"chatcmpl-bd47b6ab-f962-4ad2-bf8c-e418cad02b8e","object":"chat.completion","created":1751798020,"model":"null","choices":[{"index":0,"message":{"role":"assistant","content":"\n\nHello! I'm your intelligent assistant. I can help you answer various questions, provide suggestions, or chat with you about various topics. Whether it's academic doubts, small problems in life, or if you want to hear a story or create some content, I'll explore with you! Feel free to interact with me anytime~ 😊","reasoning_content":"\nThe user asked me to introduce myself, and I need to respond in a natural and friendly way. First, I need to determine what the user's needs are. They might be new to this platform and want to understand the basic functions of the AI, or they might just be curious about my capabilities. As an AI assistant, I should highlight my main functions, such as answering questions, providing suggestions, and helping with learning.\n\nNext, I should consider the user's potential needs.

## IV. Model Calling Examples

### 1. OpenAI Library Calling Method

#### ① Text Generation

In [5]:
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "user", "content": "You are an intelligent assistant developed by Aistudio and Wenxin Large Model. Please introduce yourself."}
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')



I am an intelligent assistant empowered by **Aistudio** (Baidu's AI technology ecosystem platform) and the **Wenxin Large Model** technology system. Here is an introduction to my core features and functions:

### 1. **Technical Background**
   - Based on Baidu's independently developed **Wenxin Large Model** technology, integrating multimodal understanding, deep logical reasoning, and natural language generation capabilities.
   - Leveraging **Aistudio's** development tools and ecological resources to support rapid access to AI capabilities for innovative applications.

### 2. **Core Capabilities**
   - **Knowledge Q&A**: Covering a wide range of fields (technology, culture, life, etc.) to provide accurate and structured answers.
   - **Text Generation**: Supporting diverse text creation such as creative writing, copy generation, code assistance, and summary extraction.
   - **Logical Reasoning**: Handling complex problems, analyzing cause-effect relationships, and assisti

#### ② Image Description Generation

In [44]:
import openai
import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

client = openai.Client(
    base_url="http://0.0.0.0:8180/v1",
    api_key="null"
)

response = client.chat.completions.create(
    model="null",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('1.jpg')}"
                    }
                },
                {
                    "type": "text",
                    "text": "Generate a description of this image"
                }
            ]
        }
    ],
    stream=True,
)

print("Image description: ", end='', flush=True)
for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)
print()

Image description: 
This image showcases the core architecture and functional modules of PaddleNLP (PaddlePaddle Natural Language Processing Development Library). The overall design is concise and clear, adopting a blue-white color scheme that highlights technical professionalism. The image is divided into three main parts:

### 1. **Industrial-Grade Predefined Tasks (Taskflow)**
   - **Natural Language Understanding**: Provides basic understanding capabilities such as lexical analysis, text correction, sentiment analysis, and syntactic analysis.
   - **Natural Language Generation**: Covers application scenarios such as automatic couplet creation, intelligent poetry writing, generative question answering, and open-domain dialogue, meeting text creation and interaction needs.

### 2. **Industry-Grade Model Library**
   - **Self-developed Pre-trained Models**: Includes ERNIE series (such as ERNIE-1.0, ERNIE-2.0, ERNIE-Tiny, etc.), PLATO-2, SKEP, etc., supporting multi-task learni

### 2. Requests Library Calling Method

#### ① Text Generation

In [17]:
import requests
import json

url = "http://localhost:8180/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
    "model": "null",
    "messages": [
        {"role": "user", "content": "You are an intelligent assistant developed by Aistudio and Wenxin Large Model. Please introduce yourself."}
    ],
    "stream": True
}

response = requests.post(url, headers=headers, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8').replace('data: ', '')

        if line.strip() == '[DONE]':
            continue
        try:

            data = json.loads(line)

            if 'choices' in data and len(data['choices']) > 0:
                delta = data['choices'][0].get('delta', {})
                content = delta.get('content', '')
                if content:
                    print(content, end='', flush=True)
        except json.JSONDecodeError as e:
            print(f"Error decoding line: {line}")
            print(f"Error: {e}")

print()



I am an intelligent assistant empowered by **Aistudio** (Baidu's open-source deep learning platform) and the **Wenxin Large Model** technical framework. Here is an introduction to my core features and functions:

### 1. **Identity Background**
   - Based on the technical foundation of the Wenxin Large Model, integrating massive multi-domain knowledge bases and deep learning algorithms, with strong language understanding and generation capabilities.
   - Continuously optimized by Baidu's AI team, focusing on the balance of logical reasoning, knowledge integration, and interaction experience.

### 2. **Core Capabilities**
   - **Natural Language Interaction**: Supports Chinese and multilingual dialogue, capable of understanding complex problems and providing structured answers.
   - **Knowledge Q&A**: Covers a wide range of fields such as science, technology, humanities, and life, providing accurate information and explanations.
   - **Text Generation**: Can write code, arti

#### ② Image Description Generation

In [18]:
import requests
import json

url = "http://localhost:8180/v1/chat/completions"
payload = {
    "model":"null",
    "messages":[
        {
            "role":"user",
            "content":[
                {"type":"image_url","image_url":{"url":f"data:image/jpeg;base64,{encode_image('1.jpg')}"}},
                {"type":"text","text":"Generate image description"}
            ]
        }
    ],
    "stream": True
}

response = requests.post(url, headers=headers, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8').replace('data: ', '')

        if line.strip() == '[DONE]':
            continue
        try:

            data = json.loads(line)

            if 'choices' in data and len(data['choices']) > 0:
                delta = data['choices'][0].get('delta', {})
                content = delta.get('content', '')
                if content:
                    print(content, end='', flush=True)
        except json.JSONDecodeError as e:
            print(f"Error decoding line: {line}")
            print(f"Error: {e}")

print()


The image displays the core architecture and functional modules of PaddleNLP (PaddlePaddle Natural Language Processing Development Library), adopting a concise blue-white color scheme with clear structure, divided into three main levels:

### **1. Industrial-Grade Predefined Tasks (Taskflow)**
- **Natural Language Understanding**: Provides basic understanding capabilities, including lexical analysis (word segmentation, part-of-speech tagging, etc.), text correction, sentiment analysis, and syntactic analysis (grammatical structure parsing).
- **Natural Language Generation**: Covers diverse generation scenarios such as automatic couplet creation, intelligent poetry writing, generative question answering, and open-domain dialogue, supporting creative text generation.

### **2. Industry-Grade Model Library**
- **Self-developed Pre-trained Models**: Displays Baidu's developed series of models, such as ERNIE (including versions 1.0/2.0/Tiny/Gram, etc.), PLATO (dialogue model), SKEP

### 3. Calling Parameter Description

#### Common Parameter Configuration
```python
# Complete calling parameter example
response = client.chat.completions.create(
    model="null",                    # Model name (fixed value)
    messages=[...],                  # Message list
    stream=True,                     # Whether to enable streaming response
    max_tokens=2048,                 # Maximum number of generated tokens
    temperature=0.7,                 # Temperature parameter (0.0-2.0)
    top_p=0.9,                      # Nucleus sampling parameter (0.0-1.0)
    frequency_penalty=0.0,           # Frequency penalty (-2.0-2.0)
    presence_penalty=0.0,            # Presence penalty (-2.0-2.0)
    stop=["<|endoftext|>"],         # Stop word list
)
```

#### Detailed Parameter Description
| Parameter Name | Type | Default Value | Description |
|---------|------|--------|------|
| `model` | str | "null" | Model name, fixed as "null" for local deployment |
| `messages` | list | Required | Dialogue message list, containing role and content |
| `stream` | bool | False | Whether to enable streaming response |
| `max_tokens` | int | Auto | Maximum number of generated tokens |
| `temperature` | float | 1.0 | Controls randomness, higher values mean more randomness |
| `top_p` | float | 1.0 | Nucleus sampling, controls vocabulary selection range |
| `frequency_penalty` | float | 0.0 | Frequency penalty, reduces repetitive content |
| `presence_penalty` | float | 0.0 | Presence penalty, encourages talking about new topics |
| `stop` | list | None | List of strings to stop generation |

## V. Common Problems and Solutions
### 1. Port Occupation Issue
```bash
# Check processes occupying the port
lsof -i:8180

# Terminate the process (replace <PID> with the actual process ID)
kill -9 <PID>
```

### 2. Model Loading Failure
- **Check directory**: Ensure complete model files (.pdparams/config.json/vocab.txt) exist under `work/models`  
- **Parser parameters**: Ensure the startup command includes `--reasoning-parser ernie-45-vl`  
- **Driver version**: NVIDIA driver must be ≥520.61.05 (supports A100)  

### 3. Streaming Response Abnormality
- Ensure the `stream=True` parameter is correctly passed  
- Check service logs (`fastdeploy_server.log`) for insufficient memory/video memory errors  

## VI. Test Cases
### 1. Text Generation Verification
- **Input**: `"Summarize the core steps of this tutorial"`  
- **Expected**: Output coherent text containing keywords such as "environment configuration", "service startup", "interface calling", etc.  

### 2. Image Description Verification
- **Test image**: Use a jpg file containing natural scenes (such as landscapes, people's activities)  
- **Expected**: Output sentences containing scene features (such as "green grassland under blue sky and white clouds") and action descriptions (such as "people walking by the lake")  

## VII. Resource Links
- **FastDeploy Documentation**: https://www.paddlepaddle.org.cn/fastdeploy  
- **AIStudio Platform**: https://aistudio.baidu.com/  
- **Model Repository**: https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Paddle  