# Running DeepSeek-V3.2-Exp with vLLM on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run the **DeepSeek-V3.2-Exp** model using vLLM, optimized for NVIDIA's latest Blackwell architecture (GB200, B200).

## About DeepSeek-V3.2-Exp

**DeepSeek-V3.2-Exp** is an open-weight 685B parameter sparse MoE model.

**Quick Facts:**
- 685B total params, ~37B active per token | 128K context length | MIT license
- **Main Innovation:** Sparse attention = 2-3x faster on long contexts (90% compute reduction at 128K tokens)
- **Best For:** Math reasoning (89.3% AIME), coding (Codeforces 2121), long documents, agentic workflows
- **Hardware:** Requires 8x GB200/B200 or H200 GPUs

**Resources:**
- Model card: [deepseek-ai/DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)
- vLLM Recipe: [DeepSeek-V3.2-Exp Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html)


## Table of Contents

- [Part 1: Installation and Serving](#Part-1:-Installation-and-Serving)
  - [Prerequisites](#Prerequisites)
  - [Environment Setup](#Environment-Setup)
  - [Installing vLLM](#Installing-vLLM)
  - [Launch OpenAI-Compatible Server](#Launch-OpenAI-Compatible-Server)
  - [Client Setup](#Client-Setup)
- [Part 2: Applications and Examples](#Part-2:-Applications-and-Examples)
  - [Long Context Testing](#Long-Context-Testing)
  - [Real-World Applications](#Real-World-Applications)
  - [Optimization Tips](#Optimization-Tips)
- [Conclusion](#Conclusion)


### Performance Benchmarks

DeepSeek-V3.2-Exp achieves **state-of-the-art performance** across multiple domains while being 2-3x faster on long contexts than dense models.

#### Key Results

| Category | Highlight Benchmark | Score | Notes |
|----------|-------------------|-------|-------|
| **Mathematics** | AIME 2025 | **89.3%** | Competition-level math |
| | GSM8K (5-shot) | **95.91%** | Grade school math |
| **Coding** | Codeforces Rating | **2121** | Competitive programming |
| | SWE Verified | **67.8%** | Real GitHub issues |
| **Agents** | SimpleQA | **97.1%** | Tool use & reasoning |
| **Language** | MMLU-Pro | **85.0%** | Graduate knowledge |


#### Long Context Performance

DeepSeek-V3.2-Exp features **sparse attention** that provides significant speed improvements on long contexts (up to 128K tokens) while maintaining quality. The sparse attention mechanism reduces computational requirements, making it 2-3x faster than dense models on longer sequences.

*Benchmark scores from the [DeepSeek-V3.2-Exp model card](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp).*

---
## Part 1: Installation and Serving


### Prerequisites

Before starting, ensure your environment meets these requirements:

#### Hardware Requirements

**Minimum Configuration:**
- 8x NVIDIA H200 GPUs (141GB HBM3 each)
- 2TB+ system RAM
- 5TB+ SSD storage for model weights (~1.3TB) and workspace
- High-speed GPU interconnect (NVLink 4.0+)


#### Software Requirements

**Core Dependencies:**
- Python 3.12+ (required for DeepGEMM kernels)
- CUDA 12.1+ (12.9 recommended for Blackwell)
- cuDNN 8.9+
- NCCL 2.18+ for multi-GPU communication

**Python Packages:**
- PyTorch 2.3+ with CUDA support
- vLLM 0.10.2rc3+ (custom build for DeepSeek-V3.2)
- Transformers 4.38+
- OpenAI Python client 1.12+

### Environment Setup

Let's verify your system is ready for DeepSeek-V3.2-Exp:


In [None]:
# GPU environment check
import torch
import platform
import subprocess
import os

print("="*70)
print("SYSTEM INFORMATION")
print("="*70)
print(f"OS: {platform.system()} {platform.release()}")
print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    print("\n" + "="*70)
    print("GPU DETAILS")
    print("="*70)

    total_memory_gb = 0
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        memory_gb = props.total_memory / 1e9
        total_memory_gb += memory_gb

        print(f"\nGPU[{i}]:")
        print(f"  Name: {props.name}")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Memory: {memory_gb:.2f} GB")
        print(f"  Multi-Processors: {props.multi_processor_count}")

        # Check if suitable for DeepSeek-V3.2
        if "H200" in props.name:
            print(f"  Status: ✅ Hopper architecture - Supported")
        elif "B200" in props.name or "GB200" in props.name:
            print(f"  Status: ✅ Blackwell architecture - Optimal")
        else:
            print(f"  Status: ⚠️ May not be supported")

    print(f"\nTotal GPU Memory: {total_memory_gb:.2f} GB")

    # Check NVLink connectivity
    print("\n" + "="*70)
    print("NVLINK STATUS")
    print("="*70)
    try:
        nvlink_output = subprocess.check_output(['nvidia-smi', 'nvlink', '--status'],
                                                stderr=subprocess.STDOUT, text=True)
        print("✅ NVLink detected - Multi-GPU performance will be optimal")
    except:
        print("⚠️ Could not detect NVLink - Performance may be degraded")

    # Recommendations
    print("\n" + "="*70)
    print("CONFIGURATION RECOMMENDATIONS")
    print("="*70)

    if total_memory_gb >= 1100:
        print("✅ Sufficient GPU memory for DeepSeek-V3.2-Exp")
        print("   Recommended mode: EP/DP (--dp 8 --enable-expert-parallel)")
    elif total_memory_gb >= 900:
        print("⚠️ Marginal GPU memory - Consider using FP8 quantization")
        print("   Recommended mode: Tensor Parallel (--tensor-parallel-size 8)")
    else:
        print("❌ Insufficient GPU memory for full model")
        print("   Consider: Smaller model or quantized version")

else:
    print("❌ CUDA not available - GPU required for this model")


SYSTEM INFORMATION
OS: Linux 6.8.0-79-generic
Python: 3.12.11
PyTorch: 2.8.0+cu128
CUDA available: True
CUDA version: 12.8
cuDNN version: 91002
Number of GPUs: 8

GPU DETAILS

GPU[0]:
  Name: NVIDIA B200
  Compute Capability: 10.0
  Memory: 191.50 GB
  Multi-Processors: 148
  Status: ✅ Blackwell architecture - Optimal

GPU[1]:
  Name: NVIDIA B200
  Compute Capability: 10.0
  Memory: 191.50 GB
  Multi-Processors: 148
  Status: ✅ Blackwell architecture - Optimal

GPU[2]:
  Name: NVIDIA B200
  Compute Capability: 10.0
  Memory: 191.50 GB
  Multi-Processors: 148
  Status: ✅ Blackwell architecture - Optimal

GPU[3]:
  Name: NVIDIA B200
  Compute Capability: 10.0
  Memory: 191.50 GB
  Multi-Processors: 148
  Status: ✅ Blackwell architecture - Optimal

GPU[4]:
  Name: NVIDIA B200
  Compute Capability: 10.0
  Memory: 191.50 GB
  Multi-Processors: 148
  Status: ✅ Blackwell architecture - Optimal

GPU[5]:
  Name: NVIDIA B200
  Compute Capability: 10.0
  Memory: 191.50 GB
  Multi-Processors: 148


### Installing vLLM

We'll install vLLM nightly build with DeepGEMM support for optimal performance.

**Note:** For the latest installation steps, refer to the [official vLLM DeepSeek-V3.2-Exp Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html).


In [None]:
# Install uv (fast Python package installer)
!pip install --upgrade pip
!pip install uv

#### Step 1: Install vLLM Nightly Build


In [None]:
# Install vLLM nightly build from wheels
!uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

print("✅ vLLM nightly build installed successfully")

#### Step 2: Install DeepGEMM Kernels

DeepGEMM provides optimized matrix multiplication kernels for DeepSeek models. It is used in two places: MoE and MQA logits computation (necessary for MQA).


In [None]:
# Install DeepGEMM kernels
!uv pip install https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl

print("✅ DeepGEMM kernels installed successfully")

#### Step 3: Install Additional Dependencies


In [None]:
# Install required Python packages
!pip install openai transformers accelerate numpy --quiet

print("\n" + "="*70)
print("✅ Installation Complete!")
print("="*70)
print("\nYou can now proceed to launch the vLLM server.")

### Launch OpenAI-Compatible Server

vLLM provides multiple serving modes for DeepSeek-V3.2-Exp. We'll cover both the **recommended EP/DP mode** and the **fallback TP mode**.

#### Serving Modes Explained

**1. Expert Parallelism + Data Parallelism (EP/DP) - Recommended**
- Each GPU runs full expert inference (TP=1)
- Experts distributed across GPUs
- Data parallelism for batching
- **Best performance** but requires stable setup
- Command: `-dp 8 --enable-expert-parallel`

**2. Tensor Parallelism (TP) - Fallback**
- Model sharded across GPUs
- More robust, easier to debug
- **~20-30% slower** than EP/DP
- Command: `--tensor-parallel-size 8`

#### Launch Commands

**Important:** Run these commands in a **separate terminal**, not in this notebook. The server needs to run continuously in the background.


#### Option 1: EP/DP Mode (Recommended for H200/B200)

```bash
# For 8x GB200/B200/H200 (or H20) GPUs - Optimal Performance
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -dp 8 --enable-expert-parallel
```

This is the recommended serving mode as the kernels are mainly optimized for TP=1. The command uses:
- `-dp 8`: Data parallelism across 8 GPUs  
- `--enable-expert-parallel`: Expert parallelism for MoE layers
- Default settings optimized for the model

**Full command with additional options:**
```bash
vllm serve deepseek-ai/DeepSeek-V3.2-Exp \
    -dp 8 \
    --enable-expert-parallel \
    --served-model-name deepseek-v32 \
    --host 0.0.0.0 \
    --port 8000 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code
```

**Note:** If you encounter errors like `CUDA error: invalid configuration argument`, try reducing `--max-num-seqs` to 256 or smaller (default is 1024).

#### Option 2: Tensor Parallel Mode (Fallback)

```bash
# Tensor Parallel - More Robust, works when EP/DP has issues
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8
```

Use this mode if you hit errors or hangs with EP/DP mode. Simple tensor parallel works and is more robust, but the performance is not optimal.

#### Expected Output

When the server starts successfully, you should see:

```
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
```

The first startup will take **10-20 minutes** to download the model weights (~1.3TB).

**Supported Hardware:** Only Hopper (H100/H200) and Blackwell (B200/GB200) data center GPUs are supported.


### Client Setup

Once the server is running, connect using the OpenAI Python client:


In [None]:
from openai import OpenAI
import time

# Connect to vLLM server
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="dummy"  # vLLM doesn't require a real API key
)

MODEL_NAME = "deepseek-v32"

print(f"Connected to vLLM server at http://127.0.0.1:8000")
print(f"Using model: {MODEL_NAME}")


Connected to vLLM server at http://127.0.0.1:8000
Using model: deepseek-v32


#### Quick Test

Let's verify the server is working correctly:


In [None]:
# Simple test to verify the server is working
response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Hello! Can you tell me what 2+2 equals?"}
    ],
    temperature=0.7,
    max_tokens=300
)

print("="*70)
print("MODEL RESPONSE")
print("="*70)
print(response.choices[0].message.content)
print("\n" + "="*70)


MODEL RESPONSE
Of course! 2 + 2 equals **4**.



---
## Part 2: Applications and Examples

### Long Context Testing

DeepSeek-V3.2-Exp's sparse attention mechanism truly shines with long contexts. Let's test this:


In [None]:
def test_long_context(context_length_words: int = 8000):
    """
    Test with long input contexts.
    DeepSeek Sparse Attention should handle this efficiently.
    """

    # Create a long context document
    long_document = " ".join([
        f"This is sentence number {i} in a very long technical document. "
        f"It discusses advanced topics in artificial intelligence, specifically topic {i % 10}. "
        f"The research findings indicate significant improvements in performance metrics. "
        for i in range(context_length_words // 30)  # Approximate word count
    ])

    # Add a question that requires understanding the full context
    question = "Based on the entire document above, how many different topics are discussed?"
    full_prompt = f"{long_document}\n\n{question}"

    word_count = len(full_prompt.split())
    estimated_tokens = word_count // 0.75  # Rough estimate: 1 token ≈ 0.75 words

    print(f"📄 Testing with ~{word_count:,} words (~{estimated_tokens:,} tokens)...")

    # Get streaming response
    chunks = []
    stream = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": full_prompt}],
        temperature=0.7,
        max_tokens=150,
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            chunks.append(chunk.choices[0].delta.content)

    response_text = "".join(chunks)

    print(f"\n{'='*70}")
    print(f"LONG CONTEXT TEST")
    print("="*70)
    print(f"Context Size: ~{word_count:,} words (~{estimated_tokens:,} tokens)")
    print(f"\n💬 Model Response:")
    print(response_text)
    print("="*70 + "\n")

# Test with increasing context sizes
for ctx_len in [2000, 8000, 16000]:
    test_long_context(ctx_len)
    time.sleep(2)  # Brief pause between tests


📄 Testing with ~1,992 words (~2,656.0 tokens)...

LONG CONTEXT TEST
Context Size: ~1,992 words (~2,656.0 tokens)

💬 Model Response:
Looking at the document, I can see that each sentence mentions a specific topic number from 0 to 9, and this pattern repeats throughout the text.

The topics mentioned are: topic 0, topic 1, topic 2, topic 3, topic 4, topic 5, topic 6, topic 7, topic 8, and topic 9.

Therefore, there are **10 different topics** discussed in the document.

📄 Testing with ~7,992 words (~10,656.0 tokens)...

LONG CONTEXT TEST
Context Size: ~7,992 words (~10,656.0 tokens)

💬 Model Response:
Let's look at the pattern in the text.  

Each sentence says:  

> "This is sentence number X … specifically topic Y."  

Looking at a few examples:  
- Sentence 0 → topic 0  
- Sentence 1 → topic 1  
- Sentence 2 → topic 2  
- …  
- Sentence 9 → topic 9  
- Sentence 10 → topic 0  
- Sentence 11 → topic 1  
- …  

The topics cycle from 0 to 9 repeatedly. That means the only topics mentioned

### Real-World Applications

Now let's explore specific applications where DeepSeek-V3.2-Exp excels:

#### Application 1: Advanced Mathematical Reasoning


In [None]:
# Test mathematical reasoning (AIME-level)
math_prompt = """
Solve this problem step by step:

A sequence is defined by a₁ = 1 and aₙ₊₁ = 2aₙ + 3 for n ≥ 1.
Find a₁₀.

Show your work clearly and verify your answer.
"""

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are an expert mathematician. Solve problems step-by-step with clear reasoning."},
        {"role": "user", "content": math_prompt}
    ],
    temperature=0.3,  # Lower temperature for precise reasoning
    max_tokens=1024
)

print("="*70)
print("APPLICATION 1: MATHEMATICAL REASONING")
print("="*70)
print(response.choices[0].message.content)
print("="*70 + "\n")


APPLICATION 1: MATHEMATICAL REASONING
Let’s go step-by-step.

---

We are told:  
\[
a_1 = 1, \quad a_{n+1} = 2a_n + 3 \quad \text{for } n \ge 1.
\]

---

**Step 1: Find a pattern or closed form**

The recurrence is:
\[
a_{n+1} = 2a_n + 3.
\]
This is a linear first-order recurrence.  
We can solve it by finding the fixed point \( L \) such that \( L = 2L + 3 \), i.e. \( L = -3 \).

Now define \( b_n = a_n - L = a_n + 3 \). Then:
\[
a_{n+1} + 3 = 2a_n + 3 + 3 = 2a_n + 6 = 2(a_n + 3).
\]
So:
\[
b_{n+1} = 2 b_n.
\]
Thus \( b_n \) is geometric:
\[
b_n = b_1 \cdot 2^{n-1}.
\]
We have \( b_1 = a_1 + 3 = 1 + 3 = 4 \).

So:
\[
b_n = 4 \cdot 2^{n-1} = 2^{n+1}.
\]
Therefore:
\[
a_n = b_n - 3 = 2^{n+1} - 3.
\]

---

**Step 2: Verify for small \( n \)**

- \( n = 1 \): \( a_1 = 2^{2} - 3 = 4 - 3 = 1 \) ✓  
- \( n = 2 \): \( a_2 = 2^{3} - 3 = 8 - 3 = 5 \), check via recurrence: \( a_2 = 2a_1 + 3 = 2 + 3 = 5 \) ✓  
- \( n = 3 \): \( a_3 = 2^{4} - 3 = 16 - 3 = 13 \), check: \( a_3 = 2a_2 + 3 = 10 + 3

#### Application 2: Code Generation


In [None]:
# Test coding capabilities
code_prompt = """
Write a Python function to reverse a string. Include type hints and a docstring.
"""

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are an expert Python developer. Write clean, efficient, well-documented code."},
        {"role": "user", "content": code_prompt}
    ],
    temperature=0.4,
    max_tokens=512
)

print("="*70)
print("APPLICATION 2: CODE GENERATION")
print("="*70)
print(response.choices[0].message.content)
print("="*70 + "\n")


APPLICATION 2: CODE GENERATION
```python
def reverse_string(text: str) -> str:
    """
    Reverse the input string.
    
    Args:
        text (str): The string to be reversed
        
    Returns:
        str: The reversed string
        
    Examples:
        >>> reverse_string("hello")
        'olleh'
        >>> reverse_string("Python")
        'nohtyP'
        >>> reverse_string("")
        ''
        >>> reverse_string("a")
        'a'
    """
    return text[::-1]
```

This function:

1. **Uses type hints**: `text: str` specifies the input type, and `-> str` specifies the return type
2. **Has a comprehensive docstring** that includes:
   - Brief description
   - Args section explaining the parameter
   - Returns section explaining the return value
   - Examples showing usage with different cases

3. **Implements the reversal efficiently** using Python's slice notation `[::-1]` which:
   - Starts at the end of the string
   - Goes to the beginning
   - Steps backwards one chara

#### Application 3: Multi-lingual Understanding


In [None]:
# Test multilingual understanding
multilingual_prompts = [
    ("English", "Explain how neural networks learn in 3 sentences."),
    ("中文 (Chinese)", "用三句话解释神经网络如何学习。"),
    ("Español (Spanish)", "Explica cómo aprenden las redes neuronales en 3 frases."),
    ("Français (French)", "Expliquez en 3 phrases comment les réseaux neuronaux apprennent.")
]

print("="*70)
print("APPLICATION 3: MULTILINGUAL CAPABILITIES")
print("="*70)

for language, prompt in multilingual_prompts:
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.6,
        max_tokens=200
    )
    print(f"\n🌐 {language}:")
    print(f"Prompt: {prompt}")
    print(f"Response: {response.choices[0].message.content}")
    print("-" * 70)

print("="*70 + "\n")


APPLICATION 3: MULTILINGUAL CAPABILITIES

🌐 English:
Prompt: Explain how neural networks learn in 3 sentences.
Response: Neural networks learn by iteratively adjusting their internal parameters, called weights and biases, to minimize the difference between their predictions and the actual correct answers. This adjustment process is guided by an optimization algorithm, typically a variant of gradient descent, which calculates how to change the weights to reduce error using the chain rule from calculus (a technique called backpropagation). Over many cycles of processing data and making these small corrections, the network's configuration gradually improves, allowing it to model complex patterns and relationships within the data.
----------------------------------------------------------------------

🌐 中文 (Chinese):
Prompt: 用三句话解释神经网络如何学习。
Response: 神经网络通过大量数据样本不断调整内部参数，使预测结果逐渐接近真实值。每次预测误差会通过反向传播算法从输出层向输入层传递，指导各层参数进行梯度下降优化。经过反复迭代，网络最终自动提取数据关键特征，形成从输入到输出的精准映射能力。
---------------------------

### Optimization Tips

Here are production-ready optimization strategies for DeepSeek-V3.2-Exp on H200/B200:

#### 1. Parallelism Strategy Selection

**EP/DP Mode (Recommended for Production)**
```bash
# Best Performance - Use this for H200/B200 deployments
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -dp 8 --enable-expert-parallel
```
**When to use:**
- Production deployments
- Maximum throughput needed
- Stable hardware setup
- H200/B200/H20 GPUs

**Why EP/DP is recommended:** The kernels are mainly optimized for TP=1, so it's best to run this model under EP/DP mode (DP=8, EP=8, TP=1).

**Tensor Parallel Mode (Fallback)**
```bash
# More Stable - Use for debugging or if EP/DP has issues
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8
```
**When to use:**
- Initial testing and debugging
- EP/DP mode encounters errors or hangs
- Need maximum stability
- Performance is not optimal but more robust

#### 2. DeepGEMM Configuration

DeepGEMM is used in two places: MoE and MQA logits computation. It is necessary for MQA logits computation.

```bash
# Disable DeepGEMM for MoE (some users report better performance on H20)
VLLM_USE_DEEP_GEMM=0 vllm serve deepseek-ai/DeepSeek-V3.2-Exp -dp 8 --enable-expert-parallel
```

**Benefits of disabling DeepGEMM:**
- Better performance on some GPUs (e.g., H20)
- Skips the long warmup period
- Still uses DeepGEMM for MQA (necessary part)

#### 3. KV Cache Configuration

**For Short Contexts (<2K tokens):**
```bash
--kv-cache-dtype bfloat16
```
- Less quantization overhead
- Better latency for short requests
- Uses more GPU memory

**For Long Contexts (>2K tokens) - Default:**
```bash
# FP8 is default (custom fp8 kvcache), no flag needed
# Or explicitly set:
--kv-cache-dtype fp8
```
- Saves GPU memory
- Allows more tokens to be cached
- Better for long documents
- Incurs additional quantization/dequantization overhead

**Recommendation:** Use `bfloat16` for short requests and `fp8` for long requests.

#### 4. Batch Size Tuning

```bash
# Default (may cause errors with large batches)
# Default is 1024

# Recommended if you see CUDA errors
--max-num-seqs 256

# Conservative (if seeing OOM errors)
--max-num-seqs 128
```

**Note:** If you encounter errors like `CUDA error (flashmla-src/csrc/smxx/mla_combine.cu:201): invalid configuration argument`, it might be caused by too large batch size. Try with `--max-num-seqs 256` or smaller.

#### 5. Memory Management

```bash
# Aggressive memory use (recommended for dedicated servers)
--gpu-memory-utilization 0.95

# Conservative (if running other workloads)
--gpu-memory-utilization 0.85

# Very conservative (shared GPU environment)
--gpu-memory-utilization 0.75
```


---
## Conclusion

### Summary

Congratulations! You've successfully set up and benchmarked **DeepSeek-V3.2-Exp** with vLLM on NVIDIA GB200/B200 GPUs. This notebook covered:

✅ **Introduction**
- Architecture and innovations (Sparse Attention, MoE)
- Performance benchmarks across multiple domains
- Why DeepSeek-V3.2-Exp is special

✅ **Part 1: Installation and Serving**
- Complete installation guide
- Server configuration for EP/DP and TP modes
- Client setup and basic testing

✅ **Part 2: Applications and Examples**
- Tested long context handling
- Demonstrated real-world applications
- Optimization strategies for production

### Resources

#### Documentation
- 📚 [DeepSeek-V3.2-Exp Model Card](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)
- 🔧 [vLLM DeepSeek Recipe](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html)
- 🏗️ [NVIDIA vLLM Guide](https://docs.nvidia.com/ai-enterprise/deployment/vllm/)

#### Code and Kernels
- 💾 [TileLang Sparse Kernels](https://github.com/deepseek-ai/TileLang)
- ⚡ [FlashMLA Implementation](https://github.com/deepseek-ai/FlashMLA)
- 🧪 [DeepSeek Examples](https://github.com/deepseek-ai/DeepSeek-V3)

#### Community
- 📧 [NVIDIA Developer Forums](https://forums.developer.nvidia.com/)


#### Acknowledgments

**Author:** Jay Rodge, Senior Developer Advocate @ NVIDIA

Special thanks to the DeepSeek and vLLM teams for their incredible work on these technologies.