# Qwen 2.5 Coder 7B Instruct - Local Testing Notebook

## Hardware Requirements

### Minimum Requirements:
- **RAM**: 16 GB system RAM (for model loading and inference)
- **VRAM**: 8 GB GPU memory (for FP16/BF16)
- **Storage**: ~15 GB free disk space (for model weights)
- **CPU**: Modern multi-core processor (if running CPU-only)

### Recommended Requirements:
- **RAM**: 32 GB system RAM
- **VRAM**: 16 GB GPU memory (NVIDIA RTX 3090, 4090, A4000, etc.)
- **Storage**: 20 GB SSD
- **GPU**: CUDA-compatible GPU (compute capability 7.0+)

### Quantization Options (to reduce memory usage):
- **8-bit quantization**: ~4 GB VRAM required
- **4-bit quantization**: ~2-3 GB VRAM required
- **CPU-only mode**: 16+ GB RAM (slower inference)

### Model Details:
- **Model**: Qwen/Qwen2.5-Coder-7B-Instruct
- **Parameters**: 7.61 billion
- **Context Length**: 128K tokens
- **Architecture**: Transformer-based decoder

## 1. Install Required Packages

Run this cell first to install all dependencies:

In [23]:
# Install required packages
# !pip install -q torch>=2.0.0 transformers>=4.37.0 accelerate>=0.25.0 bitsandbytes>=0.41.0 sentencepiece>=0.1.99 protobuf>=3.20.0

## 2. Import Libraries

In [24]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 3. Check System Information

Let's verify your hardware setup:

In [25]:
def check_system_info():
    """Display system information for debugging"""
    print("=" * 60)
    print("SYSTEM INFORMATION")
    print("=" * 60)
    print(f"PyTorch Version: {torch.__version__}")
    print(f"CUDA Available: {torch.cuda.is_available()}")
    
    if torch.cuda.is_available():
        print(f"CUDA Version: {torch.version.cuda}")
        print(f"GPU Device: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    else:
        print("Running on CPU (slower performance)")
    print("=" * 60)

check_system_info()

SYSTEM INFORMATION
PyTorch Version: 2.5.1+cu121
CUDA Available: True
CUDA Version: 12.1
GPU Device: NVIDIA GeForce RTX 4060 Laptop GPU
GPU Memory: 8.00 GB


## 4. Configure Model Loading

Choose your quantization option:
- `None` - Full precision (FP16/BF16) - requires ~14 GB VRAM
- `'8bit'` - 8-bit quantization - requires ~4 GB VRAM
- `'4bit'` - 4-bit quantization - requires ~2-3 GB VRAM

In [26]:
# Configuration - Change this based on your hardware
USE_QUANTIZATION = '4bit'  # Options: None, '8bit', '4bit'

print(f"Configuration: Quantization = {USE_QUANTIZATION if USE_QUANTIZATION else 'None (Full Precision)'}")

Configuration: Quantization = 4bit


## 5. Load Model and Tokenizer

This will download the model on first run (~14 GB). Subsequent runs will use the cached version.

In [27]:
def load_model(use_quantization=None):
    """
    Load Qwen 2.5 Coder 7B Instruct model
    
    Args:
        use_quantization: None, '8bit', or '4bit'
    """
    model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
    
    print(f"Loading model: {model_name}")
    print(f"Quantization: {use_quantization if use_quantization else 'None (FP16/BF16)'}")
    print("This may take a few minutes on first run...")
    print()
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )
    
    # Configure model loading based on quantization
    model_kwargs = {
        "trust_remote_code": True,
        "device_map": "auto"  # Automatically distribute across available devices
    }
    
    if use_quantization == '8bit':
        print("Loading with 8-bit quantization (requires bitsandbytes)")
        model_kwargs["load_in_8bit"] = True
    elif use_quantization == '4bit':
        print("Loading with 4-bit quantization (requires bitsandbytes)")
        model_kwargs["load_in_4bit"] = True
    else:
        # Use BF16 if available, otherwise FP16
        if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
            model_kwargs["torch_dtype"] = torch.bfloat16
            print("Using BF16 precision")
        else:
            model_kwargs["torch_dtype"] = torch.float16
            print("Using FP16 precision")
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        **model_kwargs
    )
    
    print("Model loaded successfully!")
    print()
    
    return model, tokenizer

# Load the model
model, tokenizer = load_model(use_quantization=USE_QUANTIZATION)

Loading model: Qwen/Qwen2.5-Coder-7B-Instruct
Quantization: 4bit
This may take a few minutes on first run...



The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading with 4-bit quantization (requires bitsandbytes)


Loading checkpoint shards: 100%|██████████| 4/4 [00:22<00:00,  5.59s/it]


Model loaded successfully!



## 6. Define Generation Function

In [28]:
def generate_response(model, tokenizer, prompt, max_new_tokens=512, temperature=0.7):
    """Generate a response from the model"""
    
    # Format the prompt using chat template
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    model_inputs = tokenizer([text], return_tensors="pt")
    
    # Move to same device as model
    if torch.cuda.is_available():
        model_inputs = model_inputs.to(model.device)
    
    # Generate
    print("Generating response...")
    with torch.no_grad():
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode only the generated tokens (exclude the prompt)
    generated_ids = [
        output_ids[len(input_ids):] 
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return response

print("Generation function defined successfully!")

Generation function defined successfully!


## 7. Test the Model

Let's test with some example prompts:

In [29]:
# Example 1: Fibonacci sequence
prompt1 = "Write a Python function to calculate the fibonacci sequence recursively."
# prompt1 = 'Write a python function to print hello world'
print("=" * 60)
print(f"Prompt: {prompt1}")
print("=" * 60)

response1 = generate_response(
    model, 
    tokenizer, 
    prompt1,
    max_new_tokens=512,
    temperature=0.7
)

print("\nResponse:")
print(response1)

Prompt: Write a Python function to calculate the fibonacci sequence recursively.
Generating response...

Response:
Sure! Here's an example implementation of a recursive function in Python that calculates the Fibonacci sequence:

```python
def fibonacci(n):
    if n <= 0:
        return "Input should be a positive integer."
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)
```

This function takes an input `n` and returns the nth number in the Fibonacci sequence. It uses recursion to calculate the previous two numbers in the sequence (i.e., `fibonacci(n-1)` and `fibonacci(n-2)`) until it reaches the base cases where `n` is either 1 or 2.

Here's how you can use this function:

```python
print(fibonacci(5))  # Output: 3
print(fibonacci(7))  # Output: 8
```

Note that this implementation has exponential time complexity due to repeated calculations. For large values of `n`, it may take a long time to compute the res

In [31]:
# Example 2: Binary Search Tree
prompt2 = "Explain what is a binary search tree and implement it in Python."

print("=" * 60)
print(f"Prompt: {prompt2}")
print("=" * 60)

response2 = generate_response(
    model, 
    tokenizer, 
    prompt2,
    max_new_tokens=512,
    temperature=0.7
)

print("\nResponse:")
print(response2)

Prompt: Explain what is a binary search tree and implement it in Python.
Generating response...

Response:
A Binary Search Tree (BST) is a type of binary tree where each node has at most two child nodes, referred to as the left child and the right child. The key property of a BST is that for any given node, all values in its left subtree are less than the value of the node, and all values in its right subtree are greater than the value of the node.

Here's an implementation of a Binary Search Tree in Python:

```python
class Node:
    def __init__(self, key):
        self.left = None
        self.right = None
        self.val = key

# Function to insert a new node with the given key
def insert(root, key):
    # If the tree is empty, return a new node
    if root is None:
        return Node(key)

    # Otherwise, recur down the tree
    if key < root.val:
        root.left = insert(root.left, key)
    else:
        root.right = insert(root.right, key)

    # return the (unchanged) node

In [32]:
# Example 3: Debug code
prompt3 = "Debug this code: def add(a b): return a + b"

print("=" * 60)
print(f"Prompt: {prompt3}")
print("=" * 60)

response3 = generate_response(
    model, 
    tokenizer, 
    prompt3,
    max_new_tokens=512,
    temperature=0.7
)

print("\nResponse:")
print(response3)

Prompt: Debug this code: def add(a b): return a + b
Generating response...

Response:
The problem with the given code is that there is no space between 'a' and 'b' in the function definition. In Python, parameters should be separated by commas.

Here's the corrected version of the code:

```python
def add(a, b):
    return a + b
```

Now you can call this function like `add(5, 3)` to get `8`.


## 8. Interactive Testing

Use this cell to test your own prompts:

In [None]:
# Enter your custom prompt here
custom_prompt = "Write a Python function to reverse a linked list."

print("=" * 60)
print(f"Your Prompt: {custom_prompt}")
print("=" * 60)

custom_response = generate_response(
    model, 
    tokenizer, 
    custom_prompt,
    max_new_tokens=512,
    temperature=0.7
)

print("\nResponse:")
print(custom_response)

## 9. Advanced Configuration

Adjust generation parameters for different behaviors:

In [None]:
def generate_response_advanced(
    model, 
    tokenizer, 
    prompt, 
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.0
):
    """Generate response with advanced parameters"""
    
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    model_inputs = tokenizer([text], return_tensors="pt")
    
    if torch.cuda.is_available():
        model_inputs = model_inputs.to(model.device)
    
    print(f"Generating with: temp={temperature}, top_p={top_p}, top_k={top_k}, rep_penalty={repetition_penalty}")
    
    with torch.no_grad():
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repetition_penalty=repetition_penalty,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    generated_ids = [
        output_ids[len(input_ids):] 
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

# Example with different temperature for more creative/deterministic output
creative_prompt = "Write a creative algorithm to solve the traveling salesman problem."

print("\n" + "=" * 60)
print("Testing with higher temperature (more creative):")
print("=" * 60)

creative_response = generate_response_advanced(
    model,
    tokenizer,
    creative_prompt,
    temperature=0.9,  # Higher temperature = more creative
    top_p=0.95,
    max_new_tokens=512
)

print("\nResponse:")
print(creative_response)

## 10. Memory Management

If you need to free up memory, run this cell:

In [None]:
# Clear GPU memory
import gc

# Delete model and tokenizer
del model
del tokenizer

# Run garbage collection
gc.collect()

# Clear CUDA cache if using GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU memory cleared")

print("Memory freed successfully!")

GPU memory cleared
Memory freed successfully!


## Troubleshooting

### CUDA Out of Memory Error
- Try 8-bit quantization: Set `USE_QUANTIZATION = '8bit'` and reload the model
- Try 4-bit quantization: Set `USE_QUANTIZATION = '4bit'` and reload the model
- Reduce `max_new_tokens` to generate shorter responses
- Close other GPU-intensive applications

### Model Download Issues
- Ensure stable internet connection
- The model (~14 GB) downloads automatically on first run
- Cache location: `~/.cache/huggingface/hub/`

### ImportError for bitsandbytes
Run: `!pip install bitsandbytes`

Note: On Windows, bitsandbytes may require additional setup.

## Performance Notes

| Configuration | VRAM Usage | Speed (tokens/sec) |
|--------------|------------|-------------------|
| FP16         | ~14 GB     | ~30-50 (GPU)      |
| 8-bit        | ~4 GB      | ~20-35 (GPU)      |
| 4-bit        | ~2-3 GB    | ~15-25 (GPU)      |
| CPU          | 0 GB       | ~2-5 (CPU)        |

*Speed varies based on hardware and prompt complexity*