# Qwen2-VL Model Testing Notebook

This notebook demonstrates how to use the Qwen2-VL-7B-Instruct model for vision-language tasks.
The model can process both text and images to generate responses.

## 1. Setup and Model Loading

In [1]:
# Install required packages if not already installed
# !pip install transformers torch torchvision Pillow

In [1]:
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import requests
from io import BytesIO
import json
import os
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

PyTorch version: 2.2.2
CUDA available: False
MPS available: True


In [2]:
# Load the model and processors
model_name = "Qwen/Qwen2-VL-2B-Instruct"

print("Loading model...")

# Use CPU for now to avoid MPS issues
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cpu",
    low_cpu_mem_usage=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

print("Model loaded successfully!")
print(f"Model device: {next(model.parameters()).device}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

print("Model loaded successfully!")

Loading model...


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/429M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


chat_template.json: 0.00B [00:00, ?B/s]

Model loaded successfully!
Model device: cpu
Model loaded successfully!


In [3]:
# Try loading a quantized or smaller model for faster testing
model_name = "Qwen/Qwen2-VL-2B-Instruct"  # Use 2B instead of 7B

print("Loading smaller model for faster testing...")

try:
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        load_in_8bit=True  # Use 8-bit quantization
    )
except:
    # Fallback to regular loading
    model = Qwen2VLForConditionalGeneration.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="cpu"
    )

tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

print("Model loaded successfully!")
print(f"Model device: {next(model.parameters()).device}")
print(f"Model size: {model_name}")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading smaller model for faster testing...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model loaded successfully!
Model device: cpu
Model size: Qwen/Qwen2-VL-2B-Instruct


In [8]:
def load_image(image_path_or_url):
    """Load image from local path or URL"""
    if image_path_or_url.startswith(('http://', 'https://')):
        response = requests.get(image_path_or_url)
        image = Image.open(BytesIO(response.content))
    else:
        image = Image.open(image_path_or_url)

    return image.convert('RGB')

def generate_response(prompt, image=None, max_new_tokens=50, temperature=0.7):
    """Generate response from text prompt and optional image"""

    # Get model device
    device = next(model.parameters()).device

    if image is not None:
        # For vision-language tasks
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": image,
                    },
                    {"type": "text", "text": prompt},
                ],
            }
        ]
    else:
        # For text-only tasks
        messages = [
            {
                "role": "user",
                "content": prompt,
            }
        ]

    # Apply chat template
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Process inputs and move to model device
    inputs = processor(
        text=[text],
        images=[image] if image is not None else None,
        padding=True,
        return_tensors="pt",
    )

    # Move inputs to model device
    inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

    # Generate response with faster settings
    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Use greedy decoding (faster than sampling)
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode response
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs['input_ids'], generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]

## 3. Text-only Testing

In [10]:
# Test 1: Simple text generation (optimized)
print("=== Test 1: Simple Text Generation (Fast) ===")
import time

start = time.time()
prompt = "Hi"
response = generate_response(prompt, max_new_tokens=30)  # Shorter response
end = time.time()

print(f"Prompt: {prompt}")
print(f"Response: {response}")
print(f"Time taken: {end - start:.1f} seconds")
print()

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


=== Test 1: Simple Text Generation (Fast) ===


KeyboardInterrupt: 

In [None]:
# Test 2: Code generation
print("=== Test 2: Code Generation ===")
prompt = "Write a Python function to calculate the Fibonacci sequence up to n terms."
response = generate_response(prompt)
print(f"Prompt: {prompt}")
print(f"Response: {response}")
print()

## 4. Vision-Language Testing

In [None]:
# Test 3: Image description with online image
print("=== Test 3: Online Image Description ===")
try:
    image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
    image = load_image(image_url)
    
    prompt = "Describe this image in detail."
    response = generate_response(prompt, image=image)
    
    print(f"Image URL: {image_url}")
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print()
    
    # Display image
    print("Image:")
    display(image)
    
except Exception as e:
    print(f"Error loading online image: {e}")

In [None]:
# Test 4: Document understanding with local images
print("=== Test 4: Document Analysis ===")

# Check if we have any images in the data/dom directory
image_dirs = list(Path("data/dom/MMLongBench-Doc").glob("*/"))
if image_dirs:
    # Take the first directory with images
    first_dir = image_dirs[0]
    image_files = list(first_dir.glob("*.png"))
    
    if image_files:
        # Use the first image
        image_path = image_files[0]
        print(f"Using image: {image_path}")
        
        try:
            image = load_image(str(image_path))
            
            prompt = "What type of document is this? Describe the content and structure you can see."
            response = generate_response(prompt, image=image)
            
            print(f"Prompt: {prompt}")
            print(f"Response: {response}")
            print()
            
            # Display image
            print("Image:")
            display(image.resize((400, 400)))
            
        except Exception as e:
            print(f"Error processing local image: {e}")
    else:
        print("No PNG images found in the first directory")
else:
    print("No image directories found in data/dom/MMLongBench-Doc")

In [None]:
# Test 5: Mathematical problem solving with image
print("=== Test 5: Mathematical Content Analysis ===")

# Look for academic paper images (likely to have mathematical content)
academic_dirs = [d for d in image_dirs if any(x in d.name for x in ['2005', '2021', '2023', '2024'])]

if academic_dirs:
    academic_dir = academic_dirs[0]
    academic_images = list(academic_dir.glob("*.png"))
    
    if academic_images:
        image_path = academic_images[0]
        print(f"Using academic paper image: {image_path}")
        
        try:
            image = load_image(str(image_path))
            
            prompt = "Analyze this academic content. What mathematical concepts, formulas, or technical information can you identify?"
            response = generate_response(prompt, image=image)
            
            print(f"Prompt: {prompt}")
            print(f"Response: {response}")
            print()
            
            # Display image
            print("Image:")
            display(image.resize((600, 400)))
            
        except Exception as e:
            print(f"Error processing academic image: {e}")
    else:
        print("No images found in academic directories")
else:
    print("No academic paper directories found")

## 5. Advanced Testing and Evaluation

In [None]:
# Test 6: OCR and text extraction
print("=== Test 6: OCR and Text Extraction ===")

# Look for document images that likely contain text
if image_dirs:
    # Try to find a table or figure image
    table_images = []
    for img_dir in image_dirs[:3]:  # Check first 3 directories
        table_imgs = list(img_dir.glob("*table*.png")) + list(img_dir.glob("*figure*.png"))
        table_images.extend(table_imgs[:2])  # Take up to 2 from each directory
    
    if table_images:
        image_path = table_images[0]
        print(f"Using document image: {image_path}")
        
        try:
            image = load_image(str(image_path))
            
            prompt = "Extract and transcribe all the text you can see in this image. Include any numbers, labels, and structured data."
            response = generate_response(prompt, image=image, max_new_tokens=1024)
            
            print(f"Prompt: {prompt}")
            print(f"Response: {response}")
            print()
            
            # Display image
            print("Image:")
            display(image.resize((600, 400)))
            
        except Exception as e:
            print(f"Error processing document image: {e}")
    else:
        print("No table or figure images found")
else:
    print("No image directories available")

In [None]:
# Test 7: Comparative analysis
print("=== Test 7: Multi-image Comparison ===")

# Try to compare two images from the same document
if image_dirs:
    first_dir = image_dirs[0]
    images = list(first_dir.glob("*.png"))[:2]  # Take first 2 images
    
    if len(images) >= 2:
        print(f"Comparing images from: {first_dir.name}")
        
        for i, img_path in enumerate(images):
            try:
                image = load_image(str(img_path))
                
                prompt = f"This is image {i+1} from a document. Describe what you see and identify key elements."
                response = generate_response(prompt, image=image)
                
                print(f"\n--- Image {i+1}: {img_path.name} ---")
                print(f"Response: {response}")
                
                # Display image
                print(f"Image {i+1}:")
                display(image.resize((300, 200)))
                
            except Exception as e:
                print(f"Error processing image {i+1}: {e}")
    else:
        print("Not enough images for comparison")
else:
    print("No image directories available")

## 6. Model Performance and Memory Usage

In [None]:
# Check model memory usage and performance
import time
import psutil
import gc

def get_memory_usage():
    """Get current memory usage"""
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    return memory_info.rss / 1024 / 1024 / 1024  # Convert to GB

print("=== Model Performance Analysis ===")
print(f"Current memory usage: {get_memory_usage():.2f} GB")

# Test inference speed
test_prompts = [
    "What is machine learning?",
    "Explain quantum computing in simple terms.",
    "How does photosynthesis work?"
]

total_time = 0
for i, prompt in enumerate(test_prompts):
    start_time = time.time()
    response = generate_response(prompt, max_new_tokens=100)
    end_time = time.time()
    
    inference_time = end_time - start_time
    total_time += inference_time
    
    print(f"\nTest {i+1}:")
    print(f"Prompt: {prompt}")
    print(f"Inference time: {inference_time:.2f} seconds")
    print(f"Response length: {len(response)} characters")

print(f"\nAverage inference time: {total_time/len(test_prompts):.2f} seconds")
print(f"Final memory usage: {get_memory_usage():.2f} GB")

# Clean up
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

## 7. Save Test Results

In [None]:
# Save model information and test summary
model_info = {
    "model_name": model_name,
    "device": str(device),
    "torch_version": torch.__version__,
    "model_parameters": sum(p.numel() for p in model.parameters()),
    "model_dtype": str(model.dtype),
    "cuda_available": torch.cuda.is_available(),
    "mps_available": torch.backends.mps.is_available(),
}

print("=== Model Information ===")
for key, value in model_info.items():
    print(f"{key}: {value}")

# Save to JSON file
with open("qwen2vl_test_results.json", "w") as f:
    json.dump(model_info, f, indent=2, default=str)

print("\nTest results saved to qwen2vl_test_results.json")

## 8. Custom Interactive Testing

In [None]:
# Interactive testing function
def interactive_test():
    """Interactive function for custom testing"""
    print("Interactive Qwen2-VL Testing")
    print("Enter your prompts below. Type 'quit' to exit.")
    print("To test with an image, first provide the image path, then the prompt.")
    print("Example: 'image:/path/to/image.png' followed by 'What do you see in this image?'")
    
    current_image = None
    
    while True:
        user_input = input("\nEnter prompt (or 'quit' to exit): ").strip()
        
        if user_input.lower() == 'quit':
            break
        
        if user_input.startswith('image:'):
            image_path = user_input[6:].strip()
            try:
                current_image = load_image(image_path)
                print(f"Image loaded: {image_path}")
                display(current_image.resize((300, 200)))
            except Exception as e:
                print(f"Error loading image: {e}")
                current_image = None
            continue
        
        if user_input.startswith('clear'):
            current_image = None
            print("Image cleared")
            continue
        
        try:
            response = generate_response(user_input, image=current_image)
            print(f"\nResponse: {response}")
        except Exception as e:
            print(f"Error generating response: {e}")

# Uncomment the line below to run interactive testing
# interactive_test()

## Conclusion

This notebook provides a comprehensive testing framework for the Qwen2-VL-7B-Instruct model. The tests cover:

1. **Text-only generation**: Basic language understanding and code generation
2. **Vision-language tasks**: Image description and document analysis
3. **OCR capabilities**: Text extraction from images
4. **Performance analysis**: Speed and memory usage
5. **Interactive testing**: Custom prompts and image analysis

The model demonstrates strong multimodal capabilities for both text and vision tasks, making it suitable for various applications including document processing, image analysis, and general AI assistance.