### Universal Recognition with Qwen2.5-VL

This notebook demonstrates how to use Qwen2.5-VL for universal recognition. It takes an image and a query, and then uses the model to interpret the user's query on the image.

!pip install git+https://github.com/huggingface/transformers
!pip install qwen-vl-utils
!pip install qwen_agent
!pip install openai#### \[Setup\]

Load plotting and inference util.

In [1]:
!pip install git+https://github.com/huggingface/transformers
!pip install qwen-vl-utils
!pip install openai

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-x9up0p8r
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-x9up0p8r
  Resolved https://github.com/huggingface/transformers to commit 41b9b92b52215bed472c9a534a06abbc3a9a95cd
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting filelock (from transformers==4.51.0.dev0)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers==4.51.0.dev0)
  Using cached huggingface_hub-0.30.1-py3-none-any.whl.metadata (13 kB)
Collecting numpy>=1.17 (from transformers==4.51.0.dev0)
  Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting pyyaml>=5.1 (from tra

In [7]:
import json
import random
from PIL import Image, ImageDraw, ImageFont
from openai import OpenAI
import os
import base64

sys_prompt = '''You are a prompt enhancer for image generation models like DALL·E or Midjourney.Given:An image.

The original text prompt used to generate the image.

Your task is to analyze the image and the original prompt, and then output a more detailed, vivid, and compositionally rich enhanced prompt that would recreate the image with higher fidelity, aesthetic quality, and visual richness. The enhanced prompt should:

Include specific objects, styles, settings, lighting, mood, and artistic techniques observed in the image.

Be clear, descriptive, and suitable for input into an AI image generator.

Improve upon vague or minimal descriptions in the original prompt.

Input:

Image: [insert image]

Original Prompt: '[insert original prompt here]'

Output:

Enhanced Prompt: '[Your improved prompt here]'
'''

# @title inference function
def inference(image_path, prompt, sys_prompt=sys_prompt, max_new_tokens=4096, return_input=False):
    image = Image.open(image_path)
    image_local_path = "file://" + image_path
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"image": image_local_path},
            ]
        },
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # print("text:", text)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to('cuda')

    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    if return_input:
        return output_text[0], inputs
    else:
        return output_text[0]
    
    



#  base 64 
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")



Load model and processors.

In [8]:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2",device_map="auto")
processor = AutoProcessor.from_pretrained(model_path)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]



#### 1. Birds Recognition

There are more than 10,000 bird species in the world, and many of them have only slight differences in appearance. This is a very challenging fine-grained recognition task.

##### 1.1 Single image recognition 

In [12]:
import pandas as pd
import requests
from io import BytesIO
from PIL import Image
import os

# ====== CONFIGURATION ======
INPUT_CSV_PATH = "../datasets/900k-diffusion-prompts-dataset/diffusion_prompts.csv"
OUTPUT_CSV_PATH = "../output/enhanced_prompts.csv"
NUM_SAMPLES = 100  # <--- Change this number to control how many rows are processed
IMAGE_SAVE_DIR = "../datasets/900k-diffusion-prompts-dataset/downloaded_images"

# ====== Ensure image directory exists ======
os.makedirs(IMAGE_SAVE_DIR, exist_ok=True)

# ====== Load the CSV ======
df = pd.read_csv(INPUT_CSV_PATH)

# ====== Slice for the desired number of rows ======
df_subset = df.head(NUM_SAMPLES)

# ====== Store successful rows and enhanced prompts ======
successful_rows = []
enhanced_prompts = []

# ====== Process each row ======
for idx, row in df_subset.iterrows():
    prompt = row['prompt']
    image_url = row['url']
    image_id = row['id']
    
    try:
        # Download image
        response = requests.get(image_url)
        response.raise_for_status()  # ensure it's a valid response
        image = Image.open(BytesIO(response.content))
        image_path = os.path.join(IMAGE_SAVE_DIR, f"{image_id}.png")
        image.save(image_path)

        # Run inference
        model_response = inference(image_path, prompt)
        print(f"model_response: {model_response}")

        # Extract enhanced prompt
        if isinstance(model_response, str) and "Enhanced Prompt:" in model_response:
            enhanced_prompt = model_response.split("Enhanced Prompt:")[-1].strip().strip('"')
        else:
            print(f"Warning: No enhanced prompt in response for row {idx}")
            continue  # skip this row

        # Save successful row and enhanced prompt
        enhanced_prompts.append(enhanced_prompt)
        successful_rows.append(row)

    except Exception as e:
        print(f"Error processing row {idx}: {e}")
        continue  # skip this row

# ====== Build final DataFrame and save ======
final_df = pd.DataFrame(successful_rows)
final_df['prompt'] = enhanced_prompts  # replace with enhanced prompts

final_df.to_csv(OUTPUT_CSV_PATH, index=False)

# ====== Summary ======
print(f"Intended to process {NUM_SAMPLES} rows.")
print(f"Successfully processed and saved {len(final_df)} rows.")
print(f"Enhanced CSV saved to {OUTPUT_CSV_PATH}")


KeyboardInterrupt: 