This notebook contains the implementation of the models to:

- Make boundary boxes
- Crop the images
    - Store the cropped images in a folder, with the name of the shelf_image. All the cropped images should be there
- Try to eliminate the non-product images

Models:

- YOLOv11.n (10MB)
- YOLOv5 (148MB)

In [None]:
# Dependencies and Packages:

!pip install ultralytics
# from google.colab import drive
# drive.mount('/content/drive')
# !cp "/content/drive/MyDrive/YOLOv5_SKU.pt" /content/
!git clone https://github.com/warun7/dotslash-repo.git
!cp "/content/dotslash-repo/YOLOv11_SKU.pt" /content/
!unzip /content/dotslash-repo/smart_cataloging.zip

Collecting ultralytics
  Downloading ultralytics-8.3.36-py3-none-any.whl.metadata (35 kB)
Collecting ultralytics-thop>=2.0.0 (from ultralytics)
  Downloading ultralytics_thop-2.0.12-py3-none-any.whl.metadata (9.4 kB)
Downloading ultralytics-8.3.36-py3-none-any.whl (887 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.3/887.3 kB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ultralytics_thop-2.0.12-py3-none-any.whl (26 kB)
Installing collected packages: ultralytics-thop, ultralytics
Successfully installed ultralytics-8.3.36 ultralytics-thop-2.0.12
Cloning into 'dotslash-repo'...
remote: Enumerating objects: 634, done.[K
remote: Counting objects: 100% (634/634), done.[K
remote: Compressing objects: 100% (465/465), done.[K
remote: Total 634 (delta 38), reused 623 (delta 36), pack-reused 0 (from 0)[K
Receiving objects: 100% (634/634), 56.49 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (38/38), done.
Archive:  /content/dotslash-repo/smart_catalogin

In [None]:
## Importing the models

import torch
import ultralytics
from ultralytics import YOLO



Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.


In [None]:
import os
import numpy as np
from PIL import Image
from pathlib import Path
import cv2
from ultralytics import YOLO  # For YOLOv11

class YOLOv11Detector:
    def __init__(self, weights_path, base_output_dir):
        """
        Initialize YOLOv11 detector

        Args:
            weights_path (str): Path to YOLOv11 weights
            base_output_dir (str): Base directory for outputs
        """
        self.model = self._load_model(weights_path)
        self.base_output_dir = base_output_dir
        os.makedirs(base_output_dir, exist_ok=True)

    def _load_model(self, weights_path):
        """Load YOLOv11 model with specified weights"""
        model = YOLO(weights_path)
        model.conf = 0.25  # confidence threshold
        model.iou = 0.45   # NMS IoU threshold
        return model

    def create_output_directory(self, image_name):
        """Create output directory structure for an image"""
        image_dir = os.path.join(self.base_output_dir, image_name)
        os.makedirs(image_dir, exist_ok=True)
        return image_dir

    def draw_boxes(self, image, detections):
        """Draw bounding boxes on image"""
        img = np.array(image)

        # Get detections
        boxes = detections.boxes.data.cpu().numpy()

        for box in boxes:
            x1, y1, x2, y2 = map(int, box[:4])
            conf = box[4]

            # Draw rectangle
            cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)

            # Add confidence score
            conf_text = f'{conf:.2f}'
            cv2.putText(img, conf_text, (x1, y1-10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        return Image.fromarray(img)

    def crop_detections(self, image, detections, output_dir, image_name):
        """Crop and save detected products"""
        saved_paths = []

        # Get detections
        boxes = detections.boxes.data.cpu().numpy()

        for i, box in enumerate(boxes):
            x1, y1, x2, y2 = map(int, box[:4])

            # Crop detection
            cropped = image.crop((x1, y1, x2, y2))

            # Save cropped image
            save_path = os.path.join(output_dir, f"{image_name}_P_{i+1}.jpg")
            cropped.save(save_path)
            saved_paths.append(save_path)

        return saved_paths

    def filter_non_products(self, image_path, min_size=50, aspect_ratio_range=(0.5, 2.0)):
        """Filter out likely non-product images"""
        img = Image.open(image_path)
        width, height = img.size

        if width < min_size or height < min_size:
            return False

        aspect_ratio = width / height
        if not (aspect_ratio_range[0] <= aspect_ratio <= aspect_ratio_range[1]):
            return False

        return True

    def process_image(self, image_path):
        """Process single image with YOLOv11"""
        # Load image
        image = Image.open(image_path)
        image_name = Path(image_path).stem

        print(f"Processing {image_name}...")

        # Create output directory
        output_dir = self.create_output_directory(image_name)

        # Detect products
        detections = self.model(image)[0]

        # Save image with bounding boxes
        bb_image = self.draw_boxes(image, detections)
        bb_path = os.path.join(output_dir, f"{image_name}_BB.jpg")
        bb_image.save(bb_path)

        # Crop and save individual products
        cropped_paths = self.crop_detections(image, detections, output_dir, image_name)

        # # Filter non-products
        # for path in cropped_paths:
        #     if not self.filter_non_products(path):
        #         os.remove(path)
        #         print(f"Removed likely non-product: {path}")

        print(f"Finished processing {image_name}")
        print(f"Detected {len(cropped_paths)} products")

    def process_directory(self, input_dir):
        """Process all images in directory"""
        total_images = len([f for f in os.listdir(input_dir)
                          if f.lower().endswith(('.jpg', '.jpeg'))])
        processed = 0

        print(f"Found {total_images} images to process")

        for image_file in os.listdir(input_dir):
            if image_file.lower().endswith(('.jpg', '.jpeg')):
                image_path = os.path.join(input_dir, image_file)
                self.process_image(image_path)
                processed += 1
                print(f"Progress: {processed}/{total_images} images processed")

def main():
    # Configuration
    input_dir = "/content/smart_cataloging"
    base_output_dir = "YOLOv11_results"
    yolov11_weights = "/content/YOLOv11_SKU.pt"

    # Initialize detector
    detector = YOLOv11Detector(
        weights_path=yolov11_weights,
        base_output_dir=base_output_dir
    )

    # Process all images
    detector.process_directory(input_dir)

if __name__ == "__main__":
    main()

Found 52 images to process
Processing store_22...

0: 640x480 175 objects, 46.6ms
Speed: 35.2ms preprocess, 46.6ms inference, 751.4ms postprocess per image at shape (1, 3, 640, 480)
Finished processing store_22
Detected 175 products
Progress: 1/52 images processed
Processing store_26...

0: 640x480 93 objects, 9.7ms
Speed: 2.5ms preprocess, 9.7ms inference, 1.6ms postprocess per image at shape (1, 3, 640, 480)
Finished processing store_26
Detected 93 products
Progress: 2/52 images processed
Processing store_18...

0: 640x480 17 objects, 9.7ms
Speed: 2.8ms preprocess, 9.7ms inference, 1.5ms postprocess per image at shape (1, 3, 640, 480)
Finished processing store_18
Detected 17 products
Progress: 3/52 images processed
Processing store_47...

0: 640x480 110 objects, 9.6ms
Speed: 2.7ms preprocess, 9.6ms inference, 1.4ms postprocess per image at shape (1, 3, 640, 480)
Finished processing store_47
Detected 110 products
Progress: 4/52 images processed
Processing store_34...

0: 480x640 64 ob

In [None]:
!zip YOLOv11_results_all -r /content/YOLOv11_results

  adding: content/YOLOv11_results/ (stored 0%)
  adding: content/YOLOv11_results/store_6/ (stored 0%)
  adding: content/YOLOv11_results/store_6/store_6_P_26.jpg (deflated 12%)
  adding: content/YOLOv11_results/store_6/store_6_P_55.jpg (deflated 9%)
  adding: content/YOLOv11_results/store_6/store_6_P_68.jpg (deflated 5%)
  adding: content/YOLOv11_results/store_6/store_6_P_56.jpg (deflated 15%)
  adding: content/YOLOv11_results/store_6/store_6_P_92.jpg (deflated 10%)
  adding: content/YOLOv11_results/store_6/store_6_P_104.jpg (deflated 6%)
  adding: content/YOLOv11_results/store_6/store_6_P_66.jpg (deflated 16%)
  adding: content/YOLOv11_results/store_6/store_6_P_29.jpg (deflated 9%)
  adding: content/YOLOv11_results/store_6/store_6_P_61.jpg (deflated 11%)
  adding: content/YOLOv11_results/store_6/store_6_P_21.jpg (deflated 9%)
  adding: content/YOLOv11_results/store_6/store_6_P_107.jpg (deflated 5%)
  adding: content/YOLOv11_results/store_6/store_6_P_76.jpg (deflated 5%)
  adding: conte

## Trying OCR:

In [None]:
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("huz-relay/idefics2-8b-ocr")
model = AutoModelForImageTextToText.from_pretrained("huz-relay/idefics2-8b-ocr")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


processor_config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Chat templates should be in a 'chat_template.json' file but found key='chat_template' in the processor's config. Make sure to move your template to its own file.


preprocessor_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/92.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/74.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/7 [00:00<?, ?it/s]

model-00001-of-00007.safetensors:   0%|          | 0.00/4.64G [00:00<?, ?B/s]

model-00002-of-00007.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00007.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00004-of-00007.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00007.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00006-of-00007.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00007-of-00007.safetensors:   0%|          | 0.00/4.25G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [1]:
print(model)

NameError: name 'model' is not defined

In [None]:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

def perform_ocr(image_path, model, processor, prompt="The following is a product image, extract the product name from this image:", max_new_tokens=512):
    """
    Perform OCR on an image using IDEFICS2 model.

    Args:
        image_path (str): Path to the image file
        model: The loaded IDEFICS2 model
        processor: The loaded IDEFICS2 processor
        prompt (str): Prompt to guide the text extraction
        max_new_tokens (int): Maximum number of tokens to generate

    Returns:
        str: Extracted text from the image
    """
    # Load and preprocess the image
    image = Image.open(image_path)

    # Prepare inputs
    inputs = processor(
        prompt,
        images=image,
        return_tensors="pt",
        truncation=True,
        max_length=2048
    )

    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False
        )

    # Decode the generated text
    generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

    # Remove the prompt from the generated text
    if generated_text.startswith(prompt):
        generated_text = generated_text[len(prompt):].strip()

    return generated_text

# Example usage
if __name__ == "__main__":
    # Load model and processor (you already have this part)
    # processor = AutoProcessor.from_pretrained("huz-relay/idefics2-8b-ocr")
    # model = AutoModelForImageTextToText.from_pretrained("huz-relay/idefics2-8b-ocr")

    # Example usage
    image_path = "/content/YOLOv11_results/store_29/store_29_P_26.jpg"
    extracted_text = perform_ocr(image_path, model, processor)
    print("Extracted text:", extracted_text)

    # Example with custom prompt
    custom_prompt = "The following is a product image, extract the product name from this image"
    extracted_text = perform_ocr(image_path, model, processor, prompt=custom_prompt)
    print("Extracted text with custom prompt:", extracted_text)

TypeError: Idefics2Processor.__call__() got multiple values for argument 'images'

In [None]:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

def perform_ocr(image_path, model, processor, prompt="Extract all text from this image:", max_new_tokens=512):
    """
    Perform OCR on an image using IDEFICS2 model.

    Args:
        image_path (str): Path to the image file
        model: The loaded IDEFICS2 model
        processor: The loaded IDEFICS2 processor
        prompt (str): Prompt to guide the text extraction
        max_new_tokens (int): Maximum number of tokens to generate

    Returns:
        str: Extracted text from the image
    """
    # Load and preprocess the image
    image = Image.open(image_path)

    # Prepare inputs - format text and images correctly
    inputs = processor(
       text=[prompt],
       images=image,
       return_tensors="pt",
       truncation=True,
       max_length=2048
   )
    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False
        )

    # Decode the generated text
    generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

    # Remove the prompt from the generated text if it appears at the start
    if generated_text.startswith(prompt):
        generated_text = generated_text[len(prompt):].strip()

    return generated_text

# Example usage
if __name__ == "__main__":
    # Load image and try OCR
    image_path = "/content/Everes_masala.jpg"

    # Different prompts to try
    prompts = [
        "Extract all text from this image:",
        "What text can you read in this image? Extract all text:",
        "Please transcribe all visible text in this image, including any small text:"
    ]

    # Try with default prompt first
    try:
        extracted_text = perform_ocr(image_path, model, processor)
        print("Extracted text:", extracted_text)
    except Exception as e:
        print(f"Error with default prompt: {str(e)}")

    # Try with different prompts
    for prompt in prompts:
        try:
            print(f"\nTrying with prompt: {prompt}")
            extracted_text = perform_ocr(image_path, model, processor, prompt=prompt)
            print("Extracted text:", extracted_text)
        except Exception as e:
            print(f"Error with prompt '{prompt}': {str(e)}")

Error with default prompt: The number of images in the text [0] and images  [1] should be the same.

Trying with prompt: Extract all text from this image:
Error with prompt 'Extract all text from this image:': The number of images in the text [0] and images  [1] should be the same.

Trying with prompt: What text can you read in this image? Extract all text:
Error with prompt 'What text can you read in this image? Extract all text:': The number of images in the text [0] and images  [1] should be the same.

Trying with prompt: Please transcribe all visible text in this image, including any small text:
Error with prompt 'Please transcribe all visible text in this image, including any small text:': The number of images in the text [0] and images  [1] should be the same.


In [None]:
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

In [None]:
model = model.to("cuda")

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 211.06 MiB is free. Process 4277 has 14.54 GiB memory in use. Of the allocated memory 14.16 GiB is allocated by PyTorch, and 236.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

# DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("/content/Everest_SM.jpeg")
image2 = load_image("/content/Everest_PB.jpg")
image3 = load_image("/content/Everes_masala.jpg")



In [None]:
# processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
# model = AutoModelForVision2Seq.from_pretrained(
#     "HuggingFaceM4/idefics2-8b-base",
# ).to(DEVICE)

# Create inputs
prompts = [
  "<image>The following is a product image, extract the product name from this image.<image>The following is a product image, extract the product name from this image,",
  "The following is a product image, extract the product name from this image<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and']





RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)