# Automatic Mask Generation Using Unsupervised Approach with Grounding Dino, SAM2, and Gemma3

In this notebook, we build an end-to-end unsupervised pipeline for object detection, segmentation, classification, and tracking—focusing on identifying and following milk pouches without manual labels. This approach leverages cutting-edge vision and language models and concludes with lightweight object tracking based on extracted features from segmentation masks.

Key Components:



1.   **Grounding Dino**

A powerful vision-language model that performs generic object detection by returning bounding boxes around visually significant regions—completely label-free and prompt-driven.

2.   **SAM2 (Segment Anything Model v2)**

Using the bounding boxes from Grounding Dino, SAM2 generates precise segmentation masks, enabling instance-level understanding and clean extraction of objects.

3.  **Gemma3 12B QAT Model**

Each cropped masked region is passed to an open source Gemma3 quantization-aware large language model to determine whether it contains a milk pouch or not, enabling robust classification without explicit supervised training.


## Install necessary packages.


In [None]:
!git clone 'https://github.com/IDEA-Research/Grounded-SAM-2'
!pip install 'git+https://github.com/IDEA-Research/Grounded-SAM-2'

%cd 'Grounded-SAM-2'

# Install SAM2
!pip install -e .

# Install Grounding Dino
!pip install --no-build-isolation -e grounding_dino

!pip install addict yapf supervision>=0.22.0

In [None]:
# Required for Ollama to detect GPUs.
!sudo apt-get install -y pciutils lshw
!pip install ollama

## Import model weights and configuration files.

In [None]:
# Download Grounding Dino weights.
!mkdir grounding_dino_weights
!wget -P ./grounding_dino_weights https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
!wget -P ./grounding_dino_weights https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/refs/heads/main/groundingdino/config/GroundingDINO_SwinT_OGC.py

In [None]:
# Download SAM2 weights
!mkdir sam2_weights
!wget -P ./sam2_weights https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

In [None]:
# download the sample image from the circularnet project
url = (
    "https://raw.githubusercontent.com/tensorflow/models/master/official/"
    "projects/waste_identification_ml/pre_processing/config/sample_images/"
    "IMG_6509.png"
)

!curl -O {url} > /dev/null 2>&1

## Import libraries.

In [None]:
import os
import supervision as sv
import torch
import tqdm
import numpy as np
from torchvision.ops import box_convert
from PIL import Image
from ollama import chat, ChatResponse
import glob
import cv2
import matplotlib.pyplot as plt
import math

In [None]:
#@title Utils

def show_mask(
        mask,
        ax,
        random_color=False,
        borders = True
):
  if random_color:
    color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
  else:
    color = np.array([30/255, 144/255, 255/255, 0.6])
  h, w = mask.shape[-2:]
  binary_mask = mask.astype(np.uint8)
  mask_image =  binary_mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
  if borders:
    contours, _ = cv2.findContours(binary_mask,cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    # Try to smooth contours
    contours = [cv2.approxPolyDP(contour, epsilon=0.01, closed=True) for contour in contours]
    mask_image = cv2.drawContours(mask_image, contours, -1, (1, 1, 1, 0.5), thickness=2)
  ax.imshow(mask_image)


def show_points(
        coords,
        labels,
        ax,
        marker_size=375
):
  pos_points = coords[labels==1]
  neg_points = coords[labels==0]
  ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
  ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)


def show_box(box, ax):
  x0, y0 = box[0], box[1]
  w, h = box[2] - box[0], box[3] - box[1]
  ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0, 0, 0, 0), lw=2))


def show_masks(
        image,
        masks,
        scores,
        point_coords=None,
        box_coords=None,
        input_labels=None,
        borders=True
):
  for i, (mask, score) in enumerate(zip(masks, scores)):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    show_mask(mask, plt.gca(), borders=borders)
    if point_coords is not None:
      assert input_labels is not None
      show_points(point_coords, input_labels, plt.gca())
    if box_coords is not None:
      # boxes
      show_box(box_coords, plt.gca())
    if len(scores) > 1:
      plt.title(f"Mask {i+1}, Score: {score:.3f}", fontsize=18)
    plt.axis('off')
    plt.show()

## Load models.

In [None]:
# Load Grounding Dino model.
from grounding_dino.groundingdino.util.inference import load_model, load_image, predict, annotate

# Path to the pre-trained Grounding Dino model checkpoint
WEIGHTS_PATH = "grounding_dino_weights/groundingdino_swint_ogc.pth"

# Path to the configuration file for the Grounding Dino model variant being used
CONFIG_PATH = "grounding_dino_weights/GroundingDINO_SwinT_OGC.py"

model = load_model(CONFIG_PATH, WEIGHTS_PATH)

In [None]:
# Load SAM2 model.
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

# Path to the pre-trained SAM2 model checkpoint
sam2_checkpoint = "sam2_weights/sam2.1_hiera_large.pt"

# Path to the configuration file for the SAM2 model variant being used
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"

# Build the SAM2 model using the config and checkpoint; `device` should be set to "cuda" or "cpu"
sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=torch.device("cuda"))

# Create a predictor object using the loaded SAM2 model for image-based mask prediction
sam2_predictor = SAM2ImagePredictor(sam2_model)

## Inference

In [None]:
# Inference via Grounding Dino
%%time
IMAGE_PATH = "IMG_6509.png"
TEXT_PROMPT = "packet"
BOX_TRESHOLD = 0.25
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

In [None]:
# Visualize Grounding Dino results.
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

%matplotlib inline
sv.plot_image(annotated_frame, (16, 16))

In [None]:
# Perform segmentation on bbox cordinates using SAM2 model.
sam2_predictor.set_image(image_source)

In [None]:
# Create a directory to store the cropped object images.
os.makedirs('tempdir', exist_ok=True)

# Convert bbox format
h, w, _ = image_source.shape
boxes = boxes * torch.Tensor([w, h, w, h])
xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy().astype(int)

In [None]:
for idx, bbox in tqdm.tqdm(enumerate(xyxy)):
  x1, y1, x2, y2 = bbox

  if (x2-x1)*(y2-y1) < 0.25 * math.prod(image.size):
    masks, scores, _ = sam2_predictor.predict(
      point_coords=None,
      point_labels=None,
      box=bbox[None, :],
      multimask_output=False,
    )

    # show_masks(image_source, masks, scores, box_coords=bbox)

    # Convert the first mask to 0-255 and expand its dimensions to match the image channels.
    # Multiply the mask with the original image (preserves object, sets background to 0).
    # Crop the masked image to the bounding box [y1:y2, x1:x2].
    masked_object = Image.fromarray(
        np.where(
            np.expand_dims(masks[0]*255, -1),
            image_source, 0
        )[y1:y2, x1:x2]
    )

    image_path = f'tempdir/{os.path.splitext(IMAGE_PATH)[0]}_{idx}.png'
    masked_object.save(image_path)

## Download Gemma3 model using Ollama tool.

Run the following commands in the terminal within your colab notebook.

```
curl https://ollama.ai/install.sh | sh
ollama serve
```



In [None]:
# Pull the required open sourced LLM model.
!ollama pull gemma3:12b-it-qat

In [None]:
# Check if the model is downloaded.
!ollama list

In [None]:
# Prompt to analyze an image for milk packet vs others.
prompt = """
Analyze the provided image of packaging. Was this packaging used to contain milk or a milk-based product?  Answer in yes or no only.
"""

In [None]:
# Read an cropped images to perform inference using LLM.
images = glob.glob('tempdir/*.png')

for path in images:
  # Run the chat/inference API, sending the temporary masked object image as input.
  response: ChatResponse = chat(model='gemma3:12b-it-qat', messages=[
    {
      'role': 'user',
      'content': prompt,
      'images': [path]
    },
  ])
  image = cv2.imread(path)
  image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  plt.imshow(image)
  plt.axis('off')
  plt.show()

  # Print the model's response content (the generated answer)
  print(f"\n{response.message.content}")