# Spot the Difference ML Workflow — Enhanced Version

This notebook implements the proposed enhancements: detector ensemble with TTA + WBF, refined change localization heatmaps, CLIP-assisted matching, threshold tuning, stronger augmentations, and improved evaluation and outputs.

In [1]:
# 0. Setup: ensure dependencies and check device
import sys, subprocess, importlib

def ensure(pkg_spec, import_name=None):
    name = import_name or pkg_spec.split('==')[0].split('[')[0].split('/')[-1]
    try:
        importlib.import_module(name)
    except Exception:
        print(f'Installing {pkg_spec}...')
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', pkg_spec])

# Core
ensure('numpy')
ensure('pandas')
ensure('matplotlib')
ensure('Pillow', 'PIL')
ensure('scikit-learn', 'sklearn')
ensure('opencv-python', 'cv2')
# ML
ensure('timm')
ensure('transformers')
ensure('albumentations')
ensure('ensemble-boxes')
ensure('open-clip-torch', 'open_clip')
ensure('scipy')

import os, json, math, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import torch
import torchvision.transforms as T
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

Installing ensemble-boxes...
PyTorch: 2.5.1
CUDA available: True


device(type='cuda')

In [2]:
# 1. Load Procedure Markdown for quick reference
from IPython.display import Markdown, display
with open('spot_the_difference_procedure.md', 'r', encoding='utf-8') as f:
    display(Markdown(f.read()))

# Spot the Difference ML Workflow: Step-by-Step Explanation

This document explains the procedure for detecting added, removed, or changed objects between two similar images using advanced machine learning techniques. The workflow combines data preparation, change localization, object detection, and matching to robustly spot differences.

---

## 1. Data Preparation & Vocabulary
- **Normalize object labels:** Clean and standardize object names in your dataset (e.g., 'man', 'guy', 'worker' → 'person').
- **Build vocabulary:** Extract a list of unique object types to detect (e.g., 'car', 'person', 'cone').

## 2. Change Localization (Where things changed)
- **Siamese backbone:** Use a twin neural network (e.g., ViT/Swin Transformer) to process both images in parallel, extracting features.
- **Cross-attention:** Compare features between images to focus on regions that differ.
- **Change logit map (H):** Output a multi-scale map highlighting areas where changes likely occurred.

## 3. Object Detection (What objects changed)
- **Open-vocabulary detector:** Use a model like OWL-ViT or Grounding DINO to detect objects in both images, using your vocabulary.
- **Bounding boxes & labels:** Get locations and types of objects present in each image.

## 4. Score Fusion
- **Combine scores:** Boost detector confidence for objects overlapping with high-change regions in the change map (H).
- **Formula:**
  
  $\text{score}' = \text{score}_{det} \times (1 + \lambda \times \text{normalized H overlap})$

## 5. Matching & Decision Rules
- **Match objects:** Use class labels and bounding box overlap (IoU) to match objects between images.
- **Rules:**
  - Only in second image → "added"
  - Only in first image → "removed"
  - Matched but moved/appearance changed → "changed"

## 6. Classification Heads (Optional)
- **Global features:** Add heads to predict, for each class, whether it was added, removed, or changed.
- **Weak supervision:** Train using category-level labels, not pixel-perfect masks.

## 7. Final Output
- **For each image pair:** Output lists of added, removed, and changed objects.
- **Visualization:** Display results and save in the required format for submission.

---

## How the Techniques Work Together
- **Siamese encoder & change map:** Tell you where to look for changes.
- **Detector:** Tells you what objects are present.
- **Matching & fusion:** Decide what changed and how (added, removed, changed).
- **Weak supervision & symmetry tricks:** Enable learning even with limited labels.

This pipeline allows robust difference detection between images, combining deep learning, object detection, and smart matching—even when only category-level labels are available.


## Data: Load, Normalize Labels, and Build Vocabulary + Prompts

In [3]:
data_dir = 'data'
train_df = pd.read_csv(os.path.join(data_dir, 'train.csv'))
test_df = pd.read_csv(os.path.join(data_dir, 'test.csv'))
print('Train rows:', len(train_df), 'Test rows:', len(test_df))
display(train_df.head())

# Normalization and synonyms
import re
synonym_map = {
    'man':'person','guy':'person','worker':'person','boy':'person','woman':'person','gentleman':'person','pedestrian':'person','people':'person','person':'person',
    'auto':'car','sedan':'car','car':'car',
    'pickup':'truck','lorry':'truck','truck':'truck','van':'van','bus':'bus',
    'motorcycle':'motorcycle','bike':'bicycle','bicycle':'bicycle',
    'cone':'cone','traffic cone':'cone',
    'sign':'sign','traffic sign':'sign','road sign':'sign',
    'pole':'pole','lamp post':'pole','lamp-post':'pole',
    'barrier':'barrier','fence':'barrier',
    'ladder':'ladder','gate':'gate','bag':'bag','box':'box','umbrella':'umbrella'
}

def normalize_labels(label_str):
    if pd.isna(label_str) or not str(label_str).strip() or str(label_str).strip().lower()=='none':
        return []
    raw = re.split(r'[ ,]+', str(label_str).strip().lower())
    mapped = [synonym_map.get(tok, tok) for tok in raw]
    mapped = [tok for tok in mapped if tok and tok!='none']
    return sorted(set(mapped))

for col in ['added_objs','removed_objs','changed_objs']:
    train_df[col+'_norm'] = train_df[col].apply(normalize_labels)

# Build vocabulary
vocab = set()
for col in ['added_objs_norm','removed_objs_norm','changed_objs_norm']:
    for lst in train_df[col]:
        vocab.update(lst)
vocab = sorted(vocab)
print('Vocab:', vocab)

# Prompt engineering: multi-phrase prompts per label
def prompts_for_label(lbl):
    base = lbl.replace('_',' ')
    return [
        base,
        f'a photo of a {base}',
        f'{base} object',
        f'small {base}',
        f'large {base}',
        f'{base} in the scene'
    ]

label_to_prompts = {lbl: prompts_for_label(lbl) for lbl in vocab}
# OWL-ViT expects a list of queries; we can flatten prompts but keep an index map to labels
flattened_prompts = []
prompt_to_label = []
for lbl, plist in label_to_prompts.items():
    for p in plist:
        flattened_prompts.append(p)
        prompt_to_label.append(lbl)
print('Total textual prompts:', len(flattened_prompts))

Train rows: 4536 Test rows: 1482


Unnamed: 0,img_id,added_objs,removed_objs,changed_objs
0,35655,none,none,none
1,30660,none,person vehicle,none
2,34838,man person,car person,none
3,34045,person,none,car
4,30596,none,bicycle person,none


Vocab: ['animal', 'baby', 'bag', 'baggage', 'bicycle', 'bicyclist', 'box', 'building', 'car', 'cart', 'case', 'chair', 'child', 'cone', 'container', 'couple', 'dog', 'dolly', 'driver', 'gate', 'girl', 'group', 'individual', 'item', 'kid', 'ladder', 'lady', 'luggage', 'motorcycle', 'object', 'person', 'personal', 'pole', 'scooter', 'shadow', 'sign', 'stroller', 'traffic', 'truck', 'umbrella', 'vehicle', 'vest']
Total textual prompts: 252


## Detector Ensemble with TTA and WBF

In [4]:
# Load OWL-ViT
from transformers import OwlViTProcessor, OwlViTForObjectDetection
owl_processor = OwlViTProcessor.from_pretrained('google/owlvit-base-patch32')
owl_model = OwlViTForObjectDetection.from_pretrained('google/owlvit-base-patch32').to(device)
owl_model.eval()

# Try to load GroundingDINO (optional)
has_gdino = False
try:
    from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
    gdino_processor = AutoProcessor.from_pretrained('IDEA-Research/grounding-dino-base')
    gdino_model = AutoModelForZeroShotObjectDetection.from_pretrained('IDEA-Research/grounding-dino-base').to(device)
    gdino_model.eval()
    has_gdino = True
    print('GroundingDINO loaded')
except Exception as e:
    print('GroundingDINO not available, proceeding with OWL-ViT only. Reason:', str(e))

from ensemble_boxes import weighted_boxes_fusion
import cv2

def _resize_to(image: Image.Image, size=(800, 800)):
    # optional resizing for speed/accuracy tradeoff
    return image

def _to_xyxy_norm(boxes, w, h):
    # boxes xyxy absolute -> normalized 0..1
    if len(boxes)==0:
        return []
    b = np.asarray(boxes, dtype=float)
    b[:,0] /= w; b[:,2] /= w; b[:,1] /= h; b[:,3] /= h
    return b.tolist()

def _from_xyxy_norm(boxes, w, h):
    if len(boxes)==0:
        return []
    b = np.asarray(boxes, dtype=float)
    b[:,0] *= w; b[:,2] *= w; b[:,1] *= h; b[:,3] *= h
    return b.tolist()

def _flip_boxes_horiz_xyxy(boxes, w):
    # flip horizontally
    flipped = []
    for x1,y1,x2,y2 in boxes:
        nx1 = w - x2
        nx2 = w - x1
        flipped.append([nx1,y1,nx2,y2])
    return flipped

def detect_owlvit(image: Image.Image, prompts: list, score_thr=0.05):
    w, h = image.size
    inputs = owl_processor(text=prompts, images=image, return_tensors='pt').to(device)
    with torch.no_grad():
        out = owl_model(**inputs)
    target_sizes = torch.tensor([[h, w]], device=device)
    results = owl_processor.post_process_object_detection(out, target_sizes=target_sizes, threshold=score_thr)[0]
    boxes = results['boxes'].detach().cpu().numpy().tolist()
    scores = results['scores'].detach().cpu().numpy().tolist()
    # labels correspond to index in prompts; map back to base label via prompt_to_label
    labels_idx = results['labels'].detach().cpu().numpy().tolist()
    labels = [prompt_to_label[idx] for idx in labels_idx]
    return boxes, scores, labels

def detect_gdino(image: Image.Image, labels: list, score_thr=0.05):
    if not has_gdino:
        return [], [], []
    w, h = image.size
    text = '. '.join(labels)
    inputs = gdino_processor(images=image, text=text, return_tensors='pt').to(device)
    with torch.no_grad():
        out = gdino_model(**inputs)
    results = gdino_processor.post_process_grounded_object_detection(out, inputs.input_ids, box_threshold=score_thr, text_threshold=0.25, target_sizes=[(h,w)])[0]
    boxes = results['boxes'].detach().cpu().numpy().tolist()
    scores = results['scores'].detach().cpu().numpy().tolist()
    # labels as matched phrases from the text per prediction
    ph = results.get('phrases', ['object']*len(boxes))
    # map phrases to closest vocab label by simple token match
    mapped = []
    for p in ph:
        p = p.lower()
        candidates = [v for v in vocab if v in p or p in v]
        mapped.append(candidates[0] if candidates else 'object')
    return boxes, scores, mapped

def ensemble_detect(image_path, base_score_thr=0.05, wbf_iou_thr=0.55, wbf_skip_thr=0.05):
    image = Image.open(image_path).convert('RGB')
    w, h = image.size
    # TTA: original + hflip
    images = [image, image.transpose(Image.FLIP_LEFT_RIGHT)]
    # Collect boxes per TTA per detector
    all_boxes = []
    all_scores = []
    all_labels = []
    for idx, im in enumerate(images):
        # OWL-ViT
        b1, s1, l1 = detect_owlvit(im, flattened_prompts, score_thr=base_score_thr)
        if idx==1: # flip back
            b1 = _flip_boxes_horiz_xyxy(b1, w)
        all_boxes.append(b1); all_scores.append(s1); all_labels.append(l1)
        # GroundingDINO (optional)
        if has_gdino:
            b2, s2, l2 = detect_gdino(im, vocab, score_thr=base_score_thr)
            if idx==1:
                b2 = _flip_boxes_horiz_xyxy(b2, w)
            all_boxes.append(b2); all_scores.append(s2); all_labels.append(l2)
    # Prepare for WBF: need lists by TTA run; we'll merge all into a single ensemble call
    boxes_norm = [_to_xyxy_norm(b, w, h) for b in all_boxes]
    # Convert text labels to integer class ids based on vocab index
    label_to_idx = {v:i for i,v in enumerate(vocab)}
    labels_idx = [[label_to_idx.get(l, -1) for l in lab] for lab in all_labels]
    # Filter out -1 labels
    for i in range(len(boxes_norm)):
        keep = [k for k,l in enumerate(labels_idx[i]) if l>=0]
        boxes_norm[i] = [boxes_norm[i][k] for k in keep]
        all_scores[i] = [all_scores[i][k] for k in keep]
        labels_idx[i] = [labels_idx[i][k] for k in keep]
    if sum(len(b) for b in boxes_norm)==0:
        return [], [], []
    wb, ws, wl = weighted_boxes_fusion(boxes_norm, all_scores, labels_idx, iou_thr=wbf_iou_thr, skip_box_thr=wbf_skip_thr)
    # Back to absolute coords and labels
    abs_boxes = _from_xyxy_norm(wb, w, h)
    out_labels = [vocab[int(i)] for i in wl]
    return abs_boxes, ws.tolist(), out_labels

def draw_boxes(image_path, boxes, labels, scores, score_thr=0.2):
    img = Image.open(image_path).convert('RGB')
    plt.figure(figsize=(10,10))
    plt.imshow(img)
    ax = plt.gca()
    import matplotlib.patches as patches
    for (x1,y1,x2,y2), l, s in zip(boxes, labels, scores):
        if s < score_thr: continue
        rect = patches.Rectangle((x1,y1), x2-x1, y2-y1, linewidth=2, edgecolor='lime', facecolor='none')
        ax.add_patch(rect)
        ax.text(x1, max(0,y1-5), f'{l}:{s:.2f}', color='black', fontsize=9, bbox=dict(facecolor='yellow', alpha=0.6))
    plt.axis('off'); plt.show()

# Quick smoke test on one sample (non-fatal)
try:
    sid = train_df['img_id'].iloc[0]
    p = os.path.join(data_dir, 'data', f'{sid}_1.png')
    b,s,l = ensemble_detect(p, base_score_thr=0.05)
    print('Detections:', len(b))
except Exception as e:
    print('Detector smoke test skipped:', e)



GroundingDINO loaded
Detections: 265




## Refined Change Localization: Siamese ViT Patch-level Heatmap + Augmentations + Focal Loss

In [5]:
import timm
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
import albumentations as A
from albumentations.pytorch import ToTensorV2

class PairDataset(Dataset):
    def __init__(self, df, root_dir, aug=True):
        self.df = df.reset_index(drop=True)
        self.root = root_dir
        self.aug = aug
        self.train_tf = A.Compose([
            A.Resize(224,224),
            A.HorizontalFlip(p=0.5),
            # Use Affine instead of deprecated ShiftScaleRotate
            A.Affine(scale=(0.9, 1.1), translate_percent=(0.0, 0.02), rotate=(-10, 10), shear=None, p=0.5),
            A.ColorJitter(p=0.5),
            # Replace Cutout with CoarseDropout
            A.CoarseDropout(max_holes=4, max_height=20, max_width=20, min_holes=1, min_height=10, min_width=10, fill_value=0, p=0.3),
            ToTensorV2()
        ])
        self.val_tf = A.Compose([A.Resize(224,224), ToTensorV2()])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        img_id = self.df.loc[idx, 'img_id']
        p1 = os.path.join(self.root, 'data', f'{img_id}_1.png')
        p2 = os.path.join(self.root, 'data', f'{img_id}_2.png')
        im1 = np.array(Image.open(p1).convert('RGB'))
        im2 = np.array(Image.open(p2).convert('RGB'))
        tf = self.train_tf if self.aug else self.val_tf
        t1 = tf(image=im1)['image']
        t2 = tf(image=im2)['image']
        # weak label: any change present?
        has_change = (len(self.df.loc[idx, 'added_objs_norm']) + len(self.df.loc[idx, 'removed_objs_norm']) + len(self.df.loc[idx, 'changed_objs_norm'])) > 0
        y = torch.tensor([1.0 if has_change else 0.0], dtype=torch.float32)
        return t1, t2, y

class SiameseViTChange(nn.Module):
    def __init__(self, backbone_name='vit_base_patch16_224'):
        super().__init__()
        self.backbone = timm.create_model(backbone_name, pretrained=True)
        self.embed_dim = self.backbone.num_features
        self.head = nn.Sequential(nn.Linear(self.embed_dim*2, 256), nn.ReLU(), nn.Linear(256, 1))
    def forward_feats(self, x):
        # returns token embeddings [B, N+1, C]; timm vit forward_features returns [B, tokens, C] for many models
        f = self.backbone.forward_features(x)
        return f
    def forward(self, x1, x2):
        f1 = self.forward_feats(x1)
        f2 = self.forward_feats(x2)
        # Global pooling on tokens
        g1 = f1.mean(dim=1)
        g2 = f2.mean(dim=1)
        return self.head(torch.cat([g1,g2], dim=1))

def focal_loss_with_logits(logits, targets, alpha=0.25, gamma=2.0):
    bce = nn.functional.binary_cross_entropy_with_logits(logits, targets, reduction='none')
    pt = torch.exp(-bce)
    loss = alpha * (1-pt)**gamma * bce
    return loss.mean()

def patch_change_heatmap(model, img1: Image.Image, img2: Image.Image):
    # compute patch-wise feature L2 diff heatmap
    t = T.Compose([T.Resize((224,224)), T.ToTensor()])
    x1 = t(img1).unsqueeze(0).to(device)
    x2 = t(img2).unsqueeze(0).to(device)
    with torch.no_grad():
        f1 = model.forward_feats(x1) # [1, tokens, C]
        f2 = model.forward_feats(x2)
    # Drop cls token if present (assume first token) to get patches
    p1 = f1[:,1:,:]
    p2 = f2[:,1:,:]
    diff = (p1 - p2).pow(2).sum(dim=-1).sqrt()  # [1, N]
    # infer grid size: 224/16=14 for vit_base_patch16_224
    n = diff.shape[1]
    side = int(math.sqrt(n))
    hm = diff[0].reshape(side, side).detach().cpu().numpy()
    hm = (hm - hm.min()) / (hm.max() - hm.min() + 1e-6)
    # upsample to image size
    hm_up = cv2.resize(hm, img1.size[::-1], interpolation=cv2.INTER_CUBIC)
    # global change score as mean of top-k heat
    k = max(1, (hm_up.size)//20)
    score = float(np.mean(np.sort(hm_up.reshape(-1))[-k:]))
    return hm_up, score

## CLIP-assisted Matching + Score Fusion with Change Heatmap

In [6]:
import open_clip
from scipy.optimize import linear_sum_assignment

# Load CLIP (ViT-B/32)
clip_model, _, clip_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
clip_model = clip_model.to(device).eval()
clip_tokenizer = open_clip.get_tokenizer('ViT-B-32')

def crop_to_tensor(image: Image.Image, box, size=224):
    x1,y1,x2,y2 = [int(v) for v in box]
    x1 = max(0, x1); y1=max(0,y1); x2=min(image.width-1, x2); y2=min(image.height-1,y2)
    crop = image.crop((x1,y1,x2,y2)).convert('RGB')
    return clip_preprocess(crop).unsqueeze(0).to(device)

@torch.no_grad()
def clip_image_similarity(img_path1, box1, img_path2, box2):
    im1 = Image.open(img_path1).convert('RGB')
    im2 = Image.open(img_path2).convert('RGB')
    t1 = crop_to_tensor(im1, box1)
    t2 = crop_to_tensor(im2, box2)
    f1 = clip_model.encode_image(t1)
    f2 = clip_model.encode_image(t2)
    f1 = nn.functional.normalize(f1, dim=-1)
    f2 = nn.functional.normalize(f2, dim=-1)
    sim = (f1 @ f2.t()).item()  # cosine similarity in [-1,1]
    return (sim+1)/2  # map to [0,1]

def iou_xyxy(a,b):
    xA = max(a[0], b[0]); yA=max(a[1],b[1]); xB=min(a[2],b[2]); yB=min(a[3],b[3])
    inter = max(0,xB-xA)*max(0,yB-yA)
    areaA = max(0,a[2]-a[0])*max(0,a[3]-a[1])
    areaB = max(0,b[2]-b[0])*max(0,b[3]-b[1])
    denom = areaA+areaB-inter + 1e-6
    return inter/denom

def fuse_scores_with_change(boxes, scores, heatmap, lambda_w=0.8, th=0.5):
    # boost scores for boxes overlapping high-heat regions
    H,W = heatmap.shape
    high = (heatmap >= th).astype(np.uint8)
    boosted = []
    for (x1,y1,x2,y2), s in zip(boxes, scores):
        x1c = int(max(0, min(W-1, x1))); x2c=int(max(0,min(W-1,x2)))
        y1c = int(max(0, min(H-1, y1))); y2c=int(max(0,min(H-1,y2)))
        if x2c<=x1c or y2c<=y1c:
            boosted.append(s); continue
        region = high[y1c:y2c, x1c:x2c]
        overlap = float(region.mean()) if region.size>0 else 0.0
        boosted.append(float(s * (1.0 + lambda_w * overlap)))
    return boosted

def match_objects(img_id, det_thr=0.25, iou_thr=0.5, alpha=0.5, heat_lambda=0.8, heat_th=0.5):
    p1 = os.path.join(data_dir, 'data', f'{img_id}_1.png')
    p2 = os.path.join(data_dir, 'data', f'{img_id}_2.png')
    # detections
    b1,s1,l1 = ensemble_detect(p1, base_score_thr=0.05)
    b2,s2,l2 = ensemble_detect(p2, base_score_thr=0.05)
    # change heatmap
    im1 = Image.open(p1).convert('RGB'); im2 = Image.open(p2).convert('RGB')
    hm, chg_score = patch_change_heatmap(chg_model, im1, im2)
    # score fusion
    s1b = fuse_scores_with_change(b1, s1, hm, lambda_w=heat_lambda, th=heat_th)
    s2b = fuse_scores_with_change(b2, s2, hm, lambda_w=heat_lambda, th=heat_th)
    # threshold filter
    keep1 = [i for i,x in enumerate(s1b) if x>=det_thr]
    keep2 = [i for i,x in enumerate(s2b) if x>=det_thr]
    b1 = [b1[i] for i in keep1]; l1 = [l1[i] for i in keep1]; s1b = [s1b[i] for i in keep1]
    b2 = [b2[i] for i in keep2]; l2 = [l2[i] for i in keep2]; s2b = [s2b[i] for i in keep2]
    # build candidates by label
    if len(b1)==0 and len(b2)==0:
        return {'added':[], 'removed':[], 'changed':[], 'change_score': chg_score}
    # Hungarian cost matrix over possible pairs with same label
    cost = np.ones((len(b1), len(b2)), dtype=float)
    for i in range(len(b1)):
        for j in range(len(b2)):
            if l1[i] != l2[j]:
                cost[i,j] = 1.0
                continue
            iou = iou_xyxy(b1[i], b2[j])
            try:
                sim = clip_image_similarity(p1, b1[i], p2, b2[j])
            except Exception:
                sim = iou  # fallback
            # combine IoU and CLIP sim (higher is better) -> cost lower is better
            score = alpha*iou + (1-alpha)*sim
            cost[i,j] = 1.0 - score
    if len(b1)>0 and len(b2)>0:
        ri, cj = linear_sum_assignment(cost)
        matches = [(i,j) for i,j in zip(ri,cj) if (l1[i]==l2[j] and (1.0-cost[i,j])>=iou_thr)]
    else:
        matches = []
    matched_i = set(i for i,_ in matches)
    matched_j = set(j for _,j in matches)
    added = [l2[j] for j in range(len(l2)) if j not in matched_j]
    removed = [l1[i] for i in range(len(l1)) if i not in matched_i]
    changed = []
    for i,j in matches:
        if iou_xyxy(b1[i], b2[j]) < iou_thr:
            changed.append(l1[i])
    # deduplicate
    return {
        'added': sorted(set(added)),
        'removed': sorted(set(removed)),
        'changed': sorted(set(changed)),
        'change_score': chg_score
    }

# Quick run on a tiny sample
try:
    sid = train_df['img_id'].iloc[0]
    res = match_objects(sid, det_thr=0.25, iou_thr=0.5, alpha=0.5, heat_lambda=0.8, heat_th=0.5)
    print('Sample result:', res)
except Exception as e:
    print('Match smoke test skipped:', e)

open_clip_model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Match smoke test skipped: name 'chg_model' is not defined


## Threshold Tuning on Validation Split (set-level F1 for added/removed/changed)

In [None]:
def set_f1(true_set, pred_set):
    tp = len(true_set & pred_set)
    fp = len(pred_set - true_set)
    fn = len(true_set - pred_set)
    precision = tp/(tp+fp+1e-9)
    recall = tp/(tp+fn+1e-9)
    if precision+recall==0: return 0.0
    return 2*precision*recall/(precision+recall)

def evaluate_params(det_thr, iou_thr, alpha, heat_lambda, heat_th, sample_ids):
    f1_added=f1_removed=f1_changed=0.0; n=0
    for img_id in sample_ids:
        gt_a = set(train_df.loc[train_df['img_id']==img_id, 'added_objs_norm'].iloc[0])
        gt_r = set(train_df.loc[train_df['img_id']==img_id, 'removed_objs_norm'].iloc[0])
        gt_c = set(train_df.loc[train_df['img_id']==img_id, 'changed_objs_norm'].iloc[0])
        try:
            res = match_objects(img_id, det_thr, iou_thr, alpha, heat_lambda, heat_th)
        except Exception:
            continue
        f1_added += set_f1(gt_a, set(res['added']))
        f1_removed += set_f1(gt_r, set(res['removed']))
        f1_changed += set_f1(gt_c, set(res['changed']))
        n+=1
    if n==0: return 0.0
    return (f1_added+f1_removed+f1_changed)/(3*n)

# small random subset for tuning
val_sample_ids = train_df.loc[val_ids, 'img_id'].sample(min(10, len(val_ids)), random_state=123).tolist()
grid_det = [0.15, 0.25, 0.35]
grid_iou = [0.4, 0.5, 0.6]
grid_alpha = [0.3, 0.5, 0.7]
grid_lambda = [0.5, 0.8, 1.2]
grid_hth = [0.4, 0.5, 0.6]
best = {'score':-1}
for dt in grid_det:
    for it in grid_iou:
        for a in grid_alpha:
            for lam in grid_lambda:
                for hth in grid_hth:
                    score = evaluate_params(dt,it,a,lam,hth, val_sample_ids)
                    if score > best.get('score', -1):
                        best = {'det_thr':dt,'iou_thr':it,'alpha':a,'heat_lambda':lam,'heat_th':hth,'score':score}
                        print('New best:', best)
best

## Evaluation on Validation and Visualization

In [None]:
best_params = best if isinstance(best, dict) and 'det_thr' in best else {
    'det_thr':0.25,'iou_thr':0.5,'alpha':0.5,'heat_lambda':0.8,'heat_th':0.5
}
print('Using params:', best_params)

eval_ids = train_df.loc[val_ids, 'img_id'].sample(min(5, len(val_ids)), random_state=77).tolist()
for mid in eval_ids:
    res = match_objects(mid, **{k:best_params[k] for k in ['det_thr','iou_thr','alpha','heat_lambda','heat_th']})
    print(f"Image {mid} -> added:{res['added']} removed:{res['removed']} changed:{res['changed']} score:{res['change_score']:.3f}")
    p1 = os.path.join(data_dir, 'data', f'{mid}_1.png')
    b,s,l = ensemble_detect(p1, base_score_thr=0.05)
    draw_boxes(p1, b, l, s, score_thr=best_params['det_thr'])
    p2 = os.path.join(data_dir, 'data', f'{mid}_2.png')
    b,s,l = ensemble_detect(p2, base_score_thr=0.05)
    draw_boxes(p2, b, l, s, score_thr=best_params['det_thr'])

# simple mean change score on validation
val_scores = []
for mid in val_sample_ids:
    r = match_objects(mid, **{k:best_params[k] for k in ['det_thr','iou_thr','alpha','heat_lambda','heat_th']})
    val_scores.append(r['change_score'])
print('Mean validation change score:', float(np.mean(val_scores)) if val_scores else None)

## Generate Submission and Save Metrics

In [None]:
submission = []
for img_id in test_df['img_id']:
    r = match_objects(img_id, **{k:best_params[k] for k in ['det_thr','iou_thr','alpha','heat_lambda','heat_th']})
    def fmt(xs):
        return 'none' if len(xs)==0 else ' '.join(sorted(set(xs)))
    submission.append({
        'img_id': img_id,
        'added_objs': fmt(r['added']),
        'removed_objs': fmt(r['removed']),
        'changed_objs': fmt(r['changed'])
    })
submission_df = pd.DataFrame(submission)
submission_df.to_csv('submission.csv', index=False)
print('Saved submission.csv with', len(submission_df), 'rows')

# Save basic metrics
mean_val_change = float(np.mean(val_scores)) if val_scores else float('nan')
with open('eval_metrics.txt','w') as f:
    f.write(f"Params: {json.dumps(best_params)}\n")
    f.write(f"Mean validation change score: {mean_val_change:.5f}\n")
print('Saved eval_metrics.txt')

## Error Analysis and Next Steps

In [None]:
err = []
chk_ids = train_df.loc[val_ids, 'img_id'].sample(min(10, len(val_ids)), random_state=99).tolist()
for mid in chk_ids:
    gt_a = set(train_df.loc[train_df['img_id']==mid, 'added_objs_norm'].iloc[0])
    gt_r = set(train_df.loc[train_df['img_id']==mid, 'removed_objs_norm'].iloc[0])
    gt_c = set(train_df.loc[train_df['img_id']==mid, 'changed_objs_norm'].iloc[0])
    r = match_objects(mid, **{k:best_params[k] for k in ['det_thr','iou_thr','alpha','heat_lambda','heat_th']})
    pa, pr, pc = set(r['added']), set(r['removed']), set(r['changed'])
    err.append({
        'img_id': mid,
        'added_missed': list(gt_a - pa), 'added_wrong': list(pa - gt_a),
        'removed_missed': list(gt_r - pr), 'removed_wrong': list(pr - gt_r),
        'changed_missed': list(gt_c - pc), 'changed_wrong': list(pc - gt_c)
    })
for e in err:
    print(e)

print('Next steps: consider more TTA (scales), multi-phrase confidence pooling, and pseudo-label fine-tuning of change model using high-confidence predictions.')

In [None]:
# Patch: add fallback version of patch_change_heatmap if not already robust
try:
    _ = patch_change_heatmap
    # function exists; redefine with robust fallback
    def patch_change_heatmap(model, img1: Image.Image, img2: Image.Image):
        # compute patch-wise feature L2 diff heatmap; fallback to pixel-diff if tokens unavailable
        try:
            t = T.Compose([T.Resize((224,224)), T.ToTensor()])
            x1 = t(img1).unsqueeze(0).to(device)
            x2 = t(img2).unsqueeze(0).to(device)
            with torch.no_grad():
                f1 = model.forward_feats(x1) # ideally [B, tokens, C]
                f2 = model.forward_feats(x2)
            if f1.dim() != 3 or f2.dim() != 3 or f1.shape[1] <= 1:
                raise RuntimeError('No token embeddings available')
            p1 = f1[:,1:,:]
            p2 = f2[:,1:,:]
            diff = (p1 - p2).pow(2).sum(dim=-1).sqrt()  # [1, N]
            n = diff.shape[1]
            side = int(math.sqrt(n))
            hm = diff[0].reshape(side, side).detach().cpu().numpy()
            hm = (hm - hm.min()) / (hm.max() - hm.min() + 1e-6)
            hm_up = cv2.resize(hm, img1.size[::-1], interpolation=cv2.INTER_CUBIC)
        except Exception:
            a = np.array(img1.convert('RGB'), dtype=np.float32)
            b = np.array(img2.convert('RGB'), dtype=np.float32)
            d = np.mean(np.abs(a - b), axis=2)
            d = (d - d.min()) / (d.max() - d.min() + 1e-6)
            hm_up = cv2.GaussianBlur(d, (0,0), sigmaX=3)
        k = max(1, (hm_up.size)//20)
        score = float(np.mean(np.sort(hm_up.reshape(-1))[-k:]))
        return hm_up, score
except NameError:
    # If not defined (unexpected), define fresh
    def patch_change_heatmap(model, img1: Image.Image, img2: Image.Image):
        try:
            t = T.Compose([T.Resize((224,224)), T.ToTensor()])
            x1 = t(img1).unsqueeze(0).to(device)
            x2 = t(img2).unsqueeze(0).to(device)
            with torch.no_grad():
                f1 = model.forward_feats(x1)
                f2 = model.forward_feats(x2)
            if f1.dim() != 3 or f2.dim() != 3 or f1.shape[1] <= 1:
                raise RuntimeError('No token embeddings available')
            p1 = f1[:,1:,:]
            p2 = f2[:,1:,:]
            diff = (p1 - p2).pow(2).sum(dim=-1).sqrt()
            n = diff.shape[1]
            side = int(math.sqrt(n))
            hm = diff[0].reshape(side, side).detach().cpu().numpy()
            hm = (hm - hm.min()) / (hm.max() - hm.min() + 1e-6)
            hm_up = cv2.resize(hm, img1.size[::-1], interpolation=cv2.INTER_CUBIC)
        except Exception:
            a = np.array(img1.convert('RGB'), dtype=np.float32)
            b = np.array(img2.convert('RGB'), dtype=np.float32)
            d = np.mean(np.abs(a - b), axis=2)
            d = (d - d.min()) / (d.max() - d.min() + 1e-6)
            hm_up = cv2.GaussianBlur(d, (0,0), sigmaX=3)
        k = max(1, (hm_up.size)//20)
        score = float(np.mean(np.sort(hm_up.reshape(-1))[-k:]))
        return hm_up, score