# Image Cleanup Pipeline (with CLIP + Duplicate Detection)

This notebook provides a pipeline to **clean up a folder of images** by detecting duplicates and classifying them with OpenAI’s CLIP model. The goal is to safely separate "junk" images (e.g., screenshots, memes, tickets) from "keep" images (e.g., personal photos, travel pictures).

To run it you need to put the notebook in the same folder as the images you are cleaning up. Recommended to verify manually as the model may make mistakes.
---

## ⚙️ Pipeline Overview

### 1. Duplicate Detection
- Uses **MD5 hashing** to detect identical image files.
- The first copy of an image is kept, and duplicates are flagged.
- Duplicates are reported but not deleted at this stage.

### 2. CLIP Classification (Preview Stage)
- Combines **delete categories** (e.g., memes, screenshots, tickets) and **keep categories** (e.g., portraits, selfies, landscapes).
- Each image is embedded with CLIP and compared to text prompts.
- Images are classified into the category with the **highest similarity score**.
- Threshold (default `0.25`) ensures only strong matches are considered.
- Up to **3 sample images per category** are displayed with confidence scores for manual review.

### 3. Summary Report
- Counts of how many images were flagged for each category (both delete + keep).
- Helps decide which categories to actually move.

### 4. Deletion Stage
- User specifies which **delete categories** should be moved.
- Selected images (and duplicates) are moved into a `trash/` folder (never permanently deleted).
- Progress is tracked with **tqdm progress bars**.
- Final summary shows how many images were moved per category and in total.

---

## ✅ Key Features
- **Safe by design**: nothing is permanently deleted; everything goes to `trash/`.
- **Preview before delete**: you see flagged images before deciding.
- **Customizable categories**: you can tune the prompts and threshold.
- **Cross-platform**: works with Apple Silicon (MPS), CUDA GPUs, or CPU.

---

## ⚠️ Notes
- CLIP is powerful but not perfect — borderline images may need manual checking.
- Prompts matter: natural, descriptive prompts (like `"a phone screenshot of an app"`) improve accuracy.
- Raising the threshold (e.g., `0.30`) reduces false positives.


In [3]:
import os
import hashlib
from pathlib import Path
import torch
import clip
from PIL import Image
import matplotlib.pyplot as plt
import shutil
from collections import Counter
from tqdm import tqdm

In [None]:

# ---------------- SETUP ----------------
if torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple Silicon GPU (MPS).") # if you are running it on macbook m3 it processes around 25images/s
elif torch.cuda.is_available():
    device = "cuda"
    print("Using CUDA GPU.")
else:
    device = "cpu"
    print("Falling back to CPU.")

model, preprocess = clip.load("ViT-B/32", device=device)

base_dir = Path(os.getcwd())
delete_categories = [
    "a phone screenshot of an app or chat with the status bar showing time and battery",
    "a cropped screenshot that contains mostly text",
    "a social media meme image with text overlay and jokes",
    "an internet meme with captions or funny text",
    "a screenshot from instagram, facebook, or tiktok",
    "a screenshot of a pdf document or ebook page",
    "an electronic ticket such as an airplane ticket or concert ticket",
    "a screenshot of a map application like google maps",
    "a product photo with labels, prices, or packaging information"
] # example of deletion categories, try to be as precise as possible as the model was trained on sentence labels from the internet
keep_categories = [
    "a portrait photo of a woman",
    "a portrait photo of a man",
    "a selfie photo taken with a phone camera",
    "a normal photo of friends or family",
    "a travel photo of people outdoors",
    "a photo of a couple together",
    "a landscape photo of nature or city",
    "a photo of people at the beach or in a park",
    "a photo taken with a DSLR or phone camera in natural light"
] # similar to above


# --- Duplicate check function ---
def file_hash(filepath, block_size=65536):
    """Return MD5 hash of a file for duplicate detection."""
    hasher = hashlib.md5()
    with open(filepath, "rb") as f:
        buf = f.read(block_size)
        while buf:
            hasher.update(buf)
            buf = f.read(block_size)
    return hasher.hexdigest()

hashes = {}
duplicates = []

# --- Scan directory for duplicates ---
for filename in tqdm(os.listdir(base_dir), desc="Checking duplicates"):
    if not filename.lower().endswith((".png", ".jpg", ".jpeg")):
        continue
    filepath = base_dir / filename
    filehash = file_hash(filepath)

    if filehash in hashes:
        duplicates.append(filepath)
    else:
        hashes[filehash] = filename

print(f"Found {len(duplicates)} duplicates (will be removed later).") #we will remove in next cell

# ---------------- CLIP PREVIEW ----------------
categories = delete_categories + keep_categories

text_tokens = clip.tokenize(categories).to(device)
with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)  # normalize once

# Storage
category_samples = {cat: [] for cat in categories}
to_delete = []  # flagged for deletion
to_keep = []    # flagged as safe

for filename in tqdm(os.listdir(base_dir), desc="Classifying images"):
    if not filename.lower().endswith((".png", ".jpg", ".jpeg")):
        continue
    filepath = base_dir / filename
    try:
        image = preprocess(Image.open(filepath)).unsqueeze(0).to(device)
    except Exception as e:
        print(f"Skipping {filename}: {e}")
        continue

    with torch.no_grad():
        image_features = model.encode_image(image)
        image_features /= image_features.norm(dim=-1, keepdim=True)

        similarities = (image_features @ text_features.T).squeeze(0)

        best_idx = similarities.argmax().item()
        best_score = similarities[best_idx].item()

        category = categories[best_idx]

        if best_score > 0.25:  # threshold, if you want stricter classification raise it
            if category in delete_categories:
                to_delete.append((filepath, category, best_score))
            else:
                to_keep.append((filepath, category, best_score))

            if len(category_samples[category]) < 3:
                category_samples[category].append((filepath, best_score))

# --- Show samples per category to know what the model sees ---
for cat, samples in category_samples.items():
    if samples:
        print(f"\nCategory: {cat} (showing {len(samples)} samples)")
        fig, axes = plt.subplots(1, len(samples), figsize=(12, 4))
        if len(samples) == 1:
            axes = [axes]
        for ax, (img_path, score) in zip(axes, samples):
            ax.imshow(Image.open(img_path))
            ax.set_title(f"{cat}\n{score:.2f}")
            ax.axis("off")
        plt.show()




Using Apple Silicon GPU (MPS).


Checking duplicates: 100%|██████████| 1/1 [00:00<00:00, 41527.76it/s]


Found 0 duplicates (will be removed later).


Classifying images: 100%|██████████| 1/1 [00:00<00:00, 27594.11it/s]


In [6]:
# Create trash folder
trash_dir = base_dir / "trash"
trash_dir.mkdir(exist_ok=True)

# --- Move duplicates ---
for filepath in tqdm(duplicates, desc="Moving duplicates"):
    if filepath.exists():
        shutil.move(str(filepath), trash_dir / filepath.name)
print(f"Moved {len(duplicates)} duplicates to trash.")

# --- Show summary of flagged categories ---
delete_counts = Counter([c for _, c, _ in to_delete])
keep_counts = Counter([c for _, c, _ in to_keep])

print("\nFlagged DELETE images per category:")
for cat, count in delete_counts.items():
    print(f"  {cat}: {count}")

print("\nFlagged KEEP images per category:")
for cat, count in keep_counts.items():
    print(f"  {cat}: {count}")


# --- Show which categories are available ---
available_delete_cats = sorted(set([c for _, c, _ in to_delete]))
print("\nAvailable DELETE categories in this run:")
for cat in available_delete_cats:
    count = sum(1 for _, c, _ in to_delete if c == cat)
    print(f"  {cat}: {count}")


Moving duplicates: 0it [00:00, ?it/s]

Moved 0 duplicates to trash.

Flagged DELETE images per category:

Flagged KEEP images per category:

Available DELETE categories in this run:





In [7]:
# --- Manually insert the categories you want to delete ---
delete_these = [
    "a phone screenshot of an app or chat with the status bar showing time and battery",
    "a cropped screenshot that contains mostly text",
    "a social media meme image with text overlay and jokes",
    "an internet meme with captions or funny text",
    "a screenshot of a pdf document or ebook page",
    "an electronic ticket such as an airplane ticket or concert ticket",
    "a screenshot of a map application like google maps",
    "a product photo with labels, prices, or packaging information"
]

# Validate categories
invalid = [cat for cat in delete_these if cat not in available_delete_cats]
if invalid:
    print("\n Warning: These categories were not flagged and will be ignored:", invalid)

print("\nYou chose to move:", delete_these)

# --- Move only selected categories ---
moved_per_cat = Counter()
for filepath, category, score in tqdm(to_delete, desc="Moving flagged images"):
    if category in delete_these and filepath.exists():
        shutil.move(str(filepath), trash_dir / filepath.name)
        moved_per_cat[category] += 1

# --- Summary ---
print("\n Finished moving images.")
for cat, count in moved_per_cat.items():
    print(f"  {cat}: {count} moved")
print(f"TOTAL: {sum(moved_per_cat.values())} images moved to trash.")
print("All moved files are in:", trash_dir)



You chose to move: ['a phone screenshot of an app or chat with the status bar showing time and battery', 'a cropped screenshot that contains mostly text', 'a social media meme image with text overlay and jokes', 'an internet meme with captions or funny text', 'a screenshot of a pdf document or ebook page', 'an electronic ticket such as an airplane ticket or concert ticket', 'a screenshot of a map application like google maps', 'a product photo with labels, prices, or packaging information']


Moving flagged images: 0it [00:00, ?it/s]


 Finished moving images.
TOTAL: 0 images moved to trash.
All moved files are in: /Users/vashqu/Desktop/clearing_photos/trash





Remember that the model is not perfect! 