<a href="https://colab.research.google.com/github/tanphong-sudo/image-deduplication-project/blob/main/notebooks/project_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
# Clone repository
GIT_REPO_URL = "https://github.com/tanphong-sudo/image-deduplication-project"
!git clone $GIT_REPO_URL
%cd image-deduplication-project

Cloning into 'image-deduplication-project'...
remote: Enumerating objects: 293, done.[K
remote: Counting objects:   3% (1/32)[Kremote: Counting objects:   6% (2/32)[Kremote: Counting objects:   9% (3/32)[Kremote: Counting objects:  12% (4/32)[Kremote: Counting objects:  15% (5/32)[Kremote: Counting objects:  18% (6/32)[Kremote: Counting objects:  21% (7/32)[Kremote: Counting objects:  25% (8/32)[Kremote: Counting objects:  28% (9/32)[Kremote: Counting objects:  31% (10/32)[Kremote: Counting objects:  34% (11/32)[Kremote: Counting objects:  37% (12/32)[Kremote: Counting objects:  40% (13/32)[Kremote: Counting objects:  43% (14/32)[Kremote: Counting objects:  46% (15/32)[Kremote: Counting objects:  50% (16/32)[Kremote: Counting objects:  53% (17/32)[Kremote: Counting objects:  56% (18/32)[Kremote: Counting objects:  59% (19/32)[Kremote: Counting objects:  62% (20/32)[Kremote: Counting objects:  65% (21/32)[Kremote: Counting objects:  68% (22/32)

In [22]:
# Install dependencies
!pip install -q -r requirements.txt

**Note:** The C++ SimHash module is optional. If build fails, you can still use FAISS and MinHash methods.

In [23]:
# Build C++ SimHash module
%cd src/lsh_cpp_module

# Verify pybind11 is installed
try:
    import pybind11
    print(f"‚úì Pybind11 found: {pybind11.get_include()}")
except ImportError:
    print("Installing pybind11...")
    !pip install -q pybind11

# Build the module
!python setup.py build_ext --inplace

%cd ../..

/content/image-deduplication-project/image-deduplication-project/image-deduplication-project/image-deduplication-project/src/lsh_cpp_module
‚úì Pybind11 found: /usr/local/lib/python3.12/dist-packages/pybind11/include
running build_ext
building 'lsh_cpp_module' extension
creating build/temp.linux-x86_64-cpython-312/lsh_cpp
x86_64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/local/lib/python3.12/dist-packages/pybind11/include -Ilsh_cpp -I/usr/include/python3.12 -c lsh_cpp/bindings.cpp -o build/temp.linux-x86_64-cpython-312/lsh_cpp/bindings.o -DVERSION_INFO=\"1.0.0\" -std=c++14 -O3 -ffast-math -Wall -Wextra -march=native -fopenmp
creating build/lib.linux-x86_64-cpython-312
x86_64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsym

---

## üì• Prepare Your Dataset

Choose one of the methods below to provide images:

In [24]:
# Option 1: Upload from Google Drive (Colab only)
# 1. Upload your images to Google Drive folder
# 2. Run this cell and authorize
# 3. Your images will be copied to data/raw/

import os
import shutil

# Check if running on Colab
try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("‚ö†Ô∏è This cell is for Google Colab only")
    print("üí° If running locally, your images should already be in data/raw/")
    print("   Skip this cell and continue to Quick Start")

if IN_COLAB:
    # Mount Google Drive
    drive.mount('/content/drive')

    # Specify your Drive folder path (edit this!)
    DRIVE_FOLDER = '/content/drive/MyDrive/DSA/Project/data/raw'  # <- Change this to your folder

    # Copy images to data/raw/
    os.makedirs('data/raw', exist_ok=True)

    if os.path.exists(DRIVE_FOLDER):
        image_files = [f for f in os.listdir(DRIVE_FOLDER)
                       if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp'))]

        print(f"Copying {len(image_files)} images from Drive...")
        for img_file in image_files:
            src = os.path.join(DRIVE_FOLDER, img_file)
            dst = os.path.join('data/raw', img_file)
            shutil.copy2(src, dst)

        print(f"‚úì Copied {len(image_files)} images to data/raw/")
    else:
        print(f"‚ö†Ô∏è Folder not found: {DRIVE_FOLDER}")
        print("Please edit DRIVE_FOLDER path in the cell above")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Copying 720 images from Drive...
‚úì Copied 720 images to data/raw/


### Option 2: Upload Files Directly

Use Colab's file upload widget:

In [None]:
# Option 2: Upload files directly (Colab only)
import os

# Check if running on Colab
try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("‚ö†Ô∏è This cell is for Google Colab only")
    print("üí° If running locally, place your images in data/raw/ folder")

if IN_COLAB:
    os.makedirs('data/raw', exist_ok=True)

    print("Click 'Choose Files' and select your images...")
    uploaded = files.upload()

    # Save uploaded files to data/raw/
    for filename, content in uploaded.items():
        with open(f'data/raw/{filename}', 'wb') as f:
            f.write(content)

    print(f"\n‚úì Uploaded {len(uploaded)} images to data/raw/")

### Option 3: Download from URL

If your dataset is hosted somewhere (Dropbox, OneDrive, etc.):

In [None]:
# Option 3: Download dataset from URL (zip file)
import os
import zipfile

# Example: Download and extract a zip file
DATASET_URL = "https://your-url.com/dataset.zip"  # <- Change this to your URL

os.makedirs('data/raw', exist_ok=True)

# Uncomment below to download
# !wget -q {DATASET_URL} -O dataset.zip
# with zipfile.ZipFile('dataset.zip', 'r') as zip_ref:
#     zip_ref.extractall('data/raw')
# !rm dataset.zip
# print("‚úì Downloaded and extracted dataset")

print("Edit DATASET_URL and uncomment the code above")

---

## ‚ö° Quick Start (Recommended)

Run this cell for default configuration (EfficientNet + FAISS):

In [25]:
# Run with best defaults: EfficientNet + FAISS
!python run_pipeline.py


üñºÔ∏è  IMAGE DEDUPLICATION PIPELINE
üìÅ Dataset:            data/raw
üìä Feature Extractor:  efficientnet
üîç Search Method:      faiss
üìè Threshold:          50.0
üíæ Output Directory:   data/processed

2025-11-03 15:54:57,333 INFO Found 720 images
2025-11-03 15:54:57,336 INFO Extracted 10 unique labels.
2025-11-03 15:54:57,340 INFO Saved image paths and labels to data/processed/image_labels.csv
2025-11-03 15:54:57,386 INFO Saved 25560 ground-truth pairs to data/processed/ground_truth_pairs.json
2025-11-03 15:54:57,386 INFO Using extractor=efficientnet device=cpu
2025-11-03 15:55:38,310 INFO Features loaded: N=720, d=1280
2025-11-03 15:55:38,310 INFO Building FAISS index type=flat nlist=1024
2025-11-03 15:55:38,311 INFO Running kNN self-search (FAISS)
2025-11-03 15:55:38,842 INFO Saved clusters (count=10) to faiss_clusters.json
2025-11-03 15:55:38,842 INFO Copied representative for cluster 0: obj10__340.png
2025-11-03 15:55:38,842 INFO Copied representative for cluster 1: obj

---

## üíæ Feature Caching (Optional but Recommended)

**Why cache features?**
- Feature extraction takes too long (one-time cost)
- After caching, you can test different search methods in <1 second

**Extract and save features once:**

In [26]:
!python run_pipeline.py --save-features data/processed/features.npy


üñºÔ∏è  IMAGE DEDUPLICATION PIPELINE
üìÅ Dataset:            data/raw
üìä Feature Extractor:  efficientnet
üîç Search Method:      faiss
üìè Threshold:          50.0
üíæ Output Directory:   data/processed

2025-11-03 15:57:14,056 INFO Found 720 images
2025-11-03 15:57:14,059 INFO Extracted 10 unique labels.
2025-11-03 15:57:14,062 INFO Saved image paths and labels to data/processed/image_labels.csv
2025-11-03 15:57:14,063 INFO Ground-truth file already exists at data/processed/ground_truth_pairs.json, skipping generation.
2025-11-03 15:57:14,063 INFO Using extractor=efficientnet device=cpu
2025-11-03 15:57:54,360 INFO Features loaded: N=720, d=1280
2025-11-03 15:57:54,360 INFO Building FAISS index type=flat nlist=1024
2025-11-03 15:57:54,362 INFO Running kNN self-search (FAISS)
2025-11-03 15:57:54,985 INFO Saved clusters (count=10) to faiss_clusters.json
2025-11-03 15:57:54,985 INFO Copied representative for cluster 0: obj10__340.png
2025-11-03 15:57:54,986 INFO Copied represent

**Now test different methods instantly (no re-extraction needed):**

In [29]:
!python run_pipeline.py --load-features data/processed/features.npy --method simhash


üñºÔ∏è  IMAGE DEDUPLICATION PIPELINE
üìÅ Dataset:            data/raw
üìä Feature Extractor:  efficientnet
üîç Search Method:      simhash
üìè Threshold:          50.0
üî¢ Hamming Threshold:  5
üíæ Output Directory:   data/processed

2025-11-03 15:59:00,366 INFO Found 720 images
2025-11-03 15:59:00,369 INFO Extracted 10 unique labels.
2025-11-03 15:59:00,374 INFO Saved image paths and labels to data/processed/image_labels.csv
2025-11-03 15:59:00,374 INFO Ground-truth file already exists at data/processed/ground_truth_pairs.json, skipping generation.
2025-11-03 15:59:00,374 INFO Loading features from data/processed/features.npy
2025-11-03 15:59:00,378 INFO Features loaded: N=720, d=1280
2025-11-03 15:59:00,381 INFO Using SimHashSearch (C++ backend via Pybind11)
‚úì Using C++ SimHash LSH (dim=1280, bits=64, tables=8)
2025-11-03 15:59:00,483 INFO Running SimHash search (tables=8, hamming_threshold=5)...
2025-11-03 15:59:00,665 INFO Clustering with threshold=50.0
2025-11-03 15:59:0

---

## üéõÔ∏è Advanced: Custom Configuration

Want to try different combinations? Edit and run this cell:

In [None]:
# ========================================
# CONFIGURATION - Edit these parameters
# ========================================

# Feature Extractor: 'resnet' or 'efficientnet'
EXTRACTOR = 'efficientnet'

# Search Method: 'faiss', 'simhash', or 'minhash'
METHOD = 'faiss'

# Threshold (for FAISS and MinHash): recommended 50-100 for FAISS, 0.5 for MinHash
THRESHOLD = 50.0

# Hamming Threshold (for SimHash only): recommended 5 for EfficientNet, 6 for ResNet
HAMMING_THRESHOLD = 5

USE_CACHE = False  # Change to True after running once

# ========================================
# Run pipeline with your configuration
# ========================================
import os

if USE_CACHE and os.path.exists('data/processed/features.npy'):
    if METHOD == 'simhash':
        !python run_pipeline.py --load-features data/processed/features.npy --method {METHOD} --hamming-threshold {HAMMING_THRESHOLD}
    else:
        !python run_pipeline.py --load-features data/processed/features.npy --method {METHOD} --threshold {THRESHOLD}
else:
    if METHOD == 'simhash':
        !python run_pipeline.py --extractor {EXTRACTOR} --method {METHOD} --hamming-threshold {HAMMING_THRESHOLD}
    else:
        !python run_pipeline.py --extractor {EXTRACTOR} --method {METHOD} --threshold {THRESHOLD}

### üí° Recommended Configurations

| Configuration | Extractor | Method | Threshold/Hamming | Use Case |
|--------------|-----------|--------|-------------------|----------|
| **Best Accuracy** ‚≠ê | efficientnet | faiss | 50 | Production, guaranteed results |
| **Best Speed** ‚≠ê | efficientnet | faiss | 50 | Fast queries, moderate dataset |
| **Baseline** | efficientnet | minhash | 0.5 | Comparison benchmark |
| **Large Scale** üîß | efficientnet | simhash | 5 | Billions of images (requires C++ build) |
| **High Dimensional** üîß | resnet | simhash | 6 | Very high-dim features (requires C++ build) |

‚≠ê = Works on all systems  
üîß = Requires C++ SimHash module

*Edit the configuration cell above to try different combinations!*

## üìä View Results

In [None]:
# Load and display evaluation results
import json

with open('data/processed/evaluation_full.json', 'r') as f:
    results = json.load(f)

print("=" * 60)
print("EVALUATION RESULTS")
print("=" * 60)
print(json.dumps(results, indent=2))

# Visualize duplicate clusters
print("\n" + "=" * 60)
print("Visualizing duplicate clusters...")
print("=" * 60)
!python view_results.py

## ‚ö° Benchmark: C++ vs Python SimHash

In [None]:
# Run performance benchmark
%cd src/lsh_cpp_module
!python benchmark_comparison.py
%cd ../..

In [None]:
# Display benchmark visualization
from IPython.display import Image, display
display(Image('src/lsh_cpp_module/lsh_benchmark_results.png'))

## üìà Summary & Conclusions

### Key Findings:

1. **FAISS**: Best overall performance with 100% accuracy and fastest query time
2. **SimHash LSH (C++)**: Excellent for large-scale applications, memory efficient
3. **MinHash LSH**: Good baseline, suitable for set-based similarity

### Performance Highlights:

- **C++ SimHash** achieves 165-187x speedup over Python implementations
- **Multi-probing** critical for high recall with deep learning features
- **Hamming threshold tuning**: 5-6 optimal for 100% recall

### Recommendations:

- Use **FAISS** for production systems requiring guaranteed accuracy
- Use **SimHash C++** for billion-scale datasets with memory constraints
- Tune `hamming_threshold` based on feature extractor (ResNet50: 6, EfficientNet: 5)

---

## üìö References

1.
---

**Team**: L√™ B·∫£o T·∫•n Phong, Nguy·ªÖn Anh Qu√¢n, Ph·∫°m VƒÉn H√™n  
**HCMUT** - Data Structures & Algorithms - 2025

‚≠ê [GitHub Repository](https://github.com/tanphong-sudo/image-deduplication-project)