# 🖼️ Image Deduplication with Deep Learning & LSH

**HCMUT** - Data Structures & Algorithms Project

<a href="https://colab.research.google.com/github/tanphong-sudo/image-deduplication-project/blob/main/notebooks/project_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Clone repository
GIT_REPO_URL = "https://github.com/tanphong-sudo/image-deduplication-project"
!git clone $GIT_REPO_URL
%cd image-deduplication-project

Cloning into 'image-deduplication-project'...
remote: Enumerating objects: 253, done.[K
remote: Counting objects: 100% (253/253), done.[K
remote: Compressing objects: 100% (160/160), done.[K
remote: Enumerating objects: 253, done.[K
remote: Counting objects: 100% (253/253), done.[K
remote: Compressing objects: 100% (160/160), done.[K
remote: Total 253 (delta 130), reused 194 (delta 84), pack-reused 0 (from 0)[K
Receiving objects: 100% (253/253), 81.85 KiB | 1.32 MiB/s, done.
Resolving deltas: 100% (130/130), done.
remote: Total 253 (delta 130), reused 194 (delta 84), pack-reused 0 (from 0)[K
Receiving objects: 100% (253/253), 81.85 KiB | 1.32 MiB/s, done.
Resolving deltas: 100% (130/130), done.
/Users/lebaotanphong/Documents/image-deduplication-project/notebooks/image-deduplication-project
/Users/lebaotanphong/Documents/image-deduplication-project/notebooks/image-deduplication-project


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
# Install dependencies
!pip install -q -r requirements.txt

**Note:** The C++ SimHash module is optional. If build fails, you can still use FAISS and MinHash methods.

In [5]:
# Build C++ SimHash module
%cd src/lsh_cpp_module

# Verify pybind11 is installed
try:
    import pybind11
    print(f"✓ Pybind11 found: {pybind11.get_include()}")
except ImportError:
    print("Installing pybind11...")
    !pip install -q pybind11

# Build the module
!python setup.py build_ext --inplace

%cd ../..

/Users/lebaotanphong/Documents/image-deduplication-project/notebooks/image-deduplication-project/src/lsh_cpp_module
✓ Pybind11 found: /Users/lebaotanphong/Documents/image-deduplication-project/.venv/lib/python3.9/site-packages/pybind11/include
running build_ext
building 'lsh_cpp_module' extension
creating build
creating build/temp.macosx-10.9-universal2-3.9
creating build/temp.macosx-10.9-universal2-3.9/lsh_cpp
running build_ext
building 'lsh_cpp_module' extension
creating build
creating build/temp.macosx-10.9-universal2-3.9
creating build/temp.macosx-10.9-universal2-3.9/lsh_cpp
In file included from lsh_cpp/bindings.cpp:4:
  182 |         [0;34mif[0m (vec.size() != dim) {[0m
      | [0;1;32m            ~~~~~~~~~~ ^  ~~~
  224 |         [0;34mif[0m (query_vec.size() != dim) {[0m
      | [0;1;32m            ~~~~~~~~~~~~~~~~ ^  ~~~
  231 |         [0;34mfor[0m ([0;34mint[0m t = [0;32m0[0m; t < num_tables && candidates_set.size() < max_candidates; ++t) {[0m
      | [0;1;32

---

## 📥 Prepare Your Dataset

Choose one of the methods below to provide images:

In [8]:
# Option 1: Upload from Google Drive (Colab only)
# 1. Upload your images to Google Drive folder
# 2. Run this cell and authorize
# 3. Your images will be copied to data/raw/

import os
import shutil

# Check if running on Colab
try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("⚠️ This cell is for Google Colab only")
    print("💡 If running locally, your images should already be in data/raw/")
    print("   Skip this cell and continue to Quick Start")

if IN_COLAB:
    # Mount Google Drive
    drive.mount('/content/drive')
    
    # Specify your Drive folder path (edit this!)
    DRIVE_FOLDER = '/content/drive/MyDrive/image_dataset'  # <- Change this to your folder
    
    # Copy images to data/raw/
    os.makedirs('data/raw', exist_ok=True)
    
    if os.path.exists(DRIVE_FOLDER):
        image_files = [f for f in os.listdir(DRIVE_FOLDER) 
                       if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp'))]
        
        print(f"Copying {len(image_files)} images from Drive...")
        for img_file in image_files:
            src = os.path.join(DRIVE_FOLDER, img_file)
            dst = os.path.join('data/raw', img_file)
            shutil.copy2(src, dst)
        
        print(f"✓ Copied {len(image_files)} images to data/raw/")
    else:
        print(f"⚠️ Folder not found: {DRIVE_FOLDER}")
        print("Please edit DRIVE_FOLDER path in the cell above")

⚠️ This cell is for Google Colab only
💡 If running locally, your images should already be in data/raw/
   Skip this cell and continue to Quick Start


### Option 2: Upload Files Directly

Use Colab's file upload widget:

In [None]:
# Option 2: Upload files directly (Colab only)
import os

# Check if running on Colab
try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("⚠️ This cell is for Google Colab only")
    print("💡 If running locally, place your images in data/raw/ folder")

if IN_COLAB:
    os.makedirs('data/raw', exist_ok=True)
    
    print("Click 'Choose Files' and select your images...")
    uploaded = files.upload()
    
    # Save uploaded files to data/raw/
    for filename, content in uploaded.items():
        with open(f'data/raw/{filename}', 'wb') as f:
            f.write(content)
    
    print(f"\n✓ Uploaded {len(uploaded)} images to data/raw/")

### Option 3: Download from URL

If your dataset is hosted somewhere (Dropbox, OneDrive, etc.):

In [None]:
# Option 3: Download dataset from URL (zip file)
import os
import zipfile

# Example: Download and extract a zip file
DATASET_URL = "https://your-url.com/dataset.zip"  # <- Change this to your URL

os.makedirs('data/raw', exist_ok=True)

# Uncomment below to download
# !wget -q {DATASET_URL} -O dataset.zip
# with zipfile.ZipFile('dataset.zip', 'r') as zip_ref:
#     zip_ref.extractall('data/raw')
# !rm dataset.zip
# print("✓ Downloaded and extracted dataset")

print("Edit DATASET_URL and uncomment the code above")

---

## ⚡ Quick Start (Recommended)

Run this cell for default configuration (EfficientNet + FAISS):

In [9]:
# Run with best defaults: EfficientNet + FAISS
!python run_pipeline.py


🖼️  IMAGE DEDUPLICATION PIPELINE
📁 Dataset:            data/raw
📊 Feature Extractor:  efficientnet
🔍 Search Method:      faiss
📏 Threshold:          50.0
💾 Output Directory:   data/processed

2025-10-06 21:11:50,111 ERROR No images found in dataset


---

## 🎛️ Advanced: Custom Configuration

Want to try different combinations? Edit and run this cell:

In [None]:
# ========================================
# CONFIGURATION - Edit these parameters
# ========================================

# Feature Extractor: 'resnet' or 'efficientnet'
EXTRACTOR = 'efficientnet'

# Search Method: 'faiss', 'simhash', or 'minhash'
METHOD = 'faiss'

# Threshold (for FAISS and MinHash): recommended 50-100 for FAISS, 0.5 for MinHash
THRESHOLD = 50.0

# Hamming Threshold (for SimHash only): recommended 5 for EfficientNet, 6 for ResNet
HAMMING_THRESHOLD = 5

# ========================================
# Run pipeline with your configuration
# ========================================
if METHOD == 'simhash':
    !python run_pipeline.py --extractor {EXTRACTOR} --method {METHOD} --hamming-threshold {HAMMING_THRESHOLD}
else:
    !python run_pipeline.py --extractor {EXTRACTOR} --method {METHOD} --threshold {THRESHOLD}

### 💡 Recommended Configurations

| Configuration | Extractor | Method | Threshold/Hamming | Use Case |
|--------------|-----------|--------|-------------------|----------|
| **Best Accuracy** ⭐ | efficientnet | faiss | 50 | Production, guaranteed results |
| **Best Speed** ⭐ | efficientnet | faiss | 50 | Fast queries, moderate dataset |
| **Baseline** | efficientnet | minhash | 0.5 | Comparison benchmark |
| **Large Scale** 🔧 | efficientnet | simhash | 5 | Billions of images (requires C++ build) |
| **High Dimensional** 🔧 | resnet | simhash | 6 | Very high-dim features (requires C++ build) |

⭐ = Works on all systems  
🔧 = Requires C++ SimHash module

*Edit the configuration cell above to try different combinations!*

## 📊 View Results

In [None]:
# Load and display evaluation results
import json

with open('data/processed/evaluation_full.json', 'r') as f:
    results = json.load(f)

print("=" * 60)
print("EVALUATION RESULTS")
print("=" * 60)
print(json.dumps(results, indent=2))

# Visualize duplicate clusters
print("\n" + "=" * 60)
print("Visualizing duplicate clusters...")
print("=" * 60)
!python view_results.py

## ⚡ Benchmark: C++ vs Python SimHash

In [None]:
# Run performance benchmark
%cd src/lsh_cpp_module
!python benchmark_comparison.py
%cd ../..

In [None]:
# Display benchmark visualization
from IPython.display import Image, display
display(Image('src/lsh_cpp_module/lsh_benchmark_results.png'))

## 📈 Summary & Conclusions

### Key Findings:

1. **FAISS**: Best overall performance with 100% accuracy and fastest query time
2. **SimHash LSH (C++)**: Excellent for large-scale applications, memory efficient
3. **MinHash LSH**: Good baseline, suitable for set-based similarity

### Performance Highlights:

- **C++ SimHash** achieves 165-187x speedup over Python implementations
- **Multi-probing** critical for high recall with deep learning features
- **Hamming threshold tuning**: 5-6 optimal for 100% recall

### Recommendations:

- Use **FAISS** for production systems requiring guaranteed accuracy
- Use **SimHash C++** for billion-scale datasets with memory constraints
- Tune `hamming_threshold` based on feature extractor (ResNet50: 6, EfficientNet: 5)

---

## 📚 References

1.
---

**Team**: Lê Bảo Tấn Phong, Nguyễn Anh Quân, Phạm Văn Hên  
**HCMUT** - Data Structures & Algorithms - 2025

⭐ [GitHub Repository](https://github.com/tanphong-sudo/image-deduplication-project)