# 🖼️ Image Deduplication with Deep Learning & LSH

**HCMUT** - Data Structures & Algorithms Project

<a href="https://colab.research.google.com/github/tanphong-sudo/image-deduplication-project/blob/main/notebooks/project_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone repository
GIT_REPO_URL = "https://github.com/tanphong-sudo/image-deduplication-project"
!git clone $GIT_REPO_URL
%cd image-deduplication-project

In [None]:
# Install dependencies
!pip install -q -r requirements.txt

In [None]:
# Build C++ SimHash module
%cd src/lsh_cpp_module
!pip install -q -e .
%cd ../..

---

## ⚡ Quick Start (Recommended)

Run this cell for default configuration (EfficientNet + FAISS):

In [None]:
# Run with best defaults: EfficientNet + FAISS
!python run_pipeline.py

---

## 🎛️ Advanced: Custom Configuration

Want to try different combinations? Edit and run this cell:

In [None]:
# ========================================
# CONFIGURATION - Edit these parameters
# ========================================

# Feature Extractor: 'resnet' or 'efficientnet'
EXTRACTOR = 'efficientnet'

# Search Method: 'faiss', 'simhash', or 'minhash'
METHOD = 'faiss'

# Threshold (for FAISS and MinHash): recommended 50-100 for FAISS, 0.5 for MinHash
THRESHOLD = 50.0

# Hamming Threshold (for SimHash only): recommended 5 for EfficientNet, 6 for ResNet
HAMMING_THRESHOLD = 5

# ========================================
# Run pipeline with your configuration
# ========================================
if METHOD == 'simhash':
    !python run_pipeline.py --extractor {EXTRACTOR} --method {METHOD} --hamming-threshold {HAMMING_THRESHOLD}
else:
    !python run_pipeline.py --extractor {EXTRACTOR} --method {METHOD} --threshold {THRESHOLD}

### 💡 Recommended Configurations

| Configuration | Extractor | Method | Threshold/Hamming | Use Case |
|--------------|-----------|--------|-------------------|----------|
| **Best Accuracy** | efficientnet | faiss | 50 | Production, guaranteed results |
| **Best Speed** | efficientnet | faiss | 50 | Fast queries, moderate dataset |
| **Large Scale** | efficientnet | simhash | 5 | Billions of images, memory limited |
| **High Dimensional** | resnet | simhash | 6 | Very high-dim features (2048D) |
| **Baseline** | efficientnet | minhash | 0.5 | Comparison benchmark |

*Edit the configuration cell above to try different combinations!*

## 📊 View Results

In [None]:
# Load and display evaluation results
import json

with open('data/processed/evaluation_full.json', 'r') as f:
    results = json.load(f)

print("=" * 60)
print("EVALUATION RESULTS")
print("=" * 60)
print(json.dumps(results, indent=2))

# Visualize duplicate clusters
print("\n" + "=" * 60)
print("Visualizing duplicate clusters...")
print("=" * 60)
!python view_results.py

## ⚡ Benchmark: C++ vs Python SimHash

In [None]:
# Run performance benchmark
%cd src/lsh_cpp_module
!python benchmark_comparison.py
%cd ../..

In [None]:
# Display benchmark visualization
from IPython.display import Image, display
display(Image('src/lsh_cpp_module/lsh_benchmark_results.png'))

## 📈 Summary & Conclusions

### Key Findings:

1. **FAISS**: Best overall performance with 100% accuracy and fastest query time
2. **SimHash LSH (C++)**: Excellent for large-scale applications, memory efficient
3. **MinHash LSH**: Good baseline, suitable for set-based similarity

### Performance Highlights:

- **C++ SimHash** achieves 165-187x speedup over Python implementations
- **Multi-probing** critical for high recall with deep learning features
- **Hamming threshold tuning**: 5-6 optimal for 100% recall

### Recommendations:

- Use **FAISS** for production systems requiring guaranteed accuracy
- Use **SimHash C++** for billion-scale datasets with memory constraints
- Tune `hamming_threshold` based on feature extractor (ResNet50: 6, EfficientNet: 5)

---

## 📚 References

1.
---

**Team**: Lê Bảo Tấn Phong, Nguyễn Anh Quân, Phạm Văn Hên  
**HCMUT** - Data Structures & Algorithms - 2025

⭐ [GitHub Repository](https://github.com/tanphong-sudo/image-deduplication-project)