A comprehensive offline, keyframe-based retrieval pipeline for text-driven video search using computer vision and sentence transformers. This system enables searching through surveillance footage using natural language queries like "a black bag" or "person walking".
- Frame Extraction: Intelligent keyframe extraction with shot boundary detection and deduplication
- Semantic Feature Extraction: Computer vision features + Sentence Transformers for text encoding
- Fast Retrieval: FAISS-based indexing for efficient similarity search
- Enhanced Search: Novel enhancements including query expansion, temporal clustering, and multi-modal fusion
- Object Grounding: Optional caption-based object detection
- Interactive Interface: Command-line and interactive search modes
Videos → Frame Extraction → Feature Extraction → Indexing → Enhanced Retrieval
↓ ↓ ↓ ↓ ↓
.mp4/.avi → Keyframes → BLIP Captions + Sentence Transformer Embeddings → FAISS Index → Search Results
Prerequisites: Python 3.8+, optional CUDA GPU. Install deps:
pip install -r requirements.txtPlace your video files in the data/videos/ directory. Supported formats: .mp4, .avi, .mov, .mkv, .flv, .wmv
data/
└── videos/
├── camera1_2023-01-01.mp4
├── camera2_2023-01-01.mp4
└── ...
python video_search_pipeline.py buildThis will extract keyframes, generate captions + embeddings, and build a FAISS index.
- Clone and install
!git clone https://github.com/rishika-nn/Capstone_Project capstone
%cd capstone
!pip install -r requirements.txt- Provide videos in
data/videos/
- Upload via Colab left Files pane into
data/videos/, or mount Drive:
from google.colab import drive
drive.mount('/content/drive')!mkdir -p data
!rm -f data/videos
!ln -s "/content/drive/MyDrive/your_videos_folder" data/videos
!ls -lah data/videos- Build index
!python video_search_pipeline.py build --segment-captions --object-tags --force-rebuild- Search
!python video_search_pipeline.py search "a black bag" --max-results 20Notes:
- The pipeline automatically falls back to a Flat FAISS index for small datasets (too few vectors for IVF training).
- If you previously had a partial build, clean outputs and rebuild:
!rm -rf data/keyframes data/features data/index
!python video_search_pipeline.py build --force-rebuildpython video_search_pipeline.py search "a black bag"python video_search_pipeline.py interactive# Basic build
python video_search_pipeline.py build
# Force rebuild existing index
python video_search_pipeline.py build --force-rebuild
# Specify custom videos directory
python video_search_pipeline.py build --videos-dir /path/to/videos# Basic search
python video_search_pipeline.py search "person with backpack"
# Search with more results
python video_search_pipeline.py search "black bag" --max-results 20
# Disable enhanced retrieval (faster, less accurate)
python video_search_pipeline.py search "car" --no-enhancements
# Save results to directory
python video_search_pipeline.py search "person walking" --save-results ./resultspython video_search_pipeline.py statssearch <query>statsquit
├── config.py # Configuration settings
├── frame_extractor.py # Frame extraction and deduplication
├── feature_extractor.py # CV + Sentence Transformer feature extraction
├── retrieval_system.py # FAISS indexing and retrieval
├── enhancements.py # Novel enhancements (query expansion, clustering)
├── video_search_pipeline.py # Main pipeline orchestration
├── requirements.txt # Python dependencies
├── README.md # This file
└── data/ # Data directory (created automatically)
├── videos/ # Input video files
├── keyframes/ # Extracted keyframes
├── features/ # Extracted features and embeddings
├── index/ # FAISS index files
└── logs/ # Log files
Required contents before build:
- `data/videos/` must contain at least one video file (.mp4/.avi/.mov/.mkv/.flv/.wmv)
Expected contents after build:
- `data/keyframes/*.jpg` and per-video metadata JSON
- `data/features/frame_features.json`, `embeddings.npy`, `feature_metadata.json`
- `data/index/faiss_index.bin`, `metadata.json`, `config.json`, `caption_index.json`
Edit config.py to customize the pipeline:
FRAME_EXTRACTION_RATE = 2 # frames per second
MAX_KEYFRAMES_PER_SHOT = 5 # max keyframes per shot
SHOT_DETECTION_THRESHOLD = 0.3 # shot boundary sensitivityPERCEPTUAL_HASH_THRESHOLD = 5 # Hamming distance for duplicates
SIMILARITY_THRESHOLD = 0.95 # Feature similarity thresholdTOP_K_RESULTS = 20 # Default number of results
FAISS_INDEX_TYPE = "IVF" # Index type: "IVF", "HNSW", or "Flat"BLIP_MODEL_NAME = "Salesforce/blip-image-captioning-base"The system automatically expands queries with synonyms and contextual information:
- Synonym Expansion: "bag" → ["backpack", "briefcase", "purse"]
- Contextual Expansion: "black bag" → ["person with black bag", "scene showing black bag"]
- Semantic Expansion: Uses corpus captions to find related terms
Groups temporally close and visually similar frames:
# Enable temporal clustering in search
results = pipeline.search("black bag", use_clustering=True)Combines visual and textual signals for better ranking:
- Visual similarity (approx via text similarity to captions)
- Textual similarity (caption matching)
- Word overlap analysis
- Grounding-based boosting
Optional object detection for improved precision:
# Enable object grounding
results = pipeline.search("black bag", use_grounding=True)Adapt embeddings to your specific environment:
# Perform domain adaptation
adapted_embeddings = domain_adaptation.adapt_to_campus_environment(
embeddings, captions
)- Use GPU in Colab for faster feature extraction.
- For large datasets, reduce
FRAME_EXTRACTION_RATEinconfig.py.
- No videos found: ensure files exist in
data/videos/or pass--videos-dir. - Partial/failed build:
rm -rf data/keyframes data/features data/indexthen rebuild with--force-rebuild. - Small dataset IVF error: the system auto-falls back to Flat index.
- Search errors: recent updates pad query embeddings to index dim; rebuild if you changed feature shapes.
The system provides several metrics for evaluation:
- Retrieval Precision: Fraction of relevant results in top-K
- Caption Quality: Measure via human eval or cosine to queries
- Temporal Coherence: Consistency across time
- Grounding Accuracy: Object detection precision
# Evaluate on test queries
test_queries = [
"a person with a black bag",
"red car in parking lot",
"person walking on sidewalk"
]
for query in test_queries:
results = pipeline.search(query)
# Evaluate results...- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
If you use this system in your research, please cite:
@article{text_driven_video_search,
title={Text-Driven Video Search Using CV and Sentence Transformers},
author={Your Name},
journal={Your Conference},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
- BLIP for image captioning
- Sentence Transformers for efficient text embeddings
- FAISS for efficient similarity search
- Transformers for model integration
- Real-time video processing
- Multi-camera synchronization
- Advanced temporal reasoning
- Fine-grained object detection
- Web interface
- Mobile app integration