Text-Driven Video Search Pipeline

A comprehensive offline, keyframe-based retrieval pipeline for text-driven video search using computer vision and sentence transformers. This system enables searching through surveillance footage using natural language queries like "a black bag" or "person walking".

Features

Frame Extraction: Intelligent keyframe extraction with shot boundary detection and deduplication
Semantic Feature Extraction: Computer vision features + Sentence Transformers for text encoding
Fast Retrieval: FAISS-based indexing for efficient similarity search
Enhanced Search: Novel enhancements including query expansion, temporal clustering, and multi-modal fusion
Object Grounding: Optional caption-based object detection
Interactive Interface: Command-line and interactive search modes

Architecture Overview

Videos → Frame Extraction → Feature Extraction → Indexing → Enhanced Retrieval
  ↓           ↓                    ↓              ↓           ↓
.mp4/.avi → Keyframes → BLIP Captions + Sentence Transformer Embeddings → FAISS Index → Search Results

Installation

Prerequisites: Python 3.8+, optional CUDA GPU. Install deps:

pip install -r requirements.txt

Quick Start

1. Prepare Your Videos

Place your video files in the data/videos/ directory. Supported formats: .mp4, .avi, .mov, .mkv, .flv, .wmv

data/
└── videos/
    ├── camera1_2023-01-01.mp4
    ├── camera2_2023-01-01.mp4
    └── ...

2. Build the Search Index

python video_search_pipeline.py build

This will extract keyframes, generate captions + embeddings, and build a FAISS index.

Run on Google Colab

Clone and install

!git clone https://github.com/rishika-nn/Capstone_Project capstone
%cd capstone
!pip install -r requirements.txt

Provide videos in data/videos/

Upload via Colab left Files pane into data/videos/, or mount Drive:

from google.colab import drive
drive.mount('/content/drive')

!mkdir -p data
!rm -f data/videos
!ln -s "/content/drive/MyDrive/your_videos_folder" data/videos
!ls -lah data/videos

Build index

!python video_search_pipeline.py build --segment-captions --object-tags --force-rebuild

Search

!python video_search_pipeline.py search "a black bag" --max-results 20

Notes:

The pipeline automatically falls back to a Flat FAISS index for small datasets (too few vectors for IVF training).
If you previously had a partial build, clean outputs and rebuild:

!rm -rf data/keyframes data/features data/index
!python video_search_pipeline.py build --force-rebuild

3. Search for Frames

python video_search_pipeline.py search "a black bag"

4. Interactive Mode

python video_search_pipeline.py interactive

Detailed Usage

Command Line Interface

Build Index

# Basic build
python video_search_pipeline.py build

# Force rebuild existing index
python video_search_pipeline.py build --force-rebuild

# Specify custom videos directory
python video_search_pipeline.py build --videos-dir /path/to/videos

Search

# Basic search
python video_search_pipeline.py search "person with backpack"

# Search with more results
python video_search_pipeline.py search "black bag" --max-results 20

# Disable enhanced retrieval (faster, less accurate)
python video_search_pipeline.py search "car" --no-enhancements

# Save results to directory
python video_search_pipeline.py search "person walking" --save-results ./results

Statistics

python video_search_pipeline.py stats

Interactive commands

search <query>
stats
quit

File Structure

├── config.py                    # Configuration settings
├── frame_extractor.py           # Frame extraction and deduplication
├── feature_extractor.py         # CV + Sentence Transformer feature extraction
├── retrieval_system.py          # FAISS indexing and retrieval
├── enhancements.py              # Novel enhancements (query expansion, clustering)
├── video_search_pipeline.py     # Main pipeline orchestration
├── requirements.txt             # Python dependencies
├── README.md                    # This file
└── data/                        # Data directory (created automatically)
    ├── videos/                  # Input video files
    ├── keyframes/               # Extracted keyframes
    ├── features/                # Extracted features and embeddings
    ├── index/                   # FAISS index files
    └── logs/                    # Log files

Required contents before build:
- `data/videos/` must contain at least one video file (.mp4/.avi/.mov/.mkv/.flv/.wmv)

Expected contents after build:
- `data/keyframes/*.jpg` and per-video metadata JSON
- `data/features/frame_features.json`, `embeddings.npy`, `feature_metadata.json`
- `data/index/faiss_index.bin`, `metadata.json`, `config.json`, `caption_index.json`

Configuration

Edit config.py to customize the pipeline:

Video Processing

FRAME_EXTRACTION_RATE = 2        # frames per second
MAX_KEYFRAMES_PER_SHOT = 5       # max keyframes per shot
SHOT_DETECTION_THRESHOLD = 0.3   # shot boundary sensitivity

Deduplication

PERCEPTUAL_HASH_THRESHOLD = 5    # Hamming distance for duplicates
SIMILARITY_THRESHOLD = 0.95      # Feature similarity threshold

Retrieval

TOP_K_RESULTS = 20               # Default number of results
FAISS_INDEX_TYPE = "IVF"         # Index type: "IVF", "HNSW", or "Flat"

Models

BLIP_MODEL_NAME = "Salesforce/blip-image-captioning-base"

Advanced Features

1. Query Expansion

The system automatically expands queries with synonyms and contextual information:

Synonym Expansion: "bag" → ["backpack", "briefcase", "purse"]
Contextual Expansion: "black bag" → ["person with black bag", "scene showing black bag"]
Semantic Expansion: Uses corpus captions to find related terms

2. Temporal Clustering

Groups temporally close and visually similar frames:

# Enable temporal clustering in search
results = pipeline.search("black bag", use_clustering=True)

3. Multi-Modal Fusion

Combines visual and textual signals for better ranking:

Visual similarity (approx via text similarity to captions)
Textual similarity (caption matching)
Word overlap analysis
Grounding-based boosting

4. Object Grounding

Optional object detection for improved precision:

# Enable object grounding
results = pipeline.search("black bag", use_grounding=True)

5. Domain Adaptation

Adapt embeddings to your specific environment:

# Perform domain adaptation
adapted_embeddings = domain_adaptation.adapt_to_campus_environment(
    embeddings, captions
)

Performance Tips (brief)

Use GPU in Colab for faster feature extraction.
For large datasets, reduce FRAME_EXTRACTION_RATE in config.py.

Troubleshooting (quick)

No videos found: ensure files exist in data/videos/ or pass --videos-dir.
Partial/failed build: rm -rf data/keyframes data/features data/index then rebuild with --force-rebuild.
Small dataset IVF error: the system auto-falls back to Flat index.
Search errors: recent updates pad query embeddings to index dim; rebuild if you changed feature shapes.

Evaluation

Metrics

The system provides several metrics for evaluation:

Retrieval Precision: Fraction of relevant results in top-K
Caption Quality: Measure via human eval or cosine to queries
Temporal Coherence: Consistency across time
Grounding Accuracy: Object detection precision

Benchmarking

# Evaluate on test queries
test_queries = [
    "a person with a black bag",
    "red car in parking lot",
    "person walking on sidewalk"
]

for query in test_queries:
    results = pipeline.search(query)
    # Evaluate results...

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Citation

If you use this system in your research, please cite:

@article{text_driven_video_search,
  title={Text-Driven Video Search Using CV and Sentence Transformers},
  author={Your Name},
  journal={Your Conference},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

BLIP for image captioning
Sentence Transformers for efficient text embeddings
FAISS for efficient similarity search
Transformers for model integration

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
config.py		config.py
enhancements.py		enhancements.py
feature_extractor.py		feature_extractor.py
frame_extractor.py		frame_extractor.py
requirements.txt		requirements.txt
retrieval_system.py		retrieval_system.py
setup.py		setup.py
video_search_pipeline.py		video_search_pipeline.py

Uh oh!

Uh oh!

rishika-nn/Capstone_Project

Folders and files

Latest commit

History

Repository files navigation

Text-Driven Video Search Pipeline

Features

Architecture Overview

Installation

Quick Start

1. Prepare Your Videos

2. Build the Search Index

Run on Google Colab

3. Search for Frames

4. Interactive Mode

Detailed Usage

Command Line Interface

Build Index

Search

Statistics

Interactive commands

File Structure

Configuration

Video Processing

Deduplication

Retrieval

Models

Advanced Features

1. Query Expansion

2. Temporal Clustering

3. Multi-Modal Fusion

4. Object Grounding

5. Domain Adaptation

Performance Tips (brief)

Troubleshooting (quick)

Evaluation

Metrics

Benchmarking

Contributing

Citation

License

Acknowledgments

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages