This project provides a blueprint for building visually perceptive and interactive AI agents for video search and summarization (VSS) using:
- LLaMA (LLM)
- Pinecone (vector DB)
- Flask (web UI)
- Vision-Language Models (VLM)
- Retrieval-Augmented Generation (RAG)
app/
— Flask web appvideo_processing/
— Video frame extraction and preprocessingvlm/
— Vision-language model integrationllm/
— LLaMA integrationdb/
— Pinecone vector DB integration
- Install dependencies:
pip install -r requirements.txt
- Run the app:
python app/main.py
The vlm/
module uses BLIP for image captioning, with automatic GPU/CPU device management and robust error handling.
import cv2
from vlm import generate_captions
# Example: Extract a frame from a video and generate a caption
cap = cv2.VideoCapture('sample.mp4')
frames = []
ret, frame = cap.read()
if ret:
frames.append(frame)
cap.release()
captions = generate_captions(frames)
print(captions)
- The model will use GPU if available, otherwise CPU.
- Errors during caption generation are caught and reported in the output list.
- For best performance, a CUDA-capable GPU is recommended for BLIP and LLaMA models.
- Ensure your environment has the necessary drivers and libraries for GPU support.