"When one sense isn't enough — fuse them all."
- Overview
- The Problem & Our Solution
- System Architecture
- Data Processing Pipeline
- Agentic RAG Engine
- Technical Stack
- Directory Structure
- Setup & Installation
- Usage Guide
- Performance Metrics
- Roadmap
- License
Videntia is an Agentic AI platform that transforms unstructured video footage and transcripts into searchable, evidence-grade forensic reports.
Instead of simple keyword search, Videntia fuses transcript text, detected emotions, and speaker context through a multi-agent reasoning loop. It supports two ingestion paths:
- Full video ingestion — FFmpeg + Whisper + BLIP-2 + pyannote (GPU recommended)
- NoteGPT transcript ingestion — ingest timestamped transcripts directly without a GPU, using overlapping 30-second sliding-window segments for richer retrieval
The backend runs on FastAPI + ChromaDB + BM25, the LLM is Groq (Llama 3.1 8B Instant), and the frontend is a Next.js 14 dark-theme forensic dashboard.
| Challenge | Impact |
|---|---|
| Dark Data Accumulation | Hours of footage (meetings, depositions, news interviews) remain unsearchable due to volume. |
| Context Blindness | Keyword search misses emotional tone, speaker identity, and contextual cues entirely. |
| Scalability | Manual review of hours of footage to find a 10-second clip is economically unviable. |
Videntia uses a 4-agent LangGraph supervisor where specialized agents act as an investigative team — decomposing queries, retrieving hybrid evidence, verifying confidence, and writing structured reports. Agents loop iteratively until a confidence threshold is met or max iterations are reached.
[Next.js Frontend] <--REST/JSON--> [FastAPI Backend]
|
+---------+---------+---------+
| | |
[ChromaDB] [BM25 Index] [LangGraph Agents]
dense vecs sparse index DET -> RET -> VER -> SCR
| |
[nomic-embed] [BM25Okapi]
|
[Records JSON]
data/records/
Key design decisions vs. original plan:
- Supabase PostgreSQL replaced by local JSON records (
data/records/) for zero-cost local operation - Groq model used is
llama-3.1-8b-instant(500K tokens/day free tier) instead of 70B - Windows-compatible: all emoji/unicode stripped from agent console output; stdout reconfigured to UTF-8
- Chunking — FFmpeg slices video into 10-second segments
- Transcription —
faster-whisper(base model) produces speech-to-text - Speaker Diarization —
pyannote.audioidentifies and labels speakers - Visual Captioning —
BLIP-2captions keyframes per segment - Emotion Fusion —
fuse.pyextracts emotion signals from transcript + captions - Indexing —
nomic-embed-text-v1.5embeds combined text into ChromaDB; BM25 index rebuilt
- Parse —
ingest_notegpt.pyparses NoteGPTHH:MM:SStimestamp blocks - Overlapping Windows — 30-second sliding windows (15s stride) built for richer retrieval
- Emotion Signals —
fuse.pyextracts emotions from transcript text only - Indexing — same ChromaDB + BM25 pipeline as full ingestion
# Ingest a NoteGPT transcript
python ingest_notegpt.py --transcript "path/to/transcript.txt" --video "path/to/video.mp4"
# Ingest a full video (GPU required)
python -c "from pipeline.ingest import ingest_video; ingest_video('data/videos/video.mp4')"flowchart TD
A[User provides input] --> B{Input type?}
B -->|Full video file| C[Full Video Ingestion Pipeline]
B -->|NoteGPT timestamp transcript + optional video ref| D[Transcript Ingestion Pipeline]
subgraph FULL_INGEST[Path A -- Full Video Ingestion]
C --> C1[Split video into 10-second chunks with FFmpeg]
C1 --> C2[Transcribe full video once with Whisper]
C2 --> C3[Map transcript text back to each 10s segment by timestamp overlap]
C3 --> C4{Captioning enabled?}
C4 -->|Yes| C5[Extract frames from each segment]
C5 --> C6[Generate visual captions with BLIP-2]
C4 -->|No| C7[Skip visual captioning]
C6 --> C8[Extract emotion signals from transcript + visual captions]
C7 --> C8
C8 --> C9{Diarization enabled?}
C9 -->|Yes| C10[Run speaker diarization and assign speaker labels]
C9 -->|No| C11[Skip speaker detection]
C10 --> C12[Build SegmentRecord]
C11 --> C12
end
subgraph NOTEGPT_INGEST[Path B -- NoteGPT Transcript Ingestion]
D --> D1[Parse HH:MM:SS transcript blocks]
D1 --> D2[Infer end times for each block]
D2 --> D3[Create overlapping windows: 30s window / 15s stride]
D3 --> D4[Join all text overlapping each window]
D4 --> D5[Extract emotion signals from transcript text]
D5 --> D6[Build SegmentRecord]
end
subgraph RECORD_LAYER[Record Formation]
C12 --> E1[Create unique segment_id and video_id]
D6 --> E1
E1 --> E2[Store timing info: start_sec / end_sec]
E2 --> E3[Store transcript]
E3 --> E4[Store visual captions if available]
E4 --> E5[Store speaker if available]
E5 --> E6[Store emotion metadata: detected emotions, scores, intensity]
E6 --> E7[Create combined_text]
E7 --> E8[combined_text = transcript + visual cues + emotion tags or time labels]
E8 --> E9[Save each SegmentRecord as JSON in data/records]
end
subgraph INDEXING[Index Building]
E9 --> F1[Load all saved records]
F1 --> F2[Build BM25 sparse index from combined_text]
F1 --> F3[Build dense text embeddings from combined_text]
F1 --> F4{Visual captions exist?}
F4 -->|Yes| F5[Build dense vision embeddings from joined captions]
F4 -->|No| F6[Skip vision embedding for that segment]
F2 --> F7[Persist BM25 index to db/bm25.pkl]
F3 --> F8[Store text embeddings in ChromaDB text_segments]
F5 --> F9[Store vision embeddings in ChromaDB vision_segments]
end
G[User asks a question] --> H[main.py initializes AgentState]
H --> I[LangGraph workflow starts]
subgraph AGENT_FLOW[Multi-Agent Reasoning Loop]
I --> J[Lead Detective Agent]
J --> J1[Read query + current evidence + confidence + contradictions]
J1 --> J2[Break question into 2 to 5 searchable sub-tasks]
J2 --> K[Retriever Agent]
K --> K1[For each sub-task run hybrid_retrieve]
K1 --> K2[Aggregate evidence across sub-tasks]
K2 --> K3[Early deduplication by segment_id]
K3 --> L[Verifier Agent]
L --> L1[Deduplicate evidence again]
L1 --> L2[Compute average rerank score]
L2 --> L3[LLM checks evidence quality and contradictions]
L3 --> L4[Compute adjusted confidence score]
L4 --> M{Confidence >= threshold or max loops reached?}
M -->|No| J
M -->|Yes| N[Scribe Agent]
N --> N1[Sort verified evidence by rerank_score]
N1 --> N2[Take top evidence segments]
N2 --> N3[Write forensic Markdown report]
N3 --> N4[Append metadata: confidence / evidence count / iterations]
N4 --> O[Final answer + report returned]
end
subgraph HYBRID[Detailed Hybrid Retrieval for each sub-task]
K1 --> R1[Detect query type: emotion / temporal / speaker]
R1 --> R2[Expand query synonyms for BM25 if useful]
R2 --> R3[BM25 sparse search on combined_text]
R2 --> R4[Dense semantic search on ChromaDB text collection]
R2 --> R5[Dense semantic search on ChromaDB vision collection]
R3 --> R6[RRF fusion of ranked lists]
R4 --> R6
R5 --> R6
R6 --> R7[Build top candidate pool]
R7 --> R8[Cross-encoder reranking]
R8 --> R9{Emotion query?}
R9 -->|Yes| R10[Boost segments with stronger emotion metadata]
R9 -->|No| R11[No emotion boost]
R10 --> R12{Temporal query?}
R11 --> R12
R12 -->|Yes| R13[Add adjacent +/-1 segments for context]
R12 -->|No| R14[Keep top reranked results]
R13 --> R15[Return final evidence segments]
R14 --> R15
end
subgraph STORAGE[Stored Data]
S1[data/records JSON records]
S2[db/bm25.pkl]
S3[ChromaDB text_segments]
S4[ChromaDB vision_segments]
S5[reports/ saved Markdown reports]
end
E9 --> S1
F7 --> S2
F8 --> S3
F9 --> S4
N4 --> S5
S2 --> R3
S3 --> R4
S4 --> R5
S1 -. fallback if dense data missing .-> R7
User Query
|
[Lead Detective] -- decomposes query into 2-5 sub-tasks
|
[Retriever Agent] -- hybrid search for each sub-task:
| 1. BM25 sparse (top 50)
| 2. ChromaDB dense (top 50)
| 3. RRF fusion (top 20)
| 4. Cross-encoder rerank (top 8, bge-reranker-v2-m3)
|
[Verifier Agent] -- scores confidence, detects contradictions
|
+-- confidence < 75% AND iterations < 5 --> back to Detective
|
[Scribe Agent] -- writes structured markdown forensic report
|
Final Report + Evidence Segments returned to API
| Parameter | Value | Description |
|---|---|---|
MAX_ITERATIONS |
5 | Max detective-retriever-verifier loops |
MIN_CONFIDENCE |
0.75 | Confidence threshold to exit loop |
RERANK_TOP_K |
8 | Final evidence segments passed to Scribe |
BM25_TOP_K |
50 | Sparse candidates per sub-task |
DENSE_TOP_K |
50 | Dense candidates per sub-task |
| Component | Technology | Notes |
|---|---|---|
| LLM | Groq Llama 3.1 8B Instant | 500K tokens/day free; llama-3.1-8b-instant |
| Orchestration | LangGraph | 4-node state graph with conditional loops |
| Embeddings | nomic-embed-text-v1.5 | 768D, trust_remote_code=True |
| Reranker | BAAI/bge-reranker-v2-m3 | Cross-encoder relevance scoring |
| Vector DB | ChromaDB (local) | Persistent, no cloud required |
| Sparse Search | BM25Okapi + NLTK | Keyword token matching |
| Transcription | faster-whisper (base) | GPU-accelerated, optional |
| Speaker ID | pyannote.audio 3.1 | Requires HF_TOKEN, optional |
| Vision | BLIP-2 (Salesforce) | Frame captioning, optional |
| Backend | FastAPI + uvicorn | Async REST, port 8000 |
| Frontend | Next.js 14 | Dark-theme forensic UI, port 3000 |
| Storage | Local JSON records | data/records/, no DB setup needed |
videntia/
├── agents/
│ ├── state.py # AgentState TypedDict
│ ├── lead_detective.py # Query decomposition + loop control
│ ├── retriever_agent.py # Hybrid RAG retrieval per sub-task
│ ├── verifier_agent.py # Confidence scoring + contradiction detection
│ └── scribe_agent.py # Forensic report generation
├── rag/
│ ├── retriever.py # BM25 + dense + RRF fusion
│ └── reranker.py # Cross-encoder reranking
├── embed/
│ ├── store.py # ChromaDB connection + upsert
│ ├── bm25_index.py # BM25 index build/load
│ └── text_embedder.py # nomic-embed-text-v1.5 wrapper
├── pipeline/
│ ├── ingest.py # Full video ingestion pipeline
│ ├── segment.py # FFmpeg chunking
│ ├── transcribe.py # faster-whisper transcription
│ ├── caption.py # BLIP-2 frame captioning
│ ├── fuse.py # Multimodal fusion + emotion extraction
│ └── audio_embeddings.py # pyannote speaker diarization
├── frontend/
│ └── app/
│ ├── page.tsx # Home: upload + existing videos
│ ├── analyze/[videoId] # Timeline viewer + transcript
│ └── query/[videoId] # Forensic Query Terminal
├── api.py # FastAPI routes
├── config.py # All config (paths, keys, params)
├── graph.py # LangGraph workflow compilation
├── main.py # CLI entry point
├── ingest_notegpt.py # NoteGPT transcript ingestion
└── requirements.txt
- Python 3.10+
- Node.js 18+
GROQ_API_KEY(free at console.groq.com)HF_TOKEN(optional, for pyannote speaker diarization)
git clone https://github.com/tsanhith/videntia.git
cd videntia
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
pip install -r requirements.txtCreate a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here
HF_TOKEN=your_huggingface_token_here # optional
SUPABASE_URL= # optional, not required for local use
SUPABASE_KEY= # optional# Windows
.\venv\Scripts\python.exe -m uvicorn api:app --reload --host 0.0.0.0 --port 8000
# Linux/Mac
uvicorn api:app --reload --host 0.0.0.0 --port 8000cd frontend
npm install
npm run devOpen http://localhost:3000
- Go to
http://localhost:3000 - Upload a video file (MP4) — it will be ingested automatically
- Click Deploy Agents to open the Forensic Query Terminal
- Type any natural language question and press Enter
Download a transcript from NoteGPT and run:
python ingest_notegpt.py \
--transcript "path/to/NoteGPT_TRANSCRIPT_video_name.txt" \
--video "path/to/video.mp4"This outputs a video_id. Then query:
python main.py "What is Iran's definition of victory in this war?"Or use the video_id in the frontend URL:
http://localhost:3000/query/<video_id>
python main.py "Who showed the most concern and what did they say?"
python main.py "What strategies were discussed?" --max-iter 3- "What is Iran's definition of victory in this war?"
- "Why did Iran start targeting Arab countries so quickly?"
- "How are Iranian civilians reacting to the strikes?"
- "What role are Kurdish forces playing in Iran?"
- "What comparisons were made between Iran and Gaza?"
| Metric | Value | Notes |
|---|---|---|
| NoteGPT ingestion | ~5 seconds | For a 10-minute transcript on CPU |
| Full video ingestion | ~90 seconds/2hr video | Requires T4 GPU |
| Query latency | 1–4 seconds | Groq 8B, depends on iterations |
| Agent iterations | 1–3 typical | Before confidence >= 75% |
| Retrieval precision | ~85% P@5 | On tested news interview content |
| False positive reduction | ~75% | Via verifier contradiction checks |
- Streaming real-time ingestion (RTMP/HLS).
- Knowledge Graph RAG (entity relationship mapping).
- Intelligent scene boundary detection (replace fixed 10s chunks).
- Multi-video cross-analysis queries.
- Desktop app packaging (Electron + PyInstaller).
MIT License. See LICENSE for details.
Note: pyannote.audio model weights on HuggingFace are subject to their own non-commercial use terms. Review before commercial deployment.