Videntia — Multimodal Forensic Video Intelligence.

"When one sense isn't enough — fuse them all."

Overview

Videntia is an Agentic AI platform that transforms unstructured video footage and transcripts into searchable, evidence-grade forensic reports.

Instead of simple keyword search, Videntia fuses transcript text, detected emotions, and speaker context through a multi-agent reasoning loop. It supports two ingestion paths:

Full video ingestion — FFmpeg + Whisper + BLIP-2 + pyannote (GPU recommended)
NoteGPT transcript ingestion — ingest timestamped transcripts directly without a GPU, using overlapping 30-second sliding-window segments for richer retrieval

The backend runs on FastAPI + ChromaDB + BM25, the LLM is Groq (Llama 3.1 8B Instant), and the frontend is a Next.js 14 dark-theme forensic dashboard.

The Problem & Our Solution

The Challenge of "Dark Data".

Challenge	Impact
Dark Data Accumulation	Hours of footage (meetings, depositions, news interviews) remain unsearchable due to volume.
Context Blindness	Keyword search misses emotional tone, speaker identity, and contextual cues entirely.
Scalability	Manual review of hours of footage to find a 10-second clip is economically unviable.

The Agentic Multi-Agent Solution

Videntia uses a 4-agent LangGraph supervisor where specialized agents act as an investigative team — decomposing queries, retrieving hybrid evidence, verifying confidence, and writing structured reports. Agents loop iteratively until a confidence threshold is met or max iterations are reached.

System Architecture

[Next.js Frontend] <--REST/JSON--> [FastAPI Backend]
                                        |
                    +---------+---------+---------+
                    |         |                   |
              [ChromaDB]  [BM25 Index]    [LangGraph Agents]
              dense vecs  sparse index    DET -> RET -> VER -> SCR
                    |         |
              [nomic-embed]  [BM25Okapi]
                    |
             [Records JSON]
             data/records/

Key design decisions vs. original plan:

Supabase PostgreSQL replaced by local JSON records (data/records/) for zero-cost local operation
Groq model used is llama-3.1-8b-instant (500K tokens/day free tier) instead of 70B
Windows-compatible: all emoji/unicode stripped from agent console output; stdout reconfigured to UTF-8

Data Processing Pipeline

Path A — Full Video Ingestion (GPU)

Chunking — FFmpeg slices video into 10-second segments
Transcription — faster-whisper (base model) produces speech-to-text
Speaker Diarization — pyannote.audio identifies and labels speakers
Visual Captioning — BLIP-2 captions keyframes per segment
Emotion Fusion — fuse.py extracts emotion signals from transcript + captions
Indexing — nomic-embed-text-v1.5 embeds combined text into ChromaDB; BM25 index rebuilt

Path B — NoteGPT Transcript Ingestion (CPU, no GPU needed)

Parse — ingest_notegpt.py parses NoteGPT HH:MM:SS timestamp blocks
Overlapping Windows — 30-second sliding windows (15s stride) built for richer retrieval
Emotion Signals — fuse.py extracts emotions from transcript text only
Indexing — same ChromaDB + BM25 pipeline as full ingestion

# Ingest a NoteGPT transcript
python ingest_notegpt.py --transcript "path/to/transcript.txt" --video "path/to/video.mp4"

# Ingest a full video (GPU required)
python -c "from pipeline.ingest import ingest_video; ingest_video('data/videos/video.mp4')"

Detailed Architecture Flowchart

flowchart TD

    A[User provides input] --> B{Input type?}
    B -->|Full video file| C[Full Video Ingestion Pipeline]
    B -->|NoteGPT timestamp transcript + optional video ref| D[Transcript Ingestion Pipeline]

    subgraph FULL_INGEST[Path A -- Full Video Ingestion]
        C --> C1[Split video into 10-second chunks with FFmpeg]
        C1 --> C2[Transcribe full video once with Whisper]
        C2 --> C3[Map transcript text back to each 10s segment by timestamp overlap]
        C3 --> C4{Captioning enabled?}
        C4 -->|Yes| C5[Extract frames from each segment]
        C5 --> C6[Generate visual captions with BLIP-2]
        C4 -->|No| C7[Skip visual captioning]
        C6 --> C8[Extract emotion signals from transcript + visual captions]
        C7 --> C8
        C8 --> C9{Diarization enabled?}
        C9 -->|Yes| C10[Run speaker diarization and assign speaker labels]
        C9 -->|No| C11[Skip speaker detection]
        C10 --> C12[Build SegmentRecord]
        C11 --> C12
    end

    subgraph NOTEGPT_INGEST[Path B -- NoteGPT Transcript Ingestion]
        D --> D1[Parse HH:MM:SS transcript blocks]
        D1 --> D2[Infer end times for each block]
        D2 --> D3[Create overlapping windows: 30s window / 15s stride]
        D3 --> D4[Join all text overlapping each window]
        D4 --> D5[Extract emotion signals from transcript text]
        D5 --> D6[Build SegmentRecord]
    end

    subgraph RECORD_LAYER[Record Formation]
        C12 --> E1[Create unique segment_id and video_id]
        D6 --> E1
        E1 --> E2[Store timing info: start_sec / end_sec]
        E2 --> E3[Store transcript]
        E3 --> E4[Store visual captions if available]
        E4 --> E5[Store speaker if available]
        E5 --> E6[Store emotion metadata: detected emotions, scores, intensity]
        E6 --> E7[Create combined_text]
        E7 --> E8[combined_text = transcript + visual cues + emotion tags or time labels]
        E8 --> E9[Save each SegmentRecord as JSON in data/records]
    end

    subgraph INDEXING[Index Building]
        E9 --> F1[Load all saved records]
        F1 --> F2[Build BM25 sparse index from combined_text]
        F1 --> F3[Build dense text embeddings from combined_text]
        F1 --> F4{Visual captions exist?}
        F4 -->|Yes| F5[Build dense vision embeddings from joined captions]
        F4 -->|No| F6[Skip vision embedding for that segment]
        F2 --> F7[Persist BM25 index to db/bm25.pkl]
        F3 --> F8[Store text embeddings in ChromaDB text_segments]
        F5 --> F9[Store vision embeddings in ChromaDB vision_segments]
    end

    G[User asks a question] --> H[main.py initializes AgentState]
    H --> I[LangGraph workflow starts]

    subgraph AGENT_FLOW[Multi-Agent Reasoning Loop]
        I --> J[Lead Detective Agent]
        J --> J1[Read query + current evidence + confidence + contradictions]
        J1 --> J2[Break question into 2 to 5 searchable sub-tasks]
        J2 --> K[Retriever Agent]

        K --> K1[For each sub-task run hybrid_retrieve]
        K1 --> K2[Aggregate evidence across sub-tasks]
        K2 --> K3[Early deduplication by segment_id]
        K3 --> L[Verifier Agent]

        L --> L1[Deduplicate evidence again]
        L1 --> L2[Compute average rerank score]
        L2 --> L3[LLM checks evidence quality and contradictions]
        L3 --> L4[Compute adjusted confidence score]
        L4 --> M{Confidence >= threshold or max loops reached?}

        M -->|No| J
        M -->|Yes| N[Scribe Agent]
        N --> N1[Sort verified evidence by rerank_score]
        N1 --> N2[Take top evidence segments]
        N2 --> N3[Write forensic Markdown report]
        N3 --> N4[Append metadata: confidence / evidence count / iterations]
        N4 --> O[Final answer + report returned]
    end

    subgraph HYBRID[Detailed Hybrid Retrieval for each sub-task]
        K1 --> R1[Detect query type: emotion / temporal / speaker]
        R1 --> R2[Expand query synonyms for BM25 if useful]
        R2 --> R3[BM25 sparse search on combined_text]
        R2 --> R4[Dense semantic search on ChromaDB text collection]
        R2 --> R5[Dense semantic search on ChromaDB vision collection]
        R3 --> R6[RRF fusion of ranked lists]
        R4 --> R6
        R5 --> R6
        R6 --> R7[Build top candidate pool]
        R7 --> R8[Cross-encoder reranking]
        R8 --> R9{Emotion query?}
        R9 -->|Yes| R10[Boost segments with stronger emotion metadata]
        R9 -->|No| R11[No emotion boost]
        R10 --> R12{Temporal query?}
        R11 --> R12
        R12 -->|Yes| R13[Add adjacent +/-1 segments for context]
        R12 -->|No| R14[Keep top reranked results]
        R13 --> R15[Return final evidence segments]
        R14 --> R15
    end

    subgraph STORAGE[Stored Data]
        S1[data/records JSON records]
        S2[db/bm25.pkl]
        S3[ChromaDB text_segments]
        S4[ChromaDB vision_segments]
        S5[reports/ saved Markdown reports]
    end

    E9 --> S1
    F7 --> S2
    F8 --> S3
    F9 --> S4
    N4 --> S5

    S2 --> R3
    S3 --> R4
    S4 --> R5
    S1 -. fallback if dense data missing .-> R7

Agentic RAG Engine

4-Agent Workflow

User Query
    |
[Lead Detective] -- decomposes query into 2-5 sub-tasks
    |
[Retriever Agent] -- hybrid search for each sub-task:
    |   1. BM25 sparse (top 50)
    |   2. ChromaDB dense (top 50)
    |   3. RRF fusion (top 20)
    |   4. Cross-encoder rerank (top 8, bge-reranker-v2-m3)
    |
[Verifier Agent] -- scores confidence, detects contradictions
    |
    +-- confidence < 75% AND iterations < 5 --> back to Detective
    |
[Scribe Agent] -- writes structured markdown forensic report
    |
Final Report + Evidence Segments returned to API

Loop Control (`config.py`)

Parameter	Value	Description
`MAX_ITERATIONS`	5	Max detective-retriever-verifier loops
`MIN_CONFIDENCE`	0.75	Confidence threshold to exit loop
`RERANK_TOP_K`	8	Final evidence segments passed to Scribe
`BM25_TOP_K`	50	Sparse candidates per sub-task
`DENSE_TOP_K`	50	Dense candidates per sub-task

Technical Stack

Component	Technology	Notes
LLM	Groq Llama 3.1 8B Instant	500K tokens/day free; `llama-3.1-8b-instant`
Orchestration	LangGraph	4-node state graph with conditional loops
Embeddings	nomic-embed-text-v1.5	768D, trust_remote_code=True
Reranker	BAAI/bge-reranker-v2-m3	Cross-encoder relevance scoring
Vector DB	ChromaDB (local)	Persistent, no cloud required
Sparse Search	BM25Okapi + NLTK	Keyword token matching
Transcription	faster-whisper (base)	GPU-accelerated, optional
Speaker ID	pyannote.audio 3.1	Requires HF_TOKEN, optional
Vision	BLIP-2 (Salesforce)	Frame captioning, optional
Backend	FastAPI + uvicorn	Async REST, port 8000
Frontend	Next.js 14	Dark-theme forensic UI, port 3000
Storage	Local JSON records	`data/records/`, no DB setup needed

Directory Structure

videntia/
├── agents/
│   ├── state.py              # AgentState TypedDict
│   ├── lead_detective.py     # Query decomposition + loop control
│   ├── retriever_agent.py    # Hybrid RAG retrieval per sub-task
│   ├── verifier_agent.py     # Confidence scoring + contradiction detection
│   └── scribe_agent.py       # Forensic report generation
├── rag/
│   ├── retriever.py          # BM25 + dense + RRF fusion
│   └── reranker.py           # Cross-encoder reranking
├── embed/
│   ├── store.py              # ChromaDB connection + upsert
│   ├── bm25_index.py         # BM25 index build/load
│   └── text_embedder.py      # nomic-embed-text-v1.5 wrapper
├── pipeline/
│   ├── ingest.py             # Full video ingestion pipeline
│   ├── segment.py            # FFmpeg chunking
│   ├── transcribe.py         # faster-whisper transcription
│   ├── caption.py            # BLIP-2 frame captioning
│   ├── fuse.py               # Multimodal fusion + emotion extraction
│   └── audio_embeddings.py   # pyannote speaker diarization
├── frontend/
│   └── app/
│       ├── page.tsx          # Home: upload + existing videos
│       ├── analyze/[videoId] # Timeline viewer + transcript
│       └── query/[videoId]   # Forensic Query Terminal
├── api.py                    # FastAPI routes
├── config.py                 # All config (paths, keys, params)
├── graph.py                  # LangGraph workflow compilation
├── main.py                   # CLI entry point
├── ingest_notegpt.py         # NoteGPT transcript ingestion
└── requirements.txt

Setup & Installation

Prerequisites

Python 3.10+
Node.js 18+
GROQ_API_KEY (free at console.groq.com)
HF_TOKEN (optional, for pyannote speaker diarization)

1. Clone & Install

git clone https://github.com/tsanhith/videntia.git
cd videntia
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

pip install -r requirements.txt

2. Configure Environment

Create a .env file in the project root:

GROQ_API_KEY=your_groq_api_key_here
HF_TOKEN=your_huggingface_token_here   # optional
SUPABASE_URL=                           # optional, not required for local use
SUPABASE_KEY=                           # optional

3. Start the Backend

# Windows
.\venv\Scripts\python.exe -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

# Linux/Mac
uvicorn api:app --reload --host 0.0.0.0 --port 8000

4. Start the Frontend

cd frontend
npm install
npm run dev

Open http://localhost:3000

Usage Guide

Option A — Upload via Web UI

Go to http://localhost:3000
Upload a video file (MP4) — it will be ingested automatically
Click Deploy Agents to open the Forensic Query Terminal
Type any natural language question and press Enter

Option B — Ingest NoteGPT Transcript (no GPU)

Download a transcript from NoteGPT and run:

python ingest_notegpt.py \
  --transcript "path/to/NoteGPT_TRANSCRIPT_video_name.txt" \
  --video "path/to/video.mp4"

This outputs a video_id. Then query:

python main.py "What is Iran's definition of victory in this war?"

Or use the video_id in the frontend URL:

http://localhost:3000/query/<video_id>

Option C — CLI Query

python main.py "Who showed the most concern and what did they say?"
python main.py "What strategies were discussed?" --max-iter 3

Example Questions (Iran podcast transcript)

"What is Iran's definition of victory in this war?"
"Why did Iran start targeting Arab countries so quickly?"
"How are Iranian civilians reacting to the strikes?"
"What role are Kurdish forces playing in Iran?"
"What comparisons were made between Iran and Gaza?"

Performance Metrics

Metric	Value	Notes
NoteGPT ingestion	~5 seconds	For a 10-minute transcript on CPU
Full video ingestion	~90 seconds/2hr video	Requires T4 GPU
Query latency	1–4 seconds	Groq 8B, depends on iterations
Agent iterations	1–3 typical	Before confidence >= 75%
Retrieval precision	~85% P@5	On tested news interview content
False positive reduction	~75%	Via verifier contradiction checks

Roadmap

Streaming real-time ingestion (RTMP/HLS).
Knowledge Graph RAG (entity relationship mapping).
Intelligent scene boundary detection (replace fixed 10s chunks).
Multi-video cross-analysis queries.
Desktop app packaging (Electron + PyInstaller).

License

MIT License. See LICENSE for details.

Note: pyannote.audio model weights on HuggingFace are subject to their own non-commercial use terms. Review before commercial deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
agents		agents
data/transcripts		data/transcripts
embed		embed
frontend		frontend
pipeline		pipeline
rag		rag
.env.example		.env.example
.gitignore		.gitignore
00_START_HERE.txt		00_START_HERE.txt
FILE_GUIDE.md		FILE_GUIDE.md
FRONTEND_SETUP.md		FRONTEND_SETUP.md
HF_SPACES_APP.py		HF_SPACES_APP.py
OPTIMIZATIONS.md		OPTIMIZATIONS.md
Phase4_Complete_Documentation.md		Phase4_Complete_Documentation.md
QUICKSTART_GUIDE.md		QUICKSTART_GUIDE.md
README.md		README.md
README_DEPLOYMENT.md		README_DEPLOYMENT.md
RUN_COMMANDS.md		RUN_COMMANDS.md
STUDENT_FREE_DEPLOYMENT.md		STUDENT_FREE_DEPLOYMENT.md
SUPABASE_SCHEMA.sql		SUPABASE_SCHEMA.sql
VERCEL_DASHBOARD.jsx		VERCEL_DASHBOARD.jsx
api.py		api.py
api_crash.txt		api_crash.txt
app.py		app.py
apply_optimizations.py		apply_optimizations.py
asgi.py		asgi.py
config.py		config.py
debug_graph.py		debug_graph.py
diagnose_rag.py		diagnose_rag.py
dummy.txt		dummy.txt
graph.py		graph.py
ingest_custom.py		ingest_custom.py
ingest_notegpt.py		ingest_notegpt.py
main.py		main.py
npm_err.txt		npm_err.txt
npm_out.txt		npm_out.txt
output.txt		output.txt
package.json		package.json
pipeline_test.txt		pipeline_test.txt
requirements.txt		requirements.txt
start-fullstack.ps1		start-fullstack.ps1
test_output.txt		test_output.txt
test_results.txt		test_results.txt
test_results_utf8.txt		test_results_utf8.txt
test_transcript.txt		test_transcript.txt
uvicorn_err.txt		uvicorn_err.txt
uvicorn_out.txt		uvicorn_out.txt

Folders and files

Latest commit

History

Repository files navigation

Videntia — Multimodal Forensic Video Intelligence.

Table of Contents

Overview

The Problem & Our Solution

The Challenge of "Dark Data".

The Agentic Multi-Agent Solution

System Architecture

Data Processing Pipeline

Path A — Full Video Ingestion (GPU)

Path B — NoteGPT Transcript Ingestion (CPU, no GPU needed)

Detailed Architecture Flowchart

Agentic RAG Engine

4-Agent Workflow

Loop Control (config.py)

Technical Stack

Directory Structure

Setup & Installation

Prerequisites

1. Clone & Install

2. Configure Environment

3. Start the Backend

4. Start the Frontend

Usage Guide

Option A — Upload via Web UI

Option B — Ingest NoteGPT Transcript (no GPU)

Option C — CLI Query

Example Questions (Iran podcast transcript)

Performance Metrics

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Loop Control (`config.py`)

Packages