Skip to content

tsanhith/videntia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Videntia — Multimodal Forensic Video Intelligence.

"When one sense isn't enough — fuse them all."

Python Next.js FastAPI LangGraph Groq License


Table of Contents

  1. Overview
  2. The Problem & Our Solution
  3. System Architecture
  4. Data Processing Pipeline
  5. Agentic RAG Engine
  6. Technical Stack
  7. Directory Structure
  8. Setup & Installation
  9. Usage Guide
  10. Performance Metrics
  11. Roadmap
  12. License

Overview

Videntia is an Agentic AI platform that transforms unstructured video footage and transcripts into searchable, evidence-grade forensic reports.

Instead of simple keyword search, Videntia fuses transcript text, detected emotions, and speaker context through a multi-agent reasoning loop. It supports two ingestion paths:

  • Full video ingestion — FFmpeg + Whisper + BLIP-2 + pyannote (GPU recommended)
  • NoteGPT transcript ingestion — ingest timestamped transcripts directly without a GPU, using overlapping 30-second sliding-window segments for richer retrieval

The backend runs on FastAPI + ChromaDB + BM25, the LLM is Groq (Llama 3.1 8B Instant), and the frontend is a Next.js 14 dark-theme forensic dashboard.


The Problem & Our Solution

The Challenge of "Dark Data".

Challenge Impact
Dark Data Accumulation Hours of footage (meetings, depositions, news interviews) remain unsearchable due to volume.
Context Blindness Keyword search misses emotional tone, speaker identity, and contextual cues entirely.
Scalability Manual review of hours of footage to find a 10-second clip is economically unviable.

The Agentic Multi-Agent Solution

Videntia uses a 4-agent LangGraph supervisor where specialized agents act as an investigative team — decomposing queries, retrieving hybrid evidence, verifying confidence, and writing structured reports. Agents loop iteratively until a confidence threshold is met or max iterations are reached.


System Architecture

[Next.js Frontend] <--REST/JSON--> [FastAPI Backend]
                                        |
                    +---------+---------+---------+
                    |         |                   |
              [ChromaDB]  [BM25 Index]    [LangGraph Agents]
              dense vecs  sparse index    DET -> RET -> VER -> SCR
                    |         |
              [nomic-embed]  [BM25Okapi]
                    |
             [Records JSON]
             data/records/

Key design decisions vs. original plan:

  • Supabase PostgreSQL replaced by local JSON records (data/records/) for zero-cost local operation
  • Groq model used is llama-3.1-8b-instant (500K tokens/day free tier) instead of 70B
  • Windows-compatible: all emoji/unicode stripped from agent console output; stdout reconfigured to UTF-8

Data Processing Pipeline

Path A — Full Video Ingestion (GPU)

  1. Chunking — FFmpeg slices video into 10-second segments
  2. Transcriptionfaster-whisper (base model) produces speech-to-text
  3. Speaker Diarizationpyannote.audio identifies and labels speakers
  4. Visual CaptioningBLIP-2 captions keyframes per segment
  5. Emotion Fusionfuse.py extracts emotion signals from transcript + captions
  6. Indexingnomic-embed-text-v1.5 embeds combined text into ChromaDB; BM25 index rebuilt

Path B — NoteGPT Transcript Ingestion (CPU, no GPU needed)

  1. Parseingest_notegpt.py parses NoteGPT HH:MM:SS timestamp blocks
  2. Overlapping Windows — 30-second sliding windows (15s stride) built for richer retrieval
  3. Emotion Signalsfuse.py extracts emotions from transcript text only
  4. Indexing — same ChromaDB + BM25 pipeline as full ingestion
# Ingest a NoteGPT transcript
python ingest_notegpt.py --transcript "path/to/transcript.txt" --video "path/to/video.mp4"

# Ingest a full video (GPU required)
python -c "from pipeline.ingest import ingest_video; ingest_video('data/videos/video.mp4')"

Detailed Architecture Flowchart

flowchart TD

    A[User provides input] --> B{Input type?}
    B -->|Full video file| C[Full Video Ingestion Pipeline]
    B -->|NoteGPT timestamp transcript + optional video ref| D[Transcript Ingestion Pipeline]

    subgraph FULL_INGEST[Path A -- Full Video Ingestion]
        C --> C1[Split video into 10-second chunks with FFmpeg]
        C1 --> C2[Transcribe full video once with Whisper]
        C2 --> C3[Map transcript text back to each 10s segment by timestamp overlap]
        C3 --> C4{Captioning enabled?}
        C4 -->|Yes| C5[Extract frames from each segment]
        C5 --> C6[Generate visual captions with BLIP-2]
        C4 -->|No| C7[Skip visual captioning]
        C6 --> C8[Extract emotion signals from transcript + visual captions]
        C7 --> C8
        C8 --> C9{Diarization enabled?}
        C9 -->|Yes| C10[Run speaker diarization and assign speaker labels]
        C9 -->|No| C11[Skip speaker detection]
        C10 --> C12[Build SegmentRecord]
        C11 --> C12
    end

    subgraph NOTEGPT_INGEST[Path B -- NoteGPT Transcript Ingestion]
        D --> D1[Parse HH:MM:SS transcript blocks]
        D1 --> D2[Infer end times for each block]
        D2 --> D3[Create overlapping windows: 30s window / 15s stride]
        D3 --> D4[Join all text overlapping each window]
        D4 --> D5[Extract emotion signals from transcript text]
        D5 --> D6[Build SegmentRecord]
    end

    subgraph RECORD_LAYER[Record Formation]
        C12 --> E1[Create unique segment_id and video_id]
        D6 --> E1
        E1 --> E2[Store timing info: start_sec / end_sec]
        E2 --> E3[Store transcript]
        E3 --> E4[Store visual captions if available]
        E4 --> E5[Store speaker if available]
        E5 --> E6[Store emotion metadata: detected emotions, scores, intensity]
        E6 --> E7[Create combined_text]
        E7 --> E8[combined_text = transcript + visual cues + emotion tags or time labels]
        E8 --> E9[Save each SegmentRecord as JSON in data/records]
    end

    subgraph INDEXING[Index Building]
        E9 --> F1[Load all saved records]
        F1 --> F2[Build BM25 sparse index from combined_text]
        F1 --> F3[Build dense text embeddings from combined_text]
        F1 --> F4{Visual captions exist?}
        F4 -->|Yes| F5[Build dense vision embeddings from joined captions]
        F4 -->|No| F6[Skip vision embedding for that segment]
        F2 --> F7[Persist BM25 index to db/bm25.pkl]
        F3 --> F8[Store text embeddings in ChromaDB text_segments]
        F5 --> F9[Store vision embeddings in ChromaDB vision_segments]
    end

    G[User asks a question] --> H[main.py initializes AgentState]
    H --> I[LangGraph workflow starts]

    subgraph AGENT_FLOW[Multi-Agent Reasoning Loop]
        I --> J[Lead Detective Agent]
        J --> J1[Read query + current evidence + confidence + contradictions]
        J1 --> J2[Break question into 2 to 5 searchable sub-tasks]
        J2 --> K[Retriever Agent]

        K --> K1[For each sub-task run hybrid_retrieve]
        K1 --> K2[Aggregate evidence across sub-tasks]
        K2 --> K3[Early deduplication by segment_id]
        K3 --> L[Verifier Agent]

        L --> L1[Deduplicate evidence again]
        L1 --> L2[Compute average rerank score]
        L2 --> L3[LLM checks evidence quality and contradictions]
        L3 --> L4[Compute adjusted confidence score]
        L4 --> M{Confidence >= threshold or max loops reached?}

        M -->|No| J
        M -->|Yes| N[Scribe Agent]
        N --> N1[Sort verified evidence by rerank_score]
        N1 --> N2[Take top evidence segments]
        N2 --> N3[Write forensic Markdown report]
        N3 --> N4[Append metadata: confidence / evidence count / iterations]
        N4 --> O[Final answer + report returned]
    end

    subgraph HYBRID[Detailed Hybrid Retrieval for each sub-task]
        K1 --> R1[Detect query type: emotion / temporal / speaker]
        R1 --> R2[Expand query synonyms for BM25 if useful]
        R2 --> R3[BM25 sparse search on combined_text]
        R2 --> R4[Dense semantic search on ChromaDB text collection]
        R2 --> R5[Dense semantic search on ChromaDB vision collection]
        R3 --> R6[RRF fusion of ranked lists]
        R4 --> R6
        R5 --> R6
        R6 --> R7[Build top candidate pool]
        R7 --> R8[Cross-encoder reranking]
        R8 --> R9{Emotion query?}
        R9 -->|Yes| R10[Boost segments with stronger emotion metadata]
        R9 -->|No| R11[No emotion boost]
        R10 --> R12{Temporal query?}
        R11 --> R12
        R12 -->|Yes| R13[Add adjacent +/-1 segments for context]
        R12 -->|No| R14[Keep top reranked results]
        R13 --> R15[Return final evidence segments]
        R14 --> R15
    end

    subgraph STORAGE[Stored Data]
        S1[data/records JSON records]
        S2[db/bm25.pkl]
        S3[ChromaDB text_segments]
        S4[ChromaDB vision_segments]
        S5[reports/ saved Markdown reports]
    end

    E9 --> S1
    F7 --> S2
    F8 --> S3
    F9 --> S4
    N4 --> S5

    S2 --> R3
    S3 --> R4
    S4 --> R5
    S1 -. fallback if dense data missing .-> R7
Loading

Agentic RAG Engine

4-Agent Workflow

User Query
    |
[Lead Detective] -- decomposes query into 2-5 sub-tasks
    |
[Retriever Agent] -- hybrid search for each sub-task:
    |   1. BM25 sparse (top 50)
    |   2. ChromaDB dense (top 50)
    |   3. RRF fusion (top 20)
    |   4. Cross-encoder rerank (top 8, bge-reranker-v2-m3)
    |
[Verifier Agent] -- scores confidence, detects contradictions
    |
    +-- confidence < 75% AND iterations < 5 --> back to Detective
    |
[Scribe Agent] -- writes structured markdown forensic report
    |
Final Report + Evidence Segments returned to API

Loop Control (config.py)

Parameter Value Description
MAX_ITERATIONS 5 Max detective-retriever-verifier loops
MIN_CONFIDENCE 0.75 Confidence threshold to exit loop
RERANK_TOP_K 8 Final evidence segments passed to Scribe
BM25_TOP_K 50 Sparse candidates per sub-task
DENSE_TOP_K 50 Dense candidates per sub-task

Technical Stack

Component Technology Notes
LLM Groq Llama 3.1 8B Instant 500K tokens/day free; llama-3.1-8b-instant
Orchestration LangGraph 4-node state graph with conditional loops
Embeddings nomic-embed-text-v1.5 768D, trust_remote_code=True
Reranker BAAI/bge-reranker-v2-m3 Cross-encoder relevance scoring
Vector DB ChromaDB (local) Persistent, no cloud required
Sparse Search BM25Okapi + NLTK Keyword token matching
Transcription faster-whisper (base) GPU-accelerated, optional
Speaker ID pyannote.audio 3.1 Requires HF_TOKEN, optional
Vision BLIP-2 (Salesforce) Frame captioning, optional
Backend FastAPI + uvicorn Async REST, port 8000
Frontend Next.js 14 Dark-theme forensic UI, port 3000
Storage Local JSON records data/records/, no DB setup needed

Directory Structure

videntia/
├── agents/
│   ├── state.py              # AgentState TypedDict
│   ├── lead_detective.py     # Query decomposition + loop control
│   ├── retriever_agent.py    # Hybrid RAG retrieval per sub-task
│   ├── verifier_agent.py     # Confidence scoring + contradiction detection
│   └── scribe_agent.py       # Forensic report generation
├── rag/
│   ├── retriever.py          # BM25 + dense + RRF fusion
│   └── reranker.py           # Cross-encoder reranking
├── embed/
│   ├── store.py              # ChromaDB connection + upsert
│   ├── bm25_index.py         # BM25 index build/load
│   └── text_embedder.py      # nomic-embed-text-v1.5 wrapper
├── pipeline/
│   ├── ingest.py             # Full video ingestion pipeline
│   ├── segment.py            # FFmpeg chunking
│   ├── transcribe.py         # faster-whisper transcription
│   ├── caption.py            # BLIP-2 frame captioning
│   ├── fuse.py               # Multimodal fusion + emotion extraction
│   └── audio_embeddings.py   # pyannote speaker diarization
├── frontend/
│   └── app/
│       ├── page.tsx          # Home: upload + existing videos
│       ├── analyze/[videoId] # Timeline viewer + transcript
│       └── query/[videoId]   # Forensic Query Terminal
├── api.py                    # FastAPI routes
├── config.py                 # All config (paths, keys, params)
├── graph.py                  # LangGraph workflow compilation
├── main.py                   # CLI entry point
├── ingest_notegpt.py         # NoteGPT transcript ingestion
└── requirements.txt

Setup & Installation

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • GROQ_API_KEY (free at console.groq.com)
  • HF_TOKEN (optional, for pyannote speaker diarization)

1. Clone & Install

git clone https://github.com/tsanhith/videntia.git
cd videntia
python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

pip install -r requirements.txt

2. Configure Environment

Create a .env file in the project root:

GROQ_API_KEY=your_groq_api_key_here
HF_TOKEN=your_huggingface_token_here   # optional
SUPABASE_URL=                           # optional, not required for local use
SUPABASE_KEY=                           # optional

3. Start the Backend

# Windows
.\venv\Scripts\python.exe -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

# Linux/Mac
uvicorn api:app --reload --host 0.0.0.0 --port 8000

4. Start the Frontend

cd frontend
npm install
npm run dev

Open http://localhost:3000


Usage Guide

Option A — Upload via Web UI

  1. Go to http://localhost:3000
  2. Upload a video file (MP4) — it will be ingested automatically
  3. Click Deploy Agents to open the Forensic Query Terminal
  4. Type any natural language question and press Enter

Option B — Ingest NoteGPT Transcript (no GPU)

Download a transcript from NoteGPT and run:

python ingest_notegpt.py \
  --transcript "path/to/NoteGPT_TRANSCRIPT_video_name.txt" \
  --video "path/to/video.mp4"

This outputs a video_id. Then query:

python main.py "What is Iran's definition of victory in this war?"

Or use the video_id in the frontend URL:

http://localhost:3000/query/<video_id>

Option C — CLI Query

python main.py "Who showed the most concern and what did they say?"
python main.py "What strategies were discussed?" --max-iter 3

Example Questions (Iran podcast transcript)

  • "What is Iran's definition of victory in this war?"
  • "Why did Iran start targeting Arab countries so quickly?"
  • "How are Iranian civilians reacting to the strikes?"
  • "What role are Kurdish forces playing in Iran?"
  • "What comparisons were made between Iran and Gaza?"

Performance Metrics

Metric Value Notes
NoteGPT ingestion ~5 seconds For a 10-minute transcript on CPU
Full video ingestion ~90 seconds/2hr video Requires T4 GPU
Query latency 1–4 seconds Groq 8B, depends on iterations
Agent iterations 1–3 typical Before confidence >= 75%
Retrieval precision ~85% P@5 On tested news interview content
False positive reduction ~75% Via verifier contradiction checks

Roadmap

  • Streaming real-time ingestion (RTMP/HLS).
  • Knowledge Graph RAG (entity relationship mapping).
  • Intelligent scene boundary detection (replace fixed 10s chunks).
  • Multi-video cross-analysis queries.
  • Desktop app packaging (Electron + PyInstaller).

License

MIT License. See LICENSE for details.

Note: pyannote.audio model weights on HuggingFace are subject to their own non-commercial use terms. Review before commercial deployment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors