Skip to content

s-haider10/VidExplainAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VidExplainAgent

An AI-powered system that makes STEM video content accessible to blind and low vision (BLV) learners through automated visual description generation and interactive question-answering.

  • We annonated the ground truth, you can see the data at evaluation/data/ground_truth
  • Please see the paper with updated graphics Paper_VidExplainAgent

🎯 Overview

VidExplainAgent uses a multimodal RAG (Retrieval-Augmented Generation) pipeline to:

  1. Extract and describe visual elements from educational videos using Vision-Language Models
  2. Index content in a vector database for semantic search
  3. Answer natural language questions about video content
  4. Generate accessible audio descriptions

πŸ“Š Evaluation Results

Comprehensive two-tier evaluation on Wave-Particle Duality video (Physics, 3:32):

Component-Level (VLM)

  • BERTScore F1: 0.588 - Strong semantic understanding
  • ROUGE-L F1: 0.248 - Moderate phrase similarity
  • BLEU-4: 0.045 - Lexical variation (expected for generative models)

RAG System (End-to-End)

  • Context Relevance: 100% ⭐⭐⭐ Perfect retrieval
  • Answer Faithfulness: 90% - Minimal hallucination
  • Answer Relevancy: 85% - Addresses questions well
  • Answer Correctness: 75% - Good factual accuracy
  • Overall Pass Rate: 87.5% - Excellent performance

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    VidExplainAgent                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                          β”‚
β”‚  Frontend (Next.js/React)                               β”‚
β”‚    β”œβ”€ Video Upload & YouTube URL Input                  β”‚
β”‚    β”œβ”€ Interactive Chat Interface                        β”‚
β”‚    β”œβ”€ Voice Input (Web Speech API)                      β”‚
β”‚    └─ Audio Playback (TTS)                              β”‚
β”‚                                                          β”‚
β”‚  Backend (FastAPI)                                       β”‚
β”‚    β”œβ”€ Ingestion Pipeline                                β”‚
β”‚    β”‚   β”œβ”€ Gemini 2.5 Flash (Multimodal VLM)            β”‚
β”‚    β”‚   β”œβ”€ Video Processing                              β”‚
β”‚    β”‚   └─ ChromaDB Indexing                             β”‚
β”‚    β”‚                                                     β”‚
β”‚    └─ Query Pipeline                                     β”‚
β”‚        β”œβ”€ Semantic Search (ChromaDB)                    β”‚
β”‚        β”œβ”€ RAG with Gemini                               β”‚
β”‚        └─ TTS Generation                                β”‚
β”‚                                                          β”‚
β”‚  Evaluation Framework                                    β”‚
β”‚    β”œβ”€ Component: BLEU, ROUGE, BERTScore                β”‚
β”‚    β”œβ”€ RAG: RAGAS (Context, Faithfulness, Relevancy)    β”‚
β”‚    └─ Human Evaluation Templates                        β”‚
β”‚                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Python 3.13+
  • Node.js 18+
  • Google GenAI API Key
  • OpenAI API Key (for evaluation only)

Backend Setup

cd backend

# Install dependencies (using uv)
uv pip install -r requirements.txt

# Set environment variables
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEY

# Run server
uvicorn src.main:app --host 0.0.0.0 --port 8000 --reload

Frontend Setup

cd frontend

# Install dependencies
npm install

# Run development server
npm run dev

Access the app at http://localhost:3000

Evaluation Framework

cd evaluation

# Install evaluation dependencies
pip install -r requirements.txt

# Set OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# Run complete evaluation
python scripts/run_ragas_simple.py \
  --qa-pairs data/ground_truth/qa_pairs.json \
  --system-responses results/system_responses.json \
  --output results/ragas_scores.json

πŸ“ Project Structure

VidExplainAgent/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ main.py                    # FastAPI application
β”‚   β”‚   β”œβ”€β”€ ingestion_pipeline.py     # Video processing & indexing
β”‚   β”‚   β”œβ”€β”€ explanation_synthesis.py  # RAG & TTS generation
β”‚   β”‚   └── config.py                 # Configuration
β”‚   β”œβ”€β”€ static/audio/                  # Generated TTS audio
β”‚   β”œβ”€β”€ db/                            # ChromaDB vector store
β”‚   └── history/                       # Processing logs
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ page.tsx                  # Main UI component
β”‚   β”‚   β”œβ”€β”€ layout.tsx                # App layout
β”‚   β”‚   └── globals.css               # Styles
β”‚   └── public/                       # Static assets
β”‚
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ ground_truth/             # Human annotations
β”‚   β”‚   └── test_video/               # Test video info
β”‚   β”œβ”€β”€ scripts/
β”‚   β”‚   β”œβ”€β”€ run_ragas_simple.py       # RAGAS evaluation
β”‚   β”‚   β”œβ”€β”€ run_component_eval.py     # BLEU/ROUGE/BERTScore
β”‚   β”‚   └── generate_system_outputs.py # Output generation
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ component_eval.py         # Component metrics
β”‚   β”‚   β”œβ”€β”€ rag_eval.py               # RAG metrics
β”‚   β”‚   β”œβ”€β”€ human_eval.py             # Human evaluation
β”‚   β”‚   └── visualization.py          # Result plotting
β”‚   β”œβ”€β”€ results/                      # Evaluation outputs
β”‚   └── templates/                    # Report templates
β”‚
β”œβ”€β”€ pyproject.toml                     # Python dependencies
└── docker-compose.yml                 # Docker configuration

πŸ”¬ Key Features

1. Multimodal Video Processing

  • Gemini 2.5 Flash for visual understanding
  • Temporal segmentation of video content
  • Rich metadata extraction (concepts, difficulty, speakers)

2. Intelligent RAG Pipeline

  • ChromaDB vector database with semantic search
  • Context-aware answer generation
  • Exponential backoff for API reliability

3. Accessibility-First Design

  • Voice input for hands-free interaction
  • TTS audio responses
  • Screen reader compatible interface
  • Progressive disclosure of complex information

4. Comprehensive Evaluation

  • Component-level: BLEU, ROUGE, BERTScore
  • End-to-end: RAGAS framework (Context Relevance, Faithfulness, Relevancy, Correctness)
  • Human evaluation: Likert scale templates ready

πŸ“Š Evaluation Details

Full evaluation report: evaluation/results/COMPLETE_EVALUATION_REPORT.md

Test Video: Wave-Particle Duality (Perimeter Institute for Theoretical Physics)

  • Duration: 3:32
  • Subject: Quantum Physics
  • Complexity: Moderate (animations, equations, concepts)

Datasets:

  • 23 human-annotated visual descriptions
  • 20 Q&A pairs with ground truth
  • Expert-verified annotations

Key Finding: Perfect retrieval (100%) + high faithfulness (90%) + strong semantic understanding (0.588) = reliable, trustworthy system for accessible STEM education.

πŸŽ“ Academic References

  • BLEU: Papineni et al. (2002) - N-gram overlap metric
  • ROUGE: Lin (2004) - Sequence similarity
  • BERTScore: Zhang et al. (2019) - Semantic similarity with BERT
  • RAGAS: Es et al. (2023) - RAG-specific evaluation framework

πŸ› οΈ Technologies

Backend:

  • FastAPI (Python web framework)
  • Google Gemini 2.5 Flash (Vision-Language Model)
  • ChromaDB (Vector database)
  • Uvicorn (ASGI server)

Frontend:

  • Next.js 14 (React framework)
  • TypeScript
  • Tailwind CSS
  • Web Speech API (voice input)

Evaluation:

  • RAGAS (RAG evaluation)
  • NLTK, rouge-score, bert-score (text metrics)
  • Matplotlib, Seaborn (visualization)

πŸ“ License

All rights reserved.

πŸ‘₯ Contributors

Syed Ali Haider

πŸ“§ Contact

For questions or collaboration: syed.ali.haider.gr@dartmouth.edu


Date: November 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors