An AI-powered system that makes STEM video content accessible to blind and low vision (BLV) learners through automated visual description generation and interactive question-answering.
- We annonated the ground truth, you can see the data at
evaluation/data/ground_truth - Please see the paper with updated graphics
Paper_VidExplainAgent
VidExplainAgent uses a multimodal RAG (Retrieval-Augmented Generation) pipeline to:
- Extract and describe visual elements from educational videos using Vision-Language Models
- Index content in a vector database for semantic search
- Answer natural language questions about video content
- Generate accessible audio descriptions
Comprehensive two-tier evaluation on Wave-Particle Duality video (Physics, 3:32):
- BERTScore F1: 0.588 - Strong semantic understanding
- ROUGE-L F1: 0.248 - Moderate phrase similarity
- BLEU-4: 0.045 - Lexical variation (expected for generative models)
- Context Relevance: 100% βββ Perfect retrieval
- Answer Faithfulness: 90% - Minimal hallucination
- Answer Relevancy: 85% - Addresses questions well
- Answer Correctness: 75% - Good factual accuracy
- Overall Pass Rate: 87.5% - Excellent performance
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VidExplainAgent β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Frontend (Next.js/React) β
β ββ Video Upload & YouTube URL Input β
β ββ Interactive Chat Interface β
β ββ Voice Input (Web Speech API) β
β ββ Audio Playback (TTS) β
β β
β Backend (FastAPI) β
β ββ Ingestion Pipeline β
β β ββ Gemini 2.5 Flash (Multimodal VLM) β
β β ββ Video Processing β
β β ββ ChromaDB Indexing β
β β β
β ββ Query Pipeline β
β ββ Semantic Search (ChromaDB) β
β ββ RAG with Gemini β
β ββ TTS Generation β
β β
β Evaluation Framework β
β ββ Component: BLEU, ROUGE, BERTScore β
β ββ RAG: RAGAS (Context, Faithfulness, Relevancy) β
β ββ Human Evaluation Templates β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.13+
- Node.js 18+
- Google GenAI API Key
- OpenAI API Key (for evaluation only)
cd backend
# Install dependencies (using uv)
uv pip install -r requirements.txt
# Set environment variables
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEY
# Run server
uvicorn src.main:app --host 0.0.0.0 --port 8000 --reloadcd frontend
# Install dependencies
npm install
# Run development server
npm run devAccess the app at http://localhost:3000
cd evaluation
# Install evaluation dependencies
pip install -r requirements.txt
# Set OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# Run complete evaluation
python scripts/run_ragas_simple.py \
--qa-pairs data/ground_truth/qa_pairs.json \
--system-responses results/system_responses.json \
--output results/ragas_scores.jsonVidExplainAgent/
βββ backend/
β βββ src/
β β βββ main.py # FastAPI application
β β βββ ingestion_pipeline.py # Video processing & indexing
β β βββ explanation_synthesis.py # RAG & TTS generation
β β βββ config.py # Configuration
β βββ static/audio/ # Generated TTS audio
β βββ db/ # ChromaDB vector store
β βββ history/ # Processing logs
β
βββ frontend/
β βββ app/
β β βββ page.tsx # Main UI component
β β βββ layout.tsx # App layout
β β βββ globals.css # Styles
β βββ public/ # Static assets
β
βββ evaluation/
β βββ data/
β β βββ ground_truth/ # Human annotations
β β βββ test_video/ # Test video info
β βββ scripts/
β β βββ run_ragas_simple.py # RAGAS evaluation
β β βββ run_component_eval.py # BLEU/ROUGE/BERTScore
β β βββ generate_system_outputs.py # Output generation
β βββ src/
β β βββ component_eval.py # Component metrics
β β βββ rag_eval.py # RAG metrics
β β βββ human_eval.py # Human evaluation
β β βββ visualization.py # Result plotting
β βββ results/ # Evaluation outputs
β βββ templates/ # Report templates
β
βββ pyproject.toml # Python dependencies
βββ docker-compose.yml # Docker configuration
- Gemini 2.5 Flash for visual understanding
- Temporal segmentation of video content
- Rich metadata extraction (concepts, difficulty, speakers)
- ChromaDB vector database with semantic search
- Context-aware answer generation
- Exponential backoff for API reliability
- Voice input for hands-free interaction
- TTS audio responses
- Screen reader compatible interface
- Progressive disclosure of complex information
- Component-level: BLEU, ROUGE, BERTScore
- End-to-end: RAGAS framework (Context Relevance, Faithfulness, Relevancy, Correctness)
- Human evaluation: Likert scale templates ready
Full evaluation report: evaluation/results/COMPLETE_EVALUATION_REPORT.md
Test Video: Wave-Particle Duality (Perimeter Institute for Theoretical Physics)
- Duration: 3:32
- Subject: Quantum Physics
- Complexity: Moderate (animations, equations, concepts)
Datasets:
- 23 human-annotated visual descriptions
- 20 Q&A pairs with ground truth
- Expert-verified annotations
Key Finding: Perfect retrieval (100%) + high faithfulness (90%) + strong semantic understanding (0.588) = reliable, trustworthy system for accessible STEM education.
- BLEU: Papineni et al. (2002) - N-gram overlap metric
- ROUGE: Lin (2004) - Sequence similarity
- BERTScore: Zhang et al. (2019) - Semantic similarity with BERT
- RAGAS: Es et al. (2023) - RAG-specific evaluation framework
Backend:
- FastAPI (Python web framework)
- Google Gemini 2.5 Flash (Vision-Language Model)
- ChromaDB (Vector database)
- Uvicorn (ASGI server)
Frontend:
- Next.js 14 (React framework)
- TypeScript
- Tailwind CSS
- Web Speech API (voice input)
Evaluation:
- RAGAS (RAG evaluation)
- NLTK, rouge-score, bert-score (text metrics)
- Matplotlib, Seaborn (visualization)
All rights reserved.
Syed Ali Haider
For questions or collaboration: syed.ali.haider.gr@dartmouth.edu
Date: November 2025