AI-powered video understanding that turns long-form content into searchable knowledge with transcription, summarization, and conversational RAG.
OmniVision transforms videos into structured, searchable intelligence.
Upload a local video or paste a YouTube link, choose the transcription mode, and OmniVision will:
- transcribe the video
- generate a title and summary
- extract action items, key decisions, and open questions
- index the transcript in ChromaDB
- let you ask follow-up questions through a RAG assistant
This project combines a Flask backend, a React + Tailwind frontend, Whisper-based transcription, Sarvam Hindi-to-English transcription support, and LangChain-powered summarization and retrieval.
flowchart LR
A[Upload Video or Paste YouTube URL] --> B[Audio Extraction]
B --> C[Transcription]
C --> D[Summary + Insights]
D --> E[ChromaDB Indexing]
E --> F[RAG Chat Experience]
- Video upload or YouTube link analysis
- English transcription with Whisper
small - Hindi-to-English transcription with Sarvam
saaras:v2.5 - AI-generated title and meeting-style summary
- Extracted action items, key decisions, and open questions
- ChromaDB-based retrieval pipeline for follow-up questions
- Modern light-themed UI with animated processing states
- Custom React + Tailwind interface
- Creative glassmorphism-inspired layout
- Interactive processing section
- Transcript, summary, and insight cards
- Conversational RAG assistant
- Markdown-rendered output formatting
| Layer | Tools |
|---|---|
| Frontend | React, Vite, Tailwind CSS, Lucide Icons |
| Backend | Flask |
| Transcription | OpenAI Whisper, Sarvam AI |
| LLM / Orchestration | LangChain, Mistral |
| Vector Store | ChromaDB |
| Media Processing | yt-dlp, pydub, ffmpeg |
AI Video assistant/
├─ app.py
├─ main.py
├─ run_backend.py
├─ requirements.txt
├─ .env.example
├─ README.md
├─ core/
│ ├─ extractor.py
│ ├─ pipeline.py
│ ├─ rag_engine.py
│ ├─ summarizer.py
│ ├─ transcriber.py
│ └─ vector_store.py
├─ utils/
│ └─ audio_processor.py
└─ frontend/
├─ index.html
├─ package.json
├─ package-lock.json
├─ vite.config.js
├─ tailwind.config.js
├─ postcss.config.js
└─ src/
├─ App.jsx
├─ main.jsx
└─ styles.css
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-namepy -3.12 -m venv .venv
.venv\Scripts\Activate.ps1python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtCopy .env.example to .env and fill in your keys:
cp .env.example .envRequired values:
MISTRAL_API_KEYSARVAM_API_KEY
Optional:
WHISPER_MODELSARVAM_STT_MODEL
cd frontend
npm install
cd ..python run_backend.pyBackend runs at:
http://127.0.0.1:5000
In a second terminal:
cd frontend
npm run devFrontend runs at:
http://127.0.0.1:5173
MISTRAL_API_KEY="your_mistral_api_key_here"
WHISPER_MODEL="small"
SARVAM_API_KEY="your_sarvam_api_key_here"
SARVAM_STT_MODEL="saaras:v2.5"Never commit your real
.envfile to GitHub.
Pipeline Breakdown
- If the source is a YouTube URL, the app downloads audio using
yt-dlp - If the source is a local file, the app converts it to WAV
- Audio is chunked for manageable transcription
englishmode uses local Whisperhinglishmode uses Sarvam to transcribe Hindi audio into English text
- Mistral generates:
- a short title
- a summary
- action items
- key decisions
- open questions
- The transcript is chunked and embedded
- ChromaDB stores transcript vectors
- LangChain retrieves relevant transcript segments for Q&A
UI Highlights
- Hero section with branded OmniVision positioning
- Upload + YouTube dual input flow
- Language selection for English or Hindi-to-English
- Animated processing card
- Summary and transcript views
- Insight cards for extracted outputs
- RAG assistant chat interface
python run_backend.pycd frontend
npm run devcd frontend
npm run buildThis project can be deployed, but there are a few practical considerations:
- Whisper can be heavy on small/free instances
ffmpegmust be available on the serveryt-dlpand long video processing may increase runtime significantly- Chroma persistence should be configured carefully in production
Recommended first deployment targets:
- Render
- Railway
- VPS-based deployment for more control
Upload these:
app.pymain.pyrun_backend.pyrequirements.txt.env.exampleREADME.mdcore/utils/frontend/
Do not upload these:
.venv/frontend/node_modules/frontend/dist/.envvector_db/downloads/uploads/- logs and cache files
Frontend starts but backend does not respond
Make sure the backend is running with:
python run_backend.pyPowerShell blocks virtual environment activation
Run:
Set-ExecutionPolicy -Scope Process BypassThen activate again:
.venv\Scripts\Activate.ps1YouTube processing takes a long time
That is expected for long videos. The pipeline currently performs:
- download
- audio extraction
- transcription
- summarization
- vector indexing
Try a shorter clip first when testing.
- Rotate any API keys that were ever exposed accidentally
- Keep
.envprivate - Do not commit vector databases, logs, or uploaded media
- Background jobs for long-running video analysis
- Live progress tracking in the UI
- Better production deployment flow
- Transcript export options
- Multi-session history
Add your preferred license here, for example MIT.
Built as an AI-powered video intelligence platform for turning content into actionable knowledge.