OmniVision

See Beyond the Video

AI-powered video understanding that turns long-form content into searchable knowledge with transcription, summarization, and conversational RAG.

Overview

OmniVision transforms videos into structured, searchable intelligence.

Upload a local video or paste a YouTube link, choose the transcription mode, and OmniVision will:

transcribe the video
generate a title and summary
extract action items, key decisions, and open questions
index the transcript in ChromaDB
let you ask follow-up questions through a RAG assistant

This project combines a Flask backend, a React + Tailwind frontend, Whisper-based transcription, Sarvam Hindi-to-English transcription support, and LangChain-powered summarization and retrieval.

Demo Flow

flowchart LR
    A[Upload Video or Paste YouTube URL] --> B[Audio Extraction]
    B --> C[Transcription]
    C --> D[Summary + Insights]
    D --> E[ChromaDB Indexing]
    E --> F[RAG Chat Experience]

Features

Core Capabilities

Video upload or YouTube link analysis
English transcription with Whisper small
Hindi-to-English transcription with Sarvam saaras:v2.5
AI-generated title and meeting-style summary
Extracted action items, key decisions, and open questions
ChromaDB-based retrieval pipeline for follow-up questions
Modern light-themed UI with animated processing states

Frontend Experience

Custom React + Tailwind interface
Creative glassmorphism-inspired layout
Interactive processing section
Transcript, summary, and insight cards
Conversational RAG assistant
Markdown-rendered output formatting

Tech Stack

Layer	Tools
Frontend	React, Vite, Tailwind CSS, Lucide Icons
Backend	Flask
Transcription	OpenAI Whisper, Sarvam AI
LLM / Orchestration	LangChain, Mistral
Vector Store	ChromaDB
Media Processing	`yt-dlp`, `pydub`, `ffmpeg`

Project Structure

AI Video assistant/
├─ app.py
├─ main.py
├─ run_backend.py
├─ requirements.txt
├─ .env.example
├─ README.md
├─ core/
│  ├─ extractor.py
│  ├─ pipeline.py
│  ├─ rag_engine.py
│  ├─ summarizer.py
│  ├─ transcriber.py
│  └─ vector_store.py
├─ utils/
│  └─ audio_processor.py
└─ frontend/
   ├─ index.html
   ├─ package.json
   ├─ package-lock.json
   ├─ vite.config.js
   ├─ tailwind.config.js
   ├─ postcss.config.js
   └─ src/
      ├─ App.jsx
      ├─ main.jsx
      └─ styles.css

Quick Start

1. Clone the repository

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

2. Create and activate a virtual environment

Windows PowerShell

py -3.12 -m venv .venv
.venv\Scripts\Activate.ps1

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

3. Install backend dependencies

pip install -r requirements.txt

4. Configure environment variables

Copy .env.example to .env and fill in your keys:

cp .env.example .env

Required values:

MISTRAL_API_KEY
SARVAM_API_KEY

Optional:

WHISPER_MODEL
SARVAM_STT_MODEL

5. Install frontend dependencies

cd frontend
npm install
cd ..

6. Run the backend

python run_backend.py

Backend runs at:

http://127.0.0.1:5000

7. Run the frontend

In a second terminal:

cd frontend
npm run dev

Frontend runs at:

http://127.0.0.1:5173

Environment Variables

MISTRAL_API_KEY="your_mistral_api_key_here"
WHISPER_MODEL="small"
SARVAM_API_KEY="your_sarvam_api_key_here"
SARVAM_STT_MODEL="saaras:v2.5"

Never commit your real .env file to GitHub.

How It Works

Pipeline Breakdown

Input Processing

If the source is a YouTube URL, the app downloads audio using yt-dlp
If the source is a local file, the app converts it to WAV
Audio is chunked for manageable transcription

Transcription

english mode uses local Whisper
hinglish mode uses Sarvam to transcribe Hindi audio into English text

Post Processing

Mistral generates:
- a short title
- a summary
- action items
- key decisions
- open questions

Retrieval

The transcript is chunked and embedded
ChromaDB stores transcript vectors
LangChain retrieves relevant transcript segments for Q&A

Screens and UX

UI Highlights

Hero section with branded OmniVision positioning
Upload + YouTube dual input flow
Language selection for English or Hindi-to-English
Animated processing card
Summary and transcript views
Insight cards for extracted outputs
RAG assistant chat interface

Common Commands

Run backend

python run_backend.py

Run frontend

cd frontend
npm run dev

Build frontend

cd frontend
npm run build

Deployment Notes

This project can be deployed, but there are a few practical considerations:

Whisper can be heavy on small/free instances
ffmpeg must be available on the server
yt-dlp and long video processing may increase runtime significantly
Chroma persistence should be configured carefully in production

Recommended first deployment targets:

Render
Railway
VPS-based deployment for more control

Manual GitHub Upload Checklist

Upload these:

app.py
main.py
run_backend.py
requirements.txt
.env.example
README.md
core/
utils/
frontend/

Do not upload these:

.venv/
frontend/node_modules/
frontend/dist/
.env
vector_db/
downloads/uploads/
logs and cache files

Troubleshooting

Frontend starts but backend does not respond

Make sure the backend is running with:

python run_backend.py

PowerShell blocks virtual environment activation

Run:

Set-ExecutionPolicy -Scope Process Bypass

Then activate again:

.venv\Scripts\Activate.ps1

YouTube processing takes a long time

That is expected for long videos. The pipeline currently performs:

download
audio extraction
transcription
summarization
vector indexing

Try a shorter clip first when testing.

Security Notes

Rotate any API keys that were ever exposed accidentally
Keep .env private
Do not commit vector databases, logs, or uploaded media

Roadmap

Background jobs for long-running video analysis
Live progress tracking in the UI
Better production deployment flow
Transcript export options
Multi-session history

License

Add your preferred license here, for example MIT.

Author

Built as an AI-powered video intelligence platform for turning content into actionable knowledge.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
core		core
frontend		frontend
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt
run_backend.py		run_backend.py
test.py		test.py

Folders and files

Latest commit

History

Repository files navigation

OmniVision

See Beyond the Video

Overview

Demo Flow

Features

Core Capabilities

Frontend Experience

Tech Stack

Project Structure

Quick Start

1. Clone the repository

2. Create and activate a virtual environment

Windows PowerShell

macOS / Linux

3. Install backend dependencies

4. Configure environment variables

5. Install frontend dependencies

6. Run the backend

7. Run the frontend

Environment Variables

How It Works

Input Processing

Transcription

Post Processing

Retrieval

Screens and UX

Common Commands

Run backend

Run frontend

Build frontend

Deployment Notes

Manual GitHub Upload Checklist

Troubleshooting

Security Notes

Roadmap

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages