This backend system captures real-time video and audio from presentations or speeches, performs facial emotion analysis, transcribes speech, and generates AI-narrated summaries with highlighted important segments.
- Real-time Video Capture: Captures video frames from camera/OBS at configurable intervals
- Facial Emotion Detection: Analyzes facial expressions for 7 emotions (anger, disgust, fear, happiness, neutral, sadness, surprise)
- Speech Transcription: Real-time audio transcription using Deepgram streaming API
- AI-Powered Summaries: Automatically generates narrated summaries using Google Gemini + ElevenLabs
- Firebase Storage: All data persisted in Firestore for easy retrieval and analysis
- Flask API: RESTful endpoints for session control and summary retrieval
┌─────────────┐
│ Camera │──┐
└─────────────┘ │
├──► ┌──────────────┐ ┌─────────────────┐
┌─────────────┐ │ │ srt.py │─────►│ Firebase │
│ Microphone │──┘ │ (Main Loop) │ │ Firestore │
└─────────────┘ └──────────────┘ └─────────────────┘
│ │
│ Flask API │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Client │ │ elevenlabs_ │
│ Requests │ │ service.py │
└─────────────┘ └──────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌──────────┐
│ Gemini │ │ ElevenLabs│ │ Audio │
│ API │ │ API │ │ Files │
└───────────┘ └───────────┘ └──────────┘
- Manages video capture from camera
- Handles audio recording and streaming to Deepgram
- Coordinates frame processing and API calls
- Provides Flask control endpoints
- Auto-triggers summary generation on session stop
- Isolated, plug-and-play design - minimal coupling with main system
- Fetches transcripts from Firebase Firestore
- Uses Google Gemini to identify important speech segments
- Calculates text highlight spans (start/end indices)
- Generates narrated audio via ElevenLabs TTS
- Manages async job processing with status tracking
- Emotion Analysis API: RunPod-hosted facial emotion detection
- Deepgram: Real-time speech-to-text transcription
- Google Gemini: Transcript analysis and importance detection
- ElevenLabs: High-quality text-to-speech narration
videos/
└── {session_name}/
└── frames/
├── frame_0
├── frame_1
└── frame_n
├── frame_number: int
├── timestamp: string (ISO 8601)
├── num_detections: int
├── detections: array[
│ ├── face_id: string
│ ├── emotion_scores: map
│ ├── pose: map
│ └── action_units: map
│ ]
└── transcripts: array[
├── text: string
├── relative_start: float
└── relative_end: float
]
- Python 3.8+
- Firebase project with Firestore enabled
- Camera/OBS Virtual Camera
- Microphone
-
Clone and navigate to backend:
cd /path/to/HACKNC2025/backend -
Install dependencies:
pip install -r requirements.txt
-
Configure Firebase:
- Download your Firebase service account key
- Save as
firebase-key.jsonin the backend directory
-
Set up API keys:
- Copy
env.exampleto.env - Add your API keys:
DEEPGRAM_KEY=your_deepgram_key GEMINI_API_KEY=your_gemini_key ELEVENLABS_API_KEY=your_elevenlabs_key
- Copy
-
Configure camera device:
- Edit
srt.pyline 23:CAPTURE_DEVICE = 1(0=default webcam, 1=OBS) - Run
python testcam.pyto list available cameras
- Edit
python srt.pyThis starts:
- Flask control server on
http://0.0.0.0:80 - Video/audio capture loop (waiting for session start)
Start Recording Session:
POST http://localhost/start
Content-Type: application/json
{
"name": "my_presentation"
}Stop Recording Session:
POST http://localhost/stopResponse:
{
"session": "my_presentation",
"status": "stopped",
"summary": "generating"
}Check Summary Status:
GET http://localhost/api/summary/{session_name}Response (processing):
{
"status": "processing",
"started_at": "2025-10-12T01:23:45Z"
}Response (completed):
{
"status": "completed",
"session_name": "my_presentation",
"result": {
"full_transcript": "Hello? My name is Harsh...",
"highlights": [
{
"start": 0,
"end": 6,
"text": "Hello?",
"importance": "high",
"reason": "Opening greeting establishes speaker presence"
}
],
"audio_url": "/static/summaries/my_presentation_summary.mp3",
"voice_tonality": "professional",
"generated_at": "2025-10-12T01:24:30Z"
}
}Manually Trigger Summary:
POST http://localhost/api/generate-summary/{session_name}List All Summaries:
GET http://localhost/api/summariesDownload Audio File:
GET http://localhost/static/summaries/{session_name}_summary.mp3| Variable | Description | Required |
|---|---|---|
DEEPGRAM_KEY |
Deepgram API key for transcription | Yes |
GEMINI_API_KEY |
Google Gemini API key for analysis | Yes |
ELEVENLABS_API_KEY |
ElevenLabs API key for TTS | Yes |
| Setting | Default | Description |
|---|---|---|
API_URL |
RunPod proxy URL | Emotion analysis API endpoint |
FRAME_INTERVAL |
0.2 seconds | Time between frame captures |
CAPTURE_DEVICE |
1 | Camera device index |
FLASK_HOST |
0.0.0.0 | Flask server host |
FLASK_PORT |
80 | Flask server port |
| Tonality | ElevenLabs Voice |
|---|---|
| professional | Rachel |
| warm | Bella |
| authoritative | Adam |
| enthusiastic | Antoni |
| calm | Elli |
backend/
├── srt.py # Main application & Flask API
├── elevenlabs_service.py # Summary generation module (isolated)
├── stream.py # Simple video-only streaming
├── live.py # Legacy streaming script
├── audiod.py # Audio streaming test
├── req.py # API test client
├── testcam.py # Camera detection utility
├── firebase-key.json # Firebase credentials (gitignored)
├── requirements.txt # Python dependencies
├── env.example # API key template
├── static/
│ └── summaries/ # Generated audio files
└── frames/ # Test frame images
-
Start a test session:
curl -X POST http://localhost/start \ -H "Content-Type: application/json" \ -d '{"name": "test_session"}'
-
Let it record for 30-60 seconds (speak into microphone)
-
Stop the session:
curl -X POST http://localhost/stop
-
Poll for completion:
watch -n 2 'curl http://localhost/api/summary/test_session' -
Download audio when complete:
curl -O http://localhost/static/summaries/test_session_summary.mp3
For existing sessions in Firebase:
curl -X POST http://localhost/api/generate-summary/catshopIssue: "ELEVENLABS_API_KEY not set"
- Solution: Add
ELEVENLABS_API_KEYto your.envfile - Summary will still generate but without audio
Issue: Camera not opening
- Solution: Run
python testcam.pyto find correct device index - Update
CAPTURE_DEVICEinsrt.py
Issue: "Failed to connect to Deepgram"
- Solution: Check
DEEPGRAM_KEYis valid - Verify network connectivity
Issue: Gemini API rate limit
- Solution: Wait a few minutes between summary generations
- Consider upgrading to paid Gemini tier
Issue: No transcripts in Firebase
- Solution: Check microphone is working
- Verify Deepgram connection in logs
- Speak clearly during recording
The system provides detailed logging:
[ok]- Successful operations[warn]- Warnings (non-critical)[error]- Errors requiring attention[db]- Firebase operations[api]- API endpoint calls[elevenlabs_service]- Summary generation progress
- API Keys: Never commit API keys to git
- Firebase Credentials: Keep
firebase-key.jsonsecure - Input Validation: Session names validated to prevent injection
- File Access: Audio file serving validates against directory traversal
- Network Access: Consider firewall rules for production deployment
- Frame Processing: ~0.2 seconds per frame
- Transcription Latency: ~1-2 seconds (real-time streaming)
- Summary Generation: 30-60 seconds for 5-minute recording
- Audio File Size: ~100KB per minute of speech
- WebSocket notifications for summary completion
- Persistent job storage (Redis/database)
- Multiple voice selection UI
- Emotion-aware voice modulation
- Summary caching to avoid regeneration
- Rate limiting for API endpoints
- User authentication and authorization
- Real-time emotion visualization
- Export to multiple formats (PDF, JSON, SRT)
[Your License Here]
For issues or questions, contact [your contact info].