Real-Time Emotion Analysis & Speech Transcription Backend

Overview

This backend system captures real-time video and audio from presentations or speeches, performs facial emotion analysis, transcribes speech, and generates AI-narrated summaries with highlighted important segments.

Key Features

Real-time Video Capture: Captures video frames from camera/OBS at configurable intervals
Facial Emotion Detection: Analyzes facial expressions for 7 emotions (anger, disgust, fear, happiness, neutral, sadness, surprise)
Speech Transcription: Real-time audio transcription using Deepgram streaming API
AI-Powered Summaries: Automatically generates narrated summaries using Google Gemini + ElevenLabs
Firebase Storage: All data persisted in Firestore for easy retrieval and analysis
Flask API: RESTful endpoints for session control and summary retrieval

Architecture

┌─────────────┐
│   Camera    │──┐
└─────────────┘  │
                 ├──► ┌──────────────┐      ┌─────────────────┐
┌─────────────┐  │    │   srt.py     │─────►│   Firebase      │
│ Microphone  │──┘    │ (Main Loop)  │      │   Firestore     │
└─────────────┘       └──────────────┘      └─────────────────┘
                             │                       │
                             │ Flask API             │
                             ▼                       ▼
                      ┌─────────────┐        ┌──────────────────┐
                      │   Client    │        │ elevenlabs_      │
                      │  Requests   │        │ service.py       │
                      └─────────────┘        └──────────────────┘
                                                     │
                                    ┌────────────────┼────────────────┐
                                    ▼                ▼                ▼
                             ┌───────────┐   ┌───────────┐   ┌──────────┐
                             │  Gemini   │   │ ElevenLabs│   │  Audio   │
                             │    API    │   │    API    │   │  Files   │
                             └───────────┘   └───────────┘   └──────────┘

System Components

1. `srt.py` - Main Application

Manages video capture from camera
Handles audio recording and streaming to Deepgram
Coordinates frame processing and API calls
Provides Flask control endpoints
Auto-triggers summary generation on session stop

2. `elevenlabs_service.py` - Summary Generation Module

Isolated, plug-and-play design - minimal coupling with main system
Fetches transcripts from Firebase Firestore
Uses Google Gemini to identify important speech segments
Calculates text highlight spans (start/end indices)
Generates narrated audio via ElevenLabs TTS
Manages async job processing with status tracking

3. External APIs

Emotion Analysis API: RunPod-hosted facial emotion detection
Deepgram: Real-time speech-to-text transcription
Google Gemini: Transcript analysis and importance detection
ElevenLabs: High-quality text-to-speech narration

4. Firebase Firestore Structure

videos/
  └── {session_name}/
      └── frames/
          ├── frame_0
          ├── frame_1
          └── frame_n
              ├── frame_number: int
              ├── timestamp: string (ISO 8601)
              ├── num_detections: int
              ├── detections: array[
              │   ├── face_id: string
              │   ├── emotion_scores: map
              │   ├── pose: map
              │   └── action_units: map
              │   ]
              └── transcripts: array[
                  ├── text: string
                  ├── relative_start: float
                  └── relative_end: float
                  ]

Installation

Prerequisites

Python 3.8+
Firebase project with Firestore enabled
Camera/OBS Virtual Camera
Microphone

Setup Steps

Clone and navigate to backend:
```
cd /path/to/HACKNC2025/backend
```
Install dependencies:
```
pip install -r requirements.txt
```
Configure Firebase:
- Download your Firebase service account key
- Save as firebase-key.json in the backend directory

Set up API keys:

Copy env.example to .env

Add your API keys:

DEEPGRAM_KEY=your_deepgram_key
GEMINI_API_KEY=your_gemini_key
ELEVENLABS_API_KEY=your_elevenlabs_key

Configure camera device:
- Edit srt.py line 23: CAPTURE_DEVICE = 1 (0=default webcam, 1=OBS)
- Run python testcam.py to list available cameras

Usage

Starting the System

python srt.py

This starts:

Flask control server on http://0.0.0.0:80
Video/audio capture loop (waiting for session start)

API Endpoints

Session Control

Start Recording Session:

POST http://localhost/start
Content-Type: application/json

{
  "name": "my_presentation"
}

Stop Recording Session:

POST http://localhost/stop

Response:

{
  "session": "my_presentation",
  "status": "stopped",
  "summary": "generating"
}

Summary Management

Check Summary Status:

GET http://localhost/api/summary/{session_name}

Response (processing):

{
  "status": "processing",
  "started_at": "2025-10-12T01:23:45Z"
}

Response (completed):

{
  "status": "completed",
  "session_name": "my_presentation",
  "result": {
    "full_transcript": "Hello? My name is Harsh...",
    "highlights": [
      {
        "start": 0,
        "end": 6,
        "text": "Hello?",
        "importance": "high",
        "reason": "Opening greeting establishes speaker presence"
      }
    ],
    "audio_url": "/static/summaries/my_presentation_summary.mp3",
    "voice_tonality": "professional",
    "generated_at": "2025-10-12T01:24:30Z"
  }
}

Manually Trigger Summary:

POST http://localhost/api/generate-summary/{session_name}

List All Summaries:

GET http://localhost/api/summaries

Download Audio File:

GET http://localhost/static/summaries/{session_name}_summary.mp3

Configuration

Environment Variables

Variable	Description	Required
`DEEPGRAM_KEY`	Deepgram API key for transcription	Yes
`GEMINI_API_KEY`	Google Gemini API key for analysis	Yes
`ELEVENLABS_API_KEY`	ElevenLabs API key for TTS	Yes

Application Settings (`srt.py`)

Setting	Default	Description
`API_URL`	RunPod proxy URL	Emotion analysis API endpoint
`FRAME_INTERVAL`	0.2 seconds	Time between frame captures
`CAPTURE_DEVICE`	1	Camera device index
`FLASK_HOST`	0.0.0.0	Flask server host
`FLASK_PORT`	80	Flask server port

Voice Tonality Mapping (`elevenlabs_service.py`)

Tonality	ElevenLabs Voice
professional	Rachel
warm	Bella
authoritative	Adam
enthusiastic	Antoni
calm	Elli

Development

Project Structure

backend/
├── srt.py                      # Main application & Flask API
├── elevenlabs_service.py       # Summary generation module (isolated)
├── stream.py                   # Simple video-only streaming
├── live.py                     # Legacy streaming script
├── audiod.py                   # Audio streaming test
├── req.py                      # API test client
├── testcam.py                  # Camera detection utility
├── firebase-key.json           # Firebase credentials (gitignored)
├── requirements.txt            # Python dependencies
├── env.example                 # API key template
├── static/
│   └── summaries/              # Generated audio files
└── frames/                     # Test frame images

Testing the ElevenLabs Integration

Start a test session:

curl -X POST http://localhost/start \
  -H "Content-Type: application/json" \
  -d '{"name": "test_session"}'

Let it record for 30-60 seconds (speak into microphone)
Stop the session:
```
curl -X POST http://localhost/stop
```

Poll for completion:

watch -n 2 'curl http://localhost/api/summary/test_session'

Download audio when complete:

curl -O http://localhost/static/summaries/test_session_summary.mp3

Manual Summary Generation

For existing sessions in Firebase:

curl -X POST http://localhost/api/generate-summary/catshop

Troubleshooting

Common Issues

Issue: "ELEVENLABS_API_KEY not set"

Solution: Add ELEVENLABS_API_KEY to your .env file
Summary will still generate but without audio

Issue: Camera not opening

Solution: Run python testcam.py to find correct device index
Update CAPTURE_DEVICE in srt.py

Issue: "Failed to connect to Deepgram"

Solution: Check DEEPGRAM_KEY is valid
Verify network connectivity

Issue: Gemini API rate limit

Solution: Wait a few minutes between summary generations
Consider upgrading to paid Gemini tier

Issue: No transcripts in Firebase

Solution: Check microphone is working
Verify Deepgram connection in logs
Speak clearly during recording

Logs

The system provides detailed logging:

[ok] - Successful operations
[warn] - Warnings (non-critical)
[error] - Errors requiring attention
[db] - Firebase operations
[api] - API endpoint calls
[elevenlabs_service] - Summary generation progress

Security Considerations

API Keys: Never commit API keys to git
Firebase Credentials: Keep firebase-key.json secure
Input Validation: Session names validated to prevent injection
File Access: Audio file serving validates against directory traversal
Network Access: Consider firewall rules for production deployment

Performance

Frame Processing: ~0.2 seconds per frame
Transcription Latency: ~1-2 seconds (real-time streaming)
Summary Generation: 30-60 seconds for 5-minute recording
Audio File Size: ~100KB per minute of speech

Future Enhancements

WebSocket notifications for summary completion
Persistent job storage (Redis/database)
Multiple voice selection UI
Emotion-aware voice modulation
Summary caching to avoid regeneration
Rate limiting for API endpoints
User authentication and authorization
Real-time emotion visualization
Export to multiple formats (PDF, JSON, SRT)

License

[Your License Here]

Support

For issues or questions, contact [your contact info].

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
frames		frames
static/summaries		static/summaries
API_DOCUMENTATION.md		API_DOCUMENTATION.md
FULL_PIPELINE_TEST_RESULTS.md		FULL_PIPELINE_TEST_RESULTS.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
NEW_STRUCTURED_APPROACH_COMPLETE.md		NEW_STRUCTURED_APPROACH_COMPLETE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
TESTING_INSTRUCTIONS.md		TESTING_INSTRUCTIONS.md
TEST_RESULTS.md		TEST_RESULTS.md
audiod.py		audiod.py
elevenlabs_service.py		elevenlabs_service.py
elevenlabs_service_new.py		elevenlabs_service_new.py
env.example		env.example
firebase-key.json		firebase-key.json
joy.jpg		joy.jpg
live.py		live.py
req.py		req.py
requirements.txt		requirements.txt
run.mp3		run.mp3
srt.py		srt.py
stream.py		stream.py
te.py		te.py
test.MOV		test.MOV
test_elevenlabs.py		test_elevenlabs.py
test_elevenlabs_simple.py		test_elevenlabs_simple.py
testcam.py		testcam.py
video.py		video.py

sentiframe/backend

Folders and files

Latest commit

History

Repository files navigation

Real-Time Emotion Analysis & Speech Transcription Backend

Overview

Key Features

Architecture

System Components

1. srt.py - Main Application

2. elevenlabs_service.py - Summary Generation Module

3. External APIs

4. Firebase Firestore Structure

Installation

Prerequisites

Setup Steps

Usage

Starting the System

API Endpoints

Session Control

Summary Management

Configuration

Environment Variables

Application Settings (srt.py)

Voice Tonality Mapping (elevenlabs_service.py)

Development

Project Structure

Testing the ElevenLabs Integration

Manual Summary Generation

Troubleshooting

Common Issues

Logs

Security Considerations

Performance

Future Enhancements

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. `srt.py` - Main Application

2. `elevenlabs_service.py` - Summary Generation Module

Application Settings (`srt.py`)

Voice Tonality Mapping (`elevenlabs_service.py`)

Packages