Skip to content

sentiframe/backend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Emotion Analysis & Speech Transcription Backend

Overview

This backend system captures real-time video and audio from presentations or speeches, performs facial emotion analysis, transcribes speech, and generates AI-narrated summaries with highlighted important segments.

Key Features

  • Real-time Video Capture: Captures video frames from camera/OBS at configurable intervals
  • Facial Emotion Detection: Analyzes facial expressions for 7 emotions (anger, disgust, fear, happiness, neutral, sadness, surprise)
  • Speech Transcription: Real-time audio transcription using Deepgram streaming API
  • AI-Powered Summaries: Automatically generates narrated summaries using Google Gemini + ElevenLabs
  • Firebase Storage: All data persisted in Firestore for easy retrieval and analysis
  • Flask API: RESTful endpoints for session control and summary retrieval

Architecture

┌─────────────┐
│   Camera    │──┐
└─────────────┘  │
                 ├──► ┌──────────────┐      ┌─────────────────┐
┌─────────────┐  │    │   srt.py     │─────►│   Firebase      │
│ Microphone  │──┘    │ (Main Loop)  │      │   Firestore     │
└─────────────┘       └──────────────┘      └─────────────────┘
                             │                       │
                             │ Flask API             │
                             ▼                       ▼
                      ┌─────────────┐        ┌──────────────────┐
                      │   Client    │        │ elevenlabs_      │
                      │  Requests   │        │ service.py       │
                      └─────────────┘        └──────────────────┘
                                                     │
                                    ┌────────────────┼────────────────┐
                                    ▼                ▼                ▼
                             ┌───────────┐   ┌───────────┐   ┌──────────┐
                             │  Gemini   │   │ ElevenLabs│   │  Audio   │
                             │    API    │   │    API    │   │  Files   │
                             └───────────┘   └───────────┘   └──────────┘

System Components

1. srt.py - Main Application

  • Manages video capture from camera
  • Handles audio recording and streaming to Deepgram
  • Coordinates frame processing and API calls
  • Provides Flask control endpoints
  • Auto-triggers summary generation on session stop

2. elevenlabs_service.py - Summary Generation Module

  • Isolated, plug-and-play design - minimal coupling with main system
  • Fetches transcripts from Firebase Firestore
  • Uses Google Gemini to identify important speech segments
  • Calculates text highlight spans (start/end indices)
  • Generates narrated audio via ElevenLabs TTS
  • Manages async job processing with status tracking

3. External APIs

  • Emotion Analysis API: RunPod-hosted facial emotion detection
  • Deepgram: Real-time speech-to-text transcription
  • Google Gemini: Transcript analysis and importance detection
  • ElevenLabs: High-quality text-to-speech narration

4. Firebase Firestore Structure

videos/
  └── {session_name}/
      └── frames/
          ├── frame_0
          ├── frame_1
          └── frame_n
              ├── frame_number: int
              ├── timestamp: string (ISO 8601)
              ├── num_detections: int
              ├── detections: array[
              │   ├── face_id: string
              │   ├── emotion_scores: map
              │   ├── pose: map
              │   └── action_units: map
              │   ]
              └── transcripts: array[
                  ├── text: string
                  ├── relative_start: float
                  └── relative_end: float
                  ]

Installation

Prerequisites

  • Python 3.8+
  • Firebase project with Firestore enabled
  • Camera/OBS Virtual Camera
  • Microphone

Setup Steps

  1. Clone and navigate to backend:

    cd /path/to/HACKNC2025/backend
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure Firebase:

    • Download your Firebase service account key
    • Save as firebase-key.json in the backend directory
  4. Set up API keys:

    • Copy env.example to .env
    • Add your API keys:
      DEEPGRAM_KEY=your_deepgram_key
      GEMINI_API_KEY=your_gemini_key
      ELEVENLABS_API_KEY=your_elevenlabs_key
  5. Configure camera device:

    • Edit srt.py line 23: CAPTURE_DEVICE = 1 (0=default webcam, 1=OBS)
    • Run python testcam.py to list available cameras

Usage

Starting the System

python srt.py

This starts:

  • Flask control server on http://0.0.0.0:80
  • Video/audio capture loop (waiting for session start)

API Endpoints

Session Control

Start Recording Session:

POST http://localhost/start
Content-Type: application/json

{
  "name": "my_presentation"
}

Stop Recording Session:

POST http://localhost/stop

Response:

{
  "session": "my_presentation",
  "status": "stopped",
  "summary": "generating"
}

Summary Management

Check Summary Status:

GET http://localhost/api/summary/{session_name}

Response (processing):

{
  "status": "processing",
  "started_at": "2025-10-12T01:23:45Z"
}

Response (completed):

{
  "status": "completed",
  "session_name": "my_presentation",
  "result": {
    "full_transcript": "Hello? My name is Harsh...",
    "highlights": [
      {
        "start": 0,
        "end": 6,
        "text": "Hello?",
        "importance": "high",
        "reason": "Opening greeting establishes speaker presence"
      }
    ],
    "audio_url": "/static/summaries/my_presentation_summary.mp3",
    "voice_tonality": "professional",
    "generated_at": "2025-10-12T01:24:30Z"
  }
}

Manually Trigger Summary:

POST http://localhost/api/generate-summary/{session_name}

List All Summaries:

GET http://localhost/api/summaries

Download Audio File:

GET http://localhost/static/summaries/{session_name}_summary.mp3

Configuration

Environment Variables

Variable Description Required
DEEPGRAM_KEY Deepgram API key for transcription Yes
GEMINI_API_KEY Google Gemini API key for analysis Yes
ELEVENLABS_API_KEY ElevenLabs API key for TTS Yes

Application Settings (srt.py)

Setting Default Description
API_URL RunPod proxy URL Emotion analysis API endpoint
FRAME_INTERVAL 0.2 seconds Time between frame captures
CAPTURE_DEVICE 1 Camera device index
FLASK_HOST 0.0.0.0 Flask server host
FLASK_PORT 80 Flask server port

Voice Tonality Mapping (elevenlabs_service.py)

Tonality ElevenLabs Voice
professional Rachel
warm Bella
authoritative Adam
enthusiastic Antoni
calm Elli

Development

Project Structure

backend/
├── srt.py                      # Main application & Flask API
├── elevenlabs_service.py       # Summary generation module (isolated)
├── stream.py                   # Simple video-only streaming
├── live.py                     # Legacy streaming script
├── audiod.py                   # Audio streaming test
├── req.py                      # API test client
├── testcam.py                  # Camera detection utility
├── firebase-key.json           # Firebase credentials (gitignored)
├── requirements.txt            # Python dependencies
├── env.example                 # API key template
├── static/
│   └── summaries/              # Generated audio files
└── frames/                     # Test frame images

Testing the ElevenLabs Integration

  1. Start a test session:

    curl -X POST http://localhost/start \
      -H "Content-Type: application/json" \
      -d '{"name": "test_session"}'
  2. Let it record for 30-60 seconds (speak into microphone)

  3. Stop the session:

    curl -X POST http://localhost/stop
  4. Poll for completion:

    watch -n 2 'curl http://localhost/api/summary/test_session'
  5. Download audio when complete:

    curl -O http://localhost/static/summaries/test_session_summary.mp3

Manual Summary Generation

For existing sessions in Firebase:

curl -X POST http://localhost/api/generate-summary/catshop

Troubleshooting

Common Issues

Issue: "ELEVENLABS_API_KEY not set"

  • Solution: Add ELEVENLABS_API_KEY to your .env file
  • Summary will still generate but without audio

Issue: Camera not opening

  • Solution: Run python testcam.py to find correct device index
  • Update CAPTURE_DEVICE in srt.py

Issue: "Failed to connect to Deepgram"

  • Solution: Check DEEPGRAM_KEY is valid
  • Verify network connectivity

Issue: Gemini API rate limit

  • Solution: Wait a few minutes between summary generations
  • Consider upgrading to paid Gemini tier

Issue: No transcripts in Firebase

  • Solution: Check microphone is working
  • Verify Deepgram connection in logs
  • Speak clearly during recording

Logs

The system provides detailed logging:

  • [ok] - Successful operations
  • [warn] - Warnings (non-critical)
  • [error] - Errors requiring attention
  • [db] - Firebase operations
  • [api] - API endpoint calls
  • [elevenlabs_service] - Summary generation progress

Security Considerations

  1. API Keys: Never commit API keys to git
  2. Firebase Credentials: Keep firebase-key.json secure
  3. Input Validation: Session names validated to prevent injection
  4. File Access: Audio file serving validates against directory traversal
  5. Network Access: Consider firewall rules for production deployment

Performance

  • Frame Processing: ~0.2 seconds per frame
  • Transcription Latency: ~1-2 seconds (real-time streaming)
  • Summary Generation: 30-60 seconds for 5-minute recording
  • Audio File Size: ~100KB per minute of speech

Future Enhancements

  • WebSocket notifications for summary completion
  • Persistent job storage (Redis/database)
  • Multiple voice selection UI
  • Emotion-aware voice modulation
  • Summary caching to avoid regeneration
  • Rate limiting for API endpoints
  • User authentication and authorization
  • Real-time emotion visualization
  • Export to multiple formats (PDF, JSON, SRT)

License

[Your License Here]

Support

For issues or questions, contact [your contact info].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages