This tool provides comprehensive video analysis through audio transcription, visual frame analysis, and executive summary generation. It's designed for creating VP-level summaries and detailed visual narratives from video content.
- macOS (tested on macOS 14+)
ffmpeg
installed (for audio/video processing)- Hermit package manager (for Python environment)
- Python packages: openai-whisper, tqdm, pillow, numpy
- Clone the repository and navigate to the project directory:
mkdir audio_transcriber
cd audio_transcriber
- Initialize Hermit:
curl -fsSL https://github.com/cashapp/hermit/releases/download/stable/install.sh | HERMIT_STATE_DIR=. bash
- Install Python using Hermit:
./bin/hermit install python3-3.11.9
source bin/activate-hermit
- Install required Python packages:
pip install openai-whisper tqdm pillow numpy
-
Place your MP4 video file in your Downloads directory.
-
Run the transcription script:
./audio_to_text.py
- Extract frames from video:
mkdir -p frames
ffmpeg -i ~/Downloads/your_video.mp4 -vf "fps=1" frames/frame_%04d.jpg
- Run the frame analysis script:
./analyze_frames.py
- Run the texture analysis (optional):
./analyze_texture.py frames/frame_0001.jpg
let's create an audio-to-text script using python. we can use `ffmpeg` on path to extract audio. the target file for audio extraction will be in the `.mp4` in `~/Downloads`
let's try "watching" the video by using `ffmpeg` to isolate every frame of the video, write it to a separate file, and merge the files into one document of files. then, scan the document sequentially top to bottom, keeping a running context of what's happening by summarizing each document page
After running frame analysis:
let's follow the sequence further, all the way to the end, then create a visual narrative document
This will create visual_narrative_detailed.txt
containing:
- Scene-by-scene breakdown
- Visual style analysis
- Narrative themes
- Production quality notes
After having both transcript and visual analysis:
can you please combine both the visual narrative analysis and creation we've done here, and the audio transcript work we did earlier, and use both to create an executive summary suitable for VP+ level?
This will create executive_summary_combined.txt
containing:
- Business impact overview
- Key outcomes
- Strategic relevance
- Market implications
- ROI indicators
The tool generates several types of files:
-
Transcript (
[videoname].transcript.txt
):- Timestamped transcription
- Word-by-word accuracy
- Speaker context
-
Visual Analysis (
visual_narrative_detailed.txt
):- Scene-by-scene breakdown
- Visual style notes
- Narrative themes
- Production quality assessment
-
Executive Summary (
executive_summary_combined.txt
):- Business impact overview
- Strategic analysis
- Market implications
- Distribution recommendations
You can modify the analysis by:
- Adjusting frame extraction rate
- Changing analysis depth
- Customizing summary format
- Targeting specific audience levels
-
Frame Extraction:
- Use 1 fps for standard analysis
- Higher rates for fast-moving content
- Lower rates for static content
-
Analysis Sequence:
- Start with audio transcription
- Follow with frame analysis
- Combine for executive summary
-
Summary Creation:
- Focus on business impact
- Include concrete metrics
- Highlight strategic relevance
- Consider audience level
- Uses Whisper "medium" model for better accuracy
- First run downloads Whisper model (~1.42GB)
- Frame analysis works best with videos under 10 minutes
- High-resolution frame extraction preserves detail
- Processing time depends on video length and CPU power