# [M_01] PROTOCOL: MULTIMODAL ANALYSIS (GEMINI 3 FLASH)

**PROJECT:** OMNI-OPERATOR-V1  
**ORGANIZATION:** [OPERATORS' FORGE](https://takzenai-hub.pl)  
**STATUS:** VISION_OPERATION

This module implements the logic for multimodal analysis of raw video material. We use the **Gemini 3 Flash** model to directly extract key moments (hooks) and narrative structure.

**Why is this groundbreaking?**
1. **No transcription:** We don't waste time on Whisper/STT. Gemini sees gestures, emotions, and on-screen text.
2. **Video Grounding:** The model connects audio with specific video frames.
3. **Structured Outputs:** The result goes directly to the Pydantic model, ready for automatic editing.

In [4]:
import os
import sys
import time
import json
from google import genai  
from pydantic import BaseModel, Field
from typing import List
from src.core.config import settings

# 1. WORKING DIRECTORY CORRECTION
if os.getcwd().endswith("notebooks"):
    os.chdir("..")

# Adding src to path to make the core module visible
sys.path.append(os.path.join(os.getcwd(), "src"))


# 2. ENGINE CONFIGURATION
client = genai.Client(api_key=settings.gemini_api_key)  

print(f"LOG: Analytics system ready. ROOT directory: {os.getcwd()}")

LOG: Analytics system ready. ROOT directory: c:\Users\takze\OneDrive\Pulpit\project\omni-operator-v1


## 1. Data Contract Definition (Structured Output)

Establishing a strict schema is crucial for Stage 3 (FFmpeg Editing). Gemini must return data in a format that Python will understand flawlessly.

In [5]:
class ShotCandidate(BaseModel):
    """Represents a video fragment selected for viral potential."""
    start: str = Field(description="Fragment start timestamp (MM:SS format)")
    end: str = Field(description="Fragment end timestamp (MM:SS format)")
    visual_description: str = Field(description="Description of what's happening on screen at this moment")
    narrative_hook: str = Field(description="Why this moment will capture viewer's attention")
    score: int = Field(description="Viral potential on a scale of 1-10")

class VideoAnalysisReport(BaseModel):
    """Complete report from source material analysis."""
    main_topic: str = Field(description="Main topic and purpose of the recording")
    suggested_titles: List[str] = Field(description="Catchy title suggestions (max 3)")
    clips: List[ShotCandidate] = Field(description="List of suggested fragments to extract")

print("LOG: Pydantic data models initialized.")

LOG: Pydantic data models initialized.


## 2. Media File Management (Google File API)

Before analysis, the video file must be indexed by Google. Gemini 3 Flash requires the file to have `ACTIVE` status.

In [6]:
def upload_to_gemini(file_path: str):
    """Uploads file to API and monitors processing status."""
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Error: File {file_path} not found in ROOT directory.")

    print(f"LOG: Uploading material {file_path} to Google Cloud...")
    
    # NEW SDK: client.files.upload
    with open(file_path, "rb") as f:
        media_file = client.files.upload(file=f, config={"mime_type": "video/mp4"})
    
    # Waiting for processing by Google servers
    while media_file.state.name == "PROCESSING":
        print(".", end="", flush=True)
        time.sleep(3)
        # NEW SDK: client.files.get
        media_file = client.files.get(name=media_file.name)
        
    if media_file.state.name == "FAILED":
        raise RuntimeError("LOG: Video processing by Google API failed.")
        
    print(f"\nLOG: Material active. URI: {media_file.uri}")
    return media_file

## 3. Executing Analysis (Native Vision & Reasoning)

We launch the analysis process. We use the `response_schema` mechanism to force the Gemini 3 Flash model to strictly adhere to our data structure.

In [7]:
# CRITICAL: Make sure the test_video.mp4 file is in your root folder!
INPUT_FILE = "test_video.mp4"

async def run_multimodal_analysis():
    try:
        # 1. Media upload
        video_handle = upload_to_gemini(INPUT_FILE)
        
        # 2. Model definition (NEW SDK - provide as string)
        model_id = "gemini-3-flash-preview" 
        
        prompt = (
            "Analyze this video recording for creating short content (Shorts/TikTok). "
            "Identify the main topic and select the 3 best moments. "
            "IMPORTANT: Each clip must last between 15 and 60 seconds. "
            "Return the result as clean JSON compliant with the VideoAnalysisReport structure."
        )
        
        # 3. Analysis with schema enforcement (NEW SDK)
        print("LOG: Agent is analyzing image and audio...")
        response = client.models.generate_content(
            model=model_id,
            contents=[video_handle, prompt],
            config={
                "response_mime_type": "application/json",
                "response_schema": VideoAnalysisReport
            }
        )
        
        # 4. Parsing and validation
        # NEW SDK: try using response.parsed or fallback to text
        if hasattr(response, 'parsed') and response.parsed:
            report = response.parsed
        else:
            report_data = json.loads(response.text)
            report = VideoAnalysisReport.model_validate(report_data)
        
        # 5. Results presentation
        print("\n" + "="*45)
        print(f"üöÄ REPORT: {report.main_topic.upper()}")
        print("="*45)
        for i, clip in enumerate(report.clips, 1):
            print(f"CLIP {i}: [{clip.start} - {clip.end}] (Score: {clip.score}/10)")
            print(f"VISUAL: {clip.visual_description}")
            print(f"HOOK: {clip.narrative_hook}\n")
            
        return report

    except Exception as e:
        print(f"‚ùå OPERATIONAL ERROR: {str(e)}")
        import traceback
        traceback.print_exc()

# EXECUTION
analysis_result = await run_multimodal_analysis()

LOG: Uploading material test_video.mp4 to Google Cloud...
......
LOG: Material active. URI: https://generativelanguage.googleapis.com/v1beta/files/8tq1dytjkitn
LOG: Agent is analyzing image and audio...

üöÄ REPORT: A COLLECTION OF HIGH-STAKES SCIENCE FICTION ACTION AND DRAMATIC SCENES FEATURING ADVANCED ROBOTICS, FUTURISTIC WARFARE, AND EMOTIONAL CHARACTER CONFRONTATIONS.
CLIP 1: [00:05 - 00:25] (Score: 9/10)
VISUAL: A dramatic close-up of a woman with bionic eye implants confronting a man as futuristic HUDs glitch in the background while a giant machine looms.
HOOK: Emotional betrayal meets futuristic technology in this high-tension confrontation that immediately draws the viewer into a complex sci-fi world.

CLIP 2: [01:30 - 02:00] (Score: 10/10)
VISUAL: Giant mechanical walkers invade a grand, overgrown cathedral-like building while a soldier engages in a desperate shootout while swinging from a rope.
HOOK: Witness an epic, high-octane battle between human resistance and massive m

## STATUS: MODULE 01 COMPLETED

We have ready input data for editing. The system "understood" the video and identified moments to extract.

**Next steps:**
1. Save the notebook.
2. Go to `notebooks/02_AGENT_COPYWRITER.ipynb`.