# First‑Order Motion Model + Diff2Lip Demo

This notebook demonstrates a two‑stage talking‑head pipeline:

1. **FOMM** drives full head + upper‑body motion from a reference video.  
2. **Diff2Lip** replaces the mouth region with diffusion‐based visemes for ultra‑high‑fidelity lip sync.

---

## 1. Install Requirements

In [None]:
# Clone repos & install dependencies (once per machine)
!git clone https://github.com/AliaksandrSiarohin/first-order-model.git fomm
!git clone https://github.com/YuanGary/DiffusionLi.git diff2lip
!pip install -r fomm/requirements.txt
!pip install -r diff2lip/requirements.txt
!pip install torch torchvision diffusers accelerate transformers

## 2. Imports & Utility Functions

In [None]:
import os, sys, torch, cv2, numpy as np
from fomm.demo import load_checkpoints, make_animation
from diff2lip.diff2lip import Diff2Lip
from IPython.display import HTML
from base64 import b64encode
import imageio

# Utility to show video inline
def show_video(path, width=320):
    mp4 = open(path,'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML(f'<video width={width} controls src="{data_url}">')

## 3. Load Models

In [None]:
# 3.1 FOMM: generator & keypoint detector
fomm_config = "fomm/config/vox-256.yaml"
fomm_checkpoint = "fomm/checkpoints/first_order_model.pth" # Make sure to download this checkpoint
generator, kp_detector = load_checkpoints(fomm_config, fomm_checkpoint, device='cuda')

# 3.2 Diff2Lip
diff2lip_model = Diff2Lip(
    vpn_model="diff2lip/pretrained/vpn.pth",      # if provided
    diff_model="diff2lip/pretrained/diff2lips.pth",
    device="cuda"
)

## 4. Prepare Inputs

In [None]:
# Still source image (one frame)
source_image = cv2.imread("assets/alice.png")[..., ::-1]  # BGR→RGB

# Driving video (short clip 5-10s)
driving_video = [ 
    cv2.imread(f"assets/driver/{i:03d}.png")[..., ::-1] 
    for i in range(30) 
]

## 5. Stage 1 – FOMM Animation

In [None]:
# Run FOMM to produce initial talking-head clip
with torch.no_grad():
    fomm_result = make_animation(
        source_image, driving_video,
        generator, kp_detector,
        relative=True, adapt_movement_scale=True,
        device='cuda'
    )

# Save interim video
fomm_path = "output_fomm.mp4"
writer = imageio.get_writer(fomm_path, fps=10)
for frame in fomm_result:
    writer.append_data(frame[..., ::-1])  # RGB→BGR for imageio
writer.close()

show_video(fomm_path, width=480)

## 6. Stage 2 – Diff2Lip Refinement

In [None]:
# Diff2Lip expects a video file + WAV audio
audio_file = "assets/hello.wav"

# Run Diff2Lip to replace mouth region
refined_path = "output_diff2lip.mp4"
diff2lip_model.render(
    video_in=fomm_path,
    audio_in=audio_file,
    video_out=refined_path,
    upscale_factor=1
)

show_video(refined_path, width=480)

## 7. Conclusion

You now have a high‑fidelity talking‑head MP4:

- Full head motion and expressions from FOMM.
- Diffusion‑quality lip sync from Diff2Lip.

Integrate this pipeline into your Avatar Renderer Pod for production‑grade avatar videos!

### Next steps:

- Swap in different driver videos for style transfer.
- Fine‑tune checkpoint on your own face dataset.
- Hook into the MCP‑server (`mcp_server.py`) for orchestration.