<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/MeiGen_AI_MultiTalk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NOTE: use A100 to run

# 🛠️Installation


In [None]:
# 1. install pytorch, xformers

!pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
!pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121


In [None]:
# 2. Flash-attn installation:
!pip install misaki[en] ninja psutil packaging
!pip install flash_attn==2.7.4.post1

In [None]:
!pip list | grep -E "librosa|ffmpeg"


In [1]:
!git clone https://github.com/weedge/MultiTalk.git

Cloning into 'MultiTalk'...
remote: Enumerating objects: 281, done.[K
remote: Counting objects: 100% (145/145), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 281 (delta 99), reused 48 (delta 48), pack-reused 136 (from 1)[K
Receiving objects: 100% (281/281), 21.10 MiB | 15.41 MiB/s, done.
Resolving deltas: 100% (111/111), done.


In [None]:
!cd /content/MultiTalk && pip install -r requirements.txt


In [None]:
!pip install -q transformers==4.49.0

In [None]:
!pip list | grep -E "torch|transformers"

In [None]:
!pip uninstall -y tensorflow jax jaxlib

# 🧱Model Preparation


#### 1. Model Download

| Models        |                       Download Link                                           |    Notes                      |
| --------------|-------------------------------------------------------------------------------|-------------------------------|
| Wan2.1-I2V-14B-480P  |      🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P)       | Base model
| chinese-wav2vec2-base |      🤗 [Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base)          | Audio encoder
| Kokoro-82M      |      🤗 [Huggingface](https://huggingface.co/hexgrad/Kokoro-82M)              | TTS weights
| MeiGen-MultiTalk      |      🤗 [Huggingface](https://huggingface.co/MeiGen-AI/MeiGen-MultiTalk)              | Our audio condition weights


In [7]:
!huggingface-cli download --quiet Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P

/content/weights/Wan2.1-I2V-14B-480P


In [8]:
!huggingface-cli download --quiet TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base


/content/weights/chinese-wav2vec2-base


In [9]:
!huggingface-cli download --quiet TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base


weights/chinese-wav2vec2-base/model.safetensors


In [10]:
!huggingface-cli download --quiet hexgrad/Kokoro-82M --local-dir ./weights/Kokoro-82M


/content/weights/Kokoro-82M


In [11]:
!huggingface-cli download --quiet MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk

/content/weights/MeiGen-MultiTalk


2. Link or Copy MultiTalk Model to Wan2.1-I2V-14B-480P Directory

In [12]:
!mv weights/Wan2.1-I2V-14B-480P/diffusion_pytorch_model.safetensors.index.json weights/Wan2.1-I2V-14B-480P/diffusion_pytorch_model.safetensors.index.json_old
!cp weights/MeiGen-MultiTalk/diffusion_pytorch_model.safetensors.index.json weights/Wan2.1-I2V-14B-480P/
!cp weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/

# 🔑 Quick Inference

Our model is compatible with both 480P and 720P resolutions. The current code only supports 480P inference. 720P inference requires multiple GPUs, and we will provide an update soon.
> Some tips
> - Lip synchronization accuracy:​​ Audio CFG works optimally between 3–5. Increase the audio CFG value for better synchronization.
> - ​​Video clip length:​​ The model was trained on 81-frame videos at 25 FPS. For optimal prompt following performance, generate clips at 81 frames. Generating up to 201 frames is possible, though longer clips might reduce prompt-following performance.
> - ​​Long video generation:​​ Audio CFG influences color tone consistency across segments. Set this value to 3 to alleviate tonal variations.
> - Sampling steps: If you want to generate a video fast, you can decrease the sampling steps to even 10 that will not hurt the lip synchronization accuracy, but affects the motion and visual quality. More sampling steps, better video quality.
> - TeaCache accelerate:​​ The optimal range for `--teacache_thresh` is between 0.2 and 0.5. Increasing this value can further improve acceleration, but may also lead to a decline in the quality of the generated video.

#### Usage of MultiTalk
```
--mode streaming: long video generation.
--mode clip: generate short video with one chunk.
--use_teacache: run with TeaCache.
--size multitalk-480: generate 480P video.
--size multitalk-720: generate 720P video.
--use_apg: run with APG.
--teacache_thresh: A coefficient used for TeaCache acceleration
—-sample_text_guide_scale： When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
—-sample_audio_guide_scale： When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
```


## 1. 🏃🏻Single-Person

In [2]:
%cd /content/MultiTalk

/content/MultiTalk




If you want run with very low VRAM, set `--num_persistent_param_in_dit 0`

In [3]:
# Run with very low VRAM
!python generate_multitalk.py \
    --ckpt_dir /content/weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir /content/weights/chinese-wav2vec2-base \
    --input_json /content/MultiTalk/examples/single_example_1.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file single_long_lowvram_exp

Loading checkpoint shards: 100% 8/8 [00:03<00:00,  2.64it/s]
teacache_init
teacache_init done
100% 40/40 [1:19:43<00:00, 119.59s/it]
^C


In [None]:
# Run with TTS
!python generate_multitalk.py \
    --ckpt_dir /content/weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir /content/weights/chinese-wav2vec2-base \
    --input_json /content/MultiTalk/examples/single_example_tts_1.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file single_long_lowvram_tts_exp \
    --audio_mode tts

## 2. 🏃🏻🏃🏻Multi-Person

In [None]:
%cd /content/MultiTalk

In [None]:
# Run with very low VRAM
!python generate_multitalk.py \
    --ckpt_dir /content/weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir /content/weights/chinese-wav2vec2-base \
    --input_json /content/MultiTalk/examples/multitalk_example_2.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file multi_long_lowvram_exp

In [None]:
# Run with TTS
!python generate_multitalk.py \
    --ckpt_dir /content/weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir /content/weights/chinese-wav2vec2-base \
    --input_json /content/MultiTalk/examples/multitalk_example_tts_1.json \
    --sample_steps 40 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --use_teacache \
    --save_file multi_long_lowvram_tts_exp \
    --audio_mode tts

## 3. 📺 Run with FusioniX and CausVid(Require only 4~8 steps)
[FusioniX](https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/FusionX_LoRa/Wan2.1_I2V_14B_FusionX_LoRA.safetensors) require 8 steps and [lightx2v](https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors) requires only 4 steps.

In [None]:
!python generate_multitalk.py \
    --ckpt_dir /content/weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir /content/weights/chinese-wav2vec2-base \
    --input_json /content/MultiTalk/examples/single_example_1.json \
    --lora_dir /content/weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
    --lora_scale 1.0 \
    --sample_text_guide_scale 1.0 \
    --sample_audio_guide_scale 1.0 \
    --sample_steps 8 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --save_file single_long_lowvram_fusionx_exp \
    --sample_shift 2

In [None]:
!python generate_multitalk.py \
    --ckpt_dir /content/weights/Wan2.1-I2V-14B-480P \
    --wav2vec_dir /content/weights/chinese-wav2vec2-base \
    --input_json /content/MultiTalk/examples/multitalk_example_2.json \
    --lora_dir /content/weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
    --lora_scale 1.0 \
    --sample_text_guide_scale 1.0 \
    --sample_audio_guide_scale 1.0 \
    --sample_steps 8 \
    --mode streaming \
    --num_persistent_param_in_dit 0 \
    --save_file multi_long_lowvram_fusionx_exp \
    --sample_shift 2