## Wav2Lip
**[Wav2Lip](https://arxiv.org/pdf/2008.10010.pdf)** 是一种基于对抗生成网络的由语音驱动的人脸说话视频生成模型。如下图所示，Wav2Lip的网络模型总体上分成三块：生成器、判别器和一个预训练好的Lip-Sync Expert组成。网络的输入有2个：任意的一段视频和一段语音，输出为一段唇音同步的视频。生成器是基于encoder-decoder的网络结构，分别利用2个encoder: speech encoder, identity encoder去对输入的语音和视频人脸进行编码，并将二者的编码结果进行拼接，送入到 face decoder 中进行解码得到输出的视频帧。判别器Visual Quality Discriminator对生成结果的质量进行规范，提高生成视频的清晰度。为了更好的保证生成结果的唇音同步性，Wav2Lip引入了一个预预训练的唇音同步判别模型 Pre-trained Lip-sync Expert，作为衡量生成结果的唇音同步性的额外损失。

### Lip-Sync Expert
Lip-sync Expert基于 **[SyncNet](https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/)**，是一种用来判别语音和视频是否同步的网络模型。如下图所示，SyncNet的输入也是两种：语音特征MFCC和嘴唇的视频帧，利用两个基于卷积神经网络的Encoder分别对输入的语音和视频帧进行降纬和特征提取，将二者的特征都映射到同一个纬度空间中去，最后利用contrastive loss对唇音同步性进行衡量，结果的值越大代表越不同步，结果值越小则代表越同步。在Wav2Lip模型中，进一步改进了SyncNet的网络结构：网络更深；加入了残差网络结构；输入的语音特征被替换成了mel-spectrogram特征。

## 1. 环境的配置
- `建议准备一台有显卡的linux系统电脑，或者可以选择使用第三方云服务器（Google Colab）` 
- `Python 3.6 或者更高版本` 
- ffmpeg: `sudo apt-get install ffmpeg`
- 必要的python包的安装，所需要的库名称都已经包含在`requirements.txt`文件中，可以使用 `pip install -r requirements.txt`一次性安装. 
- 在本实验中利用到了人脸检测的相关技术，需要下载人脸检测预训练模型：Face detection [pre-trained model](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth) 并移动到 `face_detection/detection/sfd/s3fd.pth`文件夹下. 

In [3]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting librosa>=0.7.0 (from -r requirements.txt (line 1))
  Using cached librosa-0.10.0.post2-py3-none-any.whl (253 kB)
Collecting opencv-contrib-python>=4.2.0.34 (from -r requirements.txt (line 3))
  Using cached opencv_contrib_python-4.8.0.74-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (67.8 MB)
Collecting audioread>=2.1.9 (from librosa>=0.7.0->-r requirements.txt (line 1))
  Using cached audioread-3.0.0-py3-none-any.whl
Collecting soundfile>=0.12.1 (from librosa>=0.7.0->-r requirements.txt (line 1))
  Using cached soundfile-0.12.1-py2.py3-none-manylinux_2_17_x86_64.whl (1.3 MB)
Collecting pooch<1.7,>=1.0 (from librosa>=0.7.0->-r requirements.txt (line 1))
  Downloading pooch-1.6.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting soxr>=0.3.2 (from librosa>=0.7.0->-r require

## 2. 数据集的准备及预处理

**LRS2 数据集的下载**  
实验所需要的数据集下载地址为：<a href="http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html">LRS2 dataset</a>，下载该数据集需要获得BBC的许可，需要发送申请邮件以获取下载密钥，具体操作详见网页中的指示。下载完成后对数据集进行解压到本目录的`mvlrs_v1/`文件夹下，并将LRS2中的文件列表文件`train.txt, val.txt, test.txt` 移动到`filelists/`文件夹下，最终得到的数据集目录结构如下所示。
```
data_root (mvlrs_v1)
├── main, pretrain (我们只使用main文件夹下的数据)
|	├── 文件夹列表
|	│   ├── 5位以.mp4结尾的视频ID
```
**数据集预处理**
数据集中大多数视频都是包含人的半身或者全身的画面，而我们的模型只需要人脸这一小部分。所以在预处理阶段，我们要对每一个视频进行分帧操作，提取视频的每一帧，之后使用`face detection`工具包对人脸位置进行定位并裁减，只保留人脸的图片帧。同时，我们也需要将每一个视频中的语音分离出来。

In [None]:
# !wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O face_detection/detection/sfd/s3fd.pth

In [2]:
!rm -rf ../LSR2/demo
!mkdir -p ../LSR2/demo
!cp -r ../LSR2/main/553* ../LSR2/demo/

In [6]:
!rm -rf ../LSR2/lrs2_preprocessed
!python preprocess.py --data_root "../LSR2/main" --preprocessed_root "../LSR2/lrs2_preprocessed" --batch_size 128
# !python preprocess.py --data_root "../LSR2/demo" --preprocessed_root "../LSR2/lrs2_preprocessed" --batch_size 128

Started processing for ../LSR2/demo with 1 GPUs
100%|█████████████████████████████████████████| 432/432 [07:59<00:00,  1.11s/it]
Dumping audios...
100%|█████████████████████████████████████████| 432/432 [00:08<00:00, 49.30it/s]


预处理后的`lrs2_preprocessed/`文件夹下的目录结构如下
```
preprocessed_root (lrs2_preprocessed)
├── 文件夹列表
|	├── 五位的视频ID
|	│   ├── *.jpg
|	│   ├── audio.wav
```

获取对应的文件列表并更新到filelists/train.txt和filelists/eval.txt。只保存对应的视频名称即可。代码可以参考，对视频样本重命名并生成对应的命名列表，此处视频文件数量过少<2，会报错：

In [13]:
import time
from glob import glob
import shutil,os
 
from sklearn.model_selection import train_test_split
 

# 去除名字的特殊符号，统一序号视频文件命名
 
# def original_video_name_format():
#     base_path = "../LSR2/main"
#     result = list(glob("{}/*".format(base_path),recursive=False))
#     file_num = 0
#     result_list = []
 
#     for each in result:
#         file_num +=1
#         new_position ="{0}{1}".format( int(time.time()),file_num)
#         result_list.append(new_position)
#         shutil.move(each, os.path.join(base_path,new_position+".mp4"))
#         pass

def trained_data_name_format():
    base_path = "../LSR2/lrs2_preprocessed"
    # result = list(glob("{}/*".format(base_path)))
    result = os.listdir(base_path)
    print(result)
    result_list = []
    for i,dirpath in enumerate(result):
        # shutil.move(dirpath,"{0}/{1}".format(base_path,i))
        # result_list.append(str(i))
        # print('dirpath:', dirpath)
        result_list.append(dirpath)
    if len(result_list)<14:
        test_result=val_result=train_result=result_list
    else:
        train_result,test_result = train_test_split(result_list,test_size=0.15, random_state=42)
        test_result, val_result = train_test_split(test_result, test_size=0.5, random_state=42)
 
    for file_name,dataset in zip(("train.txt","test.txt","val.txt"),(train_result,test_result,val_result)):
        with open(os.path.join("filelists",file_name),'w',encoding='utf-8') as fi:
            for dataset_i in dataset:
                # print('dataset_i:', dataset_i)
                video_result = os.listdir(os.path.join(base_path, dirpath))
                # print('video_result:', video_result)
                video_result = [dataset_i+'/'+video for video in video_result]
                fi.write("\n".join(video_result))
                fi.write("\n")
 
    # print("\n".join(result_list))

trained_data_name_format()

['5535864093654496929', '5538635636050605931', '5537751731781090844', '5537369050195015499', '5537514649586349811', '5539474443163516678', '5537522380527482610', '5536266102593401990', '5536915501648559593', '5539826200985059108', '5539535002202392187', '5539702505926936192', '5535423430009926848', '5537693749722594824', '5536968329746298779', '5539741160632598296', '5537885734760724252', '5535496873950688380', '5535415699068794046', '5537893465701857051', '5536760882825901738', '5536745420943636139', '5537143564411975377', '5539444807889172133', '5536038039829982468', '5536876846942893978']
dirpath: 5535864093654496929
dirpath: 5538635636050605931
dirpath: 5537751731781090844
dirpath: 5537369050195015499
dirpath: 5537514649586349811
dirpath: 5539474443163516678
dirpath: 5537522380527482610
dirpath: 5536266102593401990
dirpath: 5536915501648559593
dirpath: 5539826200985059108
dirpath: 5539535002202392187
dirpath: 5539702505926936192
dirpath: 5535423430009926848
dirpath: 553769374972259

Training the expert discriminator

In [14]:
!python color_syncnet_train.py --data_root ../LSR2/lrs2_preprocessed/ --checkpoint_dir ./savedmodel --checkpoint_path ./checkpoints/lipsync_expert.pth

use_cuda: True
total trainable params 16435072
Load checkpoint from: ./checkpoints/lipsync_expert.pth
Load optimizer state from ./checkpoints/lipsync_expert.pth
Loss: 0.40534352511167526: : 12it [00:05,  2.18it/s]
Loss: 0.3580533017714818: : 12it [00:02,  4.13it/s] 
Loss: 0.32754940415422124: : 12it [00:02,  4.90it/s]
Loss: 0.29246504480640095: : 12it [00:02,  5.00it/s]
Loss: 0.2891663412253062: : 12it [00:02,  4.50it/s]
Loss: 0.30349138379096985: : 12it [00:03,  3.86it/s]
Loss: 0.2882717673977216: : 12it [00:02,  5.12it/s]
Loss: 0.273677296936512: : 12it [00:02,  4.07it/s]  
Loss: 0.2897895773251851: : 12it [00:02,  5.14it/s]
Loss: 0.2912977859377861: : 12it [00:02,  4.91it/s] 
Loss: 0.25479352350036305: : 12it [00:02,  4.56it/s]
Loss: 0.27851927901307744: : 12it [00:02,  5.00it/s]
Loss: 0.2553818387289842: : 12it [00:02,  4.84it/s] 
Loss: 0.270049549639225: : 12it [00:02,  4.40it/s]  
Loss: 0.2607487055162589: : 12it [00:02,  5.01it/s] 
Loss: 0.25605521475275356: : 12it [00:02,  4.54

执行如下命令，开始训练：

In [5]:
!python wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./checkpoints/lipsync_expert.pth --checkpoint_path ./checkpoints/wav2lip.pth


use_cuda: True
total trainable params 36298035
Load checkpoint from: ./checkpoints/wav2lip.pth
Load optimizer state from ./checkpoints/wav2lip.pth
Load checkpoint from: ./checkpoints/lipsync_expert.pth
Starting Epoch: 203
0it [00:00, ?it/s]^C


In [15]:
!python hq_wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./checkpoints/lipsync_expert.pth --checkpoint_path ./checkpoints/wav2lip.pth --disc_checkpoint_path ./checkpoints/visual_quality_disc.pth


use_cuda: True
total trainable params 36298035
total DISC trainable params 14113793
Load checkpoint from: ./checkpoints/wav2lip.pth
Load optimizer state from ./checkpoints/wav2lip.pth
Load checkpoint from: ./checkpoints/visual_quality_disc.pth
Load optimizer state from ./checkpoints/visual_quality_disc.pth
Load checkpoint from: ./checkpoints/lipsync_expert.pth
Starting Epoch: 203
L1: 0.032601705711820854, Sync: 0.0, Percep: 1.0678514177384584 | Fake: 0.5464184919129247, Real: 0.5613359776527985: : 46it [00:18,  2.44it/s]
Starting Epoch: 204
L1: 0.04400158479161884, Sync: 0.0, Percep: 1.028250489545905 | Fake: 0.5532738423865774, Real: 0.5383027455081111: : 46it [00:15,  2.97it/s]  
Starting Epoch: 205
L1: 0.041590580344200136, Sync: 0.0, Percep: 1.0814215958118438 | Fake: 0.5159455627202988, Real: 0.47654434740543367: : 10it [00:04,  3.09it/s]^C
L1: 0.041590580344200136, Sync: 0.0, Percep: 1.0814215958118438 | Fake: 0.5159455627202988, Real: 0.47654434740543367: : 10it [00:04,  2.31it/

In [1]:
!python inference.py --checkpoint_path ./checkpoints/wav2lip_gan.pth --face ../LSR2/demo/5539702505926936192/00001.mp4 --audio ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/audio.wav


Using cuda for inference.
Reading video frames...
Number of frames available for inference: 57
(80, 185)
Length of mel chunks: 55
  0%|                                                     | 0/1 [00:00<?, ?it/s]
  0%|                                                     | 0/4 [00:00<?, ?it/s][A
 25%|███████████▎                                 | 1/4 [00:02<00:07,  2.35s/it][A
 50%|██████████████████████▌                      | 2/4 [00:02<00:02,  1.11s/it][A
 75%|█████████████████████████████████▊           | 3/4 [00:02<00:00,  1.41it/s][A
100%|█████████████████████████████████████████████| 4/4 [00:03<00:00,  1.16it/s][A
Load checkpoint from: ./checkpoints/wav2lip_gan.pth
Model loaded
100%|█████████████████████████████████████████████| 1/1 [01:37<00:00, 97.80s/it]
ffmpeg version 4.2.3 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7.5.0 (crosstool-NG 1.24.0.123_1667d2b)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1590573566052/_h_env_pl

In [None]:
!cd ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/final_results/ && ffmpeg -r 25 -i %d.png 00001-sr.mp4 -y


In [3]:
!python inference.py --checkpoint_path ./checkpoints/wav2lip_gan.pth --face ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/final_results/00001-sr.mp4 --audio ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/audio.wav


Using cuda for inference.
Reading video frames...
Number of frames available for inference: 57
(80, 185)
Length of mel chunks: 55
  0%|                                                     | 0/1 [00:00<?, ?it/s]
  0%|                                                     | 0/4 [00:00<?, ?it/s][A
 25%|███████████                                 | 1/4 [01:45<05:15, 105.07s/it][A
 50%|██████████████████████▌                      | 2/4 [01:47<01:29, 44.69s/it][A
 75%|█████████████████████████████████▊           | 3/4 [01:49<00:25, 25.24s/it][A
100%|█████████████████████████████████████████████| 4/4 [01:57<00:00, 29.44s/it][A
Load checkpoint from: ./checkpoints/wav2lip_gan.pth
Model loaded
100%|████████████████████████████████████████████| 1/1 [02:02<00:00, 122.76s/it]
ffmpeg version 4.2.3 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7.5.0 (crosstool-NG 1.24.0.123_1667d2b)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1590573566052/_h_env_pl