## Wav2Lip
**[Wav2Lip](https://arxiv.org/pdf/2008.10010.pdf)** 是一种基于对抗生成网络的由语音驱动的人脸说话视频生成模型。如下图所示，Wav2Lip的网络模型总体上分成三块：生成器、判别器和一个预训练好的Lip-Sync Expert组成。网络的输入有2个：任意的一段视频和一段语音，输出为一段唇音同步的视频。生成器是基于encoder-decoder的网络结构，分别利用2个encoder: speech encoder, identity encoder去对输入的语音和视频人脸进行编码，并将二者的编码结果进行拼接，送入到 face decoder 中进行解码得到输出的视频帧。判别器Visual Quality Discriminator对生成结果的质量进行规范，提高生成视频的清晰度。为了更好的保证生成结果的唇音同步性，Wav2Lip引入了一个预预训练的唇音同步判别模型 Pre-trained Lip-sync Expert，作为衡量生成结果的唇音同步性的额外损失。

### Lip-Sync Expert
Lip-sync Expert基于 **[SyncNet](https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/)**，是一种用来判别语音和视频是否同步的网络模型。如下图所示，SyncNet的输入也是两种：语音特征MFCC和嘴唇的视频帧，利用两个基于卷积神经网络的Encoder分别对输入的语音和视频帧进行降纬和特征提取，将二者的特征都映射到同一个纬度空间中去，最后利用contrastive loss对唇音同步性进行衡量，结果的值越大代表越不同步，结果值越小则代表越同步。在Wav2Lip模型中，进一步改进了SyncNet的网络结构：网络更深；加入了残差网络结构；输入的语音特征被替换成了mel-spectrogram特征。

## 1. 环境的配置
- `建议准备一台有显卡的linux系统电脑，或者可以选择使用第三方云服务器（Google Colab）` 
- `Python 3.6 或者更高版本` 
- ffmpeg: `sudo apt-get install ffmpeg`
- 必要的python包的安装，所需要的库名称都已经包含在`requirements.txt`文件中，可以使用 `pip install -r requirements.txt`一次性安装. 
- 在本实验中利用到了人脸检测的相关技术，需要下载人脸检测预训练模型：Face detection [pre-trained model](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth) 并移动到 `face_detection/detection/sfd/s3fd.pth`文件夹下. 

In [None]:
!pip install -r requirements.txt

## 2. 数据集的准备及预处理

**LRS2 数据集的下载**  
实验所需要的数据集下载地址为：<a href="http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html">LRS2 dataset</a>，下载该数据集需要获得BBC的许可，需要发送申请邮件以获取下载密钥，具体操作详见网页中的指示。下载完成后对数据集进行解压到本目录的`mvlrs_v1/`文件夹下，并将LRS2中的文件列表文件`train.txt, val.txt, test.txt` 移动到`filelists/`文件夹下，最终得到的数据集目录结构如下所示。
```
data_root (mvlrs_v1)
├── main, pretrain (我们只使用main文件夹下的数据)
|	├── 文件夹列表
|	│   ├── 5位以.mp4结尾的视频ID
```
**数据集预处理**
数据集中大多数视频都是包含人的半身或者全身的画面，而我们的模型只需要人脸这一小部分。所以在预处理阶段，我们要对每一个视频进行分帧操作，提取视频的每一帧，之后使用`face detection`工具包对人脸位置进行定位并裁减，只保留人脸的图片帧。同时，我们也需要将每一个视频中的语音分离出来。

In [None]:
# !pip install -r requirements.txt

In [1]:
# !wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O face_detection/detection/sfd/s3fd.pth

--2023-08-29 23:09:29--  https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth
Resolving www.adrianbulat.com (www.adrianbulat.com)... 45.136.29.207
Connecting to www.adrianbulat.com (www.adrianbulat.com)|45.136.29.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89843225 (86M) [application/octet-stream]
Saving to: ‘face_detection/detection/sfd/s3fd.pth’


2023-08-29 23:09:33 (23.7 MB/s) - ‘face_detection/detection/sfd/s3fd.pth’ saved [89843225/89843225]



In [None]:
!rm -rf ../LSR2/demo
!mkdir -p ../LSR2/demo
!cp -r ../LSR2/main/553* ../LSR2/demo/

In [None]:
# !pythob generate_hq_videos.py

In [None]:
# %%time

# import os
# import shutil
# from tqdm import tqdm
# from multiprocessing import Pool

# codeformer_cmd = 'cd ../CodeFormer && python inference_codeformer.py --bg_upsampler realesrgan --face_upsample -w 1.0 -s 1 --input_path {} --output_path {}'
# base_dir = '../LSR2/demo'
# sub_dirs = os.listdir(base_dir)

# cmds = []

# def execute_cmd(cmd):
#     os.system(cmd[0])
#     shutil.copy(cmd[1][0], cmd[1][1])

# for sub_dir in sub_dirs:
#     sub_dir = os.path.join(base_dir, sub_dir)
#     filenames = os.listdir(sub_dir)
#     for filename in tqdm(filenames):
#         if filename.endswith('mp4') and '_hq' not in filename:
#             full_filename = os.path.join(sub_dir, filename)
#             new_filename = full_filename[:-4]+'_hq'+full_filename[-4:]
#             new_dirname = full_filename[:-4]
#             # print(codeformer_cmd.format(full_filename, new_dirname))
#             # print(os.path.join(new_dirname, filename), new_filename)
#             if not os.path.exists(new_filename):
#                 cmds.append((codeformer_cmd.format(full_filename, new_dirname), (os.path.join(new_dirname, filename), new_filename)))
#                 # os.system(codeformer_cmd.format(full_filename, new_dirname))
#                 # shutil.copy(os.path.join(new_dirname, filename), new_filename)

# if len(cmds)>0:
#     print('cmds:', len(cmds), cmds[0])
#     with Pool(4) as p:
#         p.map(execute_cmd, tqdm(cmds))

In [2]:
!rm -rf ../LSR2/lrs2_preprocessed_288x288
!python preprocess.py --data_root "../LSR2/main_hq" --preprocessed_root "../LSR2/lrs2_preprocessed_288x288" --batch_size 32 --ngpu 4
# !rm -rf ../LSR2/lrs2_preprocessed_288x288-demo
# !python preprocess.py --data_root "../LSR2/demo" --preprocessed_root "../LSR2/lrs2_preprocessed_288x288-demo" --batch_size 128

^C


预处理后的`lrs2_preprocessed/`文件夹下的目录结构如下
```
preprocessed_root (lrs2_preprocessed)
├── 文件夹列表
|	├── 五位的视频ID
|	│   ├── *.jpg
|	│   ├── audio.wav
```

In [None]:
# import os
# from tqdm import tqdm

# codeformer_cmd = 'cd ../CodeFormer && python inference_codeformer.py --bg_upsampler realesrgan --face_upsample -w 1.0 --input_path {} --output_path {}'
# preprocessed_root = "../LSR2/lrs2_preprocessed_288x288-demo"
# sub_dirs = os.listdir(preprocessed_root)
# for sub_dir in tqdm(sub_dirs):
#     video_dirs = os.listdir(os.path.join(preprocessed_root, sub_dir))
#     for video_dir in video_dirs:
#         video_dir = os.path.join(preprocessed_root, sub_dir, video_dir)
#         # print(video_dir)
#         # print(codeformer_cmd.format(video_dir, video_dir))
#         os.system(codeformer_cmd.format(video_dir, video_dir))

获取对应的文件列表并更新到filelists/train.txt和filelists/eval.txt。只保存对应的视频名称即可。代码可以参考，对视频样本重命名并生成对应的命名列表，此处视频文件数量过少<2，会报错：

In [None]:
# import time
# from glob import glob
# import shutil,os
 
# from sklearn.model_selection import train_test_split
 
# preprocessed_root = "../LSR2/lrs2_preprocessed_288x288-demo"

# # 去除名字的特殊符号，统一序号视频文件命名
 
# # def original_video_name_format():
# #     base_path = "../LSR2/main"
# #     result = list(glob("{}/*".format(base_path),recursive=False))
# #     file_num = 0
# #     result_list = []
 
# #     for each in result:
# #         file_num +=1
# #         new_position ="{0}{1}".format( int(time.time()),file_num)
# #         result_list.append(new_position)
# #         shutil.move(each, os.path.join(base_path,new_position+".mp4"))
# #         pass

# def trained_data_name_format():
#     base_path = preprocessed_root
#     # result = list(glob("{}/*".format(base_path)))
#     result = os.listdir(base_path)
#     print(result)
#     result_list = []
#     for i,dirpath in enumerate(result):
#         # shutil.move(dirpath,"{0}/{1}".format(base_path,i))
#         # result_list.append(str(i))
#         # print('dirpath:', dirpath)
#         result_list.append(dirpath)
#     if len(result_list)<14:
#         test_result=val_result=train_result=result_list
#     else:
#         train_result,test_result = train_test_split(result_list,test_size=0.15, random_state=42)
#         test_result, val_result = train_test_split(test_result, test_size=0.5, random_state=42)
 
#     for file_name,dataset in zip(("train.txt","test.txt","val.txt"),(train_result,test_result,val_result)):
#         with open(os.path.join("filelists",file_name),'w',encoding='utf-8') as fi:
#             for dataset_i in dataset:
#                 # print('dataset_i:', dataset_i)
#                 video_result = os.listdir(os.path.join(base_path, dataset_i))
#                 # print('video_result:', video_result)
#                 video_result = [dataset_i+'/'+video for video in video_result]
#                 fi.write("\n".join(video_result))
#                 fi.write("\n")
 
#     # print("\n".join(result_list))

# trained_data_name_format()

In [None]:
!python generate_filelists.py

Training the expert discriminator

In [None]:
# !python color_syncnet_train.py --data_root ../LSR2/lrs2_preprocessed_288x288/ --checkpoint_dir ./savedmodel --checkpoint_path ./checkpoints/lipsync_expert.pth
!python color_syncnet_train.py --data_root ../LSR2/lrs2_preprocessed_288x288/ --checkpoint_dir ./savedmodel 
!python color_syncnet_train.py --data_root ../LSR2/lrs2_preprocessed_288x288/ --checkpoint_dir ./savedmodel --checkpoint_path ./savedmodel/checkpoint_step000032000.pth 
# --checkpoint_path ./checkpoints/lipsync_expert.pth

执行如下命令，开始训练：

In [None]:
# !python wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./checkpoints/lipsync_expert.pth 
!python wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed_288x288 --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./savedmodel/checkpoint_step000032000.pth 
# --checkpoint_path ./checkpoints/wav2lip.pth


In [None]:
!python hq_wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed_288x288 --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./savedmodel/checkpoint_step000032000.pth 
# !python hq_wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./checkpoints/lipsync_expert.pth --checkpoint_path ./checkpoints/wav2lip.pth --disc_checkpoint_path ./checkpoints/visual_quality_disc.pth


In [None]:
# !python wloss_hq_wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed_288x288/ --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./checkpoints/lipsync_expert.pth
!python wloss_hq_wav2lip_train.py --data_root ../LSR2/lrs2_preprocessed_288x288/ --checkpoint_dir ./savedmodel --syncnet_checkpoint_path ./savedmodel/checkpoint_step000050000.pth 
# --checkpoint_path ./checkpoints/wav2lip.pth


In [5]:
%%time
!python inference.py --checkpoint_path ./savedmodel/wav2lip_checkpoint_step000093000.pth --face ../videos/97.mp4 --audio ../videos/test.wav


Using cuda:2 for inference.
Reading video frames...
Number of frames available for inference: 380
(80, 936)
Length of mel chunks: 232
  0%|                                                     | 0/2 [00:00<?, ?it/s]
  0%|                                                    | 0/15 [00:05<?, ?it/s][A
Recovering from OOM error; New batch size: 8

  0%|                                                    | 0/29 [00:00<?, ?it/s][A
  3%|█▌                                          | 1/29 [01:26<40:32, 86.89s/it][A
  7%|███                                         | 2/29 [01:28<16:27, 36.57s/it][A
 10%|████▌                                       | 3/29 [01:29<08:52, 20.49s/it][A
 14%|██████                                      | 4/29 [01:30<05:23, 12.93s/it][A
 17%|███████▌                                    | 5/29 [01:32<03:29,  8.75s/it][A
 21%|█████████                                   | 6/29 [01:33<02:23,  6.23s/it][A
 24%|██████████▌                                 | 7/29 [01:34<01:4

In [None]:
# !python inference.py --checkpoint_path ./savedmodel/wav2lip_checkpoint_step000008000.pth --face ../LSR2/demo/5539702505926936192/00001.mp4 --audio ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/audio.wav


In [None]:
# !cd ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/final_results/ && ffmpeg -r 25 -i %d.png 00001-sr.mp4 -y


In [None]:
# !python inference.py --checkpoint_path ./savedmodel/wav2lip_checkpoint_step000008000.pth --face ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/final_results/00001-sr.mp4 --audio ../LSR2/lrs2_preprocessed_288x288-demo/5539702505926936192/00001/audio.wav
