<a href="https://colab.research.google.com/drive/1jovRJtpTNgUuOBoiMqPnwxHrEiKx7sp5?usp=drive_link](https://colab.research.google.com/drive/1jovRJtpTNgUuOBoiMqPnwxHrEiKx7sp5?usp=drive_link)" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## VideoReTalkingï¼šAudio-based Lip Synchronization for Talking Head Video Editing In the Wild (ORIGINAL SOURCE)

[Arxiv](https://arxiv.org/abs/2211.14758) | [Project](https://vinthony.github.io/video-retalking/) | [Github](https://github.com/vinthony/video-retalking)

Kun Cheng, Xiaodong Cun, Yong Zhang, Menghan Xia, Fei Yin, Mingrui Zhu, Xuan Wang, Jue Wang, Nannan Wang

Xidian University, Tencent AI Lab, Tsinghua University

*SIGGRAPH Asia 2022 Conferenence Track*



Project Code:

In [None]:
!nvidia-smi

!python --version
!apt-get update
!apt install ffmpeg &> /dev/null

!git clone https://github.com/pyetwi/video-retalking.git &> /dev/null
%cd video-retalking
!pip install -r requirements.txt

In [None]:
import gdown

### For this project, we curated subsets of LibriSpeech (real audio), LibriSevoc (fake audio), FakeAVCeleb (real video)
### We had to request permission for FakeAVCeleb (waited for approximately 1 week).
### This involved quite a bit of shell-scripting to manually process the zip files!
audio_fake_url = 'https://drive.google.com/uc?export=download&id=1F_8hYzlA1i9KIIrB_KzOfagIo-Ovgtfr'
audio_real_url = 'https://drive.google.com/uc?export=download&id=1OAVkQ_Hvf7xI0y5PZILOCgJKNIISA1wW'
video_real_url = 'https://drive.google.com/uc?export=download&id=1yyv9_z_pR3CTIwhPqNWE0aGZjbfrUDe5'

# Download the files using gdown
gdown.download(audio_fake_url, 'audio_fake.zip', quiet=False)
gdown.download(audio_real_url, 'audio_real.zip', quiet=False)
gdown.download(video_real_url, 'video_real.zip', quiet=False)

!unzip -q -d examples/ audio_fake.zip
!unzip -q -d examples/ audio_real.zip
!unzip -q -d examples/ video_real.zip


!rm -rf audio_fake.zip
!rm -rf audio_real.zip
!rm -rf video_real.zip

**Download Pretrained Models**

In [None]:
!mkdir ./checkpoints  
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/30_net_gen.pth -O ./checkpoints/30_net_gen.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/BFM.zip -O ./checkpoints/BFM.zip
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/DNet.pt -O ./checkpoints/DNet.pt
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/ENet.pth -O ./checkpoints/ENet.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/expression.mat -O ./checkpoints/expression.mat
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/face3d_pretrain_epoch_20.pth -O ./checkpoints/face3d_pretrain_epoch_20.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/GFPGANv1.3.pth -O ./checkpoints/GFPGANv1.3.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/GPEN-BFR-512.pth -O ./checkpoints/GPEN-BFR-512.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/LNet.pth -O ./checkpoints/LNet.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/ParseNet-latest.pth -O ./checkpoints/ParseNet-latest.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/RetinaFace-R50.pth -O ./checkpoints/RetinaFace-R50.pth
!wget https://github.com/pyetwi/video-retalking/releases/download/v0.0.1/shape_predictor_68_face_landmarks.dat -O ./checkpoints/shape_predictor_68_face_landmarks.dat
!unzip -d ./checkpoints/BFM ./checkpoints/BFM.zip

**Dataset Generation**

For our project, Hoang Tang engineered a pipeline to generate deepfake datasets.

Our dataset contains three types of videos:

Type #1: These are completely legitimate videos (from subsets of FakeAVCeleb + GRID datasets), consistenting of a real video & its corresponding audio
Type #2: These videos contain the real video, but its video has been transformed to match a randomly selected (synthetic) soundtrack from a subset of the LibriSpeech dataset.
Type #3: These videos contain the real video, but its video has been transformed to match a randomly selected (fake) soundtrack from a subset of the LibriSevoc dataset.

We manually wrote thousands of lines of shell scripts to preprocess each of the datasets, since they were all initially too big to use and were compressed. We also wanted to use the KoDF (Korean Deepfakes) dataset for our project, but it was nearly 2TB in size. Clearly, such a dataset is infeasible for usage given available resources.

In [None]:
#@title
import glob, os, sys
import random
import numpy as np
import ipywidgets as widgets
from IPython.display import HTML
from base64 import b64encode

src_videos = 'examples/video_real'
src_real_audios = 'examples/audio_real'
src_fake_audios = 'examples/audio_fake'

all_videos = [
    os.path.basename(x) for x in glob.glob('{}/*.mp4'.format(src_videos))
]

all_real_audios = [
    os.path.basename(x) for x in glob.glob('{}/*.wav'.format(src_real_audios))
]

all_fake_audios = [
    os.path.basename(x) for x in glob.glob('{}/*.wav'.format(src_fake_audios))
]

### This code will generate a dataset of the following videos:
### Type 1 videos (real): 
### - These are completely legitimate videos, consistenting of a real video & its corresponding audio
TYPE_ONE_COUNT = 5
type_one_dst_folder = '/examples/type_one'
os.makedirs(type_one_dst_folder, exist_ok=True)

# randomly sample TYPE_ONE_COUNT videos, with replacement.
type_one_videos = np.random.choice(all_videos, TYPE_ONE_COUNT, replace=True)
for video in type_one_videos:
  print(video)
  # !cp {type_one_src_videos}/{video} {type_one_dst_folder}/{video}


### Type 2 videos (deepfake):
### - These videos contain the real video, but its video has been transformed to match a randomly selected (synthetic) soundtrack from the LibriSpeech dataset.
TYPE_TWO_COUNT = 5
type_two_dst_folder = '/examples/type_two'
os.makedirs(type_two_dst_folder, exist_ok=True)

type_two_videos = random.sample(all_videos, TYPE_TWO_COUNT)
type_two_audios = random.sample(all_real_audios, TYPE_TWO_COUNT)

for idx, (video, audio) in enumerate(zip(type_two_videos, type_two_audios), start = 1):
  print(video)
  print(audio)

### Type 3 videos (deepfake):
### - These videos contain the real video, but its video has been transformed to match a randomly selected (fake) soundtrack from the LibriSevoc dataset.
TYPE_THREE_COUNT = 5
type_three_dst_folder = '/examples/type_three'
os.makedirs(type_three_dst_folder, exist_ok=True)

type_three_videos = random.sample(all_videos, TYPE_THREE_COUNT)
type_three_audios = random.sample(all_fake_audios, TYPE_THREE_COUNT)




# input_video_path = 'examples/face/{}'.format(default_vid_name.value)
# input_audio_path = 'examples/audio/{}'.format(default_audio_name.value)

# !python3 inference.py \
#   --face {input_video_path} \
#   --audio {input_audio_path} \
#   --up_face "surprise" \
#   --outfile results/output.mp4

Visualize the input video and audio:

In [None]:
#@title
input_video_name = './examples/face/{}'.format(default_vid_name.value)
input_video_mp4 = open('{}'.format(input_video_name),'rb').read()
input_video_data_url = "data:video/x-m4v;base64," + b64encode(input_video_mp4).decode()
print('Display input video: {}'.format(input_video_name), file=sys.stderr)
display(HTML("""
  <video width=400 controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % input_video_data_url))

input_audio_name = './examples/audio/{}'.format(default_audio_name.value)
input_audio_mp4 = open('{}'.format(input_audio_name),'rb').read()
input_audio_data_url = "data:audio/wav;base64," + b64encode(input_audio_mp4).decode()
print('Display input audio: {}'.format(input_audio_name), file=sys.stderr)
display(HTML("""
  <audio width=400 controls>
        <source src="%s" type="audio/wav">
  </audio>
  """ % input_audio_data_url))


In [None]:
input_video_path = 'examples/face/{}'.format(default_vid_name.value)
input_audio_path = 'examples/audio/{}'.format(default_audio_name.value)

!python3 inference.py \
  --face {input_video_path} \
  --audio {input_audio_path} \
  --outfile results/output.mp4

Visualize the output video:

In [None]:
#@title
# visualize code from makeittalk
from IPython.display import HTML
from base64 import b64encode
import os, sys, glob, cv2, subprocess, platform

def read_video(vid_name):
  video_stream = cv2.VideoCapture(vid_name)
  fps = video_stream.get(cv2.CAP_PROP_FPS)
  full_frames = []
  while True:
    still_reading, frame = video_stream.read()
    if not still_reading:
        video_stream.release()
        break
    full_frames.append(frame)
  return full_frames, fps

input_video_frames, fps = read_video(input_video_path)
output_video_frames, _ = read_video('./results/output.mp4')

frame_h, frame_w = input_video_frames[0].shape[:-1]
out_concat = cv2.VideoWriter('./temp/temp/result_concat.mp4', cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_w*2, frame_h))
for i in range(len(output_video_frames)):
  frame_input = input_video_frames[i % len(input_video_frames)]
  frame_output = output_video_frames[i]
  out_concat.write(cv2.hconcat([frame_input, frame_output]))
out_concat.release()

command = 'ffmpeg -loglevel error -y -i {} -i {} -strict -2 -q:v 1 {}'.format(input_audio_path, './temp/temp/result_concat.mp4', './results/output_concat_input.mp4')
subprocess.call(command, shell=platform.system() != 'Windows')


output_video_name = './results/output.mp4'
output_video_mp4 = open('{}'.format(output_video_name),'rb').read()
output_video_data_url = "data:video/mp4;base64," + b64encode(output_video_mp4).decode()
print('Display lip-syncing video: {}'.format(output_video_name), file=sys.stderr)
display(HTML("""
  <video height=400 controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % output_video_data_url))

output_concat_video_name = './results/output_concat_input.mp4'
output_concat_video_mp4 = open('{}'.format(output_concat_video_name),'rb').read()
output_concat_video_data_url = "data:video/mp4;base64," + b64encode(output_concat_video_mp4).decode()
print('Display input video and lip-syncing video: {}'.format(output_concat_video_name), file=sys.stderr)
display(HTML("""
  <video height=400 controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % output_concat_video_data_url))
