Face swapping:
- [DeepFaceLab: A simple, flexible and extensible face swapping framework](https://arxiv.org/abs/2005.05535v1)
- [FaceSwap](https://github.com/deepfakes/faceswap)
- [Roop](https://github.com/s0md3v/roop)
- [SimSwap: An Efficient Framework For High Fidelity Face Swapping](https://arxiv.org/abs/2106.06340)
- [DeepFaceLive](https://github.com/iperov/DeepFaceLive)
- [Deep-Live-Cam](https://github.com/hacksider/Deep-Live-Cam)

Lip-syncing:
- [A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild](https://arxiv.org/abs/2008.10010)

Voice cloning:
- [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243)

Pose-based animation:
- [First Order Motion Model for Image Animation](https://github.com/AliaksandrSiarohin/first-order-model)
- [Thin-Plate Spline Motion Model for Image Animation](https://arxiv.org/abs/2203.14367)
- [Deep Video Portraits](https://arxiv.org/abs/1805.11714)

# 1. Thin Plate Spline Motion Model

In [1]:
!git clone https://github.com/yoyo-nb/Thin-Plate-Spline-Motion-Model.git
%cd Thin-Plate-Spline-Motion-Model

Cloning into 'Thin-Plate-Spline-Motion-Model'...
remote: Enumerating objects: 115, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 115 (delta 44), reused 30 (delta 30), pack-reused 46 (from 1)[K
Receiving objects: 100% (115/115), 32.65 MiB | 11.13 MiB/s, done.
Resolving deltas: 100% (51/51), done.
/home/yungshun317/workspace/py/torch-cv/Thin-Plate-Spline-Motion-Model


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
!mkdir checkpoints
!pip3 install wldhx.yadisk-direct
!curl -L $(yadisk-direct https://disk.yandex.com/d/i08z-kCuDGLuYA) -o checkpoints/vox.pth.tar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  334M  100  334M    0     0  4097k      0  0:01:23  0:01:23 --:--:-- 4582k


In [3]:
# Good to have other models
!curl -L $(yadisk-direct https://disk.yandex.com/d/vk5dirE6KNvEXQ) -o checkpoints/taichi.pth.tar
!curl -L $(yadisk-direct https://disk.yandex.com/d/IVtro0k2MVHSvQ) -o checkpoints/mgif.pth.tar
!curl -L $(yadisk-direct https://disk.yandex.com/d/B3ipFzpmkB1HIA) -o checkpoints/ted.pth.tar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  334M  100  334M    0     0  3981k      0  0:01:26  0:01:26 --:--:-- 4651k:59  0:00:23  0:01:36 4272k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  291M  100  291M    0     0  4022k      0  0:01:14  0:01:14 --:--:-- 4686k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  334M  100  334M    0     0  3743k      0  0:01:31  0:01:31 --:--:-- 4502k


In [4]:
# Configuration
import torch

# Edit the config
device = torch.device('cuda:0')
dataset_name = 'vox' 
# ['vox', 'taichi', 'ted', 'mgif']
source_image_path = './assets/source.png'
driving_video_path = './assets/driving.mp4'
output_video_path = './generated.mp4'
config_path = 'config/vox-256.yaml'
checkpoint_path = 'checkpoints/vox.pth.tar'
predict_mode = 'relative' 
# ['standard', 'relative', 'avd']
find_best_frame = False 
# When use the relative mode to animate a face, use 'find_best_frame=True' can get better quality result

pixel = 256 
# For `vox`, `taichi` and `mgif`, the resolution is 256 * 256
if(dataset_name == 'ted'): 
    # For `ted`, the resolution is 384 * 384
    pixel = 384

if find_best_frame:
    !pip install face_alignment

In [5]:
# Inspect the source image & driving video
try:
    import imageio
    import imageio_ffmpeg
except:
    !pip install imageio_ffmpeg
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from skimage.transform import resize
from IPython.display import HTML
import warnings
import os

warnings.filterwarnings("ignore")

source_image = imageio.imread(source_image_path)
reader = imageio.get_reader(driving_video_path)

source_image = resize(source_image, (pixel, pixel))[..., :3]

fps = reader.get_meta_data()['fps']
driving_video = []
try:
    for im in reader:
        driving_video.append(im)
except RuntimeError:
    pass
reader.close()

driving_video = [resize(frame, (pixel, pixel))[..., :3] for frame in driving_video]

def display(source, driving, generated=None):
    fig = plt.figure(figsize=(8 + 4 * (generated is not None), 6))

    ims = []
    for i in range(len(driving)):
        cols = [source]
        cols.append(driving[i])
        if generated is not None:
            cols.append(generated[i])
        im = plt.imshow(np.concatenate(cols, axis=1), animated=True)
        plt.axis('off')
        ims.append([im])

    ani = animation.ArtistAnimation(fig, ims, interval=50, repeat_delay=1000)
    plt.close()
    return ani

HTML(display(source_image, driving_video).to_html5_video())

In [6]:
from demo import load_checkpoints

inpainting, kp_detector, dense_motion_network, avd_network = load_checkpoints(config_path = config_path, checkpoint_path = checkpoint_path, device = device)

In [7]:
from demo import make_animation
from skimage import img_as_ubyte

if predict_mode=='relative' and find_best_frame:
    from demo import find_best_frame as _find
    i = _find(source_image, driving_video, device.type=='cpu')
    print ("Best frame: " + str(i))
    driving_forward = driving_video[i:]
    driving_backward = driving_video[:(i+1)][::-1]
    predictions_forward = make_animation(source_image, driving_forward, inpainting, kp_detector, dense_motion_network, avd_network, device = device, mode = predict_mode)
    predictions_backward = make_animation(source_image, driving_backward, inpainting, kp_detector, dense_motion_network, avd_network, device = device, mode = predict_mode)
    predictions = predictions_backward[::-1] + predictions_forward[1:]
else:
    predictions = make_animation(source_image, driving_video, inpainting, kp_detector, dense_motion_network, avd_network, device = device, mode = predict_mode)

# Save resulting video
imageio.mimsave(output_video_path, [img_as_ubyte(frame) for frame in predictions], fps=fps)

HTML(display(source_image, driving_video, predictions).to_html5_video())

100%|█████████████████████████████████████████| 169/169 [00:03<00:00, 43.90it/s]


# 2. TorToiSe

In [None]:
!git clone https://github.com/neonbjb/tortoise-tts
%cd tortoise-tts
!python -m pip install -r ./requirements.txt
!python setup.py install