[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jovRJtpTNgUuOBoiMqPnwxHrEiKx7sp5?authuser=1#scrollTo=x6M0HVmn94j5)

## Generative AI Project (10-623): Demo Code for Deepfake Dataset Generation

**Authors**: Hoang Tang, Derek Duenas, Ishita Gupta

---

### Original VideoReTalking Paper
- 📄 [ArXiv](https://arxiv.org/abs/2211.14758) | 🌐 [Project Page](https://vinthony.github.io/video-retalking/) | 💻 [GitHub Repo](https://github.com/vinthony/video-retalking)

---

### Datasets Used

**Main Datasets:**
- [LibriSpeech (real audio)](https://www.openslr.org/12)
- [LibriSeVoc (fake audio)](https://github.com/csun22/Synthetic-Voice-Detection-Vocoder-Artifacts)
- [FakeAVCeleb (real videos)](https://sites.google.com/view/fakeavcelebdash-lab/download)
- [GRID (audio-visual speech corpus)](https://spandh.dcs.shef.ac.uk/gridcorpus/)
- [LipSyncTIMIT](https://docs.google.com/forms/d/e/1FAIpQLSeKn-OAlJKcOZTU1k6GXVZZjkIuHbGs3am9ScvqkKE7M35psA/viewform)
- [VidTIMIT](https://conradsanderson.id.au/vidtimit/)

**Additional Dataset Explored:**
- [KoDF (Korean Deepfakes)](https://deepbrainai-research.github.io/kodf/)

---

*This notebook supports the automatic generation and transformation of video-audio datasets for deepfake detection and synthesis tasks.*

> **Other Notes**  
> Accessing deepfake datasets turned out to be surprisingly challenging. A lot of large datasets require permission (which takes a while to acquire...)

> Hoang had to write a *concerning* amount of shell scripting just to download, convert, and structure the videos and audios into usable formats.

> Hoang was able to curate a dataset of 805 real face frontal videos from FakeAVCeleb + GRID + VidTIMIT, 2700 human-recorded audio files from LibriSpeech, and 1130 human-synthetic (AI-generated) audio files from LibriSeVoc.

**Installation** (30s)

In [None]:
!nvidia-smi

!python --version
!apt-get update
!apt install ffmpeg &> /dev/null

!git clone https://github.com/pyetwi/video-retalking.git &> /dev/null
%cd video-retalking
!pip install -r requirements.txt

Fri Apr 25 17:01:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   51C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import gdown

### For this project, we curated subsets of LibriSpeech (real audio), LibriSevoc (fake audio), FakeAVCeleb (real video)
### We had to request permission for FakeAVCeleb (waited for approximately 1 week).
### We also received permission for the KoDF (Korean deepfake) dataset, but we decided NOT to use it because it was 2TB in size, which was too substantial for our project.
### This involved quite a bit of shell-scripting to manually process the zip files!

audio_fake_url = 'https://drive.google.com/uc?export=download&id=1F_8hYzlA1i9KIIrB_KzOfagIo-Ovgtfr'
audio_real_url = 'https://drive.google.com/uc?export=download&id=1B0NfuBjdEyYpK34OCGngmSl1tTi1px1k'
video_real_url = 'https://drive.google.com/uc?export=download&id=13ENY9mbL7j6Zhg_5PyPyKtPin7CRagVQ'

# Download the files using gdown
gdown.download(audio_fake_url, 'audio_fake.zip', quiet=False)
gdown.download(audio_real_url, 'audio_real.zip', quiet=False)
gdown.download(video_real_url, 'video_real.zip', quiet=False)

!unzip -q -d examples/ audio_fake.zip
!unzip -q -d examples/ audio_real.zip
!unzip -q -d examples/ video_real.zip


!rm -rf audio_fake.zip
!rm -rf audio_real.zip
!rm -rf video_real.zip

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1F_8hYzlA1i9KIIrB_KzOfagIo-Ovgtfr
From (redirected): https://drive.google.com/uc?export=download&id=1F_8hYzlA1i9KIIrB_KzOfagIo-Ovgtfr&confirm=t&uuid=25087db5-bddc-403d-a544-3af2e763b9dd
To: /content/video-retalking/audio_fake.zip
100%|██████████| 833M/833M [00:13<00:00, 63.2MB/s]
Downloading...
From (original): https://drive.google.com/uc?export=download&id=1B0NfuBjdEyYpK34OCGngmSl1tTi1px1k
From (redirected): https://drive.google.com/uc?export=download&id=1B0NfuBjdEyYpK34OCGngmSl1tTi1px1k&confirm=t&uuid=7889fd2a-3c81-4027-ada6-e69910c4b2a3
To: /content/video-retalking/audio_real.zip
100%|██████████| 505M/505M [00:06<00:00, 74.8MB/s]
Downloading...
From (original): https://drive.google.com/uc?export=download&id=13ENY9mbL7j6Zhg_5PyPyKtPin7CRagVQ
From (redirected): https://drive.google.com/uc?export=download&id=13ENY9mbL7j6Zhg_5PyPyKtPin7CRagVQ&confirm=t&uuid=b1fd9af6-4b7c-416a-b856-7b6e4529bcf0
To: /content/vi

**Download Pretrained Models**

In [None]:
#@title
!mkdir ./checkpoints
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/30_net_gen.pth -O ./checkpoints/30_net_gen.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/BFM.zip -O ./checkpoints/BFM.zip
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/DNet.pt -O ./checkpoints/DNet.pt
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/ENet.pth -O ./checkpoints/ENet.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/expression.mat -O ./checkpoints/expression.mat
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/face3d_pretrain_epoch_20.pth -O ./checkpoints/face3d_pretrain_epoch_20.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/GFPGANv1.3.pth -O ./checkpoints/GFPGANv1.3.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/GPEN-BFR-512.pth -O ./checkpoints/GPEN-BFR-512.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/LNet.pth -O ./checkpoints/LNet.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/ParseNet-latest.pth -O ./checkpoints/ParseNet-latest.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/RetinaFace-R50.pth -O ./checkpoints/RetinaFace-R50.pth
!wget https://github.com/vinthony/video-retalking/releases/download/v0.0.1/shape_predictor_68_face_landmarks.dat -O ./checkpoints/shape_predictor_68_face_landmarks.dat
!unzip -d ./checkpoints/BFM ./checkpoints/BFM.zip

--2025-04-25 17:03:41--  https://github.com/vinthony/video-retalking/releases/download/v0.0.1/30_net_gen.pth
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/OpenTalker/video-retalking/releases/download/v0.0.1/30_net_gen.pth [following]
--2025-04-25 17:03:42--  https://github.com/OpenTalker/video-retalking/releases/download/v0.0.1/30_net_gen.pth
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/536411820/38c02c2b-bf57-4d4e-9711-6cefdc5c817c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250425%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250425T170342Z&X-Amz-Expires=300&X-Amz-Signature=6fe607ed8ab00084c7be32111483d428a86c558b2426b9b394b06503a18cd8f5&X-Amz-Signed

First, we define the paths to the videos (FakeAVCeleb + GRID), real audio (LibriSpeech), and AI-generated audio (LibriSeVoc)

In [None]:
import glob, os, sys
import random
import shutil
import numpy as np
import ipywidgets as widgets
from IPython.display import HTML
from base64 import b64encode

src_videos = 'examples/video_real'
src_real_audios = 'examples/audio_real'
src_fake_audios = 'examples/audio_fake'

videos = [
    os.path.basename(x) for x in glob.glob('{}/*.mp4'.format(src_videos))
]

real_audios = [
    os.path.basename(x) for x in glob.glob('{}/*.wav'.format(src_real_audios))
]

fake_audios = [
    os.path.basename(x) for x in glob.glob('{}/*.wav'.format(src_fake_audios))
]

Now, we create our dataset of Type 1 videos. These are completely authentic videos, consisting of a real video & its corresponding audio.

In [None]:
TYPE_ONE_COUNT = 100
type_one_dst_folder = 'examples/type_one'

# Delete and recreate the directory to ensure it's empty
shutil.rmtree(type_one_dst_folder, ignore_errors=True)
os.makedirs(type_one_dst_folder, exist_ok=True)

type_one_videos = np.random.choice(videos, TYPE_ONE_COUNT, replace=True)

for video in type_one_videos:
    src_path = os.path.join(src_videos, video)
    dst_path = os.path.join(type_one_dst_folder, video)
    shutil.copy(src_path, dst_path)

Let's visualize a random video in 'examples/type_one'!

In [None]:
type_one_dst_folder = 'examples/type_one'

default_vid_name = random.choice([
    f for f in os.listdir(type_one_dst_folder) if f.endswith('.mp4')
])

input_video_path = os.path.join(type_one_dst_folder, default_vid_name)
with open(input_video_path, 'rb') as f:
    input_video_mp4 = f.read()

input_video_data_url = "data:video/mp4;base64," + b64encode(input_video_mp4).decode()
print('Displaying video:', input_video_path, file=sys.stderr)
display(HTML(f"""
  <video width="400" controls>
    <source src="{input_video_data_url}" type="video/mp4">
    Your browser does not support the video tag.
  </video>
"""))


Displaying video: examples/type_one/bbakzs.mp4


Now, we create our dataset of Type 2 videos. These are deepfake videos, which are synthesized by transforming a real video (.mp4) with a soundtrack in LibriSpeech - a synthetic dataset of human soundtracks.

### First-Run Error Notice (Colab Users)

On your first run, you may encounter the following error due to a mismatch between **Colab’s default Python version** and the dependencies used by **VideoReTalking**:


```File "/usr/local/lib/python3.11/dist-packages/basicsr/data/degradations.py", line 8, in <module>
    from torchvision.transforms.functional_tensor import rgb_to_grayscale
ModuleNotFoundError: No module named 'torchvision.transforms.functional_tensor```

To fix this, simply change the import

``torchvision.transforms.functional_tensor`` to

``torchvision.transforms.functional``.

In [None]:
TYPE_TWO_COUNT = 25
type_two_dst_folder = 'examples/type_two'

# Delete and recreate the directory to ensure it's empty
shutil.rmtree(type_two_dst_folder, ignore_errors=True)
os.makedirs(type_two_dst_folder, exist_ok=True)

type_two_videos = np.random.choice(videos, TYPE_TWO_COUNT, replace=True)
type_two_audios = np.random.choice(real_audios, TYPE_TWO_COUNT, replace=True)

# Run deepfake generation
for idx, (video, audio) in enumerate(zip(type_two_videos, type_two_audios), start=1):
    output_file = os.path.join(type_two_dst_folder, f"{idx}.mp4")
    video_path = os.path.join(src_videos, video)
    audio_path = os.path.join(src_real_audios, audio)
    print(f"[{idx}/{TYPE_TWO_COUNT}]: Combining {video_path} and {audio_path} into a type two deepfake!")
    !python3 inference.py \
    --face {video_path} \
    --audio {audio_path} \
    --outfile {output_file}


[1/25]: Combining examples/video_real/bbae9p.mp4 and examples/audio_real/2803-161169-0009.wav into a type two deepfake!
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
[Info] Using cuda for inference.
[Step 0] Number of frames available for inference: 75
[Step 1] Landmarks Extraction in Video.
landmark Det:: 100% 75/75 [00:12<00:00,  5.91it/s]
[Step 2] 3DMM Extraction In Video:: 100% 75/75 [00:01<00:00, 73.17it/s]
using expression center
Load checkpoint from: checkpoints/DNet.pt
Load checkpoint from: checkpoints/LNet.pth
Load checkpoint from: checkpoints/ENet.pth
[Step 3] Stabilize the expression In Video:: 100% 75/75 [00:08<00:00,  8.37it/s]
[Step 4] Load audio; Length of mel chunks: 614
[Step 5] Reference Enhancement: 100% 75/75 [00:34<00:00,  2.20it/s]
[Step 6] Lip Synthesis::   0% 0/39 [00:00<?, ?it/s]
landmark Det::   0% 0/75 [00:00<?, ?it/s][A
landmark Det::   1% 1/75 [00:00<00:23,  3.1

Let's visualize a random video in 'examples/type_two'!

In [None]:
type_one_dst_folder = 'examples/type_two'

default_vid_name = random.choice([
    f for f in os.listdir(type_one_dst_folder) if f.endswith('.mp4')
])

input_video_path = os.path.join(type_one_dst_folder, default_vid_name)
with open(input_video_path, 'rb') as f:
    input_video_mp4 = f.read()

input_video_data_url = "data:video/mp4;base64," + b64encode(input_video_mp4).decode()
print('Displaying video:', input_video_path, file=sys.stderr)
display(HTML(f"""
  <video width="400" controls>
    <source src="{input_video_data_url}" type="video/mp4">
    Your browser does not support the video tag.
  </video>
"""))

Displaying video: examples/type_two/12.mp4


Finally, we create our dataset of Type 3 videos. These are deepfake videos, which are synthesized by transforming a real video (.mp4) with a soundtrack in LibriSeVoc - an AI-generated (deepfake) dataset mimicking human speech.

In [None]:
TYPE_THREE_COUNT = 25
type_three_dst_folder = 'examples/type_three'

# Delete and recreate the directory to ensure it's empty
shutil.rmtree(type_three_dst_folder, ignore_errors=True)
os.makedirs(type_three_dst_folder, exist_ok=True)

type_three_videos = np.random.choice(videos, TYPE_THREE_COUNT, replace=True)
type_three_audios = np.random.choice(fake_audios, TYPE_THREE_COUNT, replace=True)

# Run deepfake generation
for idx, (video, audio) in enumerate(zip(type_three_videos, type_three_audios), start=1):
    output_file = os.path.join(type_three_dst_folder, f"{idx}.mp4")
    video_path = os.path.join(src_videos, video)
    audio_path = os.path.join(src_fake_audios, audio)
    print(f"[{idx}/{TYPE_THREE_COUNT}]: Combining {video_path} and {audio_path} into a type three deepfake!")
    !python3 inference.py \
    --face {video_path} \
    --audio {audio_path} \
    --outfile {output_file}


[1/25]: Combining examples/video_real/mwbt0_sx113.mp4 and examples/audio_fake/103_1241_000054_000007_gen.wav into a type three deepfake!
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
[Info] Using cuda for inference.
[Step 0] Number of frames available for inference: 73
[Step 1] Landmarks Extraction in Video.
landmark Det:: 100% 73/73 [00:12<00:00,  5.74it/s]
[Step 2] 3DMM Extraction In Video:: 100% 73/73 [00:00<00:00, 75.49it/s]
using expression center
Load checkpoint from: checkpoints/DNet.pt
Load checkpoint from: checkpoints/LNet.pth
Load checkpoint from: checkpoints/ENet.pth
[Step 3] Stabilize the expression In Video:: 100% 73/73 [00:08<00:00,  8.35it/s]
[Step 4] Load audio; Length of mel chunks: 204
[Step 5] Reference Enhancement: 100% 73/73 [00:33<00:00,  2.19it/s]
[Step 6] Lip Synthesis::   0% 0/13 [00:00<?, ?it/s]
landmark Det::   0% 0/73 [00:00<?, ?it/s][A
landmark Det::   1% 1/73 [

Let's visualize a random video in 'examples/type_three'!

In [None]:
type_one_dst_folder = 'examples/type_three'

default_vid_name = random.choice([
    f for f in os.listdir(type_one_dst_folder) if f.endswith('.mp4')
])

input_video_path = os.path.join(type_one_dst_folder, default_vid_name)
with open(input_video_path, 'rb') as f:
    input_video_mp4 = f.read()

input_video_data_url = "data:video/mp4;base64," + b64encode(input_video_mp4).decode()
print('Displaying video:', input_video_path, file=sys.stderr)
display(HTML(f"""
  <video width="400" controls>
    <source src="{input_video_data_url}" type="video/mp4">
    Your browser does not support the video tag.
  </video>
"""))

Displaying video: examples/type_three/9.mp4


Finally, these type one, type two, and type three datasets will be used for experimentation with our implementation of the LIPINC-V2 model. Let's ZIP them together.

In [None]:
!zip -r examples/genai_dataset.zip examples/type_two

  adding: examples/type_two/ (stored 0%)
  adding: examples/type_two/11.mp4 (deflated 1%)
  adding: examples/type_two/3.mp4 (deflated 2%)
  adding: examples/type_two/1.mp4 (deflated 1%)
  adding: examples/type_two/14.mp4 (deflated 2%)
  adding: examples/type_two/24.mp4 (deflated 2%)
  adding: examples/type_two/6.mp4 (deflated 2%)
  adding: examples/type_two/15.mp4 (deflated 1%)
  adding: examples/type_two/19.mp4 (deflated 2%)
  adding: examples/type_two/9.mp4 (deflated 2%)
  adding: examples/type_two/13.mp4 (deflated 0%)
  adding: examples/type_two/17.mp4 (deflated 2%)
  adding: examples/type_two/2.mp4 (deflated 2%)
  adding: examples/type_two/21.mp4 (deflated 2%)
  adding: examples/type_two/8.mp4 (deflated 1%)
  adding: examples/type_two/7.mp4 (deflated 1%)
  adding: examples/type_two/23.mp4 (deflated 1%)
  adding: examples/type_two/10.mp4 (deflated 1%)
  adding: examples/type_two/4.mp4 (deflated 1%)
  adding: examples/type_two/25.mp4 (deflated 1%)
  adding: examples/type_two/16.mp4 (