# Lip-to-Speech Synthesis with LLMs and Audio-Visual Instructions

This notebook demonstrates an end-to-end lip-to-speech synthesis pipeline that generates speech from silent lip videos, using large language models (LLMs) and audio-visual instructions. The project combines techniques from the papers ["AVI-Talking"](https://arxiv.org/abs/2402.16124) and ["Towards Accurate Lip-to-Speech Synthesis in-the-Wild"](https://arxiv.org/abs/2403.01087) to achieve this task.

The primary goal of this project is to enable natural speech synthesis from lip movements, without requiring any audio input or transcripts. This technology has potential applications in various domains, including virtual assistants, video dubbing, and accessibility tools for individuals with speech impairments.

## Overview

The lip-to-speech synthesis process is divided into two main stages:

1. **Audio-Visual Instruction Generation**: In this stage, a pre-trained lipreading model extracts visual features from the input lip video. These visual features are then combined with noisy text transcripts (generated by the lipreading model) and fed into a fine-tuned large language model (LLM). The LLM generates audio-visual instructions that describe the lip movements and the corresponding speech content.
2. **Visual Text-to-Speech Synthesis**: In this stage, the audio-visual instructions are used to condition a visual text-to-speech model, which generates a mel-spectrogram representation of the speech. This mel-spectrogram is then converted into an audio waveform using a vocoder (e.g., ParallelWaveGAN), while considering the target speaker's voice characteristics.

## Usage

To use this project, you will need to provide two input files:

1. `input_lip_video.mp4`: This is the silent video for lip reading. A video file containing clear lip movements of the person speaking. The video should be reasonably well-lit, with the speaker's lips and mouth area clearly visible. Ideally, the video should have a resolution of at least 720p and a framerate of at least 30 FPS.
2. `target_speaker_audio.wav`: This is the reference audio sample for recreating the speaker voice. A short audio clip (e.g., 5-10 seconds) of the target speaker's voice. This audio clip will be used to extract the speaker's voice characteristics, which will be incorporated into the synthesized speech.

With these input files, you can run the notebook cells in sequential order to generate synthesized speech from the lip video, while preserving the target speaker's voice characteristics.

Please note that this project requires access to pre-trained models and libraries, which are automatically downloaded and set up within the notebook. Additionally, some components of the pipeline may require significant computational resources, particularly for processing longer videos or achieving real-time performance.

By leveraging the power of LLMs and audio-visual instructions, this project aims to push the boundaries of lip-to-speech synthesis, enabling more natural and expressive speech generation from visual cues alone.

## Setup and initialization...

In [None]:
!pip install huggingface_hub
hf_access_token = "REPLACE WITH YOUR HUGGING FACE TOKEN"

In [None]:
!pip install gdown

In [None]:
!pip install torch torchvision torchaudio
import torch
import torch.nn as nn

!pip install transformers
from transformers import LlamaForCausalLM, LlamaTokenizer

!pip install parallel_wavegan
from parallel_wavegan.models import ParallelWaveGANGenerator

!pip install phonemizer
from phonemizer import phonemize

!pip install speechbrain
from speechbrain.inference import EncoderClassifier
!mkdir pretrained_models

!pip install opencv-python
import cv2

!pip install numpy
import numpy as np

In [None]:
# check that GPU is available with Tensorflow GPU -
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
print("GPU:", tf.config.list_physical_devices('GPU'))
print("Num GPUs:", len(physical_devices))

In [None]:
%cd /content
!git clone --recursive https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks.git
%cd Lipreading_using_Temporal_Convolutional_Networks
!pip install -r requirements.txt

In [None]:
%cd /content/Lipreading_using_Temporal_Convolutional_Networks
import gdown
gdown.download('https://drive.usercontent.google.com/download?id=1TGFG0dW5M3rBErgU8i0N7M1ys9YMIvgm&export=download&authuser=0&confirm=t&uuid=69f4c2bc-42ca-461c-a61f-56b4e7dc601b&at=APZUnTWa0cWQMJsRReOodH4XyVIP%3A1710192153986', output='models/model_weights.pth', fuzzy=True)

In [None]:
# Install the required modules
!pip install torch torchvision

# Import the required modules
import torch
from lipreading.utils import load_model, calculateNorm2, load_json
from lipreading.model import Lipreading

In [None]:
%cd /content

import urllib.request, json

def get_lrw_model_from_json(json_config_url):
  response = urllib.request.urlopen(json_config_url)
  args_loaded = json.loads(response.read())
  backbone_type = args_loaded['backbone_type']
  width_mult = args_loaded['width_mult']
  relu_type = args_loaded['relu_type']
  use_boundary = args_loaded.get("use_boundary", False)

  if args_loaded.get('tcn_num_layers', ''):
    tcn_options = { 'num_layers': args_loaded['tcn_num_layers'],
                    'kernel_size': args_loaded['tcn_kernel_size'],
                    'dropout': args_loaded['tcn_dropout'],
                    'dwpw': args_loaded['tcn_dwpw'],
                    'width_mult': args_loaded['tcn_width_mult'],
                  }
  else:
    tcn_options = {}
  if args_loaded.get('densetcn_block_config', ''):
    densetcn_options = {'block_config': args_loaded['densetcn_block_config'],
                        'growth_rate_set': args_loaded['densetcn_growth_rate_set'],
                        'reduced_size': args_loaded['densetcn_reduced_size'],
                        'kernel_size_set': args_loaded['densetcn_kernel_size_set'],
                        'dilation_size_set': args_loaded['densetcn_dilation_size_set'],
                        'squeeze_excitation': args_loaded['densetcn_se'],
                        'dropout': args_loaded['densetcn_dropout'],
                        }
  else:
    densetcn_options = {}

  model = Lipreading( modality='video',
                      num_classes=500,
                      tcn_options=tcn_options,
                      densetcn_options=densetcn_options,
                      backbone_type=backbone_type,
                      relu_type=relu_type,
                      width_mult=width_mult,
                      use_boundary=use_boundary,
                      extract_feats=False).cuda()
  calculateNorm2(model)
  return model

# Get the model with default parameters
json_config_url = 'https://raw.githubusercontent.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/master/configs/lrw_resnet18_dctcn_boundary.json'
lrw_model = get_lrw_model_from_json(json_config_url)

# Load the model weights
lrw_model = load_model("/content/Lipreading_using_Temporal_Convolutional_Networks/models/model_weights.pth", lrw_model)
lrw_model.eval()

In [None]:
!pip install transformers

import os
from transformers import AutoTokenizer, AutoModelForCausalLM

# Download and load the LLama2 model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=hf_access_token)
llm_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=hf_access_token)

# Download the BigVGAN vocoder
bigvgan_url = 'https://github.com/PlayVoice/BigVGAN/releases/download/augment/nsf_bigvgan_pretrain_32K_Augment.pth'
bigvgan_output = 'bigvgan_pretrained.pth'
gdown.download(bigvgan_url, bigvgan_output, quiet=False)

# Download the CMU phoneme vocabulary
cmu_url = 'http://www.speech.cs.cmu.edu/cgi-bin/cmudict/cmudict.0.7a'
cmu_output = 'cmudict-0.7b.symbols'
gdown.download(cmu_url, cmu_output, quiet=False)

print("Datasets downloaded successfully.")

## Example Usage

Load the pre-trained models and provide an example usage of the lip-to-speech synthesis program.

1. [Load the pretrained models](https://colab.research.google.com/drive/1T5ihTJMCaqmOSAcC1xXyKU0CzGLttqHA#scrollTo=download_datasets&line=2&uniqifier=1)
2. Load the Video and Audio files.
3. Execute the synthesis function.
4. Save the new audio file.
5. [Optional] Use ffmpeg to merge the audio and file files (not included in this project).

In [None]:
import urllib.request
import os

# Download the silent lip video
silent_video_url = 'https://github.com/zoharbabin/lipsynth-experiment/raw/main/silent_video.mp4'
silent_video_filename = 'silent_video.mp4'
urllib.request.urlretrieve(silent_video_url, silent_video_filename)
print(f'Silent lip video downloaded as {silent_video_filename}')
# Download the target speaker audio
target_audio_url = 'https://github.com/zoharbabin/lipsynth-experiment/raw/main/reference_audio.wav'
target_audio_filename = 'reference_audio.wav'
urllib.request.urlretrieve(target_audio_url, target_audio_filename)
print(f'Target speaker audio downloaded as {target_audio_filename}')

# Example usage
lip_video = load_video('silent_video.mp4')
target_speaker_audio = 'reference_audio.wav'

generated_speech = lip_to_speech_synthesis(lip_video, target_speaker_audio)

# Save the generated speech
save_audio(generated_speech, 'generated_speech.wav')

## Utility Functions Below...

In [14]:
def load_video(video_path):
    # Load the video frames using OpenCV
    cap = cv2.VideoCapture(video_path)
    frames = []
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
    cap.release()
    return np.array(frames)

In [15]:
def save_audio(audio, output_path):
    # Save the audio using scipy
    scipy.io.wavfile.write(output_path, 22050, audio)

In [16]:
class PhonemeTokenizer:
    def __init__(self, vocab_file):
        self.vocab = self.load_vocab(vocab_file)
        self.vocab_size = len(self.vocab)

    def load_vocab(self, vocab_file):
        with open(vocab_file, 'r') as f:
            vocab = f.read().splitlines()
        return {p: i for i, p in enumerate(vocab)}

    def tokenize(self, phoneme_sequence):
        return [self.vocab[p] for p in phoneme_sequence.split()]

In [17]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

In [18]:
def generate_AV_instructions(lip_video):
    # Extract visual features using the pre-trained Lipreading model
    visual_features = lrw_model.extract_features(lip_video)

    # Generate noisy text transcripts using the Lipreading model
    noisy_text_transcripts = lrw_model.predict(lip_video)

    # Perform Q-Former alignment
    aligned_visual_features = Q_Former_align(visual_features)

    # Fine-tune the LLM projection layer
    fine_tuned_llm = fine_tune_LLM_projection(llm_model, aligned_visual_features, noisy_text_transcripts)

    # Generate audio-visual instructions using the fine-tuned LLM
    AV_instructions = fine_tuned_llm.generate(inputs_embeds=aligned_visual_features, text_inputs=noisy_text_transcripts)

    return AV_instructions, visual_features

In [19]:
class QFormer(nn.Module):
    def __init__(self, d_model, nhead, num_layers):
        super(QFormer, self).__init__()
        self.self_attn_layers = nn.ModuleList([nn.TransformerEncoderLayer(d_model, nhead) for _ in range(num_layers)])
        self.cross_attn_layers = nn.ModuleList([nn.TransformerDecoderLayer(d_model, nhead) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, visual_features, text_features):
        # Self-attention on visual features
        for self_attn_layer in self.self_attn_layers:
            visual_features = self_attn_layer(visual_features)

        # Cross-attention between visual features and text features
        for cross_attn_layer in self.cross_attn_layers:
            visual_features = cross_attn_layer(visual_features, text_features)

        # Layer normalization
        visual_features = self.norm(visual_features)

        return visual_features

def Q_Former_align(visual_features, d_model=512, nhead=8, num_layers=4):
    # Create an instance of the Q-Former network
    q_former = QFormer(d_model, nhead, num_layers)

    # Align visual features using Q-Former
    aligned_visual_features = q_former(visual_features, None)

    return aligned_visual_features

In [20]:
def fine_tune_LLM_projection(llm_model, aligned_visual_features, noisy_text_transcript, num_epochs=5, batch_size=4, learning_rate=1e-5):
    # Tokenize the noisy text transcript
    input_ids = llm_tokenizer(noisy_text_transcript, return_tensors='pt', padding=True, truncation=True)['input_ids']

    # Create a DataLoader for batching the data
    dataset = torch.utils.data.TensorDataset(aligned_visual_features, input_ids)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

    # Set the LLM to training mode
    llm_model.train()

    # Freeze all layers except the input projection layer
    for name, param in llm_model.named_parameters():
        if 'embed_in' not in name:
            param.requires_grad = False

    # Create an optimizer for the input projection layer
    optimizer = torch.optim.AdamW(llm_model.parameters(), lr=learning_rate)

    # Fine-tune the input projection layer
    for epoch in range(num_epochs):
        for batch in dataloader:
            batch_visual_features, batch_input_ids = batch

            # Forward pass
            outputs = llm_model(inputs_embeds=batch_visual_features, labels=batch_input_ids)
            loss = outputs.loss

            # Backward pass and optimization
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

    return llm_model

In [21]:
def text_to_phonemes(text):
    phonemes = phonemize(text, language='en-us', backend='espeak', strip=True)
    phoneme_sequence = ' '.join(phonemes)
    return phoneme_sequence

In [22]:
class TextEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_layers, dropout=0.1):
        super(TextEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, dropout)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)

    def forward(self, src):
        src = self.embedding(src) * math.sqrt(d_model)
        src = self.positional_encoding(src)
        output = self.transformer_encoder(src)
        return output

def encode_text(phoneme_sequence, vocab_size, d_model=512, nhead=8, num_layers=6, dropout=0.1):
    # Tokenize the phoneme sequence
    tokenizer = PhonemeTokenizer('cmudict-0.7b.symbols')
    tokens = tokenizer.tokenize(phoneme_sequence)

    # Convert tokens to tensor
    input_tensor = torch.tensor(tokens).unsqueeze(0)  # Add batch dimension

    # Create an instance of the TextEncoder
    text_encoder = TextEncoder(vocab_size, d_model, nhead, num_layers, dropout)

    # Encode the phoneme sequence
    phoneme_embeddings = text_encoder(input_tensor)

    return phoneme_embeddings

In [23]:
class VisualEncoder(nn.Module):
    def __init__(self, d_model, nhead, num_layers, dropout=0.1):
        super(VisualEncoder, self).__init__()
        self.positional_encoding = PositionalEncoding(d_model, dropout)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers)

    def forward(self, src):
        src = self.positional_encoding(src)
        output = self.transformer_encoder(src)
        return output

def encode_visual(visual_features, d_model=512, nhead=8, num_layers=6, dropout=0.1):
    # Ensure the visual features have the correct dimensions
    assert visual_features.ndim == 3, "Visual features should have dimensions (batch_size, sequence_length, feature_dim)"

    # Create an instance of the VisualEncoder
    visual_encoder = VisualEncoder(d_model, nhead, num_layers, dropout)

    # Encode the visual features
    visual_embeddings = visual_encoder(visual_features)

    return visual_embeddings

In [24]:
import torch
import torch.nn as nn
import math

def scaled_dot_product_attention(query, key, value, mask=None):
    """Compute scaled dot-product attention.

    Args:
        query (torch.Tensor): Query tensor of shape (batch_size, num_heads, query_length, d_k).
        key (torch.Tensor): Key tensor of shape (batch_size, num_heads, key_length, d_k).
        value (torch.Tensor): Value tensor of shape (batch_size, num_heads, value_length, d_v).
        mask (torch.Tensor, optional): Mask tensor of shape (batch_size, 1, 1, key_length). Defaults to None.

    Returns:
        torch.Tensor: Output tensor of shape (batch_size, num_heads, query_length, d_v).
        torch.Tensor: Attention weights tensor of shape (batch_size, num_heads, query_length, key_length).
    """
    d_k = key.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = nn.functional.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, value)
    return output, attention_weights

def align_embeddings(phoneme_embeddings, visual_embeddings, d_model=512, nhead=8, dropout=0.1):
    """Align phoneme embeddings with visual embeddings using scaled dot-product attention.

    Args:
        phoneme_embeddings (torch.Tensor): Phoneme embeddings tensor of shape (batch_size, phoneme_seq_length, d_model).
        visual_embeddings (torch.Tensor): Visual embeddings tensor of shape (batch_size, visual_seq_length, d_model).
        d_model (int, optional): Model dimension. Defaults to 512.
        nhead (int, optional): Number of attention heads. Defaults to 8.
        dropout (float, optional): Dropout probability. Defaults to 0.1.

    Returns:
        torch.Tensor: Aligned embeddings tensor of shape (batch_size, phoneme_seq_length, d_model).
    """
    # Ensure the embeddings have the correct dimensions
    assert phoneme_embeddings.ndim == 3 and visual_embeddings.ndim == 3, "Embeddings should have dimensions (batch_size, sequence_length, embedding_dim)"
    assert phoneme_embeddings.size(-1) == d_model and visual_embeddings.size(-1) == d_model, "Embeddings should have the same feature dimension as d_model"

    # Create query, key, and value tensors
    query = phoneme_embeddings
    key = visual_embeddings
    value = visual_embeddings

    # Split the embeddings into heads
    batch_size, seq_len, _ = query.size()
    query = query.view(batch_size, seq_len, nhead, d_model // nhead).transpose(1, 2)
    key = key.view(batch_size, seq_len, nhead, d_model // nhead).transpose(1, 2)
    value = value.view(batch_size, seq_len, nhead, d_model // nhead).transpose(1, 2)

    # Compute the scaled dot-product attention
    aligned_embeddings, _ = scaled_dot_product_attention(query, key, value)

    # Concatenate the heads and apply dropout
    aligned_embeddings = aligned_embeddings.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
    aligned_embeddings = nn.functional.dropout(aligned_embeddings, p=dropout)

    return aligned_embeddings

In [25]:
class UpsampleNetwork(nn.Module):
    def __init__(self, input_dim, output_dim, upsample_factors):
        super(UpsampleNetwork, self).__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.upsample_factors = upsample_factors

        self.upsample_layers = nn.ModuleList()
        for factor in upsample_factors:
            self.upsample_layers.append(
                nn.Sequential(
                    nn.ConvTranspose1d(input_dim, output_dim, kernel_size=factor, stride=factor),
                    nn.BatchNorm1d(output_dim),
                    nn.ReLU(inplace=True)
                )
            )
            input_dim = output_dim

    def forward(self, x):
        for layer in self.upsample_layers:
            x = layer(x)
        return x

def upsample(aligned_embeddings, output_dim, upsample_factors):
    # Ensure the aligned embeddings have the correct dimensions
    assert aligned_embeddings.ndim == 3, "Aligned embeddings should have dimensions (batch_size, sequence_length, embedding_dim)"

    # Create an instance of the UpsampleNetwork
    input_dim = aligned_embeddings.size(-1)
    upsample_network = UpsampleNetwork(input_dim, output_dim, upsample_factors)

    # Upsample the aligned embeddings
    upsampled_embeddings = upsample_network(aligned_embeddings.transpose(1, 2))
    upsampled_embeddings = upsampled_embeddings.transpose(1, 2)

    return upsampled_embeddings

In [26]:
class MelDecoder(nn.Module):
    def __init__(self, d_model, nhead, num_layers, output_dim, dropout=0.1):
        super(MelDecoder, self).__init__()
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        decoder_layer = nn.TransformerDecoderLayer(d_model, nhead, dropout=dropout)
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers)
        self.fc = nn.Linear(d_model, output_dim)

    def forward(self, tgt, memory):
        tgt = self.pos_encoder(tgt)
        output = self.transformer_decoder(tgt, memory)
        output = self.fc(output)
        return output

def decode_melspectrogram(conditioned_embeddings, d_model, nhead, num_layers, output_dim, dropout=0.1):
    # Ensure the conditioned embeddings have the correct dimensions
    assert conditioned_embeddings.ndim == 3, "Conditioned embeddings should have dimensions (batch_size, sequence_length, embedding_dim)"

    # Create target tensor for decoder input
    batch_size, seq_len, _ = conditioned_embeddings.size()
    tgt = torch.zeros(batch_size, seq_len, d_model, device=conditioned_embeddings.device)

    # Create an instance of the MelDecoder
    mel_decoder = MelDecoder(d_model, nhead, num_layers, output_dim, dropout)

    # Decode the mel-spectrogram
    generated_melspectrogram = mel_decoder(tgt, conditioned_embeddings)


    return generated_melspectrogram

In [27]:
def extract_speaker_embedding(audio_path):
    # Load the pre-trained speaker recognition model
    speaker_recognition = EncoderClassifier.from_hparams(
        source="speechbrain/spkrec-ecapa-voxceleb",
        savedir="pretrained_models/spkrec-ecapa-voxceleb"
    )

    # Load the audio file
    signal, fs = torchaudio.load(audio_path)

    # Extract the speaker embedding
    embeddings = speaker_recognition.encode_batch(signal)

    return embeddings

In [34]:
def synthesize_speech(AV_instructions, visual_features, target_speaker_embedding):
    # Tokenize AV instructions and create phoneme sequence
    phoneme_sequence = text_to_phonemes(AV_instructions)

    if not phoneme_sequence:
        print("Warning: Empty or invalid phoneme sequence. Skipping speech synthesis.")
        return None

    # Encode phoneme sequence using a Transformer encoder
    phoneme_tokenizer = PhonemeTokenizer('/content/cmudict-0.7b.symbols')
    phoneme_tokens = phoneme_tokenizer.tokenize(phoneme_sequence)
    phoneme_embeddings = encode_text(phoneme_tokens, phoneme_tokenizer.vocab_size)

    # Encode visual features using a Transformer encoder
    visual_embeddings = encode_visual(visual_features)

    # Align phoneme and visual embeddings using scaled dot-product attention
    aligned_embeddings = align_embeddings(phoneme_embeddings, visual_embeddings)

    # Upsample aligned embeddings to match the desired mel-spectrogram length
    upsampled_embeddings = upsample(aligned_embeddings, output_dim=80, upsample_factors=[2, 2, 2, 2, 5])

    # Concatenate upsampled embeddings with target speaker embedding
    conditioned_embeddings = torch.cat((upsampled_embeddings, target_speaker_embedding.unsqueeze(0).repeat(upsampled_embeddings.size(0), 1, 1)), dim=-1)

    # Decode conditioned embeddings into a mel-spectrogram using a Transformer decoder
    generated_melspectrogram = decode_melspectrogram(conditioned_embeddings, d_model=512, nhead=8, num_layers=6, output_dim=80)

    # Convert mel-spectrogram to speech waveform using the ParallelWaveGANGenerator
    vocoder = ParallelWaveGANGenerator(model_dir='/content/')
    with torch.no_grad():
        generated_speech = vocoder.inference(generated_melspectrogram)

    return generated_speech

In [29]:
def lip_to_speech_synthesis(lip_video, target_speaker_audio):
    # Generate audio-visual instructions using LLMs (Stage 1)
    AV_instructions, visual_features = generate_AV_instructions(lip_video)

    # Extract target speaker embedding from a short audio clip
    target_speaker_embedding = extract_speaker_embedding(target_speaker_audio)

    # Synthesize speech using visual text-to-speech model (Stage 2)
    generated_speech = synthesize_speech(AV_instructions, visual_features, target_speaker_embedding)

    if generated_speech is None:
        print("Warning: Speech synthesis failed. Returning None.")
        return None

    return generated_speech