# **Auto-Transcription Project**

### **Description**
This notebook demonstrates the process of converting audio files into text using modern automatic speech recognition (ASR) techniques. The workflow involves loading audio files, preprocessing them, applying a transcription model, and exporting the results in a user-friendly format.

---

## **Notebook Structure**
1. **Introduction**  
   Overview of the notebook’s purpose and objectives.

2. **Setup**  
   Required libraries and configurations for the transcription process.

3. **Data Preparation**  
   Loading and exploring audio files to ensure proper format and quality.

4. **Transcription**  
   Applying a pretrained speech-to-text model to convert audio into text.

5. **Results**  
   Presenting the transcribed text and saving outputs for further analysis.

6. **Conclusion and Future Work**  
   Summary of findings, challenges, and potential improvements.

---

## **Key Features**
- **Preprocessing:** Ensures audio files meet the requirements for ASR models.  
- **Pretrained Model Usage:** Utilises state-of-the-art models for accurate transcription.  
- **Scalability:** Processes multiple audio files in batch mode.  
- **Output:** Saves transcriptions as text files for easy integration with other tools.

---

### **Usage Instructions**
- Place all input audio files in a dedicated folder. Ensure they are in the correct format (e.g., `.wav`, 16kHz).  
- Update configuration settings (e.g., file paths, model parameters) as needed.  
- Run the notebook sequentially to complete the transcription process.  
- Review and validate the transcribed text for quality assurance.

---

# Set-up

Here’s how to set up for your voice cloning:

### Transcription

In [2]:
%pip install imageio[ffmpeg]
%pip install alive-progress
!ffmpeg -version

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.comNote: you may need to restart the kernel to use updated packages.

ffmpeg version 2024-11-21-git-f298507323-essentials_build-www.gyan.dev Copyright (c) 2000-2024 the FFmpeg developers
built with gcc 14.2.0 (Rev1, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab 

In [3]:
%pip install git+https://github.com/openai/whisper.git
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to c:\users\tdrelangue\appdata\local\temp\pip-req-build-zl77n_my
  Resolved https://github.com/openai/whisper.git to commit 90db0de1896c23cbfaf0c58bc2d30665f709f170
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Note: you may need to restart the kernel to use updated packages.


  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git 'C:\Users\tdrelangue\AppData\Local\Temp\pip-req-build-zl77n_my'


Looking in indexes: https://download.pytorch.org/whl/cu118, https://pypi.ngc.nvidia.comNote: you may need to restart the kernel to use updated packages.



(Adjust the PyTorch URL based on your CUDA version; use `cu118` for CUDA 11.8 or `cpu` for non-GPU systems.)


### Data Prep

In [4]:
import whisper
import csv
from alive_progress import alive_bar
from pydub import AudioSegment
from pydub.silence import split_on_silence
import librosa
import soundfile as sf

### Tacotron

In [None]:
%pip install torch
%pip install pytorch-lightning
%pip install torchaudio
%pip install torchvision 
%pip install pandas
%pip install pyopencl
%pip install icecream

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.comNote: you may need to restart the kernel to use updated packages.

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.comNote: you may need to restart the kernel to use updated packages.

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


In [None]:
import torch
import torchaudio
from torch.utils.data import Dataset
import os
import pandas as pd
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from pytorch_lightning import LightningModule, Trainer
from torchaudio.models import Tacotron2
import torch.optim as optim
import pyopencl as clpr
import icecream as ic

  from .autonotebook import tqdm as notebook_tqdm
  warn("Unable to import recommended hash 'siphash24.siphash13', "


# Batch Transcribe Audios

## Set-up

In [44]:
os.environ["PATH"] += r";C:\Users\tdrelangue\ffmpeg\bin"
model = whisper.load_model("base")  # Choose a model size
audio_folder = "Audios"

## Run

In [45]:
def transcribe(audio_folder=audio_folder, dataprep=False):
    metadata=[]
    with alive_bar(len(os.listdir(audio_folder)),force_tty=True) as bar:
        for filename in os.listdir(audio_folder):
            if filename.endswith(".wav") or filename.endswith(".mp3"):
                audio_path = os.path.join(os.getcwd(),audio_folder, filename)

                # Load model and transcribe
                result = model.transcribe(audio_path, fp16=False)
                if not dataprep:
                    text_path = audio_path.replace(".wav", ".txt").replace(".mp3", ".txt").replace("Audio", "transcript")
                    with open(text_path, "w") as f:
                        f.write(result["text"])
                else :
                    metadata.append([audio_path, result["text"]])
            bar()
        
    
    if dataprep:
        with open("dataset\metadata.csv", "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.writer(csvfile, delimiter="|")
            writer.writerows(metadata)
        print("Data transcribed !")

#transcribe()

# Prepare Data

In [46]:
# Load audio
with alive_bar(len(os.listdir(audio_folder)),force_tty=True) as bar:
    for file, filename in enumerate(os.listdir(audio_folder)):
        audio = None
        audio_path = os.path.join(os.getcwd(),audio_folder, filename)

        if filename.endswith(".mp3"):
            audio = AudioSegment.from_file(f"{audio_path}", format="mp3")
        elif filename.endswith(".wav"):
            audio = AudioSegment.from_file(f"{audio_path}", format="wav")   

        if audio:
            # Split on silence
            chunks = split_on_silence(audio, min_silence_len=200, silence_thresh=-40)

            # Export clips
            for i, chunk in enumerate(chunks):
                os.makedirs("dataset/wavs", exist_ok=True)
                chunk_name = f"dataset/wavs/clip_{file}_{i:04d}.wav"
                chunk.export(chunk_name, format="wav")
                y, sr = librosa.load(chunk_name, sr=16000, mono=True)  # Load and resample
                sf.write(chunk_name, y, sr)  # Save the processed file
        bar()

|████████████████████████████████████████| 4/4 [100%] in 6:05.6 (0.01/s)        


In [47]:
transcribe(audio_folder=f"dataset/wavs",dataprep=True)

|████████████████████████████████████████| 1153/1153 [100%] in 1:29:51.7 (0.21/s
Data transcribed !


## Make Training and Validation sets
useless

### Set up

In [48]:
import random

# Paths
metadata_path = "dataset/metadata.csv"  # Path to your metadata.csv
filelists_folder = "filelists"          # Output folder for filelists
os.makedirs(filelists_folder, exist_ok=True)  # Ensure the filelists folder exists

### Split data

In [49]:
# Read metadata
with open(metadata_path, "r", encoding="utf-8") as f:
    lines = f.readlines()

# Shuffle lines
random.shuffle(lines)

# Split into training and validation sets
split_ratio = 0.85  # 90% train, 10% validation
split_index = int(len(lines) * split_ratio)
train_lines = lines[:split_index]
val_lines = lines[split_index:]

# Write train_filelist.txt
train_file_path = os.path.join(filelists_folder, "train_filelist.txt")
with open(train_file_path, "w", encoding="utf-8") as f:
    f.writelines(train_lines)

# Write val_filelist.txt
val_file_path = os.path.join(filelists_folder, "val_filelist.txt")
with open(val_file_path, "w", encoding="utf-8") as f:
    f.writelines(val_lines)

print(f"Training filelist created: {train_file_path}")
print(f"Validation filelist created: {val_file_path}")


Training filelist created: filelists\train_filelist.txt
Validation filelist created: filelists\val_filelist.txt


# Tacotron Model

## Tokenization

In [7]:
class TextTokenizer:
    def __init__(self):
        # Define a character set. You can expand this if needed.
        self.char_to_id = {char: idx for idx, char in enumerate("abcdefghijklmnopqrstuvwxyz ")}  # Include space
        self.id_to_char = {idx: char for idx, char in enumerate("abcdefghijklmnopqrstuvwxyz ")}
    
    def encode(self, text):
        # Check if the input is a string; if not, pass it as is
        if isinstance(text, torch.Tensor):
            return text  # Already encoded, return as is
        if not isinstance(text, str):
            text = self.decode(ids=text)

        # Encode text to a list of indices and convert to a tensor
        token_indices = [self.char_to_id[char] for char in text.lower() if char in self.char_to_id]
        return torch.tensor(token_indices, dtype=torch.long)

    
    def decode(self, ids):
        # Decode a tensor or list of indices back to text
        if isinstance(ids, torch.Tensor):  # If input is a tensor, convert to a list
            ids = ids.tolist()
        if isinstance(ids, float):  # If input is a tensor, convert to a list
            return f'{ids}'
        return ''.join([self.id_to_char[idx] for idx in ids])


In [8]:
%pip install transformers
%pip install sacremoses

from transformers import AutoModel, AutoTokenizer

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.comNote: you may need to restart the kernel to use updated packages.

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


In [9]:
# chargement du tokenizer associé au modèle bert-base-multilingual-cased
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
# les classes AutoTokenizer / AutoModel / AutoConfig permettent de "deviner" la classe exacte
# des différents modèles, d'après leur nom
# (le nom "bert-base-multilingual-cased" est associé à un BertModel, BertTokenizer et BertConfig)
print(type(tokenizer_bert))

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>


## Mel spectrogram

In [10]:
import os
import torchaudio
import torchaudio.transforms as T
import pandas as pd
from torch.utils.data import DataLoader, random_split, Dataset

class MelSpectrogramDataset(Dataset):
    def __init__(self, metadata_path, audio_dir, cfg, target_length=None):
        super().__init__()
        self.metadata = pd.read_csv(metadata_path, sep="|", header=None, names=["file", "text"])
        self.audio_dir = audio_dir
        self.target_length = target_length  # Specify target length for padding/truncation

        # Create MelSpectrogram transform
        self.mel_transform = T.MelSpectrogram(
            sample_rate=22050,
            n_fft=1024,
            hop_length=256,
            n_mels=cfg["model"]["mel_channels"],
        )

    def __len__(self):
        return len(self.metadata)

    def __getitem__(self, idx):
        # Get file path and text label
        row = self.metadata.iloc[idx]
        audio_path = os.path.join(self.audio_dir, row["file"])
        label_text = row["text"]

        # Load audio and convert to mel spectrogram
        waveform, sample_rate = torchaudio.load(audio_path)
        if sample_rate != 22050:
            resample = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=22050)
            waveform = resample(waveform)
        mel_spectrogram = self.mel_transform(waveform).squeeze(0)  # Remove channel dim if mono

        return mel_spectrogram, label_text

    def pad_or_truncate(self, mel_spectrogram, target_length):
        # Pad or truncate the mel spectrogram to the target length
        if mel_spectrogram.size(1) < target_length:
            # Pad with zeros (along the time dimension)
            padding = target_length - mel_spectrogram.size(1)
            mel_spectrogram = torch.nn.functional.pad(mel_spectrogram, (0, padding), mode='constant', value=0)
        else:
            # Truncate if it's too long
            mel_spectrogram = mel_spectrogram[:, :target_length]

        return mel_spectrogram

In [11]:
def collate_fn(batch):
    # Separate texts and mel spectrograms from the batch
    texts, mel_spectrograms = zip(*batch)

    # Debugging: Print the shape of the first mel spectrogram
    print(f"First mel spectrogram shape (before padding): {mel_spectrograms[0].shape}")

    # Pad texts (assumes each text is already a 1D tensor)
    padded_texts = pad_sequence(texts, batch_first=True, padding_value=0)

    # Ensure mel spectrograms are padded along the time dimension
    max_len = max(mel.shape[1] for mel in mel_spectrograms)  # Max time dimension
    padded_mels = []
    for mel in mel_spectrograms:
        pad_len = max_len - mel.shape[1]
        padded_mel = torch.nn.functional.pad(
            mel,
            (0, pad_len),  # Padding along the time dimension
            mode="constant",
            value=0.0,  # Use 0.0 for padding
        )
        padded_mels.append(padded_mel)

    # Stack into a batch tensor
    mel_batch = torch.stack(padded_mels, dim=0)  # (batch_size, num_mels, max_len)

    return padded_texts, mel_batch

## Prepare Variables

In [12]:
class Tacotron2TTS(LightningModule):
    def __init__(self, cfg):
        super().__init__()
        self.model = Tacotron2(cfg["model"])
        self.mel_channels = cfg["model"]["mel_channels"]  # Corresponds to mel_channels
        self.hidden_channels = cfg["model"]["hidden_channels"]
        self.attention_dim = cfg["model"]["attention_dim"]
        self.default_mel_length = 80
        self.cfg = cfg
        # Add the embedding layer
        self.embedding = nn.Embedding(cfg["model"]["vocab_size"], cfg["model"]["embedding_dim"])  # Define the embedding layer
    
    def forward(self, text, mel_spectrogram, token_lengths=None ,mel_specgram_lengths=None):
        # Ensure `mel_spectrogram` is a tensor, or create a placeholder if `None`
        if mel_spectrogram is None:
            mel_spectrogram = torch.zeros((text.size(0), self.mel_channels, self.default_mel_length), 
                                        dtype=torch.float32, device=self.device)
        
        mel_spectrogram = torch.tensor(mel_spectrogram, dtype=torch.float32) if not isinstance(mel_spectrogram, torch.Tensor) else mel_spectrogram
        
        # Ensure token_lengths is a tensor
        token_lengths = torch.tensor(token_lengths, dtype=torch.long) if not isinstance(token_lengths, torch.Tensor) else token_lengths
        mel_specgram_lengths = torch.tensor(mel_spectrogram, dtype=torch.long) if not isinstance(mel_specgram_lengths, torch.Tensor) else mel_specgram_lengths
        # Sort token_lengths in descending order and get the sorted indices
        sorted_lengths, sorted_idx = torch.sort(token_lengths, descending=True)
        sorted_text = text[sorted_idx]  # Sort text accordingly
        
        # Pack the padded sequences (tokens) with sorted lengths
        packed_input = nn.utils.rnn.pack_padded_sequence(sorted_text, sorted_lengths, batch_first=True, enforce_sorted=False)

        # Forward pass through the model (you will use `packed_input` now)
        embedded_inputs = self.embedding(packed_input.data).transpose(1, 2)  # Example of how to handle packed input
        
        encoder_outputs = self.encoder(embedded_inputs, sorted_lengths)
        mel_specgram, gate_outputs, alignments = self.decoder(encoder_outputs, mel_spectrogram, memory_lengths=sorted_lengths)
        
        mel_specgram_postnet = self.postnet(mel_specgram)
        
        return mel_specgram_postnet

    def training_step(self, batch, batch_idx):
        # Unpack the batch
        text, mel_spectrogram = batch

        # Ensure token lengths and spectrogram lengths are tensors
        token_lengths = torch.tensor([text.shape[1]] * text.shape[0], dtype=torch.long, device=text.device)
        mel_specgram_lengths = torch.tensor([mel_spectrogram.shape[2]] * mel_spectrogram.shape[0], dtype=torch.long, device=mel_spectrogram.device)

        # Forward pass
        mel_spectrogram_pred = self.forward(text=text, mel_spectrogram=mel_spectrogram, token_lengths=token_lengths, mel_specgram_lengths=mel_specgram_lengths)
        
        # Compute loss
        loss = torch.nn.functional.mse_loss(mel_spectrogram_pred, mel_spectrogram)
        
        # Log the loss for monitoring
        # Print debug information 
        # print(f"text type: {type(text)}, text shape: {text.shape}") 
        # print(f"mel_spectrogram type: {type(mel_spectrogram)}, mel_spectrogram shape: {mel_spectrogram.shape}")
        self.log('train_loss', loss)

        return loss

    def validation_step(self, batch, batch_idx):
        text, mel_spectrogram = batch
        token_lengths = torch.tensor([text.shape[1]] * text.shape[0], dtype=torch.long, device=text.device)  # Example: length per sample
        mel_spectrogram_pred = self.forward(text=text, mel_spectrogram=mel_spectrogram, token_lengths=token_lengths)
        loss = torch.nn.functional.mse_loss(mel_spectrogram_pred, mel_spectrogram)
        self.log("val_loss", loss)

    def test_step(self, batch, batch_idx):
        text, mel_spectrogram = batch
        token_lengths = torch.tensor([text.shape[1]] * text.shape[0], dtype=torch.long, device=text.device)  # Example: length per sample
        mel_spectrogram_pred = self.forward(text=text, mel_spectrogram=mel_spectrogram, token_lengths=token_lengths)
        loss = torch.nn.functional.mse_loss(mel_spectrogram_pred, mel_spectrogram)
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.model.parameters(), lr=self.cfg["trainer"]["lr"])

    def print_batch_properties(self, batch):
        """
        Prints the properties of the batch components.

        Args:
            batch: A batch of data, expected to be a tuple or a list.
        """
        if not isinstance(batch, (tuple, list)):
            print("Batch is not a tuple or list. Type:", type(batch))
            return

        print("Batch contains", len(batch), "elements.")
        for i, item in enumerate(batch):
            print(f"--- Element {i} ---")
            print(f"Type: {type(item)}")
            if isinstance(item, torch.Tensor):
                print(f"Shape: {item.shape}")
                print(f"Dtype: {item.dtype}")
            elif hasattr(item, "__len__"):
                print(f"Length: {len(item)}")
            else:
                print("No additional properties available.")
            print("-------------------")

In [13]:
from collections import Counter

metadata_path = "dataset/metadata.csv"
audio_dir = "dataset/wavs"
metadata = pd.read_csv(metadata_path, sep="|", header=None, names=["file", "text"])

# Combine all texts from the dataset
all_texts = metadata["text"].dropna().tolist()

tokenized_texts=[]
# Tokenize the texts using the tokenizer
for text in all_texts:
    
    try:
        tokenized_text = tokenizer_bert(text, padding=True, truncation=True, return_tensors="pt", add_special_tokens=True)
    except:
        print(text)
        print(type(text))
        raise ValueError(text)
    tokenized_texts.append(tokenized_text)

# Dataset setup
vocab_size = len(tokenizer_bert.get_vocab())


# Configuration
cfg = {
    "model": {
        "mel_channels": 80,
        "hidden_channels": 128,
        "attention_dim": 128,
        "vocab_size": vocab_size,
        "embedding_dim":512
    },
    "trainer": {
        "max_epochs": 10,
        "lr": 1e-3,
        "batch_size": 16,
    },
}

In [14]:
# Assuming MelSpectrogramDataset is defined
dataset = MelSpectrogramDataset(metadata_path, audio_dir, cfg)

# Split dataset into training and validation
dataset_size = len(dataset)
val_size = int(0.2 * dataset_size)  # 20% validation
train_size = dataset_size - val_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# DataLoaders for training and validation sets
train_loader = DataLoader(
    train_dataset,
    batch_size=cfg["trainer"]["batch_size"],
    shuffle=True,
    num_workers=8,  # Optional: Adjust depending on your CPU cores for faster data loading
    pin_memory=True,
    collate_fn=lambda batch: (
        torch.stack([item[0] for item in batch]),  # Stack mel spectrograms
        torch.stack([item[1] for item in batch]),  # Stack labels (tokenized or numeric)
    )
)

val_loader = DataLoader(
    val_dataset,
    batch_size=cfg["trainer"]["batch_size"],
    shuffle=False,
    num_workers=4,  # Optional: Adjust depending on your CPU cores for faster data loading
    pin_memory=True,  # Optional: To speed up data transfer to GPU if using one
    collate_fn=lambda batch: (
        torch.stack([item[0] for item in batch]),  # Stack mel spectrograms
        torch.stack([item[1] for item in batch]),  # Stack labels (tokenized or numeric)
    )
)

In [15]:
print(clpr.get_platforms())
platforms = clpr.get_platforms()
devices = platforms[0].get_devices()
print(f"Available devices: {devices}")
context = clpr.Context(devices=[devices[0]])  # Select the first device
print(f"Using device: {devices[0].name}")



[<pyopencl.Platform 'Intel(R) OpenCL Graphics' at 0x257eed86840>]
Available devices: [<pyopencl.Device 'Intel(R) UHD Graphics' on 'Intel(R) OpenCL Graphics' at 0x257ef462cf0>]
Using device: Intel(R) UHD Graphics


In [16]:
# Model
model = Tacotron2TTS(cfg)

# Move model to the appropriate device
print("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.set_num_threads(8) 
model = model.to(device)

# Define criterion and optimizer
criterion = nn.MSELoss()
# Define loss function (e.g., CrossEntropyLoss for classification)
# criterion = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = optim.Adam(model.parameters(), lr=cfg["trainer"]["lr"])

cpu


## Training

In [None]:
with alive_bar(cfg["trainer"]["max_epochs"]*len(train_loader)*7,force_tty=True) as bar:
    for epoch in range(cfg["trainer"]["max_epochs"]):
        model.train()  # Set model to training mode
        running_loss = 0.0
        for batch in train_loader:
            mel_spectrograms, labels = batch  # Unpack batch into spectrograms and labels
            bar()
            
            # Move to GPU if available
            mel_spectrograms = mel_spectrograms.to(device)
            labels = labels.to(device)  # Get labels from your dataset (you may need to modify this based on your use case)
            optimizer.zero_grad()  # Clear gradients
            bar()
            # Forward pass
            outputs = model(mel_spectrograms)
            bar()
            # Compute loss
            loss = criterion(outputs, labels)
            bar()
            # Backward pass
            loss.backward()
            bar()
            # Optimize
            optimizer.step()
            bar()
            running_loss += loss.item()
            bar()

        # Print epoch loss
        print(f"Epoch {epoch+1}/{cfg['trainer']['max_epochs']}, Loss: {running_loss / len(train_loader)}")

|                                        | ▆█▆ 0/4060 [0%] in 6:46:06 (~0s, 0.0/ █▆▄ 0/4060 [0%] in 20:50 (~0s, 0.0/s) ▁▃▅ 0/4060 [0%] in 23:40 (~0s, 0.0/s) ▇▇▅ 0/4060 [0%] in 27:08 (~0s, 0.0/s) ▂▂▄ 0/4060 [0%] in 28:37 (~0s, 0.0/s) ▅▃▁ 0/4060 [0%] in 29:03 (~0s, 0.0/s) ▃▅▇ 0/4060 [0%] in 45:19 (~0s, 0.0/s) ▃▅▇ 0/4060 [0%] in 47:46 (~0s, 0.0/s) ▆█▆ 0/4060 [0%] in 3:08:56 (~0s, 0.0/ ▂▄▆ 0/4060 [0%] in 6:26:55 (~0s, 0.0/

## Synthesise

In [None]:
from tacotron2 import synthesize

audio = synthesize(
    checkpoint="checkpoints/latest_model.pth",
    text="Hello, this is a test synthesis."
)
with open("output.wav", "wb") as f:
    f.write(audio)