# Imports

This cell handles the initial setup and imports all necessary dependencies for the SSL-TTS framework:
- Clones the TTS repository from coqui-ai
- Installs required packages: TTS, transformers, torchaudio
- Imports core deep learning libraries (torch, torchaudio)
- Imports WavLM model for SSL feature extraction
- Imports GlowTTS components for the text-to-SSL model
- Sets up other essential utilities like torch.nn.functional

In [None]:
#!git clone https://github.com/idiap/coqui-ai-TTS.git
!pip install transformers torchaudio

# DO NOT RESTART RUNTIME AFTER RUNNING THIS CELL
# YOU MIGHT HAVE A FEW WARNINGS/ERROR BUT DW IT'S FINE

In [None]:
# prompt: put all necessary imports for this notebook in this cell

import torch
import torchaudio
import torch.optim as optim
from transformers import WavLMModel
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import os
from torch import nn
from google.colab import drive
from datetime import datetime
import json
import torch.nn.functional as F
from typing import Tuple, Dict, List
import pandas as pd
import numpy
from dataclasses import dataclass, field


# SSL Encoder Implementation (WavLM Integration)

Implements the Self-Supervised Learning encoder component using WavLM-Large:

### Key Features
1. Model Initialization:
   - Loads WavLM-Large model from HuggingFace
   - Automatically selects GPU if available
   - Sets model to evaluation mode

2. Feature Extraction:
   - Uses WavLM's 6th layer for optimal feature representation
   - Handles automatic resampling to 16kHz
   - Manages proper tensor dimensions and device placement
   - Outputs 1024-dimensional feature vectors

3. Audio Processing:
   - Supports variable length inputs
   - Handles mono/stereo conversion
   - Implements automatic batching

### Technical Details
- Input: Audio waveform tensor [B, T] or [1, T]
- Output: SSL features [B, T', 1024]
- Uses @torch.no_grad() for efficient inference
- Includes sample rate verification and conversion




In [None]:
class SSLEncoder:
    def __init__(self, device='cuda' if torch.cuda.is_available() else 'cpu'):
        self.device = device
        print(f"Loading WavLM model to {device}...")
        self.model = WavLMModel.from_pretrained("microsoft/wavlm-large").to(device)
        self.model.eval()
        print("WavLM model loaded successfully!")

    @torch.no_grad()
    def extract_features(self, waveform, sample_rate=16000):
        """Extract WavLM features from the 6th layer"""
        # Resample if sample rate is not 16000 Hz
        if sample_rate != 16000:
            waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

        # Ensure waveform is properly batched
        if waveform.ndim == 1:
            waveform = waveform.unsqueeze(0)

        # Move waveform to the specified device
        waveform = waveform.to(self.device)
        outputs = self.model(waveform, output_hidden_states=True)

        # Extract features from the 6th layer
        features = outputs.hidden_states[6]
        return features

'''# Example usage
ssl_encoder = SSLEncoder()

# Load a sample audio file (replace 'path_to_audio_file.wav' with the actual file path)
waveform, sample_rate = torchaudio.load('/content/harvard.wav')

# Extract features
features = ssl_encoder.extract_features(waveform, sample_rate)
print("Extracted features shape:", features.shape)'''


'# Example usage\nssl_encoder = SSLEncoder()\n\n# Load a sample audio file (replace \'path_to_audio_file.wav\' with the actual file path)\nwaveform, sample_rate = torchaudio.load(\'/content/harvard.wav\')\n\n# Extract features\nfeatures = ssl_encoder.extract_features(waveform, sample_rate)\nprint("Extracted features shape:", features.shape)'

#LJSpeech dataset

In [None]:
# Download the LJSpeech dataset
!wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2

# Extract the dataset
!tar -xjf LJSpeech-1.1.tar.bz2

# Verify the extraction by listing the contents
!ls LJSpeech-1.1


--2024-12-13 14:34:00--  https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
Resolving data.keithito.com (data.keithito.com)... 169.150.236.98, 2400:52e0:1a00::1207:2
Connecting to data.keithito.com (data.keithito.com)|169.150.236.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2748572632 (2.6G) [text/plain]
Saving to: ‘LJSpeech-1.1.tar.bz2’


2024-12-13 14:34:16 (162 MB/s) - ‘LJSpeech-1.1.tar.bz2’ saved [2748572632/2748572632]

metadata.csv  README  wavs


# k-NN Retrieval System

Implements the k-Nearest Neighbors retrieval mechanism for voice conversion:

### Technical Features
1. Efficient Batch Processing:
   - Handles multiple sequences simultaneously
   - Optimized matrix operations
   - Memory-efficient implementation

2. Distance Calculation (more below):
   - Cosine similarity metric
   - Numerical stability handling
   - Batch matrix multiplication

3. Feature Averaging:
   - Uniform weighting of k-nearest neighbors
   - Proper dimension handling
   - Gradient-free operations

### Parameters
- k: Number of neighbors (default: 4)
- device: Computation device
- input dimensions: [B, T, D] for both source and target
- output dimensions: [B, T, D] for selected features

### Cosine Similarity
For two feature vectors $\mathbf{a}$ and $\mathbf{b}$ in a high-dimensional space (in our case, $\mathbb{R}^{1024}$), the cosine similarity is defined as:

$
\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}
$

Where:
- $\mathbf{a} \cdot \mathbf{b}$ is the dot product
- $\|\mathbf{a}\|$ and $\|\mathbf{b}\|$ are the L2 norms (Euclidean norms)

For batched computation with source features $\mathbf{S} \in \mathbb{R}^{B \times T_s \times D}$ and target features $\mathbf{T} \in \mathbb{R}^{B \times T_t \times D}$, we compute:

$
\text{Similarity}_{batch} = \frac{\mathbf{S}\mathbf{T}^T}{\|\mathbf{S}\|_2 \|\mathbf{T}\|_2^T}
$

### Cosine Distance
The cosine distance is derived from the cosine similarity:

$
d_{cos}(\mathbf{a}, \mathbf{b}) = 1 - \cos(\mathbf{a}, \mathbf{b})
$

In our implementation, we compute this in steps:

1. **Dot Product**:
   $\text{dot}_{batch} = \mathbf{S}\mathbf{T}^T \in \mathbb{R}^{B \times T_s \times T_t}$

2. **L2 Norms**:
   $\|\mathbf{S}\|_2 \in \mathbb{R}^{B \times T_s \times 1}$ and $\|\mathbf{T}\|_2 \in \mathbb{R}^{B \times T_t \times 1}$

3. **Norm Product**:
   $\text{norm\_prod} = \|\mathbf{S}\|_2\|\mathbf{T}\|_2^T \in \mathbb{R}^{B \times T_s \times T_t}$

4. **Final Distance**:
   $d_{cos} = 1 - \frac{\text{dot}_{batch}}{\text{norm\_prod} + \epsilon}$

where $\epsilon = 1e-8$ for numerical stability.

This distance metric has several advantageous properties for our SSL-TTS framework:

1. **Bounded Range**: $d_{cos} \in [0, 2]$ where:
   - 0 indicates identical direction
   - 1 indicates orthogonal vectors
   - 2 indicates opposite directions

2. **Scale Invariance**: The distance is invariant to the magnitude of the vectors, making it suitable for comparing SSL features that may have different magnitudes but similar patterns.

3. **Batch Efficiency**: The formulation allows efficient computation across batches using matrix operations, crucial for processing multiple time steps simultaneously.


# $\lambda$ function

### Core Implementation
1. Interpolation Formula:
```python
converted_features = lambda_value * selected_features +
                    (1 - lambda_value) * source_features
```

2. Parameter Management:
   - Lambda value bounds checking
   - Device handling
   - Tensor dimension verification

### Features
1. Input Validation:
   - Lambda range enforcement
   - Tensor dimension checking
   - Device consistency

2. Computation Efficiency:
   - In-place operations where possible
   - Memory-efficient implementation
   - Batch processing support

3. Interface Options:
   - Direct method call
   - Callable interface
   - Flexible parameter passing

# Vocoder

Implements the HiFi-GAN vocoder for waveform generation:

### Technical Details
1. Model Components:
   - Residual blocks
   - Upsampling layers
   - Convolutional layers

2. Audio Generation:
   - Feature conditioning
   - Multi-scale processing
   - Waveform synthesis

3. Current Status:
   - Checkpoint loading issue
   - Needs path configuration
   - Testing infrastructure ready

In [None]:
knn_vc = torch.hub.load(
    'bshall/knn-vc',
    'knn_vc',
    pretrained=True,
    prematched=True,
    trust_repo=True
)


In [None]:
vocoder = knn_vc.hifigan


# Urhythmic Component

In [None]:
# Pretrained models are available for:
# VCTK: p228, p268, p225, p232, p257, p231.
# and LJSpeech.

def load_urhythmic_model(source, target):
    hubert = torch.hub.load("bshall/hubert:main", "hubert_soft").cuda()
    urhythmic, encode = torch.hub.load(
        "bshall/urhythmic:main",
        "urhythmic_global",
        source_speaker=source,
        target_speaker=target,
    )
    urhythmic.cuda()

    return hubert, encode, urhythmic


# Pipeline

In [None]:
!pip install gTTS

In [None]:
from gtts import gTTS
from io import BytesIO

In [None]:
def cosine_dist(source_features, target_features):
    source_norms = torch.norm(source_features, p=2, dim=-1)
    matching_norms = torch.norm(target_features, p=2, dim=-1)
    dotprod = -torch.cdist(source_features[None], target_features[None], p=2)[0]**2 + source_norms[:, None]**2 + matching_norms[None]**2
    dotprod /= 2

    dists = 1 - ( dotprod / (source_norms[:, None] * matching_norms[None]) )
    return dists

In [None]:
class TTSPipeline(nn.Module):
    def __init__(self, source='LJSpeech', target='LJSpeech'):
        super().__init__()
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.ssl_encoder = SSLEncoder()
        self.vocoder = vocoder
        self.resampler = torchaudio.transforms.Resample(orig_freq=24000, new_freq=16000)
        self.hubert, self.encode, self.urhythmic = load_urhythmic_model(source, target)
        self.i = 0

    def reset_i(self):
        self.i = 0

    def text_to_waveform(self, text):
        #text to speech through google python library
        tts = gTTS(text)
        #save to buffer
        mp3_buffer = BytesIO()
        tts.write_to_fp(mp3_buffer)
        mp3_buffer.seek(0)
        #load
        waveform, sample_rate = torchaudio.load(mp3_buffer, format="mp3", normalize=True)
        #close and delete buffer
        mp3_buffer.close()
        del mp3_buffer
        #resample
        if sample_rate != 16000:
            waveform = self.resampler(waveform)

        return waveform.to(self.device)

    def KNN(self, source_features, target_features):
        synth_set = target_features

        dists = cosine_dist(source_features, target_features)
        best = dists.topk(k=4, largest=False, dim=-1)
        selected_features = synth_set[best.indices].mean(dim=1)

        return selected_features


    def linear_interpolation(self, source_features, selected_features, lambda_value = 1.0):
        # Ensure tensors are on correct device
        selected_features = selected_features.to(self.device)
        source_features = source_features.to(self.device)

        # Ensure lambda is in valid range
        lambda_value = max(0.0, min(1.0, lambda_value))

        # Perform linear interpolation
        converted_features = lambda_value * selected_features + (1 - lambda_value) * source_features

        return converted_features

    def get_target_features(self, wavs):
        features = []
        # i=0
        for path in wavs:
            # print(i)
            # i+=1
            features.append(self.get_features(path, get_target=True))

        features = torch.concat(features, dim = 0)
        return features.to(self.device)


    def get_features(self, path = None, waveform = None, get_target = False):
        if waveform is not None:
            x = waveform
        else:
            x, sample_rate = torchaudio.load(path, normalize=True)
            if sample_rate != 16000:
                resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
                x = resampler(x)


        if get_target:
            transform = torchaudio.transforms.Vad(sample_rate=16000, trigger_level=7.0)
            x_trim = transform(x)
            x_reversed = torch.flip(x_trim, (-1,))
            x_reversed_trim = transform(x_reversed)
            x_full_trim = torch.flip(x_reversed_trim, (-1,))
            x = x_full_trim

        features = self.ssl_encoder.extract_features(x)
        features = features.squeeze(0)

        return features.to(self.device)

    def forward(self, text, wavs, save_file = False):
        waveform = self.text_to_waveform(text)

        waveform = waveform.to(self.device)

        source_features = self.get_features(waveform = waveform)
        target_features = self.get_target_features(wavs)

        selected_features = self.KNN(source_features, target_features)

        converted_features = self.linear_interpolation(source_features, selected_features)

        generated_waveform = self.vocoder(selected_features[None].to(self.device)).cpu().squeeze()

        src_loudness = torchaudio.functional.loudness(generated_waveform[None], 16000)
        tgt_loudness = -16
        generated_waveform = torchaudio.functional.gain(generated_waveform, tgt_loudness - src_loudness)

        if save_file:
            save_waveform = generated_waveform
            #change output path to wherever you want to put files in drive
            if self.i < 10:
                output_path = f'/content/drive/MyDrive/LJ_graph/LJ_150/LJ_150_gtts_generated_waveform_0{self.i}.wav'
            else:
                output_path = f'/content/drive/MyDrive/LJ_graph/LJ_150/LJ_150_gtts_generated_waveform_{self.i}.wav'
            torchaudio.save(output_path, torch.from_numpy(save_waveform.detach().numpy()).unsqueeze(0), sample_rate=16000)
            print(f"Generated waveform saved at {output_path}")


        return generated_waveform.unsqueeze(0)

    def forward_with_urhythmic(self, text, wavs, save_file = False, save_example = False):
        # set save_example = True to save both the wav file before and after applying
        if save_example:
            waveform = self.forward(text, wavs, save_file=True)
        else:
            waveform = self.forward(text, wavs, save_file=False)


        waveform = waveform.unsqueeze(0).cuda()
        with torch.inference_mode():
            # Extract speech units and log probabilities
            units, log_probs = self.encode(self.hubert, waveform)
            # Convert to the target speaker
            generated_waveform = self.urhythmic(units, log_probs)

        generated_waveform = generated_waveform.cpu().squeeze()

        # src_loudness = torchaudio.functional.loudness(generated_waveform[None], 16000)
        # tgt_loudness = -16
        # generated_waveform = torchaudio.functional.gain(generated_waveform, tgt_loudness - src_loudness)

        if save_file:
            save_waveform = generated_waveform
            #change output path to wherever you want to put files in drive
            if self.i < 10:
                output_path = f'/content/drive/MyDrive/LJ_graph/LJ_ur_150/LJ_150_gtts_generated_waveform_urythmic_0{self.i}.wav'
            else:
                output_path = f'/content/drive/MyDrive/LJ_graph/LJ_ur_150/LJ_150_gtts_generated_waveform_urythmic_{self.i}.wav'
            torchaudio.save(output_path, torch.from_numpy(save_waveform.detach().numpy()).unsqueeze(0), sample_rate=16000)
            print(f"Generated waveform saved at {output_path}")
            self.i+=1

        return generated_waveform



# VCTK wavs for Audio Generation (in local drive)

In [None]:
wavs = [
    '/content/drive/MyDrive/p228/p228_001_mic1.flac',
    '/content/drive/MyDrive/p228/p228_001_mic2.flac',
    '/content/drive/MyDrive/p228/p228_002_mic1.flac',
    '/content/drive/MyDrive/p228/p228_003_mic2.flac',
    '/content/drive/MyDrive/p228/p228_004_mic1.flac',
    '/content/drive/MyDrive/p228/p228_005_mic1.flac',
    '/content/drive/MyDrive/p228/p228_006_mic1.flac',
    '/content/drive/MyDrive/p228/p228_007_mic1.flac',
    '/content/drive/MyDrive/p228/p228_035_mic2.flac',
    '/content/drive/MyDrive/p228/p228_009_mic1.flac',
    '/content/drive/MyDrive/p228/p228_010_mic1.flac',
    '/content/drive/MyDrive/p228/p228_011_mic1.flac',
    '/content/drive/MyDrive/p228/p228_012_mic2.flac',
    '/content/drive/MyDrive/p228/p228_013_mic1.flac',
    '/content/drive/MyDrive/p228/p228_014_mic2.flac',
    '/content/drive/MyDrive/p228/p228_033_mic1.flac',
    '/content/drive/MyDrive/p228/p228_016_mic1.flac',
    '/content/drive/MyDrive/p228/p228_037_mic1.flac',
    '/content/drive/MyDrive/p228/p228_018_mic2.flac',
    '/content/drive/MyDrive/p228/p228_019_mic1.flac',
    '/content/drive/MyDrive/p228/p228_020_mic2.flac',
    '/content/drive/MyDrive/p228/p228_021_mic2.flac',
    '/content/drive/MyDrive/p228/p228_038_mic1.flac',
    '/content/drive/MyDrive/p228/p228_039_mic1.flac',
    '/content/drive/MyDrive/p228/p228_024_mic1.flac',
    '/content/drive/MyDrive/p228/p228_025_mic2.flac',
    '/content/drive/MyDrive/p228/p228_026_mic2.flac',
    '/content/drive/MyDrive/p228/p228_027_mic1.flac',
    '/content/drive/MyDrive/p228/p228_028_mic1.flac',
    '/content/drive/MyDrive/p228/p228_029_mic1.flac',
    '/content/drive/MyDrive/p228/p228_030_mic1.flac',
]

In [None]:
wavs = [
    '/content/drive/MyDrive/p231/p231_001_mic1.flac',
    '/content/drive/MyDrive/p231/p231_001_mic2.flac',
    '/content/drive/MyDrive/p231/p231_002_mic1.flac',
    '/content/drive/MyDrive/p231/p231_003_mic2.flac',
    '/content/drive/MyDrive/p231/p231_004_mic1.flac',
    '/content/drive/MyDrive/p231/p231_005_mic1.flac',
    '/content/drive/MyDrive/p231/p231_006_mic1.flac',
    '/content/drive/MyDrive/p231/p231_007_mic1.flac',
    '/content/drive/MyDrive/p231/p231_008_mic2.flac',
    '/content/drive/MyDrive/p231/p231_009_mic1.flac',
    '/content/drive/MyDrive/p231/p231_010_mic1.flac',
    '/content/drive/MyDrive/p231/p231_011_mic1.flac',
    '/content/drive/MyDrive/p231/p231_012_mic2.flac',
    '/content/drive/MyDrive/p231/p231_013_mic1.flac',
    '/content/drive/MyDrive/p231/p231_014_mic2.flac',
    '/content/drive/MyDrive/p231/p231_033_mic1.flac',
    '/content/drive/MyDrive/p231/p231_016_mic1.flac',
    '/content/drive/MyDrive/p231/p231_015_mic1.flac',
    '/content/drive/MyDrive/p231/p231_018_mic2.flac',
    '/content/drive/MyDrive/p231/p231_019_mic1.flac',
    '/content/drive/MyDrive/p231/p231_020_mic2.flac',
    '/content/drive/MyDrive/p231/p231_021_mic2.flac',
    '/content/drive/MyDrive/p231/p231_031_mic1.flac',
    '/content/drive/MyDrive/p231/p231_023_mic1.flac',
    '/content/drive/MyDrive/p231/p231_024_mic1.flac',
    '/content/drive/MyDrive/p231/p231_025_mic2.flac',
    '/content/drive/MyDrive/p231/p231_026_mic2.flac',
    '/content/drive/MyDrive/p231/p231_027_mic1.flac',
    '/content/drive/MyDrive/p231/p231_028_mic1.flac',
    '/content/drive/MyDrive/p231/p231_029_mic1.flac',
    '/content/drive/MyDrive/p231/p231_030_mic1.flac',
]

In [None]:
wavs = [
    '/content/drive/MyDrive/p257/p257_001_mic1.flac',
    '/content/drive/MyDrive/p257/p257_001_mic2.flac',
    '/content/drive/MyDrive/p257/p257_002_mic1.flac',
    '/content/drive/MyDrive/p257/p257_003_mic2.flac',
    '/content/drive/MyDrive/p257/p257_004_mic1.flac',
    '/content/drive/MyDrive/p257/p257_005_mic1.flac',
    '/content/drive/MyDrive/p257/p257_006_mic1.flac',
    '/content/drive/MyDrive/p257/p257_007_mic1.flac',
    '/content/drive/MyDrive/p257/p257_008_mic2.flac',
    '/content/drive/MyDrive/p257/p257_009_mic1.flac',
    '/content/drive/MyDrive/p257/p257_010_mic1.flac',
    '/content/drive/MyDrive/p257/p257_011_mic1.flac',
    '/content/drive/MyDrive/p257/p257_012_mic2.flac',
    '/content/drive/MyDrive/p257/p257_013_mic1.flac',
    '/content/drive/MyDrive/p257/p257_014_mic2.flac',
    '/content/drive/MyDrive/p257/p257_033_mic1.flac',
    '/content/drive/MyDrive/p257/p257_016_mic1.flac',
    '/content/drive/MyDrive/p257/p257_015_mic1.flac',
    '/content/drive/MyDrive/p257/p257_018_mic2.flac',
    '/content/drive/MyDrive/p257/p257_019_mic1.flac',
    '/content/drive/MyDrive/p257/p257_020_mic2.flac',
    '/content/drive/MyDrive/p257/p257_021_mic2.flac',
    '/content/drive/MyDrive/p257/p257_031_mic1.flac',
    '/content/drive/MyDrive/p257/p257_023_mic1.flac',
    '/content/drive/MyDrive/p257/p257_024_mic1.flac',
    '/content/drive/MyDrive/p257/p257_025_mic2.flac',
    '/content/drive/MyDrive/p257/p257_026_mic2.flac',
    '/content/drive/MyDrive/p257/p257_027_mic1.flac',
    '/content/drive/MyDrive/p257/p257_028_mic1.flac',
    '/content/drive/MyDrive/p257/p257_029_mic1.flac',
    '/content/drive/MyDrive/p257/p257_030_mic1.flac',
]

In [None]:
wavs = [
    '/content/drive/MyDrive/p268/p268_001_mic1.flac',
    '/content/drive/MyDrive/p268/p268_001_mic2.flac',
    '/content/drive/MyDrive/p268/p268_002_mic1.flac',
    '/content/drive/MyDrive/p268/p268_003_mic2.flac',
    '/content/drive/MyDrive/p268/p268_004_mic1.flac',
    '/content/drive/MyDrive/p268/p268_005_mic1.flac',
    '/content/drive/MyDrive/p268/p268_006_mic1.flac',
    '/content/drive/MyDrive/p268/p268_007_mic1.flac',
    '/content/drive/MyDrive/p268/p268_035_mic2.flac',
    '/content/drive/MyDrive/p268/p268_009_mic1.flac',
    '/content/drive/MyDrive/p268/p268_010_mic1.flac',
    '/content/drive/MyDrive/p268/p268_011_mic1.flac',
    '/content/drive/MyDrive/p268/p268_012_mic2.flac',
    '/content/drive/MyDrive/p268/p268_013_mic1.flac',
    '/content/drive/MyDrive/p268/p268_014_mic2.flac',
    '/content/drive/MyDrive/p268/p268_033_mic1.flac',
    '/content/drive/MyDrive/p268/p268_016_mic1.flac',
    '/content/drive/MyDrive/p268/p268_015_mic1.flac',
    '/content/drive/MyDrive/p268/p268_018_mic2.flac',
    '/content/drive/MyDrive/p268/p268_019_mic1.flac',
    '/content/drive/MyDrive/p268/p268_020_mic2.flac',
    '/content/drive/MyDrive/p268/p268_021_mic2.flac',
    '/content/drive/MyDrive/p268/p268_031_mic1.flac',
    '/content/drive/MyDrive/p268/p268_034_mic1.flac',
    '/content/drive/MyDrive/p268/p268_024_mic1.flac',
    '/content/drive/MyDrive/p268/p268_025_mic2.flac',
    '/content/drive/MyDrive/p268/p268_026_mic2.flac',
    '/content/drive/MyDrive/p268/p268_027_mic1.flac',
    '/content/drive/MyDrive/p268/p268_028_mic1.flac',
    '/content/drive/MyDrive/p268/p268_029_mic1.flac',
    '/content/drive/MyDrive/p268/p268_030_mic1.flac',
]

# LJSpeech wavs for Audio Generation

In [None]:
wavs = [
    '/content/LJSpeech-1.1/wavs/LJ001-0001.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0002.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0003.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0004.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0005.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0006.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0007.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0008.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0009.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0010.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0011.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0012.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0013.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0014.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0015.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0016.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0017.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0018.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0019.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0020.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0021.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0022.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0023.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0024.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0025.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0026.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0027.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0028.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0029.wav',
    '/content/LJSpeech-1.1/wavs/LJ001-0030.wav',
]

# Example Audio Generation

If you wish to test this, the easiest way is to load up the LJSpeech dataset. Just run the cell under LJSpeech dataset as well as the Pipeline cells and the LJSpeech wavs cell.

In [None]:
import pandas as pd
import os

# If your CSV is on your Google Drive, specify its path.
# Example assuming you placed it directly in your "MyDrive":
# csv_path = '/content/sampled_sentences2.csv'

# # Read the CSV file. If it has no header, set header=None.
# with open(csv_path, 'r', encoding='utf-8') as file:
#     lines = [line.strip() for line in file]

# df = pd.DataFrame(lines, columns=['text'])
# Path to the folder where you want to store all audio files
tts = TTSPipeline()

# Now, loop through each sentence and create a TTS file for it
for i, row in df.iterrows():
    # Each row is just one sentence in the first column
    sentence = f'{row[0]}'
    tts.forward_with_urhythmic(sentence, wavs, save_file=True, save_example=True)


print("All audio files have been generated and saved to the folder in your drive.")

Loading WavLM model to cpu...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

WavLM model loaded successfully!


NameError: name 'vocoder' is not defined

# Testing and Evaluation

To evaluate the zero-shot model they used the LibriSpeech test-clean dataset for target speaker reference utterances (ground truth). The database has speech from 20 males and 20 females, 8 minutes of speech per each. We downloaded the data and we specifically need the following file: test-clean, which contains subfolders (one for each speaker) then subfolders within those (one for each chapter of a book that the speakers read from), then the individual audio files (in .flac form, each file is a sentence from the chapter).

\\

To create the output for the model, they passed in 100 English sentences for each speaker, from the FLoRes+ dataset. We downloaded the data and figured out where to find the sentences. We really only need one file, “devtest.eng_Latn”, which contains a multitude of random English sentences. Below you will find example sentences.

\\

MOS = mean opinion score is a measure of the human-judged overall quality of an event or experience. For us, a MOS is a ranking of the quality of speech utterances. Most often judged on a scale of 1 (bad) to 5 (excellent), MOS’s are the average of a number of other human-scored individual parameters. Although originally MOS’s were derived from surveys of expert observers, today a MOS is often produced by an Objective Measurement Method, approximating a human ranking. 4.3-4.5 is considered an excellent target to shoot for due to human tendency to rarely give out perfect 5’s. Below 3.5 is generally unacceptable.

\\

All tests are conducted with $λ=1$. The evaluation focuses on a few key metrics of language:

- **Naturalness: UTMOS**
 - UTMOS = UTokyo-SaruLab Mean Opinion Score, an autonomous method of calculating MOS.

- **Intelligibility: WER, PER**
 - WER = Word Error Rate, i.e. the ratio of word errors in a transcript to the total words spoken. A lower WER in speech-to-text means better accuracy in recognizing speech. In our case, this would be calculated with the formula $\frac{S+D+I}{N}$, where S is the number of substitutions (instances where a word in the synthesized sentence vector would need to be subsituted to match the truth vector), D is the number of deletions (instances where a word in the synthesized sentence vector would need to be deleted to match the truth vector), I is the number of insertions (instances where a new word would need to be inserted to match the truth vector), and N is the total number of phenomes. The numerator is also known as the edit distance because it represents "how far away" two sentences are.
 - PER = Phenome Error Rate, i.e. the ratio of phenome errors in a transcript to the total phenomes spoken. As above, a lower PER means better accuracy. The formula the same as above, except in the context of comparing phenomes instead of words.
 - Both of these are calculated using the Whisper-Large v3 model.

- **Speaker Similarity: SECS**
 - SECS = Speaker-Encoder Cosine Similarity, i.e. the cosine similarity between the embeddings of two audio samples, which in our case are a ground truth sample from one speaker and the synthesized sample for that same speaker. The original paper uses ECAPA2 to find these embeddings and their similtarity. The goal of speaker similarity is to determine if two audio samples come from the same spaker, so if the output of the model is above a certain threshold, they are considered to be from the same speaker, otherwise, they are from different speakers.

- **Subjective Evaluation: N-MOS, S-MOS**
 - N-MOS = Natural MOS, i.e. how natural the utterance (output) sounds compared to the ground-truth recording.
 - S-MOS = Similarity MOS, i.e. how similar the utterance sounds compared to the ground-truth recording.
The original paper had 10 raters go through 3 synthesized sentences per speaker, thus they went through 60 in total. They then gave a score for each synthesis from 1 to 5 in 0.5 increments. They hired native English speakers in the United States through Amazon Mechanical Turk to rate, so in our case it would just be us 4 rating.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## UTMOS

In [None]:
!pip install pip==23.2.1
!pip install utmos

Collecting pip==23.2.1
  Downloading pip-23.2.1-py3-none-any.whl.metadata (4.2 kB)
Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/2.1 MB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━[0m [32m1.3/2.1 MB[0m [31m19.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-23.2.1
Collecting utmos
  Obtaining dependency information for utmos from https://files.pythonhosted.org/packages/45/2b/92e89033000755d437239da84e062eeeae464cba

In [None]:
import os
import pandas as pd
import utmos

# Initialize the model once
model = utmos.Score()

# Suppose you have a parent directory that contains the 14 subfolders
parent_dir = '/content/drive/MyDrive/examples'

# Get the list of all folders inside the parent directory
folders = [os.path.join(parent_dir, d) for d in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, d))]

# Create an empty list to store results
results = []
i = 0
# Iterate through each folder
for folder_path in folders:
    # Get the folder name (last part of the path)
    folder_name = os.path.basename(folder_path)

    # Iterate through each .wav file in this folder
    for filename in os.listdir(folder_path):
        if filename.lower().endswith('.wav'):
            audio_file_path = os.path.join(folder_path, filename)

            # Calculate the score for this audio file
            score = model.calculate_wav_file(audio_file_path)
            if i % 75 == 0:
              print(f'{filename}: {score}')
            i += 1
            # Append the result as a dictionary
            results.append({
                'folder': folder_name,
                'filename': filename,
                'score': score
            })

# Convert the list of dictionaries into a DataFrame
df2 = pd.DataFrame(results)

# Now df contains all the results for each wav file across all 14 folders.
# If you want to save this DataFrame to a CSV file:
folder_means = df.groupby('folder')['score'].mean()
i = 0
base = 0
urhyth = 0
# Print each folder and its mean score
for folder, mean_score in folder_means.items():
    print(f"{folder}: {mean_score}")

    if i % 2 == 0:
      base += mean_score
    else:
      urhyth += mean_score
    i += 1
print(base/6)
print(urhyth/6)


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.9 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--mosnets--utmos/snapshots/65956d677bd502519c30c0f4bfda97a749d63009/model.ckpt`
  state = torch.load(f, map_location=torch.device("cpu"))
  WeightNorm.apply(module, name, dim)


p225_gtts_generated_waveform_00.wav: 3.729201376438141
LJ_generated_waveform_urythmic_00.wav: 4.256206393241882
LJ_generated_waveform_00.wav: 3.794767916202545
p225_gtts_generated_waveform_urythmic_00.wav: 3.5768465399742126
p231_gtts_generated_waveform_urythmic_00.wav: 3.8537667393684387
p228_gtts_generated_waveform_00.wav: 4.123617649078369
p228_gtts_generated_waveform_urythmic_00.wav: 4.067911148071289
p268_gtts_generated_waveform_urythmic_00.wav: 4.081933617591858
p257_gtts_generated_waveform_00.wav: 3.787180006504059
p257_gtts_generated_waveform_urythmic_00.wav: 3.4753727316856384
p231_gtts_generated_waveform_00.wav: 3.9803765416145325
p268_gtts_generated_waveform_00.wav: 4.387761950492859
LJSpeech_examples: 3.7928078508377077
LJSpeech_examples_urhythmic: 4.193322395185629
p225_examples: 3.678368322253227
p225_examples_urhythmic: 3.3803739565610886
p228_examples: 3.9510156413167716
p228_examples_urhythmic: 3.821045938928922
p231_examples: 3.995636603931586
p231_examples_urhythmic:

## SECS

In [None]:
!pip install --upgrade huggingface_hub
!pip install scikit-learn



In [None]:
# Note: This metric needs a ground truth audio file to compare with,
# so if you wish to test it, you must load up the LJSpeech dataset
from huggingface_hub import hf_hub_download
from sklearn.metrics.pairwise import cosine_similarity
import os
import pandas as pd

# Initialize the model once
model_file = hf_hub_download(repo_id='Jenthe/ECAPA2', filename='ecapa2.pt', cache_dir=None)
ecapa2 = torch.jit.load(model_file, map_location='cuda')
ecapa2.half()

parent_dir = '/content/drive/MyDrive/examples'

folders = [os.path.join(parent_dir, d) for d in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, d))]

results = []
i = 0

for folder_path in folders:
    folder_name = os.path.basename(folder_path)

    # Iterate through each .wav file in this folder
    for filename in os.listdir(folder_path):
        if filename.lower().endswith('.wav'):
            audio_file_path = os.path.join(folder_path, filename)

            # Calculate the embedding for this audio file
            audio_gen, sr = torchaudio.load(audio_file_path)# sample rate of 16 kHz expected
            audio_gen = audio_gen.to('cuda')
            embedding_generated = ecapa2(audio_gen).detach().cpu().numpy()

            #Calculate the embedding for the desired speaker
            if folder_name[:2] == 'LJ':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/LJ001-0001.wav')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            elif folder_name[:4] == 'p225':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/p225/p225_001_mic1.flac')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            elif folder_name[:4] == 'p228':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/p228/p228_001_mic1.flac')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            elif folder_name[:4] == 'p231':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/p231/p231_001_mic1.flac')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            elif folder_name[:4] == 'p232':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/p232/p232_001_mic1.flac')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            elif folder_name[:4] == 'p257':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/p257/p257_001_mic1.flac')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            elif folder_name[:4] == 'p268':
              audio_truth, sr = torchaudio.load('/content/drive/MyDrive/p268/p268_001_mic1.flac')
              audio_truth = audio_truth.to('cuda')
              embedding_truth = ecapa2(audio_truth).detach().cpu().numpy()

            #Calculate SECS (speaker encoder cosine similarity)
            secs = cosine_similarity(embedding_generated, embedding_truth)

            #the values in the paper are b/w 0 and 1 so normalize: min value is -1 and max value is 1
            secs = (secs+1) / 2

            #Check results
            if i % 75 == 0:
              print(f'{folder_name}/{filename}: {secs}')
            i += 1

            # Append the result as a dictionary
            results.append({
                'folder': folder_name,
                'filename': filename,
                'secs': secs
            })

LJSpeech_examples_gtts/LJ_generated_waveform_00.wav: [[0.56690031]]
LJSpeech_examples_urhythmic_gtts/LJ_generated_waveform_urythmic_00.wav: [[0.66848637]]
p225_examples_gtts/p225_gtts_generated_waveform_00.wav: [[0.48326665]]
p225_examples_urhythmic_gtts/p225_gtts_generated_waveform_urythmic_00.wav: [[0.53131772]]
p228_examples_urhythmic_gtts/p228_gtts_generated_waveform_urythmic_00.wav: [[0.56336158]]
p228_examples_gtts/p228_gtts_generated_waveform_00.wav: [[0.51761073]]
p231_examples_urhythmic_gtts/p231_gtts_generated_waveform_urythmic_00.wav: [[0.55573299]]
p231_examples_gtts/p231_gtts_generated_waveform_00.wav: [[0.51919524]]
p257_examples_urhythmic_gtts/p257_gtts_generated_waveform_urythmic_00.wav: [[0.53735418]]
p257_examples_gtts/p257_gtts_generated_waveform_00.wav: [[0.54565198]]
p268_examples_urhythmic_gtts/p268_gtts_generated_waveform_urythmic_00.wav: [[0.51130244]]
p268_examples_gtts/p268_gtts_generated_waveform_00.wav: [[0.46486592]]


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!pip install --upgrade pandas numpy



In [None]:
df3 = pd.DataFrame(results)
df3['secs'] = df3['secs'].apply(lambda x: x[0][0])
df3.columns = list(df3.columns)

# Now df contains all the results for each wav file across all 14 folders.
# If you want to save this DataFrame to a CSV file:
folder_means = df3.groupby('folder')['secs'].mean()
i = 0
base = 0
urhyth = 0
# Print each folder and its mean score
for folder, mean_score in folder_means.items():
    print(f"{folder}: {mean_score}")

    if i % 2 == 0:
      base += mean_score
    else:
      urhyth += mean_score
    i += 1
print(base/6)
print(urhyth/6)

output_csv_path = '/content/secs_scores.csv'
df3_dict = df3.to_dict(orient='records')
# Use the csv module to write the dictionary to CSV:
import csv
with open(output_csv_path, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=df3.columns)
    writer.writeheader()
    writer.writerows(df3_dict)

print("All scores have been calculated and saved to:", output_csv_path)

LJSpeech_examples_gtts: 0.5706962302208631
LJSpeech_examples_urhythmic_gtts: 0.6494904270143566
p225_examples_gtts: 0.4821260035306302
p225_examples_urhythmic_gtts: 0.531314990573211
p228_examples_gtts: 0.5132718528655164
p228_examples_urhythmic_gtts: 0.5460057999979038
p231_examples_gtts: 0.5024797379488829
p231_examples_urhythmic_gtts: 0.5428143174305543
p257_examples_gtts: 0.5547537291982848
p257_examples_urhythmic_gtts: 0.5399783076574057
p268_examples_gtts: 0.47704823177929495
p268_examples_urhythmic_gtts: 0.51124499872964
0.5167292975905787
0.5534748069005119
All scores have been calculated and saved to: /content/no_taco_secs_scores.csv


## WER and PER

In [None]:
pip install -U openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/800.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.0/800.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting triton>=2.0.0 (from openai-whisper)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylin

In [None]:
!pip install jiwer g2p-en


Collecting jiwer
  Downloading jiwer-3.0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting g2p-en
  Downloading g2p_en-2.1.0-py3-none-any.whl.metadata (4.5 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting distance>=0.1.3 (from g2p-en)
  Downloading Distance-0.1.3.tar.gz (180 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.3/180.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading jiwer-3.0.5-py3-none-any.whl (21 kB)
Downloading g2p_en-2.1.0-py3-none-any.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m87.7 MB/s[0m eta [36m0:0

In [None]:
!pip install --upgrade nltk g2p_en




In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:

import os
import pandas as pd
import whisper

# Load the Whisper model once
model = whisper.load_model("turbo")

# Path to the top-level directory containing all speaker folders
data_dir = "/content/drive/MyDrive/examples"  # Adjust this to your directory

# Load ground truth sentences from a text file
# Each line in the file corresponds to a sentence, in the exact order as the clips
with open("/content/sampled_sentences2.csv", "r", encoding="utf-8") as f:
    ground_truth_sentences = [line.strip() for line in f if line.strip()]

# Initialize a list for DataFrame rows
rows = []

# We'll assume each speaker directory has 75 files named consistently.
# If the files have a known pattern like clip1.wav, clip2.wav, etc., we can rely on indexing.
# If filenames differ, you may need to sort them alphabetically or by a numeric pattern.
for speaker_dir in sorted(os.listdir(data_dir)):
    speaker_path = os.path.join(data_dir, speaker_dir)
    if os.path.isdir(speaker_path):
        # Collect all wav files
        audio_files = [f for f in os.listdir(speaker_path) if f.lower().endswith(".wav")]

        # Sort them so that the order matches the ground truth sentences order
        # This step assumes the naming pattern aligns so sorting gives the correct order.
        # For example, if they are named clip1.wav, clip2.wav... sorting by name should be fine.
        audio_files.sort()

        # Ensure we have the same number of audio files as ground truth lines (in this case, 75)
        for i, fname in enumerate(audio_files):
            filepath = os.path.join(speaker_path, fname)

            # Transcribe the audio file
            result = model.transcribe(filepath)
            transcription = result["text"].strip()
            if i % 75 == 0:
              print(f'{speaker_dir}/{fname}: {transcription}')

            # Match ground truth by index
            # i corresponds to the i-th file of the speaker, so ground_truth_sentences[i] should match
            # if the order is consistent.
            gt_sentence = ground_truth_sentences[i] if i < len(ground_truth_sentences) else ""

            rows.append({
                "speaker": speaker_dir,
                "filename": fname,
                "transcription": transcription,
                "ground_truth": gt_sentence
            })

# Create a DataFrame
df = pd.DataFrame(rows, columns=["speaker", "filename", "transcription", "ground_truth"])

# Preview the first few rows
print(df.head())


  checkpoint = torch.load(fp, map_location=device)


LJSpeech_examples_gtts/LJ_generated_waveform_00.wav: Each year, dozens of visitors are injured because they didn't keep a proper distance. These animals are large, wild, and potentially dangerous, so give them nest space.
LJSpeech_examples_urhythmic_gtts/LJ_generated_waveform_urythmic_00.wav: Each year, dozens of visitors are injured because they didn't keep a proper distance. These animals are large, wild, and potentially dangerous, so give them their space.
p225_examples_gtts/p225_gtts_generated_waveform_00.wav: Each year, dozens of visitors are injured because they didn't keep a proper distance. These animals are large, wild, and potentially dangerous, so give them their space.
p225_examples_urhythmic_gtts/p225_gtts_generated_waveform_urythmic_00.wav: Each year, dozens of visitors are injured because they didn't keep a proper distance. These animals are large, wild and potentially dangerous, so give them their space.
p228_examples_gtts/p228_gtts_generated_waveform_00.wav: Each year,

In [None]:
df["ground_truth"] = df["ground_truth"].str.replace("[()]", "", regex=True)


In [None]:
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer()
df["transcription_clean"] = [normalizer(text) for text in df["transcription"]]
df["ground_truth_clean"] = [normalizer(text) for text in df["ground_truth"]]
df

Unnamed: 0,speaker,filename,transcription,ground_truth,transcription_clean,ground_truth_clean
0,LJSpeech_examples_gtts,LJ_generated_waveform_00.wav,"Each year, dozens of visitors are injured beca...","""Each year, dozens of visitors are injured bec...",each year dozens of visitors are injured becau...,each year dozens of visitors are injured becau...
1,LJSpeech_examples_gtts,LJ_generated_waveform_01.wav,"personal involvement, and continuing relations...","""""""Personal involvement” and “continuing relat...",personal involvement and continuing relationsh...,personal involvement and continuing relationsh...
2,LJSpeech_examples_gtts,LJ_generated_waveform_02.wav,Women did the cooking in the yard. Stores were...,Women did the cooking in the yard; stores were...,women did the cooking in the yard stores were ...,women did the cooking in the yard stores were ...
3,LJSpeech_examples_gtts,LJ_generated_waveform_03.wav,"The main Amazon River is 6,387 kilometres, 3,9...","""The main Amazon River is 6,387 kilometers 3,9...",the main amazon river is 6387 kilometers 3980 ...,the main amazon river is 6387 kilometers 3980 ...
4,LJSpeech_examples_gtts,LJ_generated_waveform_04.wav,"For some, understanding something about how ai...","""For some, understanding something about how a...",for some understanding something about how air...,for some understanding something about how air...
...,...,...,...,...,...,...
895,p268_examples_urhythmic_gtts,p268_gtts_generated_waveform_urythmic_70.wav,Money can be exchanged at the only bank in the...,Money can be exchanged at the only bank in the...,money can be exchanged at the only bank in the...,money can be exchanged at the only bank in the...
896,p268_examples_urhythmic_gtts,p268_gtts_generated_waveform_urythmic_71.wav,The feathers structure suggests that they were...,"""""""The feathers' structure suggests that they ...",the feathers structure suggests that they were...,the feathers structure suggests that they were...
897,p268_examples_urhythmic_gtts,p268_gtts_generated_waveform_urythmic_72.wav,handicraft products might be defined as antiqu...,"""""""Handicraft products might be defined as ant...",handicraft products might be defined as antiqu...,handicraft products might be defined as antiqu...
898,p268_examples_urhythmic_gtts,p268_gtts_generated_waveform_urythmic_73.wav,"However, a nationwide road network is not econ...","""""""However, a nationwide road network is not e...",however a nationwide road network is not econo...,however a nationwide road network is not econo...


In [None]:
from jiwer import wer

# Calculate WER for each row in the DataFrame
df["wer"] = [wer(ref, hyp) for ref, hyp in zip(df["ground_truth_clean"], df["transcription_clean"])]

df.head()


Unnamed: 0,speaker,filename,transcription,ground_truth,transcription_clean,ground_truth_clean,wer
0,LJSpeech_examples_gtts,LJ_generated_waveform_00.wav,"Each year, dozens of visitors are injured beca...","""Each year, dozens of visitors are injured bec...",each year dozens of visitors are injured becau...,each year dozens of visitors are injured becau...,0.035714
1,LJSpeech_examples_gtts,LJ_generated_waveform_01.wav,"personal involvement, and continuing relations...","""""""Personal involvement” and “continuing relat...",personal involvement and continuing relationsh...,personal involvement and continuing relationsh...,0.041667
2,LJSpeech_examples_gtts,LJ_generated_waveform_02.wav,Women did the cooking in the yard. Stores were...,Women did the cooking in the yard; stores were...,women did the cooking in the yard stores were ...,women did the cooking in the yard stores were ...,0.0
3,LJSpeech_examples_gtts,LJ_generated_waveform_03.wav,"The main Amazon River is 6,387 kilometres, 3,9...","""The main Amazon River is 6,387 kilometers 3,9...",the main amazon river is 6387 kilometers 3980 ...,the main amazon river is 6387 kilometers 3980 ...,0.0
4,LJSpeech_examples_gtts,LJ_generated_waveform_04.wav,"For some, understanding something about how ai...","""For some, understanding something about how a...",for some understanding something about how air...,for some understanding something about how air...,0.0


In [None]:
from g2p_en import G2p

g2p = G2p()

def text_to_phonemes(text):
    # Convert text to phonemes using g2p
    phonemes = g2p(text)
    # g2p returns a list of phonemes and possibly some punctuation
    # Filter out non-phoneme tokens if needed:
    phonemes = [p for p in phonemes if p.isalpha()]
    return phonemes

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


In [None]:

def compute_per_for_row(row):
    # Convert the reference and hypothesis texts to phonemes
    reference_phonemes = text_to_phonemes(row["ground_truth_clean"])
    hypothesis_phonemes = text_to_phonemes(row["transcription_clean"])

    # Join them into strings for jiwer
    ref_phoneme_str = " ".join(reference_phonemes)
    hyp_phoneme_str = " ".join(hypothesis_phonemes)

    # Compute PER using wer function
    return wer(ref_phoneme_str, hyp_phoneme_str)

# Apply the function to each row of the DataFrame
df["per"] = df.apply(compute_per_for_row, axis=1)

# Now 'df' will have a new column 'per' with the phoneme error rate
print(df.head())


                  speaker                      filename  \
0  LJSpeech_examples_gtts  LJ_generated_waveform_00.wav   
1  LJSpeech_examples_gtts  LJ_generated_waveform_01.wav   
2  LJSpeech_examples_gtts  LJ_generated_waveform_02.wav   
3  LJSpeech_examples_gtts  LJ_generated_waveform_03.wav   
4  LJSpeech_examples_gtts  LJ_generated_waveform_04.wav   

                                       transcription  \
0  Each year, dozens of visitors are injured beca...   
1  personal involvement, and continuing relations...   
2  Women did the cooking in the yard. Stores were...   
3  The main Amazon River is 6,387 kilometres, 3,9...   
4  For some, understanding something about how ai...   

                                        ground_truth  \
0  "Each year, dozens of visitors are injured bec...   
1  """Personal involvement” and “continuing relat...   
2  Women did the cooking in the yard; stores were...   
3  "The main Amazon River is 6,387 kilometers 3,9...   
4  "For some, understanding 

In [None]:
mean_results = df.groupby("speaker")[["wer", "per"]].mean().reset_index()
print(mean_results)


                             speaker       wer       per
0             LJSpeech_examples_gtts  0.015789  0.010536
1   LJSpeech_examples_urhythmic_gtts  0.019772  0.010754
2                 p225_examples_gtts  0.013540  0.008231
3       p225_examples_urhythmic_gtts  0.016080  0.006540
4                 p228_examples_gtts  0.019743  0.013647
5       p228_examples_urhythmic_gtts  0.020906  0.011958
6                 p231_examples_gtts  0.014669  0.008263
7       p231_examples_urhythmic_gtts  0.021695  0.014283
8                 p257_examples_gtts  0.019027  0.013242
9       p257_examples_urhythmic_gtts  0.018525  0.010895
10                p268_examples_gtts  0.016392  0.010644
11      p268_examples_urhythmic_gtts  0.017118  0.010928


In [None]:
output_csv_path = '/content/wer_per_scores.csv'
df.to_csv(output_csv_path, index=False)