Required libraries are listed in `requirements.txt`.

Run `%pip install -r requirements.txt` to install them.

We conducted experiments on GPU (`g.1.8` - 8-core CPU + 1xV100 GPU configuration in [Yandex Datasphere](https://datasphere.yandex.ru/)) and assume that GPU will be used in any replication.

In [1]:
#!g1.1
import os
import IPython.display as ipd
import torch, torchaudio
from xnot_matcher import XNOTNeighborsVC
from xnot import XNot

from speechbrain.pretrained import EncoderClassifier

import glob
import numpy as np
import json

from tqdm.auto import tqdm
from collections import defaultdict
from sklearn.metrics import roc_curve
from jiwer import cer, wer

device = 'cuda'

For intelligibility evaluation and some experiments later in this work we will be using commercially available ARS and TTS service [Yandex SpeechKit](https://cloud.yandex.ru/en/services/speechkit).

In [2]:
#!g1.1

from speechkit import configure_credentials, creds
from speechkit import model_repository
from speechkit.stt import AudioProcessingType

configure_credentials(
    yandex_credentials=creds.YandexCredentials(
        api_key=os.environ['api_key'],
    )
)

model = model_repository.recognition_model()

model.model = 'general:rc'
model.language = 'en-US'
model.audio_processing_type = AudioProcessingType.Full

For speaker identity conservation evaluation we will be using pretrained model for x-vector retrieval from [SpeechBrain](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb).

In [5]:
#!g1.1
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb", run_opts={"device": device})

hyperparams.yaml:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

embedding_model.ckpt:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

mean_var_norm_emb.ckpt:   0%|          | 0.00/3.20k [00:00<?, ?B/s]

classifier.ckpt:   0%|          | 0.00/15.9M [00:00<?, ?B/s]

label_encoder.txt:   0%|          | 0.00/129k [00:00<?, ?B/s]

As a baseline for comparison, we will use the best checkpoint provided by original authors of the [paper](https://arxiv.org/abs/2305.18975) - [kNN-VC with prematched HiFiGAN](https://github.com/bshall/knn-vc/releases/download/v0.1/prematch_g_02500000.pt).

`XNOTNeighborsVC` is our modification of `KNeighborsVC` that allows for `xnot` usage during inference.

In [6]:
#!g1.1
knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True, device=device)
xnot_vc = XNOTNeighborsVC(knn_vc.wavlm, knn_vc.hifigan, knn_vc.h, device=device).eval()

Downloading: "https://github.com/bshall/knn-vc/zipball/master" to /tmp/xdg_cache/torch/hub/master.zip
Downloading: "https://github.com/bshall/knn-vc/releases/download/v0.1/prematch_g_02500000.pt" to /tmp/xdg_cache/torch/hub/checkpoints/prematch_g_02500000.pt
100%|██████████| 63.1M/63.1M [00:00<00:00, 118MB/s] 
Downloading: "https://github.com/bshall/knn-vc/releases/download/v0.1/WavLM-Large.pt" to /tmp/xdg_cache/torch/hub/checkpoints/WavLM-Large.pt


Removing weight norm...
[HiFiGAN] Generator loaded with 16,523,393 parameters.


100%|██████████| 1.18G/1.18G [00:09<00:00, 138MB/s] 


WavLM-Large loaded with 315,453,120 parameters.


Following the original [paper](https://arxiv.org/abs/2305.18975), we are going to be using LibriSpeech test-clean split from http://www.openslr.org/12. Download this split and place unpacked folder in the root of this repository.

For convenience, we are going to use preprocessed textual ground truth references for this split from [LibriSpeech Alignments](https://github.com/CorentinJ/librispeech-alignments). Download the texts in condensed format from the `README.md` and also unpack it in the root of the repository.

In [7]:
#!g1.1
libri_folder = 'data/LibriSpeech/test-clean/'
speakers = list(sorted(os.listdir(libri_folder)))

In [8]:
#!g1.1
all_speaker_audios = {}
for speaker in speakers:
    files = glob.glob(f'{libri_folder}/{speaker}/*/*.flac', recursive=True)
    all_speaker_audios[speaker] = files

Following the original [paper](https://arxiv.org/abs/2305.18975) once again, we choose 5 sample audios at random for each speaker.

In [9]:
#!g1.1
chosen = {}
for speaker in speakers:
    files = glob.glob(f'{libri_folder}/{speaker}/*/*.flac', recursive=True)
    chosen[speaker] = np.random.choice(files, 5, replace=False)

For convenience, we precalculate feature embeddings for chosen audios:

In [11]:
#!g1.1
matching_sets = {}

for speaker in tqdm(speakers):
    audios = []
    for filename in chosen[speaker]:
        audio, _ = torchaudio.load(filename)
        audios.append(audio)
    matching_sets[speaker] = knn_vc.get_matching_set(audios).cpu()

torch.save(matching_sets, 'data/matchings/matching_sets')

As the number of pairs is approximately square of the number of speakers (thus, growing fast), we consider only the first 10 speakers (out of 40) in our research due to time and budget constraints.

---

### Basic setup

Each audio is used as a single source and converted to all the other considered speakers via both original `knn` algorithm and our `xnot` modification of it. For `xnot`, during **all** the experiments we are going to use `[1, 2, 4]` as values for `w` hyperparameter of `XNot`.

In [None]:
#!g1.1
n_speakers = 10
speakers_list = speakers[:n_speakers]


for src_speaker in tqdm(speakers_list):
    for target_speaker in speakers_list:
        if src_speaker == target_speaker:
            continue
        print(f'Conversing {src_speaker=} to {target_speaker=}')
        for filename in chosen[src_speaker]:
            audio, _ = torchaudio.load(filename)
            query_seq = knn_vc.get_features(audio)
            idx = filename.split('/')[-1].split('.')[0]

            for i, W in enumerate([1.0, 2.0, 4.0]):
                path = f'w-{int(W)}/target-{target_speaker}-idx-{idx}-src-{src_speaker}'
                if os.path.exists(f'data/x_nots/{path}'):
                    continue
                out_wav_xnot, xnot = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='xnot', W=W, max_steps=200)
                torchaudio.save(f'data/audios/xnot/{path}.wav', out_wav_xnot[None].cpu(), 16000)
                torch.save(
                {
                    'state_dict': xnot.state_dict()
                }, f'data/x_nots/{path}')
            path = f'target-{target_speaker}-idx-{idx}-src-{src_speaker}'
            out_wav_knn, _ = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='knn')
            torchaudio.save(f'data/audios/knn/{path}.wav', out_wav_knn[None].cpu(), 16000)

### V2 experiment setup

For each source-target speaker pair we use **all 5** available audios for each speaker to train a single `XNot` checkpoint, which we use to convert all 5 audios afterwards.

In [None]:
#!g1.1
%%time

for src_speaker in tqdm(speakers_list):
    for target_speaker in speakers_list:
        if src_speaker == target_speaker:
            continue
        print(f'Conversing {src_speaker=} to {target_speaker=}')
        src_audios = []

        for filename in chosen[src_speaker]:
            audio, _ = torchaudio.load(filename)
            src_audios.append(audio)

        query_matchings = knn_vc.get_matching_set(src_audios)
        for i, W in enumerate([1.0, 2.0, 4.0]):

            x_not = XNot(query_matchings.shape[1], device)
            x_not = x_not.train(True)
            x_not.fit(query_matchings.to(device), matching_sets[target_speaker].to(device), max_steps=200, t_iters=10, cost="sq_cost", batch_size=64, W=W)

            for filename in chosen[src_speaker]:
                audio, _ = torchaudio.load(filename)
                query_seq = knn_vc.get_features(audio)
                idx = filename.split('/')[-1].split('.')[0]

                path = f'w-{int(W)}/target-{target_speaker}-idx-{idx}-src-{src_speaker}'
                if os.path.exists(f'data/2/x_nots/{path}'):
                    continue
                out_wav_xnot, _ = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='xnot', W=W, max_steps=200, x_not=x_not)
                torchaudio.save(f'data/2/audios/xnot/{path}.wav', out_wav_xnot[None].cpu(), 16000)
                torch.save(
                {
                    'state_dict': x_not.state_dict()
                }, f'data/2/x_nots/{path}')
                path = f'target-{target_speaker}-idx-{idx}-src-{src_speaker}'
                out_wav_knn, _ = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='knn')
                torchaudio.save(f'data/2/audios/knn/{path}.wav', out_wav_knn[None].cpu(), 16000)

### Ablation setup

For each source-target speaker pair we use only one pretrained (in the previous step) `XNot` checkpoint to convert all 4 of remaining audios.

In [None]:
#!g1.1
%%time
n_speakers = 10
speakers_list = speakers[:n_speakers]
xnot = XNot(1024, device)  # 1024 is the original embedding size

for src_speaker in tqdm(speakers_list):
    for target_speaker in speakers_list:
        if src_speaker == target_speaker:
            continue
        print(f'Conversing {src_speaker=} to {target_speaker=}')
        existing_xnot_filename = chosen[src_speaker][0]
        existing_xnot_idx = existing_xnot_filename.split('/')[-1].split('.')[0]

        for filename in chosen[src_speaker][1:]:
            audio, _ = torchaudio.load(filename)
            query_seq = knn_vc.get_features(audio)
            idx = filename.split('/')[-1].split('.')[0]

            for i, W in enumerate([1.0, 2.0, 4.0]):
                path = f'w-{int(W)}/target-{target_speaker}-existing-{existing_xnot_idx}-idx-{idx}-src-{src_speaker}'
                xnot.load_state_dict(torch.load(f'data/x_nots/w-{int(W)}/target-{target_speaker}-idx-{existing_xnot_idx}-src-{src_speaker}')['state_dict'])
                out_wav_xnot, _ = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='xnot', W=W, max_steps=200, x_not=xnot)
                torchaudio.save(f'data/audios/reconstructed/{path}.wav', out_wav_xnot[None].cpu(), 16000)

## Evaluation

### Speaker identity conservation

We couldn't find the computation code for `EER` used by authors, and our version definitely differs from theirs (our is capped at 1.0 and not 0.5). Thus, we compute our version of this metric on results produced by original algorithm to produce comparable values.

In [None]:
#!g1.1
def compute_eer(label, pred, positive_label=1):
    fpr, tpr, threshold = roc_curve(label, pred, pos_label=positive_label)
    fnr = 1 - tpr
    eer_threshold = threshold[np.nanargmin(np.absolute((fnr - fpr)))]
    eer_1 = fpr[np.nanargmin(np.absolute((fnr - fpr)))]
    eer_2 = fnr[np.nanargmin(np.absolute((fnr - fpr)))]
    eer = (eer_1 + eer_2) / 2
    return eer

`EER` is computated over x-vector cosine similarity scores.

In [None]:
#!g1.1
sim = torch.nn.CosineSimilarity().to(device)

knn_similarities = []

for target_speaker in tqdm(chosen, total=len(chosen)):
    conversions = glob.glob(f'data/audios/knn/target-{target_speaker}*', recursive=True)
    source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

    for source, conversion in zip(source_audios, conversions):
        source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
        converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
        knn_similarities.append(sim(source_embedding, converted_embedding).cpu().item())


gt_similarities = []

for target_speaker in np.random.choice(speakers, size=len(knn_similarities)):
    src_speaker = target_speaker
    while src_speaker == target_speaker:
        src_speaker = np.random.choice(speakers)

    target_audio = np.random.choice(all_speaker_audios[target_speaker], size=1)[0]
    source_audio = np.random.choice(all_speaker_audios[src_speaker], size=1)[0]

    source_embedding = classifier.encode_batch(torchaudio.load(source_audio)[0]).squeeze(1)
    converted_embedding = classifier.encode_batch(torchaudio.load(target_audio)[0]).squeeze(1)
    gt_similarities.append(sim(source_embedding, converted_embedding).cpu().item())


preds_knn = np.array(knn_similarities + gt_similarities)
targets = np.array([0. for _ in knn_similarities] + [1. for _ in gt_similarities])
knn_eer = compute_eer(targets, preds_knn, positive_label=1)

results = []
for W in [1.0, 2.0, 4.0]:
    similarities = []

    for target_speaker in tqdm(chosen, total=len(chosen)):
        conversions = glob.glob(f'data/audios/xnot/w-{int(W)}/target-{target_speaker}*', recursive=True)
        source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

        for source, conversion in zip(source_audios, conversions):
            source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
            converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
            similarities.append(sim(source_embedding, converted_embedding).cpu().item())

    preds_xnot = np.array(similarities + gt_similarities)
    targets = np.array([0. for _ in similarities] + [1. for _ in gt_similarities])

    results.append(compute_eer(targets, preds_xnot, positive_label=1))

In [None]:
#!g1.1
knn_similarities_2 = []

for target_speaker in tqdm(chosen, total=len(chosen)):
    conversions = glob.glob(f'data/2/audios/knn/target-{target_speaker}*', recursive=True)
    source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

    for source, conversion in zip(source_audios, conversions):
        source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
        converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
        knn_similarities_2.append(sim(source_embedding, converted_embedding).cpu().item())

gt_similarities = []

for target_speaker in np.random.choice(speakers, size=len(knn_similarities_2)):
    src_speaker = target_speaker
    while src_speaker == target_speaker:
        src_speaker = np.random.choice(speakers)

    target_audio = np.random.choice(all_speaker_audios[target_speaker], size=1)[0]
    source_audio = np.random.choice(all_speaker_audios[src_speaker], size=1)[0]

    source_embedding = classifier.encode_batch(torchaudio.load(source_audio)[0]).squeeze(1)
    converted_embedding = classifier.encode_batch(torchaudio.load(target_audio)[0]).squeeze(1)
    gt_similarities.append(sim(source_embedding, converted_embedding).cpu().item())


preds_knn = np.array(knn_similarities_2 + gt_similarities)
targets = np.array([0. for _ in knn_similarities_2] + [1. for _ in gt_similarities])
knn_eer_2 = compute_eer(targets, preds_knn, positive_label=1)

results_2 = []
for W in [1.0, 2.0, 4.0]:
    similarities = []

    for target_speaker in tqdm(chosen, total=len(chosen)):
        conversions = glob.glob(f'data/2/audios/xnot/w-{int(W)}/target-{target_speaker}*', recursive=True)
        source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

        for source, conversion in zip(source_audios, conversions):
            source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
            converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
            similarities.append(sim(source_embedding, converted_embedding).cpu().item())

    preds_xnot = np.array(similarities + gt_similarities)
    targets = np.array([0. for _ in similarities] + [1. for _ in gt_similarities])

    results_2.append(compute_eer(targets, preds_xnot, positive_label=1))

In [None]:
#!g1.1
results_reconstructed = []
for W in [1.0, 2.0, 4.0]:
    similarities = []

    for target_speaker in tqdm(chosen, total=len(chosen)):
        conversions = glob.glob(f'data/audios/reconstructed/w-{int(W)}/target-{target_speaker}*', recursive=True)
        source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

        for source, conversion in zip(source_audios, conversions):
            source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
            converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
            similarities.append(sim(source_embedding, converted_embedding).cpu().item())

    preds_xnot = np.array(similarities + gt_similarities)
    targets = np.array([0. for _ in similarities] + [1. for _ in gt_similarities])

    results_reconstructed.append(compute_eer(targets, preds_xnot, positive_label=1))

### Intelligibility

We obtain recognitions of generated audio from SpeechKit:

In [None]:
#!g1.1
recognitions = defaultdict(dict)

for speaker, filenames in tqdm(chosen.items(), total=len(chosen)):
    for filename in filenames:
        if filename in recognitions['src']:
            continue
        recognition = model.transcribe_file(filename)
        assert len(recognition) == 1
        recognitions['src'][filename] = recognition[0].raw_text

with open('recognitions-src.json', 'w') as f:
    json.dump(recognitions['src'], f, ensure_ascii=False, indent=4)

In [None]:
#!g1.1
knn_wavs = glob.glob(f'data/audios/knn/*.wav', recursive=True)
xnot_all_wavs = glob.glob(f'data/audios/xnot/*/*.wav', recursive=True)
xnot_reconstructed_wavs = glob.glob(f'data/audios/reconstructed/*/*.wav', recursive=True)

knn_2_wavs = glob.glob(f'data/2/audios/knn/*.wav', recursive=True)
xnot_2_all_wavs = glob.glob(f'data/2/audios/xnot/*/*.wav', recursive=True)

for filename in tqdm(knn_wavs):
    if filename in recognitions['knn']:
        continue
    recognition = model.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['knn'][filename] = recognition[0].raw_text

for filename in tqdm(xnot_all_wavs):
    if filename in recognitions['xnot']:
        continue
    recognition = model.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['xnot'][filename] = recognition[0].raw_text


for filename in tqdm(xnot_reconstructed_wavs):
    if filename in recognitions['reconstructed']:
        continue
    recognition = model.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['reconstructed'][filename] = recognition[0].raw_text


for filename in tqdm(knn_2_wavs):
    if filename in recognitions['2-knn']:
        continue

    recognition = model.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['2-knn'][filename] = recognition[0].raw_text
    print(recognitions['2-knn'][filename])

for filename in tqdm(xnot_2_all_wavs):
    if filename in recognitions['2-xnot']:
        continue
    recognition = model.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['2-xnot'][filename] = recognition[0].raw_text

In [None]:
#!g1.1
with open('recognitions.json', 'w') as f:
    json.dump(recognitions, f, ensure_ascii=False, indent=4)

Ground truth transcripts retrieval:

In [None]:
#!g1.1
text_files = glob.glob(f'{libri_folder}/**/*trans.txt', recursive=True)

In [None]:
#!g1.1
gt_texts = {}

for file in text_files:
    with open(file) as f:
        for line in f:
            line = line.strip()
            idx, text = line.split(maxsplit=1)
            gt_texts[idx] = text.lower()

with open('gt_texts.json', 'w') as f:
    json.dump(gt_texts, f, ensure_ascii=False, indent=4)

`WER` and `CER` calculation:

In [None]:
#!g1.1
wer_results = {}

knn_wers = []
knn_cers = []

for filename, rec_text in recognitions['knn'].items():
    idx = filename.split('-', maxsplit=3)[-1].rsplit('-', maxsplit=2)[0]
    knn_wers.append(wer(gt_texts[idx], rec_text))
    knn_cers.append(cer(gt_texts[idx], rec_text))

for W in [1.0, 2.0, 4.0]:

    xnot_wavs = glob.glob(f'data/audios/xnot/w-{int(W)}/*.wav', recursive=True)

    xnot_wers = []
    xnot_cers = []

    for filename, rec_text in recognitions['xnot'].items():
        idx = filename.split('-', maxsplit=4)[-1].rsplit('-', maxsplit=2)[0]
        xnot_wers.append(wer(gt_texts[idx], rec_text))
        xnot_cers.append(cer(gt_texts[idx], rec_text))

    wer_results[int(W)] = (xnot_wers, xnot_cers)


wer_results_reconstructed = {}

for W in [1.0, 2.0, 4.0]:

    xnot_wavs = glob.glob(f'data/audios/reconstructed/w-{int(W)}/*.wav', recursive=True)

    xnot_wers = []
    xnot_cers = []

    for filename, rec_text in recognitions['reconstructed'].items():
        idx = filename.split('-', maxsplit=8)[-1].rsplit('-', maxsplit=2)[0]
        xnot_wers.append(wer(gt_texts[idx], rec_text))
        xnot_cers.append(cer(gt_texts[idx], rec_text))

    wer_results_reconstructed[int(W)] = (xnot_wers, xnot_cers)



wer_results_2 = {}

knn_wers_2 = []
knn_cers_2 = []

for filename, rec_text in recognitions['2-knn'].items():
    if 'target-ru' not in filename:
        idx = filename.split('-', maxsplit=3)[-1].rsplit('-', maxsplit=2)[0]
    else:
        idx = filename.split('-', maxsplit=4)[-1].rsplit('-', maxsplit=2)[0]

    knn_wers_2.append(wer(gt_texts[idx], rec_text))
    knn_cers_2.append(cer(gt_texts[idx], rec_text))

for W in [1.0, 2.0, 4.0]:
    xnot_wavs = glob.glob(f'data/2/audios/xnot/w-{int(W)}/*.wav', recursive=True)

    xnot_wers = []
    xnot_cers = []

    for filename, rec_text in recognitions['2-xnot'].items():
        if 'target-ru' not in filename:
            idx = filename.split('-', maxsplit=4)[-1].rsplit('-', maxsplit=2)[0]
        else:
            idx = filename.split('-', maxsplit=5)[-1].rsplit('-', maxsplit=2)[0]
        xnot_wers.append(wer(gt_texts[idx], rec_text))
        xnot_cers.append(cer(gt_texts[idx], rec_text))

    wer_results_2[int(W)] = (xnot_wers, xnot_cers)

As a precaution, we also calculate `WER` and `CER` of chosen ASR service on original audios:

In [None]:
#!g1.1
gt_wers = []
gt_cers = []

for filename, rec_text in recognitions['src'].items():
    idx = filename.split('/')[-1].split('.')[0]
    gt_wers.append(wer(gt_texts[idx], rec_text))
    gt_cers.append(cer(gt_texts[idx], rec_text))

# Cross-lingual conversion

We synthesized 5 randomly chosen texts from `War and Peace` by Leo Tolstoy (in Russian) via [Yandex SpeechKit](https://cloud.yandex.ru/en/services/speechkit) TTS. Then, we translated it into considered English speakers and vise-versa.

In [None]:
#!g1.1
tts_model = model_repository.synthesis_model()

In [None]:
#!g1.1
import string

In [None]:
#!g1.1
ru_speaker = 'ru-tts'

with open('data/ru_texts.txt') as f:
    for i, line in enumerate(f):
        text = line.strip()
        assert i not in gt_texts
        gt_texts[str(i)] = text.lower().translate(str.maketrans('', '', string.punctuation))  # rem-punkt normalization
        tts_model.synthesize(text).export(f'data/LibriSpeech/test-clean/{ru_speaker}/{i}.wav', format='wav')

In [None]:
#!g1.1
%%time
chosen[ru_speaker] = glob.glob(f'{libri_folder}/{ru_speaker}/*.wav', recursive=True)

for filename in chosen[ru_speaker]:
    audios = []
    audio, _ = torchaudio.load(filename)
    audios.append(audio)
    matching_sets[ru_speaker] = knn_vc.get_matching_set(audios).cpu()

torch.save(matching_sets, 'data/matchings/matching_sets-ru')


for src_speaker in tqdm(speakers_list):
    target_speaker = ru_speaker
    print(f'Conversing {src_speaker=} to {target_speaker=}')
    for filename in chosen[src_speaker]:
        audio, _ = torchaudio.load(filename)
        query_seq = knn_vc.get_features(audio)
        idx = filename.split('/')[-1].split('.')[0]

        for i, W in enumerate([1.0, 2.0, 4.0]):
            path = f'w-{int(W)}/target-{target_speaker}-idx-{idx}-src-{src_speaker}'
            if os.path.exists(f'data/ru/x_nots/{path}'):
                continue
            out_wav_xnot, xnot = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='xnot', W=W, max_steps=200)
            torchaudio.save(f'data/ru/audios/xnot/{path}.wav', out_wav_xnot[None].cpu(), 16000)
            torch.save(
            {
                'state_dict': xnot.state_dict()
            }, f'data/ru/x_nots/{path}')
        path = f'target-{target_speaker}-idx-{idx}-src-{src_speaker}'
        out_wav_knn, _ = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='knn')
        torchaudio.save(f'data/ru/audios/knn/{path}.wav', out_wav_knn[None].cpu(), 16000)



for target_speaker in tqdm(speakers_list):
    src_speaker = ru_speaker
    print(f'Conversing {src_speaker=} to {target_speaker=}')
    for filename in chosen[src_speaker]:
        audio, _ = torchaudio.load(filename)
        query_seq = knn_vc.get_features(audio)
        idx = filename.split('/')[-1].split('.')[0]

        for i, W in enumerate([1.0, 2.0, 4.0]):
            path = f'w-{int(W)}/target-{target_speaker}-idx-{idx}-src-{src_speaker}'
            if os.path.exists(f'data/ru/x_nots/{path}'):
                continue
            out_wav_xnot, xnot = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='xnot', W=W, max_steps=200)
            torchaudio.save(f'data/ru/audios/xnot/{path}.wav', out_wav_xnot[None].cpu(), 16000)
            torch.save(
            {
                'state_dict': xnot.state_dict()
            }, f'data/ru/x_nots/{path}')
        path = f'target-{target_speaker}-idx-{idx}-src-{src_speaker}'
        out_wav_knn, _ = xnot_vc.match(query_seq, matching_sets[target_speaker], topk=4, algorithm='knn')
        torchaudio.save(f'data/ru/audios/knn/{path}.wav', out_wav_knn[None].cpu(), 16000)

In [None]:
#!g1.1
ru_model = model_repository.recognition_model()

ru_model.model = 'general:rc'
ru_model.language = 'ru-RU'
ru_model.audio_processing_type = AudioProcessingType.Full

In [None]:
#!g1.1
knn_ru_wavs = glob.glob(f'data/ru/audios/knn/*.wav', recursive=True)
xnot_ru_all_wavs = glob.glob(f'data/ru/audios/xnot/*/*.wav', recursive=True)

for filename in tqdm(knn_ru_wavs):
    if filename in recognitions['ru-knn']:
        continue
    if f'target-{ru_speaker}' in filename:
        m = model
    else:
        m = ru_model
    recognition = m.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['ru-knn'][filename] = recognition[0].raw_text

for filename in tqdm(xnot_ru_all_wavs):
    if filename in recognitions['ru-xnot']:
        continue
    if f'target-{ru_speaker}' in filename:
        m = model
    else:
        m = ru_model
    recognition = m.transcribe_file(filename)
    assert len(recognition) == 1
    recognitions['ru-xnot'][filename] = recognition[0].raw_text

In [None]:
#!g1.1
ru_wer_results = {}

ru_knn_wers = []
ru_knn_cers = []

for filename, rec_text in recognitions['ru-knn'].items():
    if 'target-ru' not in filename:
        idx = filename.split('-', maxsplit=3)[-1].rsplit('-', maxsplit=3)[0]
    else:

        idx = filename.split('-', maxsplit=4)[-1].rsplit('-', maxsplit=2)[0]

    ru_knn_wers.append(wer(gt_texts[idx], rec_text))
    ru_knn_cers.append(cer(gt_texts[idx], rec_text))

for W in [1.0, 2.0, 4.0]:

    ru_xnot_wavs = glob.glob(f'data/ru/audios/xnot/w-{int(W)}/*.wav', recursive=True)

    ru_xnot_wers = []
    ru_xnot_cers = []

    for filename, rec_text in recognitions['ru-xnot'].items():
        if 'target-ru' not in filename:
            idx = filename.split('-', maxsplit=4)[-1].rsplit('-', maxsplit=3)[0]
        else:

            idx = filename.split('-', maxsplit=5)[-1].rsplit('-', maxsplit=2)[0]
        ru_xnot_wers.append(wer(gt_texts[idx], rec_text))
        ru_xnot_cers.append(cer(gt_texts[idx], rec_text))

    ru_wer_results[int(W)] = (ru_xnot_wers, ru_xnot_cers)

In [None]:
#!g1.1
ru_knn_similarities = []

for target_speaker in tqdm(list(chosen.keys()) + [ru_speaker], total=len(chosen)):
    conversions = glob.glob(f'data/ru/audios/knn/target-{target_speaker}*', recursive=True)
    source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

    for source, conversion in zip(source_audios, conversions):
        source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
        converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
        ru_knn_similarities.append(sim(source_embedding, converted_embedding).cpu().item())


gt_similarities = []

for target_speaker in np.random.choice(speakers + [ru_speaker], size=len(ru_knn_similarities)):
    src_speaker = target_speaker
    while src_speaker == target_speaker:
        src_speaker = np.random.choice(speakers + [ru_speaker])

    target_audio = np.random.choice(all_speaker_audios[target_speaker], size=1)[0]
    source_audio = np.random.choice(all_speaker_audios[src_speaker], size=1)[0]

    source_embedding = classifier.encode_batch(torchaudio.load(source_audio)[0]).squeeze(1)
    converted_embedding = classifier.encode_batch(torchaudio.load(target_audio)[0]).squeeze(1)
    gt_similarities.append(sim(source_embedding, converted_embedding).cpu().item())


preds_knn = np.array(ru_knn_similarities + gt_similarities)
targets = np.array([0. for _ in ru_knn_similarities] + [1. for _ in gt_similarities])
knn_eer_ru = compute_eer(targets, preds_knn, positive_label=1)

results_ru = []
for W in [1.0, 2.0, 4.0]:
    similarities = []

    for target_speaker in tqdm(list(chosen.keys()) + [ru_speaker], total=len(chosen)):
        conversions = glob.glob(f'data/ru/audios/xnot/w-{int(W)}/target-{target_speaker}*', recursive=True)
        source_audios = np.random.choice(all_speaker_audios[target_speaker], size=len(conversions))

        for source, conversion in zip(source_audios, conversions):
            source_embedding = classifier.encode_batch(torchaudio.load(source)[0]).squeeze(1)
            converted_embedding = classifier.encode_batch(torchaudio.load(conversion)[0]).squeeze(1)
            similarities.append(sim(source_embedding, converted_embedding).cpu().item())

    preds_xnot = np.array(similarities + gt_similarities)
    targets = np.array([0. for _ in similarities] + [1. for _ in gt_similarities])

    results_ru.append(compute_eer(targets, preds_xnot, positive_label=1))

# Results

In [None]:
#!g1.1
exp_results = dict(
    knn_eer=knn_eer,
    results=results,
    knn_eer_2=knn_eer_2,
    results_2=results_2,
    results_reconstructed=results_reconstructed,
    knn_eer_ru=knn_eer_ru,
    results_ru=results_ru,
    knn_wers=knn_wers,
    knn_cers=knn_cers,
    wer_results=wer_results,
    wer_results_reconstructed=wer_results_reconstructed,
    wer_results_2=wer_results_2,
    knn_wers_2=knn_wers_2,
    knn_cers_2=knn_cers_2
)

In [None]:
#!g1.1
with open('results.json', 'w') as f:
    json.dump(exp_results, f)