# VERSA Colab Demonstration

In this demonstration, we will show you some examples of VERSA (Versatile Speech and Audio) evaluation toolkit.

Main reference:
- [VERSA repository]()
- [ESPnet repository]()

Author:
- Jiatong Shi ([email](jiatongs@andrew.cmu.edu))



## Motivation:

Evaluating the quality of generated speech and audio presents numerous challenges:
- Collecting subjective evaluations is not easy.
- Implementing objective evaluations is equally challenging.
- There is a growing need for multi-domain, large-scale evaluations with unified speech, audio, and music modeling.

Although new metrics are emerging in the field of speech and audio evaluation, researchers still face significant obstacles in accessing a broad range of evaluation metrics.

VERSA aims to provide a general interface for speech and audio evaluation, offering a collection of both conventional and recent automatic quality evaluation metrics. While it can function independently, we also offer seamless integration with existing ESPnet tasks.

## Contents

1. VERSA installation

2. VERSA Examples with API
  
  2.1 VERSA base evaluation

  2.2 VERSA speaker evaluation

  2.3 VERSA singing MOS evaluation

3. VERSA Realtime Demonstration

4. Contact

In [None]:
!pip install numpy==1.23.5 tensorflow==2.12.0 jax==0.4.9 jaxlib==0.4.9 ml_dtypes==0.2.0 transformers
# IMPORTANT NOTE: due to the recent default change in colab, we need to restart the session to make the numpy(1.23.5) work as expected
# Go to `Runtime` and select `Restart Session`

## 1. VERSA Installation

The VERSA can be easily installed with `pip` easily. By default, most of the supported metrics are included (check the default supported metrics in [Metric List](https://github.com/shinjiwlab/versa?tab=readme-ov-file#list-of-metrics)), while some other metrics are left as optional.


For detailed instructions on installing specific metrics, please refer to the installation scripts/guides available at https://github.com/shinjiwlab/versa/tree/main/tools

In [None]:
# It takes 3-5 minutes.
!git clone --depth 5 https://github.com/shinjiwlab/versa.git
%cd /content/versa
!pip install -e .

## 2. VERSA Examples with API

In this section, we demonstrate several examples using the VERSA scorer API, which serves as the primary interface for VERSA-supported metrics.

The first step involves downloading some prepared audio samples along with example configuration files.

In [None]:
!git clone https://github.com/ftshijt/versa_demo_egs.git

### 2.1 VERSA base evaluation

In this section, we showcase some basic evaluations supported by VERSA.

First, let's review the configuration used for the evaluation. We offer two versions: a CPU version and a GPU version, indicating which metrics require only a CPU and which can be accelerated with a GPU.

In [None]:
print("CPU metrics")

!cat /content/versa/versa_demo_egs/configs/codec_16k_cpu.yaml

print("***" * 20)


print("GPU metrics")
!cat /content/versa/versa_demo_egs/configs/codec_16k_gpu.yaml


We can then use the configuration to evaluate the speech signals. In this demo, we have prepared a few samples from the LibriSpeech development set (you can also try a real-time demo with customized audio later!).

In [None]:
# Audio Listening
import soundfile
from IPython.display import Audio, display

print("Ground Truth Audio (22050Hz) for Testing")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/normal_speech/codec/gt/1.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/normal_speech/codec/gt/2.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/normal_speech/codec/gt/3.wav")
display(Audio(audio, rate=sr))

print("Target Audio (16000Hz) for Testing (Resynthesized with Encodec codec)")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/normal_speech/codec/encodec/1.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/normal_speech/codec/encodec/2.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/normal_speech/codec/encodec/3.wav")
display(Audio(audio, rate=sr))




Scoring is simple and can be done with a single command.

- Note that audio files with different sampling rates will be automatically handled.
- We also support multi-processing with a job scheduling system to speed up the process. For more details, please visit [this link](https://github.com/shinjiwlab/versa?tab=readme-ov-file#usage).
- A reference signal is optional for scoring, but its absence will result in some metrics being skipped if they require it.
- We accept either a folder of wave or Kaldi-style `wav.scp` as input.

In [None]:
! python versa/bin/scorer.py --score_config /content/versa/versa_demo_egs/configs/codec_16k_cpu.yaml \
    --gt /content/versa/versa_demo_egs/examples/normal_speech/codec/gt \
    --pred /content/versa/versa_demo_egs/examples/normal_speech/codec/encodec \
    --io dir \
    --output_file codec_base_cpu.json

The utterance-level results will be stored in the codec_base_cpu.json file, and the aggregated results can be obtained using a provided script.

In [None]:
!cat codec_base_cpu.json

!python /content/versa/scripts/show_result.py codec_base_cpu.json

By setting use_gpu=True, you can easily enable GPU acceleration for metrics that support it.

In [None]:
! python versa/bin/scorer.py --score_config /content/versa/versa_demo_egs/configs/codec_16k_gpu.yaml \
    --gt /content/versa/versa_demo_egs/examples/normal_speech/codec/gt \
    --pred /content/versa/versa_demo_egs/examples/normal_speech/codec/encodec \
    --io dir \
    --use_gpu true \
    --output_file codec_base_gpu.json

! cat codec_base_gpu.json

!python /content/versa/scripts/show_result.py codec_base_gpu.json

### 2.2 VERSA Speaker evaluation

The reference signal can include samples from the same speakers, enabling speaker embedding analysis. VERSA is proud to integrate ESPnet-SPK, supporting more than 10 speaker embedding models!

- Reference:
  - [ESPnet-SPK](https://arxiv.org/abs/2401.17230)
  - [Available Speaker Embedding](https://huggingface.co/models?other=speaker-recognition&sort=trending&search=espnet)

For test purposes,  we have prepared three folders (with data also from LibriSpeech):
- spk1: target speaker
- spk1-other: other speech from the same target speaker
- spk2: another speaker


In [None]:
# Audio Listening
import soundfile
from IPython.display import Audio, display

print("Target Speaker (22050Hz) for Testing")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk1/1.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk1/2.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk1/3.wav")
display(Audio(audio, rate=sr))

print("Other Speech from Target Speaker (22050Hz) for Testing")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk1-other/1.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk1-other/2.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk1-other/3.wav")
display(Audio(audio, rate=sr))

print("Other Speaker (22050Hz) for Testing")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk2/1.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk2/2.wav")
display(Audio(audio, rate=sr))
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/spk/spk2/3.wav")
display(Audio(audio, rate=sr))

In [None]:
# Test if the same speaker can be identified
! python versa/bin/scorer.py --score_config /content/versa/versa_demo_egs/configs/codec_16k_spk.yaml \
    --gt /content/versa/versa_demo_egs/examples/spk/spk1 \
    --pred /content/versa/versa_demo_egs/examples/spk/spk1-other \
    --io dir \
    --use_gpu true \
    --output_file spk_base.json

! cat spk_base.json

!python /content/versa/scripts/show_result.py spk_base.json

In [None]:
# Test if different speakers can be identified
! python versa/bin/scorer.py --score_config /content/versa/versa_demo_egs/configs/codec_16k_spk.yaml \
    --gt /content/versa/versa_demo_egs/examples/spk/spk1 \
    --pred /content/versa/versa_demo_egs/examples/spk/spk2 \
    --io dir \
    --use_gpu true \
    --output_file spk_base_different.json

! cat spk_base_different.json

!python /content/versa/scripts/show_result.py spk_base_different.json

### 2.3 VERSA singing MOS evaluation

In this subsection, we demonstrate the use of two types of Mean Opinion Score (MOS) evaluations applied to singing voice evaluation, which is distinct from normal speech.

References
  - [UTMOS](https://arxiv.org/abs/2204.02152)
  - [UTMOS implementation](https://github.com/tarepan/SpeechMOS)
  - [SingMOS](https://arxiv.org/abs/2406.10911)
  - [SingMOS implementation](https://github.com/South-Twilight/SingMOS)

We use singing samples from the ACE-KiSing paper. Three samples are compared: baseline, baseline with proposed corpora, and ground truth.

Reference:
- [ACE-Opencpop and ACE-KiSing](https://arxiv.org/pdf/2401.17619)
- [KiSing](https://arxiv.org/abs/2205.04029)

In [None]:
# Audio Listening
import soundfile
from IPython.display import Audio, display

print("Singing voice synthesis baseline")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/sing/visinger2-kising.wav")
display(Audio(audio, rate=sr))
print("Singing voice synthesis baseline with proposed corpora")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/sing/visinger2-ace-kising.wav")
display(Audio(audio, rate=sr))
print("Singing voice synthesis ground truth")
audio, sr = soundfile.read("/content/versa/versa_demo_egs/examples/sing/ground_truth.wav")
display(Audio(audio, rate=sr))

In [None]:
# Scoring with the model
! python versa/bin/scorer.py --score_config /content/versa/versa_demo_egs/configs/codec_16k_sing.yaml \
    --pred /content/versa/versa_demo_egs/examples/sing \
    --io dir \
    --output_file sing_base.json

! cat sing_base.json

## 3. VERSA Realtime Demonstration

Let's try some realtime demonstration by recording your own voice.

First, please record your voice into the system. To achieve that, we first initialize a audio recording function.

In [None]:
# Credit to https://colab.research.google.com/drive/1Z6VIRZ_sX314hyev3Gm5gBqvm1wQVo-a#scrollTo=9ol3xVNL6gy3

!pip install ffmpeg-python

from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});

</script>
"""

def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])

  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)

  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  sr, audio = wav_read(io.BytesIO(riff))

  return audio, sr


Now, let's start recording. The recording will be start right after you execute the block. Press stop to save the recording.

In [None]:
audio, sr = get_audio()

# Resample
import librosa
test_audio = librosa.resample(np.array(audio, dtype=np.float32), orig_sr=sr, target_sr=16000)

import soundfile
import os
os.makedirs("test_audio", exist_ok=True)
soundfile.write('test_audio/test.wav', test_audio, 16000, 'PCM_16')

In [None]:
# Scoring with the model
! python versa/bin/scorer.py --score_config /content/versa/versa_demo_egs/configs/codec_16k_sing.yaml \
    --pred test_audio \
    --io dir \
    --output_file realtime_demo.json

! cat realtime_demo.json

# 4. Contact

VERSA is rapidly expanding, please directly submit issues to [our repo](https://github.com/shinjiwlab/versa.git) or contact Jiatong Shi (jiatongs@andrew.cmu.edu)/Shinji Watanabe (shinjiw@ieee.org) if you want to contribute/propose or request new metrics/report bug.