# L635-Fall2025-Assignment 1: ESPnet Tutorial

In this tutorial, we will learn how to fine-tune a speech foundation model using ESPnet EZ.

Main references:
- [ESPnet repository](https://github.com/espnet/espnet)
- [ESPnet documentation](https://espnet.github.io/espnet/)
- [ESPnet-EZ repo](https://github.com/espnet/espnet/tree/master/espnetez)

## Important Notes
- Please submit PDF files of your completed notebooks to Canvas. You can print the notebook using `File -> Print` in the menu bar.

## Acknowledgement
- This homework is adapted from the ESPnet online demos and tutorials.

## Install ESPnet
- We are using a temporary version of ESPNET to avoid compatibility issues for this assignment. You may see some dependency errors. It should be safe for you to ignore them for now.


In [1]:
!wget https://github.com/Fhrozen/espnet/archive/refs/heads/pr-numpy.zip
!unzip pr-numpy.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: espnet-pr-numpy/espnet2/diar/decoder/linear_decoder.py  
  inflating: espnet-pr-numpy/espnet2/diar/espnet_model.py  
  inflating: espnet-pr-numpy/espnet2/diar/label_processor.py  
   creating: espnet-pr-numpy/espnet2/diar/layers/
 extracting: espnet-pr-numpy/espnet2/diar/layers/__init__.py  
  inflating: espnet-pr-numpy/espnet2/diar/layers/abs_mask.py  
  inflating: espnet-pr-numpy/espnet2/diar/layers/multi_mask.py  
  inflating: espnet-pr-numpy/espnet2/diar/layers/tcn_nomask.py  
   creating: espnet-pr-numpy/espnet2/diar/separator/
 extracting: espnet-pr-numpy/espnet2/diar/separator/__init__.py  
  inflating: espnet-pr-numpy/espnet2/diar/separator/tcn_separator_nomask.py  
   creating: espnet-pr-numpy/espnet2/enh/
 extracting: espnet-pr-numpy/espnet2/enh/__init__.py  
  inflating: espnet-pr-numpy/espnet2/enh/abs_enh.py  
   creating: espnet-pr-numpy/espnet2/enh/decoder/
 extracting: espnet-pr-numpy/espnet2/e

In [2]:
!cd espnet-pr-numpy && pip install .

Processing /content/espnet-pr-numpy
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting g2p_en@ git+https://github.com/espnet/g2p.git@master (from espnet==202506)
  Cloning https://github.com/espnet/g2p.git (to revision master) to /tmp/pip-install-j09u5ran/g2p-en_ddea8b1da65d4eedb58653152e45e244
  Running command git clone --filter=blob:none --quiet https://github.com/espnet/g2p.git /tmp/pip-install-j09u5ran/g2p-en_ddea8b1da65d4eedb58653152e45e244
  Resolved https://github.com/espnet/g2p.git to commit 053dfa94811efc6370bbdb0ff9777be4a5cb0b84
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ctc-segmentation@ git+https://github.com/espnet/ctc-segmentation.git@9b9ea1d (from espnet==202506)
  Cloning https://github.com/espnet/ctc-segmentation.git (to revision 9b9ea1d) to /tmp/pip-install-j09u5ran/ctc-segmentation_f4251a1cedec428ba208172abcfb6559
  Running command git clone --filter=blob:none --quiet https://github.com/espnet/ctc-segmentation.git /tmp/pip-install-

In [3]:
!pip install espnet-model-zoo # for downloading pre-trained models
!apt install ffmpeg # for audio file processing
!pip install ipywebrtc notebook # for real-time recording


Collecting espnet-model-zoo
  Downloading espnet_model_zoo-0.1.7-py3-none-any.whl.metadata (10 kB)
Collecting g2p-en@ git+https://github.com/espnet/g2p.git@master (from espnet->espnet-model-zoo)
  Cloning https://github.com/espnet/g2p.git (to revision master) to /tmp/pip-install-i241f35g/g2p-en_86aadfe387f74e2b8156c422360d8065
  Running command git clone --filter=blob:none --quiet https://github.com/espnet/g2p.git /tmp/pip-install-i241f35g/g2p-en_86aadfe387f74e2b8156c422360d8065
  Resolved https://github.com/espnet/g2p.git to commit 053dfa94811efc6370bbdb0ff9777be4a5cb0b84
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ctc-segmentation@ git+https://github.com/espnet/ctc-segmentation.git@9b9ea1d (from espnet->espnet-model-zoo)
  Cloning https://github.com/espnet/ctc-segmentation.git (to revision 9b9ea1d) to /tmp/pip-install-i241f35g/ctc-segmentation_8f22c73461974587a97686756d080bb5
  Running command git clone --filter=blob:none --quiet https://github.com/espnet/ctc-segm

### Import the dependencies and check the state of installation

In [7]:
!pip install datasets==3.6.0

Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed datasets-3.6.0


In [1]:
import torch
import datasets
import espnetez as ez # ESPnet wrapper that simplifies integration. If you get an error when executing this cell, click Runtime -> Restart Session, and rerun from the beginning
import numpy as np
import librosa
from espnet2.bin.s2t_inference import Speech2Text # Core ESPnet module for pre-trained models

print("Installation success!")

Failed to import Flash Attention, using ESPnet default: No module named 'flash_attn'


  @torch.cuda.amp.autocast(enabled=False)
  @torch.cuda.amp.autocast(enabled=False)


Installation success!


## Data Processing
For this tutorial, we will use the [FLEURS](https://arxiv.org/abs/2205.12446) dataset from HuggingFace: https://huggingface.co/datasets/google/fleurs .

FLEURS is a 102-language multilingual speech dataset, supporting tasks such as Automatic Speech Recognition (ASR), Speech Translation (ST), and Language Identification (LID).

While the total size of FLEURS is relatively large at ~1000 hours of training data, each individual language only has 7-10 hours of audio.

For this tutorial, we will focus on monolingual ASR for one of the 102 languages.

### Data Downloading
We will first download the data for one language of FLEURS. FLEURS organizes the languages by its ISO2 language code and locale. For example, American English is `en_us`.

**We will use English for the first fine-tuning experiment.** You will have the opportunity to try a different language later on in the assignment.

If you want to download the data for another language, you can map the language name to the ISO2 code using Table 9 in the FLEURS paper: https://arxiv.org/pdf/2205.12446. Then, you can use that to identify the language+region combination using the HuggingFace data previewer: https://huggingface.co/datasets/google/fleurs .

(Please select y for the prompt of running custom code to download the data)

In [2]:
fleurs_language = 'en_us' # fleurs language codes are in the <language code>_<region code> format.
fleurs_hf = datasets.load_dataset("google/fleurs", fleurs_language)

The repository for google/fleurs contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/google/fleurs.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


data/en_us/audio/train.tar.gz:   0%|          | 0.00/1.38G [00:00<?, ?B/s]

data/en_us/audio/dev.tar.gz:   0%|          | 0.00/171M [00:00<?, ?B/s]

data/en_us/audio/test.tar.gz:   0%|          | 0.00/290M [00:00<?, ?B/s]

train.tsv: 0.00B [00:00, ?B/s]

dev.tsv: 0.00B [00:00, ?B/s]

test.tsv: 0.00B [00:00, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

### Inspect the data

In [3]:
fleurs_hf['train'][0]

{'id': 903,
 'num_samples': 108800,
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/be467d88ba270014363a9d0aaae3893b4701a2710e0a55c6091a6d4fa56a9d84/10004088536354799741.wav',
 'audio': {'path': 'train/10004088536354799741.wav',
  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         -3.15904617e-06, -3.03983688e-06, -3.27825546e-06]),
  'sampling_rate': 16000},
 'transcription': 'a tornado is a spinning column of very low-pressure air which sucks the surrounding air inward and upward',
 'raw_transcription': 'A tornado is a spinning column of very low-pressure air, which sucks the surrounding air inward and upward.',
 'gender': 1,
 'lang_id': 19,
 'language': 'English',
 'lang_group_id': 0}

In [4]:
from IPython.display import Audio, display
display(Audio(fleurs_hf['train'][0]['audio']['array'], rate=16000))
print(fleurs_hf['train'][0]['transcription'])

a tornado is a spinning column of very low-pressure air which sucks the surrounding air inward and upward


## Pretrained Model
In low-resource settings, training a model from scratch is unlikely to lead to good results. So instead, we will fine-tune a pre-trained foundation model.

We will use the base version of [OWSM 3.1](https://arxiv.org/pdf/2401.16658), an open-source speech foundation model trained on 180K hours of multilingual ASR and ST.

### Downloading
Since it needs to support many language varieties, OWSM uses ISO3 for the language IDs. The ISO3 code for your language of choice can also be found in Table 9 in the FLEURS paper: https://arxiv.org/pdf/2205.12446

In [5]:
FINETUNE_MODEL="espnet/owsm_v3.1_ebf_base"
owsm_language="eng" # language code in ISO3

In [6]:
pretrained_model = Speech2Text.from_pretrained(
    FINETUNE_MODEL,
    lang_sym=f"<{owsm_language}>",
    beam_size=1,
    device='cuda'
)
torch.save(pretrained_model.s2t_model.state_dict(), 'original.pth')
pretrain_config = vars(pretrained_model.s2t_train_args)
tokenizer = pretrained_model.tokenizer
converter = pretrained_model.converter

Fetching 29 files:   0%|          | 0/29 [00:00<?, ?it/s]

data/token_list/bpe_unigram50000/bpe.mod(…):   0%|          | 0.00/1.04M [00:00<?, ?B/s]

exp/s2t_stats_raw_bpe50000/train/feats_s(…):   0%|          | 0.00/1.40k [00:00<?, ?B/s]

config.yaml: 0.00B [00:00, ?B/s]

acc.png:   0%|          | 0.00/33.1k [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

backward_time.png:   0%|          | 0.00/42.2k [00:00<?, ?B/s]

RESULTS.md:   0%|          | 0.00/366 [00:00<?, ?B/s]

cer.png:   0%|          | 0.00/28.5k [00:00<?, ?B/s]

gpu_max_cached_mem_GB.png:   0%|          | 0.00/36.9k [00:00<?, ?B/s]

forward_time.png:   0%|          | 0.00/48.7k [00:00<?, ?B/s]

grad_norm.png:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

clip.png:   0%|          | 0.00/29.1k [00:00<?, ?B/s]

cer_ctc.png:   0%|          | 0.00/26.6k [00:00<?, ?B/s]

loss.png:   0%|          | 0.00/31.4k [00:00<?, ?B/s]

optim0_lr0.png:   0%|          | 0.00/30.6k [00:00<?, ?B/s]

loss_scale.png:   0%|          | 0.00/32.8k [00:00<?, ?B/s]

loss_ctc.png:   0%|          | 0.00/34.5k [00:00<?, ?B/s]

loss_att.png:   0%|          | 0.00/31.6k [00:00<?, ?B/s]

iter_time.png:   0%|          | 0.00/41.8k [00:00<?, ?B/s]

exp/s2t_train_s2t_ebf_conv2d_size384_e6_(…):   0%|          | 0.00/405M [00:00<?, ?B/s]

optim_step_time.png:   0%|          | 0.00/33.8k [00:00<?, ?B/s]

train_time.png:   0%|          | 0.00/40.3k [00:00<?, ?B/s]

wer.png:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

train.log: 0.00B [00:00, ?B/s]

train.1.log: 0.00B [00:00, ?B/s]

train.2.log: 0.00B [00:00, ?B/s]

train.3.log: 0.00B [00:00, ?B/s]

meta.yaml:   0%|          | 0.00/422 [00:00<?, ?B/s]

## Setup Training
We first need to convert the HuggingFace data into a format that ESPnet can read. This can be easily done by defining a `data_info` dictionary that maps each field required for OWSM fine-tuning to a column in our dataset.

In [7]:
'''
pretrained_model -> the pre-trained model we downloaded earlier
tokenizer -> Tokenizes raw text into subwords
converter -> Converts subwords into integer IDs for model input
'''

def tokenize(text):
    return np.array(converter.tokens2ids(tokenizer.text2tokens(text)))
data_info = {
    "speech": lambda d: d['audio']['array'].astype(np.float32), # 1-D raw waveform
    "text": lambda d: tokenize(f"<{owsm_language}><asr><notimestamps> {d['transcription']}"), # tokenized text mapped to integer ids
    "text_prev": lambda d: tokenize("<na>"), # tokenized text of previous utterance for prompting, unused here
    "text_ctc": lambda d: tokenize(d['transcription']), # tokenized text mapped to integer ids for CTC loss, can be different from "text" depending on task
}
test_data_info = {
    "speech": lambda d: d['audio']['array'].astype(np.float32),
    "text": lambda d: tokenize(f"<{owsm_language}><asr><notimestamps> {d['transcription']}"),
    "text_prev": lambda d: tokenize("<na>"),
    "text_ctc": lambda d: tokenize(d['transcription']),
    "text_raw": lambda d: d['transcription'], # raw untokenized text as the reference
}
train_dataset = ez.dataset.ESPnetEZDataset(fleurs_hf['train'], data_info=data_info)
valid_dataset = ez.dataset.ESPnetEZDataset(fleurs_hf['validation'], data_info=data_info)
test_dataset = ez.dataset.ESPnetEZDataset(fleurs_hf['test'], data_info=test_data_info)

Next we need to define a function that will pass our pre-trained model to ESPnet. This function here doesn't do much since our setup is simple, but its required for more complex settings (such as LoRA fine-tuning).

In [8]:
# define model loading function
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def build_model_fn(args):
  model = pretrained_model.s2t_model
  model.train()
  print(f'Trainable parameters: {count_parameters(model)}')
  return model

### Training
Training requires tuning many hyper-parameters. Here is an initial config to start you off.

In [9]:
!gdown 1Hp4hgtgdt84i4Qd99hW5ZM5aV743xN1U
!mkdir config
!mv finetune.yaml config/finetune.yaml

Downloading...
From: https://drive.google.com/uc?id=1Hp4hgtgdt84i4Qd99hW5ZM5aV743xN1U
To: /content/finetune.yaml
  0% 0.00/987 [00:00<?, ?B/s]100% 987/987 [00:00<00:00, 5.89MB/s]


Before we begin training, we need to define where our model files and logs will be saved. We also need to override some of the settings used to pre-train the foundation model with our own settings.

In [10]:
EXP_DIR = f"./exp/finetune"
STATS_DIR = f"./exp/stats_finetune"
finetune_config = ez.config.update_finetune_config(
	's2t',
	pretrain_config,
	f"./config/finetune.yaml"
)

# You can edit your config by changing the finetune.yaml file directly (but make sure you rerun this cell again!)
# You can also change it programatically like this
finetune_config['max_epoch'] = 1
finetune_config['num_iters_per_epoch'] = 500

Finally, we just need to pass our model, data, and configs to a trainer.

In [11]:
trainer = ez.Trainer(
    task='s2t',
    train_config=finetune_config,
    train_dataset=train_dataset,
    valid_dataset=valid_dataset,
    build_model_fn=build_model_fn, # provide the pre-trained model
    data_info=data_info,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    ngpu=1
)

In [12]:
trainer.collect_stats() # collect audio/text length information to construct batches

/usr/bin/python3 /usr/local/lib/python3.12/dist-packages/colab_kernel_launcher.py -f /root/.local/share/jupyter/runtime/kernel-d46b9550-eb4f-43b9-adad-3129c1db55f8.json


Trainable parameters: 101182628


In [13]:
trainer.train() # every 100 steps takes ~1 min

/usr/bin/python3 /usr/local/lib/python3.12/dist-packages/colab_kernel_launcher.py -f /root/.local/share/jupyter/runtime/kernel-d46b9550-eb4f-43b9-adad-3129c1db55f8.json


Trainable parameters: 101182628


  with autocast(
  with autocast(False):
  with autocast(


## Inference

Here is a demo of how to perform inference, and how to load checkpoints.

In [14]:
id, sample_test_utterance = test_dataset.__getitem__(0)

In [15]:
pretrained_model.s2t_model.cuda()
pretrained_model.device = 'cuda'

d = torch.load("original.pth")
pretrained_model.s2t_model.load_state_dict(d)
pred = pretrained_model(sample_test_utterance['speech'])
print('PREDICTED: ' + pred[0][0])
print('REFERENCE: ' + sample_test_utterance['text_raw'])

PREDICTED: <eng><asr><notimestamps> However, due to the slow communication channels, styles in the west could lag behind by twenty five to thirty years.
REFERENCE: however due to the slow communication channels styles in the west could lag behind by 25 to 30 year


### Inference with fine-tuned model

In [16]:
d = torch.load("./exp/finetune/1epoch.pth")
pretrained_model.s2t_model.load_state_dict(d)
pred = pretrained_model(sample_test_utterance['speech'])
print('PREDICTED: ' + pred[0][0])
print('REFERENCE: ' + sample_test_utterance['text_raw'])

PREDICTED: <eng><asr><notimestamps> however due to the slow communication channels styles in the west could lag behind by 25 to 30 years
REFERENCE: however due to the slow communication channels styles in the west could lag behind by 25 to 30 year


## ✅ Task 1
Now that you have performed inference with both the pre-trained model and your fine-tuned model, provide some qualitative analyses of the results. How does the output between the two models differ? Do you observe any stylistic differences in the transcriptions?

(Your answer here)

## ✅ Task 2


A good way to understand the capabilities of neural models is to play with them yourself. Record a brief audio clip of yourself speaking. How accurately does the model transcribe your speech?

Then, utter the same sentence, but vary some characteristics of your speech. For example, you can try speaking faster, in a different pitch, or in a different accent. How does the resulting transcription change?

Finally, if the model made any mistakes in its transcriptions, why do you think these errors occured?

In [None]:
# You may need to run this cell twice before it will work the first time
!rm recording.webm
!rm recording.wav
from ipywebrtc import AudioRecorder, CameraStream
from google.colab import output
output.enable_custom_widget_manager()
mic = AudioRecorder(stream=CameraStream(constraints={'audio': True,'video':False}))
mic # press the circle to record, press it again to stop recording

In [None]:
with open('recording.webm', 'wb') as out_f:
    out_f.write(mic.audio.value)
audio, sr = librosa.load('recording.webm', sr=16000) # ASR typically uses 16kHz audio
pretrained_model(audio)[0][0]

(Your answer here)