# mini-project: Automatic Speech Recognition(2/2)
We split the mini-project into two notebooks. 

This is ***2/2*** of the mini-project. Make sure to also complete and submit ***LAS_ASR.ipynb***.

***Credits***: ELENE 6820, Xilin Jiang(xj2289 at columbia) and Prof. Nima Mesgarani.

Rewritten from Yi Luo's ASR homework in 2021.

I acknowledge SpeechBrain [Tutorials](https://speechbrain.github.io/tutorial_basics.html) for reference and ChatGPT for rephrasing and testing.


---



Hope you spend a good time with LAS. 

In this notebook, we will be conducting experiments on a self-supervised learning model to assess its ability to recognize words. 

We will revisit **HuBERT** model which is briefly mentioned in Lecture 6 Speech Signal Representation.

You don't need to re-download the dataset. But if you are using Colab, you need to re-mount the folder from Google Drive and re-install SpeechBrain.


In [None]:
# from google.colab import drive
# drive.mount('/content/drive/')
# %cd drive/MyDrive/HW6_part2/

In [None]:
# !pip install speechbrain
# !pip install transformers

In [None]:
import os
import re
import torch
import torch.nn as nn

import librosa
import datetime
import speechbrain as sb
import matplotlib.pyplot as plt
from hyperpyyaml import load_hyperpyyaml

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
!nvidia-smi

## 1. Configuration

Load hyperparameters and set the experiment time stamp. 

***You need to reload*** `hparams/hubert.yaml` everytime you change hyperparameters. 

In [None]:
HPARAM_FILE = 'hparams/hubert.yaml'

time_stamp = datetime.datetime.now().strftime('%Y-%m-%d+%H-%M-%S')
print(f'Experiment Time Stamp: {time_stamp}')
argv = [HPARAM_FILE, '--time_stamp', time_stamp]

hparam_file, run_opts, overrides = sb.parse_arguments(argv)

with open(HPARAM_FILE) as f:
    hparams = load_hyperpyyaml(f, overrides)

## 2. Models

We will use HuBERT as our encoder for speech recognization . Other speech SSL models can be used in a similar way.

Please read the paper: https://arxiv.org/abs/2106.07447 and the lecture slides for how HuBERT learns speech representations in a self-supervised manner.


We can get a pretrained HuBERT model from [Transformers](https://huggingface.co/docs/transformers/index) library by 🤗 Hugging Face. 

🤗 Hugging Face is the 'github' for ML models. You can upload and download models just like codes.


The [checkpoint](https://huggingface.co/facebook/hubert-large-ls960-ft) we will be using is first pretrained on 60,000-hour [Libri-Light](https://arxiv.org/abs/1912.07875) dataset without supervision and then finetuned on 960-hour LibriSpeech for ASR. In this notebook, we will build an ASR system from HuBERT embeddings, rather than from MFCC or shallow CNN encodings.

***Note***: In this homework, you are ***not*** pretraining or finetuning HuBERT. The HuBERT checkpoint parameters are already optimized and should be fixed. You only need to implement and optimize the missing part in the ASR pipeline.

In [None]:
import transformers
from transformers import HubertModel

# Initialize pretrained HuBERT. 
# Ignore the warning 'Some weights of the model checkpoint ... were not used'
# We will not use the pretrained language model.
encoder = HubertModel.from_pretrained(hparams['hubert_path']).to(device)

Let's encode a sample audio with HuBERT. Our HuBERT encoder has [24](https://huggingface.co/facebook/hubert-large-ls960-ft/blob/main/config.json#L66) hidden layers, so there are 25 levels of encoded features we can use. We will observe different performance if we choose the embeddings from different levels. By default, HuBERT only returns the features of the last layer. Set `output_hidden_states=True` to get them all.

Pay attention to HuBERT feature dimension. Can you see anything meaningful from HuBERT features? (Probably not.)

In [None]:
sr = hparams['sample_rate']
sample_audio_path = 'data/LibriSpeech/train-clean-5/1088/134315/1088-134315-0000.flac'
sample_audio = librosa.load(sample_audio_path, sr=sr)[0]
sample_audio = torch.tensor(sample_audio).unsqueeze(dim=0).to(device)

from IPython.display import Audio
Audio(sample_audio.cpu(), rate=sr)

In [None]:
with torch.no_grad():
    hubert_feats = encoder(sample_audio, output_hidden_states=True).hidden_states
    hubert_last_feats = hubert_feats[-1]
    hubert_middle_feats = hubert_feats[12]
    
hubert_dim = hubert_last_feats.shape[-1] 
    
plt.figure(figsize=(10, 10))
plt.subplot(121)
plt.imshow(hubert_last_feats[0].cpu().detach().numpy().T)
plt.xlabel('T')
plt.title('HuBERT last-layer features')
plt.subplot(122)
plt.imshow(hubert_middle_feats[0].cpu().detach().numpy().T)
plt.xlabel('T')
plt.title('HuBERT 12th-layer features')

Next, we load the tokenizer for HuBERT. The vocabulary set contains 32 characters(26 English letters + 6 special symbols). 

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(hparams['hubert_path'])
print(tokenizer.get_vocab())
vocab_size = len(tokenizer.get_vocab())

## What's Next?

Recall the task of Speech Recognition. We need to build a system to map a waveform to words.

Right now, we have an encoder mapping from waveform to HuBERT features $\mathbb{R}^{1 \times T} → \mathbb{R}^{C \times N}$, where $C$ is the feature dimension and $N$ is the number of frames, and a tokenizer to decode words given token indices of all frames $\{0, 1, ..., V\}^{N}, V=32$.

We have the frontend and the backend. The only missing part in the middle is to predict the most likely character from the HuBERT features. 

You need to implement a ***linear*** CTC `Predictor` $\mathbb{R}^{C \times N} → \mathbb{R}^{V \times N}$ to estimate token probabilities. 

I wrote a `Decoder` which takes the argmax of your prediction.

**TODD(8/10)**. Implement `Predictor`.

In [None]:
class Predictor(nn.Module):
    def __init__(self, hidden_size, vocab_size):
        super(Predictor, self).__init__()
        raise NotImplementedError
        
    def forward(self, x):
        raise NotImplementedError
        

class Decoder(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, preds):
        '''
        preds (B, N, V)
        '''
        ids = torch.argmax(preds, dim=-1)
        return ids
    

## 3. Data

We encode words with a new tokenizer in the `text_pipeline`. The rest of the data pipeline is unchanged.

We use the full `train-clean-5` folder for training and the first 100 utterances from `dev-clean-2` and `test-clean` for validation and testing.

In [None]:
bos_index = hparams['bos_index']
eos_index = hparams['eos_index']
@sb.utils.data_pipeline.takes('words')
@sb.utils.data_pipeline.provides('words', 'tokens_list', 'tokens_bos', 'tokens_eos', 'tokens', 'attention_mask')
def text_pipeline(words):
    yield words
    encoded = tokenizer(words)
    tokens_list, attention_mask = encoded['input_ids'], encoded['attention_mask']
    yield tokens_list
    tokens_bos = torch.LongTensor([bos_index] + (tokens_list))
    yield tokens_bos
    # we use same eos and bos indexes as in pretrained model
    tokens_eos = torch.LongTensor(tokens_list + [eos_index])
    yield tokens_eos
    tokens = torch.LongTensor(tokens_list)
    yield tokens
    attention_mask = torch.LongTensor(attention_mask)
    yield attention_mask

In [None]:
from speechbrain.dataio.dataset import DynamicItemDataset

# Create Train Dataset
train_manifest_path = hparams['train_manifest_path']
print(f'Creating training set from {train_manifest_path}')
train_set = DynamicItemDataset.from_json(train_manifest_path)
train_set.add_dynamic_item(
    sb.dataio.dataio.read_audio, takes='file_path', provides='signal'
)
train_set = train_set.filtered_sorted(sort_key='length', select_n=hparams['n_train'])
train_set.add_dynamic_item(text_pipeline)
train_set.set_output_keys(
    ['id', 'signal', 'words', 'tokens_list', 'tokens_bos', 'tokens_eos', 'tokens', 'attention_mask']
)
    
# Create Valid Dataset
valid_manifest_path = hparams['valid_manifest_path']
print(f'Creating validation set from {valid_manifest_path}')
valid_set = DynamicItemDataset.from_json(valid_manifest_path)
valid_set.add_dynamic_item(
    sb.dataio.dataio.read_audio, takes='file_path', provides='signal'
)
valid_set = valid_set.filtered_sorted(sort_key='length', select_n=hparams['n_valid'])
valid_set.add_dynamic_item(text_pipeline)
valid_set.set_output_keys(
    ['id', 'signal', 'words', 'tokens_list', 'tokens_bos', 'tokens_eos', 'tokens', 'attention_mask']
)

# Create Test Dataset 
test_manifest_path = hparams['test_manifest_path'] 
print(f'Creating validation set from {test_manifest_path}')
test_set = DynamicItemDataset.from_json(test_manifest_path)
test_set.add_dynamic_item(
    sb.dataio.dataio.read_audio, takes='file_path', provides='signal'
)
test_set = test_set.filtered_sorted(sort_key='length', select_n=hparams['n_test'])
test_set.add_dynamic_item(text_pipeline)
test_set.set_output_keys(
    ['id', 'signal', 'words', 'tokens_list', 'tokens_bos', 'tokens_eos', 'tokens', 'attention_mask']
)

## 4. Brain

**TODD(9/10)**. Complete `compute_forward` function.

*Hint1*: Use the last-layer embedding of HuBERT. You can set `output_hidden_states=False` or `hidden_states[-1]`.

*Hint2*: Don't update HuBERT. Wrap HuBERT encoding `with torch.no_grad()`

In [None]:
class HubertASR(sb.core.Brain):
    
    def on_fit_start(self):
        super().on_fit_start()
        self.modules.encoder.feature_extractor._freeze_parameters()
        self.checkpointer.add_recoverable('predictor', self.modules.predictor)

    def compute_forward(self, batch, stage):
        """Forward computations from the waveform batches to the output probabilities."""
        batch = batch.to(self.device)
        wavs, wav_lens = batch.signal
                    
        # HuBERT encode features
        feats = None
        
        # Estimate tokens
        preds = None
        raise NotImplementedError

        # Compute outputs
        if stage != sb.Stage.TRAIN:
            hyps = self.modules.decoder(preds.detach())
        else:
            hyps = None

        return p_ctc, wav_lens, hyps

    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss (CTC) given predictions and targets."""

        (p_ctc, wav_lens, hyps,) = predictions

        ids = batch.id
        tokens_eos, tokens_eos_lens = batch.tokens_eos
        tokens, tokens_lens = batch.tokens
        attention_mask, _ = batch.attention_mask
        tokens = tokens.masked_fill(attention_mask.ne(1), -100)
        loss_ctc = self.hparams.ctc_cost(
            p_ctc, tokens, wav_lens, tokens_lens
        ).mean()
        loss = loss_ctc

        if stage != sb.Stage.TRAIN:
            # Decode token terms to words
            predicted_words = [
                self.tokenizer.decode(utt_seq).split(" ") 
                for utt_seq in hyps
            ]
            target_words = [wrd.split(" ") for wrd in batch.words]
            self.wer_metric.append(ids, predicted_words, target_words)
            self.cer_metric.append(ids, predicted_words, target_words)
            
        return loss

    def evaluate_batch(self, batch, stage):
        """Computations needed for validation/test batches"""
        with torch.no_grad():
            predictions = self.compute_forward(batch, stage=stage)
            loss = self.compute_objectives(predictions, batch, stage=stage)
        return loss.detach()

    def on_stage_start(self, stage, epoch):
        """Gets called at the beginning of each epoch"""
        if stage != sb.Stage.TRAIN:
            self.wer_metric = self.hparams.wer_computer()
            self.cer_metric = self.hparams.cer_computer()

    def on_stage_end(self, stage, stage_loss, epoch):
        """Gets called at the end of a epoch."""
        # Compute/store important stats
        stage_stats = {'loss': stage_loss}
        if stage == sb.Stage.TRAIN:
            self.train_stats = stage_stats
        else:
            stage_stats['WER'] = self.wer_metric.summarize('error_rate')
            stage_stats["CER"] = self.cer_metric.summarize("error_rate")

        if stage == sb.Stage.VALID:
            self.checkpointer.save_and_keep_only(
                meta={'WER': stage_stats['WER'], 'epoch': epoch},
                min_keys=['WER'],
            )

        elif stage == sb.Stage.TEST:
            with open(self.hparams.wer_file, 'w') as w:
                self.wer_metric.write_stats(w)
                
        print(f'Epoch {epoch}: ', stage, stage_stats) 

    def fit_batch(self, batch):
        outputs = self.compute_forward(batch, sb.Stage.TRAIN)
        loss = self.compute_objectives(outputs, batch, sb.Stage.TRAIN)
        loss.backward()
        if self.check_gradients(loss):
            self.optimizer.step()
        self.optimizer.zero_grad()
        self.optimizer_step += 1
        
        return loss.detach().cpu()


## 5. Experiments
Instantiate `HubertASR` with our models.

In [None]:
# Add encoder, predictor and decoder
modules = {
    'encoder': encoder,
    'predictor': Predictor(hubert_dim, vocab_size),
    'decoder': Decoder()
}

brain = HubertASR(
    modules, 
    hparams=hparams, 
    opt_class=hparams['opt_class'],
    run_opts=run_opts,
    checkpointer=hparams['checkpointer']
)

brain.tokenizer = tokenizer

**TODD(10/10)**. Train and evaluate the model.

You need to score a ***WER $\leq 3$*** on the testing set for full points.

Your training loss may increase within an epoch, because utterances are sorted by length and longer utterances produce larger loss.



In [None]:
brain.fit(
    epoch_counter=hparams['epoch_counter'],
    train_set=train_set,
    valid_set=valid_set,
    train_loader_kwargs=hparams['train_dataloader_opts'],
    valid_loader_kwargs=hparams['valid_dataloader_opts'],
)

In [None]:
brain.hparams.wer_file = os.path.join(
    hparams['save_folder'], 'wer.txt'
)
brain.evaluate(
    test_set,
    test_loader_kwargs=hparams['test_dataloader_opts'],
    min_key='WER'
)