Original: https://colab.research.google.com/drive/196b1E_4SievYhi8mxdoS10i72K--ySaK

# CS753 2023 -- Assignment 2

This assignment is due on or before **11.59 pm on April 9, 2023**. The submission portal
on Moodle will be open until midnight on April 11th with a 5% penalty for each additional day after the 9th. This  assignment adds up to 25 points overall. This is a group assignment. You can form groups of 2-3 students.

## **Acknowledgements**

* All of Task 0's ASR-related code snippets have been borrowed from the ASR Notebook at [CS224S's Assignment 4 at Stanford](https://web.stanford.edu/class/cs224s/assignments/a4/) which are, in turn, borrowed from the [SpeechBrain toolkit](https://github.com/speechbrain/speechbrain/). 

* SpeechBrain models are downloaded from their host site on [Huggingface](https://huggingface.co/speechbrain).  

* The following [SpeechBrain tutorial](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=J6N0Fb51pFnZ) will give you a code walkthrough of how an ASR system is coded from scratch using this toolkit. 

## **Dataset** 

We will use *HarperValleyBank (HVB)* -- a publicly-available spoken dialog corpus. Click [here](https://arxiv.org/pdf/2010.13929.pdf) for more details about the HVB corpus, how it was collected and what it is annotated for.  

## **What to submit**

On Moodle, you will have to submit a text file `README.txt` with answers requested for in the tasks below and a `.ipynb` file containing all your code with the following naming convention: LDAP IDs of team members delimited by `_`. E.g., `220022022_220022021.ipynb`.

## **Getting Started**
* Make a copy of this notebook in your personal Google Drive to make edits. 
* Change your runtime type (under "Runtime") to GPU. 
* Go through all the steps in Task 0 to get set up with the first ASR task. 
* **Important:** ASR training, as in Task 0, will take close to **1.5 hours** for two epochs. Keep this in mind when scheduling your runs. You should consider saving the checkpoints from a training run if you want to use it for other experiments or for additional finetuning. 


# Dependencies 

If you have any issues using `gdown`, the same data and config files are available directly via Google Drive [here](https://drive.google.com/drive/folders/1xQxvR9NRlwK-75KMd0i1y4Dy0fimTsM9).  

In [None]:
# setup
# !gdown 1oJh0U3g_bUx6UPX4xix2UHMVHeCE_H1y
!gdown 1_OXiLOL2RBsbdCb4WyQsLudYxzJxMDJr
!unzip -q hvb.zip
!mv content/data /content/
!rm -r /content/content

!gdown 1a0EGlsLbXnGn1xwZoSqT0tcdAQ1L2nfd # train.py
!gdown 1yCmjRbxXRxfEN5LXdnE1Zpl8ZOIzdrAO # train.yaml
!gdown 1KHmdcLVFI9ontvGmi5J6vfaropGYuKcr # inference.yaml

In [None]:
!pip install speechbrain -q

import speechbrain as sb
from speechbrain.pretrained import EncoderDecoderASR
import json
import torchaudio
import torch
from torch import nn
from tqdm import tqdm
from collections import Counter
from IPython.display import Audio
from scipy.io import wavfile

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Task 0: Evaluate a pretrained CRDNN model, further finetuned with HVB (5 points)

We will first load a [SpeechBrain CRDNN model pretrained on LibriSpeech](https://huggingface.co/speechbrain/asr-crdnn-rnnlm-librispeech). SpeechBrain has some utils with built-in options to source pre-trained models from a repository on HuggingFace. 

In Task 0, you will first load this pretrained CRDNN model from HuggingFace and use it for inference on the first 500 examples from `test_manifest.json` (included within `hvb.zip` you have already downloaded earlier). Subsequently, you will fine tune this model on the HVB training dataset before reevaluating on the test examples and note the difference in performance. 

In [None]:
crdnn = EncoderDecoderASR.from_hparams(
    source='speechbrain/asr-crdnn-rnnlm-librispeech',
    savedir='asr-crdnn-rnnlm-librispeech',
    run_opts={'device': 'cuda'}
)

The manifests we prepared to use with SpeechBrain are jsons with the structure
```
{
    "15748": {
        "wav": "/content/data/segments/15748.wav",
        "length": 1.86,
        "words": "WHAT DAY WOULD YOU LIKE FOR YOUR APPOINTMENT"
    },
    ...
}
```

We first load them and define a function to batch them into a format that our `EncoderDecoderASR` object can ingest:

In [None]:
TEST_SIZE = 200 # for faster processing

with open('data/test_manifest.json', 'r') as f:
    test_manifest = json.load(f)
test_manifest = {
    k: v for k, v in list(test_manifest.items())[:TEST_SIZE]
}

def batchify(manifest, batch_size):
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples, batch_size):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + batch_size, num_examples)]
        ], batch_first=True)
        batch_keys = keys[i:min(i + batch_size, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        yield batch_keys, batch_wavs, batch_wav_lens

Next, we feed our test examples through the pretrained ASR model:

In [None]:
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

def inference(model, test_manifest, batch_size=8):
    torch.cuda.empty_cache()
    pred_dict = {}
    for keys, wavs, wav_lens in tqdm(batchify(test_manifest, batch_size), total=round(len(test_manifest) / batch_size + 0.5)):
        transcriptions, _ = model.transcribe_batch(wavs.to(device), wav_lens.to(device))
        for key, transcription in zip(keys, transcriptions):
            pred_dict[key] = transcription
    return pred_dict

pred_dict = inference(crdnn, test_manifest)

## Q1: Evaluate WER of pretrained model predictions (2 points)

Check the word error rate on the first 200 test instances in `test_manifest.json`. Note that we want WERs, so we need to split our transcripts into lists of words. 

You don't need to implement anything new here. Just follow along and ensure you can run the code to obtain WER on the results you just generated. 

In [None]:
# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)

In [None]:
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

What is the WER value you obtain? Write it down in `README.txt` that you will upload on Moodle, along with your Colab notebook. 

We expect WER of this pretrained system to be somewhat high on HVB data (around 72%+ WER). That is really quite high! Note that we already re-sampled the audio to 16kHz to make the HVB audio features more similar to the training inputs of the pre-trained model. 

Often times ASR errors have specific error modes or correlations -- let's see if we can understand where our pretrained system is failing on HVB data. We can start to investigate where our system is making mistakes by checking some of the top missed utterances.

In [None]:
def summarize(detail_dict, true_dict, pred_dict):
    print(f"{detail_dict['key']}: {detail_dict['WER']}")
    print(f"\tTrue: {true_dict[detail_dict['key']]}")
    print(f"\tPred: {pred_dict[detail_dict['key']]}")

for wer_dict in sb.utils.edit_distance.top_wer_utts(details_by_utterance, 10)[0]:
    summarize(wer_dict, true_dict, pred_dict)

Seems that our predictions keep outputting the same word over and over. Let's see why.

Select some examples the model wrongly predicts, and try to build a hypothesis around what in the data is associated with the model making mistakes. Examples of mistake types include:
- Repeated wrong word
- A few correct words but clearly wrong transcript

Give examples of at least 3 audio files and different kinds of errors you have identified in these audio files. Add this to `README.txt`.


Here's a code snippet to help you get started. 

In [None]:
example = details_by_utterance[1]
summarize(example, true_dict, pred_dict)
Audio(test_manifest[example['key']]['wav'])

(Hint: In our initial checks, poor audio quality seems to be associated with repeated word errors.)

## Q2: Finetune the pretrained CRDNN model (2 points)

The performance of this pretrained model is disappointing. However, the model was trained on a different data domain than call-center transcripts as in HVB. To see if we can derive better performance on our dataset, we fine-tune the pretrained model with HVB training data and test it. For this experiment, you won't need to modify it much. Just get training working, and you can try adjusting some training or decoding parameters as you like. The key thing to learn here is simply developing an understanding of how things work when finetuning on a new corpus using SpeechBrain. (This is a state-of-the-art approach to building and adjusting ASR models that might be used in industry projects.)

We've set up the training script, experiment yaml, and inference yaml for you, but we encourage you to take a look at how it works (and most importantly what a neat ML experiment yaml file looks like). Training this model for 2 epochs on Colab GPUs should take around **1.5 hours**.

The model will save checkpoints during training which you can specify for use during inference / testing below. That means you should be able to use a fine-tuned version of the model, even if you don't wait the full time for the model to completely train. It is okay to submit the homework with your fine tuning model partially trained, but not 2 full epochs. 

In [None]:
# this downloads the training and config files for our fine tuning setup
!gdown 1v_3Kl8OrUd6_1_D0ZGoYVFEuOKhZ7YMo # train.py
!gdown 17cQIpx5kLLMCD23EDaE0EYg2E9LPqMCF # train.yaml
!gdown 1CWYOD2PC97gXguW4krc9122HKAraHkYS # inference.yaml

**Finetuning with HVB data**: 

There are two files you've just downloaded which specify the architecture to train, and the main training loop for improving the neural net ASR system. 

- `train.yaml` is the yaml config file SpeechBrain uses to specify both the network / ASR system architecture, as well as the parameters of training procedures, loss functions, datasets, etc. This is a good starting point for understanding the architecture of the ASR system you're working with. Note that you are not able to modify much about the ASR network architecture as it needs to match what we load from file. You can adjust things like loss function weights, learning rate, and training time to adjust the fine tuning setup. 
- `train.py` specifies the main training loop for fitting the acoustic model. You do not need to modify this file. 

Edit the training yaml file and run the training loop as shown below. Finetune with modified hyperparameters that worked best for you. Copy/paste train loss, valid loss, valid CER and valid WER from your training output in `README.txt`, along with the epoch number. (With using `train.yaml` as-is, we obtained training and validation loss < 1.75 after epoch 1.) 



In [None]:
torch.cuda.empty_cache()

!python train.py train.yaml --batch_size=4
# OOM on batch_size=5

## Q3: Evaluate your finetuned model (1 point)

To run inference, you need to use a different yaml to be compatible with the `EncoderDecoderASR` class. 

NOTE: You need to set your checkpoint path in a few locations to make this work. Be careful your paths are set before other debugging if things aren't working (e.g. trying to download from HuggingFace)

To get inference working there are three steps:
1. Note the directory that your checkpoints are saved in (under `./results/CRDNN_BPE_960h_LM/2602/save/{your ckpt here}`)
2. Paste this directory into the ckptdir entry in `inference.yaml`
3. Paste this directory after the `ckpt_path = ` in the below cell.

You can then use the same inference procedure as in Q1.
NOTE: set the `ckpt_path` below AND change the path in `inference.yaml` (by modifying `ckptdir` to point to the desired checkpoint) before or after it is copied into your checkpoint path.


In [None]:
ckpt_path = "/content/results/CRDNN_BPE_960h_LM/2602/save/CKPT+2023-04-08+06-09-55+00"
!cp inference.yaml {ckpt_path}

Evaluate your finetuned model on the first 50 test sentences in `test_manifest.json`. Generate predictions by setting up a model object, and calling `inference()`. Remember your checkpoint paths must be set correctly in the copy of inference.yaml read to run inference. For this step simply populate pred_dict with inferences from your test subset. Write down the WER in `README.txt`. Also compute the WER from the pretrained model for this test subset of size 50 and write it down in `README.txt`.

NOTE: Running inference on ~50 utterances might require ~15 minutes of computation on a Colab CPU. The code below uses CPU inference as we could not get checkpoint-loaded DNNs to work with SpeechBrain's inference on the GPU (you are free to try this). 

In [None]:
device = 'cpu'
our_model = EncoderDecoderASR.from_hparams(
    source=ckpt_path, 
    hparams_file='inference.yaml', 
    savedir="our_ckpt",
    run_opts={'device': device}
)

In [None]:
##############################
#### YOUR CODE GOES BELOW #####
##############################

### Populate test_manifest with the first 50 test sentences and then call inference() below###
EVAL_SIZE = 50
test_manifest = {
    k: v for k, v in list(test_manifest.items())[:EVAL_SIZE]
}
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

pred_dict = inference(our_model.to(device), test_manifest)
# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

In [None]:
pred_dict = inference(crdnn, test_manifest)
# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

# Task 1: Train a sentiment detection model (20 points)


Apart from the audio files and transcriptions, the HVB corpus also comes with annotations for intent, sentiment/emotion and dialog actions. Use the commands below to download `transcript.zip` and `hvb-audio.zip`. 

In [None]:
# !gdown 1-s2e8dZYSjhVgfo_TL0V_89RZVhGnZ1Y #transcript.zip
!gdown 1oCn4PoJO-9XMEh-RtuatRgtKKL6ZyXb6
!unzip -q transcript.zip

!gdown 1ChdI1XyhmGq9z8Y8M38yXPMob6oPRqPO  #train.txt
!gdown 10w15DnUbJcQRBSZWP03qjM6Oq8l7WSVQ  #dev.txt

# !gdown 1eimo-BFXZz6Z3FeZK8uC-wT84ji-JVos #hvb-audio.zip
!gdown 1xMyXiFpQo3reF5sWFi6MahUQviW4Z1RA
!unzip -q hvb-audio.zip

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
with open('transcript/370981f1f0254ebc.json', 'r') as f:
    print(f.read())

Each transcript json file within `/content/transcript/` refers to a conversation with a list of utterances. Each segment (utterance) is associated with a json object and one of its keys is labeled as "emotion" that maps to three probability values associated with positive, negative and neutral sentiments. You can consider the sentiment with maximum probability to be the ground-truth label for each utterance. 

For this task, you will write new code to create a sentiment classification model for HVB utterances. Here are the various steps:
- Create train and dev splits from `train.txt` and `dev.txt` downloaded in the code cell above. To understand the format in `train.txt` (and `dev.txt`), consider a line in  `train.txt`: *010d38f5ada54e0d:1,2,3,4,5,6,7,8*. This refers to conversation-ID 010d38f5ada54e0d with eight utterances appearing sequentially within the file `/content/transcript/010d38f5ada54e0d.json`. These utterances are numbered by the field `index` in the json file. Similarly interpret the lines in `dev.txt`. Use these text files (`train.txt` and `dev.txt`) to create train and dev sets and extract the relevant metadata ("emotion") you need from the respective transcript json files. To extract corresponding audio for these segments, the relevant files will be named `010d38f5ada54e0d.wav` within `/content/audio/agent` and `/content/audio/caller` (obtained after you unzip `hvb-audio.zip`). The fields `channel_index`, `index`, `start_ms` and `duration_ms` within `/content/transcript/010d38f5ada54e0d.json` gives you all the information needed to extract audio for the eight utterances mentioned in the example above. Click [here](https://github.com/cricketclub/gridspace-stanford-harper-valley#field-descriptions) for more details about the json field descriptions. 
- Load the pretrained CRDNN Librispeech model as in Task 0 and extract the encoder.
- Add a linear layer mapping the encoding of the audio signal to a prediction of the underlying sentiment. This will be randomly initialized.
- Train both the new linear layer and all the pretrained encoder layers using a cross-entropy loss with the reference emotion labels derived as described at the top of this cell. 
- Evaluate your trained sentiment detection model on utterances listed in `dev.txt` and compute overall accuracy for sentiment prediction. 
- Write down the accuracy you obtained in `README.txt`. 



## Rough Score Breakdown

---

1. Data preprocessing (4 points)
2. Loading the CRDNN model and extracting the encoder layers (6 points)
3. Adding a linear layer + training the model (7 points)
4. Obtaining accuracies similar to or better than our solution code (3 points)  

In [None]:
!pip install pydub

In [None]:
##########################################################
#### ALL YOUR CODE FOR SENTIMENT DETECTION GOES HERE #####
import json
from pydub import AudioSegment

train_data = []
train_audio_agent = []
train_audio_caller = []
with open('train.txt','r') as f:
    for line in f.readlines():
        convID = line.split(':')[0]
        indices = list(map(int,line.split(':')[1].split(',')))
        train_file = []
        with open(f'transcript/{convID}.json','r') as f1:
            utt = json.load(f1)
            for i in indices:
                train_obj = {}
                train_obj['words'] = utt[i - 1]['transcript']
                train_obj['emotion'] = list(utt[i - 1]['emotion'].values())
                train_obj['channel_index'] = utt[i - 1]['channel_index']
                train_obj['start_ms'] = utt[i - 1]['start_ms']
                train_obj['duration_ms'] = utt[i - 1]['duration_ms']
                train_obj['length'] = train_obj['duration_ms']/1000
                if train_obj['channel_index'] == 1:
                    path = f'audio/caller/'
                else:
                    path = f'audio/agent/'
                original_wav = AudioSegment.from_wav(path + str(convID) + '.wav')
                extracted_wav = original_wav[train_obj['start_ms']:train_obj['start_ms']+train_obj['duration_ms']]
                extracted_wav.export(path + str(convID) + '_' + str(i) + '.wav', format='wav')
                train_obj["wav"] = path + str(convID) + '_' + str(i) + '.wav'
                train_file.append(train_obj)
        train_data.extend(train_file)

dev_data = []
dev_audio_agent = []
dev_audio_caller = []
with open('dev.txt','r') as f:
    for line in f.readlines():
        convID = line.split(':')[0]
        indices = list(map(int,line.split(':')[1].split(',')))
        train_file = []
        with open(f'transcript/{convID}.json','r') as f1:
            utt = json.load(f1)
            for i in indices:
                train_obj = {}
                train_obj['words'] = utt[i - 1]['transcript']
                train_obj['emotion'] = list(utt[i - 1]['emotion'].values())
                train_obj['channel_index'] = utt[i - 1]['channel_index']
                train_obj['start_ms'] = utt[i - 1]['start_ms']
                train_obj['duration_ms'] = utt[i - 1]['duration_ms']
                train_obj['length'] = train_obj['duration_ms']/1000
                if train_obj['channel_index'] == 1:
                    path = f'audio/caller/'
                else:
                    path = f'audio/agent/'
                original_wav = AudioSegment.from_wav(path + str(convID) + '.wav')
                extracted_wav = original_wav[train_obj['start_ms']:train_obj['start_ms']+train_obj['duration_ms']]
                extracted_wav.export(path + str(convID) + '_' + str(i) + '_dev.wav', format='wav')
                train_obj["wav"] = path + str(convID) + '_' + str(i) + '_dev.wav'
                train_file.append(train_obj)
        dev_data.extend(train_file)
##########################################################

In [None]:
crdnn = EncoderDecoderASR.from_hparams(
    source='speechbrain/asr-crdnn-rnnlm-librispeech',
    savedir='asr-crdnn-rnnlm-librispeech',
    run_opts={'device': 'cuda'}
)

max_length = 0

def getMaxLength(manifest):
    global max_length
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + 1, num_examples)]
        ], batch_first=True)
        batch_keys = keys[i:min(i + 1, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        max_length = max(max_length, batch_wavs.shape[1])

train_manifest = {key: train_data[key] for key in range(len(train_data))}

getMaxLength(train_manifest)

In [None]:
import torch
from torch import nn

class Network(nn.Module):

    def __init__(self, crdnn):
        super().__init__()
        self.enc = crdnn.mods.encoder
        self.lin = nn.Linear(233*512,3)
    
    def forward(self, wav, wav_len):
        x = self.enc(wav,wav_len).reshape(1,-1)
        x = self.lin(x)
        return x

model = Network(crdnn).to(device)

In [None]:
# training
# https://speechbrain.readthedocs.io/en/latest/API/speechbrain.pretrained.interfaces.html#speechbrain.pretrained.interfaces.EncoderDecoderASR
from tqdm.notebook import tqdm

def mybatchify(manifest, batch_size):
    global max_length
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples, batch_size):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + batch_size, num_examples)]
        ], batch_first=True)
        batch_wavs = torch.cat([batch_wavs, torch.zeros(batch_wavs.size(0), max_length - batch_wavs.size(1))], dim=1)
        batch_keys = keys[i:min(i + batch_size, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        emotions = torch.tensor([
            manifest[key]['emotion'] for key in batch_keys
        ])
        yield batch_keys, batch_wavs, batch_wav_lens, emotions

def train(model, train_manifest, batch_size=8):
    torch.cuda.empty_cache()
    model.train()
    optim = torch.optim.Adam(model.parameters(),lr=0.0001,weight_decay=0.01)
    loss_fn = torch.nn.functional.cross_entropy
    total_loss = 0
    i = 0
    pred_dict = {}
    for keys, wavs, wav_lens, emotions in tqdm(mybatchify(train_manifest, batch_size), total=round(len(train_manifest) / batch_size + 0.5)):
        optim.zero_grad()
        preds = model(wavs.to(device), wav_lens.to(device))
        emotions = emotions.to(device)
        max_emotion = emotions // emotions.max()
        loss = torch.nn.functional.cross_entropy(preds,max_emotion)
        # if i % 20 == 0:
        #     print(loss.item())
        i += 1
        total_loss += loss.item()
        loss.backward()
        optim.step()

train(model,train_manifest,1)

In [None]:
### Populate test_manifest with the first 50 test sentences and then call inference() below###
def evaluate(model, test_manifest, batch_size=8):
    torch.cuda.empty_cache()
    model.eval()
    pred_dict = {}
    score = 0
    for keys, wavs, wav_lens, emotions in tqdm(mybatchify(test_manifest, batch_size), total=round(len(test_manifest) / batch_size + 0.5)):
        preds = model(wavs.to(device), wav_lens.to(device))
        emotions = emotions.to(device)
        for i in range(preds.shape[0]):
            if torch.argmax(preds[i]) == torch.argmax(emotions[i]):
                score += 1
    print(score / len(test_manifest))

test_manifest = {key: dev_data[key] for key in range(len(dev_data))}

evaluate(model, test_manifest,1)

# Extra Credit: Use Whisper's pretrained model to evaluate HVB (5 points)

Write code to evaluate [Whisper's **small** model](https://github.com/openai/whisper/blob/main/model-card.md) on the first 200 test utterances in `test_manifest.json` that you used in Q1 of Task 0. Compute the WER with predictions from Whisper and add it to `README.txt`.

In [None]:
!pip install openai-whisper

In [None]:
##############################################################
#### YOUR EVALUATION CODE USING Whisper-small GOES BELOW #####
import whisper

TEST_SIZE = 200

with open('data/test_manifest.json', 'r') as f:
    test_manifest = json.load(f)
test_manifest = {
    k: v for k, v in list(test_manifest.items())[:TEST_SIZE]
}
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

whisp = whisper.load_model('small').to('cuda')

pred_dict = {}
for key in test_manifest:
    audio = test_manifest[key]['wav']
    result = whisp.transcribe(audio)
    pred_dict[key] = result['text'].upper()

# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)
##############################################################