# Metrics

## Imports

In [1]:
from evaluate import load
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

## Evaluate

In [2]:
reference = "the cat sat on the mat"
prediction = "the cat sit on the"

### WER (Word Error Rate)

The word error rate (WER) metric is the ‘de facto’ metric for speech recognition. It calculates substitutions, insertions and deletions on the word level. This means errors are annotated on a word-by-word basis.

In [3]:
wer_metric = load("wer")
wer = wer_metric.compute(references=[reference], predictions=[prediction])

print(wer)

0.3333333333333333


### Word Accuracy

Word Accuracy = 1 - WER

### Character Error Rate (CER)

It seems a bit unfair that we marked the entire word for “sit” wrong when in fact only one letter was incorrect. That’s because we were evaluating our system on the word level, thereby annotating errors on a word-by-word basis. The character error rate (CER) assesses systems on the character level. This means we divide up our words into their individual characters, and annotate errors on a character-by-character basis:

In [4]:
cer_metric = load("cer")
cer = cer_metric.compute(references=[reference], predictions=[prediction])

print(cer)

0.22727272727272727


### Normalisation

If we train an ASR model on data with punctuation and casing, it will learn to predict casing and punctuation in its transcriptions. This is great when we want to use our model for actual speech recognition applications, such as transcribing meetings or dictation, since the predicted transcriptions will be fully formatted with casing and punctuation, a style referred to as orthographic.

In [5]:
normalizer = BasicTextNormalizer()

prediction = " He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly is drawn from eating and its results occur most readily to the mind."
normalized_prediction = normalizer(prediction)

normalized_prediction

' he tells us that at this festive season of the year with christmas and roast beef looming before us similarly is drawn from eating and its results occur most readily to the mind '

In [6]:
reference = "HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND"
normalized_referece = normalizer(reference)

wer = wer_metric.compute(
    references=[normalized_referece], predictions=[normalized_prediction]
)
wer

0.0625

## Putting it all together

In [7]:
from transformers import pipeline
import torch


if torch.cuda.is_available():
    device = "cuda:0"
    torch_dtype = torch.float16
    
else:
    device = "cpu"
    torch_dtype = torch.float32

    
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    torch_dtype=torch_dtype,
    device=device,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [24]:
from datasets import load_dataset

common_voice_test = load_dataset(
    "mozilla-foundation/common_voice_13_0", "et", split="test"
)

Downloading data:   0%|          | 0.00/152M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/129M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/137M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/344M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/713k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/150k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.00M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 3138it [00:00, 175041.57it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 2638it [00:00, 242107.92it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 2638it [00:00, 252517.83it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 506it [00:00, 221559.43it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 6697it [00:00, 230264.32it/s]


In [25]:
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

all_predictions = []

# run streamed inference
for prediction in tqdm(
    pipe(
        KeyDataset(common_voice_test, "audio"),
        max_new_tokens=128,
        generate_kwargs={"task": "transcribe", 'language': 'et'},
        batch_size=32,
    ),
    total=len(common_voice_test),
):
    all_predictions.append(prediction["text"])

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2638/2638 [34:52<00:00,  1.26it/s]


In [26]:
from evaluate import load

wer_metric = load("wer")

wer_ortho = 100 * wer_metric.compute(
    references=common_voice_test["sentence"], predictions=all_predictions
)
wer_ortho

74.25920197958553

In [27]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

# compute normalised WER
all_predictions_norm = [normalizer(pred) for pred in all_predictions]
all_references_norm = [normalizer(label) for label in common_voice_test["sentence"]]

# filtering step to only evaluate the samples that correspond to non-zero references
all_predictions_norm = [
    all_predictions_norm[i]
    for i in range(len(all_predictions_norm))
    if len(all_references_norm[i]) > 0
]
all_references_norm = [
    all_references_norm[i]
    for i in range(len(all_references_norm))
    if len(all_references_norm[i]) > 0
]

wer = 100 * wer_metric.compute(
    references=all_references_norm, predictions=all_predictions_norm
)

wer

71.7370515470278