<a href="https://colab.research.google.com/github/zanqi/tiny_asr/blob/main/asr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Transformers installation
! pip install transformers datasets evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


# Automatic speech recognition

Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users every day, and there are many other useful user-facing applications like live captioning and note-taking during meetings.

This guide will show you how to:

1. Fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to transcribe audio to text.
2. Use your fine-tuned model for inference.

<Tip>

To see all architectures and checkpoints compatible with this task, we recommend checking the [task-page](https://huggingface.co/tasks/automatic-speech-recognition)

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate jiwer soundfile librosa torchcodec
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [None]:
pip install transformers datasets evaluate jiwer soundfile librosa torchcodec

Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading jiwer-4.0.0-py3-none-any.whl (23 kB)
Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-4.0.0 rapidfuzz-3.14.3


In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load MInDS-14 dataset

Start by loading a smaller subset of the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset from the 🤗 Datasets library. This will give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [None]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")

README.md: 0.00B [00:00, ?B/s]

en-US/train-00000-of-00001.parquet:   0%|          | 0.00/34.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/563 [00:00<?, ? examples/s]

Split the dataset's `train` split into a train and test set with the `~Dataset.train_test_split` method:

In [None]:
minds = minds.train_test_split(test_size=0.2)

Then take a look at the dataset:

In [None]:
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 80
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 20
    })
})

While the dataset contains a lot of useful information, like `lang_id` and `english_transcription`, this guide focuses on the `audio` and `transcription`. Remove the other columns with the [remove_columns](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.remove_columns) method:

In [None]:
minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])

Review the example again:

In [None]:
minds["train"][0]

{'path': 'en-US~JOINT_ACCOUNT/602baa0fbb1e6d0fbce9214f.wav',
 'audio': <datasets.features._torchcodec.AudioDecoder at 0x79208ff4d250>,
 'transcription': 'I need to find out if I probably set up a joint account'}

There are two fields:

- `audio`: a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
- `transcription`: the target text.

In [None]:
import os
from IPython.display import Audio as AudioPlayer

# Access the audio feature which automatically loads the audio data
audio_sample = minds["train"][0]['audio']
print(minds["train"][0]['path'])

# Pass the audio array and its sampling rate to IPython.display.Audio
AudioPlayer(audio_sample['array'], rate=audio_sample['sampling_rate'])

en-US~JOINT_ACCOUNT/602baa0fbb1e6d0fbce9214f.wav


## Preprocess

The next step is to load a Wav2Vec2 processor to process the audio signal:

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

The MInDS-14 dataset has a sampling rate of 8000Hz (you can find this information in its [dataset card](https://huggingface.co/datasets/PolyAI/minds14)), which means you'll need to resample the dataset to 16000Hz to use the pretrained Wav2Vec2 model:

In [None]:
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'path': 'en-US~JOINT_ACCOUNT/602baa0fbb1e6d0fbce9214f.wav',
 'audio': <datasets.features._torchcodec.AudioDecoder at 0x79209021fb30>,
 'transcription': 'I need to find out if I probably set up a joint account'}

As you can see in the `transcription` above, the text contains a mix of uppercase and lowercase characters. The Wav2Vec2 tokenizer is only trained on uppercase characters so you'll need to make sure the text matches the tokenizer's vocabulary:

In [None]:
def uppercase(example):
    return {"transcription": example["transcription"].upper()}


minds = minds.map(uppercase)

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
minds["train"][0]

{'path': 'en-US~JOINT_ACCOUNT/602baa0fbb1e6d0fbce9214f.wav',
 'audio': <datasets.features._torchcodec.AudioDecoder at 0x791f1b3ba660>,
 'transcription': 'I NEED TO FIND OUT IF I PROBABLY SET UP A JOINT ACCOUNT'}

Now create a preprocessing function that:

1. Calls the `audio` column to load and resample the audio file.
2. Extracts the `input_values` from the audio file and tokenize the `transcription` column with the processor.

In [None]:
def prepare_dataset(batch):
    audio = batch["audio"]
    batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
    batch["input_length"] = len(batch["input_values"][0])
    return batch

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by increasing the number of processes with the `num_proc` parameter. Remove the columns you don't need with the [remove_columns](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.remove_columns) method:

In [None]:
encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)

Map (num_proc=4):   0%|          | 0/80 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/20 [00:00<?, ? examples/s]

🤗 Transformers doesn't have a data collator for ASR, so you'll need to adapt the [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding) to create a batch of examples. It'll also dynamically pad your text and labels to the length of the longest element in its batch (instead of the entire dataset) so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.

Unlike other data collators, this specific data collator needs to apply a different padding method to `input_values` and `labels`:

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union


@dataclass
class DataCollatorCTCWithPadding:
    processor: AutoProcessor
    padding: Union[bool, str] = "longest"

    def __call__(self, features: list[dict[str, Union[list[int], torch.Tensor]]]) -> dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"][0]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

        labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

Now instantiate your `DataCollatorForCTCWithPadding`:

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [word error rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) metric (refer to the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about loading and computing metrics):

In [None]:
import evaluate

wer = evaluate.load("wer")

Downloading builder script: 0.00B [00:00, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the WER:

In [None]:
import numpy as np


def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer_score = wer.compute(predictions=pred_str, references=label_str)

    return {"wer": wer_score}

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You are now ready to start training your model! Load Wav2Vec2 with [AutoModelForCTC](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCTC). Specify the reduction to apply with the `ctc_loss_reduction` parameter. It is often better to use the average instead of the default summation:

In [None]:
from transformers import AutoModelForCTC, TrainingArguments, Trainer

model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base-960h",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/212 [00:00<?, ?it/s]

[1mWav2Vec2ForCTC LOAD REPORT[0m from: facebook/wav2vec2-base-960h
Key                        | Status  | 
---------------------------+---------+-
wav2vec2.masked_spec_embed | MISSING | 

[3mNotes:
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the WER and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to fine-tune your model.

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_asr_mind_model",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    eval_strategy="steps",
    per_device_eval_batch_size=8,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    processing_class=processor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# trainer.train()

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so it can be accessible to everyone:

In [None]:
trainer.push_to_hub()

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...d_model/training_args.bin: 100%|##########| 5.20kB / 5.20kB            

  ...d_model/model.safetensors:   0%|          | 16.2kB /  378MB            

CommitInfo(commit_url='https://huggingface.co/keylazy/my_awesome_asr_mind_model/commit/ebb14f97936cd9745072aa2d95962a116cf2a45c', commit_message='End of training', commit_description='', oid='ebb14f97936cd9745072aa2d95962a116cf2a45c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/keylazy/my_awesome_asr_mind_model', endpoint='https://huggingface.co', repo_type='model', repo_id='keylazy/my_awesome_asr_mind_model'), pr_revision=None, pr_num=None)

<Tip>

For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.

</Tip>

## Inference

Great, now that you've fine-tuned a model, you can use it for inference!

Load an audio file you'd like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

In [None]:
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["path"]

In [None]:
dataset[0]

{'path': 'en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'audio': <datasets.features._torchcodec.AudioDecoder at 0x791f18b7d340>,
 'transcription': 'I would like to set up a joint account with my partner',
 'english_transcription': 'I would like to set up a joint account with my partner',
 'intent_class': 11,
 'lang_id': 4}

The simplest way to try out your fine-tuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for automatic speech recognition with your model, and pass your audio file to it:

In [None]:
audio_sample = dataset[0]['audio']
AudioPlayer(audio_sample['array'], rate=audio_sample['sampling_rate'])

In [None]:
from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="keylazy/my_awesome_asr_mind_model")
transcriber(dataset[0]["audio"])

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/213 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json:   0%|          | 0.00/358 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/300 [00:00<?, ?B/s]

{'text': 'I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT'}

<Tip>

The transcription is decent, but it could be better! Try finetuning your model on more examples to get even better results!

</Tip>

You can also manually replicate the results of the `pipeline` if you'd like:

Load a processor to preprocess the audio file and transcription and return the `input` as PyTorch tensors:

In [None]:
# 1. Inspect a single processed example
print("--- Single Processed Example ---")
example = encoded_minds["train"][0]
print(f"Keys: {example.keys()}")
print(f"Input values shape: {len(example['input_values'][0])}") # It's a list of list
if 'labels' in example:
    print(f"Labels shape: {len(example['labels'])}")
    print(f"Labels sample: {example['labels'][:10]}")

# 2. Test the Data Collator
print("\n--- Testing Data Collator ---")
batch_size = 2
batch_data = [encoded_minds["train"][i] for i in range(batch_size)]

try:
    batch_out = data_collator(batch_data)
    print("Collator ran successfully.")
    print(f"Batch keys: {batch_out.keys()}")

    # Check Input Values
    if "input_values" in batch_out:
        print(f"Batch input_values shape: {batch_out['input_values'].shape}")

    # Check Labels
    if "labels" in batch_out:
        print(f"Batch labels shape: {batch_out['labels'].shape}")
        print(f"Batch labels (first row): {batch_out['labels'][0]}")
        # Check if padding is -100
        is_padded = (batch_out['labels'] == -100).any()
        print(f"Labels contain -100 padding: {is_padded}")

except Exception as e:
    print(f"Collator Failed: {e}")

--- Single Processed Example ---
Keys: dict_keys(['input_values', 'labels', 'input_length'])
Input values shape: 117400
Labels shape: 55
Labels sample: [10, 4, 9, 5, 5, 14, 4, 6, 8, 4]

--- Testing Data Collator ---
Collator ran successfully.
Batch keys: KeysView({'input_values': tensor([[ 1.8774e-05, -2.6943e-03, -5.2926e-03,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-2.4078e-04,  8.7171e-05, -2.4078e-04,  ...,  9.1279e-04,
          4.0266e-03,  5.6968e-03]]), 'labels': tensor([[10,  4,  9,  5,  5, 14,  4,  6,  8,  4, 20, 10,  9, 14,  4,  8, 16,  6,
          4, 10, 20,  4, 10,  4, 23, 13,  8, 24,  7, 24, 15, 22,  4, 12,  5,  6,
          4, 16, 23,  4,  7,  4, 29,  8, 10,  9,  6,  4,  7, 19, 19,  8, 16,  9,
          6],
        [11,  5, 22,  4, 10, 27, 17,  4, 19,  7, 15, 15, 10,  9, 21,  4,  6,  8,
          4,  7, 12, 26,  4,  7, 24,  8, 16,  6,  4, 17, 22,  4, 19, 16, 13, 13,
          5,  9,  6,  4,  7, 19, 19,  8, 16,  9,  6,  4, 24,  7, 15,  7,  9, 19,
 

In [None]:
# 1. Check if the labels can be decoded back to text
print("--- Inspecting Tokenizer & Labels ---")
label_ids = batch_out['labels'][0]
# Filter out -100 padding before decoding
label_ids = label_ids[label_ids != -100]

decoded_text = processor.decode(label_ids, group_tokens=False)
print(f"Original Label IDs: {label_ids[:10]}...")
print(f"Decoded Text: '{decoded_text}'")

# 2. Check Vocab Size
vocab_size = len(processor.tokenizer)
print(f"\nTokenizer Vocab Size: {vocab_size}")

# 3. Check if Model Head matches Vocab
print(f"Model Output Layer (lm_head) size: {model.lm_head.out_features}")

if vocab_size != model.lm_head.out_features:
    print("\nWARNING: Model head size does not match tokenizer vocabulary!")
else:
    print("\nModel head matches tokenizer.")

--- Inspecting Tokenizer & Labels ---
Original Label IDs: tensor([10,  4,  9,  5,  5, 14,  4,  6,  8,  4])...
Decoded Text: 'I NEED TO FIND OUT IF I PROBABLY SET UP A JOINT ACCOUNT'

Tokenizer Vocab Size: 32
Model Output Layer (lm_head) size: 32

Model head matches tokenizer.


In [None]:
# Let's look at the actual vocabulary mapping
vocab = processor.tokenizer.get_vocab()

# Sort by ID for easier reading
sorted_vocab = sorted(vocab.items(), key=lambda item: item[1])

print(f"Total Vocab Size: {len(sorted_vocab)}")
print("Token Mapping:")
for token, id in sorted_vocab:
    print(f"ID {id}: '{token}'")

Total Vocab Size: 32
Token Mapping:
ID 0: '<pad>'
ID 1: '<s>'
ID 2: '</s>'
ID 3: '<unk>'
ID 4: '|'
ID 5: 'E'
ID 6: 'T'
ID 7: 'A'
ID 8: 'O'
ID 9: 'N'
ID 10: 'I'
ID 11: 'H'
ID 12: 'S'
ID 13: 'R'
ID 14: 'D'
ID 15: 'L'
ID 16: 'U'
ID 17: 'M'
ID 18: 'W'
ID 19: 'C'
ID 20: 'F'
ID 21: 'G'
ID 22: 'Y'
ID 23: 'P'
ID 24: 'B'
ID 25: 'V'
ID 26: 'K'
ID 27: '''
ID 28: 'X'
ID 29: 'J'
ID 30: 'Q'
ID 31: 'Z'


In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("keylazy/my_awesome_asr_mind_model")
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Pass your inputs to the model and return the logits:

In [None]:
from transformers import AutoModelForCTC

model = AutoModelForCTC.from_pretrained("keylazy/my_awesome_asr_mind_model")
with torch.no_grad():
    logits = model(**inputs).logits

Loading weights:   0%|          | 0/213 [00:00<?, ?it/s]

Get the predicted `input_ids` with the highest probability, and use the processor to decode the predicted `input_ids` back into text:

In [None]:
import torch

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
transcription

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT']

### Moonshine

In [None]:
!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git#egg=transformers datasets[audio]


Collecting transformers
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-install-3pni95t5/transformers_7c65942fa35e4f2facf17cdfc6363b77
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-install-3pni95t5/transformers_7c65942fa35e4f2facf17cdfc6363b77
  Resolved https://github.com/huggingface/transformers.git to commit 609e3d585bfe2f78e95f18761dd85ae753da5f1b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]")
minds = minds.train_test_split(test_size=0.2)

In [None]:
from datasets import Audio

# Force the dataset to resample to 16kHz
minds = minds.cast_column("audio", Audio(sampling_rate=16000))

# Verify the change
print(minds["train"][0]["audio"]["sampling_rate"]) # Should print 16000

16000


In [None]:
from transformers import MoonshineStreamingForConditionalGeneration, AutoProcessor
import torch

# You can change this to "usefulsensors/moonshine-streaming-small" if you want the 123M version.
model_id = "usefulsensors/moonshine-streaming-tiny"

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = MoonshineStreamingForConditionalGeneration.from_pretrained(
    model_id
).to(device).to(torch_dtype)

processor = AutoProcessor.from_pretrained(model_id)

print(f"Moonshine Streaming (Tiny) loaded on {device}")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/176M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/161 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Moonshine Streaming (Tiny) loaded on cuda


In [None]:
import jiwer
import jiwer.transforms as tr

def evaluate_moonshine_streaming(batch):
    # 1. Prepare inputs
    audio = batch["audio"]
    inputs = processor(
        audio["array"],
        return_tensors="pt",
        sampling_rate=16000 # Explicitly set to match model expectation
    )
    inputs = inputs.to(device, torch_dtype)

    # 2. Calculate Max Length (Anti-Hallucination Logic)
    # The paper suggests 6.5 tokens per second of audio to prevent loops
    token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate
    seq_lens = inputs.attention_mask.sum(dim=-1)
    max_length = int((seq_lens * token_limit_factor).max().item())

    # 3. Generate
    # We enforce max_length here.
    # To test failure modes for Wen, you could comment out `max_length` later.
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_length=max_length)

    # 4. Decode
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

    batch["pred_str"] = transcription
    return batch

# Run on the first 10 samples to test
results = minds["test"].select(range(10)).map(evaluate_moonshine_streaming)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

WER: 0.5227272727272727


In [None]:
import jiwer.transformations as tr

transforms = tr.wer_standardize

print("WER:", jiwer.wer(list(results["transcription"]), list(results["pred_str"]), transforms, transforms))

WER: 0.45555555555555555


In [None]:
list(results["transcription"])

['I need to find out if I probably set up a joint account',
 'show me my account balance please',
 'I would like to set up a joint account can I do that in the app',
 'set up a joint account',
 'account balance',
 'I have to pay bill make payment',
 'I am calling because I would like to set up a joint with my partner as possible and is there a bunny',
 'what is my account balance',
 "what's my current balance",
 "I'd like to set up a joint account"]

In [None]:
import os
from IPython.display import Audio as AudioPlayer

# Access the audio feature which automatically loads the audio data
audio_sample = minds["test"][6]['audio']

# Pass the audio array and its sampling rate to IPython.display.Audio
AudioPlayer(audio_sample['array'], rate=audio_sample['sampling_rate'])

In [None]:
list(results["pred_str"])

['I need to find out about how to set up a joint account please.',
 'Show me my account balance, please.',
 'I would like to set up a challenge. Can I do that in the app?',
 'Set up a joint account.',
 'Account balance.',
 'I have to pay a bill, make a payment, my credit card is due.',
 "I am calling because I would like to set up an account that is joined with my partner. It's this possible and is there a joint account that I can see on the app.",
 'What is my account balance?',
 "What's my current balance?",
 "I'd like to set up a joint account."]

Finding: From listening to the example, the moonshine outperform the labels qualitatively. For example, in the 1st example, moonshine included the 'please' that the speaker said. In the 6 example, the label is talking about bunny, which is clearly wrong and moonshine has the much better transcript

#### Let's try librespeech

In [None]:
from datasets import load_dataset, Audio
import jiwer
import jiwer.transformations as tr
import numpy as np
import torch

# 1. Load LibriSpeech (Clean Test Set) in Streaming Mode
# This avoids downloading 100GB+ of data
dataset_stream = load_dataset("librispeech_asr", "clean", split="test", streaming=True)

# 2. Resample to 16kHz on the fly
dataset_stream = dataset_stream.cast_column("audio", Audio(sampling_rate=16000))

# 3. Take 10 examples
# converting to list downloads just these 10 audio files
samples = list(dataset_stream.take(10))

print(f"Loaded {len(samples)} examples from LibriSpeech (clean)")
print(f"Sample text: {samples[0]['text']}")

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Loaded 10 examples from LibriSpeech (clean)
Sample text: CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS


In [None]:
# Standardize text (remove punctuation, uppercase, etc.) for fair comparison
transforms = tr.wer_standardize

def run_inference(batch_samples, noise_level=0.0):
    refs = []
    preds = []

    print(f"--- Running Inference (Noise Level: {noise_level}) ---")

    for i, sample in enumerate(batch_samples):
        # A. Get Audio & Add Noise
        audio_array = sample["audio"]["array"]

        if noise_level > 0:
            noise = np.random.randn(len(audio_array))
            audio_array = audio_array + (noise_level * noise)

        # B. Prepare Inputs
        inputs = processor(
            audio_array,
            return_tensors="pt",
            sampling_rate=16000
        ).to(device, torch_dtype)

        # C. Generate (using the limit logic from before to be safe)
        token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate
        seq_lens = inputs.attention_mask.sum(dim=-1)
        max_length = int((seq_lens * token_limit_factor).max().item())

        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_length=max_length)

        # D. Decode & Store
        transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

        refs.append(sample["text"])
        preds.append(transcription)

        # Optional: Print first example to see what's happening
        if i > -1:
            print(f"Ref:  {sample['text']}")
            print(f"Pred: {transcription}")

    # Calculate WER
    wer = jiwer.wer(
        refs,
        preds,
        reference_transform=transforms,
        hypothesis_transform=transforms
    )
    return wer

# Run Clean Test
clean_wer = run_inference(samples, noise_level=0.0)
print(f"\n✅ Clean Baseline WER: {clean_wer:.2%}")

--- Running Inference (Noise Level: 0.0) ---
Ref:  CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS
Pred: Concord returned to its place amidst the tents.
Ref:  THE ENGLISH FORWARDED TO THE FRENCH BASKETS OF FLOWERS OF WHICH THEY HAD MADE A PLENTIFUL PROVISION TO GREET THE ARRIVAL OF THE YOUNG PRINCESS THE FRENCH IN RETURN INVITED THE ENGLISH TO A SUPPER WHICH WAS TO BE GIVEN THE NEXT DAY
Pred: The English voted to the French baskets of flowers, of which they had made a plentiful provision to greet the arrival of the young princess. The French in return invited the English to a supper, which was to be given the next day.
Ref:  CONGRATULATIONS WERE POURED IN UPON THE PRINCESS EVERYWHERE DURING HER JOURNEY
Pred: Congratulations were poured in upon the princess everywhere during her journey.
Ref:  FROM THE RESPECT PAID HER ON ALL SIDES SHE SEEMED LIKE A QUEEN AND FROM THE ADORATION WITH WHICH SHE WAS TREATED BY TWO OR THREE SHE APPEARED AN OBJECT OF WORSHIP THE QUEEN MOTHER GAVE THE FRENCH T

In [None]:
# Experiment 1: Light Noise (SNR ~20dB)
noise_wer_low = run_inference(samples, noise_level=0.005)
print(f"⚠️ Low Noise WER: {noise_wer_low:.2%}")
print()

# Experiment 2: Heavy Noise (SNR ~10dB)
# This should break the model significantly
noise_wer_high = run_inference(samples, noise_level=0.02)
print(f"❌ High Noise WER: {noise_wer_high:.2%}")

--- Running Inference (Noise Level: 0.005) ---
Ref:  CONCORD RETURNED TO ITS PLACE AMIDST THE TENTS
Pred: Concord returned to its place amidst the tents.
Ref:  THE ENGLISH FORWARDED TO THE FRENCH BASKETS OF FLOWERS OF WHICH THEY HAD MADE A PLENTIFUL PROVISION TO GREET THE ARRIVAL OF THE YOUNG PRINCESS THE FRENCH IN RETURN INVITED THE ENGLISH TO A SUPPER WHICH WAS TO BE GIVEN THE NEXT DAY
Pred: The English voted to the French baskets of flowers of which they had made a plentiful provision to greet the arrival of the young princess. The French in return invited the English to a supper which was to be given the next day.
Ref:  CONGRATULATIONS WERE POURED IN UPON THE PRINCESS EVERYWHERE DURING HER JOURNEY
Pred: Congratulations were poured in upon the princess everywhere during her journey.
Ref:  FROM THE RESPECT PAID HER ON ALL SIDES SHE SEEMED LIKE A QUEEN AND FROM THE ADORATION WITH WHICH SHE WAS TREATED BY TWO OR THREE SHE APPEARED AN OBJECT OF WORSHIP THE QUEEN MOTHER GAVE THE FRENCH T

In [None]:
import numpy as np
import IPython.display as ipd

# 1. Pick the first sample from your list of 10 LibriSpeech examples
example_idx = 3
sample = samples[example_idx]
audio_clean = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]

print(f"📖 Text: {sample['text']}")

# 2. Define Noise Injector
def add_noise(audio, noise_level):
    noise = np.random.randn(len(audio))
    # Add noise and clip to avoid digital distortion
    return np.clip(audio + (noise_level * noise), -1.0, 1.0)

# 3. Create 3 Scenarios
audio_low_noise  = add_noise(audio_clean, noise_level=0.005) # Light Hiss
audio_med_noise  = add_noise(audio_clean, noise_level=0.02)  # Noticeable Static
audio_high_noise = add_noise(audio_clean, noise_level=0.05)  # "Extreme" (Expect Failures)

# 4. Listen
print("\n--- 🟢 Clean Audio ---")
display(ipd.Audio(audio_clean, rate=sr))

print("\n--- 🟡 Low Noise (0.005) - Cafe/Office ---")
display(ipd.Audio(audio_low_noise, rate=sr))

print("\n--- 🟠 Medium Noise (0.02) - Street/Phone ---")
display(ipd.Audio(audio_med_noise, rate=sr))

print("\n--- 🔴 High Noise (0.05) - Construction/Wind (ASR Killer) ---")
display(ipd.Audio(audio_high_noise, rate=sr))

📖 Text: FROM THE RESPECT PAID HER ON ALL SIDES SHE SEEMED LIKE A QUEEN AND FROM THE ADORATION WITH WHICH SHE WAS TREATED BY TWO OR THREE SHE APPEARED AN OBJECT OF WORSHIP THE QUEEN MOTHER GAVE THE FRENCH THE MOST AFFECTIONATE RECEPTION FRANCE WAS HER NATIVE COUNTRY AND SHE HAD SUFFERED TOO MUCH UNHAPPINESS IN ENGLAND FOR ENGLAND TO HAVE MADE HER FORGET FRANCE

--- 🟢 Clean Audio ---



--- 🟡 Low Noise (0.005) - Cafe/Office ---



--- 🟠 Medium Noise (0.02) - Street/Phone ---



--- 🔴 High Noise (0.05) - Construction/Wind (ASR Killer) ---
