<a href="https://colab.research.google.com/github/vijaygwu/SEAS8525/blob/main/Class_6_Nemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


1. **`import nemo.collections.asr as nemo_asr`**  
   Pulls in NeMo’s **speech-recognition collection** and gives it the shorter nickname `nemo_asr`. NeMo is a modular NVIDIA framework; this import brings in ready-made ASR models, audio helpers, and decoding utilities.

2. **`ASRModel.from_pretrained("nvidia/stt_en_conformer_ctc_small")`**  
   * Downloads a 13 M-parameter **Conformer-CTC “small” model** from NVIDIA’s model hub (Hugging Face / NGC).  
   * Unpacks its `.nemo` checkpoint + config to `~/.cache/torch/NeMo/…` and builds a PyTorch model object on the current GPU/CPU.  
   * Conformer is a hybrid CNN + Transformer encoder; the **CTC** head lets it emit all tokens in one pass instead of step-by-step, so it runs fast even on a Colab T4.  ([STT En Conformer-CTC Small - NGC Catalog - NVIDIA](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small?utm_source=chatgpt.com))

3. **`!wget -q https://…2086-149220-0033.wav`**  
   Shell escape (`!`) grabs a 16 kHz WAV from LibriSpeech (the classic Harvard sentence corpus). The `-q` flag keeps wget quiet. The file ends up in `/content/` alongside your notebook.

4. **Upload block (commented-out)**  
   Shows how you could swap in your own audio via Colab’s `files.upload()`. Only one line to change: `audio_path`.

5. **`audio_path = "2086-149220-0033.wav"`**  
   Just sets the variable that will be fed to the model. (If you uploaded something, you’d overwrite this.)

6. **`transcript = asr_model.transcribe([audio_path])[0]`**  
   * `transcribe()` is a high-level helper that accepts a **list** of paths so it can batch multiple clips.  
   * Under the hood it:
     1. Loads each file and rescales it to mono 16 kHz.  
     2. Converts waveforms to log-mel features.  
     3. Runs the Conformer encoder → linear layer → CTC decoder.  
     4. Performs greedy / beam search decoding to get text.  
   * It returns a list of **Hypothesis** objects (one per clip) that hold `.text`, token IDs, and log-prob scores. The `[0]` grabs the first (and only) hypothesis.  ([NeMo/examples/asr/transcribe_speech.py at main - GitHub](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/transcribe_speech.py?utm_source=chatgpt.com), [NeMo/nemo/collections/asr/models/ctc_models.py at main - GitHub](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/ctc_models.py?utm_source=chatgpt.com))

7. **`print("Transcript:", transcript.text)`**  
   Prints the clean, lower-case English transcription—something like:

   ```
   Transcript: a king ruled the state in the early days
   ```

---

* **Latency & memory:** This small Conformer needs ~400 MB of GPU RAM and transcribes a 3-second clip in ≈40 ms on a T4.  
* **Why CTC?** The model predicts characters all at once, so there’s no autoregressive loop—good for streaming or real-time apps.  
* **Want timestamps?** `asr_model.transcribe([...], timestamps=True)` adds word- and char-level time offsets.  


In [3]:
!pip install -q Cython packaging 'nemo_toolkit[asr]'
!pip install numpy==1.24.3 -q --no-deps

import nemo.collections.asr as nemo_asr
print("✔ NeMo loaded inside venv:", nemo_asr.__version__)




✔ NeMo loaded inside venv: 2.2.1


In [4]:
import nemo.collections.asr as nemo_asr

# 13 M-parameter English model (~100 MB download, fits on a T4)
asr_model = nemo_asr.models.ASRModel.from_pretrained(
    "nvidia/stt_en_conformer_ctc_small"
)


[NeMo I 2025-04-25 18:40:58 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2025-04-25 18:40:58 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/NeMo_ASR_SET/English/v2.0/train/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 64
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 2048
    is_tarred: true
    tarred_audio_filepaths: /data/NeMo_ASR_SET/English/v2.0/train/audio__OP_0..4095_CL_.tar
    
[NeMo W 2025-04-25 18:40:58 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath:
    - /data/ASR/LibriSpeech/librisp

[NeMo I 2025-04-25 18:40:58 nemo_logging:393] PADDING: 0
[NeMo I 2025-04-25 18:40:59 nemo_logging:393] Model EncDecCTCModelBPE was successfully restored from /root/.cache/huggingface/hub/models--nvidia--stt_en_conformer_ctc_small/snapshots/e5b9941cc1b0b8a08c29b31a111c674f3040a80f/stt_en_conformer_ctc_small.nemo.


In [5]:
# Option A: download a public 16 kHz sample (Harvard sentence, ~240 kB)
!wget -q https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

# Option B (uncomment): upload your own .wav
# from google.colab import files
# uploaded = files.upload()
# audio_path = next(iter(uploaded))   # first uploaded file name
audio_path = "2086-149220-0033.wav"   # comment out if you used Option B

# Run inference (returns a list of hypotheses objects)
transcript = asr_model.transcribe([audio_path])[0]
print("Transcript:", transcript.text)


Transcribing: 100%|██████████| 1/1 [00:00<00:00, 23.60it/s]

Transcript: well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait



