<a href="https://colab.research.google.com/github/usshaa/SkillzRevozNLP/blob/main/04_NLP/03_speech_recognition_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎙️ Speech Recognition with Python
This notebook demonstrates different ways of performing **speech-to-text** using:
- `SpeechRecognition` (Google Web API)
- Hugging Face pretrained model (**Wav2Vec2**)

## Step 1: Install Dependencies

In [1]:
!pip install SpeechRecognition pydub transformers datasets torchaudio --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 2: Import Libraries

In [2]:
import speech_recognition as sr
from pydub import AudioSegment
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import numpy as np

  m = re.match('([su]([0-9]{1,2})p?) \(([0-9]{1,2}) bit\)$', token)
  m2 = re.match('([su]([0-9]{1,2})p?)( \(default\))?$', token)
  elif re.match('(flt)p?( \(default\))?$', token):
  elif re.match('(dbl)p?( \(default\))?$', token):


## Step 3: Load or Record Audio

In [3]:
from google.colab import files
uploaded = files.upload()  # upload your .wav or .mp3 file

# Convert to wav if mp3
for fn in uploaded.keys():
    if fn.endswith('.mp3'):
        sound = AudioSegment.from_mp3(fn)
        fn_wav = fn.replace('.mp3', '.wav')
        sound.export(fn_wav, format="wav")
        audio_path = fn_wav
    else:
        audio_path = fn
print("Audio ready:", audio_path)

Saving Speaker26_000.wav to Speaker26_000.wav
Audio ready: Speaker26_000.wav


## Step 4: Baseline – Google SpeechRecognition API

In [4]:
recognizer = sr.Recognizer()
with sr.AudioFile(audio_path) as source:
    audio_data = recognizer.record(source)
try:
    text = recognizer.recognize_google(audio_data)
    print("🔹 Recognized Text (Google API):", text)
except Exception as e:
    print("Error:", e)

🔹 Recognized Text (Google API): section 0 of Aesop's Fables a new revised version by Aesop this labor box recording is in the public domain preface the following are some of Aesop's Best Loved fables the goose with the golden eggs a certain man had the Good Fortune to possess a goose that laid him a golden egg everyday but I satisfied with so slow and thinking to see the whole treasure at once he killed the goose and cutting her open her just what any other Goose would be much more and loses all the town Mouse and The Country Mouse a country mouse invited a townhouse


## Step 5: Deep Learning – Wav2Vec2 Pretrained Model

In [5]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

speech, rate = librosa.load(audio_path, sr=16000)
input_values = processor(speech, return_tensors="pt", sampling_rate=16000).input_values

with torch.no_grad():
    logits = model(input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print("🔹 Recognized Text (Wav2Vec2):", transcription)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


🔹 Recognized Text (Wav2Vec2): SECTION ZERO OF EESOP'S FABLES A NEW REVISED VERSION BY EESOP THIS SLEEVER BOX RECORDING IS IN THE PUBLIC DOMAIN PREFACE THE FOLLOWING ARE SOME OF ESOP'S BEST LOVED FABLES THE GOOSE WITH THE GOLDEN EGGS A CERTAIN MAN HAD THE GOOD FORTUNE TO POSSESS A GOOSE THAT LAID HIM A GOLDEN EGG EVERY DAY BUT DISSATISFIED WITH SO SLOW AN INCOME AND THINKING TO SEIZE THE WHOLE TREASURE AT ONCE HE KILL THE GOOSE AND CUTTING HER OPEN FOUND HER JUST WHAT ANY OTHER GOOSE WOULD BE MUCH ONCE MORE AND LOOSES ALL THE TOWN MOUSE AND THE COUNTRY MOUSE A COUNTRY MOUSE INVITED A TOWN MOUSE AND INTIMATE FRIEND TO PAY HIM A VISIT AND PARTAKE OF HIS COUNTRY FAR AS THEY WERE ON THE BARE PLAUED LANDS EATING THEIR WHEAT STA


## Step 6: Evaluation (Word Error Rate)

In [12]:
!pip install jiwer --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m1.8/3.2 MB[0m [31m55.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
!pip install evaluate --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [14]:
# from datasets import load_metric # This is deprecated
from evaluate import load
wer_metric = load("wer")

# if ground truth available
ground_truth = "SECTION ZERO OF EESOP'S FABLES A NEW REVISED VERSION BY EESOP THIS SLEEVER BOX RECORDING IS IN THE PUBLIC DOMAIN PREFACE THE FOLLOWING ARE SOME OF ESOP'S BEST LOVED FABLES THE GOOSE WITH THE GOLDEN EGGS A CERTAIN MAN HAD THE GOOD FORTUNE TO POSSESS A GOOSE THAT LAID HIM A GOLDEN EGG EVERY DAY BUT DISSATISFIED WITH SO SLOW AN INCOME AND THINKING TO SEIZE THE WHOLE TREASURE AT ONCE HE KILL THE GOOSE AND CUTTING HER OPEN FOUND HER JUST WHAT ANY OTHER GOOSE WOULD BE MUCH ONCE MORE AND LOOSES ALL THE TOWN MOUSE AND THE COUNTRY MOUSE A COUNTRY MOUSE INVITED A TOWN MOUSE AND INTIMATE FRIEND TO PAY HIM A VISIT AND PARTAKE OF HIS COUNTRY FAR AS THEY WERE ON THE BARE PLAUED LANDS EATING THEIR WHEAT STAe"
wer = wer_metric.compute(predictions=[transcription.lower()], references=[ground_truth.lower()])
print(f"WER: {wer:.2f}")

WER: 0.01
