# 🎙️ Speech Recognition with Python
This notebook demonstrates different ways of performing **speech-to-text** using:
- `SpeechRecognition` (Google Web API)
- Hugging Face pretrained model (**Wav2Vec2**)

## Step 1: Install Dependencies

In [None]:
!pip install SpeechRecognition pydub transformers datasets torchaudio --quiet

## Step 2: Import Libraries

In [None]:
import speech_recognition as sr
from pydub import AudioSegment
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import numpy as np

## Step 3: Load or Record Audio

In [None]:
from google.colab import files
uploaded = files.upload()  # upload your .wav or .mp3 file

# Convert to wav if mp3
for fn in uploaded.keys():
    if fn.endswith('.mp3'):
        sound = AudioSegment.from_mp3(fn)
        fn_wav = fn.replace('.mp3', '.wav')
        sound.export(fn_wav, format="wav")
        audio_path = fn_wav
    else:
        audio_path = fn
print("Audio ready:", audio_path)

## Step 4: Baseline – Google SpeechRecognition API

In [None]:
recognizer = sr.Recognizer()
with sr.AudioFile(audio_path) as source:
    audio_data = recognizer.record(source)
try:
    text = recognizer.recognize_google(audio_data)
    print("🔹 Recognized Text (Google API):", text)
except Exception as e:
    print("Error:", e)

## Step 5: Deep Learning – Wav2Vec2 Pretrained Model

In [None]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

speech, rate = librosa.load(audio_path, sr=16000)
input_values = processor(speech, return_tensors="pt", sampling_rate=16000).input_values

with torch.no_grad():
    logits = model(input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print("🔹 Recognized Text (Wav2Vec2):", transcription)

## Step 6: Evaluation (Word Error Rate)

In [None]:
from datasets import load_metric
wer_metric = load_metric("wer")

# if ground truth available
ground_truth = "this is a sample sentence"
wer = wer_metric.compute(predictions=[transcription.lower()], references=[ground_truth.lower()])
print(f"WER: {wer:.2f}")

## ✅ Future Enhancements
- Train on custom dataset
- Add real-time microphone recording
- Deploy as a Flask or Streamlit app