<a href="https://colab.research.google.com/github/srinijalanda93/SPR_LAB/blob/main/SPR_L3_2448526.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install openai-whisper vosk google-cloud-speech pydub soundfile pandas
!apt-get install ffmpeg -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


In [None]:
from google.colab import files
import os

os.makedirs("audio_files", exist_ok=True)
uploaded = files.upload()

for fn in uploaded.keys():
    os.rename(fn, f"audio_files/{fn}")
print("âœ… Files uploaded to audio_files/")

Saving soft_voice.wav to soft_voice.wav
Saving noisy_background.wav to noisy_background.wav
Saving fast_speech.wav to fast_speech.wav
Saving clear_male.wav to clear_male.wav
Saving clear_female.wav to clear_female.wav
âœ… Files uploaded to audio_files/


In [None]:
!pip install SpeechRecognition



In [None]:
import os, wave, json
from pathlib import Path
import pandas as pd
from pydub import AudioSegment

# Whisper
import whisper

# Vosk
from vosk import Model as VoskModel, KaldiRecognizer

# Google (speech_recognition, free API)
import speech_recognition as sr

# ---------------------------
# Convert audio to 16kHz mono
# ---------------------------
def ensure_wav_16k_mono(src_path, dst_path):
    audio = AudioSegment.from_file(src_path)
    audio = audio.set_frame_rate(16000).set_channels(1)
    audio.export(dst_path, format="wav")

def read_wave_frames(path):
    with wave.open(path, "rb") as wf:
        sample_rate = wf.getframerate()
        frames = wf.readframes(wf.getnframes())
    return sample_rate, frames

# ---------------------------
# Whisper recognizer
# ---------------------------
whisper_model = whisper.load_model("base")

def whisper_recognize(wav_path):
    res = whisper_model.transcribe(wav_path, verbose=False)
    return res.get("text", "").strip()

# ---------------------------
# Vosk recognizer
# ---------------------------
!wget -q https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
!unzip -q vosk-model-small-en-us-0.15.zip
vosk_model = VoskModel("vosk-model-small-en-us-0.15")

def vosk_recognize(wav_path):
    sr, frames = read_wave_frames(wav_path)
    rec = KaldiRecognizer(vosk_model, sr)
    rec.AcceptWaveform(frames)
    res = json.loads(rec.Result())
    return res.get("text", "").strip()

# ---------------------------
# Google recognizer (free API)
# ---------------------------
def google_recognize(wav_path):
    r = sr.Recognizer()
    with sr.AudioFile(wav_path) as source:
        audio_data = r.record(source)
    try:
        return r.recognize_google(audio_data)
    except sr.UnknownValueError:
        return "Google API could not understand audio"
    except sr.RequestError as e:
        return f"Google API request failed: {e}"

# ---------------------------
# Run recognition on folder
# ---------------------------
rows = []
for file in Path("audio_files").iterdir():
    base = file.stem
    preproc = f"audio_files/{base}_16k.wav"
    ensure_wav_16k_mono(str(file), preproc)

    print(f"\nðŸŽ¤ Processing {file.name} ...")
    whisper_text = vosk_text = google_text = ""
    notes = ""

    # Whisper
    try:
        print("Recognizing with Whisper...")
        whisper_text = whisper_recognize(preproc)
    except Exception as e:
        notes += f"Whisper error: {e} | "

    # Vosk
    try:
        print("Recognizing with Vosk...")
        vosk_text = vosk_recognize(preproc)
    except Exception as e:
        notes += f"Vosk error: {e} | "

    # Google (free API)
    try:
        print("Recognizing with Google API...")
        google_text = google_recognize(preproc)
    except Exception as e:
        notes += f"Google error: {e} | "

    rows.append({
        "Audio Type": base,
        "Whisper Output": whisper_text,
        "Vosk Output": vosk_text,
        "Google API Output": google_text
    })

df = pd.DataFrame(rows)
df


ðŸŽ¤ Processing fast_speech.wav ...
Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1154/1154 [00:00<00:00, 3744.19frames/s]


Recognizing with Vosk...
Recognizing with Google API...

ðŸŽ¤ Processing noisy_background.wav ...
Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 885/885 [00:00<00:00, 3460.91frames/s]


Recognizing with Vosk...
Recognizing with Google API...

ðŸŽ¤ Processing clear_male.wav ...
Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 788/788 [00:00<00:00, 3919.21frames/s]


Recognizing with Vosk...
Recognizing with Google API...

ðŸŽ¤ Processing clear_female.wav ...
Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 728/728 [00:00<00:00, 3287.85frames/s]


Recognizing with Vosk...
Recognizing with Google API...

ðŸŽ¤ Processing soft_voice.wav ...
Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1063/1063 [00:00<00:00, 4306.97frames/s]


Recognizing with Vosk...
Recognizing with Google API...


Unnamed: 0,Audio Type,Whisper Output,Vosk Output,Google API Output
0,fast_speech,Deep learning models such as convolutional neu...,deep learning models such as convolution or ne...,convolutional neural networks and recurrent ne...
1,noisy_background,"In artificial intelligence, reinforcement lear...",the artificial intelligence reinforcement lear...,can artificial intelligence reinforcement lear...
2,clear_male,This is a clear male voice. Artificial intelli...,this is a clear male voice artificial intellig...,this is a clear male voice artificial intellig...
3,clear_female,This is a clear female voice. Neural networks ...,this is a clear female voice neural networks l...,this is a clear female voice neural networks l...
4,soft_voice,"As an AI and machine learning student, I am ex...",as and a i and machine learning student i am e...,as an AI and machine learning student I am exp...


In [None]:
import os, wave, json
from pathlib import Path
import pandas as pd
from pydub import AudioSegment

# Whisper
import whisper

# Vosk
from vosk import Model as VoskModel, KaldiRecognizer

# Google (speech_recognition, free API)
import speech_recognition as sr

# ---------------------------
# Convert audio to 16kHz mono
# ---------------------------
def ensure_wav_16k_mono(src_path, dst_path):
    audio = AudioSegment.from_file(src_path)
    audio = audio.set_frame_rate(16000).set_channels(1)
    audio.export(dst_path, format="wav")

def read_wave_frames(path):
    with wave.open(path, "rb") as wf:
        sample_rate = wf.getframerate()
        frames = wf.readframes(wf.getnframes())
    return sample_rate, frames

# ---------------------------
# Whisper recognizer
# ---------------------------
whisper_model = whisper.load_model("base")

def whisper_recognize(wav_path):
    print("Recognizing with Whisper...")
    res = whisper_model.transcribe(wav_path, verbose=False)
    text = res.get("text", "").strip()
    print(f"Whisper Output: {text}\n")
    return text

# ---------------------------
# Vosk recognizer
# ---------------------------
vosk_model = VoskModel("vosk-model-small-en-us-0.15")  # make sure model is downloaded

def vosk_recognize(wav_path):
    print("Recognizing with Vosk...")
    sr_val, frames = read_wave_frames(wav_path)
    rec = KaldiRecognizer(vosk_model, sr_val)
    rec.AcceptWaveform(frames)
    res = json.loads(rec.Result())
    text = res.get("text", "").strip()
    print(f"Vosk Output: {text}\n")
    return text

# ---------------------------
# Google recognizer (free API)
# ---------------------------
def google_recognize(wav_path):
    print("Recognizing with Google API...")
    r = sr.Recognizer()
    with sr.AudioFile(wav_path) as source:
        audio_data = r.record(source)
    try:
        text = r.recognize_google(audio_data)
    except sr.UnknownValueError:
        text = "Google API could not understand audio"
    except sr.RequestError as e:
        text = f"Google API request failed: {e}"
    print(f"Google API Output: {text}\n")
    return text

# ---------------------------
# Run recognition on folder
# ---------------------------
rows = []

for file in Path("audio_files").iterdir():
    base = file.stem
    preproc = f"audio_files/{base}_16k.wav"
    ensure_wav_16k_mono(str(file), preproc)

    print(f"\nðŸŽ¤ Processing {file.name} ...\n")

    # Initialize outputs
    whisper_text = vosk_text = google_text = ""

    # Whisper
    try:
        whisper_text = whisper_recognize(preproc)
    except Exception as e:
        whisper_text = f"Whisper error: {e}"
        print(whisper_text)

    # Vosk
    try:
        vosk_text = vosk_recognize(preproc)
    except Exception as e:
        vosk_text = f"Vosk error: {e}"
        print(vosk_text)

    # Google
    try:
        google_text = google_recognize(preproc)
    except Exception as e:
        google_text = f"Google error: {e}"
        print(google_text)

    # Save to DataFrame
    rows.append({
        "Audio Type": base,
        "Whisper Output": whisper_text,
        "Vosk Output": vosk_text,
        "Google API Output": google_text
    })

# Final comparison table
df = pd.DataFrame(rows)
print("\nâœ… Summary Table:\n")
print(df)


ðŸŽ¤ Processing clear_male_16k.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 788/788 [00:00<00:00, 3978.73frames/s]


Whisper Output: This is a clear male voice. Artificial intelligence and machine learning are transforming industries worldwide.

Recognizing with Vosk...
Vosk Output: this is a clear male voice artificial intelligence and machine learning a transforming industries worldwide

Recognizing with Google API...
Google API Output: this is a clear male voice artificial intelligence and machine learning are transforming Industries worldwide


ðŸŽ¤ Processing fast_speech.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1154/1154 [00:00<00:00, 4030.90frames/s]


Whisper Output: Deep learning models such as convolutional neural networks and recurrent neural networks are widely used in image recognition, natural language processing and speech analysis.

Recognizing with Vosk...
Vosk Output: deep learning models such as convolution or neural networks and recurrent neural networks are widely used in image recognition natural language processing and speech analysis

Recognizing with Google API...
Google API Output: convolutional neural networks and recurrent neural networks are widely used in image recognition natural language processing and speech analysis


ðŸŽ¤ Processing noisy_background.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 885/885 [00:00<00:00, 3492.71frames/s]


Whisper Output: In artificial intelligence, reinforcement learning enables agents to learn from rewards and penalties, similar to human trial and error.

Recognizing with Vosk...
Vosk Output: the artificial intelligence reinforcement learning enables ages to learn from reward penalty similar to humans file and era

Recognizing with Google API...
Google API Output: can artificial intelligence reinforcement learning enables agents to learn from rewards and penalties similar to human file and error


ðŸŽ¤ Processing clear_male.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 788/788 [00:00<00:00, 4113.86frames/s]


Whisper Output: This is a clear male voice. Artificial intelligence and machine learning are transforming industries worldwide.

Recognizing with Vosk...
Vosk Output: this is a clear male voice artificial intelligence and machine learning a transforming industries worldwide

Recognizing with Google API...
Google API Output: this is a clear male voice artificial intelligence and machine learning are transforming Industries worldwide


ðŸŽ¤ Processing soft_voice_16k.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1063/1063 [00:00<00:00, 4436.39frames/s]


Whisper Output: As an AI and machine learning student, I am exploring algorithms that help computers think and learn like humans.

Recognizing with Vosk...
Vosk Output: as and a i and machine learning student i am exploring algorithms that help computers think and learn like humans

Recognizing with Google API...
Google API Output: as an AI and machine learning student I am exploring algorithms that help computers think and learn like humans


ðŸŽ¤ Processing clear_female.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 728/728 [00:00<00:00, 3377.51frames/s]


Whisper Output: This is a clear female voice. Neural networks learn patterns from data to make accurate predictions.

Recognizing with Vosk...
Vosk Output: this is a clear female voice neural networks learn patterns from data to make accurate predictions

Recognizing with Google API...
Google API Output: this is a clear female voice neural networks learn patterns from data to make accurate predictions


ðŸŽ¤ Processing noisy_background_16k.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 885/885 [00:00<00:00, 3667.65frames/s]


Whisper Output: In artificial intelligence, reinforcement learning enables agents to learn from rewards and penalties, similar to human trial and error.

Recognizing with Vosk...
Vosk Output: the artificial intelligence reinforcement learning enables ages to learn from reward penalty similar to humans file and era

Recognizing with Google API...
Google API Output: can artificial intelligence reinforcement learning enables agents to learn from rewards and penalties similar to human file and error


ðŸŽ¤ Processing clear_female_16k.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 728/728 [00:00<00:00, 3325.97frames/s]


Whisper Output: This is a clear female voice. Neural networks learn patterns from data to make accurate predictions.

Recognizing with Vosk...
Vosk Output: this is a clear female voice neural networks learn patterns from data to make accurate predictions

Recognizing with Google API...
Google API Output: this is a clear female voice neural networks learn patterns from data to make accurate predictions


ðŸŽ¤ Processing soft_voice.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1063/1063 [00:00<00:00, 4634.21frames/s]


Whisper Output: As an AI and machine learning student, I am exploring algorithms that help computers think and learn like humans.

Recognizing with Vosk...
Vosk Output: as and a i am machine learning student i am exploring algorithms that help computers think and learn like humans

Recognizing with Google API...
Google API Output: as an AI and machine learning student I am exploring algorithms that help computers think and learn like humans


ðŸŽ¤ Processing fast_speech_16k.wav ...

Recognizing with Whisper...
Detected language: English


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1154/1154 [00:00<00:00, 2716.29frames/s]


Whisper Output: Deep learning models such as convolutional neural networks and recurrent neural networks are widely used in image recognition, natural language processing and speech analysis.

Recognizing with Vosk...
Vosk Output: deep learning models such as convolution or neural networks and recurrent neural networks are widely used in image recognition natural language processing and speech analysis

Recognizing with Google API...
Google API Output: convolutional neural networks and recurrent neural networks are widely used in image recognition natural language processing and speech analysis


âœ… Summary Table:

             Audio Type                                     Whisper Output  \
0        clear_male_16k  This is a clear male voice. Artificial intelli...   
1           fast_speech  Deep learning models such as convolutional neu...   
2      noisy_background  In artificial intelligence, reinforcement lear...   
3            clear_male  This is a clear male voice. Artificial 