# mlx-whisper

- option1 tiny
`pip install mlx-whisper`

- option2 
  - tiny
  - tiny_en
' pip install -U openai-whisper '



## Mac

### mlx-Whisper

In [2]:
import mlx_whisper

samples_folder='../samples'
import time
import librosa
import os

# Get all audio files from samples folder
audio_files = [f for f in os.listdir(samples_folder) if f.endswith(('.wav', '.flac', '.mp3','.m4a'))]
print ('default model: whisper-tiny 74m' )


for audio_file in audio_files:
    print(f"\nProcessing: {audio_file}")
    print("-" * 50)
    
    # Get audio duration
    audio_path = os.path.join(samples_folder, audio_file)
    duration = librosa.get_duration(path=audio_path)

    # Transcribe audio
    start_time = time.time()
    text = mlx_whisper.transcribe(audio_path)['text']
    end_time = time.time()

    # Print results
    print(f"Audio duration: {duration:.2f} seconds")
    print(f"Transcription time: {end_time - start_time:.2f} seconds")
    print(f"Transcribed text: {text}")



default model: whisper-tiny 74m

Processing: recordAfterVAD.m4a
--------------------------------------------------


	Audioread support is deprecated in librosa 0.10.0 and will be removed in version 1.0.
  duration = librosa.get_duration(path=audio_path)


Audio duration: 14.48 seconds
Transcription time: 0.19 seconds
Transcribed text:  speech to the text and I want to know is that be the way to show that all the sound for 300 I want to know that clap clap

Processing: Speaker27_000.wav
--------------------------------------------------
Audio duration: 60.00 seconds
Transcription time: 0.60 seconds
Transcribed text:  The story of the invention, development, and present-day uses of wars and newest weapons. This is the Librevox Recording. All Librevox Recording are in the public domain. For more information or to volunteer, please visit Librevox.org. Recording by William Tomko. Aircraft and submarines by Willis J. Abbott. Preface. Not since Gunpowder was first employed in warfare has so revolutionary contribution to the science of slaughtering men been made as by the perfection of aircraft and submarines. The former have had their first employment in this worldwide war of the nations. The latter, though in the experimental stage as far bac

### Apple Speech  SFSpeechRecognizer


In [2]:
# ── IMPORTS ───────────────────────────────────────────────────────────────
import os, time, librosa
from Foundation import NSRunLoop, NSDate, NSURL
import Speech                       # ← Apple’s Speech framework via PyObjC

# ── CONFIG ────────────────────────────────────────────────────────────────
SAMPLES_DIR = "../samples"
SUPPORTED   = (".wav", ".m4a", ".mp3", ".flac", ".aiff", ".aif")

recognizer  = Speech.SFSpeechRecognizer.alloc().init()   # default locale

def transcribe_file(path):
    """Return the best transcription of an audio file using Apple Speech."""
    url       = NSURL.fileURLWithPath_(os.path.abspath(path))
    request   = Speech.SFSpeechURLRecognitionRequest.alloc().initWithURL_(url)
    finished  = False
    transcript = ""

    # Callback that Speech runs on its own thread
    def handler(result, error):
        nonlocal finished, transcript
        if result:
            transcript = str(result.bestTranscription().formattedString())
            if result.isFinal():
                finished = True
        else:                     # error or user denied permission
            finished = True

    # Kick off recognition
    task = recognizer.recognitionTaskWithRequest_resultHandler_(request, handler)

    # Spin the run-loop until the async callback sets finished = True
    while not finished:
        NSRunLoop.currentRunLoop().runUntilDate_(
            NSDate.dateWithTimeIntervalSinceNow_(0.1)
        )
    return transcript

# ── MAIN LOOP ─────────────────────────────────────────────────────────────
print("Apple Speech framework (SFSpeechRecognizer)\n")

for fname in sorted(f for f in os.listdir(SAMPLES_DIR)
                    if f.lower().endswith(SUPPORTED)):
    path = os.path.join(SAMPLES_DIR, fname)
    print(f"Processing: {fname}\n" + "-"*50)

    duration = librosa.get_duration(path=path)
    t0       = time.time()
    text     = transcribe_file(path)
    dt       = time.time() - t0

    print(f"Audio duration     : {duration:6.2f} s")
    print(f"Transcription time : {dt:6.2f} s")
    print("Transcribed text   :", text or "[empty]", "\n")

Apple Speech framework (SFSpeechRecognizer)

Processing: Speaker26_000.wav
--------------------------------------------------
Audio duration     :  60.00 s
Transcription time :   3.22 s
Transcribed text   : Section 0 of Aesop fables a new revised version by ASAP this lab box recording is in the public domain preface the following or some of EOPS best loved fables the goose with the golden eggs a certain man had a good fortune to possess a goose that laid him a golden egg every day but dissatisfied with so slow and income and thinking to see the whole treasure at once he killed the goose and cutting her open found her just what any other goose would be much once more and loses all the town mouse and the country mouse a country mouse invited a townhouse and intimate friend to pay him a visit and partake of his country fair as they were on the bear plowed lands eating their wheat 

Processing: Speaker27_000.wav
--------------------------------------------------
Audio duration     :  60.00

### Whisper cpp
https://github.com/absadiki/pywhispercpp

In [6]:
from pywhispercpp.model import Model
import time
import librosa
import os
import soundfile as sf

# ── CONFIG ────────────────────────────────────────────────────────────────
SAMPLES_DIR = "../samples"
SUPPORTED   = (".wav", ".m4a", ".mp3", ".flac", ".aiff", ".aif")
TARGET_SR = 16000  # Whisper requires 16kHz audio

# Initialize model
model = Model('tiny.en')

# ── MAIN LOOP ─────────────────────────────────────────────────────────────
print("Whisper.cpp (tiny.en model)\n")

for fname in sorted(f for f in os.listdir(SAMPLES_DIR)
                    if f.lower().endswith(SUPPORTED)):
    path = os.path.join(SAMPLES_DIR, fname)
    print(f"Processing: {fname}\n" + "-"*50)

    # Load and resample audio to 16kHz
    audio, sr = librosa.load(path, sr=TARGET_SR)
    
    # Save as temporary WAV file with correct format
    temp_path = f"temp_{fname}.wav"
    sf.write(temp_path, audio, TARGET_SR)

    duration = librosa.get_duration(path=path)
    t0       = time.time()
    text     = model.transcribe(temp_path)
    dt       = time.time() - t0

    # Clean up temporary file
    os.remove(temp_path)

    print(f"Audio duration     : {duration:6.2f} s")
    print(f"Transcription time : {dt:6.2f} s")
    print("Transcribed text   :", text or "[empty]", "\n")

whisper_init_from_file_with_params_no_state: loading model from '/Users/rongweiji/Library/Application Support/pywhispercpp/models/ggml-tiny.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 3
whisper_init_with_params_no_state: backends   = 3
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whi

Whisper.cpp (tiny.en model)

Processing: Speaker26_000.wav
--------------------------------------------------


Progress:   0%
Progress:  38%
Progress:  85%
Progress:  95%
Progress: 100%


Audio duration     :  60.00 s
Transcription time :   0.63 s
Transcribed text   : [t0=0, t1=668, text=Section zero of ESOP's fables, a new revised version by ESOP. This sliver box recording, t0=668, t1=1660, text=is in the public domain. Preface, the following are some of ESOP's best-loved fables., t0=1660, t1=2308, text=The Goose with the Golden Eggs A certain man had the good fortune to possess a goose, t0=2308, t1=3072, text=that laid him a golden egg every day, but dissatisfied with so slow an income and thinking, t0=3072, t1=3660, text=to seize the whole treasure at once, he killed the Goose and cutting her open, found her, t0=3660, t1=4520, text=just what any other Goose would be. Much once more and loses all., t0=4520, t1=5108, text=The Town Mouse and the Country Mouse A country mouse invited a town mouse, t0=5108, t1=5700, text=and intimate friend to pay him a visit and partake of his country fair. As they were on, t0=5700, t1=6000, text=the bare plowed lands, eating their wheat

Progress:   0%
Progress:  48%
Progress:  93%
Progress: 100%
Progress:   0%
Progress: 100%


### openai-whisper

In [20]:
import whisper

import time
import librosa
import os
samples_folder='../samples'
model = whisper.load_model("tiny")
# Get all audio files from samples folder
audio_files = [f for f in os.listdir(samples_folder) if f.endswith(('.wav', '.flac', '.mp3'))]

print ('openai-whisper, tiny model, 72M')
for audio_file in audio_files:
    print(f"\nProcessing: {audio_file}")
    print("-" * 50)
    
    # Get audio duration
    audio_path = os.path.join( samples_folder, audio_file)
    duration = librosa.get_duration(path=audio_path)

    # Transcribe audio
    start_time = time.time()
    result = model.transcribe(audio_path)
    end_time = time.time()

    # Print results
    print(f"Audio duration: {duration:.2f} seconds")
    print(f"Transcription time: {end_time - start_time:.2f} seconds")
    print(f"Transcribed text: {result['text']}")



openai-whisper, tiny model, 72M

Processing: Speaker27_000.wav
--------------------------------------------------




Audio duration: 60.00 seconds
Transcription time: 1.26 seconds
Transcribed text:  The story of the invention, development, and present-day uses of wars and newest weapons. This is the Librevox Recording. All Librevox Recording are in the public domain. For more information or to volunteer, please visit Librevox.org. Recording by William Tomko. Aircraft and submarines by Willis J. Abbott. Preface. Not since Gunpowder was first employed in warfare has so revolutionary contribution to the science of slaughtering men been made as by the perfection of aircraft and submarines. The former have had their first employment in this worldwide war of the nations. The latter, though in the experimental stage as far back as the American Revolution, have in this bitter contest then for the first time brought to so practical and stage of development as to exert a really appreciable influence on the outcome of this.

Processing: Speaker26_000.wav
--------------------------------------------------




Audio duration: 60.00 seconds
Transcription time: 1.08 seconds
Transcribed text:  Section 0 of esophs fables, a new revised version by esoph. This sleeper box recording is in the public domain. Preface, the following are some of esophs best-loved fables. The Goose with the Golden Eggs. A certain man had the good fortune to possess a Goose that laid him a Golden Egg every day, but dissatisfied with so slow and encommed, and thinking to seize the whole treasure at once, he killed the Goose and cutting her open, found her just what any other Goose would be. Much once more, and loses all. The town mouse and the country mouse. A country mouse invited a town mouse and intimate friend to pay him a visit and portake of his country fair. As they were on the bare-plowed lands, eating their wheat stock.

Processing: ls_test.flac
--------------------------------------------------




Audio duration: 6.67 seconds
Transcription time: 0.31 seconds
Transcribed text:  Then the good soul openly sorted the boat and she had buoyed so long in secret and bravely stretched on alone.


In [21]:
import whisper

import time
import librosa
import os
samples_folder='../samples'
model_en = whisper.load_model("tiny.en")
# Get all audio files from samples folder
audio_files = [f for f in os.listdir(samples_folder) if f.endswith(('.wav', '.flac', '.mp3'))]

print ('openai-whisper, tiny_en model, 72M')
for audio_file in audio_files:
    print(f"\nProcessing: {audio_file}")
    print("-" * 50)
    
    # Get audio duration
    audio_path = os.path.join( samples_folder, audio_file)
    duration = librosa.get_duration(path=audio_path)

    # Transcribe audio
    start_time = time.time()
    result = model_en.transcribe(audio_path)
    end_time = time.time()

    # Print results
    print(f"Audio duration: {duration:.2f} seconds")
    print(f"Transcription time: {end_time - start_time:.2f} seconds")
    print(f"Transcribed text: {result['text']}")



100%|█████████████████████████████████████| 72.1M/72.1M [00:01<00:00, 69.9MiB/s]


openai-whisper, tiny model, 72M

Processing: Speaker27_000.wav
--------------------------------------------------




Audio duration: 60.00 seconds
Transcription time: 1.12 seconds
Transcribed text:  Preface of Aircraft and Submarines, the story of the invention, development, and present-day uses of wars and newest weapons. This is the Liber vox recording. All Liber vox recordings are in the public domain. For more information or to volunteer, please visit Libervox.org. Recording by William Tomko, Aircraft and Submarines by Willis J. Abbott, Preface. Not since gunpowder was first employed in warfare has so revolutionary a contribution to the science of slaughtering men been made as by the perfection of aircraft and submarines. The former have had their first employment in this worldwide war of the nations. The latter, though in the experimental stage, as far back as the American Revolution, have in this bitter contest, then for the first time brought to so practical a stage of development, as to exert a really appreciable influence on the outcome of the...

Processing: Speaker26_000.wav
--------------



Audio duration: 60.00 seconds
Transcription time: 0.97 seconds
Transcribed text:  Section zero of ESOP's fables, a new revised version by ESOP. This sliver box recording is in the public domain. Preface, the following are some of ESOP's best-loved fables. The Goose with the Golden Eggs A certain man had the good fortune to possess a goose that laid him a golden egg every day, but dissatisfied with so slow an income and thinking to seize the whole treasure at once, he killed the Goose and cutting her open, found her just what any other Goose would be. Much once more and loses all. The Town Mouse and the Country Mouse A country mouse invited a town mouse and intimate friend to pay him a visit and partake of his country fair. As they were on the bare plowed lands, eating their wheat stock.

Processing: ls_test.flac
--------------------------------------------------




Audio duration: 6.67 seconds
Transcription time: 0.23 seconds
Transcribed text:  Then the goods sold openly, shorted the burden she had borne so long in secret and bravely trudged on alone.


In [2]:
import torch
print(torch.__version__)
print(torch.backends.mps.is_available())
print(torch.backends.mps.is_built())

2.7.0
True
True


### Vosk


In [7]:
"""
Transcribe every audio file in ../samples with Vosk.

Requirements
------------
pip install vosk soundfile librosa numpy
Download a Vosk model (e.g. “vosk-model-small-en-us-0.22”) and point
`model_path` to its directory.
"""

import os
import time
import json
import numpy as np
import librosa
from vosk import Model, KaldiRecognizer

# ------------------------------------------------------------------ #
# Configuration
# ------------------------------------------------------------------ #
samples_folder = "../samples"
model_path     = "../models/vosk-model-small-en-us-0.15"   # ← change if needed
sample_rate    = 16_000                                 # Vosk expects 16 kHz
chunk_samples  = 4_000                                  # ~0.25 s per chunk
# ------------------------------------------------------------------ #

# Load Vosk model once
print(f"default model: {os.path.basename(model_path)}")
model = Model(model_path)

# Collect supported audio files
audio_files = [
    f for f in os.listdir(samples_folder)
    if f.lower().endswith((".wav", ".flac", ".mp3"))
]

print(f'Vosk model: {os.path.basename(model_path)} 40M')
for audio_file in audio_files:
    print(f"\nProcessing: {audio_file}")
    print("-" * 50)

    audio_path = os.path.join(samples_folder, audio_file)

    # 1) Read and resample to 16 kHz mono (librosa handles any format ffmpeg can decode)
    waveform, sr = librosa.load(audio_path, sr=sample_rate, mono=True)
    duration_sec = len(waveform) / sample_rate

    # 2) Set up a recognizer for this file
    rec = KaldiRecognizer(model, sample_rate)
    rec.SetWords(True)          # include word-level timestamps

    # 3) Stream audio in small chunks (Vosk works best this way)
    start = time.time()

    for i in range(0, len(waveform), chunk_samples):
        chunk = waveform[i : i + chunk_samples]
        # convert float32 → 16-bit PCM bytes
        pcm16 = (chunk * 32767).astype(np.int16).tobytes()
        rec.AcceptWaveform(pcm16)

    result = json.loads(rec.FinalResult())
    transcription = result.get("text", "")

    end = time.time()

    # 4) Report
    print(f"Audio duration: {duration_sec:.2f} s")
    print(f"Transcription time: {end - start:.2f} s")
    print(f"Transcribed text:\n{transcription}")


default model: vosk-model-small-en-us-0.15


LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from ../models/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from ../models/vosk-model-small-en-us-0.15/graph/HCLr.fst ../models/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo ../models/vosk-model-small-en-us-0.15/graph/phones/word_boundary.int
ggml_metal_free: deallocating


Vosk model: vosk-model-small-en-us-0.15 40M

Processing: Speaker27_000.wav
--------------------------------------------------
Audio duration: 60.00 s
Transcription time: 55.57 s
Transcribed text:
preface of aircraft and submarines the story of the invention development and present day uses of wars newest weapons this is a liberal box recording olive revives recordings are in the public domain for more information or to volunteer please visit liberal vox dot org recording by william tomko aircraft and submarines by will is jay abbott preface that since gunpowder was first employed in warfare has so revolutionary a contribution to the science of slaughtering men been made as why the perfection of aircraft and submarines the former have had their first employment in this world wide war of the nation's the ladder though in the experimental stage as far back as the american revolution have in this bitter contest in for the first time brought to so practical and stage of development as to ex

KeyboardInterrupt: 

## windoswe  

In [1]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Current CUDA device: {torch.cuda.current_device()}")
    print(f"CUDA device name: {torch.cuda.get_device_name()}")

CUDA available: True
Current CUDA device: 0
CUDA device name: NVIDIA GeForce RTX 3060


In [2]:
import whisper

import time
import librosa
import os
samples_folder='../samples'
model_en = whisper.load_model("tiny.en")
# Get all audio files from samples folder
audio_files = [f for f in os.listdir(samples_folder) if f.endswith(('.wav', '.flac', '.mp3'))]

print ('openai-whisper, tiny_en model, 72M')
for audio_file in audio_files:
    print(f"\nProcessing: {audio_file}")
    print("-" * 50)
    
    # Get audio duration
    audio_path = os.path.join( samples_folder, audio_file)
    duration = librosa.get_duration(path=audio_path)

    # Transcribe audio
    start_time = time.time()
    result = model_en.transcribe(audio_path)
    end_time = time.time()

    # Print results
    print(f"Audio duration: {duration:.2f} seconds")
    print(f"Transcription time: {end_time - start_time:.2f} seconds")
    print(f"Transcribed text: {result['text']}")



openai-whisper, tiny_en model, 72M

Processing: ls_test.flac
--------------------------------------------------
Audio duration: 6.67 seconds
Transcription time: 0.96 seconds
Transcribed text:  Then the goods sold openly, shorted the burden she had borne so long in secret and bravely trudged on alone.

Processing: Speaker26_000.wav
--------------------------------------------------
Audio duration: 60.00 seconds
Transcription time: 1.70 seconds
Transcribed text:  Section zero of ESOP's fables, a new revised version by ESOP. This sliver box recording is in the public domain. Preface, the following are some of ESOP's best-loved fables. The Goose with the Golden Eggs A certain man had the good fortune to possess a goose that laid him a golden egg every day, but dissatisfied with so slow an income and thinking to seize the whole treasure at once, he killed the Goose and cutting her open, found her just what any other Goose would be. Much once more and loses all. The Town Mouse and the Countr

### Swift app time 



- whisper.cpp and using the coreml for encoding 
tiny.en 

Speaker26_000.wav
audio time : 60s
transcription time : 1.18s , 0.93s, 1.024s 

Speaker27_000.wav
audio time : 60s
transcription time : 0.87s, 0.968s, 0.88s

base.en 
Speaker26_000.wav
audio time : 60s
transcription time : 1.27s, 1.30s, 1.29s 

Speaker27_000.wav
audio time : 60s
transcription time : 1.75s, 1.65s, 1.50s


- SFSpeechRecognizer using the onDevice model 

Speeker26_000.wav
audio time : 60s
transcription time : 2.64s, 3.027s, 2.63s

Speaker27_000.wav
audio time : 60s
transcription time : 2.75s, 2.83s, 2.73s


Real-time-factor ( Processing Time / Audio Duration)
- whisper.cpp tiny.en : 0.0197, 0.0155, 0.0171, 0.0145, 0.0161, 0.0147
- whisper.cpp base.en : 0.0212, 0.0217, 0.0215, 0.0292, 0.0275, 0.0250
- AppleSpeech(onDevice): 0.0440, 0.0505, 0.0438, 0.0458, 0.0472, 0.0455

