# Web App Demonstrating OpenAI's Whisper Speech Recognition Model

This is a Colab notebook that allows you to record or upload audio files to [OpenAI's free Whisper speech recognition model](https://openai.com/blog/whisper/). This was based on [an original notebook by @amrrs](https://github.com/amrrs/openai-whisper-webapp), with added documentation and test files by [Pete Warden](https://twitter.com/petewarden).

To use it, choose `Runtime->Run All` from the Colab menu. If you're viewing this notebook on GitHub, follow [this link](https://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb) to open it in Colab first. After about a minute or so, you should see a button at the bottom of the page with a `Record from microphone` link. Click this, you'll be asked to give permission to access your mic, and then speak for up to 30 seconds. Once you're done, press `Stop recording`, and a transcript of the first 30 seconds of your speech should soon appear in the box to the right of the recording button. To transcribe more speech, click `Clear' in the left box and start over.

You can also upload your own audio samples using the folder icon on the left of this page. That gives you access to a file system you can upload to by dragging files into it. You can see examples of how to run the transcription in a couple of the cells below.

## Install the Whisper Code

In [35]:
! pip install git+https://github.com/openai/whisper.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Load the ML Model

In [36]:
import whisper

model = whisper.load_model("base")


## Check we have a GPU

You should see the output `device(type='cuda', index=0)` below. If you don't, you may be on a CPU-only Colab instance which will run more slowly. Go to `Runtime->Change Runtime Type` to fix this.

In [37]:
model.device

device(type='cuda', index=0)

## Define the Transcribe Function

Now we've loaded the model, and have the code, this is the function that takes an audio file path as an input and returns the recognized text (and logs what it thinks the language is).

In [38]:
def transcribe(audio):

    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)
    return result.text


## Install the Web UI Toolkit

We'll be using gradio to provide the widgets we need to do audio recording.

In [39]:
! pip install gradio -q

In [40]:
import gradio as gr
import time

## File Upload Facility

Upload your file in the prompt.

In [41]:
import gradio as gr
import time

from google.colab import files
uploaded = files.upload()



Saving How DeepSeek Rewrote the Transformer [MLA].mp3 to How DeepSeek Rewrote the Transformer [MLA] (1).mp3


In [42]:
filename = next(iter(uploaded))
print(filename)

Audio(filename)

hard_text = transcribe(filename)
print(hard_text)

How DeepSeek Rewrote the Transformer [MLA] (1).mp3
Detected language: en
This video is sponsored by KiwiCo, more on them later. In January 2025, the Chinese company DeepSeek shocked the world with the release of R1, a highly competitive language model that requires only a fraction of the compute of other leading models. Perhaps even more shocking is that unlike most of its American counterparts, DeepSeek has publicly released the R1 model weights, inference code, and extensive technical reports. Publishing an average of one report per month in 2024,


In [43]:
from pydub import AudioSegment
from IPython.display import Audio, display
import os

# Load audio
audio = AudioSegment.from_file(filename)

# 5 seconds = 5000 ms
frame_duration = 25 * 1000
num_chunks = len(audio) // frame_duration + (1 if len(audio) % frame_duration > 0 else 0)

# Optional: create a folder for chunks
os.makedirs("chunks", exist_ok=True)

text = ""
# Go through and play each chunk
for i in range(num_chunks):
    start = i * frame_duration
    end = min((i + 1) * frame_duration, len(audio))
    chunk = audio[start:end]

    # Export to WAV file
    chunk_filename = f"chunks/chunk_{i+1}.wav"
    chunk.export(chunk_filename, format="wav")

    # print(f"🔊 Playing chunk {i+1}: {start/1000:.2f}s to {end/1000:.2f}s")
    # display(Audio(chunk_filename))

    text_chunks = transcribe(chunk_filename)
    print(text_chunks)

    text += text_chunks




Detected language: en
This video is sponsored by KiwiCo, more on them later. In January 2025, the Chinese company DeepSeek shocked the world with the release of R1, a highly competitive language model that requires only a fraction of the compute of other leading models. Perhaps even more shocking is that unlike most of its American counterparts, DeepSeek has publicly released the R1 model weights, inference code, and extensive technical reports.
Detected language: en
reports, publishing an average of one report per month in 2024, and detailing many of the innovations that dramatically culminated in the release of R1 in early 2025. Back in June of 2024, the Deepseek team introduced a technique that they call multi-head latent attention. Unlike many deep-seek innovations that occur at the margins of the stack, multi-head latent attention strikes at the core of the transformer itself.
Detected language: en
This is the compute architecture that virtually all large language models share. Th

In [44]:
print(text)

This video is sponsored by KiwiCo, more on them later. In January 2025, the Chinese company DeepSeek shocked the world with the release of R1, a highly competitive language model that requires only a fraction of the compute of other leading models. Perhaps even more shocking is that unlike most of its American counterparts, DeepSeek has publicly released the R1 model weights, inference code, and extensive technical reports.reports, publishing an average of one report per month in 2024, and detailing many of the innovations that dramatically culminated in the release of R1 in early 2025. Back in June of 2024, the Deepseek team introduced a technique that they call multi-head latent attention. Unlike many deep-seek innovations that occur at the margins of the stack, multi-head latent attention strikes at the core of the transformer itself.This is the compute architecture that virtually all large language models share. This modification reduces the size of an important bottleneck, called 

In [45]:
from google.colab import drive

drive.mount('/content/drive')


file_path = f"/content/drive/My Drive/{filename.split('.mp3')}.txt"
print(file_path)

with open(file_path, 'w') as f:
  f.write(text)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/['How DeepSeek Rewrote the Transformer [MLA] (1)', ''].txt
