<a href="https://colab.research.google.com/github/s1161858/project1/blob/main/09_WhisperSpeechRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speech Recognition with Whisper

### By AIDCEC, EDUHK


Speech recognition (i.e. speech-to-text recognition) tools are now common in everyday life. In this Colab notebook, we will introduce OpenAI's Whisper for transcribing audio files and a few other packages for audio processing.

GitHub page: https://github.com/openai/whisper

## 1. Install and Import Libraries

Install Whisper and FFmpeg

In [None]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-8gz8o0nj
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-8gz8o0nj
  Resolved https://github.com/openai/whisper.git to commit c0d2f624c09dc18e709e37c2ad90c039a4eb72a2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
!sudo apt update && sudo apt install ffmpeg

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Connecting to security.ub[0m                                                                               Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Do

Import **os** library to navigate the file system and suppress warnings

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

## 2. Read Audio File Or/And Record Your Audio File

###Choice (A): Test with Pre-recorded Audio Files

You could find the audio from here:
Language | Corpus website
-------------------|------------------
English | https://corpus.eduhk.hk/english_speech_corpus/
 | https://corpus.eduhk.hk/esl_learner_corpus/
Putonghua | https://corpus.eduhk.hk/pth_learner_corpus/
Cantonese | https://corpus.eduhk.hk/cantonese/


Below, we use the audio file "250231.mp3" as an example, which contains a recording of the phrase "It's on me" in English.

****Remember to upload the file "250231.mp3" to the "Files" before running the code below.**

In [None]:
from IPython.display import Audio
Audio('/content/250231.mp3') #change the file name here if you have uploaded other audio files

If you want to try on your own audio file, upload the file, change code in the above block and rerun it.

###Choice (B): Record Your Own Audio Files

You could record your own voice in Colab with "GoogleAudio".

In [None]:
pip install GoogleAudio==0.0.3



Run the following code, allow the use of microphone in the browser, then re-run the following code. After recording, press the button "Recording... press to stop" to stop recording.

****If the code cannot run, try running all code again, or refresh the page and try again.**

In [None]:
from googleaudio import colabaudio as agoogle #import modules
my_audio_name='my_audio_123.wav' # file audio name, change before next recording if you want to keep the old file
my_audio,sample_rate=agoogle.get_audio() #read audio data and sample rate
agoogle.saveaudio(my_audio_name,my_audio,sample_rate)  # save it

If you want record and keep the audio file, rename the audio name in the block above every time after recording.

## 3. Transcribe a Single Audio File

**Model selection**

There are 5 pre-trained options to play with:

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
 | small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

**Note:** \
FP16 and FP32 are floating-point precision formats. GPU will automatically adopt FP16. \
For CPU runtime, add `--fp16 False` to set FP16 False as CPU only support FP32, not FP16.

**Transcribe the File from Choice (A): Pre-recorded Audio Files**

In [None]:
!whisper "/content/250231.mp3" --model base

100%|███████████████████████████████████████| 139M/139M [00:01<00:00, 83.8MiB/s]
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:02.000]  It's on me.


**Transcribe the File from Choice (B): You Own Audio Files**

In [None]:
!whisper "/content/my_audio_123.wav" --model base --fp16 False

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:02.000]  Bye bye bye bye bye bye


In [None]:
!whisper "/content/my_audio_123.wav" --model large --fp16 False

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:01.480]  Why, why, why, why, why?


A variety of output files (json, srt, tsv, txt and vtt) can be found in the file system.

## 4. Transcribe Youtube Videos

Dowload Youtube videos with **yt-dlp**

In [None]:
!pip install yt-dlp

Collecting yt-dlp
  Downloading yt_dlp-2025.10.22-py3-none-any.whl.metadata (176 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/176.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.0/176.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading yt_dlp-2025.10.22-py3-none-any.whl (3.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.2/3.2 MB[0m [31m111.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt-dlp
Successfully installed yt-dlp-2025.10.22


**Example 1: English Video**

ChatGPT: Are humans still smarter than AI? - BBC News \
https://www.youtube.com/watch?v=NR1Tvxaiu2Y

In [None]:
if (os.path.exists("/content/audio1.mp3")):
  os.remove("/content/audio1.mp3")
else:
  pass
!yt-dlp -x --audio-format mp3 --output "audio1.mp3" https://www.youtube.com/watch?v=NR1Tvxaiu2Y

[youtube] Extracting URL: https://www.youtube.com/watch?v=NR1Tvxaiu2Y
[youtube] NR1Tvxaiu2Y: Downloading webpage
[youtube] NR1Tvxaiu2Y: Downloading android sdkless player API JSON
[youtube] NR1Tvxaiu2Y: Downloading tv client config
[youtube] NR1Tvxaiu2Y: Downloading tv player API JSON
[youtube] NR1Tvxaiu2Y: Downloading web safari player API JSON
[youtube] NR1Tvxaiu2Y: Downloading player 87644c66-main
         player = https://www.youtube.com/s/player/87644c66/player_ias.vflset/en_US/base.js
         n = gvcojeHQcVgC8A4aizS ; player = https://www.youtube.com/s/player/87644c66/player_ias.vflset/en_US/base.js
         Please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
[youtube] NR1Tvxaiu2Y: Downloading m3u8 information
[info] NR1Tvxaiu2Y: Downloading 1 format(s): 251
[download] Sleeping 4.00 seconds as required by the site...
[download] Destination: audio1.webm
[K[dow

In [None]:
!whisper "/content/audio1.mp3" --model base --fp16 False

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:04.240]  I'm designed to generate responses to questions based on my understanding of natural language.
[00:04.240 --> 00:07.520]  You've probably used it, you've definitely heard of it.
[00:07.520 --> 00:11.680]  Chat GPT has been held as a game changer. It can write songs.
[00:11.680 --> 00:15.520]  Melissa's got a vision. A story she wants to tell.
[00:15.520 --> 00:19.840]  Define quantum mechanics as well as explaining how I could boost my intelligence.
[00:20.480 --> 00:23.280]  Google has also released a competitor, Bard.
[00:23.920 --> 00:29.440]  Impressive, right? Computers seem to be rapidly outsmarting their creators.
[00:29.440 --> 00:31.120]  But is that really true?
[00:31.120 --> 00:37.520]  We have a lot of very sophisticated algorithms, but basically what these algorithms are doing is
[00:37.520 --> 00:42.480]  getting a lot of inf

**Example 2: Non-English Video**

EdUHK AIDCEC 2526 MSc AIEP - Postgraduate Student Interview Part 1 \
https://www.youtube.com/watch?v=EKkZ8yVu2yg


**You can also try with other links:** \
EdUHK AIDCEC 2526 MSc AIEP - Postgraduate Student Interview Part 2 \
https://www.youtube.com/watch?v=tZcThqxgfmg

開心香港 \
https://www.youtube.com/watch?v=VpMve52OpQQ

香港教育大學「看動畫．學歷史」第一集：孔子(普通話) \
https://www.youtube.com/watch?v=D2Qcwjidxns


In [None]:
if (os.path.exists("/content/audio2.mp3")):
  os.remove("/content/audio2.mp3")
else:
  pass
!yt-dlp -x --audio-format mp3 --output "audio2.mp3" https://www.youtube.com/watch?v=EKkZ8yVu2yg #try replacing the link here

[youtube] Extracting URL: https://www.youtube.com/watch?v=EKkZ8yVu2yg
[youtube] EKkZ8yVu2yg: Downloading webpage
[youtube] EKkZ8yVu2yg: Downloading android sdkless player API JSON
[youtube] EKkZ8yVu2yg: Downloading tv client config
[youtube] EKkZ8yVu2yg: Downloading tv player API JSON
[youtube] EKkZ8yVu2yg: Downloading web safari player API JSON
[youtube] EKkZ8yVu2yg: Downloading m3u8 information
[info] EKkZ8yVu2yg: Downloading 1 format(s): 251
[download] Sleeping 5.00 seconds as required by the site...
[download] Destination: audio2.webm
[K[download] 100% of    1.25MiB in [1;37m00:00:00[0m at [0;32m2.57MiB/s[0m
[ExtractAudio] Destination: audio2.mp3
Deleting original file audio2.webm (pass -k to keep)


In [None]:
!whisper "/content/audio2.mp3" --model large --fp16 False

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Chinese
[00:00.000 --> 00:14.400] 一个是给我更多的项目的了解
[00:14.400 --> 00:16.320] 一个是在我知识方面
[00:16.320 --> 00:18.580] 会让我有更广泛的学习
[00:18.580 --> 00:21.880] 它也是帮助我们结合AI和教育之间
[00:21.880 --> 00:23.600] 一个很好的一个桥梁
[00:23.600 --> 00:26.140] 所以在我了解到这个课程的时候
[00:26.140 --> 00:27.380] 就非常的感兴趣
[00:27.380 --> 00:28.940] 科技之类读书
[00:28.940 --> 00:33.520] 现在社会都对人文智能有很广泛的应用
[00:33.520 --> 00:36.400] 所以我想通过这方面的课程
[00:36.400 --> 00:40.260] 可以有我深入对这个专业的认识
[00:40.260 --> 00:41.740] 我是跟他讲我后面的规划
[00:41.740 --> 00:44.020] 他也是会给我一些相对的建议
[00:44.020 --> 00:45.720] 说他能给予我什么帮助
[00:45.720 --> 00:47.780] 我需要达到一种什么样的程度
[00:47.780 --> 00:49.420] 这边是小组讨论
[00:49.420 --> 00:50.200] 你要有一个小组
[00:50.200 --> 00:51.640] 然后你要去做一些discord
[00:51.640 --> 00:53.980] 我觉得这个对于一些思维的碰撞
[00:53.980 --> 00:54.900] 这个是非常好
[00:54.900 --> 00:56.020] 他里面学的一些东西
[00:56.020 --> 00:58.440] 和我们现在学的AI是非常的接触
[00:58.440 --> 00:58.860] 我觉得
[00:58.8

Translate non-English YouTube videos into English

In [None]:
!whisper "/content/audio2.mp3" --model large --task translate

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Chinese
[00:00.000 --> 00:11.000]  What is the most important thing for you to learn about AI?
[00:11.000 --> 00:14.000]  One is to learn more about the project
[00:14.000 --> 00:18.000]  The other is to learn more about my knowledge
[00:18.000 --> 00:23.000]  It is also a good bridge between AI and education
[00:23.000 --> 00:27.000]  So when I learned about this course, I was very interested
[00:27.000 --> 00:29.000]  I studied in science and technology
[00:29.000 --> 00:33.000]  Now the society has a wide range of applications for AI
[00:33.000 --> 00:40.000]  So I want to learn more about AI through this course
[00:40.000 --> 00:42.000]  I talked to him about my future plans
[00:42.000 --> 00:44.000]  He also gave me some relative suggestions
[00:44.000 --> 00:46.000]  What kind of help can he give me?
[00:46.000 --> 00:48.000]  To what extent do I need to achieve?
[00:4

*Extra: View all available options in Whisper*

In [None]:
!whisper --help

usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
               [--output_dir OUTPUT_DIR]
               [--output_format {txt,vtt,srt,tsv,json,all}]
               [--verbose VERBOSE] [--task {transcribe,translate}]
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,L