<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Speech_to_Text_with_Whisper/Whisper_AI_for_Public_Health.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1>Speech-to-Text with Whisper AI</h1></center>

![](https://images.ctfassets.net/kftzwdyauwt9/18ff9c06-7853-4e3b-d849bc901978/2b49cdd19fcdf22f689f606fdf2dc8d6/asr-details-desktop.svg?w=1920&q=90)

# Link to Materials: [GDrive Link](https://drive.google.com/drive/folders/1NAwDRZS4tfW-1A7eEUax-upMQGnz25Q-?usp=share_link)

Link to transcript of the 56-minute audio: [GDrive Link](https://drive.google.com/file/d/1j9Z7lF2DsinNkGBtRzpUtjkuGy8vGSXD/view?usp=share_link)

# Speech-to-text Fundamentals

## Some Terminology
- Speech-To-Text (STT): A task for taking an audio file with speech as input, and returning the words and sentences sporken as the output, usually with timestamps.
- Transcripts: A file with all the audio saved in a text format.
- (Close) Captions: Text that follows the audio, and may include descriptions of the audio and video content.
- Subtitles: translations of captions into another language.
- Word Error Rate (WER): A metric used to evaluate transcription quality. It is the percentage of words incorrectly transcribed in utterances in a transcript, per 100 words in the transcript.


## Transcription formats and content
- `.vtt, .srt`
- Textfiles, Json, textgrids
- Speaker, content, timestamps

1. VTT (WebVTT)
WebVTT is commonly used for displaying timed text tracks in HTML5 videos.

```
WEBVTT

00:00:00.000 --> 00:00:02.500
Hello, and welcome to today's workshop

00:00:02.500 --> 00:00:05.000
where we will discuss speech recognition.
```

2. SRT (SubRip Subtitle)
SRT is one of the most widely used subtitle formats, known for its simplicity.

```
SRT
1
00:00:00,000 --> 00:00:02,500
Hello, and welcome to today's workshop

2
00:00:02,500 --> 00:00:05,000
where we will discuss speech recognition.
```

3. JSON can be useful for storing structured data, including transcription with timestamps.

```
{
    "transcriptions": [
        {
            "start": "00:00:00.000",
            "end": "00:00:02.500",
            "text": "Hello, and welcome to our video."
        },
        {
            "start": "00:00:02.500",
            "end": "00:00:05.000",
            "text": "Today, we will discuss the basics of speech recognition."
        }
    ]
}
```
4. Textgrids- these are outputs from speech processing software like Praat, where the same file contains data from multiple annotation tiers. Itsa great way to annotate audio more than one way.

```
FileType = "ooTextFile"
ObjectClass = "TextGrid"
xmin = 0
xmax = 10
tiers? <exists>
size = 2
item []:
    item [0]:
        class = "IntervalTier"
        name = "Words"
        xmin = 0
        xmax = 10
        intervals: size = 3
        intervals [0]: xmin = 0.0; xmax = 2.5; text = "Hello"
        intervals [1]: xmin = 2.5; xmax = 5.0; text = "and welcome"
        intervals [2]: xmin = 5.0; xmax = 10.0; text = "to today's session"
    item [1]:
        class = "IntervalTier"
        name = "Phonemes"
        xmin = 0
        xmax = 10
        intervals: size = 6
        intervals [0]: xmin = 0.0; xmax = 0.5; text = "H"
        intervals [1]: xmin = 0.5; xmax = 1.0; text = "ɛ"
        intervals [2]: xmin = 1.0; xmax = 1.5; text = "l"
        intervals [3]: xmin = 1.5; xmax = 2.0; text = "o"
        intervals [4]: xmin = 2.5; xmax = 3.0; text = "a"
        intervals [5]: xmin = 3.0; xmax = 3.5; text = "n"
```

**Bottom line:** all transcrition outputs contain information about the content of a recording, and can be inter-converted. For our NLP pipeline, it is important to know which one is being asked for.

## Popular use cases
- Accessibility
- Audio input for assistants, hands-free applications
- Downstream NLP tasks


## Option 1: Zoom Transcription and Captions

- Free for premium accounts
- Setup on the Zoom Cloud
- Great for non-private settings
- Generates `.vtt` files with timestamps, as well as transcript file `.txt`. Ignores false starts and filler sounds
- Caption support for many languages

## Documentation:
- [Enabling or disabling audio transcription for cloud recordings](https://support.zoom.com/hc/en/article?id=zm_kb&sysparm_article=KB0065911)
- [Enabling automated captions](https://support.zoom.com/hc/en/article?id=zm_kb&sysparm_article=KB0058810)
- [Enabling and configuring translated captions](https://support.zoom.com/hc/en/article?id=zm_kb&sysparm_article=KB0059081)
- [Real-time automatic caption translation](https://support.zoom.com/hc/en/article?id=zm_kb&sysparm_article=KB0060844)

## Downstream NLP tasks from STT data

- Conversation summarization and automatic note-taking
- Topic analysis
- Named Entity Recognition (NER)
- Speaker dominance and conversation quality assessments

## Web-scale Supervised Pretraining for Speech Recognition (Whisper)


### Basics


<img src="https://raw.githubusercontent.com/openai/whisper/main/approach.png" width="600" />

[image source](https://raw.githubusercontent.com/openai/whisper/main/approach.png)

- Powerful audio transformer model from OpenAI.
- This model maps utterances and their transcribed form across multiple languages.
- It can be downloaded and used on one's own setup (GPU needed) without sending data through the web.
- Its training data includes may different recording conditions noisy and quiet environments, audio with and without speech, songs, etc.
- So it performs well on both quiet and noisy environments.
- Whisper used a  **sequence-to-sequence transformer** model.
- It also uses weak supervision for training on transcripts (that is, not all of the transcripts are labelled or even generated by humans).
- Its speech model uses a 'multitask training format' and a set of special tokens that can understand the audio data collectively for a lot of tasks.
- It is powerful because the model has been pre-trained on many speech processing tasks, such as multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

When we call the model to process a file, it makes predictions for the set of tasks as a whole, instead of sending the data through different stages.


### Data and pipeline
- 680,000 hours of audio and transcripts
- Source of data: the internet.
- 65% ( 438,000 hours) English-language audio
- ~ 18% (126,000 hours) non-English audio, English translations
- ~ 17% (117,000 hours) non-English audio and transcripts from 98 languages.

 Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

# STEP-1: Whisper Setup

We use Whisper by calling the python library, and downloading the necessary language model.

In [1]:
! pip install git+https://github.com/openai/whisper.git -q

# Load the model
import whisper
model = whisper.load_model("base")

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m91.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m20.2 MB/s[0m eta [36m0:

100%|███████████████████████████████████████| 139M/139M [00:14<00:00, 9.70MiB/s]


# STEP-2: Processing the Audio files

In this example, we will run Whisper on the command line for a few `.mp3` files in English and Korean. Based on the size of the file, the model may need more or less time.

In [2]:
import locale
print(locale.getpreferredencoding())
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

utf-8


In [3]:
! pip install -U pytube
!pip install yt-dlp
import os


Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0
Collecting yt-dlp
  Downloading yt_dlp-2025.6.9-py3-none-any.whl.metadata (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.3/174.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading yt_dlp-2025.6.9-py3-none-any.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt-dlp
Successfully installed yt-dlp-2025.6.9


## Get Audio File

Download it here: [GDrive Link](https://drive.google.com/file/d/1xMrxqPLVh-kmoeKkHQXiA6_pawJw98JJ/view?usp=share_link), or download from youtube:

In [4]:
#### Edit this to add a youtube url
video_url = "https://www.youtube.com/watch?v=NCY_qMpCExQ"
audio_filename = "audio_file1.mp3"

In [5]:
os.system(f'yt-dlp -x --audio-format mp3 -o "{audio_filename.replace( "mp3", "%(ext)s")}" "{video_url}"')

0

In [9]:
print(audio_filename)

audio_file1.mp3


## Chunk Audio File

Next, we will use FFMPEG, a command line tool with powerful media file managing capabilities.

In the interest of making processing easy, we will use ffmpeg to chunk our audio files into small 10 min (600sec) chunks

In [10]:
! pip install ffmpeg

Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ffmpeg
  Building wheel for ffmpeg (setup.py) ... [?25l[?25hdone
  Created wheel for ffmpeg: filename=ffmpeg-1.4-py3-none-any.whl size=6083 sha256=4d4cdaebd1875dbc2ccc7ee6457a9251f5b28b807e0135c1b1c7612be7892e7b
  Stored in directory: /root/.cache/pip/wheels/56/30/c5/576bdd729f3bc062d62a551be7fefd6ed2f761901568171e4e
Successfully built ffmpeg
Installing collected packages: ffmpeg
Successfully installed ffmpeg-1.4


In [11]:
!mkdir /content/chunks

In [12]:
!prefix=$(basename "$audio_filename" | sed 's/\.[^.]*$//')
!ffmpeg -i "/content/$audio_filename" -f segment -segment_time 120 -c copy chunks/audio_filename_%03d.mp3


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

## Check the output folder

`/content/chunks` should contain segments of the audio file. Check if you can play the audio:

In [13]:
first_test_audio = "/content/chunks/audio_filename_001.mp3"
audio_file_path = f"/content/{first_test_audio}"

In [14]:
from IPython.display import Audio
Audio(first_test_audio)

# STEP-3: Generating transcription files
[Whisper documentation](https://pypi.org/project/openai-whisper/)

[Code Source](https://github.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial)

In [None]:
# install whisper from the Github repository:
# !pip install git+https://github.com/openai/whisper.git -q

# # Load the model
# import whisper

# model = whisper.load_model("base")

In [7]:
# Other tools for processing audio files:
!pip install setuptools-rust

Collecting setuptools-rust
  Downloading setuptools_rust-1.11.1-py3-none-any.whl.metadata (9.6 kB)
Downloading setuptools_rust-1.11.1-py3-none-any.whl (28 kB)
Installing collected packages: setuptools-rust
Successfully installed setuptools-rust-1.11.1


In [15]:
# Check useage guide for more:
!whisper --help

usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
               [--output_dir OUTPUT_DIR]
               [--output_format {txt,vtt,srt,tsv,json,all}]
               [--verbose VERBOSE] [--task {transcribe,translate}]
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,L

In [16]:
# Check for GPU availability:
!nvidia-smi

Wed Jun 11 21:36:37 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             30W /   70W |     544MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Transcription Task

In [17]:
from IPython.display import Audio
Audio(first_test_audio)

In [18]:
output_dir_transciption = first_test_audio.replace(".mp3", "_transcripts")

# Process an audio file with specified parameters:
!whisper $first_test_audio \
--model medium --task transcribe \
--output_dir $output_dir_transciption \
--output_format all \
--word_timestamps True


Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.380]  We're a non-inherent patient, some who speaks English, some who doesn't speak English.
[00:03.640 --> 00:08.060]  I went to the hospital earlier to try to interview the patient.
[00:08.320 --> 00:12.980]  She was in a CT scan, but I was able to read the full medical record.
[00:13.220 --> 00:21.340]  She is a 22-year-old female, born in Ecuador, symptoms of cough, fever, and a 10 to 15-pound
[00:21.340 --> 00:22.940]  weight loss over the past month.
[00:23.300 --> 00:28.660]  She was evaluated by her primary physician, and codeine cough syrup was prescribed with
[00:28.660 --> 00:29.520]  no improvement.
[00:30.160 --> 00:35.760]  Chest X-ray reported bronchitis, and antibiotics were prescribed with no response.
[00:36.000 --> 00:42.660]  She was referred to the hospital ER for admission, and a chest X-ray was abnormal with cavitary
[00:42

In [None]:
# Process an audio file with specified parameters:
!whisper $first_test_audio \
--model medium \
--task transcribe \
--output_dir $output_dir_transciption \
--output_format all \
--word_timestamps True


# You can also try:
# --output_format srt
# --max_words_per_line 3

## Translation Task

[Code Source](https://github.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial)

In [11]:
#### Edit this to add a youtube url
video_url = "https://www.youtube.com/watch?v=mtGe2MM56IU"


In [22]:
audio_filename_hindi = "/content/" + "tb_hindi.mp3"
os.system(f'yt-dlp -x --audio-format mp3 -o "{audio_filename_hindi.replace( "mp3", "%(ext)s")}" "{video_url}"')


In [23]:
from IPython.display import Audio
Audio(audio_filename_hindi)

In [24]:
audio_filename_hindi_transcription_path = audio_filename_hindi.replace(".mp3", "_translate")

In [25]:
# Process and translate audio from Hindi, and save an English transcript:
!whisper $audio_filename_hindi_path \
--language Hindi \
--task translate \
--model medium \
--output_dir $audio_filename_hindi_transcription_path \
--output_format txt

usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE]
               [--output_dir OUTPUT_DIR]
               [--output_format {txt,vtt,srt,tsv,json,all}]
               [--verbose VERBOSE] [--task {transcribe,translate}]
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,L

# STEP-4A: [OPTIONAL] Adding Whisper to a task pipeline



Instead of calling whisper on the command line, we can call it in our python code, so that we can manipulate the input and output with more control.

In this examples, we have a function that takes an audio file path as an input and returns the recognized text (and logs what it thinks the language is).

[Code source](https://github.com/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb)

In [41]:
# Create a function to process the audio file
# and generate the Whisper output:

def transcribe_30_seconds(audio):

    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)
    return result.text


In [43]:
# Transcribe two recording files:
# easy_text = transcribe_30_seconds(first_test_audio)
# print(easy_text)

hard_text = transcribe_30_seconds(audio_filename_hindi)
print(hard_text)


Detected language: hi
ٹیبر کلوسز جسے آمتور پر ٹیبی کے روپے جانا جاتا ہے یہ ایک سنکراما کنفیکشن ہے جو کی آمتور پر پھے پھروں پر ہملا کرتا ہے یا شریع کے انہیں حسوں میں بھی پھیل سکتا ہے جس کی مستشک ریڈ کی ہڈیادی آج زیادہ تر ٹیبی کے معاملے اینٹی بیائیوٹک دواؤں کے ساتھ ٹیک ہو جاتے ہیں لیکن اس میں λمبہ سمی لکتا ہے کم سے کم چھے سے νوں میں ہی نے


# STEP-4B: [OPTIONAL] Web UI Toolkit for Recording

A simple API for recording and processing audio that uses Gradio. Please run the code in Step-4A

[Code Source](https://colab.research.google.com/github/petewarden/openai-whisper-webapp/blob/main/OpenAI_Whisper_ASR_Demo.ipynb#scrollTo=deSAVvfJcWBo)

Note: this may ask for browser permissions, and the recording function may or may not work on Colab.

In [44]:
! pip install gradio -q

In [45]:
import gradio as gr
import time

In [46]:
gr.Interface(
    title = 'OpenAI Whisper ASR Gradio Web UI',
    fn=transcribe_30_seconds,
    inputs=[
        gr.Audio(type="filepath")
    ],
    outputs=[
        "textbox"
    ],
    live=True).launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b0f1e0566b5815e778.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Visualising and Analysing STT Output

Now that we have text for the audio we processed, we can use it for downstream analysis. Let's try and visualise the data to see what is being talked about.



## First, we fetch the transcripts:

In [71]:
import os

def find_first_txt(directory="./transcripts"):
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            return os.path.join(directory, filename)
    return None  # If no .txt file found



In [72]:
# Example usage
txt_file = find_first_txt(output_dir_transciption)


In [83]:
# Sample text
# Read text from file
with open(txt_file, "r", encoding="utf-8") as file:
    text = file.read().replace("\n", " ")  # Remove line breaks
print(text[:100])

Then you got an antibiotic, which didn't help either, I understand. Yeah, I think so. I just don't r


In [69]:
!pip install spacy wordcloud matplotlib
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m88.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [88]:
import spacy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

# Load the English spaCy model
nlp = spacy.load("en_core_web_sm")

# Apply NER
doc = nlp(text)


## Word Clouds- Useful for assessing topics being discussed in a text

We will use the SpaCy library for processing the text.

### Example-1

In [None]:
# Word Cloud for all words:

# Keep only non-stopword, non-punctuation words
words = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]

# Count word frequencies
word_freq = Counter(words)

# Create the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(word_freq)

# Print unique words
print("Unique words:")
for word in word_freq:
    print(word)

# Plot it
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud (All Words)")
plt.show()


In [None]:
# Extract named entities (you can filter by type if needed)
entities = [ent.text for ent in doc.ents]

# Optional: lowercase & remove duplicates
entity_counts = Counter([e.lower() for e in entities])

# Print unique words
print("Unique words:")
for word in entity_counts:
    print(word)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(entity_counts)

# Display it
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Named Entities Word Cloud")
plt.show()

In [91]:
txt_file2 = find_first_txt(audio_filename_hindi_transcription_path)

In [92]:
# Sample text
# Read text from file
with open(txt_file2, "r", encoding="utf-8") as file:
    text2 = file.read().replace("\n", " ")  # Remove line breaks
print(text[:100])

Then you got an antibiotic, which didn't help either, I understand. Yeah, I think so. I just don't r


In [95]:
# Apply NER
doc = nlp(text2)

In [None]:

# Keep only non-stopword, non-punctuation words
words = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]

# Count word frequencies
word_freq = Counter(words)

# Print unique words
print("Unique words:")
for word in word_freq:
    print(word)

# Create the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(word_freq)

# Plot it
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud (All Words)")
plt.show()


In [None]:
# Extract named entities (you can filter by type if needed)
entities = [ent.text for ent in doc.ents]

# Optional: lowercase & remove duplicates
entity_counts = Counter([e.lower() for e in entities])

# Print unique entities
print("Unique entities:")
for word in entity_counts:
    print(word)

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(entity_counts)

# Display it
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Named Entities Word Cloud")
plt.show()

# References and Further Reading

- [Illustrated Wav2vec 2.0](https://jonathanbgn.com/2021/09/30/illustrated-wav2vec-2.html)
- [Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
](https://arxiv.org/pdf/2006.11477)
- [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/pdf/2212.04356)

