<a href="https://colab.research.google.com/github/sarahkaarina/lazy-language/blob/main/transcribing_data/demo_transcribe_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transcribing data

**This basic notebook will talk you through how to transcribe pre-recorded audio using the Whisper AI**

For more information on Whisper and other tutorials (and where I got most of the information myself on how to use it), see the reference documentation at the bottom of this notebook.

**Step 1**

Whisper is not included in Collab, so let's install it using pip.

Everytime you re-set (?) this notebook, collab will ask you to reinstall.

In [10]:
!pip install -U openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/800.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting triton>=2.0.0 (from openai-whisper)
  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.4/209.4 MB[0m [31m5.0 MB/s[0m eta [36

**Step 2**

Load in the relevant libraries (including our now newly installed whisper).

In [1]:
# Library we will use to define and read in our file paths:
from pathlib import Path

# Library to read and manage json files:
import json

# Library to transcibe our data:
import whisper

# And the usual suspects, cioè libraries we need to wrangle and organizie data:
import numpy as np
import pandas as pd

# Display images:
from PIL import Image
from IPython.display import display


**Step 3**

Load the relevant whisper model.

Whisper has 6 different model, which range in size (and therefore speed; the smaller the model, the faster the transcription).

For those interested in further details on Whisper itself, the source documentation can be found here: https://github.com/openai/whisper

In brief, whisper is a speech-recognition model that has been trained on large number of both English and multilingual data. Whisper can transcibe, translate and identify the language used in your speech data.

The screenshot below (taken from the Whisper landing page) shows the 6 different models. Let's breakdown their differences.


*   **Size**: refers to the size of the model, it's important to know that the name of the size is also the name of the model we will need to call to load it later on (for English models only it's the model name + .en)

*   **Parameters**: these, in sum, refer to the computational efficiency and size of the model you are using. For more details see refs below (or to have a play with calcluating parameter sizes for models, see here: https://transformerparameters-calculator.streamlit.app/)

*   **Required VRAM**: The amount of memory space you will need available to run the model.

*   **Relative speed**: The speed it will take to run the model in comparison to the slowest model (large).

N.B: The columns "English-only model" and "Multilingual model" are fairly self-explanatory, but essentially refer to whether the model is only for english data or will work on multilingual data as well.

In [6]:
img = Image.open('/content/Screenshot 2024-10-11 at 12.29.29 PM.png')
display(img)

In [2]:
model = whisper.load_model("turbo")

100%|█████████████████████████████████████| 1.51G/1.51G [00:21<00:00, 74.7MiB/s]
  checkpoint = torch.load(fp, map_location=device)


**Step 4**:

Transcribe the data!

To do is very simple, we call the function 'transcribe' from the whisper library.

The function takes a string containing the file path to your audio file. We will also use the following arguments:



1.   **Language**: this tells whisper what language you want to transcribe from. In this case we will tell it we are trancribing from English ('en'). (See list of ISO codes in reference materials).

2.   **Verbose**: setting verbose to True means that whisper will show us it's output as it's transcribing the data (with its log of warnings, time taken, and any other relevant stuff).




In [8]:
# NB: whipser looks for a string containing the filepath, not a filepath (i.e. Path("path/to/file.wav")).

audiofile = "/content/28158-15-3228993-task-lj3f-10598718-animalsenglish-3-2.wav"

result = model.transcribe(audiofile, language = "en", verbose = True)
result["text"]



[00:00.000 --> 00:26.800]  Dogs, cats, horses, pigs, wolves, badgers, foxes, weasels, stoats, ferrets, squirrels, sheep, cows, pigs, horses, goats, rabbits, whales, elephants, giraffes, tigers, lions.
[00:26.800 --> 00:31.160]  Dolphins.
[00:33.240 --> 00:36.600]  Gazelles, deer.
[00:42.600 --> 00:47.280]  Rhinoceroses, hippopotamuses.
[00:49.400 --> 00:52.440]  Hedgehogs.
[00:56.800 --> 01:00.080]  Wild Caps.
[01:00.120 --> 01:01.160] ulei Valudjaros.
[01:03.200 --> 01:05.240]  Hjistum við tóraum að Sól og hverfolaðan í þetta Nesna al người undernoð.
[01:05.240 --> 01:18.680]  Flnin er tórið gjordfið alveginarstörsk


' Dogs, cats, horses, pigs, wolves, badgers, foxes, weasels, stoats, ferrets, squirrels, sheep, cows, pigs, horses, goats, rabbits, whales, elephants, giraffes, tigers, lions. Dolphins. Gazelles, deer. Rhinoceroses, hippopotamuses. Hedgehogs. Wild Caps.ulei Valudjaros. Hjistum við tóraum að Sól og hverfolaðan í þetta Nesna al người undernoð. Flnin er tórið gjordfið alveginarstörsk<|ko|>'

**Optional**:

Because Whisper used GPT predictive capacity to transcribe audio, you can also give it a 'helping hand' by prompting it on the content of the audio.

The example audio I am using in this notebook is from verbal fluency taks I collected during my PhD thesis. Therefore, I am going to prompt Whisper that the audio in this file contains content from a verbal fluency task. I will also give it a further description of the task itself.

*Let's see if it's any better!*

In [7]:
prompt = (
    f"This is a computer recorded audio file containing a verbal fluency task."
    f"The speaker is a research subject and is naming all the animals"
    f"they can think of in under sixty seconds.")

result = model.transcribe(audiofile, language='en', verbose=True, initial_prompt=prompt)
result["text"]



[00:00.000 --> 00:26.820]  Dogs, cats, horses, pigs, wolves, badgers, foxes, weasels, stoats, ferrets, squirrels, sheep, cows, pigs, horses, goats, rabbits, whales, elephants, giraffes, tigers, lions.
[00:26.820 --> 00:31.140]  Dolphins.
[00:33.660 --> 00:36.580]  Gazelles, deer.
[00:43.260 --> 00:47.280]  Rhinoceroses, hippopotamuses.
[00:50.280 --> 00:52.440]  Hedgehogs.
[00:56.820 --> 01:00.020]  Wildc elevate bancor.
[01:00.020 --> 01:15.940]  Many祖 props toücken their homeland in Michigan, they were drawn by gold militant, Georgios, Geovits, amputin andosis in the action of the population and distribution surg iconic patent in the colduna funds.


' Dogs, cats, horses, pigs, wolves, badgers, foxes, weasels, stoats, ferrets, squirrels, sheep, cows, pigs, horses, goats, rabbits, whales, elephants, giraffes, tigers, lions. Dolphins. Gazelles, deer. Rhinoceroses, hippopotamuses. Hedgehogs. Wildc elevate bancor. Many祖 props toücken their homeland in Michigan, they were drawn by gold militant, Georgios, Geovits, amputin andosis in the action of the population and distribution surg iconic patent in the colduna funds.'

**References:**

Whisper tutorial: https://christophergs.com/blog/ai-podcast-transcription-whisper

Whisper landing page: https://github.com/openai/whisper

Regarding parameters: https://medium.com/@geosar/understanding-parameter-calculation-in-transformer-based-models-simplified-e8c7f4e059d8

Language ISO codes: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes

Some fun resources on YouTube: https://www.youtube.com/watch?v=wjZofJX0v4M&t=776s