# Audio2Caption AI

----

## Setup

Before we begin, we will need the relevant libraries and files for ensuring that the AI runs within here.

**The libraries needed are as follows:**
- `numba` - (JIT compiler for generating machine code from Python)
- `numpy` - (mathematics functions library, needed for most AI/ML)
- `torch` - (all-in-one package for tensor computation/NN)
- `tqdm` - (progress meter)
- `more-itertools` - (for chunking, tools meant for handling the audio file)
- `tiktoken` - (tokenization of input data)

Run the following code block to install them all.

In [1]:
%pip install numba
%pip install numpy
%pip install torch
%pip install tqdm
%pip install more-itertools
%pip install tiktoken





[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip






[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


If you are running this on Linux, please install the `triton` package by running the command below.

It is required as it providers a Python-like environment, for it is not entirely compatible with Linux yet.

In [7]:
%pip install triton>=2.0.0;platform_machine=="x86_64" and sys_platform=="linux" or sys_platform=="linux2"

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\yijia\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_vendor\packaging\markers.py", line 266, in __init__
    self._markers = _normalize_extra_values(_parse_marker(marker))
                                            ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yijia\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_vendor\packaging\_parser.py", line 253, in parse_marker
    return _parse_full_marker(Tokenizer(source, rules=DEFAULT_RULES))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yijia\AppData\Local\Programs\Python\Python312\Lib\site-packages\pip\_vendor\packaging\_parser.py", line 257, in _parse_full_marker
    retval = _parse_marker(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\yijia\AppData\Local\Programs

After installing the needed libraries as above, there are two additional things you need.

One of them is `ffmpeg`, which is a CLI tool that is necessary for **converting audio and video**.

You can install it via any of the package managers on your computer, such as `chocolatey`, `brew` for MacOS, or `apt` on Ubuntu/Debian. Full list is below.

```bash
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
```

The other is `rust`, a common, fast programming language that might be needed in case `tiktoken` does not provide a pre-built wheel, which is required for ease of installation, else Python attempts to **compile the library from its source code**, which is written in Rust.

To install `rust` on your system, [click here](https://www.rust-lang.org/tools/install).

**(Optional)** Additionally, you may need to configure the PATH environment variable, e.g. `export PATH="$HOME/.cargo/bin:$PATH"`.

**(Optional)** If the installation fails with `No module named 'setuptools_rust'`, you need to install setuptools_rust, e.g. by running:

In [2]:
%pip install setuptools_rust

Collecting setuptools_rustNote: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading setuptools_rust-1.10.2-py3-none-any.whl.metadata (9.2 kB)
Collecting semantic-version<3,>=2.8.2 (from setuptools_rust)
  Downloading semantic_version-2.10.0-py2.py3-none-any.whl.metadata (9.7 kB)
Downloading setuptools_rust-1.10.2-py3-none-any.whl (26 kB)
Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: semantic-version, setuptools_rust
Successfully installed semantic-version-2.10.0 setuptools_rust-1.10.2


Refer to the following table to understand your ideal model for running the model. Keep it in mind as we will refer to it later.

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~10x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~7x       |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~4x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |
| turbo  |   809 M    |        N/A         |      `turbo`       |     ~6 GB     |      ~8x       |

We can now officially begin the coding process.

## Deploying your Whisper model

As per [GitHub](https://github.com/openai/whisper), the code below deploys the model within your IDE.

In [6]:
import whisper
import math

# Load the model
model = whisper.load_model("turbo")

# Load audio file
audio = whisper.load_audio("harvard.wav")

# Define the duration in samples for a 30-second window
chunk_duration = 30 * 16000  # 30 seconds at 16,000 samples per second
total_chunks = math.ceil(len(audio) / chunk_duration)

# Iterate over chunks and transcribe with timestamps
transcribed_text = []
for i in range(total_chunks):
    start_sample = i * chunk_duration
    end_sample = min((i + 1) * chunk_duration, len(audio))

    chunk = whisper.pad_or_trim(audio[start_sample:end_sample])
    mel = whisper.log_mel_spectrogram(chunk).to(model.device)

    # Transcribe using model, extracting timestamped segments
    result = model.transcribe(chunk, language="en", word_timestamps=True)

    # Adjust segment timestamps based on current chunk start
    for segment in result["segments"]:
        start_time = segment["start"] + (i * 30)
        end_time = segment["end"] + (i * 30)
        text = segment["text"]
        transcribed_text.append(f"[{start_time:.3f} --> {end_time:.3f}]  {text}")

# Print the full transcription with timestamps
print("\n".join(transcribed_text))




[0.820 --> 3.620]   The stale smell of old beer lingers.
[3.920 --> 6.180]   It takes heat to bring out the odor.
[6.660 --> 9.340]   A cold dip restores health and zest.
[9.840 --> 12.020]   A salt pickle tastes fine with ham.
[12.640 --> 14.320]   Tacos al pastor are my favorite.
[14.920 --> 17.460]   A zestful food is the hot cross bun.
