### Testing out Distil-Whisper

[GitHub Repo](https://github.com/huggingface/distil-whisper/tree/3c8c15f771139f4c98284486534667a87927ae45) of model.

In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

In [2]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

In [3]:
display(device)
display(torch_dtype)

'cpu'

torch.float32

In [4]:
# Distil-Whisper model id on hugging face api
model_id = "distil-whisper/distil-large-v2"

# load the model...
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,  # this helps keep loading time as low as possible
    use_safetensors=True  # use safetensors
)

model.to(device)

# ... and the processor
processor = AutoProcessor.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
#     max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

In [14]:
# load audio sample from LibriSpeech corpus
from datasets import load_dataset

dataset = load_dataset('hf-internal-testing/librispeech_asr_dummy', 'clean', split='validation')
sample_audio = dataset[0]['audio']

Downloading builder script: 100%|███| 5.17k/5.17k [00:00<00:00, 619kB/s]
Downloading data files:   0%|                     | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                     | 0.00/9.08M [00:00<?, ?B/s][A
Downloading data:   0%|             | 21.5k/9.08M [00:00<00:50, 180kB/s][A
Downloading data:   1%|             | 69.6k/9.08M [00:00<00:31, 282kB/s][A
Downloading data:   2%|▏             | 157k/9.08M [00:00<00:26, 334kB/s][A
Downloading data:   3%|▍             | 244k/9.08M [00:00<00:23, 376kB/s][A
Downloading data:   3%|▍             | 313k/9.08M [00:00<00:22, 397kB/s][A
Downloading data:   6%|▊             | 557k/9.08M [00:01<00:11, 762kB/s][A
Downloading data:   8%|█             | 696k/9.08M [00:01<00:10, 834kB/s][A
Downloading data:   9%|█▏            | 783k/9.08M [00:01<00:11, 729kB/s][A
Downloading data:  10%|█▍            | 940k/9.08M [00:01<00:10, 740kB/s][A
Downloading data:  12%|█▍           | 1.04M/9.08M [00:01<00:12, 662kB/s][A
Downloading data: 

In [15]:
result = pipe(sample_audio)
print(result['text'])

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.


In [15]:
# load local sample audio
humurous_conversation = r"C:\Users\Administrator\Documents\Sound Recordings\Spreading_jam_on_bread.mp3"
result2 = pipe(humurous_conversation)
print(result2['text'])

 Ah, good morrow, dear sir, might I in my moment of hunger-induced desperation humbly beseech thee for the boon of thy bladed utensil that I may partake in the task of spreading delightfully fruity preserve upon yonder mundane unleavened sustenance. Certainly, my fine fellow, here you go, my trusty cutlery, ready to serve in your quest for culinary satisfaction.


In [16]:
result2

{'text': ' Ah, good morrow, dear sir, might I in my moment of hunger-induced desperation humbly beseech thee for the boon of thy bladed utensil that I may partake in the task of spreading delightfully fruity preserve upon yonder mundane unleavened sustenance. Certainly, my fine fellow, here you go, my trusty cutlery, ready to serve in your quest for culinary satisfaction.'}

### Being robust to audio length

- Since the model is only suited for limitted-length audio segments, we have to split the audio into segments and feed to the model by segment, and then concatenate the transcript from there. 

- In order to avoid splitting words, phrase and sentences in half during this segmentation stage, we will use a VAD (Voice Activity Detection) model in order to segment the audio. 

In [23]:
! pip install pyAudioAnalysis

Collecting pyAudioAnalysis
  Downloading pyAudioAnalysis-0.3.14.tar.gz (41.3 MB)
     ---------------------------------------- 0.0/41.3 MB ? eta -:--:--
     ---------------------------------------- 0.0/41.3 MB 1.3 MB/s eta 0:00:33
     ---------------------------------------- 0.1/41.3 MB 1.1 MB/s eta 0:00:40
     ---------------------------------------- 0.1/41.3 MB 1.1 MB/s eta 0:00:39
     ---------------------------------------- 0.3/41.3 MB 1.5 MB/s eta 0:00:27
      --------------------------------------- 0.5/41.3 MB 2.2 MB/s eta 0:00:19
      --------------------------------------- 0.6/41.3 MB 2.2 MB/s eta 0:00:19
      --------------------------------------- 0.7/41.3 MB 2.3 MB/s eta 0:00:18
      --------------------------------------- 0.9/41.3 MB 2.4 MB/s eta 0:00:17
     - -------------------------------------- 1.1/41.3 MB 2.8 MB/s eta 0:00:15
     - -------------------------------------- 1.3/41.3 MB 2.9 MB/s eta 0:00:14
     - -------------------------------------- 1.5/41.3 MB

In [112]:
from pyAudioAnalysis import audioBasicIO
from pyAudioAnalysis import audioSegmentation
from pydub import AudioSegment

# Load your audio file
audio_file = r"C:\Users\Administrator\Documents\Sound Recordings\Spreading_jam_on_bread.mp3"

# Read the audio file
[fs, x] = audioBasicIO.read_audio_file(audio_file)

# Convert multi-channel audio to single-channel (mono)
x = x[:, 1]  # Extract the first channel (mono)

# Define the window size and step size in seconds (you can adjust these values)
st_win = 0.1  # Window size (in seconds)
st_step = 0.02  # Step size (in seconds)

# Perform VAD
segments = audioSegmentation.silence_removal(x, fs, st_win, st_step)

In [None]:
# Transcribe each segment using the ASR model
transcriptions = []
for i, segment in enumerate(segments):
    start_time, end_time = segment
    # extract segment from audio
    low, high = int(start_time*fs), int(end_time*fs)
    audio_segment = x[low : high]
    
    # perform ASR on audio segment
    segment_transcription = pipe(audio_segment)
    transcriptions.append(segment_transcription['text'])

In [None]:
full_transcription = " ".join(transcriptions)
print("Full Transcription:")
print(full_transcription)