### Importing libraries

In [5]:
import numpy as np
import pandas as pd
import sklearn
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import scipy
from datasets import load_dataset
from pydub import AudioSegment
import librosa
import pickle
import os

General Workflow:




1. Speech signal ---> Text




2. Text ----> Dirichlet Clusters




3. Dirichlet Cluster keywords ---> Search Engine / Youtube API







In [6]:
### Loading model to device

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next man!


In [8]:
current_directory = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_directory, os.pardir))

print("Current directory:", current_directory)
print("Parent directory:", parent_dir)

Current directory: /home/alejandro/Documents/nus-mtechis/procrastinate/procrastinate_data_processor/static/ml_models
Parent directory: /home/alejandro/Documents/nus-mtechis/procrastinate/procrastinate_data_processor/static


In [9]:
sample_audio_path = os.path.join(parent_dir, "audio") + "/test_mono_audio.m4a"
print(sample_audio_path)
# audio = scipy.io.wavfile.read(mono_audio_path)[1]
# print(audio)
# print(audio.dtype)

/home/alejandro/Documents/nus-mtechis/procrastinate/procrastinate_data_processor/static/audio/test_mono_audio.m4a


### We use the librosa package here to read mp3 files, for more information regarding the librosa package, you can visit: https://github.com/librosa/librosa
Note: Librosa works for mp3, m4a files

In [10]:
y, sr = librosa.load(sample_audio_path)
print(y)
print(sr)

[ 0.0000000e+00  0.0000000e+00  0.0000000e+00 ... -1.3549987e-12
 -3.1959628e-12 -1.4259451e-12]
22050


  y, sr = librosa.load(sample_audio_path)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


In [11]:
output_result = pipe(y, generate_kwargs={"language": "english", "task": "transcribe"})
print(output_result["text"])

 This is a sample file for the speech-to-text notebook. This is meant as a test audio to try out whether Whisper works to actually decode the audio into word tokens. Check. Check. One, two, three, four. Zero. Over.


### Exporting the transcribed voice into a text file

In [14]:
output_file = os.path.join(parent_dir, 'output') + "/mono_recording_voice_transcribed2.txt"
print(output_file)
transcribed_text = output_result["text"]
with open(output_file, "w") as dst:
    dst.write(transcribed_text)

/home/alejandro/Documents/nus-mtechis/procrastinate/procrastinate_data_processor/static/output/mono_recording_voice_transcribed2.txt
