https://discuss.huggingface.co/t/convert-openai-whisper-transformer-model-to-quantized-tflite-model/25488

Quantisation to tflite
https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/whisper_to_onnx_tflite_int8.ipynb

Quantisation from HuggingFace to tflite
https://colab.research.google.com/github/usefulsensors/openai-whisper/blob/main/notebooks/tflite_from_huggingface_whisper.ipynb

Quantised ggml models: https://huggingface.co/farmer00317558/quantized_whisper_model
- https://ggml.ggerganov.com/
- https://whisper.ggerganov.com/

https://github.com/MiscellaneousStuff/openai-whisper-cpu

https://medium.com/@daniel-klitzke/quantizing-openais-whisper-with-the-huggingface-optimum-library-30-faster-inference-64-36d9815190e0

https://twitter.com/younesbelkada/status/1590735022398455810?lang=en

Onnx quantisation
- https://huggingface.co/docs/optimum/onnxruntime/usage_guides/pipelines
- https://huggingface.co/docs/transformers/model_doc/whisper#whisper

Whisper Documentation
https://huggingface.co/docs/transformers/model_doc/whisper

Whisper Finetuning
https://github.com/vasistalodagala/whisper-finetune

https://github.com/MiscellaneousStuff/openai-whisper-cpu/tree/main

https://pytorch.org/docs/stable/quantization.html

https://huggingface.co/docs/optimum/main/en/concept_guides/quantization

https://opennmt.net/CTranslate2/guides/transformers.html#whisper

https://www.youtube.com/watch?v=2kSPbH4jWME

https://www.youtube.com/watch?v=2kSPbH4jWME

https://huggingface.co/docs/transformers/model_doc/whisper

https://huggingface.co/docs/optimum/onnxruntime/usage_guides/pipelines

https://github.com/vasistalodagala/whisper-finetune

https://huggingface.co/vasista22/whisper-hindi-large-v2

https://github.com/Vaibhavs10/fast-whisper-finetuning

- try to use flax model for faster finetuning

https://huggingface.co/DrishtiSharma/whisper-large-v2-hindi-3k-steps/tree/main

https://huggingface.co/spaces/autoevaluate/leaderboards/tree/main

### Datasets
- google/fleurs: https://huggingface.co/datasets/google/fleurs
- Common Voices:

### Models
- Medium: 
https://huggingface.co/openai/whisper-medium/tree/main

- Large-v2: 
https://huggingface.co/openai/whisper-large-v2/tree/main

In [1]:
# Check Pytorch Configuration
import torch
print('Torch Version: ', torch.__version__)
print('CUDA Available: ', torch.cuda.is_available() if torch.cuda.is_available() else 'N/A')
print('CUDA Version: ', torch.version.cuda)
print('GPU Count: ', torch.cuda.device_count())
print('GPU Info: ',
    [
        f"{i}: {torch.cuda.get_device_name(i)} ({'.'.join(map(str, torch.cuda.get_device_capability(i)))})"
                    for i in range(torch.cuda.device_count())
    ]
)

Torch Version:  2.0.1+cu117
CUDA Available:  True
CUDA Version:  11.7
GPU Count:  2
GPU Info:  ['0: NVIDIA GeForce RTX 3060 (8.6)', '1: NVIDIA GeForce GTX 1660 SUPER (7.5)']


In [None]:
# Check Tensorflow Configuration
import tensorflow as tf
import keras

print('TensorFlow Version:', tf.__version__)
print('Keras Version:', keras.__version__)
print('CUDA Available:', tf.test.is_built_with_cuda())
print('GPU Count:', len(tf.config.list_physical_devices('GPU')))
print('GPU Info:')
for gpu in tf.config.list_physical_devices('GPU'):
    print(f"- {gpu.name}: {tf.config.experimental.get_device_details(gpu)}")


In [1]:
# Set the device to use
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

MODEL_NAME = 'medium'
DEVICE = 'cuda'

COMPILED_MODEL_DIR = 'models/compiled'
OUTPUT_DIR = 'models/output'

# MODEL_DIR = f'./models/raw_hf/whisper-{MODEL_NAME}'
MODEL_DIR_FP16 = f'./models/output/whisper_{MODEL_NAME}_fp16_transformers'

## Loading using whisper

In [None]:
import whisper
import torch
model_fp32 = whisper.load_model(
    name='medium',
    device='cuda',
    download_root=COMPILED_MODEL_DIR,
)

# Set model to evaluation mode
# model_fp32.eval()

# See the model architecture
# model_fp32

In [None]:
SAMPLE_RATE = 16000

audio = whisper.load_audio('test.wav')
audio = whisper.pad_or_trim(audio)

from IPython.display import Audio
Audio(audio, rate=SAMPLE_RATE)

In [None]:
# Transcribe the audio using whisper
mel = whisper.log_mel_spectrogram(audio, device=DEVICE)
options = whisper.DecodingOptions(language="en")
result = whisper.decode(model_fp32, mel, options)
print(result.text)

# torch.cuda.empty_cache()

## Load using Transformers

In [7]:
from transformers import (
    WhisperForConditionalGeneration, 
    WhisperProcessor, 
    WhisperConfig, 
    BitsAndBytesConfig, 
    WhisperTokenizer,
    pipeline
)
import torch

# set torch default device
# torch.set_default_device(DEVICE)


In [8]:
processor = WhisperProcessor.from_pretrained(MODEL_DIR_FP16)

config = WhisperConfig.from_pretrained(MODEL_DIR_FP16)
tokeniser = WhisperTokenizer.from_pretrained(MODEL_DIR_FP16)
# tokeniser = processor.tokenizer

In [10]:
tokeniser

WhisperTokenizer(name_or_path='./models/output/whisper_medium_fp16_transformers', vocab_size=50258, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<|startoftranscript|>', '<|en|>', '<|zh|>', '<|de|>', '<|es|>', '<|ru|>', '<|ko|>', '<|fr|>', '<|ja|>', '<|pt|>', '<|tr|>', '<|pl|>', '<|ca|>', '<|nl|>', '<|ar|>', '<|sv|>', '<|it|>', '<|id|>', '<|hi|>', '<|fi|>', '<|vi|>', '<|he|>', '<|uk|>', '<|el|>', '<|ms|>', '<|cs|>', '<|ro|>', '<|da|>', '<|hu|>', '<|ta|>', '<|no|>', '<|th|>', '<|ur|>', '<|hr|>', '<|bg|>', '<|lt|>', '<|la|>', '<|mi|>', '<|ml|>',

In [None]:
# It will work but it will be slow as it will be using CPU to convert the model to FP16
# torch.set_default_dtype(config.torch_dtype)

In [None]:
# BnB int Configs
# https://huggingface.co/blog/hf-bitsandbytes-integration

bfloat16 = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_compute_dtype=torch.bfloat16
)

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Normal Float 4
)

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

In [9]:
# model = AutoModel.from_pretrained(MODEL_MEDIUM_DIR)
model = WhisperForConditionalGeneration(
        config=config
    ).from_pretrained(
                    pretrained_model_name_or_path = MODEL_DIR_FP16, 
                    torch_dtype = config.torch_dtype,
                    # 'balanced', 'balanced_low_0', 'sequential'
                    # device_map="cuda", # Inc. memory usage
                    low_cpu_mem_usage = True,
                    # quantization_config = nf4_config,
                )

In [13]:
# See Model Device
print('Device: ',model.device)

# See Model Device Map (Device Map is used to map the model to the device)
# print('Device Map: ',model.hf_device_map)

Device:  cuda:0


In [12]:
# Move model to GPU
if model.device.type != 'cuda':
    print('Moving model to GPU')
    model = model.to('cuda')
    model.eval()

else:
    print('Model is already on GPU')
    model.eval()

Moving model to GPU


In [14]:
# dtype
print('dtype of model acc to config: ', config.torch_dtype)
print('dtype of loaded model: ', model.dtype)

dtype of model acc to config:  torch.float16
dtype of loaded model:  torch.float16


In [21]:
transcriber = pipeline(
    task="automatic-speech-recognition", 
    model=model, 
    config=config,
    tokenizer=tokeniser,
    feature_extractor=processor.feature_extractor,
    device=0,
    # low_cpu_mem_usage = True,
    torch_dtype = config.torch_dtype,
    )

In [31]:
transcriber("./audios/test.wav")


{'text': ' Hello world, this is a sample test for the Python post request module to the Whisper API hosted on the Jarvis server. Now I am going to speak in Hindi. So I have opened my whatsapp and I have messaged a person that he will participate in the event. Thank you very much.'}

In [None]:
# Force to use the decoder ids
# model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(processor, language="english", task="transcribe", no_timestamps=True)

In [25]:
# load_audio and pad_or_trim functions

import ffmpeg
import torch
import torch.nn.functional as F
import numpy as np
# import whisper

SAMPLE_RATE = 16000
CHUNK_LENGTH = 30  # 30-second chunks
N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000 samples in a 30-second chunk

# audio = whisper.load_audio('test.wav')
def load_audio(file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out, _ = (
            ffmpeg.input(file, ss=start_time, threads=0)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
        )
    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
    return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0


# audio = whisper.pad_or_trim(audio)
def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
    """
    Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
    """
    if torch.is_tensor(array):
        if array.shape[axis] > length:
            array = array.index_select(
                dim=axis, index=torch.arange(length, device=array.device)
            )

        if array.shape[axis] < length:
            pad_widths = [(0, 0)] * array.ndim
            pad_widths[axis] = (0, length - array.shape[axis])
            array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
    else:
        if array.shape[axis] > length:
            array = array.take(indices=range(length), axis=axis)

        if array.shape[axis] < length:
            pad_widths = [(0, 0)] * array.ndim
            pad_widths[axis] = (0, length - array.shape[axis])
            array = np.pad(array, pad_widths)

    return array

In [27]:
audio = load_audio('audios/test.wav', dtype=np.float16, start_time=0)
audio = pad_or_trim(audio)

from IPython.display import Audio
Audio(audio, rate=SAMPLE_RATE)

In [28]:
input_features = processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt").input_features.half().to(DEVICE)
# input_features = whisper.log_mel_spectrogram(audio, device=DEVICE).unsqueeze(0)

In [30]:
with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        num_beams = 1,
        language="english",
        task="transcribe",
        use_cache=True,
        is_multilingual=True,
        return_timestamps=True,
        # return_attention_masks=True,
    )

# transcription = tokeniser.decode(predicted_ids[0])
# prediction = tokeniser._normalize(transcription)

prediction = tokeniser.batch_decode(predicted_ids, skip_special_tokens=False)

print(prediction)
# torch.cuda.empty_cache()

['<|startoftranscript|><|en|><|transcribe|> Hello world, this is a sample test for the Python post request module to the Whisper API hosted on the Jarvis server. Now I am going to speak in Hindi. I have opened my whatsapp and I have messaged a person to ask what event he will participate in. Thank you very much.<|endoftext|>']


In [None]:
predicted_ids

In [None]:
model.generation_config.use_cache

In [None]:
model.generation_config.return_timestamps

In [None]:
tokeniser.batch_decode(model.config.forced_decoder_ids, skip_special_tokens=False)

In [None]:
# import os
# SAVE_DIR = os.path.join(OUTPUT_DIR, MODEL_NAME + '_fp16' + '_transformers')
# model.save_pretrained(SAVE_DIR)
# processor.save_pretrained(SAVE_DIR)

In [None]:
for name, param in model.named_parameters():
    print(f"Parameter: {name} \t Dtype: {param.dtype}")

In [None]:
model.dtype

In [None]:
model.half()

In [None]:
torch.cuda.empty_cache()

In [None]:
from pprint import pprint
print(torch.cuda.memory_summary())

In [None]:
# torch.save(model, 'models/output/large-v2_fp16_unit8_bnb.pt')