SPDX-License-Identifier: Apache-2.0 Copyright (c) 2023, Shivani karangula karangulax.shivani@intel.com, Nikhila Haridas nikhilax.haridas@intel.com


# Automatic Speech Recognition Quantization using Intel® Neural Compressor

## Intel® Neural Compressor

* Intel® Neural Compressor performs model optimization to reduce the model size and increase the speed of deep learning inferenc for deployment on CPUs or GPUs.
  
* Intel® Neural Compressor is an open source Python* library that performs model compression techniques such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks including TensorFlow*, PyTorch*, and ONNX* (Open Neural Network Exchange) Runtime.  

### Quantization

Quantization is a deep learning model optimization technique that is used to improve the speed of inference. It reduces the number of bits required by converting a set of real-valued numbers into a lower bit data representation such as int8 and int4. This helps in reducing the memory requirement, cache miss rate, and computational cost of using neural networks, and also in achieving the goal of higher inference performance.

Quantization has three different approaches:

* Post training dynamic quantization
* Post training static quantization
* Quantization aware training.

## Notebook overview
This notebook describes a detailed step-by-step code walkthrough on How to use Intel® Neural Compressor for Quantizing Whisper Model using Post training dynamic quantization approach on **CPU**.
* Load the Model and inference before Quantization.
* Quantize the Model.
* Inference of the Quanized model.

### Environment setup 
 
* This cell helps to create a python virtual environment ```asr-quant```.
* Install necessary packages, along with Intel® Extension for PyTorch.
* Create a Jupyter kernel for the notebook environment.
* Finally select ```Python (asr-quant)``` kernel to run the notebook.

In [5]:
# Creating python virtual environment
!python3 -m venv asr-quant

print("Virtual environment created")

print("Installing the required dependencies")

# Installing required dependencies in the virtual environment
!asr-quant/bin/pip install transformers==4.44.2 \
                            pandas==2.2.2 \
                            numpy==1.26.4 \
                           datasets==3.0.0 \
                           evaluate==0.4.2 \
                           jiwer==3.0.4 \
                           librosa==0.10.2.post1 \
                           neural_compressor==3.0.2 \
                           intel-extension-for-pytorch \
                           torch==2.3.1+cxx11.abi \
                           intel-extension-for-pytorch==2.3.110+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

print("Dependencies installed")

# Install `ipykernel` to register the virtual environment as a Jupyter kernel
!asr-quant/bin/pip install ipykernel

# Register the virtual environment as a Jupyter kernel
!asr-quant/bin/python -m ipykernel install --user --name=asr-quant --display-name "Python (asr-quant)"


Virtual environment created
Installing the required dependencies
Looking in indexes: https://pypi.org/simple, https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
Dependencies installed
Installed kernelspec asr-quant in /home/u8bd311b633876ba392b704069aeab3e/.local/share/jupyter/kernels/asr-quant


## Let's patch few changes to source

* Few changes need to be done in modeling_whisper.py & generation_whisper.py inorder to run the quantization script.Run the below two scripts to make the necessary changes.
* Line numbers may vary based on the transformer version used.Transformer version 4.44.2 used in this notebook.

In [2]:
def patch_file(file_path, line_number, original_line, replacement_text):
    """
    Replaces a specific line in a file with a given replacement text if the original line matches.
    
    Parameters:
        file_path (str): Path to the file to be modified.
        line_number (int): Line number where the replacement should happen (1-based indexing).
        original_line (str): The exact line content to be replaced.
        replacement_text (str): The new multi-line text to replace the original line.
    """
    try:
        with open(file_path, 'r') as file:
            lines = file.readlines()
        
        line_index = line_number - 1
        
        if len(lines) > line_index and lines[line_index].strip() == original_line.strip():
            
            lines[line_index] = replacement_text
            
            with open(file_path, 'w') as file:
                file.writelines(lines)
            print(f"Replacement on line {line_number} completed in {file_path}!")
        else:
            print(f"Original line not found at line {line_number} in {file_path}. No changes made.")
    except Exception as e:
        print(f"An error occurred while modifying the file {file_path}: {e}")

# first file modeling_whisper.py

patch_file(
    file_path='./asr-quant/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py',
    line_number=1069,
    original_line="expected_seq_length = self.config.max_source_positions * self.conv1.stride[0] * self.conv2.stride[0]\n",
    replacement_text="""        try:
            expected_seq_length = self.config.max_source_positions * self.conv1.stride[0] * self.conv2.stride[0]
        except AttributeError:
            try:
                expected_seq_length = self.config.max_source_positions * self.conv1.module.module.stride[0] * self.conv2.module.module.stride[0]
            except AttributeError:
                try:
                    expected_seq_length = self.config.max_source_positions * self.conv1.module.module.module.stride[0] * self.conv2.module.module.module.stride[0]
                except AttributeError:
                    expected_seq_length = None
# REPLACEMENT_DONE
"""
)

# second file generation_whisper.py
patch_file(
    file_path='./asr-quant/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py',
    line_number=505,
    original_line="input_stride = self.model.encoder.conv1.stride[0] * self.model.encoder.conv2.stride[0]\n",
    replacement_text="""        try:
            input_stride = self.model.encoder.conv1.stride[0] * self.model.encoder.conv2.stride[0]
        except AttributeError:
            try:
                input_stride = self.model.encoder.conv1.module.module.stride[0] * self.model.encoder.conv2.module.module.stride[0]
            except AttributeError:
                try:
                    input_stride = self.model.encoder.conv1.module.module.module.stride[0] * self.model.encoder.conv2.module.module.module.stride[0]
                except AttributeError:
                    input_stride = None
#REPLACEMENT_DONE
"""
)

Original line not found at line 1069 in ./asr-quant/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py. No changes made.
Original line not found at line 505 in ./asr-quant/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py. No changes made.


### Importing required packages

In [3]:
import time
import os
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration,pipeline
from datasets import load_dataset
from neural_compressor import quantization
from neural_compressor.quantization import fit
from neural_compressor.config import PostTrainingQuantConfig, TuningCriterion, AccuracyCriterion
from neural_compressor.utils.pytorch import load

  from .autonotebook import tqdm as notebook_tqdm


### Whisper Model Inference before Quantization

**Model Card** : https://huggingface.co/openai/whisper-large

In [4]:
model_id='openai/whisper-large'

model = WhisperForConditionalGeneration.from_pretrained(model_id,use_safetensors=True)

processor = WhisperProcessor.from_pretrained('openai/whisper-large')

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
start_time = time.time()

predicted_ids = model.generate(input_features)

end_time=time.time()
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
print(f"Time spent before quantization :{end_time-start_time} seconds")

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
Time spent before quantization :116.60087156295776 seconds


### Quantization of Whisper Model

In [5]:
model_id='openai/whisper-large'

model = WhisperForConditionalGeneration.from_pretrained(
    model_id,use_safetensors=True)
output_dir = "quantized_whisper_large"

tune=True

if tune:
    tuning_criterion = TuningCriterion(max_trials=5)
    accuracy_criterion = AccuracyCriterion(tolerable_loss=0.1)
    op_type_dict = {
        "Embedding": {"weight": {"dtype": ["fp32"]}, "activation": {"dtype": ["fp32"]}}
        }
    conf = PostTrainingQuantConfig(approach="dynamic", tuning_criterion=tuning_criterion, accuracy_criterion=accuracy_criterion,op_type_dict=op_type_dict)
    q_model = quantization.fit(model, conf=conf) 
    q_model.save(output_dir)     

2024-12-10 09:21:30 [INFO] Start auto tuning.
2024-12-10 09:21:30 [INFO] Quantize model without tuning!
2024-12-10 09:21:30 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-12-10 09:21:30 [INFO] Adaptor has 5 recipes.
2024-12-10 09:21:30 [INFO] 0 recipes specified by user.
2024-12-10 09:21:30 [INFO] 3 recipes require future tuning.
2024-12-10 09:21:30 [INFO] *** Initialize auto tuning
2024-12-10 09:21:30 [INFO] {
2024-12-10 09:21:30 [INFO]     'PostTrainingQuantConfig': {
2024-12-10 09:21:30 [INFO]         'AccuracyCriterion': {
2024-12-10 09:21:30 [INFO]             'criterion': 'relative',
2024-12-10 09:21:30 [INFO]             'higher_is_better': True,
2024-12-10 09:21:30 [INFO]             'tolerable_loss': 0.1,
2024-12-10 09:21:30 [INFO]             'absolute': None,
2024-12-10 09:21:30 [INFO]      

### Whisper Model Inference after Quantization

In [6]:
model_id='openai/whisper-large'

model = WhisperForConditionalGeneration.from_pretrained(model_id,use_safetensors=True)

# Load the quantized model
model = load(os.path.abspath(os.path.expanduser('./quantized_whisper_large')), model)

# load model and processor
processor = WhisperProcessor.from_pretrained('openai/whisper-large')

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

start_time = time.time()

# generate token ids
predicted_ids = model.generate(input_features)

end_time=time.time()
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
print(f"Time spent after quantization :{end_time-start_time} seconds")

  device=storage.device,
2024-12-10 09:23:06 [INFO] Convert operators to bfloat16


[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
Time spent after quantization :104.29494643211365 seconds


### Benchmark results of verified whisper models

| **Model**                 | **Original Size** | **Quantized Size** |
|---------------------------|-------------------|--------------------|
| openai--whisper-large     | 5.8G              | 1.7G               |
| openai--whisper-small     | 927M              | 393.8M             |
| openai--whisper-large-v3  | 2.9G              | 1.7G               |



### Disclaimer for Using Large Language Models

Please be aware that while Large Language Models like Camel-5B and OpenLLaMA 3b v2 are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It's advisable to carefully review the generated text and consider the context and application in which you are using these models.

For detailed information on each model's capabilities, licensing, and attribution, please refer to the respective model cards:

* **openai/whisper-large**
  
   * Model card : https://huggingface.co/openai/whisper-large

* **openai/whisper-small**

   * Model card : https://huggingface.co/openai/whisper-small

* **openai/whisper-large-v3**

   * Model card : https://huggingface.co/openai/whisper-large-v3

Usage of these models must also adhere to the licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.

To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.