# Pretrain Wav2Vec2 model for speech recognition with Hugging Face and SageMaker

## Background

Wav2Vec2 is a transformer-based architecture for ASR tasks and was released in September 2020. We show its simplified architecture diagram below. For more details, see the [original paper](https://arxiv.org/abs/2006.11477). The model is composed of a multi-layer convolutional network (CNN) as feature extractor, which takes input audio signal and outputs audio representations, also considered as features. They are fed into a transformer network to generate contextualized representations. This part of training can be self-supervised, it means that the transformer can be trained with a mass of unlabeled speech and learn from them. Then the model is fine-tuned on labeled data with Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is [Wav2Vec2-Base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), it is fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. 
<img src="images/wav2vec2.png">

Connectionist Temporal Classification (CTC) is character-based algorithm. During the training, it‚Äôs able to demarcate each character of the transcription in the speech automatically, so the timeframe alignment is not required between audio signal and transcription. For example, one audio clip says ‚ÄúHello World‚Äù, we don‚Äôt need to know in which second word ‚Äúhello‚Äù is located. It saves a lot of labeling effort for ASR use cases. If you are interested in how the algorithm works underneath, see [this article](https://distill.pub/2017/ctc/) for more information.  


## Notebook Overview 

In this notebook, we use [SUPERB 
(Speech processing Universal PERformance Benchmark) dataset](https://huggingface.co/datasets/superb) that available from Hugging Face Datasets library, and fine-tune the Wav2Vec2 model and deploy it as SageMaker endpoint for real-time inference for an ASR task. 
<img src="images/solution_overview.png">

First of all, we show how to load and preprocess the SUPERB dataset in SageMaker environment in order to obtain tokenizer and feature extractor, which are required for fine-tuning the Wav2Vec2 model. Then we use SageMaker Script Mode for training and inference steps, that allows you to define and use custom training and inference scripts and SageMaker provides supported Hugging Face framework Docker containers. For more information about training and serving Hugging Face models on SageMaker, see Use [Hugging Face with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html). This functionality is available through the development of Hugging Face [AWS Deep Learning Container (DLC)](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html). 

This notebook is tested in both SageMaker Studio and SageMaker Notebook environments. Below shows detailed setup.   
- SageMaker Studio: **ml.m5.xlarge** instance with **Data Science** kernel.
- SageMaker Notebook: **ml.m5.xlarge** instance with **conda_python3** kernel. 


## Set up 
First, install the dependencies.

In [2]:
!pip install sagemaker --upgrade
!pip install boto --upgrade
!pip install boto3 --upgrade
!pip install "transformers>=4.4.2" 
!pip install s3fs --upgrade
!pip install datasets --upgrade 
#!pip install "librosa==0.9.1librosa"
!pip install torch # framework is required for transformer 
!pip install torchaudio
!pip install transformers
!pip install accelerate>=0.5.0
!pip install tensorboard
!pip install wandb

!conda install -y -c conda-forge librosa

Keyring is skipped due to an exception: 'keyring.backends'
Collecting sagemaker
  Downloading sagemaker-2.127.0.tar.gz (655 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m655.0/655.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting boto3<2.0,>=1.26.28
  Downloading boto3-1.26.43-py3-none-any.whl (132 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m132.7/132.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting botocore<1.30.0,>=1.29.43
  Downloading botocore-1.29.43-py3-none-any.whl (10.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

**soundfile** library will be used to read raw audio files and convert them into arrays. Before installing **soundfile** python library, package **libsndfile** needs to be installed. 

In [3]:
!conda install -c conda-forge libsndfile -y
!pip install soundfile

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.9.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c conda-forge conda



# All requested packages already installed.

Retrieving notices: ...working... done
[0m

In [5]:
#!pip install boto --upgrade
#!pip install boto3 --upgrade

Collecting botocore<1.30.0,>=1.29.43
  Using cached botocore-1.29.43-py3-none-any.whl (10.3 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.27.59
    Uninstalling botocore-1.27.59:
      Successfully uninstalled botocore-1.27.59
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.24 requires botocore==1.29.24, but you have botocore 1.29.43 which is incompatible.
awscli 1.27.24 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.
awscli 1.27.24 requires rsa<4.8,>=3.1.2, but you have rsa 4.9 which is incompatible.
aiobotocore 2.4.1 requires botocore<1.27.60,>=1.27.59, but you have botocore 1.29.43 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.29.43
[0m

Following let's import common python libraries. Create a S3 bucket in AWS console for this project, and replace **[BUCKET_NAME]** with your bucket. 
Get the execution role which allows training and servering jobs to access your data.  

In [6]:
import json
import time
import boto3
import numpy as np
import random
import soundfile 
import sagemaker
import sagemaker.huggingface

BUCKET="pretrain-wav2vec2-on-swahili" # please use your bucket name
PREFIX = "900h-radio-2022-dataset"
ROLE = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=BUCKET)

print(f"sagemaker role arn: {ROLE}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")



sagemaker role arn: arn:aws:iam::121713061542:role/service-role/AmazonSageMaker-ExecutionRole-20220927T193257
sagemaker bucket: pretrain-wav2vec2-on-swahili
sagemaker session region: us-west-2


Log in to HuggingFace

In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

Set up Weights and Biases

In [8]:
import wandb
wandb.login()


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [9]:
wandb.sagemaker_auth(path="./sagemaker/pretrain_wav2vec/pytorch")

## Data Pre-processing
We are using SUPERB dataset for this notebook, which can be loaded from Hugging Face [dataset library](https://huggingface.co/datasets/superb) directly using `load_dataset` function. SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. It also includes speaker_id and chapter_id etc., these columns are removed from the dataset, and we only keep audio files and transcriptions to fine-tune the Wav2Vec2 model for an audio recognition task, which transcribes speech to text. 

In [10]:
from huggingface_hub import HfFolder
HF_API_TOKEN=HfFolder.get_token()
HF_MODEL_ID="mutisya/wav2vec2-pretrain-swahili-radio2022-sage-1"

## Fine-tune the HuggingFace model (Wav2Vec2)

### Training script

Here we are using SageMaker HuggingFace DLC (Deep Learning Container) script mode to construct the training and inference job, which allows you to write custom trianing and serving code and using HuggingFace framework containers that maintained and supported by AWS. 

When we create a training job using the script mode, the `entry_point` script, hyperparameters, its dependencies (inside requirements.txt) and input data (train and test datasets) will be copied into the container. Then it invokes the `entry_point` training script, where the train and test datasets will be loaded, training steps will be executed and model artifacts will be saved in `/opt/ml/model` in the container. After training, artifacts in this directory are uploaded to S3 for later model hosting.

This script is saved in directory `scripts`, and you can inspect the training script by running the next cell. 

In [11]:
!pygmentize sagemaker/pretrain_wav2vec/pytorch/run_wav2vec2_pretraining_no_trainer.py

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m (
    Wav2Vec2ForCTC, 
    Trainer, 
    TrainingArguments, 
    Wav2Vec2CTCTokenizer, 
    Wav2Vec2FeatureExtractor, 
    Wav2Vec2Processor)
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk, load_metric
[34mfrom[39;49;00m [04m[36mdataclasses[39;49;00m [34mimport[39;49;00m dataclass
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Dict, List, Optional, Union
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m
 

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"

### Creating an Estimator and start a training job

Worth to highlight that, when you create a Hugging Face Estimator, you can configure hyperparameters and provide a custom parameter into the training script, such as `vocab_url` in this example. Also you can specify the metrics in the Estimator, and parse the logs of metrics and send them to CloudWatch to monitor and track the training performance. 

In [17]:
from sagemaker.huggingface import HuggingFace

#create an unique id to tag training job, model name and endpoint name. 
id = int(time.time())

TRAINING_JOB_NAME = f"huggingface-wav2vec2-pretrain-swahili-radio2022-{id}"
print('Training job name: ', TRAINING_JOB_NAME)

vocab_url = f"s3://{BUCKET}/{PREFIX}/vocab.json"
hyperparameters = {
    'dataset_names':"mutisya/swahili_radio_yt_2022_v0.1_sage_pt0 mutisya/swahili_radio_yt_2022_v0.1_sage_pt1 mutisya/swahili_radio_yt_2022_v0.1_sage_pt2 mutisya/swahili_radio_yt_2022_v0.1_sage_pt3 mutisya/swahili_radio_yt_2022_v0.1_sage_pt4 mutisya/swahili_radio_yt_2022_v0.1_sage_pt5 mutisya/swahili_radio_yt_2022_v0.1_sage_pt6 mutisya/swahili_radio_yt_2022_v0.1_sage_pt7 mutisya/swahili_radio_yt_2022_v0.1_sage_pt8 mutisya/swahili_radio_yt_2022_v0.1_sage_pt9",
    'dataset_split_names': "train", 
    'dataset_config_names': "train", 
    'dataset_use_auth_token' : "True",
    'model_name_or_path': "patrickvonplaten/wav2vec2-base-v2",
    'output_dir': "./wav2vec2-pretrain-swahili-radio2022-1",
    'max_train_steps': "20000",
    'num_warmup_steps': "32000",
    'saving_steps': "10000",
    'gradient_accumulation_steps': "4",
    'learning_rate': "0.002",
    'weight_decay' : "0.01",
    'max_duration_in_seconds': "30.5",
    'min_duration_in_seconds' : "2.0",
    'logging_steps': "1",
    'per_device_train_batch_size' : "4",
    'per_device_eval_batch_size': "4",
    'adam_beta1': "0.9",
    'adam_beta2' : "0.98",
    'adam_epsilon' : "1e-06",
    'push_to_hub': "True",
    'gradient_checkpointing': "True",
    'hub_token': HF_API_TOKEN,
    'hub_model_id': HF_MODEL_ID,
  }

# define metrics definitions
metric_definitions=[
        {'Name': 'val_loss', 'Regex': "'val_loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'val_contrastive_loss', 'Regex': "'val_contrastive_loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'val_diversity_loss', 'Regex': "'val_diversity_loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'val_num_losses', 'Regex': "'val_num_losses': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'contrast_loss', 'Regex': "'contrast_loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'div_loss', 'Regex': "'div_loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': '%_mask_idx', 'Regex': "'%_mask_idx': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'ppl', 'Regex': "'ppl': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'lr', 'Regex': "'lr': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'temp', 'Regex': "'temp': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'grad_norm', 'Regex': "'grad_norm': ([0-9]+(.|e\-)[0-9]+),?"}]

Training job name:  huggingface-wav2vec2-pretrain-swahili-radio2022-1672871623


In [55]:
#create an unique id to tag training job, model name and endpoint name. 
id = int(time.time())

TRAINING_JOB_NAME = f"huggingface-wav2vec2-pretrain-swahili-radio2022-{id}"
print('Training job name: ', TRAINING_JOB_NAME)

Training job name:  huggingface-wav2vec2-pretrain-bbc-1667079536


We use the [HuggingFace estimator class](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) to train our model. When creating the estimator, the following parameters need to specify. 

* **entry_point**: the name of the training script. It loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model. 
* **source_dir**: the location of the training scripts. 
* **transformers_version**: the Hugging Face transformers library version we want to use.
* **pytorch_version**: the pytorch version that compatible with transformers library. 

**Instance Selection**: For this use case and dataset, we use one ml.p3.2xlarge instance and the training job is able to finish within two hours. You can select a more powerful instance to reduce the training time, however it will generate more cost.  

In [None]:
OUTPUT_PATH= f's3://{BUCKET}/{PREFIX}/{TRAINING_JOB_NAME}/output/'

env_variables = {
    'HF_API_TOKEN':HF_API_TOKEN,
    'HF_MODEL_ID': HF_MODEL_ID
}
huggingface_estimator = HuggingFace(entry_point='run_wav2vec2_pretraining_no_trainer_sagemaker.py',
                                    source_dir='./sagemaker/pretrain_wav2vec/pytorch',
                                    output_path= OUTPUT_PATH, 
                                    instance_type='ml.g5.2xlarge',
                                    instance_count=1,
                                    transformers_version='4.17.0',
                                    pytorch_version='1.10.2',
                                    #pytorch_version='1.8.0',
                                    py_version='py38',
                                    #py_version='py37',
                                    role=ROLE,
                                    # use_spot_instances=False,  # Use a spot instance 
                                    max_run=259200,  # Max training time
                                    # max_wait=3600,  # Max training time + spot waiting time
                                    hyperparameters = hyperparameters,
                                    metric_definitions = metric_definitions,
                                    environment = env_variables
                                   )

#Starts the training job using the fit function, training takes approximately 2 hours to complete.
huggingface_estimator.fit(job_name=TRAINING_JOB_NAME)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-wav2vec2-pretrain-swahili-radio2022-1672871623


2023-01-04 22:33:51 Starting - Starting the training job...
2023-01-04 22:34:07 Starting - Preparing the instances for training.........
2023-01-04 22:35:49 Downloading - Downloading input data....................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-01-04 22:38:56,756 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-01-04 22:38:56,775 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-01-04 22:38:56,777 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-01-04 22:38:56,931 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/conda/bin/python3.8 -m pip install -r requirements.txt[0m
[34mCollecting accelerate>=0.5.0[0m
[34mDownloading accelerate-0.15.0-py3-none-any.whl (191 kB)[0m
[34m‚îÅ‚îÅ‚î

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


From the training logs you can see that, after 10 epochs of training, and model evaluation metrics wer can achieve around 0.32 for the subset of SUPERB dataset. You can increase the number of epochs or use the full dataset to improve the model further. 