How to use an off-the-shelf pre-trained model for audio classification with only a few lines of code with hf transformers

In [1]:
! pip install datasets

Collecting datasets
  Using cached datasets-2.13.1-py3-none-any.whl (486 kB)
Collecting numpy>=1.17
  Using cached numpy-1.25.1-cp311-cp311-win_amd64.whl (15.0 MB)
Collecting pyarrow>=8.0.0
  Using cached pyarrow-12.0.1-cp311-cp311-win_amd64.whl (21.5 MB)
Collecting dill<0.3.7,>=0.3.0
  Using cached dill-0.3.6-py3-none-any.whl (110 kB)
Collecting pandas
  Using cached pandas-2.0.3-cp311-cp311-win_amd64.whl (10.6 MB)
Collecting requests>=2.19.0
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting tqdm>=4.62.1
  Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting xxhash
  Using cached xxhash-3.2.0-cp311-cp311-win_amd64.whl (30 kB)
Collecting multiprocess
  Using cached multiprocess-0.70.14-py310-none-any.whl (134 kB)
Collecting fsspec[http]>=2021.11.1
  Using cached fsspec-2023.6.0-py3-none-any.whl (163 kB)
Collecting aiohttp
  Using cached aiohttp-3.8.4-cp311-cp311-win_amd64.whl (317 kB)
Collecting huggingface-hub<1.0.0,>=0.11.0
  Using cached huggingface_hub-0.16


[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
! python.exe -m pip install --upgrade pip

Collecting pip
  Using cached pip-23.1.2-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.3.1
    Uninstalling pip-22.3.1:
      Successfully uninstalled pip-22.3.1
Successfully installed pip-23.1.2


In [1]:
# load the en-AU subset of the data to try out the pipeline and upsample it to 16kHz sampling rate which is what most of the models expect
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", "en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16000))

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset minds14 (C:/Users/Raj/.cache/huggingface/datasets/PolyAI___minds14/en-AU/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696)


In [3]:
! pip install transformers[torch]

Collecting transformers[torch]
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting regex!=2019.12.17 (from transformers[torch])
  Using cached regex-2023.6.3-cp311-cp311-win_amd64.whl (268 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Using cached tokenizers-0.13.3-cp311-cp311-win_amd64.whl (3.5 MB)
Collecting safetensors>=0.3.1 (from transformers[torch])
  Using cached safetensors-0.3.1-cp311-cp311-win_amd64.whl (263 kB)
Collecting torch!=1.12.0,>=1.9 (from transformers[torch])
  Using cached torch-2.0.1-cp311-cp311-win_amd64.whl (172.3 MB)
Collecting accelerate>=0.20.2 (from transformers[torch])
  Using cached accelerate-0.20.3-py3-none-any.whl (227 kB)
Collecting sympy (from torch!=1.12.0,>=1.9->transformers[torch])
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch!=1.12.0,>=1.9->transformers[torch])
  Using cached networkx-3.1-py3-none-any.whl (2.1 MB)
Collecting jinja2 (from torch!=1.12.0,>=1.9->

In [2]:
# To classify the audio recording into a set of classes, we can use audio-classification pipeline from HF transformers
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

In [9]:
! pip install librosa

Collecting librosa
  Using cached librosa-0.10.0.post2-py3-none-any.whl (253 kB)
Collecting audioread>=2.1.9 (from librosa)
  Using cached audioread-3.0.0-py3-none-any.whl
Collecting scipy>=1.2.0 (from librosa)
  Using cached scipy-1.11.1-cp311-cp311-win_amd64.whl (44.0 MB)
Collecting scikit-learn>=0.20.0 (from librosa)
  Using cached scikit_learn-1.3.0-cp311-cp311-win_amd64.whl (9.2 MB)
Collecting joblib>=0.14 (from librosa)
  Using cached joblib-1.3.1-py3-none-any.whl (301 kB)
Collecting numba>=0.51.0 (from librosa)
  Using cached numba-0.57.1-cp311-cp311-win_amd64.whl (2.6 MB)
Collecting soundfile>=0.12.1 (from librosa)
  Using cached soundfile-0.12.1-py2.py3-none-win_amd64.whl (1.0 MB)
Collecting pooch<1.7,>=1.0 (from librosa)
  Using cached pooch-1.6.0-py3-none-any.whl (56 kB)
Collecting soxr>=0.3.2 (from librosa)
  Using cached soxr-0.3.5-cp311-cp311-win_amd64.whl (184 kB)
Collecting lazy-loader>=0.1 (from librosa)
  Using cached lazy_loader-0.3-py3-none-any.whl (9.1 kB)
Collecti

In [14]:
! pip install numpy==1.24



In [3]:
# The raw audio data is stored as a Numpy array under ["audio"]["array"]
example = minds[0]

In [4]:
classifier(example["audio"]["array"])

[{'score': 0.9625311493873596, 'label': 'pay_bill'},
 {'score': 0.028672676533460617, 'label': 'freeze'},
 {'score': 0.0033497896511107683, 'label': 'card_issues'},
 {'score': 0.0020057999063283205, 'label': 'abroad'},
 {'score': 0.0008484320132993162, 'label': 'high_value_payment'}]

The model is very confident that the caller intended to learn about paying their bill. Let's see what the actual label for this example is:

In [5]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

Hooray! the predicted label was correct!