## Zero-Shot Audio Classification

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

In [None]:
!pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa

from transformers.utils import logging
logging.set_verbosity_error()

: 

### Prepare the dataset of audio recordings

In [None]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

In [None]:
audio_sample = dataset[0]
audio_sample

from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [None]:
from transformers import pipeline

In [None]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

### Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? 

In [3]:
(1 * 192000) / 16000

12.0

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [4]:
(5 * 192000) / 16000

60.0

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [None]:
zero_shot_classifier.feature_extractor.sampling_rate
audio_sample["audio"]["sampling_rate"]

Set the correct sampling rate for the input and the model.

In [None]:
from datasets import Audio

dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))
audio_sample = dataset[0]
audio_sample

candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

### Automatic Speech Recognition

In [None]:
!pip install transformers
!pip install -U datasets
!pip install soundfile
!pip install librosa
!pip install gradio

from transformers.utils import logging
logging.set_verbosity_error()

from datasets import load_dataset

# laoding speech dataset in streaming mode to minimize memory usage
dataset = load_dataset("librispeech_asr",
                       split="train.clean.100",
                       streaming=True,
                       trust_remote_code=True)


In [None]:

# loop through the examples
example = next(iter(dataset))

# Loop with more than one
dataset_head = dataset.take(5)
list(dataset_head)


In [None]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(example["audio"]["array"],
             rate=example["audio"]["sampling_rate"])

In [None]:
from transformers import pipeline

asr = pipeline(task="automatic-speech-recognition", model="distil-whisper/distil-small.en")

asr.feature_extractor.sampling_rate
example['audio']['sampling_rate']

asr(example["audio"]["array"])
example["text"]