# Lesson 5: Zero-Shot Audio Classification

- In the classroom, the libraries have already been installed for you.
- If you are running this code on your own machine, please install the following:
``` 
    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa
```

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. 
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [1]:
from transformers.utils import logging
logging.set_verbosity_error()

### Prepare the dataset of audio recordings

In [2]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

In [3]:
audio_sample = dataset[0]

In [4]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100}}

In [5]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [6]:
from transformers import pipeline

In [7]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

### Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? 

In [3]:
(1 * 192000) / 16000

12.0

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [4]:
(5 * 192000) / 16000

60.0

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [8]:
zero_shot_classifier.feature_extractor.sampling_rate

48000

In [9]:
audio_sample["audio"]["sampling_rate"]

44100

* Set the correct sampling rate for the input and the model.

In [10]:
from datasets import Audio

In [11]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [12]:
audio_sample = dataset[0]

In [13]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000}}

In [14]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [15]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9985589385032654, 'label': 'Sound of a dog'},
 {'score': 0.0014411123702302575, 'label': 'Sound of vacuum cleaner'}]

In [16]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [17]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.6172530055046082, 'label': 'Sound of a bird singing'},
 {'score': 0.21602635085582733, 'label': 'Sound of vacuum cleaner'},
 {'score': 0.12547191977500916, 'label': 'Sound of an airplane'},
 {'score': 0.04124866798520088, 'label': 'Sound of a child crying'}]

### Try it yourself! 
- Try this model with some other labels and audio files!