# Lesson 5: Zero-Shot Audio Classification

- In the classroom, the libraries have already been installed for you.
- If you are running this code on your own machine, please install the following:
```
    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa
```

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed.
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [2]:
# !pip install transformers
# !pip install datasets
# !pip install soundfile
# !pip install librosa

In [3]:
from transformers.utils import logging
logging.set_verbosity_error()

### Prepare the dataset of audio recordings

In [5]:
from datasets import load_dataset, load_from_disk

# not working if you do not have the files downloaded and saved locally
# This dataset is a collection of different sounds of 5 seconds
# dataset = load_from_disk("./models/ashraq/esc50/train")

# download directly from the web
dataset = load_dataset("ashraq/esc50",
                      split="train[0:10]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/345 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/387M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/387M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [6]:
audio_sample = dataset[0]

In [7]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100}}

In [8]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

# barking

# audio data

## continuous sound wave === microphone/data collection ===> electrical signal --> analog-to-digita;-converter === Sampling ===> Digital representation


## Sampling rate

### Sampling = measuring the value of a continuous signal at fixed time steps.
### Sampling rate（Hz）= the number of samples taken in one second.

## Sampling rate examples:

8,000 Hz: telephone / walkie-talkie \
16,000 Hz: human speech recording \
192,000 Hz: high-resolution audio \

# sampling rate for transformer models:
5-second sound \
At 8,000 Hz ==> 5*8000 = 40,000 signal values \
At 16,000 Hz ==> 5*16000 = 80, 000 signal values \
At 192,000 Hz ==> 5*192000 = 960,000 signal values \

For a transformer trained on 16 kHz audio, an array of 960, 000 values \
will look like a 60-second recording at 16kHz (60*1600 = 960,000)

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [9]:
from transformers import pipeline

In [11]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")

# no local files, just download

config.json:   0%|          | 0.00/5.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/615M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

### Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)?

In [12]:
(1 * 192000) / 16000

12.0

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [13]:
(5 * 192000) / 16000

60.0

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [14]:
zero_shot_classifier.feature_extractor.sampling_rate

48000

In [15]:
audio_sample["audio"]["sampling_rate"]

44100

* Set the correct sampling rate for the input and the model.

In [16]:
from datasets import Audio

In [17]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [26]:
audio_sample = dataset[0]

In [19]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000}}

In [20]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [21]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9985589385032654, 'label': 'Sound of a dog'},
 {'score': 0.0014411048032343388, 'label': 'Sound of vacuum cleaner'}]

In [29]:
# even when you send candidates that do not have the exact labels,
# you can still get some probabilities that are assigned to different labels
# and all the label probabilities will add up to 1
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane",
                    "sound of a dog barking"]

In [30]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9893622994422913, 'label': 'Sound of a bird singing'},
 {'score': 0.0061001889407634735, 'label': 'Sound of an airplane'},
 {'score': 0.0022795458789914846, 'label': 'sound of a dog barking'},
 {'score': 0.0015596762532368302, 'label': 'Sound of a child crying'},
 {'score': 0.0006982541526667774, 'label': 'Sound of vacuum cleaner'}]

In [31]:
# audio_sample_1 = dataset[1]

### Try it yourself!
- Try this model with some other labels and audio files!

In [33]:
# put things toegther:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))
audio_sample = dataset[1]
print(audio_sample)
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

# bird singing for 5 seconds

{'filename': '1-100038-A-14.wav', 'fold': 1, 'target': 14, 'category': 'chirping_birds', 'esc10': False, 'src_file': 100038, 'take': 'A', 'audio': {'path': None, 'array': array([-0.01288922, -0.09524129, -0.14230728, ...,  0.03312215,
        0.00153297,  0.        ]), 'sampling_rate': 48000}}


In [34]:
# zero shot

candidate_labels = ["Sound of a dog",
                    "Sound of a bird"]
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9994524121284485, 'label': 'Sound of a bird'},
 {'score': 0.0005475395009852946, 'label': 'Sound of a dog'}]

In [35]:
# even when you send candidates that do not have the exact labels,
# you can still get some probabilities that are assigned to different labels
# and all the label probabilities will add up to 1
# but in this case, of course, we add the correct answer into the dictionary
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane",
                    "sound of a dog barking"]

In [36]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9893622994422913, 'label': 'Sound of a bird singing'},
 {'score': 0.0061001889407634735, 'label': 'Sound of an airplane'},
 {'score': 0.0022795458789914846, 'label': 'sound of a dog barking'},
 {'score': 0.0015596762532368302, 'label': 'Sound of a child crying'},
 {'score': 0.0006982541526667774, 'label': 'Sound of vacuum cleaner'}]

## Another example:

In [39]:
# put things toegther:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))
audio_sample = dataset[2]
print(audio_sample)
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

# vacuum cleaner for 5 seconds

{'filename': '1-100210-A-36.wav', 'fold': 1, 'target': 36, 'category': 'vacuum_cleaner', 'esc10': False, 'src_file': 100210, 'take': 'A', 'audio': {'path': None, 'array': array([-0.00669002, -0.01229006, -0.01174703, ..., -0.08509892,
       -0.27288783,  0.        ]), 'sampling_rate': 48000}}


In [43]:
# zero shot

candidate_labels = ["Sound of a dog",
                    "Sound of a bird"]
two_choice_classifier = zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)
# either is right here
print("set 1: ", two_choice_classifier)

candidate_labels_2 = ["Sound of a dog",
                    "Sound of vacuum cleaner"]
two_choice_classifier = zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels_2)
# either is right here
print("set 2: ", two_choice_classifier)

# even when you send candidates that do not have the exact labels,
# you can still get some probabilities that are assigned to different labels
# and all the label probabilities will add up to 1
# but in this case, of course, we add the correct answer into the dictionary
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane",
                    "sound of a dog barking"]
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

set 1:  [{'score': 0.8161998391151428, 'label': 'Sound of a dog'}, {'score': 0.18380016088485718, 'label': 'Sound of a bird'}]
set 2:  [{'score': 0.99989914894104, 'label': 'Sound of vacuum cleaner'}, {'score': 0.00010087894042953849, 'label': 'Sound of a dog'}]


[{'score': 0.9970026612281799, 'label': 'Sound of vacuum cleaner'},
 {'score': 0.002884805668145418, 'label': 'Sound of an airplane'},
 {'score': 4.5282329665496945e-05, 'label': 'sound of a dog barking'},
 {'score': 4.442631689016707e-05, 'label': 'Sound of a child crying'},
 {'score': 2.285035043314565e-05, 'label': 'Sound of a bird singing'}]