<a href="https://colab.research.google.com/github/x1001000/yamnet-on-raspberrypi3/blob/main/colab_notebooks/IISNRL_labeled_by_YAMNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sound classification with YAMNet

YAMNet is a deep net that predicts 521 audio event [classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv) from the [AudioSet-YouTube corpus](http://g.co/audioset) it was trained on. It employs the
[Mobilenet_v1](https://arxiv.org/pdf/1704.04861.pdf) depthwise-separable
convolution architecture.

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import csv
#import io

import matplotlib.pyplot as plt
from IPython.display import Audio
from scipy.io import wavfile
import librosa
import scipy.signal

Lets load the Model from TensorFlow Hub.

**Note**: to read the documentation just follow the model [url](https://tfhub.dev/google/yamnet/1)

In [None]:
# Load the model.
model = hub.load('https://tfhub.dev/google/yamnet/1')

The labels file will be loaded from the models assets and is present at `model.class_map_path()`.
We will load it on the `class_names` variable.

In [None]:
# Find the name of the class with the top score when mean-aggregated across frames.
def class_names_from_csv(class_map_csv_text):
  """Returns list of class names corresponding to score vector."""
  class_names = []
  with open(class_map_csv_text, newline='\r\n') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
      class_names.append(row['display_name'])

  return class_names

class_map_path = model.class_map_path().numpy()
class_names = class_names_from_csv(class_map_path)

In [None]:
#class_map_path # https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv

#AudioSet Ontology
#import requests
#r = requests.get('https://raw.githubusercontent.com/audioset/ontology/master/ontology.json')
#len(r.json())

Lets add a method to verify and convert a loaded audio is on the proper sample_rate (16K), otherwise it would change the results of the model

In [None]:
def ensure_sample_rate(original_sample_rate, waveform,
                       desired_sample_rate=16000):
  """Resample waveform if required."""
  if original_sample_rate != desired_sample_rate:
    desired_length = int(round(float(len(waveform)) /
                               original_sample_rate * desired_sample_rate))
    waveform = scipy.signal.resample(waveform, desired_length)
  return desired_sample_rate, waveform

## Transform wav file into waveform

Here we will download a wav file and listen to it.
If you have a file already available, just upload it to colab and use it instead

**Note**: The expected audio file should be a mono wav file at 16kHz sample rate

In [None]:
# wav_file_name = 'speech_whistling2.wav'
# wav_file_name = 'test.wav'
def waveform(wav_file_name):
    #sample_rate, wav_data = wavfile.read(wav_file_name, 'rb')
    wav_data, sample_rate = librosa.load(wav_file_name)
    # stereo to mono
    if wav_data.ndim > 1:
        wav_data = wav_data[:,0]
    # normalization before resampling
    if wav_data.dtype == np.uint8:
        wav_data = wav_data / tf.uint8.max
    elif wav_data.dtype == np.int16:
        wav_data = wav_data / tf.int16.max
    elif wav_data.dtype == np.int32:
        wav_data = wav_data / tf.int32.max
    elif wav_data.dtype == np.float32:
        pass
    else:
        print('wav_data.dtype UNKNOWN')
    sample_rate, wav_data = ensure_sample_rate(sample_rate, wav_data)

    # Show some basic information about the audio.
    #duration = len(wav_data)/sample_rate
    #print(f'Sample rate: {sample_rate} Hz')
    #print(f'Total duration: {duration:.2f}s')
    #print(f'Size of the input: {len(wav_data)}')

    # Let's listen to the wav file.
    #Audio(wav_data, rate=sample_rate)

    return wav_data

The `wav_data` needs to be normalized to values in `[-1.0, 1.0]` (as stated in the model's [documentation](https://tfhub.dev/google/yamnet/1))

In [None]:
#waveform = wav_data / tf.int16.max

## Executing the Model

Now the easy part: using the data we've already prepared, we just call the model and get the: scores, embedding and the spectrogram

The score is main result we will use.
The spectrogram we will use to do some visualizations later.

In [None]:
def inference(waveform):
    # Run the model, check the output.
    scores, embeddings, spectrogram = model(waveform)
    scores_np = scores.numpy()
    spectrogram_np = spectrogram.numpy()
    #infered_class = class_names[scores_np.mean(axis=0).argmax()]
    #print(f'The main sound is: {infered_class}')
    
    #top1 = class_names[scores_np.mean(axis=0).argsort()[-1]]
    #top2 = class_names[scores_np.mean(axis=0).argsort()[-2]]
    #top3 = class_names[scores_np.mean(axis=0).argsort()[-3]]
    #return top1, top2, top3

    return scores_np.mean(axis=0)

# IISNRL data fed into YAMNet model

In [None]:
#!curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav
#!curl -O https://storage.googleapis.com/audioset/miaow_16k.wav

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  153k  100  153k    0     0   725k      0 --:--:-- --:--:-- --:--:--  725k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  210k  100  210k    0     0  1090k      0 --:--:-- --:--:-- --:--:-- 1090k


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
from os.path import isdir
import pandas as pd
#df = pd.DataFrame(columns=['Filename','IISNRL','YAMNET #1','YAMNET #2','YAMNET #3'])
df = pd.DataFrame(columns=['Filename', 'Scores'])
n = 0
for entry in os.scandir('/content/drive/My Drive/IISNRL1700'):
    if entry.is_dir():
        for wav_file_name in os.scandir(entry):
            if not wav_file_name.name.startswith('.'):
                n += 1
                print(n, wav_file_name.name)
                #top1, top2, top3 = inference(waveform(wav_file_name.path))
                df.loc[len(df)+1] = None
                df.iloc[-1]['Filename'] = wav_file_name.name
                #df.iloc[-1]['IISNRL'] = wav_file_name.name.split('_')[0]
                #df.iloc[-1]['YAMNET #1'] = top1
                #df.iloc[-1]['YAMNET #2'] = top2
                #df.iloc[-1]['YAMNET #3'] = top3
                df.iloc[-1]['Scores'] = pd.Series(dict(zip(class_names, inference(waveform(wav_file_name.path)))))
                #print(inference(waveform(wav_file_name.path)).shape)

1 human_001_05.wav
2 human_001_03.wav
3 human_001_02.wav
4 human_001_04.wav
5 human_001_01.wav
6 human_002_03-2.wav
7 human_002_05.wav
8 human_002_15.wav
9 human_002_10.wav
10 human_002_03.wav
11 human_002_08.wav
12 human_002_12.wav
13 human_002_11.wav
14 human_002_18.wav
15 human_002_02.wav
16 human_002_04.wav
17 human_002_21.wav
18 human_002_27.wav
19 human_002_16.wav
20 human_002_07.wav
21 human_002_19.wav
22 human_002_01.wav
23 human_002_25.wav
24 human_002_20.wav
25 human_002_22.wav
26 human_002_28.wav
27 human_002_13.wav
28 human_002_17.wav
29 human_002_26.wav
30 human_002_14.wav
31 human_002_09.wav
32 human_002_23.wav
33 192124-2-0-4.wav
34 169043-2-0-21.wav
35 72015-2-0-3.wav
36 80806-2-0-3.wav
37 99500-2-0-23.wav
38 133090-2-0-38.wav
39 129356-2-0-199.wav
40 175904-2-0-76.wav
41 162318-2-0-34.wav
42 49312-2-0-1.wav
43 174786-2-0-39.wav
44 58857-2-0-13.wav
45 17009-2-0-3.wav
46 138031-2-0-13.wav
47 84143-2-0-7.wav
48 193698-2-0-112.wav
49 207213-2-0-81.wav
50 49312-2-0-2.wav
51

In [None]:
#df.to_csv('/content/drive/My Drive/IISNRL_labeled_by_YAMNet.csv')
with open('/content/drive/My Drive/IISNRL1700_labeled_by_YAMNet.json', 'w') as f:
    f.write(df.to_json(orient='index'))

In [None]:
import json
with open('/content/drive/My Drive/IISNRL1700_labeled_by_YAMNet.json') as f:
    d = json.loads(f.readline())
    for k,v in sorted(d['1700']['Scores'].items(), key=lambda item: item[1], reverse=True):
        print(k,v)

Animal 0.3596857786
Fowl 0.2890303731
Livestock, farm animals, working animals 0.2394678146
Turkey 0.2385484725
Silence 0.1971412003
Gobble 0.151840955
Bird 0.0987212509
Wild animals 0.0915187076
Goose 0.0767482966
Outside, rural or natural 0.0631526709
Duck 0.0631161481
Honk 0.05554289
Crow 0.0227993503
Bird vocalization, bird call, bird song 0.0212207176
Speech 0.0180486664
Quack 0.0174022391
Domestic animals, pets 0.0138384625
Chicken, rooster 0.0136832483
Outside, urban or manmade 0.0122036338
Vehicle 0.0099217044
Caw 0.0099178199
Inside, small room 0.0068507874
Chirp, tweet 0.005718682
Whimper 0.0056208847
Coo 0.0048755948
Cat 0.0048269676
Sound effect 0.0047461521
Pigeon, dove 0.0046159169
Crying, sobbing 0.0045233513
Television 0.0044442369
Cluck 0.0044127032
Dog 0.004343844
Wind 0.0040741321
Owl 0.0039514815
Alarm 0.0036405835
Explosion 0.0032704263
Crowing, cock-a-doodle-doo 0.0032250099
Radio 0.0030828656
Motor vehicle (road) 0.0027632648
Telephone 0.0026100548
Burst, pop 0.0