### At the end of this Notebook you can generate and download the whole 128x128 image (mel-spectrograms) training dataset for short audio sounds, as well as the whole cropped 128x128 image training soundscape dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
import IPython.display as ipd
import warnings
warnings.filterwarnings(action='ignore')
import pathlib
from pathlib import Path
from PIL import Image
from tqdm import tqdm

# Audio library for Python
import librosa
import librosa.display

# Sources

https://www.kaggle.com/stefankahl/birdclef2021-processing-audio-data

# Data

* **train_short_audio** - The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format. The training data should have nearly all relevant files; we expect there is no benefit to looking for more on xenocanto.org.
* **train_soundscapes** - Audio files that are quite comparable to the test set. They are all roughly ten minutes long and in the ogg format. The test set also has soundscapes from the two recording locations represented here.
* **test_soundscapes** - When you submit a notebook, the test_soundscapes directory will be populated with approximately 80 recordings to be used for scoring. These will be roughly 10 minutes long and in ogg audio format. The file names include the date the recording was taken, which can be especially useful for identifying migratory birds.

This folder also contains text files with the name and approximate coordinates of the recording location plus a csv with the set of dates the test set soundscapes were recorded.

* **test.csv** - Only the first three rows are available for download; the full test.csv is in the hidden test set.
    * row_id: ID code for the row.
    * site: Site ID.
    * seconds: the second ending the time window
    * audio_id: ID code for the audio file.
* **train_metadata.csv** - A wide range of metadata is provided for the training data. The most directly relevant fields are:
    * primary_label: a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
    * recodist: the user who provided the recording.
    * latitude & longitude: coordinates for where the recording was taken. Some bird species may have local call 'dialects,' so you may want to seek geographic diversity in your training data.
    * date: while some bird calls can be made year round, such as an alarm call, some are restricted to a specific season. You may want to seek temporal diversity in your training data.

    * filename: the name of the associated audio file.
* **train_soundscape_labels.csv** -
    * row_id: ID code for the row.
    * site: Site ID.
    * seconds: the second ending the time window
    * audio_id: ID code for the audio file.
    * birds: space delimited list of any bird songs present in the 5 second window. The label nocall means that no call occurred.
* **sample_submission.csv** - A properly formed sample submission file. Only the first three rows are public, the remainder will be provided to your notebook as part of the hidden test set.
    * row_id
    * birds: space delimited list of any bird songs present in the 5 second window. If there are no bird calls, use the label nocall.
    


In [None]:
path = Path('/kaggle/working')
train_short_audio_path = Path('../input/birdclef-2021/train_short_audio')
train_soundscapes = Path('../input/birdclef-2021/train_soundscapes')
train_path = path/'train'
train_soundscapes_path = path/'train_soundscapes'
train_path.mkdir(exist_ok=True)
train_soundscapes_path.mkdir(exist_ok=True)

training_df = pd.read_csv("../input/birdclef-2021/train_metadata.csv")
training_soundscapes_df = pd.read_csv("../input/birdclef-2021/train_soundscape_labels.csv")

In [None]:
training_df.head()

# Explore data

## Listen to files

### short train file

The extracts are short, and you can often hear the bird singing several times in a row. When you listen to the files of the same folder, you realize that a single bird can be assimilated to many different calls.

In [None]:
# Pick a file
audio_path = '../input/birdclef-2021/train_short_audio/rubwre1/XC236057.ogg'

# Listen to it
ipd.Audio(audio_path)

### Long train file

The extracts are long, noisy, and sometimes there is no call during the whole extract.

In [None]:
# Pick a file
audio_path_long = '../input/birdclef-2021/train_soundscapes/11254_COR_20190904.ogg'

# Listen to it
# ipd.Audio(audio_path_long)

## Transform audio into a numpy array

We need to feed sound waves into a computer. But sound is transmitted as waves. How do we turn sound waves into numbers?


Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave. To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced points.

This is called **sampling**. We are taking a reading thousands of times a second and recording a number representing the height of the sound wave at that point in time. That’s basically all an uncompressed .wav audio file is.

Here, audio files are sampled at 32khz (32000 readings per second).

In [None]:
sig, rate = librosa.load(audio_path, sr=32000, offset=None)

# The result is a 1D numpy array that contains audio samples. shape = seconds * sample rate = 5 * 32000 == 160000)
print('SIGNAL SHAPE:', sig.shape)

As we can see, we get a 1D array of 5x32000=160000 numbers, because 1/32000 second, we have a number which represent he height of the sound wave at that point in time. We can visualize this:

In [None]:
plt.figure(figsize=(15, 5))
librosa.display.waveplot(sig, sr=32000)

## Transform audio 1D array into 2D array (spectrogram) for image recognition

We now have an array of numbers with each number representing the sound wave’s amplitude at 1/44100th of a second intervals.

We could feed these numbers right into a neural network. But trying to recognize speech patterns by processing these samples directly is difficult. We’ve seen in the past that convolutional neural networks (CNN) perform particularly well for sound classification. But CNN need 2D inputs. Luckily, we can transform an audio signal into a 2D representation.

To make this data easier for a neural network to process, we are going to break apart this complex sound wave into it’s component parts. We’ll break out the low-pitched parts, the next-lowest-pitched-parts, and so on. Then by adding up how much energy is in each of those frequency bands (from low to high), we create a fingerprint of sorts for this audio snippet.

We do this using a mathematic operation called a Fourier transform. It breaks apart the complex sound wave into the simple sound waves that make it up. Once we have those individual sound waves, we add up how much energy is contained in each one.

The end result is a score of how important each frequency range is, from low pitch (i.e. bass notes) to high pitch.

![](https://cdn-images-1.medium.com/max/1200/1*A4CxgdyqYd_nrF3e-7ETWA.png)

If we repeat this process on every 20 millisecond chunk of audio, we end up with a spectrogram.

In [None]:
# First, compute the spectrogram using the "short-time Fourier transform" (stft)
spec = librosa.stft(sig)

# Scale the amplitudes according to the decibel scale
spec_db = librosa.amplitude_to_db(spec, ref=np.max)

# Plot the spectrogram
plt.figure(figsize=(15, 5))
librosa.display.specshow(spec_db, 
                         sr=32000, 
                         x_axis='time', 
                         y_axis='hz', 
                         cmap=plt.get_cmap('viridis'))

In [None]:
print('SPEC SHAPE:', spec_db.shape)

Very nice! However, that's still a very large input for a CNN. Let's change the *“window length”* and *"hop length"*
A good final size could be 128x128 or 256x256 (if the second number is higher than 128 or 256, we can then split the sample into multiple smaller samples).

We could use the so-called mel scale (https://en.wikipedia.org/wiki/Mel_scale) to scale the frequency axis of our spectrogram. In the past, this attempt (even though it was initially designed for human speech) worked well for bird sound recognition. Luckily, Librosa supports this transformation. We can set the number of mel bins we want to use and that number would eventually be our vertical resolution of the spectrogram. We also know that the hop length we choose is key for the width of the spectrogram, so we have to settle on a certain value. On top of that, we should probably process 5-second chunks of audio (since that’s the submission segment duration).

We should probably also consider the vocal and auditory range of birds. We know that most songbirds vocalize between 1 and 4 kHz. Yet, some species vocalize below that, and some significantly above. In general, we can probably limit the frequency range we want to include in a spectrogram between 500 Hz and 12.5 kHz. Not many birds will vocalize outside this range.



In [None]:
# Desired shape of the input spectrogram for a 5s time window
SPEC_HEIGHT = 128
SPEC_WIDTH = 128

# Derive num_mels and hop_length from desired spec shape
# num_mels is easy, that's just spec_height
# hop_length is a bit more complicated
NUM_MELS = SPEC_HEIGHT
HOP_LENGTH = int(32000 * 5 / (SPEC_WIDTH - 1)) # sample rate * duration / spec width - 1 == 627

# High- and low-pass frequencies
# For many birds, these are a good choice
FMIN = 20
FMAX = 16000

# Compute the spectrogram and apply the mel scale
mel_spec = librosa.feature.melspectrogram(y=sig, 
                                      sr=32000, 
                                      n_fft=2048, 
                                      hop_length=HOP_LENGTH, 
                                      n_mels=NUM_MELS, 
                                      fmin=FMIN, 
                                      fmax=FMAX)

mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

# Show the spec
plt.figure()
plt.title('Shape: ' + str(mel_spec_db.shape))
plt.imshow(mel_spec_db)
# librosa.display.specshow(mel_spec_db, 
#                              sr=32000, 
#                              hop_length=HOP_LENGTH, 
#                              x_axis='time', 
#                              y_axis='mel',
#                              fmin=FMIN, 
#                              fmax=FMAX, 
#                              cmap=plt.get_cmap('viridis'))

Last but not least, we can convert this 2D array into a 3D array.
The mono_to_color function takes as an input the spectrogram of our sound. 
* It stacks it three times, so that it has the same shape as a classic RGB image.
* Then it standardize the array (take a matrix and change it so that its mean is equal to 0 and variance is 1). This improves performance.
* Then it normalizes each value between 0 and 255 (gray scale).

In [None]:
def mono_to_color(X: np.ndarray,
                  mean=None,
                  std=None,
                  norm_max=None,
                  norm_min=None,
                  eps=1e-6):
    """
    Code from https://www.kaggle.com/daisukelab/creating-fat2019-preprocessed-data
    """
    # Stack X as [X,X,X]
    X = np.stack([X, X, X], axis=-1)

    # Standardize
    mean = mean or X.mean()
    X = X - mean
    std = std or X.std()
    Xstd = X / (std + eps)
    _min, _max = Xstd.min(), Xstd.max()
    norm_max = norm_max or _max
    norm_min = norm_min or _min
    if (_max - _min) > eps:
        # Normalize to [0, 255]
        V = Xstd
        V[V < norm_min] = norm_min
        V[V > norm_max] = norm_max
        V = 255 * (V - norm_min) / (norm_max - norm_min)
        V = V.astype(np.uint8)
    else:
        # Just zero
        V = np.zeros_like(Xstd, dtype=np.uint8)
    return V

In [None]:
image = mono_to_color(mel_spec_db)
plt.title('Shape: ' + str(image.shape))
plt.imshow(image)

# Prepare Data

Let's use everything we've done above to define:
* A function which convert an ogg file into a 3D array image.
* A function which split a soundscape into 5 second splits and then convert it into a 3D array image.

In [None]:
SPEC_HEIGHT = 128
SPEC_WIDTH = 128
NUM_MELS = SPEC_HEIGHT
HOP_LENGTH = int(32000 * 5 / (SPEC_WIDTH - 1)) # sample rate * duration / spec width - 1 == 627
FMIN = 20
FMAX = 16000
SAMPLE_RATE = 32000
N_FFT = 2048
DURATION = 5

def ogg_to_image(audio_path):
    # Load the ogg file
    sig, rate = librosa.load(audio_path, sr=SAMPLE_RATE, offset=None, duration=DURATION)
    # Get start and stop sample
    s_start = 0
    s_end = DURATION * 32000
    # Compute the spectrogram and apply the mel scale
    mel_spec = librosa.feature.melspectrogram(y=sig[s_start:s_end], sr=SAMPLE_RATE, n_fft=N_FFT,
                                              hop_length=HOP_LENGTH, n_mels=NUM_MELS, fmin=FMIN, fmax=FMAX)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    # Convert the spectrogram into a 128x128x3 image
    image = mono_to_color(mel_spec_db)
    return image

def soundscape_to_images(audio_path):
    images_soundscape = []
    # Load the ogg file
    sig, rate = librosa.load(audio_path, sr=SAMPLE_RATE, offset=None)
    # Compute the spectrogram and apply the mel scale
    
    for second in tqdm(range(DURATION, 605, DURATION)):  
        # Get start and stop sample
        s_start = (second - 5) * 32000
        s_end = second * 32000
        # Compute the spectrogram and apply the mel scale
        mel_spec = librosa.feature.melspectrogram(y=sig[s_start:s_end], sr=SAMPLE_RATE, n_fft=N_FFT,
                                                  hop_length=HOP_LENGTH, n_mels=NUM_MELS, fmin=FMIN, fmax=FMAX)
        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
        # Convert the spectrogram into a 128x128x3 image
        image = mono_to_color(mel_spec_db)
        images_soundscape.append(image)
    return images_soundscape

Now let's test if everything works

In [None]:
audio_path = '../input/birdclef-2021/train_short_audio/rubwre1/XC236057.ogg'
image = ogg_to_image(audio_path)
plt.imshow(image)

In [None]:
audio_path_soundscape = '../input/birdclef-2021/train_soundscapes/10534_SSW_20170429.ogg'
images_soundscape = soundscape_to_images(audio_path_soundscape)
print('Shape: ' + str(np.array(images_soundscape).shape))
for i in range(0,5):
    plt.imshow(images_soundscape[i])
    plt.show()

# Get all processed data

For now I just take the first 5s of audio for short audio samples.

In [None]:
#Creating every image in training set for short audio  
for i, row in tqdm(enumerate(training_df.values)):
    dir = train_path/str(row[0])
    dir.mkdir(exist_ok=True)
    audio_path = train_short_audio_path/str(row[0])/str(row[9])
    img = ogg_to_image(audio_path)
    filename = str(row[9])[:-4]
    Image.fromarray(img, mode='RGB').save(dir/f"{filename}.png")

In [None]:
# Here it is fast because there are not many audio files, and the slow part is loading the audio file 
for i in tqdm(range(0, len(training_soundscapes_df), 120)):
    #Find the filename in the list of files in the directory
    audio_filename = [j for j in os.listdir(train_soundscapes) if os.path.isfile(os.path.join(train_soundscapes,j)) and 
                      training_soundscapes_df.iloc[i]['row_id'][:-2] in j][0]
    audio_path_soundscape = train_soundscapes/audio_filename
    imgs = soundscape_to_images(audio_path_soundscape)
    for j in range(len(imgs)):
        dir = train_soundscapes_path/training_soundscapes_df.iloc[i+j]['birds']
        dir.mkdir(exist_ok=True)
        filename = str(training_soundscapes_df.iloc[i+j]['row_id'])
        Image.fromarray(imgs[j], mode='RGB').save(dir/f"{filename}.png")

# Profit: download the processed dataset!

On the top right of your screen zip files will be generated, just download it.

In [None]:
# Sorry fot this.
!tar -zcvf train_a.tar.gz /kaggle/working/train/a*
!tar -zcvf train_b.tar.gz /kaggle/working/train/b*
!tar -zcvf train_c.tar.gz /kaggle/working/train/c*
!tar -zcvf train_d.tar.gz /kaggle/working/train/d*
!tar -zcvf train_e.tar.gz /kaggle/working/train/e*
!tar -zcvf train_f.tar.gz /kaggle/working/train/f*
!tar -zcvf train_g.tar.gz /kaggle/working/train/g*
!tar -zcvf train_h.tar.gz /kaggle/working/train/h*
!tar -zcvf train_i.tar.gz /kaggle/working/train/i*
!tar -zcvf train_j.tar.gz /kaggle/working/train/j*
!tar -zcvf train_k.tar.gz /kaggle/working/train/k*
!tar -zcvf train_l.tar.gz /kaggle/working/train/l*
!tar -zcvf train_m.tar.gz /kaggle/working/train/m*
!tar -zcvf train_n.tar.gz /kaggle/working/train/n*
!tar -zcvf train_o.tar.gz /kaggle/working/train/o*
!tar -zcvf train_p.tar.gz /kaggle/working/train/p*
!tar -zcvf train_q.tar.gz /kaggle/working/train/q*
!tar -zcvf train_r.tar.gz /kaggle/working/train/r*
!tar -zcvf train_s.tar.gz /kaggle/working/train/s*
!tar -zcvf train_t.tar.gz /kaggle/working/train/t*
!tar -zcvf train_u.tar.gz /kaggle/working/train/u*
!tar -zcvf train_v.tar.gz /kaggle/working/train/v*
!tar -zcvf train_w.tar.gz /kaggle/working/train/w*
!tar -zcvf train_x.tar.gz /kaggle/working/train/x*
!tar -zcvf train_y.tar.gz /kaggle/working/train/y*
!tar -zcvf train_z.tar.gz /kaggle/working/train/z*

# !tar -zcvf train_soundscapes.tar.gz /kaggle/working/train_soundscapes