One of the most used method for audio based classification is to convert 1D audio signal to 2D image representation and use well established computer vision techniques. Following this widely used and successful method, this kernel builds an "interesting" image dataset using audio dataset.


**Idea:** Transform the audio data to get one channel image data like spectrogram. Take three such transforms and stack them together to form 3 channel image.

The three transforms that this notebook uses are:
- STFT based spectrogram
- Log of the spectrogram
- MFCC based spectrogram

_Sneak Peak of the dataset_

![img](https://i.imgur.com/z8P7ByC.png)

Overview: 

* Generate multiple image datasets using different values of the hyperparameters - `n_fft` and `hop_length`. 
* Since we are creating multiple dataset to experiment with, dataset version control can be useful. We will save the dataset as W&B artifacts. 

_Sneak Peak of our dataset version control_

![img](https://i.imgur.com/1cqGL0B.png)

In a different kernel we shall consume these artifacts to train neural network based models. 

# Imports and Setups

In [None]:
import os
import csv
import numpy as np
import pandas as pd
from pathlib import Path
from PIL import Image

import librosa as lb 
import librosa.display
import matplotlib.pyplot as plt
import IPython.display as ipd

from skimage.transform import resize
from scipy import stats

import wandb
wandb.login()

# Simple EDA

Let's briefly investigate `csv` files. We have two annotation files - `train_tp.csv` and `train_fp.csv`.

For clarity or more information on `train_tp` vs `train_fp` check out this [disussion thread](https://www.kaggle.com/c/rfcx-species-audio-detection/discussion/197866). We are discarding `train_fp.csv` for now.

For more on `submission.csv` check out this [discussion thread](https://www.kaggle.com/c/rfcx-species-audio-detection/discussion/200757).

In [None]:
data_dir = '/kaggle/input/rfcx-species-audio-detection/'
train_tp = pd.read_csv(os.path.join(data_dir, 'train_tp.csv'))
train_fp = pd.read_csv(os.path.join(data_dir, 'train_fp.csv'))

### A peek into the file.

In [None]:
train_tp.head(2)

In [None]:
train_fp.head(2)

### General description

In [None]:
train_tp.describe()

In [None]:
train_fp.describe()

### Number of Classes

In [None]:
# number of classes.
print('Classes: ', sorted(train_tp.species_id.unique()))

In [None]:
# unique `songtype_id`
sorted(train_tp.songtype_id.unique())

Pointers:

* There are 24 species of birds and frogs together. Thus 24 classes.
* `songtype_id` is the same species using different frequencies as annotated. 
* There are 1216 rows in `train_tp.csv` and 7781 rows in `train_fp.csv`.

Let us see the number of training audio files.

### Number of train and test audio records

In [None]:
train_folder= Path(data_dir+'train')
test_folder = Path(data_dir+'test')

train_file_path = list(map(str, list(train_folder.glob('*.flac'))))
test_file_path = list(map(str, list(test_folder.glob('*.flac'))))

print('Number of audio files to train: {} and test: {}'.format(len(train_file_path), len(test_file_path)))

The numbers don't add up and thus a closer look at the `csv` files is required. 

In [None]:
print('Number of unique true positive annotated files: ', len(train_tp.recording_id.unique()))
print('Number of unique false positive annotated files: ', len(train_fp.recording_id.unique()))

print('TP + FP: ', 1132+3958)

The reason for tp+fp to not add up to the number of audion files is that there are some files annotated with tp as well as fp.

In [None]:
print('Number of files present in both tp and fp: ', len(set(train_tp.recording_id.unique()).intersection(set(train_fp.recording_id.unique()))))
print('Total number of files: ', 5090-363)

# Save the `csv` files as W&B Artifacts

In [None]:
# initialize a W&B run
run = wandb.init(project='rainforest', job_type='load_dataset')

# create an artifact to add file(s) and meaningful description.
artifact = wandb.Artifact('csv_reference', 
                          type='dataset', 
                          description='These csv files contain contain both true positive and false positive annotations.',
                          metadata={'type': 'csv'})
    
artifact.add_file(data_dir+'train_tp.csv')
artifact.add_file(data_dir+'train_fp.csv')

# Save the artifact version to W&B and mark it as the output of this run
run.log_artifact(artifact)
    
run.join()

# Audio to Image Transformations

We will quickly look at the standard audio to image transformation techniques. They will be used to generate the dataset to train our model. 

The transformations are based on this [Kaggle kernel](https://www.kaggle.com/samcantor9/getting-started-with-rainforest-audio-data) by [Sam Cantor](https://www.kaggle.com/samcantor9).

In [None]:
training_files = train_tp.recording_id.unique()
training_files

In [None]:
sample_recording_id = np.random.choice(training_files, 1)[0]
sample_path = [path for path in train_file_path if sample_recording_id in path][0]

SR = 48000
signal, sr = lb.load(sample_path, sr=SR)
lb.display.waveplot(signal, sr=SR)
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.show()

ipd.Audio(sample_path)

In [None]:
# which specie(s) does the sample audio file belong
train_tp.loc[train_tp['recording_id'] == sample_recording_id]

### 1. STFT based spectrogram

In [None]:
n_fft = 2048 # number of samples per FFT (the duration of each slice)
hop_length = 512 # shift

stft = lb.core.stft(signal, hop_length=hop_length, n_fft=n_fft)

spectrogram = np.abs(stft)

lb.display.specshow(spectrogram, sr=sr, hop_length=hop_length)
plt.xlabel('Time') 
plt.ylabel('Frequency')
clb = plt.colorbar()
clb.set_label('Amplitude')
plt.show()

### 2. STFT based log spectrogram

SImply convert the spectrogram to log scale.

In [None]:
log_spectrogram = lb.amplitude_to_db(spectrogram)

lb.display.specshow(log_spectrogram, sr=sr, hop_length=hop_length)
plt.xlabel('Time')
plt.ylabel('Frequency')
clb = plt.colorbar()
clb.set_label('Amplitude')
plt.show()

### 3. Mel Frequency Ceptral Coefficients(MFCCs)

In [None]:
mel_spectrogram = lb.feature.melspectrogram(signal, n_fft=n_fft, hop_length=hop_length, sr=sr)

log_mel_spectrogram = lb.amplitude_to_db(mel_spectrogram)

lb.display.specshow(log_mel_spectrogram, sr=sr, hop_length=hop_length)
plt.xlabel('Time')
plt.ylabel('MFCC')
clb = plt.colorbar()
clb.set_label('Volume') 
plt.show()  

# IDEA: Stack all three transformations such that we get a standard 3 channel image.

In [None]:
spectrogram = resize(spectrogram, (224, 400))
log_spectrogram = resize(log_spectrogram, (224, 400))
log_mel_spectrogram = resize(log_mel_spectrogram, (224, 400))

In [None]:
img = np.stack((spectrogram, log_spectrogram, log_mel_spectrogram), axis=-1)
print(img.shape)

plt.imshow(img);

In [None]:
# normalize image
norm_img = stats.zscore(img)
plt.imshow(norm_img);

# Dataset Creation and W&B Artifacts

Learn more about the artifacts through this easy to understand [Colab notebook](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/wandb-artifacts/Pipeline_Versioning_with_W%26B_Artifacts.ipynb).

### Dataset related hyperparameters

I created three variants of the dataset by changing the hyperparameters, by simply running the jupyter cells you can create a new dataset and save as artifacts. You can later use the same to download the dataset and train a model. 

In [None]:
N_FFT = 1024
HOP_LENGTH = 512
SR = 48000 # high sr for less rounding errors this way
LENGTH = 10 * SR #length of slice

IMG_WIDTH = 400
IMG_HEIGHT = 224

SAVE_DIR = 'kaggle/working/'

### Prepare dataset and save as [W&B artifacts](https://www.wandb.com/artifacts)

In [None]:
# initialize a W&B run
run = wandb.init(project='rainforest', job_type='prepare_dataset')

# declare which artifact we'll be using
artifact_csv = run.use_artifact('wandb/rainforest/csv_reference:v0')

# we can eithr download the csv file from the artifact_csv or use the one which is opened. 
recording_ids = train_tp.recording_id.values

# make a directory to save the created files
os.makedirs(SAVE_DIR+'nfft_{}_hop_{}'.format(N_FFT, HOP_LENGTH), exist_ok=True)
print('Dir successfully made')

# create a Artifact to save in the dataset to be used later
artifact = wandb.Artifact('spectrogram-dataset_nfft_{}_hop_{}'.format(N_FFT, HOP_LENGTH), 
                          type='dataset', 
                          description='This dataset was generated by stacking spectrogrm ,log spectrogram and MFCC based spectrogram with n_fft value of {}\
                                and hop length of {}.'.format(N_FFT, HOP_LENGTH),
                          metadata={'width': 400,
                                    'height': 224,
                                    'channel': 3,
                                    'data_type': 'uint8'})

# actual loop to create the dataset 
for i, recording_id in enumerate(recording_ids):
    # load the audio 
    file_path = [path for path in train_file_path if recording_id in path][0]
    wav, sr = librosa.load(file_path, sr=SR)
    
    # get features from the train_tp.csv file
    features = train_tp.loc[train_tp['recording_id'] == recording_id].values[0]
    t_min = features[3] * sr
    t_max = features[5] * sr
    
    # Get the postition to slice the audio
    center = np.round((t_min + t_max) / 2)
    beginning = center - LENGTH / 2
    if beginning < 0:
        beginning = 0
    ending = beginning + LENGTH
    
    if ending > len(wav):
        ending = len(wav)
        beginning = ending - LENGTH
        
    wav_slice = wav[int(beginning):int(ending)]
    
    # spectrogram
    stft = lb.core.stft(wav_slice, hop_length=HOP_LENGTH, n_fft=N_FFT)
    spectrogram = np.abs(stft)
    spectrogram = resize(spectrogram, (IMG_HEIGHT, IMG_WIDTH))
    
    # log_spectrogram
    log_spectrogram = lb.amplitude_to_db(spectrogram)
    log_spectrogram = resize(log_spectrogram, (IMG_HEIGHT, IMG_WIDTH))
    
    # mel_spectrogram
    mel_spectrogram = lb.feature.melspectrogram(wav_slice, n_fft=N_FFT, hop_length=HOP_LENGTH, sr=sr)
    log_mel_spectrogram = lb.amplitude_to_db(mel_spectrogram)
    log_mel_spectrogram = resize(log_mel_spectrogram, (IMG_HEIGHT, IMG_WIDTH))
    
    # generate image by stacking three transforms 
    img = np.stack((spectrogram, log_spectrogram, log_mel_spectrogram), axis=-1)
    
    # normalize image
    norm_img = stats.zscore(img)
    #scale image to 0-1
    norm_img = norm_img - np.min(norm_img)
    norm_img = norm_img / np.max(norm_img)
    # scale up to 0-255 to save in bmp format
    norm_img = np.round(norm_img*255).astype('uint8')
    norm_img = np.asarray(norm_img)
    
    # convert to PIL Image and save in bmp format
    bmp = Image.fromarray(norm_img)
    bmp.save(SAVE_DIR + 'nfft_{}_hop_{}/'.format(N_FFT, HOP_LENGTH) + recording_id + '_' + str(features[1]) + '_' + str(center) + '.bmp')
    
    if i % 100 == 0:
        print('Processed ' + str(i) + ' train examples from ' + str(len(recording_ids)))

# save the directory as an artifact
artifact.add_dir(SAVE_DIR+'nfft_{}_hop_{}'.format(N_FFT, HOP_LENGTH))

# Save the artifact version to W&B
run.log_artifact(artifact)

# let W&B know that this run is complete
run.join()