![xyzy fsadsdf](https://images.unsplash.com/photo-1511142878591-5040f0bdaadd?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1650&q=80)

[](https://images.unsplash.com/photo-1473448912268-2022ce9509d8?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1625&q=80)

## üéØ Motivations
**Processing spectrogram on-the-fly is slow!** And it's likely to be a bottleneck of your GPU utility rate, and hence the speed of your training loop. One simple way to speed up your data pipeline is to precompute all spectrograms into disk in prior. By doing so you can avoid repeated costly transformation. 

In this notebook, I will implement this simple solution in fastai and fastaudio, using [Rainforest Connection Species Audio Detection competition](https://www.kaggle.com/c/rfcx-species-audio-detection/overview) as an example. I will show you:  
1. How to transforme audios into mel-spectrograms and save them into disk
2. How to include audio-domain augmentations in the precomputed mel-spectrogram
2. How to construct `DataLoaders` from these precomputed mel-spectrograms
3. Compare the speed of data loading with and without precomputed mel-spectrogram

At the end, I will show you how you can achieve **20x SPEED-UP** with this simple trick! (Remember to run this kernel in GPU mode to reproduce the speed-up stated)

In [None]:
import pkg_resources

def placeholder(x):
    raise pkg_resources.DistributionNotFound
pkg_resources.get_distribution = placeholder


# uninstall fastai-2.3.0 & fastcore-1.3.19 to remove constraint on torch version
!pip uninstall fastai fastcore torchaudio -y
!pip install torch==1.8.1 torchaudio==0.8.1 fastcore==1.3.20
!pip install fastaudio

In [None]:
import os
import librosa
from tqdm import tqdm

import pandas as pd
from fastaudio.all import *
from fastai.vision.all import *

import torch
import torchaudio
import fastcore
import fastai
import fastaudio
import torchaudio
torchaudio.set_audio_backend("sox_io")

## üìî Bookkeeping

In [None]:
print(f'CUDA mode: {torch.cuda.is_available()}')
for _module in (torch, torchaudio, fastcore, fastaudio, fastai):
    print(f'{_module.__name__} version: {_module.__version__}')

In [None]:
# CONFIGURATIONS
DATA_DIR = Path('../input/rfcx-species-audio-detection')
CACHE_DIR = Path('./cached_melspectrograms')
SR = 48000
MELSPEC_CONFIG = {'n_fft': 2048, 'sample_rate': SR}
CROP_INTERVAL = 10  # sec
RADNOM_CROP_INTERVAL = 8  # sec
RESAMPLE_N = 15
BS = 32
CLASS_N = 24
CLASS_NAMES = list(map(str, range(CLASS_N)))

## ‚òï Prepare Our Mel-spectrogram into Disk
First we build a pipeline that can transform audio into mel-spectrogram and then save them into disk. Instead of utilizing a full audio clip, here we crop a small clip of 10 seconds long around the place when there is a species identified, so a cropped clip correspond to a sample of a single species. This has been shown to be an effective approach for model to learn the classification of different species.  

We applied `ResizeSignal` and `AddNoise` as augmentations on the audio. `ResizeSignal` will randomly crop a 8-second clip again from the 10-second clip. In the pipeline, we will bootstrap the same sample many times (controlled by `RESAMPLE_N`) in order to create a variety of augmented snapshots. With such setting, we can emulate the samples being randomly augmented over iterations in downstream stage. (We will show this later)

We need an identifiable filename for each sample. Since a `recording_id` contain multiple samples, we can additionally make use of `t_min` and `t_max` (those are the columns from `train_tp.csv`) to annotate the filename of a sample, with different snapshots having different index. i.e. `{recording_id}_{t_min}_{t_max}_{index}.pt`

In [None]:
audio_fns = get_audio_files(DATA_DIR/'train')
print(f'No. of audio files in train folder: {len(audio_fns)}')

# to save time, I subset training data
df = pd.read_csv(DATA_DIR/ 'train_tp.csv')
df.species_id = df.species_id.astype(str)
print(f'No. of samples to be processed: {df.shape[0]}')

In [None]:
def read_and_crop_audio(row: pd.Series) -> AudioTensor:
    # crop the subclip responsible for the target species
    t_min, t_max = row.t_min, row.t_max
    center = (t_max + t_min) / 2
    start_t = center - (CROP_INTERVAL / 2.)
    
    _frame_offset = int(max(0, start_t) * SR)
    _num_frames = int(CROP_INTERVAL * SR)
    audio_fn = DATA_DIR.resolve()/ f'train/{row.recording_id}.flac'
    assert audio_fn.is_file()
    audio = AudioTensor.create(audio_fn,
                               frame_offset=_frame_offset,
                               num_frames=_num_frames)
    
    # attach metadata to tensor so as to help create unique filename later
    audio.t_min, audio.t_max = row.t_min, row.t_max
    audio.recording_id = row.recording_id
    return audio


def new_encodes(self, audio: AudioTensor):
    self.pipe.to(audio.device)
    self.settings.update({"sr": audio.sr, "nchannels": audio.nchannels})
    spec = AudioSpectrogram.create(self.pipe(audio), settings=dict(self.settings))
    
    # propagate metadata from AdudioTensor to AudioSpectrogram
    spec.__dict__.update(audio.__dict__)
    return spec

# overload an existing type-dispatch function
AudioToSpec.encodes.add(new_encodes)


def save_spectrogram(spec: AudioSpectrogram, idx: int):
    CACHE_DIR.mkdir(exist_ok=True)
    spec_fn = f'{spec.recording_id}_{spec.t_min:.3f}_{spec.t_max:.3f}_{idx}.pt'
    torch.save(spec, CACHE_DIR/ spec_fn)

In [None]:
spec_tfms = AudioToSpec.from_cfg(AudioConfig.BasicMelSpectrogram(**MELSPEC_CONFIG))
tfms_ls = [read_and_crop_audio, 
           AddNoise(noise_level=0.1), 
           ResizeSignal(RADNOM_CROP_INTERVAL*1000, AudioPadType.Repeat),
           spec_tfms]
dsets = Datasets(items=df, tfms=[tfms_ls])

In [None]:
%%time
for sample_i in tqdm(range(len(dsets))):
    # create different snapshots as an augmentation in audio domain
    for repeat_j in range(RESAMPLE_N):
        spec, = dsets[sample_i]
        save_spectrogram(spec, repeat_j)

In [None]:
# sanity check all files have been successfully saved
!ls $CACHE_DIR | wc -l

## ‚è© Build `DataLoaders` with the Precomputed Mel-spectrogram
Once all mel-spectrograms have been saved, we can build `DataLoaders` for loading our precomputed mel-spectrogram. For simplicity, I skipped over the part for creating validation set, but you can easily plug in those parts into the pipeline.

In [None]:
def load_precompute_spectrogram(row: pd.Series) -> AudioSpectrogram:
    assert CACHE_DIR.is_dir()
    # use random file sampling to emulate audio domain augmentation
    random_idx = random.choice(range(RESAMPLE_N))
    random_fn = f'{row.recording_id}_{row.t_min:.3f}_{row.t_max:.3f}_{random_idx}.pt'
    fn = CACHE_DIR/ random_fn
    spec = torch.load(fn)
    return spec


def parse_labels_from_row(row: pd.Series):
    _y_reader = ColReader('species_id')
    label = _y_reader(row)
    assert isinstance(label, str)
    return [label]

In [None]:
# emulate DataLoaders for model training: with input + target
x_tfms_ls = [load_precompute_spectrogram]
y_tfms_ls = [parse_labels_from_row, MultiCategorize(vocab=CLASS_NAMES), OneHotEncode]
dsets_precomp = Datasets(df, tfms=[x_tfms_ls, y_tfms_ls])
dls_precomp = dsets_precomp.dataloaders(bs=BS, num_workers=4)

# sanity check
batch_precomp = dls_precomp.one_batch()
assert isinstance(batch_precomp[0], AudioSpectrogram)
assert isinstance(batch_precomp[1], TensorMultiCategory)

## üöÄ Time the Speed of Loading 1 Epoch 
To time the speed of data loading, lets emulate an epoch of data feeding into our model. Note that as a sole purpose of illustrating the speed, I simply feed each batch to the model ResNet18, instead of constructing a full training loop with `Learner`.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = create_cnn_model(resnet18, n_in=1, 
                         n_out=CLASS_N, 
                         pretrained=True)
model.to(device);

It took only **~2s to complete ~1200 samples**!

In [None]:
%%time
for _batch in tqdm(dls_precomp.train):
    _ = model(_batch[0])

## üî• 20x Speedup Compared to On-the-fly Approach!
To contrast how fast it is, lets build another `DataLoaders` with the same pipeline, but this time we compute mel-spectrogram on-the-fly. 

In [None]:
# DataLoaders that load in melspectrogram on the fly
onthefly_x_tfms_ls = tfms_ls[:-1]
dsets_woprecomp = Datasets(df, tfms=[onthefly_x_tfms_ls, y_tfms_ls])
dls_woprecomp = dsets_woprecomp.dataloaders(bs=BS, after_batch=spec_tfms, num_workers=4)

# sanity check
batch_woprecomp = dls_woprecomp.one_batch()
assert isinstance(batch_woprecomp[0], AudioSpectrogram)
assert isinstance(batch_woprecomp[1], TensorMultiCategory)

This time it took **~38s (v.s. 2s)** to complete the same loading.  

**So we have gained ~20x speedup simply by precomputing melspectrograms!**

In [None]:
%%time
for _batch in tqdm(dls_woprecomp.train):
    _ = model(_batch[0])

## üìù Final Remarks
1. Saving different (random) snapshots of the same audio could enumlate augmentations done in audio space, but shortcoming of this approach is that it could eat up a lot of disk space: the more variations you want, you more snapshots you have to save in your disk. This is certainty a issue when you have a big sample size. Such snapshots could be several times bigger than your original input.
2. The speed of 1 epoch is mainly determined by the speed of data processing and the speed of your model, so if you plug in a heavier model in the same pipeline, probably you cant get the same extent of speedup because the speed is likely to be bounded by your model.

In [None]:
!du -sh $CACHE_DIR