Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio pre-processing for Machine Learning: Getting things right #9

scarecrow1123 opened this issue Dec 31, 2019 · 0 comments

Audio pre-processing for Machine Learning: Getting things right #9

scarecrow1123 opened this issue Dec 31, 2019 · 0 comments


Copy link

@scarecrow1123 scarecrow1123 commented Dec 31, 2019

Audio pre-processing for Machine Learning: Getting things right

For any machine learning experiment, careful handling of input data in terms of cleaning, encoding/decoding, featurizing are paramount. When it comes to applying machine learning for audio, it gets even trickier when compared with text/image, since dealing with audio involves many tiny details that can be overlooked. Any sort of inconsistency in the pre-processing pipeline could be a potential disaster in terms of the final accuracy of the overall system. We'll look into a few basic things that need to be set right when writing an audio pre-processing pipeline.

If you are not familiar with how audio input is fed to a machine learning model, I highly recommend reading these two articles first: How to do Speech Recognition with Deep Learning, Speech Processing for Machine Learning - Filter banks, etc.

Fixing on a data format

First step to get the pipeline right is to fix on a specific data format that the system would require. This would ensure a consistent interface that the dataset reader can rely upon. The usual practice is to use WAV which is a lossless format(FLAC is also another popular choice). Since WAV is an uncompressed format, it tends to be better when compared to lossy formats such as MP3, etc.

Do not vary the sample rate

WAV stores audio signals as a series of numbers also called the PCM (Pulse Code Modulation) data. PCM is a way to convert analog audio to digital data. So essentially if you are loading an audio file into a numpy array, it is the underlying PCM data that is loaded. Each number in the sequence is called a sample, that represents the amplitude of the signal at an approximate point in time. It is called a sample since the PCM method approximates the amplitude value by sampling the original audio signal for a fixed number of times every second. The number of samples taken for every second is the sampling rate of the signal. This is an important factor that needs to be uniform in the audio pipeline. If this varies in different parts of a system, things can get miserable! Many machine learning systems for audio applications such as speech recognition, wake-word detection, etc. can work well with 16k Hz audio(16000 samples for every second of the original audio). So for example, a numpy array for a 5 second audio with 16k Hz sample rate would have the shape (80000,) ( 5 * 16000 = 80000).

Popular audio libraries such as PySoundFile, audiofile, librosa, etc. in Python provide operations for loading audio to numpy array and return the sample rate of the signal. The libraries use the header information in WAV files to figure out the sample rate.

# Using soundfile to load audio and know its sample rate
import soundfile as sf

audio, sample_rate ="sample.wav")

Bit depth

This is a crucial property that needs to be handled correctly, especially in places where the data is loaded to arrays/tensors. Bit depth represents the number of bits required to represent each sample in the PCM audio data. In practice, 16-bit signed integers can be used to store training data. During training, these 16-bit data can be loaded to 32-bit float tensors/arrays and can be fed to neural nets. Things can go wrong here say when a 24-bit audio file is loaded into a 16-bit array. Let's take Python stdlib's wave module for example, which returns a byte array from an audio file:

import wave

w ="sample_16b.wav", "rb")
n_samples = w.getnframes()
audio_data = w.readnframes(n_samples)

The byte array is converted into a np array using np.frombuffer and specifying the appropriate type of the data stored, 16-bit int in this case. Things will go wrong when it is loaded into a wrong container say np.int8

# correct type
audio_array = np.frombuffer(audio_data, dtype=np.int16)

# incorrect
audio_array = np.frombuffer(audio_data, dtype=np.int8)

Hence deciding on a standard bit depth that the system will always look for, will help eliminate overflows because of incorrect typecasting.

Byte order

It is also recommended to not to take the byte order for granted when reading/writing audio data. Even though the underlying codec may take into account the system's byte order, for the paranoid ones, it is better to get fixed on one standard order, say little endian.


Number of channels can depend on the actual application for which the pre-processing step is done. For speech recognition let's say, an input to a neural net is typically a single channel. In case of a stereo input, each channel can form distinct inputs to the neural net. Or the channels could be merged together to form a mono audio. However, this is an application specific choice.

A standard way to load/convert input audio

To make sure nothing goes wrong in your audio pre-processing pipeline, it would be the safest to assume none of your inputs is in the right format and always go for a standard format conversion routine. Below would be a set of useful ffmpeg options using ffmpeg-python to standardize the incoming input:

import ffmpeg
import numpy as np

stream = ffmpeg.input("sample.wav")
# set the output sample rate is 16000
stream = ffmpeg.filter(stream, "aresample", osr=16000)
# set num channels = 1, bit depth to 16-bit int(s16), byte order to little endian(le) 
stream = ffmpeg.output(stream, "pipe:", ac=1, acodec="pcm_s16le", format="s16le")
out, err =, quiet=True)

# load it with proper data type (int16)
audio_array = np.frombuffer(out, dtype=np.int16)

Note that audio_array is raw PCM data and cannot be directly written into a WAV file. It is safe to use the IO mechanisms that the audio libraries provide to write the raw data into a WAV file. This will make sure appropriate headers are in place in the WAV file.

import soundfile as sf

sf.write("sample_out.wav", audio_array, samplerate=16000, subtype="PCM_16", endian="LITTLE")

The raw array data however is the starting point for further pre-processing which depend on the downstream experiment/application. They can be converted to signal processing features such as spectrogram, MFCC, etc. which are supported by libraries such as librosa, torchaudio, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
1 participant
You can’t perform that action at this time.