Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Audio pre-processing for Machine Learning: Getting things right #9
Audio pre-processing for Machine Learning: Getting things right
For any machine learning experiment, careful handling of input data in terms of cleaning, encoding/decoding, featurizing are paramount. When it comes to applying machine learning for audio, it gets even trickier when compared with text/image, since dealing with audio involves many tiny details that can be overlooked. Any sort of inconsistency in the pre-processing pipeline could be a potential disaster in terms of the final accuracy of the overall system. We'll look into a few basic things that need to be set right when writing an audio pre-processing pipeline.
If you are not familiar with how audio input is fed to a machine learning model, I highly recommend reading these two articles first: How to do Speech Recognition with Deep Learning, Speech Processing for Machine Learning - Filter banks, etc.
Fixing on a data format
First step to get the pipeline right is to fix on a specific data format that the system would require. This would ensure a consistent interface that the dataset reader can rely upon. The usual practice is to use WAV which is a lossless format(FLAC is also another popular choice). Since WAV is an uncompressed format, it tends to be better when compared to lossy formats such as MP3, etc.
Do not vary the sample rate
WAV stores audio signals as a series of numbers also called the PCM (Pulse Code Modulation) data. PCM is a way to convert analog audio to digital data. So essentially if you are loading an audio file into a numpy array, it is the underlying PCM data that is loaded. Each number in the sequence is called a sample, that represents the amplitude of the signal at an approximate point in time. It is called a sample since the PCM method approximates the amplitude value by sampling the original audio signal for a fixed number of times every second. The number of samples taken for every second is the sampling rate of the signal. This is an important factor that needs to be uniform in the audio pipeline. If this varies in different parts of a system, things can get miserable! Many machine learning systems for audio applications such as speech recognition, wake-word detection, etc. can work well with 16k Hz audio(16000 samples for every second of the original audio). So for example, a numpy array for a 5 second audio with 16k Hz sample rate would have the shape
Popular audio libraries such as PySoundFile, audiofile, librosa, etc. in Python provide operations for loading audio to numpy array and return the sample rate of the signal. The libraries use the header information in WAV files to figure out the sample rate.
# Using soundfile to load audio and know its sample rate import soundfile as sf audio, sample_rate = sf.read("sample.wav")
This is a crucial property that needs to be handled correctly, especially in places where the data is loaded to arrays/tensors. Bit depth represents the number of bits required to represent each sample in the PCM audio data. In practice, 16-bit signed integers can be used to store training data. During training, these 16-bit data can be loaded to 32-bit float tensors/arrays and can be fed to neural nets. Things can go wrong here say when a 24-bit audio file is loaded into a 16-bit array. Let's take Python stdlib's
import wave w = wave.open("sample_16b.wav", "rb") n_samples = w.getnframes() audio_data = w.readnframes(n_samples)
The byte array is converted into a np array using
# correct type audio_array = np.frombuffer(audio_data, dtype=np.int16) # incorrect audio_array = np.frombuffer(audio_data, dtype=np.int8)
Hence deciding on a standard bit depth that the system will always look for, will help eliminate overflows because of incorrect typecasting.
It is also recommended to not to take the byte order for granted when reading/writing audio data. Even though the underlying codec may take into account the system's byte order, for the paranoid ones, it is better to get fixed on one standard order, say little endian.
Number of channels can depend on the actual application for which the pre-processing step is done. For speech recognition let's say, an input to a neural net is typically a single channel. In case of a stereo input, each channel can form distinct inputs to the neural net. Or the channels could be merged together to form a mono audio. However, this is an application specific choice.
A standard way to load/convert input audio
To make sure nothing goes wrong in your audio pre-processing pipeline, it would be the safest to assume none of your inputs is in the right format and always go for a standard format conversion routine. Below would be a set of useful ffmpeg options using ffmpeg-python to standardize the incoming input:
import ffmpeg import numpy as np stream = ffmpeg.input("sample.wav") # set the output sample rate is 16000 stream = ffmpeg.filter(stream, "aresample", osr=16000) # set num channels = 1, bit depth to 16-bit int(s16), byte order to little endian(le) stream = ffmpeg.output(stream, "pipe:", ac=1, acodec="pcm_s16le", format="s16le") out, err = ffmpeg.run(stream, quiet=True) # load it with proper data type (int16) audio_array = np.frombuffer(out, dtype=np.int16)
import soundfile as sf sf.write("sample_out.wav", audio_array, samplerate=16000, subtype="PCM_16", endian="LITTLE")
The raw array data however is the starting point for further pre-processing which depend on the downstream experiment/application. They can be converted to signal processing features such as spectrogram, MFCC, etc. which are supported by libraries such as librosa, torchaudio, etc.