# 11-process-audio
> Getting audio files in the correct format

In this notebook, we evaluate the state of the audio files and get them into a correct format if necessary.  This code is based on the findings of [26-transcribe-audio.ipynb](https://github.com/vanderbilt-data-science/wise/blob/26-transcribe-audio-files/26-transcribe-audio.ipynb) at commit `df09abc`. 

Note that our objective here is to use the `wav2vec2` or a similar audio transformer model.  In looking at the documentation for the `wav2vec2` model we'll use (e.g., here: https://huggingface.co/facebook/wav2vec2-base-960h ), the first requirement is that our wave file has to have a sampling rate of 16KHz.  This notebook is created in pursuit of getting the audio in the correct format and making sure the functionality we need is provided.

**To regenerate the resampled audio files**:
Our resampled audio files are stored under `cleaned_data/resampled_audio_16khz`.  To regenerate these, see the final code notebook in the cell and uncomment it.  This will regenerate all of the files.

In [None]:
#all_no_test
#modeling packages
import soundfile as sf
import torch
import librosa

#data science packages
import pandas as pd
import numpy as np

#other python packages
import os.path
import glob

# Load data
Here, we'll just obtain the list of audio files by using `glob` to list the contents of the file.  We won't load the actual audio yet until we have to since the data is non-trivially sized.

In [None]:
#File constants
base_prefix = os.path.expanduser('~/Box/DSI Documents/')
audio_files_prefix = base_prefix + 'Audio Files & Tanscripts/Audio Files/' 

In [None]:
#get list of audio files
audio_files_list = glob.glob(audio_files_prefix + '*.wav') + glob.glob(audio_files_prefix + '*.mp3')

#print info about them
print('Number of audio files available:', len(audio_files_list))
audio_files_list[:3]

Number of audio files available: 109


['C:\\Users\\bellcs1/Box/DSI Documents/Audio Files & Tanscripts/Audio Files\\008-1.wav',
 'C:\\Users\\bellcs1/Box/DSI Documents/Audio Files & Tanscripts/Audio Files\\008-2.wav',
 'C:\\Users\\bellcs1/Box/DSI Documents/Audio Files & Tanscripts/Audio Files\\008-3.wav']

# Check the sampling rate
Here, we'll check the sampling rate of all of the audio files and see if we need to fix any.  Again, keep in mind that the sampling rate that we want is 16KHz.

Note that the following code will fail with `NoBackendError` if we read something other than `.wav` files (e.g., `.mp3` files.  For the following to work, you may need to run `conda install -c conda-forge ffmpeg` to get the mp3 backend.

In [None]:
#get sampling rates (see above if you get an error here)
sampling_rates = [librosa.get_samplerate(audio_file) for audio_file in audio_files_list]

In [None]:
#make into a pandas array just because simplicity
audio_details = pd.DataFrame({'file_id': [audio_file.split('\\')[-1] for audio_file in audio_files_list],
                              'sampling_rate': sampling_rates,
                              'file_path': audio_files_list})

#get value counts for the samplings rates
audio_details['sampling_rate'].value_counts()

44100    74
11025    28
22050     7
Name: sampling_rate, dtype: int64

Wow, fantastic.  It looks like we have a sampling rate of every and anything other than 16KHz.  No problem - we'll resample using `librosa` and write these files back to the `cleaned_data` directory.

# Resample and Save Audio
Given the above, we need to resample the audio to 16KHz and save it to this new location.  We'll put it in `cleaned_data/resampled_audio_16khz`.  The filenames will be the same.  As shown in the prior commit, we can use the librosa `load` function for resampling.

Note:  This does **NOT** take a short amount of time!  Each resampling is about 30 seconds.

In [None]:
#constants
sampling_rate = 16000
resampled_audio_base = base_prefix + 'cleaned_data/resampled_audio_16khz/'

In [None]:
#A function to do the resampling and saving
def resample_save(audio_fpath, sample_rate, save_directory, verbose=False):
    
    #resample while loading
    audio_signal, sr = librosa.load(audio_fpath, sample_rate)
    
    #create name of file
    fname = audio_fpath.split('\\')[-1] #returns 008-1.wav or similar
    fname = fname.split('.')[0] + '.wav' #make sure it is saved as a wave file
    
    #save file
    sf.write(save_directory + fname, audio_signal, samplerate=sample_rate, format='WAV')
    if verbose:
        print('Saving', save_directory + fname, 'done.')
    
    return fname, save_directory + fname, sr

In [None]:
#actually do the resampling and saving
#resamp_results = [resample_save(audio_file, sampling_rate, resampled_audio_base, verbose=True) for audio_file in audio_files_list]

Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/008-1.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/008-2.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/008-3.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/027-1.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/027-2.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/027-3.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/038-2.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/038-3.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/038-4.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/046-1.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio



Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/088-1a.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/088-1b.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/088-2.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/088-3.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/088-4.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/105-1.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/105-2.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/105-3.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/105-4.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_audio_16khz/107-1.wav done.
Saving C:\Users\bellcs1/Box/DSI Documents/cleaned_data/resampled_aud

OK, things are looking good here and I've confirmed that the data is now saved on Box and seems to be able to be listened to correctly.  Note that normally, we would not leave long outputs such as these in the notebooks, but this one is left because of the warning raised in one of the conversions.  The problem is likely `088-1a`, so we should keep an eye on this.