# Refining Raw Audio

In this tutorial, you will learn how to apply `VocalForge.audio` pipelines on audio files.

Each pipeline will (or at least attempt to) remove poor/inappropiate audio from each file in order to better prime it for dataset creation, or whatever other purpose you have in mind. These can be done in different order, or some not at all. It's up to you!

The models generally consist of a neural network designed to identify a specific piece of audio, then mark timestamps for its removal. Let's go over the ones currently supported in order to better illistrates VocalForge's usefulness:

- `Voice Detection` will remove segments of audio in which no human sounds are found. Say there is a long segment of city noise, or a musical intro to a podcast, all of this is removed. This is helpful not only in that it removes any of that non human audio, but it also reduces the time in which the subsequent audio takes to process.

- `Overlap` covers speech that has two or more people talking at the same time. Not only does it forceably remove egotistical people from trying to take over a conversation, but it *also* removes poor audio from podcasts or other casual conversational settings.

- `Isolate` one of the less straightforward pipelines, it goes through and seperates each speaker in each audio file. From there, you as a user can specify a specific speaker you want to target and it will find that same user across each audio file, even in different recording enviroments, such as a recording studio and a park. 

- `Export` is really just to put everything in a nice little bow. Given a directory, it will format on sample rate, as well as optionally normalize and noise reduce the audio. 

More pipelines are coming soon™

NOTE: It is highly reccomended to run this on a conda enviroment if running locally by running the command
`conda create -n VocalForge python=3.8 pytorch=1.11.0 torchvision=0.12.0 torchaudio=0.11.0 cudatoolkit=11.3.1 -c pytorch`

### Getting Started

First, let's get to creating our work directory and installing `VocalForge`

In [None]:
from pathlib import Path

root_path = Path.cwd()  # Gets current working directory
print(root_path)

work_audio_path = root_path / 'work' / 'audio'  # Constructs a new path

work_audio_path.mkdir(parents=True, exist_ok=True)  # Creates all missing parents in the path (does not raise any exceptions if the directory already exists)

In [None]:
from VocalForge.audio.audio_utils import create_core_folders

root_path = Path.cwd()
work_audio_path = root_path / 'work' / 'audio'

folder_names = ['RawAudio', 'Samples', 'VD', 'Overlap', 'Verification',
                'Isolated', 'Exported', 'Noise_Removed', 'Normalized']

# Here, we pass the folder paths to 'create_core_folders' method as string instead of 'os.path.join'
create_core_folders(folder_names, workdir=str(work_audio_path))

Alright cool, thats all taken care of. Now for the sake of our demo, we will download a YouTube Playlist of Joe Biden, however you could link your own playlist or simply drop your own local wav files into the `RawAudio` folder.

In [None]:
from VocalForge.audio.audio_utils import download_videos

work_path = Path.cwd() / 'work' / 'audio'

download_videos(
    url='https://www.youtube.com/playlist?list=PLAVNH_8nglubKvZ8bdiEjf9IKKB73SvIy', 
    out_dir=str(work_path / 'RawAudio')
)


For actual production, we would want to process all the audio we can get our grubby hands on. But for the purposes of our demo, we will be trimming each audio down to 5 minutes using the `create_samples` method  

In [None]:
from VocalForge.audio.audio_utils import create_samples

work_path = Path.cwd() / 'work' / 'audio'

create_samples(
    length=300,
    input_dir=str(work_path / 'RawAudio'),
    output_dir=str(work_path / 'Samples'),
)


In [None]:
from IPython.display import Audio

work_path = Path.cwd() / 'work' / 'audio'

Audio(str(work_path / 'Samples' / 'DATA0.wav'))

### Voice Activity

Initialize the class and set the paths of what the input files are, and where to output the filtered files are.

In [None]:
from VocalForge.audio import VoiceDetection

work_path = Path.cwd() / 'work' / 'audio'

VD = VoiceDetection(
    input_dir=str(work_path / 'Samples'),
    output_dir=str(work_path / 'VD'),
)

VD.run()


Alright! Lets check out the timeline of an audio file to see what parts got deleted.

In [None]:
VD.timelines[0]

In [None]:
from IPython.display import Audio

work_path = Path.cwd() / 'work' / 'audio'

Audio(str(work_path / 'VD' / 'DATA0.wav'))


Let's say that the audio highlighted in red has too many short breaks which cause to abrupt cuts in the audio. we can change around some model parameters to change this. by modifying the `min_duration_off` and `min_duration_on` values

In [None]:

HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.2, "offset": 0.6,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 1.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 1.0
}

The default values are normally:

`Onset: 0.5`
`Offset: 0.5`
`min_duration_on: 0.0`
`min_duration_off: 0.0`

One can change any of these values to make the values a little more or less liberal in what is speech and what's not (see what I did there?). This can also be used for overlapping speech, however this feature does not exist for isolating voices.

### Overlapping Removal

In [None]:
# Overlap Detection
from VocalForge.audio import Overlap

OD = Overlap(
    input_dir=str(work_path / 'VD'),
    output_dir=str(work_path / 'Overlap')
)
OD.run()

In [None]:
OD.timelines[0]

In [None]:
from IPython.display import Audio

work_path = Path.cwd() / 'work' / 'audio'

Audio(str(work_path / 'Overlap' / 'DATA0.wav'))

## Speaker Isolation

In [None]:
from VocalForge.audio.isolate import Isolate
from pathlib import Path
work_path = Path.cwd() / 'work' / 'audio'

IV = Isolate(
    input_dir=str(work_path / 'Overlap'),
    verification_dir=str(work_path / 'Verification'),  # this is where the separated voices will be saved
    output_dir=str(work_path / 'Isolated'),  # this is where the targeted voice will be saved
)

In [None]:
IV.isolate_speakers()

In [None]:
from IPython.display import Audio

work_path = Path.cwd() / 'work' / 'audio'

Audio(str(work_path / 'Verification' / 'DATA0' / 'SPEAKER_00.wav'))

In [None]:
Audio(str(work_path / 'Verification' / 'DATA0' / 'SPEAKER_01.wav'))

In [None]:
IV.create_target_embedding(str(work_path / 'Verification' / 'DATA0' / 'SPEAKER_00.wav'), 'joe_biden')

In [None]:
IV.group_audios_by_speaker(threshold=0.25)

In [None]:
from IPython.display import Audio

work_path = Path.cwd() / 'work' / 'audio'

Audio(str(work_path / 'Isolated' / 'joe_biden' / 'DATA2_SPEAKER_01.wav'))

In [None]:
Audio(str(work_path / 'Isolated' / 'joe_biden' / 'DATA4_SPEAKER_01.wav'))

In [None]:
Audio(str(work_path / 'Isolated' / 'joe_biden' / 'DATA5_SPEAKER_00.wav'))

In [None]:
Audio(str(work_path / 'Isolated' / 'joe_biden' / 'DATA6_SPEAKER_01.wav'))

Now to export. This is how we can define the final output of the wav files. 

By declaring a directory to `noise_removed_dir` will apply deepfilternet2 to each audio file to reduce noise. I find that this specific NN works best compared to solutions like the Adobe Podcast Audio Upscaler for tasks like TTS training or some other application that requires natural audio processing.

`normalization_dir`, if declared, will export a copy of the exported audio with normalized audio.

In [None]:
from VocalForge.audio import ExportAudio
from pathlib import Path

work_path = Path.cwd() / 'work' / 'audio'

exported = ExportAudio(
    input_dir=str(work_path / 'Isolated' / 'joe_biden'),
    output_dir=str(work_path / 'Exported'),
    noise_removed_dir=str(work_path / 'Noise_Removed'),
    normalization_dir=str(work_path / 'Normalized'),
)

In [None]:
exported.noise_remove()

In [None]:
from IPython.display import Audio

work_path = Path.cwd() / 'work' / 'audio'

Audio(str(work_path / 'Noise_Removed' / 'DATA0_SPEAKER_00.wav'))

In [None]:
exported.normalize()

In [None]:
exported.create_samples(max_seconds=120)

And you're done! Well, sort of. While this process does a pretty good job, to get the best results you will want to check the results manually. As I add more filters, this process will hopefully increase in resolution to reduce the time needed to review the output. But for now, stay vigilent.

Next, we will be going over how to format this now refined audio into a dataset ready and prepped for a NN. Stay tuned!