# Mouth noise detection and remediation
## Problem overview
The focus of this workbook is to identify mouth noises from spoken audio, because touching these up manually is a real pain. These are just the minor clicks your mouth generates when saliva catches on various parts of your mouth as you speak, it's not really noticeable in day to day conversation, but really sticks out in podcasts/voiceover recordings. Somehow, I could not find any good tools for automatically doing this online, which is really surprising to me, everything I could find was about removing general background noise, not these minor clicks.

From experience, I can usually tell these just from an eyeball of the waveform and spectrogram, these seems to be reliable cues for when it’s occurring - generally an overlaid frequency spike with a duration of between 8 and 60 (median ~25) ten-thousandths of a second, particularly common on “a” sounds, at the end of a sentence, or before “the” sounds (although can appear in many other locations).

The solution for these (once identified) in Audacity is to use the “repair” tool over the specified section (or, since the section commonly is greater than the 128 sample limit, repeatedly using it over the duration) to “smooth” the soundwave of that section based on the brief sample before and after it. Alternatively, just cut the section entirely, picking points before and after with equal positions on the waveform - because these sounds are so short, it is almost impossible to notice their cutting.

I recorded a ~22 minute piece of sample audio, where I intentionally didn’t use any of the “just be a virtuous person” bullshit “mouth noise reducing tips” espoused on all the Google search results. So far I have gone through the first 6:36 of the recording and manually annotated the start and rough end times of 464 mouth sounds, which is where I got the duration values above. I *was* going to train an ML tool on this dataset, but it turned out my naive benchmark below did the job just fine.

(I have a separate private notebook where these is a bunch of me screwing around figuring out how to work with sound files/what fourier transform functions actually do, etc. I've cleared all that out for this public version, but if you want to improve on it, I'd suggest diving into the librosa library and gettign a feel for all that stuff).

## Naïve approach

This is done by loading a soundfile through librosa, cutting it into 4ms chunks each starting 1ms after the other, and then analysing the average frequency of the chunk. If a given chunk has a substantially greater average frequency to the ones abutting it (e.g mean frequency is >6000hz and >3x mean frequency of chunk ending the sample prior to it and chunk starting the sample after it), it will be flagged as a likely click.

Once the entire recording has been searched and chunks classified, overlapping chunks will be analysed to determine the highest frequency chunk, and this will be selected as the definitive "mouth noise" chunk for that section.

For this to work, you need a folder in the same directory as this notebook called "WAV", and save your file to be analysed in there.

In [8]:
import librosa
import numpy as np
import pandas as pd
import os
import time 

# Set a variable for the current notebook's path for various loading/saving mechanisms
nb_path = os.getcwd()

import warnings
warnings.filterwarnings('ignore')

def gross_mag_in_threshold(chunk, sr=44100, band_start=5000, band_end=10000):
    """Calculate the total intensity of frequencies within a chunk using Fourier Transform.
    chunk - array of longs - a librosa loaded sound object containing sound data to be analyzed
    band_start - int - the minimum frequency band to include
    band_end - int - the maximum frequency band to include
    """
    stft = librosa.stft(chunk, n_fft=len(chunk))
    
    # Get the frequency bin indices
    freq_bins = librosa.fft_frequencies(sr=sr, n_fft=len(chunk))

    # Get the magnitude of each frequency band by averaging each element returned in stft
    magnit = np.abs(stft)
    av_mag = np.mean(magnit, axis=1)
    gross_mag = np.sum(av_mag * (freq_bins > band_start) * (freq_bins < band_end))
    return gross_mag


def tick_detector(chunks, start_times, chunk_ms=4, threshold_factor=3, band_start=5000, band_end=10000, sr=44100):
    """Returns a pandas DataFrame presenting timestamps within a sound file that contain ticks.
    chunks - array of overlapping librosa loaded sound objects containing sound data to be analyzed
    start_times - list of longs - timestamps (in seconds) of the time where each chunk begins
    chunk_ms - length (in milliseconds) of each chunk
    threshold_factor - long - how much a chunk has to exceed its neighbors by to indicate a likely tick
    band_start - int - the minimum frequency band to include
    band_end - int - the maximum frequency band to include
    sr - int - sample rate of the chunks
    """
    flags = np.zeros(len(chunks), dtype=bool) #initialize array with all False values
    
    chunk_size = int(sr * (chunk_ms / 1000))  # chunk size in samples

    chunk_mags = np.array([gross_mag_in_threshold(chunk, band_start=band_start, band_end=band_end, sr=sr)
                                   for chunk in chunks])
    
    # set array for comparable chunks (e.g. chunk n can only be compared to n-4 and n+4)
    prev_chunks = chunks[:-(2 * chunk_ms)]
    current_chunks = chunks[chunk_ms:-chunk_ms]
    next_chunks = chunks[(2 * chunk_ms):]

    prev_frequency_mag = chunk_mags[:-(2 * chunk_ms)]
    current_frequency_mag = chunk_mags[chunk_ms:-chunk_ms]
    next_frequency_mag = chunk_mags[(2 * chunk_ms):]
    
    flags[chunk_ms:-chunk_ms] = (current_frequency_mag > threshold_factor * prev_frequency_mag) & \
                                (current_frequency_mag > threshold_factor * next_frequency_mag)

    df = pd.DataFrame({'Start Time': start_times, 'Flag': flags})
    
    # Filter the flagged chunks
    mouth_sounds = df.loc[df['Flag'] == True].copy()
    # Calculate the end times - need to use chunk_size because of rounding issues
    mouth_sounds['End Time'] = mouth_sounds['Start Time'] + (chunk_size / sr)
    # Reset the index
    mouth_sounds.reset_index(drop=True, inplace=True)

    return mouth_sounds


def generate_mouth_sounds_labels(filepath, output_file_name = 'output_file.txt', sample_rate=None):
    kickoff_time = time.time()
    print(time.strftime('%X %x %Z'))

    test1, sr = librosa.load(filepath, sr=sample_rate)
    
    # cutting it into chunks of "chunk_ms" length each starting 1ms after the other
    chunk_ms = 4
    chunk_size = int(sr * (chunk_ms / 1000))
    hop_size = int(sr * 0.001)
    
    chunks = np.array([test1[i:i+chunk_size] for i in range(0, len(test1) - chunk_size, hop_size)])
    start_times = np.arange(0, len(chunks) * hop_size / sr, hop_size / sr)

    df_2 = tick_detector(chunks, start_times, threshold_factor=2, band_end=12000, sr=sr)
    
    # Initialize the consolidated dataframe
    consolidated_chunks = pd.DataFrame(columns=['Start Time', 'End Time'])
    
    # Initialize variables for tracking the current chunk
    current_start = None
    current_end = None

    # Iterate through each row in the dataframe
    for index, row in df_2.iterrows():
        if current_start is None:
            # If it's the first chunk, set the current chunk
            current_start = row['Start Time']
            current_end = row['End Time']
        elif row['Start Time'] <= current_end:
            # If the current chunk overlaps with the next chunk, extend the current chunk
            current_end = row['End Time']
        else:
            # If the next chunk is not overlapping, add the consolidated chunk to the dataframe
            consolidated_chunks = consolidated_chunks.append({'Start Time': current_start, 'End Time': current_end},
                                                             ignore_index=True)
            # Set the next chunk as the current chunk
            current_start = row['Start Time']
            current_end = row['End Time']

    # Add the last consolidated chunk to the dataframe
    consolidated_chunks = consolidated_chunks.append({'Start Time': current_start, 'End Time': current_end},
                                                     ignore_index=True)
    
    
    # iterate over each chunk 4 steps each delayed 0.5ms to find the maximum energy in high frequency bands to narrow
    # down the starting point. This part is ripe for optimisation. I had 1 attempt but it didn't really work out.
    
    # Iterate over each row in the dataframe
    for index, row in consolidated_chunks.iterrows():
        start_time = row['Start Time']
        end_time = row['End Time']

        # Find the corresponding samples for the start and end times
        start_sample = int(start_time * sr)
        end_sample = int(end_time * sr)

        # Extract the chunk from the original loaded .wav object
        chunk = test1[start_sample:end_sample]

        # Create four new chunks with staggered start times (0.5ms apart)
        chunk_offsets = [0, int(0.5e-3 * sr), int(1e-3 * sr), int(1.5e-3 * sr)]
        chunks = [chunk[offset:] for offset in chunk_offsets]

        # Calculate the energy for each chunk within the desired frequency band
        energies = [gross_mag_in_threshold(chunk, sr=sr) for chunk in chunks]

        # Find the index of the chunk with the maximum energy
        max_energy_index = max(0, np.argmax(energies)-1)

        # Update the start time based on the selected chunk
        new_start_time = start_time + chunk_offsets[max_energy_index] / sr

        # Update the dataframe with the new start time
        consolidated_chunks.loc[index, 'Start Time'] = new_start_time
    
    #export the dataframe to a tab delimited text file for import into Audacity
    consolidated_chunks['Index'] = consolidated_chunks.index
    
    consolidated_chunks.to_csv(output_file_name, sep='\t', columns=['Start Time', 'End Time', 'Index'], index=False)

    print(time.strftime('%X %x %Z'))
    print("time taken: --- %s seconds ---" % (time.time() - kickoff_time))
    return None


In [9]:
generate_mouth_sounds_labels('{}\WAV\\test 30s.wav'.format(nb_path), "30s44100.txt")

10:42:55 06/27/23 AUS Eastern Standard Time
10:43:10 06/27/23 AUS Eastern Standard Time
time taken: --- 14.295339584350586 seconds ---


In [11]:
generate_mouth_sounds_labels('{}\WAV\\test 30s.wav'.format(nb_path), "30s22050Hz.txt", sample_rate= 22050)

10:44:36 06/27/23 AUS Eastern Standard Time
10:44:49 06/27/23 AUS Eastern Standard Time
time taken: --- 12.711917638778687 seconds ---


I believe STFT limits frequency bands to n/2 of sample rate, so would not want to go below 22050. Smaller sample rate is slightly faster, but identifies 9 fewer mouth noises (not sure if any of these are false positives in the 44100 sr output).
   
Further opportunities:   
 - Set up the loaded sound file as a class.
 - Optimise the "fine tuning" section of "generate_mouth_sounds_labels()" to use vectors instead of loops.
 - Extend the "fine tuning" section to trim the tail of the identified clip, not just the lead.
 - Multithreading. Since the "process" is comparing a set of independent chunks, it would be very possible to divide the total sound file into n threads and simultaneously process each thread. Potential bottleneck on memory?
 - Add a "clean" function which duplicates the mechanism of the "repair" function in Audacity to re-write the identified "mouthsound" datapoints with a smoothed interpolated function and export a new sound file. This way all processing is done in one space rather than flitting between programs.
 - Write this as a .py file that can be used via terminal instead of a notebook.
 - Be less lazy.
 - Write up a summary of the mechanics of all this so someone can learn and actually implement these improvements.