# Use Google Text-To-Speech to generate a dataset for keyword spotting


### Local Software Requirements
- Python 3
- Pip package manager 
- Jupyter Notebook: https://jupyter.org/install
- pip packages (install with `pip install `*`packagename`*):
    - pydub https://pypi.org/project/pydub/
    - google-cloud-texttospeech  https://cloud.google.com/python/docs/reference/texttospeech/latest
    - requests https://pypi.org/project/requests/




In [19]:
# Imports
import os
import shutil
import json
import time
import io
import random
from typing import cast
from pydub import AudioSegment
from google.cloud import texttospeech


## Set up Google TTS API
First off you will need to set up and Edge Impulse account and create your first project.
You will also need a Google Cloud account with the Text to Speech API enabled: https://cloud.google.com/text-to-speech, the first million characters generated each month are free (WaveNet voices), this should be plenty for most cases as you'll only need to generate your dataset once.
From google you will need to download a credentials JSON file and set it to the correct environment variable on your system to allow the python API to work: (https://developers.google.com/workspace/guides/create-credentials#service-account)


In [20]:

# Insert the path to your service account API key json file here for google cloud
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './emo-consens-bot-c0925968d3c8.json'


## Generate the desired samples
First off we need to set our desired keywords and labels:


In [21]:

# Keyword or short sentence and label (e.g. 'hello world')
keyword = [
    {'string':'hey','label':'hey'},
    {'string':'pepper','label':'pepper'},
           ]


Then we need to set up the parameters for our speech dataset, all possible combinations will be iterated through:
- voices - Choose the text to speech voice languages to use (https://cloud.google.com/text-to-speech/docs/voices)
- pitches - Which voice pitches to apply
- speakingRates - Which speaking speeds to apply

In [22]:
voices = [
    'de-DE-Wavenet-A',
    'de-DE-Wavenet-B',
    'de-DE-Wavenet-C',
    'de-DE-Wavenet-D',
    'de-DE-Wavenet-E',
    'de-DE-Wavenet-F',
    'en-US-Wavenet-A',
    'en-US-Wavenet-B',
    'en-US-Wavenet-C',
    'en-US-Wavenet-D',
    'en-US-Wavenet-E',
    'en-US-Wavenet-F',
]
'''Voices to use for the generated audio'''
pitches = [-2, 0, 2]
'''Pitches to generate (in semitones) range: [-20.0, 20.0]'''
speakingRates = [0.9, 1, 1.1]
'''Speaking rates to use range: [0.25, 4.0]'''



'Speaking rates to use range: [0.25, 4.0]'


Then provide some other key parameters:
- out_length - How long each output sample should be
- count - Maximum number of samples to output (if all combinations of languages, pitches etc are higher then this restricts output)
- voice-dir - Where to store the clean samples before noise is added
- noise-url - Which noise file to download and apply to your samples
- output-folder - The final output location of the noised samples
- num-copies - How many different noisy versions of each sample to create
- max-noise-level - in Db, 



In [32]:
out_length = 1 
'''Out length minimum (default: 1s)'''
count = 100
'''Maximum number of keywords to generate'''
all_opts_filename = 'all_opts.json'
voice_dir = 'out-wav' 
'''Raw sample output directory'''
noise_files = ['bathroom_1.wav', 'cafeteria_1_quieter.wav', 'crowd_1_quieter.wav','fan_1.wav', 'fan_2.wav',  'homeoffice_1_quieter.wav', 'homeoffice_2.wav', 'office_1_quieter.wav', 'static_1.wav']
output_folder = 'out-noisy'
num_copies = 6 
''' Number of noisy copies to create for each input sample '''
max_noise_level = -5  
'''Maximum noise level to add in dBFS (negative value)'''


'Maximum noise level to add in dBFS (negative value)'

Then we need to check all the output folders are ready

In [24]:

# Check if output directory for noisey files exists and create it if it doesn't
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
# Check if output directory for raw voices exists and create it if it doesn't
if not os.path.exists(voice_dir):
    os.makedirs(voice_dir)    


And download the background noise file

In [25]:
noise_audios: list[tuple[AudioSegment, str]] = list(map(lambda filename: (AudioSegment.from_file(io.FileIO(f'noise/{filename}'), format='wav'), filename), noise_files))

Then we can generate a list of all possible parameter combinations based on the input earlier. If you have set `num_copies` to be smaller than the number of combinations then these options will be reduced: 

In [26]:

# Generate all combinations of parameters
all_opts = []
for kw in keyword:
    keyword_opts = []
    for p in pitches:
        for v in voices:
            for s in speakingRates:
                keyword_opts.append({
                        "pitch": p,
                        "voice": v,
                        "speakingRate": s,
                        "text": kw['string'],
                        "label": kw['label']
                    })
    all_opts += random.sample(keyword_opts, count)
    
print(f'Generating {len(all_opts)} samples')

all_opts_file = os.path.join('./', all_opts_filename)
all_opts_info = {
    "version": 1,
    "files": all_opts
}
# Output the metadata file
with open(all_opts_file, "w") as f:
    json.dump(all_opts_info, f)

Generating 200 samples
[{'pitch': -2, 'voice': 'en-US-Wavenet-B', 'speakingRate': 0.9, 'text': 'hey', 'label': 'hey'}, {'pitch': -2, 'voice': 'en-US-Wavenet-F', 'speakingRate': 0.9, 'text': 'hey', 'label': 'hey'}, {'pitch': 0, 'voice': 'de-DE-Wavenet-D', 'speakingRate': 0.9, 'text': 'hey', 'label': 'hey'}, {'pitch': 0, 'voice': 'de-DE-Wavenet-B', 'speakingRate': 1.1, 'text': 'hey', 'label': 'hey'}, {'pitch': 2, 'voice': 'en-US-Wavenet-F', 'speakingRate': 1, 'text': 'hey', 'label': 'hey'}, {'pitch': 0, 'voice': 'de-DE-Wavenet-A', 'speakingRate': 0.9, 'text': 'hey', 'label': 'hey'}, {'pitch': -2, 'voice': 'en-US-Wavenet-A', 'speakingRate': 0.9, 'text': 'hey', 'label': 'hey'}, {'pitch': -2, 'voice': 'de-DE-Wavenet-D', 'speakingRate': 1.1, 'text': 'hey', 'label': 'hey'}, {'pitch': 2, 'voice': 'en-US-Wavenet-D', 'speakingRate': 1, 'text': 'hey', 'label': 'hey'}, {'pitch': 0, 'voice': 'de-DE-Wavenet-A', 'speakingRate': 1, 'text': 'hey', 'label': 'hey'}, {'pitch': 0, 'voice': 'de-DE-Wavenet-A

Finally we iterate though all the options generated, call the Google TTS API to generate the desired sample, and apply noise to it, saving locally with metadata:

In [27]:
def clear_directory(directory):
    # Check each item in the directory
    for item in os.listdir(directory):
        item_path = os.path.join(directory, item)
        try:
            if os.path.isfile(item_path) or os.path.islink(item_path):
                os.unlink(item_path)  # Remove files and links
            elif os.path.isdir(item_path):
                shutil.rmtree(item_path)  # Remove directories
        except Exception as e:
            print(f'Failed to delete {item_path}. Reason: {e}')

In [28]:

clear_directory(voice_dir)

In [29]:
# Instantiates a client
client = texttospeech.TextToSpeechClient()

ix = 0
for o in all_opts:
    ix += 1
    # Set the text input to be synthesized
    synthesis_input = texttospeech.SynthesisInput(text=o['text'])
    # Build the voice request
    voice = texttospeech.VoiceSelectionParams(
        language_code=o['voice'][:5],
        name=o['voice'],
    )
    # Select the type of audio file you want returned
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.LINEAR16,
        pitch=o['pitch'],
        speaking_rate=o['speakingRate'],
        sample_rate_hertz=16000
    )
    # Perform the text-to-speech request on the text input with the selected
    # voice parameters and audio file type

    wav_file_name = f"{voice_dir}/{o['label']}/{o['voice']}-{o['pitch']}-{o['speakingRate']}.tts.wav"

    if not os.path.exists(wav_file_name):
        print(f"[{ix}/{len(all_opts)}] Text-to-speeching...")
        response = client.synthesize_speech(
            input=synthesis_input, voice=voice, audio_config=audio_config
        )
        with open(wav_file_name, "wb") as f:
            f.write(response.audio_content)
        has_hit_api = True
    else:
        print(f'skipping {wav_file_name}')
        has_hit_api = False

    # Load voice sample
    

[1/200] Text-to-speeching...
[2/200] Text-to-speeching...
[3/200] Text-to-speeching...
[4/200] Text-to-speeching...
[5/200] Text-to-speeching...
[6/200] Text-to-speeching...
[7/200] Text-to-speeching...
[8/200] Text-to-speeching...
[9/200] Text-to-speeching...
[10/200] Text-to-speeching...
[11/200] Text-to-speeching...
[12/200] Text-to-speeching...
[13/200] Text-to-speeching...
[14/200] Text-to-speeching...
[15/200] Text-to-speeching...
[16/200] Text-to-speeching...
[17/200] Text-to-speeching...
[18/200] Text-to-speeching...
[19/200] Text-to-speeching...
[20/200] Text-to-speeching...
[21/200] Text-to-speeching...
[22/200] Text-to-speeching...
[23/200] Text-to-speeching...
[24/200] Text-to-speeching...
[25/200] Text-to-speeching...
[26/200] Text-to-speeching...
[27/200] Text-to-speeching...
[28/200] Text-to-speeching...
[29/200] Text-to-speeching...
[30/200] Text-to-speeching...
[31/200] Text-to-speeching...
[32/200] Text-to-speeching...
[33/200] Text-to-speeching...
[34/200] Text-to-sp

In [30]:
clear_directory(output_folder)

In [33]:
# Instantiate list for file label information
downloaded_files = []

for o in all_opts:
    wav_file_name = f"{voice_dir}/{o['label']}.{o['voice']}-{o['pitch']}-{o['speakingRate']}.tts.wav"
    voice_audio: AudioSegment = AudioSegment.from_file(wav_file_name)
    # Add silence to match output length with random padding
    difference = (out_length * 1000) - len(voice_audio)
    if difference > 0:
        padding_before = random.randint(0, difference)
        padding_after = difference - padding_before
        voice_audio = AudioSegment.silent(duration=padding_before) +  voice_audio + AudioSegment.silent(duration=padding_after)

    
    for noise_audio, noise_filename in random.sample(noise_audios, num_copies):
        # Save noisy sample to output folder
        output_filename = f"{o['label']}/{o['voice']}-{o['pitch']}-{o['speakingRate']}_{noise_filename.split('.')[0]}.wav"
        output_path = os.path.join(output_folder, output_filename)
        if not os.path.exists(output_path):
            # Select random section of noise and random noise level
            start_time = random.randint(0, len(noise_audio) - len(voice_audio))
            end_time = start_time +len(voice_audio)
            noise_level = random.uniform(max_noise_level, 0)

            # Extract selected section of noise and adjust volume
            noise_segment = cast(AudioSegment, noise_audio[start_time:end_time])
            noise_segment = noise_segment - abs(noise_level)

            # Mix voice sample with noise segment
            mixed_audio = voice_audio.overlay(noise_segment)
            # Save mixed audio to file
            mixed_audio.export(output_path, format='wav')

            print(f'Saved mixed audio to {output_path}')
        else:
            print(f'skipping {output_path}')
        # Save metadata for file
        downloaded_files.append({
            "path": str(output_filename),
            "label": o['label'],
            "category": "split",
            "metadata": {
                "pitch": str(['pitch']),
                "voice": o['voice'],
                "speakingRate": str(o['speakingRate']),
                "text": o['text'],
                "imported_from": "Google Cloud TTS"
            }
        })

    if has_hit_api:
        time.sleep(0.5)

print("Done text-to-speeching")
print("")

input_file = os.path.join(output_folder, 'input.json')
info_file = {
    "version": 1,
    "files": downloaded_files
}
# Output the metadata file
with open(input_file, "w") as f:
    json.dump(info_file, f)

Saved mixed audio to out-noisy/hey.en-US-Wavenet-B--2-0.9_office_1_quieter.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-B--2-0.9_fan_2.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-B--2-0.9_homeoffice_2.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-B--2-0.9_bathroom_1.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-B--2-0.9_crowd_1_quieter.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-B--2-0.9_static_1.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-F--2-0.9_fan_2.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-F--2-0.9_homeoffice_2.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-F--2-0.9_office_1_quieter.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-F--2-0.9_crowd_1_quieter.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-F--2-0.9_static_1.wav
Saved mixed audio to out-noisy/hey.en-US-Wavenet-F--2-0.9_bathroom_1.wav
Saved mixed audio to out-noisy/hey.de-DE-Wavenet-D-0-0.9_homeoffice_2.wav
Saved mixed audio to out-noisy/hey.de-