### Hello there! 👋
<a target="_blank" href="https://colab.research.google.com/github/Sharonio/roboshaul/blob/main/roboshaul_usage_colab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If you're interested in using Roboshaul to generate Hebrew text-to-speech, you've come to the right place! I'll guide you through the steps so that you can start using it in no time, even if you're new to machine learning.

Here are the steps we'll follow in this tutorial:

1. Import necessary Python libraries
2. Download the trained version of the Roboshaul TTS model
3. Download the trained version of the spectrogram-to-wav model, trained on Shaul Amsterdamski's voice
4. Connect all the components and test the system by generating Hebrew text and hearing Roboshaul speak it out loud

Let's get started! in the end you'll be able to use our trained model, and have results similar to the ones in this demo page:
https://anonymous19283746.github.io/saspeech/

The infratructure we will be using is Coqui TTS
and you can learn more about it here:
https://github.com/coqui-ai/TTS

In [1]:
import importlib

!pip install numpy=='1.22.4'

if not importlib.util.find_spec("TTS"):
    !git clone https://github.com/shenberg/TTS
    !pip install -e TTS



#### Import necessary Python libraries

In [2]:
import os
import sys
import subprocess
import signal

from pathlib import Path
from IPython.display import Audio

from google.colab import auth
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')
from googleapiclient.http import MediaIoBaseDownload

def download_file_from_gdrive(file_id, dest_file_name):
    request = drive_service.files().get_media(fileId=file_id)
    with open(dest_file_name, 'wb') as downloaded:
        downloader = MediaIoBaseDownload(downloaded, request)
        done = False
        while done is False:
            # _ is a placeholder for a progress object that we ignore.
            # (Our file is small, so we skip reporting progress.)
            status, done = downloader.next_chunk()
            print(f"progress: {status.progress():.1%}")

#### Adding diacritics (Nikud) to Hebrew text
Our input has to have Nikud in order to turn Hebrew text into good sounding audio

There are 2 places where you can add Nikud easily online:
- https://nakdan.dicta.org.il/
- https://www.nakdan.com/

(When we trained our TTS model we used this repository to automate the process: https://github.com/elazarg/nakdimon (give it a ⭐️ on GitHub), by the way, if you are advanced in coding and would want to help this repository - integrating the Nikud process to this notebook can be a meanigful contribution)

#### Connect all the components and test the system by generating Hebrew text and hearing Roboshaul speak it out loud
- Define input text
- Download models

In [3]:
# This is the text that will be created as audio, feel free to change it ♡
input_text =  "אַתֶּם הֶאֱזַנְתֶּם לְחַיוֹת כִּיס, הַפּוֹדְקָאסְט הַכַּלְכָּלִי שֶׁל כָּאן."

#### Download the trained version of the Roboshaul TTS model
Trained on 4 hours of Shaul Amsterdamski's voice + transcripts

In [4]:
# tts model:
model_path = Path('tts_model')
model_path.mkdir(exist_ok=True)
model_pth_path = model_path / 'saspeech_nikud_7350.pth'
model_config_path = model_path / 'config_overflow.json'

download_file_from_gdrive('1dExa0AZqmyjz8rSZz1noyQY9aF7dR8ew', model_pth_path)
download_file_from_gdrive('1eK1XR_ZwuUy4yWh80nui-q5PBifJsYfy', model_config_path)

progress: 30.7%
progress: 61.4%
progress: 92.0%
progress: 100.0%
progress: 100.0%


#### Download the trained version of the Mel-to-wav model
Trained on 30 hours of Shaul Amsterdamski's voice

In [5]:
# Mel-to-wav:
vocoder_path = Path('hifigan_model')
vocoder_path.mkdir(exist_ok=True)
vocoder_pth_path = vocoder_path / 'checkpoint_500000.pth'
vocoder_config_path = vocoder_path / 'config_hifigan.json'

download_file_from_gdrive('1XdmRRHjZ_eZOFKoAQgQ8wivrLDJnNDkh', vocoder_pth_path)
download_file_from_gdrive('1An6cTCYkxXWhagIJe3NGkoP8n2CQWQ-3', vocoder_config_path)

progress: 12.2%
progress: 24.4%
progress: 36.6%
progress: 48.7%
progress: 60.9%
progress: 73.1%
progress: 85.3%
progress: 97.5%
progress: 100.0%
progress: 100.0%


In [6]:
# Where will the outputs be saved?
output_folder = "outputs"

if not os.path.exists(output_folder):
    os.makedirs(output_folder)
    print(f"Folder named {output_folder} created.")
else:
    print(f"Folder named {output_folder} already exists.")

Folder named outputs already exists.


In [7]:
def escape_dquote(s):
    return s.replace('"', r'\"')

global_p = None

def run_model(text, output_wav_path):
    global global_p
    call_tts_string = f"""CUDA_VISIBLE_DEVICES=0 tts --text "{escape_dquote(text)}" \
        --model_path {model_pth_path} \
        --config_path {model_config_path} \
        --vocoder_path {vocoder_pth_path} \
        --vocoder_config_path {vocoder_config_path} \
        --out_path "{output_wav_path}" """
    try:
        print(call_tts_string)
        p = subprocess.Popen(['bash','-c',call_tts_string],
                             stdout=subprocess.PIPE, stderr=subprocess.PIPE, start_new_session=True)
        global_p = p
        # throw an exception if the called process exited with an error
        stdout, stderr = p.communicate(timeout=60)
        print(stdout.decode('utf-8'))
        print(stderr.decode('utf-8'))
    except subprocess.TimeoutExpired as e:
        print(f'Timeout for {call_tts_string} (60s) expired', file=sys.stderr)
        print('Terminating the whole process group...', file=sys.stderr)
        os.killpg(os.getpgid(p.pid), signal.SIGTERM)

In [8]:
run_model(input_text, output_folder + "/output.wav")

CUDA_VISIBLE_DEVICES=0 tts --text "אַתֶּם הֶאֱזַנְתֶּם לְחַיוֹת כִּיס, הַפּוֹדְקָאסְט הַכַּלְכָּלִי שֶׁל כָּאן."         --model_path tts_model/saspeech_nikud_7350.pth         --config_path tts_model/config_overflow.json         --vocoder_path hifigan_model/checkpoint_500000.pth         --vocoder_config_path hifigan_model/config_hifigan.json         --out_path "outputs/output.wav" 
 > Using model: Overflow
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:True
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linea

### Listen to the result 👾

In [9]:
Audio(filename=output_folder + '/output.wav')