## Detailed Article Explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/541970/text-to-speech-conversion-using-hugging-face-transformers

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Installing and Importing Required Libraries

In [None]:
!git clone https://github.com/myshell-ai/MeloTTS.git
%cd MeloTTS
!pip install -e .
!python -m unidic download


Cloning into 'MeloTTS'...
remote: Enumerating objects: 410, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 410 (delta 147), reused 122 (delta 117), pack-reused 233[K
Receiving objects: 100% (410/410), 6.04 MiB | 15.95 MiB/s, done.
Resolving deltas: 100% (203/203), done.
/content/MeloTTS
Obtaining file:///content/MeloTTS
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting txtsplit (from melotts==0.1.2)
  Downloading txtsplit-1.0.0.tar.gz (6.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cached_path (from melotts==0.1.2)
  Downloading cached_path-1.6.2-py3-none-any.whl (35 kB)
Collecting transformers==4.27.4 (from melotts==0.1.2)
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mecab-python3==1.0.5 (from melotts==0.1.2)
  Download

## A Basic Example

In [None]:
from melo.api import TTS

speed = 1.0


device = 'auto' # Will automatically use GPU if available

# English
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id

# American accent
output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)

 > Text split to sentences.
In this video, you will learn about Large Language Models. This is going to be fun :


100%|██████████| 1/1 [00:00<00:00,  4.16it/s]


## Trying Different Accents

In [None]:
print(speaker_ids)

{'EN-US': 0, 'EN-BR': 1, 'EN_INDIA': 2, 'EN-AU': 3, 'EN-Default': 4}


In [None]:
speed = 1.0

device = 'auto'

# English Indian Accent
text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id

output_path = 'en-in.wav'
model.tts_to_file(text, speaker_ids['EN_INDIA'], output_path, speed=speed)



 > Text split to sentences.
In this video, you will learn about Large Language Models. This is going to be fun.


100%|██████████| 1/1 [00:00<00:00,  3.93it/s]


## Trying Different Languages

In [None]:
speed = 1.0

device = 'auto'

# French
model = TTS(language='FR', device=device)
speaker_ids = model.hps.data.spk2id
print(speaker_ids)

{'FR': 0}


In [None]:
text = "Dans cette vidéo, vous allez apprendre sur les Large Language Models. Ça va être amusant."
output_path = 'fr.wav'
model.tts_to_file(text, speaker_ids['FR'], output_path, speed=speed)

 > Text split to sentences.
Dans cette vidéo, vous allez apprendre sur les Large Language Models. Ça va être amusant.


100%|██████████| 1/1 [00:00<00:00,  3.32it/s]


## Adjusting Speech Speed

In [None]:
speed = 5


text = "In this video, you will learn about Large Language Models. This is going to be fun."
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id

output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)

 > Text split to sentences.
In this video, you will learn about Large Language Models. This is going to be fun.


100%|██████████| 1/1 [00:00<00:00,  6.07it/s]


In [None]:
model.tts_to_file()

## Putting it All Together

In [None]:
def generate_speech(text, language = "EN", speaker_id = "EN-US", speed = 1.0, audio_path = "speech.wav"):

  model = TTS(language=language,
              device='auto')

  speaker_ids = model.hps.data.spk2id

  model.tts_to_file(text,
                    speaker_ids[speaker_id],
                    audio_path,
                    speed=speed)

 > Text split to sentences.
Hello, this is test speech


100%|██████████| 1/1 [00:00<00:00,  6.13it/s]


In [None]:
generate_speech(text = "Hello, this is test speech",
                speaker_id = "EN_INDIA",
                speed = 3,
                audio_path = "indian_speech.wav")
