# Lesson 7: Text to Speech

- In the classroom, the libraries are already installed for you.
- If you would like to run this code on your own machine, you can install the following:

```
    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install timm
    !pip install inflect
    !pip install phonemizer
    
```

**Note:**  `py-espeak-ng` is only available Linux operating systems.

To run locally in a Linux machine, follow these commands:
```
    sudo apt-get update
    sudo apt-get install espeak-ng
    pip install py-espeak-ng
```

### Build the `text-to-speech` pipeline using the 🤗 Transformers Library

- Here is some code that suppresses warning messages.

In [2]:
# !pip install transformers
# !pip install gradio
# !pip install timm
# !pip install timm
# !pip install inflect
# !pip install phonemizer

In [3]:
from transformers.utils import logging

logging.set_verbosity_error()

In [4]:
from transformers import pipeline

# only English
narrator = pipeline("text-to-speech",
                    model="kakao-enterprise/vits-ljs")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/145M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.14k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

Info about [kakao-enterprise/vits-ljs](https://huggingface.co/kakao-enterprise/vits-ljs)

In [5]:
text = """
Researchers at the Allen Institute for AI, \
HuggingFace, Microsoft, the University of Washington, \
Carnegie Mellon University, and the Hebrew University of \
Jerusalem developed a tool that measures atmospheric \
carbon emitted by cloud servers while training machine \
learning models. After a model’s size, the biggest variables \
were the server’s location and time of day it was active.
"""

In [6]:
!apt-get install espeak
# pip does not do it because espeak is not available as a Python package on PyPI
# brew only works locally. so use apt-get


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 49 not upgraded.
Need to get 1,382 kB of archives.
After this operation, 3,178 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libsonic0 amd64 0.2.0-11build1 [10.3 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 espeak-data amd64 1.48.15+dfsg-3 [1,085 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libespeak1 amd64 1.48.15+dfsg-3 [156 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe amd64 espeak amd64 1.48.15+dfsg-3 [64.2 kB]
Fetched 1,382 kB in 1s (958 kB

In [82]:
narrated_text = narrator(text)



In [8]:
from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

### Try it yourself!
- Try this model with your own text to speech examples!

In [80]:
text_Chinese = """生活就像一盒巧克力，你永远不知道下一颗是什么味道。\
早上赖床五分钟也许会错过一班公交，但却可能换来一次偶遇老朋友的惊喜。\
偶尔的小迷糊、犯点小错，没什么大不了的。开心的时候，笑得大声一点；\
难过的时候，吃点好吃的。别忘了，快乐是生活的主旋律，偶尔跑调也没关系。只要保持微笑，一切都会好起来的！"""


In [83]:
# if we keep using kakao-enterprise/vits-ljs

from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])
# Chinese does not work here. because the model kakao-enterprise/vits-ljs mainly work with English

In [84]:
# Change to Helsinki-NLP/opus-mt-zh-en

In [71]:
text_Chinese = """生活就像一盒巧克力，你永远不知道下一颗是什么味道。\
早上赖床五分钟也许会错过一班公交，但却可能换来一次偶遇老朋友的惊喜。\
偶尔的小迷糊、犯点小错，没什么大不了的。开心的时候，笑得大声一点；\
难过的时候，吃点好吃的。别忘了，快乐是生活的主旋律，偶尔跑调也没关系。只要保持微笑，一切都会好起来的！"""


In [94]:
# putting it all together using Helsinki-NLP Chinese texts to English speech
# first send to English TTS pipeline
# then you get the english speech from the translated text
from transformers import pipeline
import soundfile as sf
import numpy as np
from IPython.display import Audio

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-zh-en")
english = translator(text_Chinese)[0]['translation_text']
print(english)
tts_pipeline = pipeline("text-to-speech", model="facebook/mms-tts-eng")
audio_output = tts_pipeline(english)
waveform = audio_output["audio"]
# convert the waveform tensor to a NumPy array
waveform_np = np.array(waveform)
# IPython's Audio widget
Audio(waveform_np, rate=16000)

Life is like a box of chocolate, and you never know what it's like to be next. Five minutes of bed in the morning may miss a bus, but it may be a surprise to meet an old friend. It's not a big deal to make a little bit of a mistake. When you're happy, laugh louder; when you're sad, eat something good. Don't forget, happiness is the main melody of life, and sometimes it's okay to run around. Everything will be fine if you just smile.


In [75]:
with open("english_speech.wav", "wb") as f:
    f.write(audio_output["audio"])
