# Lesson 7: Text to Speech

- In the classroom, the libraries are already installed for you.
- If you would like to run this code on your own machine, you can install the following:

```
    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install timm
    !pip install inflect
    !pip install phonemizer
    
```

**Note:**  `py-espeak-ng` is only available Linux operating systems.

To run locally in a Linux machine, follow these commands:
```
    sudo apt-get update
    sudo apt-get install espeak-ng
    pip install py-espeak-ng
```

### Build the `text-to-speech` pipeline using the 🤗 Transformers Library

- Here is some code that suppresses warning messages.

In [1]:
from transformers.utils import logging

logging.set_verbosity_error()

In [2]:
from transformers import pipeline

narrator = pipeline("text-to-speech",
                    model="./models/kakao-enterprise/vits-ljs")

In [14]:
type(narrator)

transformers.pipelines.text_to_audio.TextToAudioPipeline

Info about [kakao-enterprise/vits-ljs](https://huggingface.co/kakao-enterprise/vits-ljs)

Text-to-speech models: https://huggingface.co/models?pipeline_tag=text-to-speech

List of all the pipelines: https://huggingface.co/docs/transformers/main_classes/pipelines

In [3]:
text = """
Researchers at the Allen Institute for AI, \
HuggingFace, Microsoft, the University of Washington, \
Carnegie Mellon University, and the Hebrew University of \
Jerusalem developed a tool that measures atmospheric \
carbon emitted by cloud servers while training machine \
learning models. After a model’s size, the biggest variables \
were the server’s location and time of day it was active.
"""

In [4]:
narrated_text = narrator(text)

In [5]:
from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

In [6]:
len(narrated_text["audio"])

1

In [8]:
type(narrated_text["audio"][0])

numpy.ndarray

In [9]:
narrated_text["audio"][0].shape

(538880,)

In [11]:
narrated_text["audio"][0][:10]

array([-0.00066421, -0.00055108, -0.00076541, -0.00069564, -0.00067338,
       -0.00062859, -0.00060529, -0.00098113, -0.00096382, -0.00103115],
      dtype=float32)

In [12]:
narrated_text.keys()

dict_keys(['audio', 'sampling_rate'])

In [13]:
narrated_text["sampling_rate"]

22050

### Try it yourself! 
- Try this model with your own text to speech examples!

In [15]:
text = """
In dual label assignments, the one-to-many branch provides much richer supervisory signals than
one-to-one branch. Intuitively, if we can harmonize the supervision of the one-to-one head with that
of one-to-many head, we can optimize the one-to-one head towards the direction of one-to-many
head’s optimization. As a result, the one-to-one head can provide improved quality of samples during
inference, leading to better performance. To this end, we first analyze the supervision gap between the
two heads.
"""

In [16]:
narrated_text = narrator(text)

In [17]:
IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])