# Installing Required Libraries

In [1]:
!pip install transformers



The **pipeline** function is a high-level API provided by the **Hugging Face** transformers library. It simplifies the use of various pre-trained models for a wide range of tasks in Natural Language Processing (NLP) and other domains.

This function abstracts away the complex details of **loading models, tokenizing input, and processing output**, making it easy to perform tasks with just a few lines of code.

In [2]:
from transformers import pipeline

# Loading Pre Trained model

### Text to speech Model
The **suno/bark-small** model is a text-to-speech (TTS) model, part of the Bark project developed by Suno, aimed at generating realistic speech from text input. It is designed for **text-to-audio generation**, providing high-quality speech synthesis, including nuances like tone and intonation. The small version of the model is optimized for efficiency, making it suitable for resource-constrained environments without sacrificing too much on audio quality.



#### **Key Features:**
**Text-to-Speech:**

Converts text input into spoken language (speech) in a natural and expressive manner.
Supports various languages and can produce speech that includes emotions, making the output more human-like.

**Efficient Size:**

The "small" variant is designed to be lightweight, using fewer resources (like memory and compute) compared to larger models, while still delivering acceptable performance.
Ideal for situations where speed and lower computational load are important, such as deployment on edge devices or inference on lower-end hardware.

**Pre-trained:**

The model is pre-trained on large datasets of speech and text pairs, making it capable of generalizing well to a wide range of inputs.
You can use it right away for tasks like converting text to speech without needing additional fine-tuning.

**Multilingual Support:**

Bark models, including bark-small, can handle multiple languages, meaning it is not limited to just English. This makes it versatile for global applications.

**Realism and Emotion:**

One of the key advantages of the Bark models is the ability to generate more natural and expressive speech that can capture different emotions or tones.
This capability makes it a preferred choice for applications like voice assistants, audiobooks, podcasts, and interactive storytelling, where the richness of voice matters.

**GPU Support:**

You can run the model on both CPU and GPU, with CUDA support for faster inference using GPUs.



In [3]:
text1 = "Keep pushing forward, success comes to those who grind relentlessly. You’ve got this!"
text2 = """Large Language Models (LLMs) are AI systems designed to understand and generate human language, using vast amounts of data to perform tasks
like translation, summarization, and text generation."""

pipe = pipeline("text-to-speech", model="suno/bark-small", device="cuda")

output = pipe(text1)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

  WeightNorm.apply(module, name, dim)
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


**Output** is a **NumPy** array containing the audio waveform data generated by the text-to-speech (TTS) **model**.

The values in the array represent the **amplitude of the audio signal** at each point in time. Since the data type is float32, each value is a 32-bit floating-point number, and the waveform oscillates between positive and negative values, producing sound when played.

These values can be converted into sound when processed by an audio system or saved into a file format like .wav.

**'sampling_rate': 24000:**

The sampling rate is the number of samples per second, which in this case is 24,000 samples per second (24kHz).
A higher sampling rate generally means better audio quality, as more data points are used to capture the sound. A 24kHz rate is typically used for clear speech generation.

In [4]:
output

{'audio': array([[ 0.00178578,  0.00186026,  0.00252651, ..., -0.01629342,
         -0.01533457, -0.01505096]], dtype=float32),
 'sampling_rate': 24000}

The **Audio** Function plays the audio directly in the notebook, allowing you to hear the speech generated from your input text.

In [5]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

There are different models in Hugging face which converts text to speech, I recoment you to try out those models for a better Hands-on.

[Hugging Face text to speech Models](https://huggingface.co/models?pipeline_tag=text-to-speech&sort=trending)