<a href="https://colab.research.google.com/github/tanyamadaan/test/blob/main/SpeechT5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SpeechT5 with Hugging Face

Also check out the blog post: [hf.co/blog/speecht5](http://hf.co/blog/speecht5)

And the online demos:

- [Speech Synthesis (TTS)](https://huggingface.co/spaces/Matthijs/speecht5-tts-demo)
- [Voice Conversion](https://huggingface.co/spaces/Matthijs/speecht5-vc-demo)
- [Automatic Speech Recognition](https://huggingface.co/spaces/Matthijs/speecht5-asr-demo)

First install Transformers and sentencepiece.

**Note:** It's important to restart the notebook after installing sentencepiece, or the demos won't work!

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

In [None]:
!pip install datasets

In [None]:
!pip install sentencepiece

## Text-to-speech

Load the model:

In [None]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

Preprocess the text input:

In [None]:
inputs = processor(text="Don't count the days, make the days count.", return_tensors="pt")

Load a speaker embedding:

In [None]:
from datasets import load_dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

import torch
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

In [None]:
speaker_embeddings.shape

torch.Size([1, 512])

Load a vocoder:

In [None]:
from transformers import SpeechT5HifiGan
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Generate the speech from the input text:

In [None]:
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

In [None]:
speech.shape

torch.Size([34816])

In [None]:
from IPython.display import Audio

Audio(speech, rate=16000)

In [None]:
import soundfile as sf
sf.write("tts_example.wav", speech.numpy(), samplerate=16000)

## Speech-to-speech for voice conversion

Load the model:

In [None]:
from transformers import SpeechT5ForSpeechToSpeech

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")

Load an input speech example:

In [None]:
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
example = dataset[40]

In [None]:
Audio(example["audio"]["array"], rate=16000)

Preprocess the speech input:

In [None]:
sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Load the speaker embedding for the target speaker's voice:

In [None]:
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

Generate the speech:

In [None]:
speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder)

In [None]:
Audio(speech, rate=16000)

In [None]:
import soundfile as sf
sf.write("speech_converted.wav", speech.numpy(), samplerate=16000)

## Automatic speech recognition (using pipeline)

In [None]:
from transformers import pipeline
generator = pipeline(task="automatic-speech-recognition", model="microsoft/speecht5_asr")

In [None]:
transcription = generator(example["audio"]["array"])



In [None]:
transcription["text"]

'a man said to the universe sir i exist'

## Automatic speech recognition (using the model)

Load the model:

In [None]:
from transformers import SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

Preprocess the input speech example:

In [None]:
sampling_rate = dataset.features["audio"].sampling_rate
inputs = processor(audio=example["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

Generate text from the speech input:

In [None]:
predicted_ids = model.generate(**inputs, max_length=100)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

In [None]:
transcription[0]

'a man said to the universe sir i exist'