# Text-to-Speech (TTS) Models

## 1. English Text-to-Speech Model  
- This model is based on **VITS (Variational Inference Text-to-Speech)** and is sourced from **kakao-enterprise/vits-ljs**.  
- It uses `AutoTokenizer` and `AutoModelForTextToWaveform` from Hugging Face’s Transformers library.  
- The model is pre-trained on **LJSpeech**, a widely used dataset for English speech synthesis.  
- Capable of generating high-quality, natural-sounding **English** speech from text.  

## 2. Arabic Text-to-Speech Model  
- This model is from **Mohamed Bin Zayed University (MBZUAI)** and is based on **SpeechT5** with the `clartts_ar` configuration.  
- It utilizes the `pipeline` function from Hugging Face to streamline Arabic TTS processing.  
- The model is fine-tuned for **Arabic** speech synthesis.  
- Generates clear and natural **Arabic** speech from text.  

Both models leverage advanced deep learning techniques to deliver **high-quality** speech synthesis in their respective languages. 🚀


In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForTextToWaveform

tokenizer = AutoTokenizer.from_pretrained("kakao-enterprise/vits-ljs")
model = AutoModelForTextToWaveform.from_pretrained("kakao-enterprise/vits-ljs")

In [None]:
#!pip install phonemizer 

In [None]:
#!apt-get install espeak

In [None]:
#!pip install torch

In [None]:
import torch
text = "The error message cannot open backup device. Operating system error 3 The system cannot find the path specified means that SQL Server is unable to locate or access the specified directory. Here’s how to fix it"

inputs = tokenizer(text, return_tensors ="pt")

with torch.no_grad():
    speech = model(**inputs).waveform

In [None]:
#!pip install --upgrade --force-reinstall ipython -q

In [None]:
from IPython.display import Audio
Audio(speech, rate=model.config.sampling_rate)

## This model form Mohammed Ben Salman Unvirsity
### It's Microsoft model and the (Mohammed Ben Salman Unvirsity) finetuned it to on the arabic langauge

In [None]:
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf

synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")

embeddings_dataset = load_dataset("herwoww/arabic_xvector_embeddings", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[105]["speaker_embeddings"]).unsqueeze(0)
# You can replace this embedding with your own as well.

speech = synthesiser("لأنه لا يرى أنه على السفه ثم من بعد ذلك حديث منتشر", forward_params={"speaker_embeddings": speaker_embedding})
# ArTST is trained without diacritics.

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])


In [None]:
Audio('speech.wav')