# Chatbot with Speech Implementation using MeloTTS

I learned a lot about how Text-To-Speech works as I took a deeper look into the MeloTTS repo and saw its working, and I also learned how to implement it in Natural Language Processing tasks. 

Problems faced and solutions implemented - 
1. While installing the requirements.txt file it kept on showing an error and tried to solve it but then didn’t work so I had to install each library independently there was a problem with mecab-python3 library which was tied to version 1.0.5 and had to be updated to 1.0.9 and which is only used for Japanese language and for some reason it was mentioned twice. So, I updated the requirements.txt file.
2. There was also a problem while installing unidic and there was a problem with ssl certificate particularly s:1153 so I had to install it directly from the web and paste it in the necessary folder again it was used for Japanese.  
3. After the installations were done when I was executing I tried a very small text in English to convert to speech in American accent and while executing that code I was struck with a issue that was it couldn’t access a folder in unidic library, and that got me wondering if unidic is used for Japanese then why is it showing an error in English so turned out the english.py file used a function from japanese.py and so I copied that function from japanese.py to english.py that function was distribute_phone so that solved one issue but turned that error still showed up so in the cleaner.py file I removed Japanese from imports as removing it solved the issue (This problem can surely be solved if given enough time.). I started getting speech outputs after this. Later I deleted unidic and used unidic_lite after which the error was solved and then even japanese speech was possible. 
4. Didn’t face any problem while building a Chatbot and implementing TTS into it.


I have built a basic chatbot that takes in questions and gives a textual and an audio file that is saved, as well as can be played on the notebook itself. There are number of additions we can do to the chatbot, like adding memory to it so that it remembers the previous conversations and taking the audio from each and storing it, creating a web ui for a chatgpt like interface, but it gives audio output.


In [1]:
# importing essential libraries
import os
import IPython.display as ipd
from melo.api import TTS
from langchain.prompts import PromptTemplate
from langchain.llms import CTransformers
from langchain_huggingface import HuggingFaceEndpoint, HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Started with flan t5 base model 
model_id = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

In [3]:
# created a hugging face pipeline 
p = pipeline('text2text-generation', model=model, tokenizer=tokenizer, max_length=100)
hf = HuggingFacePipeline(pipeline = p)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [4]:
# Used a prompt template to ask a question and get the answer and created a chain 
# The prompt template was used such that it would answer all questions breifly so that the speech is not very long
question = 'What is the capital of germany?'
template = """Question: {question}
Answer: Answer as briefly as you can in a single sentence."""
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = LLMChain(prompt = prompt, llm = hf)

  warn_deprecated(


In [5]:
# This is the answer to the question "What is the capital of germany?"
chain.run(question=question)

  warn_deprecated(


'Berlin'

In [6]:
# It gave correct answer for the previous question so now I asked another question
# It gave a wrong answer so I used a different open-source model
question = "What is the capital of India?"
chain.run(question=question)

'chennai'

In [7]:
# I used the Huggingface library with to only get the endpoints for this one and to not download the entire model
from getpass import getpass
sec_key = getpass('Enter your Hugging Face API key: ')
os.environ['HUGGINGFACEHUB_APITOKEN'] = sec_key

In [8]:
# Then tried the mistralai model
repo_id = 'mistralai/Mistral-7B-Instruct-v0.3'
llm = HuggingFaceEndpoint(repo_id=repo_id, max_new_token=10, temperature=0.9, huggingfacehub_api_token = sec_key) 

                    max_new_token was transferred to model_kwargs.
                    Please make sure that max_new_token is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/zeeshan/.cache/huggingface/token
Login successful


In [9]:
# I asked a question "What is the capital of India?" and it gave the correct answer
# Although the answer was unnecessarily long
question = input('Ask a question: ')
llm.invoke(question)

" Well, I can answer that for you: New Delhi. But it’s more interesting to talk about why it’s the capital.\n\nFrom 1526 to 1858, Delhi was the capital of the Mughal Empire. It was a center of art, culture, and power, with monuments such as the Red Fort, the Jama Masjid, and the Qutub Minar that still stand today.\n\nIn 1858, the British took control of India, and the capital was moved to Calcutta (now Kolkata). However, in 1911, the British decided to move the capital back to Delhi, establishing the new city of New Delhi, which was built between 1911 and 1931.\n\nThe decision to move the capital back to Delhi was based on strategic, symbolic, and cultural reasons. Delhi was central to the Indian subcontinent, and the new city was designed to be a symbol of British power and authority. The city was also meant to be a center of Indian culture, with a mix of traditional Indian and modern British architecture.\n\nNew Delhi was designed by British architects Edwin Lutyens and Herbert Baker

In [10]:
# Used a prompt template similar to the previous one to get a brief answer
template = """Question: {question}
Answer: Answer as briefly as you can in a single sentence."""
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = LLMChain(prompt = prompt, llm = llm)

In [21]:
# It gave an accurate answer and it is also to the point

question = input('Ask a question: ')
text = chain.run(question)
print(text)




The capital of Japan is Tokyo.


Now that the Chatbot part was done I started to add the TTS part to it and went to add the TTS model from meloTTS

In [22]:
# chose the mistralai model then started implementing the TTS model
# I set the language to English(American) and the device to auto

speed = 1.0

device = 'auto'
model = TTS(language='EN', device=device)
speaker_ids = model.hps.data.spk2id



In [23]:
# This saves the audio to a file and also plays it here in the notebook
output_path = 'en-us.wav'
model.tts_to_file(text, speaker_ids['EN-US'], output_path, speed=speed)
ipd.Audio(output_path)

 > Text split to sentences.
The capital of Japan is Tokyo.


100%|██████████| 1/1 [00:05<00:00,  5.05s/it]


In [30]:
# I summed up all of the work and created one function in which you can simply pass the question and get the answer in text and it also saves the audio output in an .wav file

repo_id = 'mistralai/Mistral-7B-Instruct-v0.3'
llm = HuggingFaceEndpoint(repo_id=repo_id, max_new_token=10, temperature=0.9, huggingfacehub_api_token = sec_key) 

template = "Answer as briefly as you can {question}."
# template = """Question: {question}
# Answer: Answer as briefly as you can in a single sentence."""
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = LLMChain(prompt = prompt, llm = llm)

def chatbot(question):

    text = chain.run(question=question)
    
    speed = 1.2

    device = 'auto' 
    model = TTS(language='EN', device=device)
    speaker_ids = model.hps.data.spk2id
    
    output_path = 'en-br.wav'
    model.tts_to_file(text, speaker_ids['EN-BR'], output_path, speed=speed)
    
    return text, 

                    max_new_token was transferred to model_kwargs.
                    Please make sure that max_new_token is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/zeeshan/.cache/huggingface/token
Login successful


In [32]:
# Here is an implementation where i simply passed a question and got the text output and also the audio filed saved which could be played in the notebook

# question = input('Ask a question: ')
question = "Places to visit in Europe"
chatbot(question)
ipd.Audio(output_path)



 > Text split to sentences.
1. Paris, France: Known as the city of love, Paris is famous for its iconic landmarks like the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. 2. Rome, Italy: Home to ancient Roman ruins such as the Colosseum and the Forum, as well as Vatican City,
which houses St. Peter's Basilica and the Sistine Chapel. 3. Barcelona, Spain: Famous for its unique architecture, particularly the works of Antoni Gaudi, such as the Sagrada Familia and Park Güell. 4. Amsterdam, Netherlands: Known for its artistic heritage,
elaborate canal system, and narrow houses. 5. London, England: Home to the British Museum, the Tower of London, Buckingham Palace, and the London Eye. 6. Prague, Czech Republic: Known for its Old Town Square, colorful baroque buildings, Gothic churches, and the medieval Astronomical Clock.
7. Athens, Greece: Home to the Acropolis and Parthenon, as well as numerous other ancient Greek monuments. 8. Vienna, Austria: Known for its imperial palaces, includi

100%|██████████| 5/5 [00:50<00:00, 10.11s/it]
