## <b><font color='darkblue'>Preface</font></b>
From this notebook, we would like to explore how to integrate our Model A (STT) into [**LangChain**](https://python.langchain.com/v0.1/docs/get_started/introduction) framework.

In [33]:
import os
from os import walk
import IPython
from IPython.display import Markdown as md

### <b><font color='darkgreen'>Model A candidates</font></b>
From this [issue #30](https://github.com/Cockatoo-AI-Org/Cockatoo.AI/issues/30), we have evaluate a few model a models:
```
Raw data (LangEnum.cn)
+-------------------------+------------------------------+-------+----------+
|          Input          |             Name             | Score | Time (s) |
+-------------------------+------------------------------+-------+----------+
| cn_20240108_johnlee.wav |    SpeechRecognition/GCP     |  0.36 |   6.27   |
| cn_20240108_johnlee.wav | SpeechRecognition/WhisperAPI |  0.88 |   2.42   |
|  cn_20240121_chung.wav  |    SpeechRecognition/GCP     |  0.93 |   15.1   |
|  cn_20240121_chung.wav  | SpeechRecognition/WhisperAPI |  0.96 |   2.54   |
+-------------------------+------------------------------+-------+----------+

Ranking Table
+------------------------------+------------+--------------+
|             Name             | Avg. score | Avg. Time(s) |
+------------------------------+------------+--------------+
| SpeechRecognition/WhisperAPI |    0.92    |     2.48     |
|    SpeechRecognition/GCP     |    0.65    |    10.68     |
+------------------------------+------------+--------------+
```

"Although the current dataset size may limit our ability to fully assess Model A's capabilities (which we plan to address), <b>we will proceed with integrating `WhisperAPI` due to its superior performance to date</b>.

### <b><font color='darkgreen'>Testing dataset</font></b>
We will put our testing dataset under path `Cockatoo.AI/experiments/voice_dataset` (The root of repo is `Cockatoo.AI/`).

In [122]:
TEST_DATASET_PATH = '../voice_dataset'
TEST_AUDIO_PATH = '../voice_dataset/what_is_cock_ai.wav'
WHAT_IS_MODEL_A_TEST_AUDIO_PATH = '../voice_dataset/what_is_model_a.wav'

In [121]:
for (dirpath, dirnames, filenames) in walk(TEST_DATASET_PATH):
    for file_name in filenames:
        file_path = os.path.join(dirpath, file_name)
        if file_path.endswith('.wav'):
            print(f'{file_path}')

../voice_dataset/what_is_cock_ai.wav
../voice_dataset/what_is_model_a.wav


In [12]:
IPython.display.Audio(TEST_AUDIO_PATH)

<a id='agenda'></a>
## <b><font color='darkblue'>Integration work</font></b>
Here we are going to explain the integration work step by step:
* <font size='3ptx'>[**API to use model A**](#sect_1)</font>
* <font size='3ptx'>[**LangChain API**](#sect_2)</font>
* <font size='3ptx'>[**Integrate RAG into LangChain**](#sect_3)</font>
* <font size='3ptx'>[**Integrate model A into LangChain**](#sect_4)</font>
* <font size='3ptx'>[**Survey and demo of GCP Text-to-Speech APIs**](#sect_5)</font>

<a id='sect_1'></a>
### <b><font color='darkgreen'>API to use model A</font></b>
Here we are going to introduce how to use our wrapper in usine `model A` to carry out the speech to text (`STT`) transformation.

In [10]:
from speech_recognition_wrapper import SRWhisperWrapper
from dotenv import load_dotenv
import openai
import wrapper

load_dotenv()
openai.api_key = os.environ['OPENAI_API_KEY']

stt_agent = SRWhisperWrapper(wrapper.LangEnum.en)

In [7]:
%%time
resp = stt_agent.audio_2_text(TEST_AUDIO_PATH)

CPU times: user 13.2 ms, sys: 0 ns, total: 13.2 ms
Wall time: 1.8 s


In [8]:
resp.text

'Could you explain to me what Cockatoo AI is?'

<a id='sect_2'></a>
### <b><font color='darkgreen'>LangChain API</font></b>
<b><font size='3ptx'>Here we are going to study and introduce how to integrate our `model A` wrapper into LangChain API</font>. Let's recall what we have introduced how chains work in LangChain in this section **["Demonstrate the usage of RAG"](https://github.com/Cockatoo-AI-Org/Cockatoo.AI/blob/master/experiments/langchain_lab/README.md#demonstrate-the-usage-of-rag)**</b>:
```python
  chat_model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
  retriever = vectorstore.as_retriever(k=most_relevant_doc_count)
  qa_chain_with_rag = (
      {"context": retriever, "question": RunnablePassthrough()}  # 1) Integrate RAG
      | review_prompt_template                                   # 2) Put context retrieved RAG into question
      | chat_model                                               # 3) Feed in LLM
      | StrOutputParser()                                        # 4) Format the output
  )
  return qa_chain_with_rag
```

So we could play with chatbot with [**RAG**](https://python.langchain.com/docs/tutorials/rag/) this way:
```python
>>> from scripts import doc_loader
>>> doc_path = '...'  # Please replace `...` with doc path we want LLM to search for.
>>> vectorstore = doc_loader.create_vector_db_of_repo([doc_path])
>>> question = 'Where do we put the sample code of LangChain?'
>>> from chatbot_demo import rag
>>> rag_chatbot = rag.get_chatbot(vectorstore)
```

Let's play with it a bit from this notebook:

In [13]:
from langchain.prompts import (
    PromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    ChatPromptTemplate,
)
from langchain.schema.runnable import RunnablePassthrough
# from langchain_community.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from chroma_utils_in_digest_repo_info import prepare_chroma_data, CHROMA_PATH, EMBEDDING_FUNC_NAME, COLLECTION_NAME, TEST_DOC_PATHS
import chromadb
from chromadb.utils import embedding_functions
from more_itertools import batched

import os
import json
import openai
import textwrap

In [14]:
qa_system_template_str = textwrap.dedent("""
You are a senior engineer working on the Cockatoo.AI open-source project.
You are an expert on how to use the `Cockatoo.AI` GitHub repository.

Answer questions using only the provided context. Be as detailed as possible.
If you don't know the answer, say "I don't know."

{context}
""").strip()

review_system_prompt = SystemMessagePromptTemplate(
    prompt=PromptTemplate(
        input_variables=["context"], template=qa_system_template_str))

review_human_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        input_variables=["question"], template="{question}"))

In [15]:
messages = [review_system_prompt, review_human_prompt]
review_prompt_template = ChatPromptTemplate(
    input_variables=["context", "question"],
    messages=messages)

In [19]:
context = '''
Cockatoo AI is a framework that integrates Model A for Speech-to-Text (STT),
Model B as a Large Language Model (LLM), and Model C for Text-to-Speech (TTS) to enable a wide range of applications.'''.strip()

resp = review_prompt_template.invoke({
        'question': 'What is Cockatoo AI?',
        'context': context})

In [18]:
print(resp)

messages=[SystemMessage(content='You are a senior engineer working on the Cockatoo.AI open-source project.\nYou are an expert on how to use the `Cockatoo.AI` GitHub repository.\n\nAnswer questions using only the provided context. Be as detailed as possible.\nIf you don\'t know the answer, say "I don\'t know."\n\nCockatoo AI is a framework that integrates Model A for Speech-to-Text (STT),\nModel B as a Large Language Model (LLM), and Model C for Text-to-Speech (TTS) to enable a wide range of applications.'), HumanMessage(content='What is Cockatoo AI?')]


We can use the following script to extract information about Cockatoo.AI from the repository, enabling integration with Retrieval Augmented Generation ([**RAG**](https://python.langchain.com/v0.2/docs/tutorials/rag/)):
```shell
./chroma_utils_in_digest_repo_info.py 

$ ls -hl cockatoo_repo_embeddings/
total 172K
drwxr-x--- 2 johnkclee primarygroup 4.0K May 12 02:49 4a02abcf-767a-4962-bf71-14117ad65422
-rw-r----- 1 johnkclee primarygroup 164K May 12 02:49 chroma.sqlite3
```

Folder <font color='olive'>`cockatoo_repo_embeddings`</font> will be created to store the knowledge/information of `Cockatoo.AI` as [**Vector store**](https://en.wikipedia.org/wiki/Vector_database).

In [20]:
test_chroma_dataset = prepare_chroma_data(TEST_DOC_PATHS)

Total 5 documents collected!


In [21]:
test_chroma_dataset['ids']

['1', '2', '3', '4', '5']

In [22]:
document_indices = list(range(len(test_chroma_dataset['documents'])))

print(f'Ingesting doc({len(document_indices)}) ...')
for batch in batched(document_indices, 166):
    print(batch)

Ingesting doc(5) ...
(0, 1, 2, 3, 4)


Then we could retrieve the vectorstore instance for RAG query:

In [23]:
CHROMA_PATH

'cockatoo_repo_embeddings'

In [24]:
chromadb_client = chromadb.PersistentClient(CHROMA_PATH)

In [25]:
# embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
#     model_name=EMBEDDING_FUNC_NAME)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
      model_name="text-embedding-ada-002")

def embed_query(text, embedding_func=openai_ef):
    return embedding_func(text)[0]

openai_ef.embed_query = embed_query

In [28]:
%%time
embed_query('test')[:20]

CPU times: user 10.8 ms, sys: 5.27 ms, total: 16 ms
Wall time: 1.11 s


[-0.014994166791439056,
 -0.009446053765714169,
 0.008583617396652699,
 -0.016827693209052086,
 -0.03107486665248871,
 0.01245439425110817,
 -0.015306545421481133,
 -0.026266954839229584,
 -0.006142311729490757,
 -0.014138521626591682,
 0.009249119088053703,
 0.0245285015553236,
 0.0120401531457901,
 -0.003870776854455471,
 -0.014016286469995975,
 0.010159091092646122,
 0.044955335557460785,
 0.0029013848397880793,
 0.006193242967128754,
 -0.02691887505352497]

In [29]:
collection = chromadb_client.get_collection(
    name=COLLECTION_NAME, embedding_function=openai_ef)

In [30]:
# Number of ingested documents
len(collection.get()['ids'])

5

In [31]:
resp = collection.query(
    query_texts=["What is Cockatoo AI?"],
    n_results=2)

In [34]:
md(resp["documents"][0][0])

# Text to Speech (TTS) model notes

## General: 
 - [TTS Intro](https://huggingface.co/docs/transformers/tasks/text-to-speech)
 - [What is TTS](https://huggingface.co/tasks/text-to-speech) (Suno Bark, MS SpeechT5, MMS TTS)


## Resources

### Open AI TTS API
 - Pricing:
   Model | Usage 
   -- | -- 
   Whisper | $0.006 / minute (rounded to the nearest second)
   TTS | $0.015 / 1K characters
   TTS HD | $0.030 / 1K characters
 - [API Tutorial](https://platform.openai.com/docs/guides/text-to-speech)

### [Speech-T5-TTS](https://huggingface.co/microsoft/speecht5_tts)
 - [Interactive Demo](https://huggingface.co/spaces/Zhenhong/text-to-speech-SpeechT5-demo): only support for English, and the generated sound is too monotonic without cadence.
 - Currently the English TTS are good for most models that I have tried. The problem is that when it comes to Chinese-English mixed text, it is hard for model to synthesize the output voice

### [Suno Bark Github](https://github.com/suno-ai/bark)
Suno Bark model is also supported in Coqui TTS, or in Hugging Face Framework.
 - [Tutorial](https://huggingface.co/suno/bark)
 - [Bark Docs](https://huggingface.co/docs/transformers/main/en/model_doc/bark)
 - Bark supported preset voices: [Bark Speaker Library v2](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c). It supports multi-language, including simplified Chinese, but not Taiwanese tones. Users can click this speaker library to listen to some demos there (for me the generated quality is so so). 
 - [Interactive Demo](https://huggingface.co/spaces/suno/bark): users can interact with Bark model.

### [Coqui Github](https://github.com/coqui-ai/TTS)
- [Interactive Demo](https://huggingface.co/spaces/coqui/xtts)
- Allows user to input voice and text so model can imitate your voice to speak out the texts.


## Evaluations
In the world of TTS models, there are three or more evalution methods (keep updating...):
- Log F0
- Human eval in the loop
- mel cepstral distortion (MCD)


## Notes
At the end, I wanna record some technical terms encountered when reading papers so as to easy recall them in the future:
- Speaker Verification
- (log-)mel Spectrogram


Finally, let's apply RAG in LLM query:

In [42]:
question = "What is Cockatoo AI?"
context = textwrap.dedent("""
You are a senior engineer working on the Cockatoo.AI open-source project.
You are an expert on how to use the `Cockatoo.AI` GitHub repository.

Answer questions using only the provided context. Be as detailed as possible.
If you don't know the answer, say "I don't know."

{}
""").strip()

In [36]:
good_reviews = collection.query(
    query_texts=[question],
    n_results=2,
    include=["documents"],
    # where={"Rating": {"$gte": 3}},
)

In [41]:
reviews_str = "\n\n".join(good_reviews["documents"][0])
#md(reviews_str)

In [43]:
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": context.format(reviews_str)},
        {"role": "user", "content": question},
    ],
    temperature=0,
    n=1)

In [44]:
response

ChatCompletion(id='chatcmpl-Aw6bniH7QgjMHYK0GLyzWdgcdk9zP', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Cockatoo.AI is an open-source project that focuses on developing and improving various AI models, particularly in the areas of Text-to-Speech (TTS) and language tutoring. The project aims to provide users with tools and resources to enhance their language learning and communication skills through AI-powered solutions. Cockatoo.AI offers a range of models and resources, such as TTS models like Suno Bark and Coqui TTS, as well as language tutoring capabilities using advanced AI technologies. The project also emphasizes the use of open-source models to benefit users and team members alike.', role='assistant', function_call=None, tool_calls=None, refusal=None))], created=1738411791, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier='default', system_fingerprint=None, usage=CompletionUsage(completion_tokens=117, pro

In [45]:
print(response.choices[0].message.content)

Cockatoo.AI is an open-source project that focuses on developing and improving various AI models, particularly in the areas of Text-to-Speech (TTS) and language tutoring. The project aims to provide users with tools and resources to enhance their language learning and communication skills through AI-powered solutions. Cockatoo.AI offers a range of models and resources, such as TTS models like Suno Bark and Coqui TTS, as well as language tutoring capabilities using advanced AI technologies. The project also emphasizes the use of open-source models to benefit users and team members alike.


<a id='sect_3'></a>
### <b><font color='darkgreen'>Integrate RAG into LangChain</font></b> ([back](#agenda))
<b><font size='3ptx'>A [retrieval system](https://python.langchain.com/docs/concepts/retrievers/) is defined as something that can take string queries and return the most 'relevant' Documents from some sources.</font></b>

A retriever follows the standard Runnable interface, and should be used via the standard runnable methods of `invoke`, `ainvoke`, `batch`, `abatch`.

When implementing a custom retriever, the class should implement the `_get_relevant_documents` method to define the logic for retrieving documents. Optionally, an async native implementations can be provided by overriding the `_aget_relevant_documents` method.

Example: A retriever that returns the first 5 documents from a list of documents
```python
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from typing import List

class SimpleRetriever(BaseRetriever):
    docs: List[Document]
    k: int = 5
    
    def _get_relevant_documents(self, query: str) -> List[Document]:
        """Return the first k documents from the list of documents"""
        return self.docs[:self.k]

    async def _aget_relevant_documents(self, query: str) -> List[Document]:
        """(Optional) async native implementation."""
        return self.docs[:self.k]
```

In [47]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

class ChromaRetriever(BaseRetriever):
    pass

In [63]:
vectorstore = Chroma(
    persist_directory=CHROMA_PATH,
    collection_name=COLLECTION_NAME,
    embedding_function=openai_ef)

In [55]:
# dir(vectorstore)

In [67]:
md(vectorstore.similarity_search('Cockatoo.AI')[0].page_content)

# Project language tutor (focused on listening & speaking)

_In the Open issues, please put your name at the end of the task that you are interested in._

# Goal towards users:
- Speak as native as possible to users so users can practice listening as well as speaking.
- Can adjust the language tutor’s teaching attitude (e.g., Direct vs Plato)
- Can adjust the language tutor’s speaking/work usage difficulties (e.g., A1, A2, B1, B2, C1 levels) to make learning easier for users.
- Can discuss the topic the user uploaded/raised and can correspondingly ask appropriate questions to the user. 
- Can practice listening to various tones from AI (e.g., expressing the sentence with happiness, sadness, …, from elderly/adult/woman/man …)

# Goal towards team members:
- Improve LLM, voice models, DL, and other framework knowledge/skills. 
- Can use the language tutor to benefit him/herself.
- Learn CI/CD and do research on improving potential product biasedness and relevant skills. 
- Learn the balance between fine-tuning and our computing powers.

# How To (Initiative):
Use models (A-B-C as shown below) to form the pipeline as the final language tutor:
- model A: voice to text
- model B(LLM): text à text , might be managed by [langChain](https://python.langchain.com/docs/get_started/introduction), Andrew Ng provided [2-3 free short courses](https://www.deeplearning.ai/short-courses/).
- model C: text to voice
PS: C might be the same as A, needs more research. But the** Goal towards users** should be the top concern when considering which model to use. Besides, better to use open-source models instead of paying premium APIs.

For model C, one may use LangChain to manage LLM. langChain is a tool that helps build automation pipelines, store vector databases (our own data), and automate the monitoring of LLM with various technical strategies as well as prompt engineering. For example, we may manage the prompt strategy by telling LLM to self-supervise the output text quality, ie, tell LLM to redo it again if the previous output was not good.


# Market Competitors (for language tutoring)
- Langotalk (multi-lan)
- Tutor Lily: AI Language Tutor (multi-lan)

# Good References
## For Model A (Speech to text, STT)
- The library [`SpeechRecognition`](https://pypi.org/project/SpeechRecognition/)
  is used for performing speech recognition, with support for several engines and APIs, online and offline.
    - [Notebook - Easy Speech-to-Text with Python](https://github.com/johnklee/ml_articles/blob/master/medium/Easy-speech-to-text-with-python/notebook.ipynb)
- [`Whisper`](https://github.com/openai/whisper) is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
    - [Notebook - Introduction for Whisper in OpenAI](https://github.com/johnklee/ml_articles/blob/master/others/ithome_ithelp_openapi_whisper/notebook.ipynb)
## For Model C (Text to speech, TTS)
- The company [Evelenlabs](https://elevenlabs.io/) has advanced **non-open sourced** TTS model for multi-languages. It performs well in Chinese (no Taiwanese tone) and English from my perspective, which inlcudes happy/sad/... tones as well as male/female/... sounds.


# Misc


In [68]:
chat_model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa_chain_with_rag = (
    {"context": retriever, "question": RunnablePassthrough()}  # 1) Integrate RAG
    | review_prompt_template                                   # 2) Put context retrieved RAG into question
    | chat_model                                               # 3) Feed in LLM
    | StrOutputParser()                                        # 4) Format the output
)

In [71]:
%%time
question = "What is Cockatoo AI?"
resp = qa_chain_with_rag.invoke(question)

CPU times: user 52.3 ms, sys: 6.95 ms, total: 59.3 ms
Wall time: 3.72 s


In [72]:
md(resp)

Cockatoo.AI is an open-source project focused on creating a language tutor that emphasizes listening and speaking skills. The goal of Cockatoo.AI is to help users practice speaking as naturally as possible and adjust the teaching attitude and difficulty levels of the language tutor to make learning easier. The project also aims to enable the language tutor to discuss user-uploaded topics and ask appropriate questions. Additionally, Cockatoo.AI aims to help team members improve their knowledge and skills in areas such as LLM, voice models, deep learning, and other frameworks. The project also encourages team members to learn about CI/CD and research ways to reduce potential product biases.

<a id='sect_4'></a>
### <b><font color='darkgreen'>Integrate model A into LangChain</font></b> ([back](#agenda))
<b><font size='3ptx'>Our next step is to turn voice into text before feeding into LangChain.</font></b>

Here we will leverage [**LCEL**](https://python.langchain.com/v0.1/docs/expression_language/) (Langchain Expression Language) to achieve the goal:

In [75]:
from langchain_core.runnables import RunnableLambda

stt_agent = SRWhisperWrapper(wrapper.LangEnum.en)

def parse_audio_file(data_dict):
    audio_file_path = data_dict['audio_file_path']
    #context = data_dict.get('context', 'You are a software engineer and passinate with AI.')
    resp = stt_agent.audio_2_text(audio_file_path)
    return {'question': resp.text}

input_audio_parser = RunnableLambda(parse_audio_file)

In [76]:
prompt_str = textwrap.dedent("""
You are a senior engineer working on the Cockatoo.AI open-source project.
You are an expert on how to use the `Cockatoo.AI` GitHub repository.

Answer questions using only the provided context. Be as detailed as possible.
If you don't know the answer, say "I don't know."

{context}

With above context, please answer below question:
{question}
""").strip()
prompt = ChatPromptTemplate.from_template(prompt_str)

In [77]:
chain = (
    input_audio_parser                                           # 1) Turn speech into text
    | {"context": retriever, "question": RunnablePassthrough()}  # 2) Integrate RAG
    | prompt                                                     # 3) Put context retrieved RAG into question
    | chat_model                                                 # 4) Feed in LLM
    | StrOutputParser()                                          # 5) Format the output
)

In [79]:
%%time
resp = chain.invoke({'audio_file_path': TEST_AUDIO_PATH})

CPU times: user 39.9 ms, sys: 9.5 ms, total: 49.4 ms
Wall time: 5.89 s


In [80]:
resp

'Cockatoo.AI is an open-source project focused on creating a language tutor that helps users practice listening and speaking skills. The goal of Cockatoo.AI is to enable users to speak as native as possible, adjust the teaching attitude and difficulty levels of the language tutor, discuss user-uploaded topics, and practice listening to various tones from AI. The project aims to improve voice models, deep learning knowledge, and other framework skills for team members. The project also emphasizes using open-source models instead of premium APIs and learning the balance between fine-tuning and computing power. The project utilizes models for voice-to-text conversion, language model management, and text-to-voice conversion to create the final language tutor pipeline. Additionally, Cockatoo.AI provides references for speech-to-text and text-to-speech models, as well as information on market competitors in the language tutoring space.'

<a id='sect_5'></a>
### <b><font color='darkgreen'>Survey and demo of GCP Text-to-Speech APIs</font></b> ([back](#agenda))
<b><font size='3ptx'>Here we will try on GCP Text-to-Speech APIs for `model C`.</font></b>

In [84]:
from dotenv import load_dotenv, find_dotenv
import os

os.environ['ALLOW_RESET'] = 'TRUE'
_ = load_dotenv(find_dotenv(os.path.expanduser('~/.env')))

In [87]:
from typing import Sequence
import google.cloud.texttospeech as tts

def unique_languages_from_voices(voices: Sequence[tts.Voice]):
    language_set = set()
    for voice in voices:
        for language_code in voice.language_codes:
            language_set.add(language_code)
    return language_set

def list_languages():
    client = tts.TextToSpeechClient()
    response = client.list_voices()
    languages = unique_languages_from_voices(response.voices)

    print(f" Languages: {len(languages)} ".center(60, "-"))
    for i, language in enumerate(sorted(languages)):
        print(f"{language:>10}", end="\n" if i % 5 == 4 else "")


def list_voices(language_code=None):
    client = tts.TextToSpeechClient()
    response = client.list_voices(language_code=language_code)
    voices = sorted(response.voices, key=lambda voice: voice.name)

    print(f" Voices: {len(voices)} ".center(60, "-"))
    for voice in voices:
        languages = ", ".join(voice.language_codes)
        name = voice.name
        gender = tts.SsmlVoiceGender(voice.ssml_gender).name
        rate = voice.natural_sample_rate_hertz
        print(f"{languages:<8} | {name:<24} | {gender:<8} | {rate:,} Hz")

In [85]:
list_languages()

---------------------- Languages: 60 -----------------------
     af-ZA     am-ET     ar-XA     bg-BG     bn-IN
     ca-ES    cmn-CN    cmn-TW     cs-CZ     da-DK
     de-DE     el-GR     en-AU     en-GB     en-IN
     en-US     es-ES     es-US     et-EE     eu-ES
     fi-FI    fil-PH     fr-CA     fr-FR     gl-ES
     gu-IN     he-IL     hi-IN     hu-HU     id-ID
     is-IS     it-IT     ja-JP     kn-IN     ko-KR
     lt-LT     lv-LV     ml-IN     mr-IN     ms-MY
     nb-NO     nl-BE     nl-NL     pa-IN     pl-PL
     pt-BR     pt-PT     ro-RO     ru-RU     sk-SK
     sr-RS     sv-SE     ta-IN     te-IN     th-TH
     tr-TR     uk-UA     ur-IN     vi-VN    yue-HK


In [88]:
list_voices("en")

----------------------- Voices: 105 ------------------------
en-AU    | en-AU-Journey-D          | MALE     | 24,000 Hz
en-AU    | en-AU-Journey-F          | FEMALE   | 24,000 Hz
en-AU    | en-AU-Journey-O          | FEMALE   | 24,000 Hz
en-AU    | en-AU-Neural2-A          | FEMALE   | 24,000 Hz
en-AU    | en-AU-Neural2-B          | MALE     | 24,000 Hz
en-AU    | en-AU-Neural2-C          | FEMALE   | 24,000 Hz
en-AU    | en-AU-Neural2-D          | MALE     | 24,000 Hz
en-AU    | en-AU-News-E             | FEMALE   | 24,000 Hz
en-AU    | en-AU-News-F             | FEMALE   | 24,000 Hz
en-AU    | en-AU-News-G             | MALE     | 24,000 Hz
en-AU    | en-AU-Polyglot-1         | MALE     | 24,000 Hz
en-AU    | en-AU-Standard-A         | FEMALE   | 24,000 Hz
en-AU    | en-AU-Standard-B         | MALE     | 24,000 Hz
en-AU    | en-AU-Standard-C         | FEMALE   | 24,000 Hz
en-AU    | en-AU-Standard-D         | MALE     | 24,000 Hz
en-AU    | en-AU-Wavenet-A          | FEMALE   | 24,00

In [97]:
def text_to_wav(voice_name: str, text: str):
    language_code = "-".join(voice_name.split("-")[:2])
    text_input = tts.SynthesisInput(text=text)
    voice_params = tts.VoiceSelectionParams(
        language_code=language_code, name=voice_name
    )
    audio_config = tts.AudioConfig(audio_encoding=tts.AudioEncoding.LINEAR16)

    client = tts.TextToSpeechClient()
    response = client.synthesize_speech(
        input=text_input,
        voice=voice_params,
        audio_config=audio_config,
    )

    filename = f"{voice_name}.wav"
    with open(filename, "wb") as out:
        out.write(response.audio_content)
        print(f'Generated speech saved to "{filename}"')

    return filename

In [98]:
output_wav_file = text_to_wav("en-US-Studio-O", "What is Cockatoo.AI?")

Generated speech saved to "en-US-Studio-O.wav"


In [99]:
IPython.display.Audio(output_wav_file)

Here we could do the same thing similar to integrate `model A` into LangChain for `model C` here:

In [103]:
def output_text_2_audio_file(text):
    return text_to_wav("en-US-Studio-O", text)

output_audio_file = RunnableLambda(output_text_2_audio_file)

In [104]:
%%time
audio_file = output_audio_file.invoke(
    'Cockatoo.AI is an open-source project focused on creating a language tutor that helps users practice listening and speaking skills.')

Generated speech saved to "en-US-Studio-O.wav"
CPU times: user 52.4 ms, sys: 0 ns, total: 52.4 ms
Wall time: 113 ms


In [105]:
IPython.display.Audio(audio_file)

<a id='sect_6'></a>
## <b><font color='darkblue'>Chain models `A`, `B` and `C` together in LangCahin</font></b>
<b><font size='3ptx'>Our final step is to chain models `A`, `B` and `C` to form a total solution.</font></b>

In [106]:
final_chain = (
    input_audio_parser                                           # 1) Turn speech into text
    | {"context": retriever, "question": RunnablePassthrough()}  # 2) Integrate RAG
    | prompt                                                     # 3) Put context retrieved RAG into question
    | chat_model                                                 # 4) Feed in LLM
    | StrOutputParser()                                          # 5) Format the output
    | output_audio_file                                          # 6) Turn text into speech
)

Let's try first question `What is Cockatoo.AI?`:

In [109]:
%%time
audio_file = final_chain.invoke({'audio_file_path': TEST_AUDIO_PATH})

Generated speech saved to "en-US-Studio-O.wav"
CPU times: user 95.7 ms, sys: 38.2 ms, total: 134 ms
Wall time: 6.53 s


In [110]:
IPython.display.Audio(audio_file)

and second question `What is model A?`:

In [123]:
IPython.display.Audio(WHAT_IS_MODEL_A_TEST_AUDIO_PATH)

In [124]:
%%time
audio_file = final_chain.invoke({'audio_file_path': WHAT_IS_MODEL_A_TEST_AUDIO_PATH})

Generated speech saved to "en-US-Studio-O.wav"
CPU times: user 116 ms, sys: 15.9 ms, total: 132 ms
Wall time: 6.31 s


In [125]:
IPython.display.Audio(audio_file)

## <b><font color='darkblue'>Experiment to record the voice</font></b>

In [113]:
import sounddevice as sd
from scipy.io.wavfile import write
import wavio as wv

In [114]:
# Sampling frequency
freq = 44100

# Recording duration
duration = 5

In [118]:
# Start recorder with the given values of 
# duration and sample frequency
recording = sd.rec(int(duration * freq), 
                   samplerate=freq, channels=1)

# Record audio for the given number of seconds
sd.wait()

In [119]:
# This will convert the NumPy array to an audio
# file with the given sampling frequency
recording_file_name = "recording0.wav"
write(recording_file_name, freq, recording)

In [120]:
IPython.display.Audio(recording_file_name)

## <b><font color='darkblue'>Supplement</font></b>
* [Pinecone - LangChain Expression Language Explained](https://github.com/johnklee/ml_articles/blob/master/others/Langchain_expression_language/notebook.ipynb)
* [OpenAI - Introducing APIs for GPT-3.5 Turbo and Whisper](https://openai.com/index/introducing-chatgpt-and-whisper-apis/)
* [LangChain doc - Chains](https://python.langchain.com/v0.1/docs/modules/chains/)
* [LangChain doc - LangChain Expression Language (LCEL)](https://python.langchain.com/v0.1/docs/expression_language/)
> LangChain Expression Language, or LCEL, is a declarative way to easily compose chains together.
* [LangChain doc > Vectorstore: Chroma](https://python.langchain.com/v0.1/docs/integrations/vectorstores/chroma/)
* [RealPython: Playing and Recording Sound in Python](https://realpython.com/playing-and-recording-sound-python/#recording-audio)
* [Tutorial: Use Chroma and OpenAI to Build a Custom Q&A Bot](https://thenewstack.io/tutorial-use-chroma-and-openai-to-build-a-custom-qa-bot/)
* [GCP doc - 將 Text-to-Speech API 與 Python 搭配使用](https://codelabs.developers.google.com/codelabs/cloud-text-speech-python3?hl=zh-tw#6)
* [Create a Voice Recorder using Python](https://www.geeksforgeeks.org/create-a-voice-recorder-using-python/)