## Tutorial: How to Create a Voice-Enabled Chat-Bot

**Team:** Skandda Chandrasekar, Tonya Chivandire, Luke Summers

**Summary:** Our project is about creating a voice-enabled chat-bot that you can ask various questions to and receive accurate answers. Many chat-bots that are implemented today are all text-based bots and don't allow for people who may be blind or visually impaired to access them. As a result, we sought to create one that had increased accessibility for people who may not be able to use other chat-bots. Also, other related works focus on non-ML supported solutions and only allow for one question back and forth. By the end of this tutorial, you will be able to have a rolling audio conversation with a chat-bot that remembers all previous questions and answers.

**Audience:** Individuals with beginner Python experience who want to learn about audio input and ouput as well as response generation, more specifically with the ChatGPT API.

**External libraries used:** `SpeechRecognition`, `PyAudio`, `OpenAI`, `gTTS`, `playsound`, `os`, and potentially `flac` and/or `portaudio`

**Vocabulary:** Natural Language Processing, ChatGPT, API, audio, API Key, class, model, chat-bot, system, library


## Introduction
This tutorial is written in Python and makes use of various libraries and APIs supported by Python.

**What is a library?:** A library is a group of modules that can be downloaded and used in new programs. For example, one library used in this tutorial is the gTTS library, which provides functions for converting text into an audio file. Once downloaded, a library can be imported into any python file. This can be the whole library, but individual modules are able to be imported as well.

<span style=color:lightblue>**Source:** [What are Libraries in Python](https://codeinstitute.net/global/blog/what-are-libraries-in-python/)</span>

**What is an API?:** An API or an application programming interface is a way to connect to other softwares in code. APIs can be used for various things, but in the case of this tutorial the OpenAI API will be used in order to query prompts into a ChatGPT model. Some APIs are what is called open which means they are free to use, with only an email account required to gain access to a key. OpenAI is not like this, as each query of the API requires a key with a connected account that it can charge. You will not have to worry about any of that for this tutorial, as significant payment would only arise for a far greater amount of queries than would be processed in running this tutorial multiple times. For any API, once you have a key you can use that to query the API in your code as long as your key has the proper access for your given call.

<span style=color:lightblue>**Source:** [What is an API](https://aws.amazon.com/what-is/api/)</span>

Through the use of a library and API that converts the input from a microphone into text, the OpenAI ChatGPT API, and a library that converts text into a playable audio file, this tutorial will walk through creating a ChatGPT powered conversation bot in python.

In this tutorial, we will discuss how to do the following:

* Install a library onto your machine
* Create an OpenAI account
* Process audio into text (either from user input or a file)
* Generate a ChatGPT-created response to a query
* Convert said response to an audio file and output it
* Have your chat-bot remember past questions and answers, like a conversation

To start, we will learn how to install all the required dependencies for this tutorial.

## Part 0: Installing Homebrew for Mac
**Important Note:** If you are running this tutorial on Mac, you will need to install the the package manager Homebrew.

<span style="color:salmon">**I. Installing Homebrew**</span>

Homebrew is a terminal package manager for Mac that we will utilize to install some required dependencies for installing certain libraries later in the tutorial. In order to install homebrew, point your browser to https://docs.brew.sh/Installation for homebrew's system requirements and installation instructions. After following the installation prompts, homebrew will be ready to use in the terminal. For this project, you will need to run the commands ```brew install flac``` and ```brew install portaudio``` in an instance of your MacBook's terminal. Also, homebrew installs things into different locations depending if your MacBook has an Intel or Apple produced processor. While this will not affect our project, if you were using homebrew to install packages for something like compiling c++ files, you may need to adjust your compiler to account for this.

## Part 1: Installing Required Libraries

**Important Note:** This tutorial will be run on Python 3.11.6 in a Jupyter Notebook environment. Again, we recommand running this through VScode.

<span style="color:salmon">**I. Processing Audio to Text**</span>

The libraries required for this portion of the tutorial are SpeechRecognition and PyAudio. These can be installed using pip.
Those who are utilizing MacBooks for this tutorial require both the Flac and PortAudio libraries that can be installed using homebrew. If you do not have homebrew on your machine, you can follow the instructions [here](https://brew.sh/).

For MacBook users, please make sure you followed Part 0 in order to get flac and portaudio in your terminal.
* ```brew install flac```
* ```brew install portaudio```

Both Windows and Mac users should run the following two commands

* ```pip install SpeechRecognition```
* ```pip install PyAudio```

<span style="color:salmon">**II. Generate Response from Text**</span>

There is only one library required for this portion of the tutorial and it is OpenAI. This can be installed using pip.

* ```pip install openai```

<span style="color:salmon">**II. Generate Response from Text**</span>

The libraries required for this portion of this tutorial are Google Text to Speech and Playsound. These can be installed using pip. Please note that we will be installing a specific version of Playsound, specifically 1.2.2, which is compatible with the tutorial.

* ```pip install gtts```
* ```pip install playsound==1.2.2```

If you have any trouble installing any library, but more specifically the Playsound library, try running the following command.

* ```pip install --upgrade wheel```

<span style="color:salmon">If you want to automatically install these, run the following cell below. If you are on MacBook, make sure that you followed Part 0 in order to have flac and portaudio installed. Make sure to restart the Kernel in your code editor in order to use these libraries.</span>

In [None]:
%pip install SpeechRecognition
%pip install PyAudio
%pip install openai
%pip install gTTS
%pip install playsound==1.2.2

## Troubleshooting
<span style="color:salmon">**I. Installing Anaconda**</span>

If you have trouble running part one and get errors related to the flac library, try changing the kernel to anaconda. This can be done by installing anaconda and then restarting VS Code. When you try to run the code after restarting, choose the Anaconda python environment.
Anaconda provides distributions of the Python and R programming languages, and it helps manage the packes for these environments and deploy them. While its focus is on running programs that involve data science, we will utilize it to change the python environment of the IPython notebook kernel. In order to install Anaconda, point your browser to https://www.anaconda.com/download, and download the Mac Anaconda installer. Once downloaded, run the installer following all of its prompts until Anaconda has successfully been installed. Once installed, the installer can be disposed of. In order to make use of the Anaconda environment in the IPython notebook kernel, upon the first run of code of an open notebook, you should be prompted to choose a python environment for the kernel. With Anaconda installed, there will be an option to choose an Anaconda environment instead of the recommended Python one, and choosing that option sets up the kernel successfuly.

In [None]:
#if anything fails to install, run this cell and see if it helps. If not, ignore this cell.
%pip install --upgrade wheel

In [None]:
#MACBOOK: Try this cell if you have issues later on retrieving audio. You could just install this to be safe as well
%pip install -U PyObjC
%pip install AppKit

## Part 2: Creating a OpenAI ChatGPT Account

If you have created a ChatGPT account with your phone number longer than three months ago, then your free trial period for the API has expired, and you can just use our key. If you have access to a new phone number for a new ChatGPT account or have created an account within the past three months, then you can use your key in limited amounts for free.

<span style="color:salmon">**Our API Key:**</span> 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

<span style="color:salmon">**Follow the steps below to create an OpenAI ChatGPT Account and access your API key.**</span>

Please note, this tutorial will be free for those who create a new ChatGPT Account and complete the tutorial within the first three months of doing so. You are also welcome to use our API key in this tutorial.

Follow these steps to create a new account if you have not created one yet. If you have, then you should use our API key.

1. Navigate in any internet browser to [OpenAI](https://openai.com).

2. Click 'login' in the top right of the webpage.

3. Create an account if you do not have one already. If you do, please login.

4. Once you are logged in, navigate to the [documentation for OpenAI](https://platform.openai.com/docs/overview).

5. On the left sidebar, navigate to the 'API Keys' section.

6. Create a new 'secret key' and make sure that you copy down the key. Keep it in a safe place for later.

## Part 3: Implementation: Processing Audio into Text

There are two main processes involved in processing audio into text. First, we need to **retrieve the audio**. This is capturing and processing the audio into a sequence of bytes. Second, we need to **transcribe the audio**. This is converting the sequence of bytes into text/natural language.

To do this, we first import the speech recognition library and create a new instance of the ```Recognizer``` class.

In [None]:
import speech_recognition as sr
recognizer = sr.Recognizer()

### 3.1 Retrieving the Audio

<span>**I. Listening to Audio with a Microphone**</span>

One way of retrieving audio input is by listening to live speech through a microphone. To do this we use the ```Microphone``` class from the speech recognition library and declare it as our audio source. The ```listen``` function will process the audio and return an object of type ```AudioData```. This object stores a sequence of bytes representing audio samples.

Since we will be making more calls to `listen`, we will wrap the code in a function `listen_to_audio` which returns the audio data from `listen`.

<span style=color:lightblue>**Source:** [SpeechRecognition Microphone Example](https://github.com/Uberi/speech_recognition/blob/master/examples/microphone_recognition.py#L7-L11)</span>


In [None]:
# obtain audio from the microphone
def listen_to_audio():
    with sr.Microphone() as source:
        print("Say something!")
        audio = recognizer.listen(source)
        print("Audio retrieved.")
    return audio

**Adjusting Energy Threshold:** The energy threshold is the minimum audio energy that the recognizer will consider for recording speech. It has a default value of 300. A higher value will make the microphone more sensitive, which is suitable if you are speaking in a loud room or a room with a lot of ambience sound. Not adjusting the threshold in such an environment can lead to unintelligible audio during transcription in Part 3.2. If noise is an issue, you can try adjusting the value of the energy threshold in the cell below.

<span style=color:lightblue>**Source:** [SpeechRecognition Library Reference](https://github.com/Uberi/speech_recognition/blob/master/reference/library-reference.rst#recognizer_instanceenergy_threshold--300---type-float)</span>

In [None]:
#This is what has worked well for us
recognizer.energy_threshold = 700

In [None]:
mic_audio = listen_to_audio()

<span>**II. Retrieving the Audio from a File**</span>

Alternatively, we can recognize speech input from an audio file. This is useful when needing to recognize pre-recorded or separate audio inputs e.g. a series of requests in different languages.

To do this, we first declare our audio file as our source. Then we use the ```record``` function to read the entire audio file. Similarly to the ```listen``` function, ```record``` returns an ```AudioData``` object. This is wrapped in our function `get_audio_from_file`.

<span style=color:lightblue>**Source:** [SpeechRecognition Audio Transcribe Example](https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py#L11-L14)</span>

In [None]:
# read the entire audio file
def get_audio_from_file(audio_file):
    with sr.AudioFile(audio_file) as source:
        return recognizer.record(source)

The SpeechRecognition library only supports WAV, AIFF and FLAC audio file types. You can use the wav files provided for you in this tutorial, but if you prefer to use your own speech, record your speech to an audio file, convert it to a supported file format, and save the file in the same directory as this notebook. Be sure to replace *`how_are_you.wav`* below with the name of your file.


In [None]:
file_audio_en = get_audio_from_file("how_are_you.wav")
file_audio_fr = get_audio_from_file("french.flac")

### 3.2 Transcribing the Audio to Text

Our audio is now retrieved, but as mentioned before, it is still stored in ```AudioData``` objects which is a sequence of bytes. We need to convert those bytes into actual text (a string) that we can feed as input to our chatbot. To do this, we can call the Google Speech Recognition API `recognize_google` to transcribe the audio data. This tutorial will use the default API key which does not need to be passed into the API call.

Run the cell below to transcribe the speech that we recorded from the microphone earlier.

<span style=color:lightblue>**Source:** [SpeechRecognition Audio Transcribe Example](https://github.com/Uberi/speech_recognition/blob/master/examples/audio_transcribe.py#L24-L29)</span>

In [None]:
mic_audio_text = recognizer.recognize_google(mic_audio)
print("Google Speech Recognition thinks you said: \"{}\"".format(mic_audio_text))

The code cell below transcribes our two English and French audio files. Note that the default language is *en-US* and we need to specify the language parameter to recognize any other language/dialect.

In [None]:
text_en = recognizer.recognize_google(file_audio_en)
print("Google Speech Recognition thinks you said: \"{}\"".format(text_en))

text_fr = recognizer.recognize_google(file_audio_fr, language="fr")
print("Google Speech Recognition thinks you said: \"{}\"".format(text_fr))

The `recognize_google` function can raises two errors.

1. An `UnknownValueError` is raised if the speech is unintelligible. This could be because the audio input is in a different language from the recognizer language (in which case you would specify the language as shown above). It could also be due to ambient noise (revisit 3.1 on how to adjust energy threshold).

2. A `RequestError` might be a network issue (in which case check that your internet is stable)

## Part 4: Implementation: Generating Response from Text

At this point, we have taken audio input from the user and saved it as a string named 'user_input', and now we will use the ChatGPT API to get a conversational response to this string. In this section, we will write a function that takes this string as an input and queries the gpt-3.5-turbo ChatGPT model for a response, and then returns that response as a string. This will make use of the OpenAI object from the openai API, and will then use the .chat.completions functionality of the OpenAI object.

In [None]:
from openai import OpenAI

The first step is to create a variable for the API key. This will be used to make the call to the ChatGPT model for a response.

In [None]:
#Replace this with your API key if you have generated one.
OPENAI_API_KEY = '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'

The function, `generate_response`, only takes in one string input, which is the input that is to be used on the ChatGPT query. The first step within the funtion is to make an OpenAI object, as with that we will be able to query whichever ChatGPT model we please. The API key is used as a parameter in the object construction as a way to unlock access to the API. Then, we will create a chat completion object in order to generate and save a response to our input. We then access and return the response that the query gave from the choices attribute of the chat completion object.

The client (in our case) will take the model, and a list of inputs that are messages. We will manipulate these heavier later in the tutorial, so we will better explain it then. For now, just know we have a system command and then the question that the user asked. The system command just tells how the model to behave when creating responses.

The call returns the response with lots more information than we need for the sake of this tutorial. For example, it gives a finish reason, id, and more. We are interested in the message content. We access this by calling `response.choices[0].message.content` as you can see in the return statement.

Especially in this section, if you want to delve into the details and power of the API beyond what we are doing here, we suggest you read more on the OpenAI documentation pages.

<span style=color:lightblue>**Source:** [OpenAI API Documentation](https://platform.openai.com/docs/guides/text-generation)</span>

In [None]:

def generate_response(question):
  #construct OpenAI objct with API key
  client = OpenAI(api_key=OPENAI_API_KEY)
  #create chat compleations object whit 'question' as an input
  response = client.chat.completions.create(
    model="gpt-3.5-turbo", #sets the model of ChatGPT to desired model
    messages=[
      {"role": "system", "content": "You are a helpful assistant."}, #message from the system to the model that it is a helpful assistant
      {"role": "user", "content": question}, #input message from user
    ]
  )
  #
  return response.choices[0].message.content

This is an example call of the function that saves the response to a string variable named 'response'.

In [None]:
# generate response
response = generate_response(mic_audio_text)

## Part 5: Implementation: Converting Text to Speech and Output

We have now received a response to our initial voice-inputted question from ChatGPT. Reminder that we have saved this response in a local variable `response`, as you can see above. Now we can take this response and output it as audio through the device we are currently on! We will be using the `gTTS` ( _Google Text-To-Speech_), `playsound`, and `os` libraries to make this process trivial.

There are four main parts in this part of the implementation:
* Convert the response into a "gTTS" object
* Save this object to an mp3 file in your current directory
* Output the mp3 file as audio through the default audio player
* Delete the mp3 file to clean up the process

First, let's import the required libraries.

In [None]:
# Imports the required libraries to output the response as audio

from gtts import gTTS
import playsound
import os

Now that we have imported the required libraries, we can finally write our `text_to_speech` function that will complete the three steps outlined above. This function will take 4 inputs, but only one of them is required (the response generated in **"Part 4"** of the tutorial).

These inputs are:
* <span style=color:lightblue>response:</span> a string - the question response returned by the ChatGPT API
* <span style=color:lightblue>language:</span> a string - default set to English, the language that you want the output to be. A comprehensive list of all languages can be found [here](https://gtts.readthedocs.io/en/latest/module.html#module-gtts.tts).
* <span style=color:lightblue>speed:</span> a boolean - default set to false, will output audio slower if set to true
* <span style=color:lightblue>file_name:</span> a string - default set to "response.mp3", the name of the temporary audio file that will hold your audio

This function will not return anything. Instead, it will output the audio through your default computer audio player.

<span style=color:lightblue>**Source:** [gTTS Documentation](https://gtts.readthedocs.io/en/latest/), [playsound Documentation](https://pypi.org/project/playsound/)</span>

In [None]:
def text_to_speech(response, language='en', speed='False', file_name='response.mp3'):

    #Uses the response to save a "gTTS" object
    audio = gTTS(response, lang=language, slow=speed)

    #Saves the object into your current directory using the desired file_name
    audio.save(file_name)

    #Get absolute path and format whitespaces
    abs_path = os.path.abspath(file_name).replace(' ', '%20')

    #Output the mp3 file as audio
    playsound.playsound(abs_path)

    #Delete the mp3 file
    os.remove(file_name)

We have finished writing all the required functions for our voice-enabled chat-bot. We can now test our output by calling our `text_to_speech` function and receiving the answer to our question through audio. Run the cell below and get the answer!

In [None]:
#Outputs the response as audio
text_to_speech(response)

## Part 6: Implementation: Have a Recurring Conversation

Although we've finished the main part of this tutorial, a potential expansion upon a simple question and answer bot is to allow for a recurring conversation. That is, having the bot remember the previous quesitons that were asked so that it has context for future responses. We will do this by making edits to our `generate_response` function that we wrote earlier. Instead, we will write a function called `generate_rolling_response` that takes a log of messages into account as opposed to just a question.

Besides that, the function will also take an OpenAI client as an input so that we aren't reinitializing a client every time we ask a new question in the thread.

The inputs are:

* <span style=color:lightblue>log:</span> a list - containing dictionaries with role and content, which will be explained later.
* <span style=color:lightblue>client:</span> an OpenAI client - initialized with an API_KEY, this will generate your responses

This function will return the response as a string.

<span style=color:lightblue>**Source:** [OpenAI API Documentation](https://platform.openai.com/docs/guides/text-generation)</span>


In [None]:
#This function will feed a log with context to the client in order to get a more accurate response with context
def generate_rolling_response(log, client):

  #contacts the API to get a response
  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=log
  )

  #returns the actual answer as a string
  return response.choices[0].message.content

Now that we have updated our `generate_response` function to `generate_rolling_response`, we can write a function that will continuously ask for questions as well as output the responses as we did before. Now we can write a function called `chatbot_interface` in order to keep this conversation going at all times.

We will initialize our client before entering the loop. You may be wondering what the log is; It is a list of dictionaries that contain the context for each entry to the chatbot. It contains a role and content.

There are three different roles you can have, and the content usually will depend on the role:

* <span style=color:lightblue>system:</span> This is basically describing how you want the chatbot to behave.
* <span style=color:lightblue>user:</span> For our sake, this is the role that will be asking questions, i.e. you are the user
* <span style=color:lightblue>assistant:</span> This is the chat-bot that will be generating responses based off how it is told to by the system.

Please note that if the conversation goes on for too long, you will hit the ChatGPT token limit. In order to error check this, we will also write a function that will check the number of tokens in the log, called `count_tokens`. This function will take the log and then calculate the number of tokens, given the log. It will return the number of tokens in the current thread.

<span style=color:lightblue>**Source:** [Counting Tokens in Paragraph](https://www.tutorialspoint.com/python_text_processing/python_counting_token_in_paragraphs.htm)</span>


In [None]:
#genarically counts the number of tokens in the log to ensure that we don't hit the limit
def count_tokens(log):
    total_tokens = 0

    #loops through all the dictionaries in the log and concatenates the tokens in the role and content keys
    for query in log:
        total_tokens += (len(query.get("role")) + len(query.get("content").split()))

    return total_tokens

To go more in-depth into the `chatbot_interface` function, we will initialize the log with the system command prior to starting our thread. We will use a generic "You are a helpful assistant." for the content here so the bot generates typical responses. Then, we will enter the while loop which is the main part of the interface. This is where we will be talking back and forth with the chat-bot.

* First, it will retrieve the input like before, but will check to see if the user says quit. That way, there is a way to break out of the thread. Then, it adds that user question to the log.
* Then, it will check to make sure that the conversation is not too long that the model cannot deal with it. If it is, it breaks out of the thread.
* If not, it generates a response with the log by giving the context to the client. After it does that, it adds the response given to the log as context.
* Lastly, we output the last response as audio so the user gets the answer to the most recently asked question.

<span style=color:lightblue>**Source:** [OpenAI Documentation Chat Completions](https://platform.openai.com/docs/guides/text-generation/chat-completions-api), [OpenAI Token Limit](https://platform.openai.com/docs/models/gpt-3-5)</span>


In [None]:
def chatbot_interface(max_tokens=4040):
    client = OpenAI(api_key=OPENAI_API_KEY)

    #initializes the log with the system command
    log = [{"role": "system", "content": "You are a helpful assistant."}]

    #main loop
    while(1):

        #retrieve audio
        print("Listening... Say quit to exit.")
        audio_input = listen_to_audio()
        user_input = recognizer.recognize_google(audio_input)

        #checks if the user wants to quit
        if user_input.lower() == "quit":
            break

        print("You said: " + user_input)

        #appends the question to the log
        log.append({"role": "user", "content": user_input})

        #checks the number of tokens currently in the log, breaking out of the loop if it exceeds a soft value
        tokens = count_tokens(log)
        if tokens >= max_tokens:
            print("Token limit exceeded, please start a new conversation.")
            break

        #generates a response
        response = generate_rolling_response(log, client)

        #adds the response to the log
        log.append({"role": "assistant", "content": response})

        #outputs the response as audio
        text_to_speech(response)

        print(response)

Now, we can run our code to have a conversation with the chat-bot who will now remember previous questions it was asked as well as answers it has given. Run the following cell and test it out!

In [None]:
recognizer.energy_threshold = 700

chatbot_interface()

If you are struggling to retrieve the audio, try moving to a quieter location or messing with the threshold. Thank you so much for completing this tutorial with us!

## Part 7: Exploration: System Commands

We've completed the main part of the tutorial - We can now have a full audio conversation with a chat-bot that can generate responses to our questions. Now, what if we messed with the system command to get special styles of responses? To do this, we will edit the `chatbot_interface` function to the take a system command as a string. This way, if we have a specific use for our chat-bot, we can tune our system command to follow that.

Our `chatbot_interface` function, on top of taking an optional input of <span style=color:lightblue>max_tokens</span>, we will add another optional input of <span style=color:lightblue>system_command</span>.

We will call this funtion `custom_chatbot_interface`. Mess with the cell that calls this function. Try different system commands here and see how the output changes.

<span style=color:lightblue>**Source:** [OpenAI Documentation Chat Completions](https://platform.openai.com/docs/guides/text-generation/chat-completions-api)</span>

In [None]:
def custom_chatbot_interface(max_tokens=4040, system_command="You are a helpful assistant."):
    client = OpenAI(api_key=OPENAI_API_KEY)

    #initializes the log with the system command
    log = [{"role": "system", "content": system_command}]

    #main loop
    while(1):

        #retrieve audio
        print("Listening... Say quit to exit.")
        audio_input = listen_to_audio()
        user_input = recognizer.recognize_google(audio_input)

        #checks if the user wants to quit
        if user_input.lower() == "quit":
            break

        print("You said: " + user_input)

        #appends the question to the log
        log.append({"role": "user", "content": user_input})

        #checks the number of tokens currently in the log, breaking out of the loop if it exceeds a soft value
        tokens = count_tokens(log)
        if tokens >= max_tokens:
            print("Token limit exceeded, please start a new conversation.")
            break

        #generates a response
        response = generate_rolling_response(log, client)

        #adds the response to the log
        log.append({"role": "assistant", "content": response})

        #outputs the response as audio
        text_to_speech(response)

        print(response)

In [None]:
#insert your command here!
#EXAMPLE: command = "You are a very mean assistant."
command = ""

custom_chatbot_interface(system_command=command)