<H1>This code builds an AI Voice Assisstant.</H1> 

<p>The code will provide the following features</p>

<ol>
    <li>Capture mic input</li> 
    <li>Check audio level of input</li>
    <li>If loud enough, record user speech for 5 seconds</li>
    <li>This speech is then be saved as an audio file</li>
    <li>The audio file is read and transcribed into text</li>
    <li>The text is save to a file</li>
    <li>LLM access and key is setup</li>
    <li>A prompt instruction is created for LLM</li>
    <li>The text is read from file</li> 
    <li>The text is sent to LLM with instructions</li>
    <li>The response text is captured from LLM</li>
    <li>The response text is written to file</li>
    <li>The text response from LLM is read from file</li>
    <li>The text is tokenized into sentences</li>
    <li>Each sentence is synthesized into audio</li> 
    <li>A pause is inserted after each sentence</li>
    <li>The audio from each sentence is aggregated</li>
    <li>The aggregated audio is saved to file</li>
    <li>The audio is then played back to the user</li>
    <li>This entire process is repeated continuously</li>
</ol>

In [6]:
import pyaudio                      # To manipulate audio
from pydub import AudioSegment      # To manipulate audio
import pygame                       # To play audio
import struct                       # To grab short integer for volume check
import wave                         # To store as wave file
import speech_recognition as sr     # Google's speech recognition tool
from openai import OpenAI           # OpenAI's api
from gtts import gTTS               # text to speech conversion
import os                           # for accessing file data

pygame 2.5.2 (SDL 2.28.2, Python 3.11.7)
Hello from the pygame community. https://www.pygame.org/contribute.html


In [7]:
# Create a dictionary to store the parameters for the 
# audio stream.
param = {
    "format": pyaudio.paInt16,
    "channels": 1,
    "rate": 44100,
    "chunk": 1024,
}

In [8]:
def listen(threshold=5000):
    """
    Listens through the microphone continuous after user
    potentially presses a button. It listens for the audio 
    level to be sufficiently high and then exits
    """
    # Initialize PyAudio
    p = pyaudio.PyAudio()

    # Open audio stream
    stream = p.open(
        format=param["format"], 
        channels=param["channels"], 
        rate=param["rate"], 
        input=True, 
        frames_per_buffer=param["chunk"]
    )

    # Assume the initial volume of audio is zero
    # This is a 16 bit sample value, so it can 
    # vary between 0 and 65,535
    volume = 0

    # Iterate and keep checking the audio until the leve
    # exceeds the threshold. If it does then we assume
    # that this is because there is some speech directed
    # at the voice assistant.
    while (volume < threshold):
        # Read audio stream
        rawAudioBytes = stream.read(param["chunk"])

        # There are two bytes per frame for 
        frame_bytes = len(rawAudioBytes)/param["chunk"]

        # Use integer division with // to create a
        # struct format data type
        format = "%dh" % (len(rawAudioBytes)/frame_bytes)

        # Extract sample data as a magnitude from the
        # raw audio byte stream. The audio is in the
        # form of a tuple containing all sample volume
        # levels
        audio = struct.unpack(format, rawAudioBytes)

        # Check the maximum volume of the samples
        volume = max(audio)

    # Stop and close the stream
    stream.stop_stream()
    stream.close()
    p.terminate()

In [9]:
def record(file="output.wav", threshold=2000, time=1):
    # Initialize PyAudio
    p = pyaudio.PyAudio()

    # Open audio stream
    stream = p.open(
        format=param["format"], 
        channels=param["channels"], 
        rate=param["rate"], 
        input=True, 
        frames_per_buffer=param["chunk"]
    )

    # Initialize an empty list for the frames
    # to be recorded
    frames = []

    # Assume that the volume is already at the threshold
    # when starting the recording
    volume = threshold

    # Record until the average volume of the audio
    # falls below the threshold. We want to record at 
    # least 'time' seconds before checking volume.
    while (volume>=threshold):
        # The minimum number of frames to record at a time
        # before checking volume
        minframes = int(param["rate"] / param["chunk"] * time)

        # Create an array of volumes for each time period of recording
        volume=[]

        # Record data from the microphone
        for i in range(0, minframes):
            # Read the raw audio of chunk size indicated
            rawAudioBytes = stream.read(param["chunk"])

            # Store the raw bytes as valid data
            frames.append(rawAudioBytes)

            # How many bytes in a single frame 
            frame_bytes = len(rawAudioBytes)/param["chunk"]

            # Use integer division with // to create a
            # struct format data type
            format = "%dh" % (len(rawAudioBytes)/frame_bytes)

            # Extract sample data as a magnitude from the
            # raw audio byte stream. The audio is in the
            # form of a tuple containing all sample volume
            # levels
            audio = struct.unpack(format, rawAudioBytes)

            # Check the max volume of the samples
            volume.append(max(audio))
        
        # Grab the max volume in the time period
        volume = max(volume) #sum(volume)/len(volume)

        print (volume)

    # Stop and close the stream once
    # recording is over.
    stream.stop_stream()
    stream.close()
    p.terminate()

    # Save the recorded data as a WAV file
    # set the audio parameters for recording as well
    # then concatenate all the frames together into a 
    # single continuous set of bytes 
    wf = wave.open(file, 'wb')
    wf.setnchannels(param["channels"])
    wf.setsampwidth(p.get_sample_size(param["format"]))
    wf.setframerate(param["rate"])
    wf.writeframes(b''.join(frames))
    wf.close()

In [10]:
def extract(audiofile="output.wav", textfile="output.txt"):
    # Create a new recognizer object
    r = sr.Recognizer()

    # Open a recorded audio file
    audio = sr.AudioFile(audiofile)

    # Record the audio into the recognizer object
    with audio as source:
        audio_ = r.record(source)

    # Transcribe audio to text
    try:
        transcription=r.recognize_google(audio_)
    except Exception as e:
        print(str(e))
        return

    # Write the text to file
    with open(textfile, "w") as file:
        # Write the variables to the file
        file.write(f"{transcription}")

In [11]:
def send(requestfile="output.txt", responsefile="output_.txt"):
    # Read the text of the request generated by the user
    with open(requestfile, 'r') as file:
        requesttxt = file.read()
 
    # How many tokens for the transcript and summary text and what are the instructions for 
    # formating the transcript
    tokens=150
    instructions="Provide a response to the user query. Keep it brief and concise. Remove grammatically incorrect parts of the text from the query."
    prompt = f"{instructions}: {requesttxt}"
    key = "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"

    # Instantiate an openai client
    client = OpenAI(api_key=key)

    # response = client.completion.create(
    #     engine="text-davinci-003",
    #     prompt=prompt,
    #     max_tokens=tokens
    # )

    completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )

    # Extract the summary from the OpenAI API response
    responsetxt = completion.choices[0].message.content

    # Write the text to file
    with open(responsefile, "w") as file:
        # Write the variables to the file
        file.write(f"{responsetxt}")

In [12]:
def respond(outputfile = "output_.mp3", textfile="output_.txt"):
    # Read the text of the request generated by the user
    with open(textfile, 'r') as file:
        text = file.read()

    # provide a data structure containing 
    # a mapping for various regional accents
    # to be used by gtts
    ac={
        'gtts':{
            'en':{
                'au':'com.au',
                'uk':'co.uk',
                'us':'com',
                'ca':'ca',
                'in':'co.in',
                'ie':'ie',
                'za':'co.za',
            }

        },
        'pyttsx':{
            'en':{
                'uk':'english_rp+f3'
            }
        }
    }

    # Choose your accent and language
    accent = 'uk'
    language = 'en'

    # Call google text to speech and save audio to mp3
    gtts = gTTS(text=text, lang=language, tld = ac['gtts'][language][accent], slow=False)
    gtts.save(outputfile)

    # Convert mp3 to wav and store
    name, ext = os.path.splitext("output_.mp3")
    AudioSegment.from_mp3(outputfile).export(name+".wav", format="wav")
    

In [13]:
def play(audiofile="output_.wav"):
    # Initialize pygame mixer
    pygame.mixer.init()

    # Load the sound
    sound = pygame.mixer.Sound(audiofile)

    # Play the sound
    sound.play()

    # Keep the script running until the audio finishes
    while pygame.mixer.get_busy():
        pygame.time.Clock().tick(10)


In [14]:
def main():
    for i in range(0,10):
        # Listen until audio detected
        listen(threshold=12000)
        print ("audio detected")

        # # Record audio to file
        record(threshold=7000)
        print ("audio recorded")

        # Extract text from the audio file
        extract()
        print ("audio transcribed")

        # Feed the text into LLM and get response
        send()
        print ("text sent")

        # Responds to the user in audio format
        respond()
        print ("response given")

        # Plays the audio for user
        play()
        print ("Plays audio to user")
    

In [16]:
if __name__ == "__main__":
    main()

audio detected
16021
13688
1203
audio recorded
audio transcribed
text sent
response given
Plays audio to user
audio detected
27506
17099
994
audio recorded
audio transcribed
text sent
response given
Plays audio to user
audio detected
32767
1230
audio recorded
audio transcribed
text sent
response given
Plays audio to user
audio detected
23669
15246
13882
16372
9678
882
audio recorded
audio transcribed
text sent
response given
Plays audio to user
audio detected
8364
4710
audio recorded

audio transcribed
text sent
response given


KeyboardInterrupt: 