# Week 3 Homework – Voice Agent (ASR → LLM → TTS)

This notebook contains my solutions for **Week 3**, where I build a small **voice assistant**:

- Use **OpenAI audio models** for speech recognition (ASR) and text-to-speech (TTS).
- Use **Ollama + Llama 3** as the local LLM “brain”.
- Wrap everything in a **FastAPI** backend with a `/chat/` endpoint.
- Maintain simple **multi-turn conversation memory** (5 turns).

I developed and tested everything in my existing `ollama314` Conda environment on Windows 11.


## 1. Environment & Dependencies

I use the same Conda environment as in Week 1 & 2 (`ollama314`):

- Python 3.14
- `openai` (already installed)
- `fastapi`, `uvicorn`, `python-multipart` for serving the API

If any of these are missing, I install them below.



In [8]:
# --- Load API Key from .env ---
from dotenv import load_dotenv
import os

load_dotenv()  # loads OPENAI_API_KEY into environment

from openai import OpenAI

# --- Real OpenAI client (for ASR + TTS) ---
openai_client = OpenAI()  
# This automatically reads OPENAI_API_KEY from environment


# --- Ollama local client (for LLM) ---
ollama_client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"   # dummy key; Ollama ignores it
)


# --- Model configuration ---
LLM_MODEL = "llama3"               # local LLM for responses
ASR_MODEL = "gpt-4o-transcribe"    # or "whisper-1"
TTS_MODEL = "gpt-4o-mini-tts"      # any available TTS model
TTS_VOICE = "alloy"                # any supported voice


print("Cell 3 loaded successfully!")
print("OpenAI API Key detected:", "Yes" if os.getenv("OPENAI_API_KEY") else "No")
print("OpenAI endpoint:", openai_client.base_url)
print("Ollama endpoint:", ollama_client.base_url)


Cell 3 loaded successfully!
OpenAI API Key detected: Yes
OpenAI endpoint: https://api.openai.com/v1/
Ollama endpoint: http://localhost:11434/v1/


In [1]:
# 1.1 Install / check dependencies (run once)

%pip install --upgrade fastapi uvicorn[standard] python-multipart


Collecting fastapi
  Downloading fastapi-0.123.10-py3-none-any.whl.metadata (30 kB)
Collecting python-multipart
  Using cached python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting uvicorn[standard]
  Using cached uvicorn-0.38.0-py3-none-any.whl.metadata (6.8 kB)
Collecting starlette<0.51.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.50.0-py3-none-any.whl.metadata (6.3 kB)
Collecting annotated-doc>=0.0.2 (from fastapi)
  Using cached annotated_doc-0.0.4-py3-none-any.whl.metadata (6.6 kB)
Collecting click>=7.0 (from uvicorn[standard])
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting httptools>=0.6.3 (from uvicorn[standard])
  Downloading httptools-0.7.1-cp314-cp314-win_amd64.whl.metadata (3.6 kB)
Collecting python-dotenv>=0.13 (from uvicorn[standard])
  Using cached python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Collecting watchfiles>=0.13 (from uvicorn[standard])
  Downloading watchfiles-1.1.1-cp314-cp314-win_amd64.whl.metadata (5.0 k

In [2]:
# 1.2 Basic environment check

import sys, os
print("Python:", sys.version)
print("OPENAI_API_KEY set:", "OPENAI_API_KEY" in os.environ)

# IMPORTANT:
#  - OPENAI_API_KEY must be set in the environment for ASR + TTS.
#  - Ollama must be running: `ollama serve` or Ollama app open.


Python: 3.14.0 | packaged by Anaconda, Inc. | (main, Oct 22 2025, 08:58:42) [MSC v.1929 64 bit (AMD64)]
OPENAI_API_KEY set: False


## 2. Clients & Configuration

I use **two different “OpenAI-compatible” endpoints**:

1. **Real OpenAI API** (internet)  
   - For ASR (speech-to-text) and TTS (text-to-speech)  
   - Uses `OPENAI_API_KEY`

2. **Local Ollama (http://localhost:11434)**  
   - For chat completions with `llama3`  
   - Uses OpenAI-compatible API surface

This keeps the “voice” models in the cloud but the **reasoning model local**.


In [12]:
from openai import OpenAI

# Real OpenAI for ASR + TTS
openai_client = OpenAI()  # uses OPENAI_API_KEY

# Ollama (local) for LLM
ollama_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",     # dummy string; Ollama ignores it
)

LLM_MODEL = "llama3"      # make sure this is already pulled in Ollama
ASR_MODEL = "gpt-4o-transcribe"  # or "whisper-1"
TTS_MODEL = "gpt-4o-mini-tts"    # any supported TTS model
TTS_VOICE = "alloy"


## 3. ASR – Audio → Text (OpenAI Audio)

Here I implement a helper that accepts **raw audio bytes** and returns the
transcribed text using OpenAI’s audio transcription API.

The FastAPI endpoint will call this function.


In [13]:
from tempfile import NamedTemporaryFile
from pathlib import Path

def transcribe_audio_bytes(audio_bytes: bytes, suffix: str = ".wav") -> str:
    """
    Save uploaded audio bytes to a temporary file and send it
    to OpenAI's audio transcription endpoint.
    """
    with NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
        tmp.write(audio_bytes)
        tmp_path = Path(tmp.name)

    with tmp_path.open("rb") as f:
        transcript = openai_client.audio.transcriptions.create(
            model=ASR_MODEL,
            file=f,
        )

    # cleanup temp file
    try:
        tmp_path.unlink()
    except FileNotFoundError:
        pass

    return transcript.text


In [17]:
# Optional sanity check: point this at a short WAV/MP3 file
# If you don't have one handy you can skip this cell.

TEST_AUDIO = "test.wav"  # change to your file

if Path(TEST_AUDIO).exists():
    with open(TEST_AUDIO, "rb") as f:
        audio_bytes = f.read()
    print("Transcription:", transcribe_audio_bytes(audio_bytes))
else:
    print("Skipping ASR test – file not found:", TEST_AUDIO)


Skipping ASR test – file not found: test.wav


## 4. LLM – Text → Text with Local Llama 3 (Ollama)

Next, I build a wrapper around **Ollama’s OpenAI-compatible API**.

I also add a very small **conversation memory** structure:

- `history` is a list of `{user, bot}` turns.
- For each new message, I:
  - Add the last ≤5 turns as context
  - Append the new user message
  - Ask `llama3` for a reply.


In [18]:
from typing import List, Dict

ConversationTurn = Dict[str, str]
history: List[ConversationTurn] = []  # global in-notebook memory


def build_messages(user_text: str, history: List[ConversationTurn], max_turns: int = 5):
    """
    Convert our simple history list into OpenAI-style chat messages.
    """
    messages = [
        {
            "role": "system",
            "content": (
                "You are a friendly, concise voice assistant. "
                "Answer in short, clear sentences."
            ),
        }
    ]

    # Only keep the last max_turns turns
    for turn in history[-max_turns:]:
        messages.append({"role": "user", "content": turn["user"]})
        messages.append({"role": "assistant", "content": turn["bot"]})

    messages.append({"role": "user", "content": user_text})
    return messages


def generate_bot_reply(user_text: str) -> str:
    """
    Call local llama3 via Ollama to generate a reply.
    """
    messages = build_messages(user_text, history)

    response = ollama_client.chat.completions.create(
        model=LLM_MODEL,
        messages=messages,
        temperature=0.7,
    )

    reply_text = response.choices[0].message.content.strip()

    # Update the conversation history
    history.append({"user": user_text, "bot": reply_text})
    # keep only recent turns
    if len(history) > 10:
        del history[:-10]

    return reply_text


In [19]:
test_reply = generate_bot_reply("Hi, who are you and what can you do?")
print(test_reply)


I'm a friendly voice assistant! I can help with various tasks, such as answering questions, providing definitions, giving directions, sending messages, setting reminders, and even telling jokes! What would you like to know or get done today?


## 5. TTS – Text → Audio (OpenAI TTS)

Now I convert the assistant reply text into **spoken audio**.

- I use OpenAI's `audio.speech` endpoint.
- The helper returns the **path to an MP3 file** that can be served back
  to clients (e.g. via FastAPI or directly downloaded).


In [20]:
import openai   # top-level module is still available alongside the client
openai.api_key = os.getenv("OPENAI_API_KEY")

def text_to_speech_file(text: str, out_path: str = "reply.mp3") -> str:
    """
    Use OpenAI TTS to synthesize speech from text.
    The result is written to `out_path` and the path is returned.
    """
    from pathlib import Path
    out_path = Path(out_path)

    with openai.audio.speech.with_streaming_response.create(
        model=TTS_MODEL,
        voice=TTS_VOICE,
        input=text,
    ) as response:
        response.stream_to_file(out_path)

    return str(out_path)


In [21]:
# This will create an MP3 file in the current directory.
# You can play it with any media player.

TEST_SENTENCE = "Hello, this is a small test of my Week 3 voice agent."
mp3_path = text_to_speech_file(TEST_SENTENCE, "tts_test.mp3")
print("Generated:", mp3_path)


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

## 6. FastAPI Integration – `/chat/` Endpoint

Finally, I glue everything together in a **FastAPI app**:

1. Client uploads an audio file to `/chat/` (`multipart/form-data`).
2. Backend:
   - Reads audio bytes.
   - Calls `transcribe_audio_bytes` → `user_text`.
   - Calls `generate_bot_reply(user_text)` → `bot_text`.
   - Calls `text_to_speech_file(bot_text)` → `reply.mp3`.
3. Returns a JSON response describing the interaction and the audio filename.


In [22]:
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse, JSONResponse
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI(title="Week 3 Voice Agent")

# Allow simple local front-ends (HTML/JS) if needed
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)


@app.post("/chat/")
async def chat_endpoint(file: UploadFile = File(...)):
    """
    Main voice-chat endpoint.

    1. Receive audio upload.
    2. ASR → user text.
    3. LLM → bot text.
    4. TTS → MP3 reply.
    """
    audio_bytes = await file.read()
    user_text = transcribe_audio_bytes(audio_bytes, suffix=Path(file.filename).suffix or ".wav")
    bot_text = generate_bot_reply(user_text)

    # Save audio reply as MP3 (overwrite each time for simplicity)
    reply_audio_path = text_to_speech_file(bot_text, "last_reply.mp3")

    return JSONResponse(
        {
            "user_text": user_text,
            "bot_text": bot_text,
            "audio_file": Path(reply_audio_path).name,
            "turns_in_memory": len(history),
        }
    )


@app.get("/audio/{filename}")
async def get_audio(filename: str):
    """
    Simple endpoint to download / play the generated audio file.
    """
    path = Path(filename)
    if not path.exists():
        return JSONResponse({"error": "file not found"}, status_code=404)
    return FileResponse(path, media_type="audio/mpeg")


## 7. Running the FastAPI Server

To actually serve the voice agent, I run **uvicorn** from the terminal
in the same directory as this notebook file.

Example command:

```bash
uvicorn week3_submission:app --reload --port 8000



## 8. Design Notes & Limitations

- **ASR**: I chose `gpt-4o-transcribe` (or `whisper-1`) from OpenAI
  instead of installing a heavy local Whisper model.  
  This keeps my environment simpler and avoids GPU requirements.

- **LLM**: I reuse my local **Ollama + Llama3** from Week 1–2 to provide
  the conversational “brain”.  
  The messages include up to **5 prior turns** for short-term memory.

- **TTS**: I used OpenAI's TTS (`gpt-4o-mini-tts`) since installing
  CozyVoice locally is heavier and more complex on Windows.

- **FastAPI**: The app is intentionally small and modular:
  - `transcribe_audio_bytes` – ASR
  - `generate_bot_reply` – LLM + memory
  - `text_to_speech_file` – TTS
  - `chat_endpoint` – orchestration

- **Security & performance**: For homework I keep everything simple
  (no auth, no streaming). In a production system I would:
  - Add authentication and rate limiting
  - Stream ASR/LLM/TTS for lower latency
  - Store conversation state in a database instead of in-memory list


## 9. Summary

In this Week 3 assignment I:

1. **Implemented ASR** using OpenAI's audio transcription API.
2. **Integrated a local LLM (Llama 3 via Ollama)** to act as the dialog agent.
3. **Added TTS** using OpenAI's text-to-speech models.
4. Combined everything in a **FastAPI microservice** with a `/chat/` endpoint.
5. Implemented a basic **5-turn conversation memory** in Python.

This notebook serves both as:
- My **final submission** for Week 3, and
- A starting point to extend into a richer voice assistant
  (streaming, better memory, richer prompts, frontend UI, etc.).
