# Macaw OpenVoice — API Validation Notebook

Validates **all REST API endpoints and WebSocket protocol** of Macaw OpenVoice.
Run all cells sequentially on Google Colab (GPU runtime recommended).

**What this validates:**
- Health and system endpoints
- Audio transcription (STT) — all formats, language, word timestamps
- Audio translation
- Speech synthesis (TTS) — WAV, PCM, effects, alignment, seed
- Voice management — CRUD operations
- WebSocket realtime — STT streaming, TTS full-duplex
- Error handling — expected error responses

> **Pre-requisite:** Select a GPU runtime (Runtime > Change runtime type > T4 GPU)

## 1. Setup

In [None]:
# Edit these before running
STT_MODEL = 'faster-whisper-tiny'
TTS_MODEL = 'kokoro-v1'
SERVER_PORT = 8000
BASE_URL = f'http://localhost:{SERVER_PORT}'

# Installation source:
#   'pip'     — install released version from PyPI
#   'develop' — install from git develop branch (pre-release validation)
INSTALL_FROM = 'develop'

### Installation

Installs from **PyPI** or **git develop branch** based on `INSTALL_FROM` above.

> **Note (Colab):** You may see a pip resolver warning about `protobuf` conflicts with pre-installed `tensorflow`/`grpcio-status`. This is harmless — Macaw requires `protobuf>=6.31` and does not use TensorFlow. The warning can be safely ignored.

In [None]:
%%capture
if INSTALL_FROM == 'develop':
    !pip install 'macaw-openvoice[faster-whisper,kokoro,itn,codec] @ git+https://github.com/usemacaw/macaw-openvoice.git@develop'
else:
    !pip install macaw-openvoice[faster-whisper,kokoro,itn,codec]

In [None]:
%%capture
!macaw pull $STT_MODEL
!macaw pull $TTS_MODEL

### Start server

In [None]:
import os
os.environ['MACAW_WORKER_HEALTH_PROBE_TIMEOUT_S'] = '300'
os.environ['MACAW_LOG_FORMAT'] = 'json'  # avoid ANSI escape codes in logs
os.environ['MACAW_VOICE_DIR'] = '/tmp/macaw_voices'

In [None]:
!nohup macaw serve --host 0.0.0.0 --port $SERVER_PORT --voice-dir /tmp/macaw_voices > /tmp/macaw.log 2>&1 &

In [None]:
import time, httpx

print('Waiting for server ...')
for i in range(90):
    try:
        r = httpx.get(f'{BASE_URL}/health', timeout=5)
        if r.json().get('status') == 'ok':
            print(f'Server ready! (attempt {i+1})')
            print(r.json())
            break
    except Exception:
        pass
    time.sleep(2)
else:
    print('Server did not start. Logs:')
    !tail -50 /tmp/macaw.log
    raise RuntimeError('Server not ready')

## 2. Health & System Endpoints

### 2.1 `GET /health`

In [None]:
import httpx

r = httpx.get(f'{BASE_URL}/health', timeout=10)
data = r.json()
print(data)
assert r.status_code == 200
assert data['status'] == 'ok'
assert data['workers_ready'] > 0

### 2.2 `GET /v1/models`

In [None]:
r = httpx.get(f'{BASE_URL}/v1/models', timeout=10)
data = r.json()
print(data)
assert data['object'] == 'list'
model_ids = [m['id'] for m in data['data']]
assert STT_MODEL in model_ids, f'{STT_MODEL} not loaded'
assert TTS_MODEL in model_ids, f'{TTS_MODEL} not loaded'

## 3. Generate Test Audio

Uses TTS to generate speech, then uses that audio for STT tests (round-trip).

In [None]:
TEST_TEXT = 'Hello world. This is a test of the Macaw voice system.'

r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': TEST_TEXT, 'voice': 'default'},
    timeout=120,
)
assert r.status_code == 200, f'TTS failed: {r.status_code} {r.text}'
assert len(r.content) > 1000

TEST_AUDIO = '/tmp/test_audio.wav'
with open(TEST_AUDIO, 'wb') as f:
    f.write(r.content)
print(f'Test audio: {len(r.content):,} bytes')

## 4. Audio Transcription (`POST /v1/audio/transcriptions`)

### 4.1 JSON format

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'response_format': 'json'},
        timeout=120,
    )
data = r.json()
print(data)
assert r.status_code == 200
assert len(data['text'].strip()) > 0

### 4.2 Verbose JSON format

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'response_format': 'verbose_json'},
        timeout=120,
    )
data = r.json()
print(f"Language={data['language']}, Duration={data['duration']}s")
print(f"Text: {data['text']}")
print(f"Segments: {len(data['segments'])}")
assert 'segments' in data
assert 'language' in data
assert 'duration' in data

### 4.3 Text format

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'response_format': 'text'},
        timeout=120,
    )
print(repr(r.text))
assert r.status_code == 200
assert len(r.text.strip()) > 0

### 4.4 SRT format

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'response_format': 'srt'},
        timeout=120,
    )
print(r.text)
assert '-->' in r.text, 'No SRT timestamps'

### 4.5 VTT format

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'response_format': 'vtt'},
        timeout=120,
    )
print(r.text)
assert 'WEBVTT' in r.text

### 4.6 With explicit language

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'language': 'en', 'response_format': 'verbose_json'},
        timeout=120,
    )
data = r.json()
print(f"Language: {data['language']}, Text: {data['text']}")
assert data['language'] == 'en'

### 4.7 With word-level timestamps

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/transcriptions',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL, 'response_format': 'verbose_json',
              'timestamp_granularities[]': 'word'},
        timeout=120,
    )
data = r.json()
words = data.get('words', [])
print(f'Words: {words[:5]}')
assert len(words) > 0, 'No word timestamps'
assert 'start' in words[0]
assert 'end' in words[0]

## 5. Audio Translation (`POST /v1/audio/translations`)

In [None]:
with open(TEST_AUDIO, 'rb') as f:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/translations',
        files={'file': ('test.wav', f, 'audio/wav')},
        data={'model': STT_MODEL},
        timeout=120,
    )
data = r.json()
print(data)
assert r.status_code == 200
assert len(data['text'].strip()) > 0

## 6. Speech Synthesis (`POST /v1/audio/speech`)

### 6.1 WAV format (default)

In [None]:
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Hello world', 'voice': 'default'},
    timeout=120,
)
assert r.status_code == 200
assert r.content[:4] == b'RIFF', 'Not a WAV file'
print(f'WAV: {len(r.content):,} bytes')

### 6.2 PCM format

In [None]:
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Hello world', 'response_format': 'pcm'},
    timeout=120,
)
assert r.status_code == 200
assert r.content[:4] != b'RIFF', 'Should be raw PCM, not WAV'
assert len(r.content) % 2 == 0, 'PCM 16-bit must be even bytes'
print(f'PCM: {len(r.content):,} bytes')

### 6.3 Speed control

In [None]:
r_normal = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Testing speed control', 'speed': 1.0},
    timeout=120,
)
r_fast = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Testing speed control', 'speed': 2.0},
    timeout=120,
)
print(f'Normal: {len(r_normal.content):,} bytes')
print(f'Fast:   {len(r_fast.content):,} bytes')
assert len(r_fast.content) < len(r_normal.content), 'Fast should be shorter'

### 6.4 Audio effects — pitch shift

In [None]:
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Pitch shift test',
          'effects': {'pitch_shift_semitones': 3.0}},
    timeout=120,
)
assert r.status_code == 200
print(f'Pitch shift: {len(r.content):,} bytes')

### 6.5 Audio effects — reverb

In [None]:
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Reverb test',
          'effects': {'reverb_room_size': 0.7, 'reverb_damping': 0.5,
                      'reverb_wet_dry_mix': 0.3}},
    timeout=120,
)
assert r.status_code == 200
print(f'Reverb: {len(r.content):,} bytes')

### 6.6 Audio effects — combined

In [None]:
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Combined effects',
          'effects': {'pitch_shift_semitones': -2.0,
                      'reverb_room_size': 0.5, 'reverb_wet_dry_mix': 0.2}},
    timeout=120,
)
assert r.status_code == 200
print(f'Combined: {len(r.content):,} bytes')

### 6.7 Word-level alignment (NDJSON)

In [None]:
import json as _json

r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Hello world',
          'include_alignment': True, 'alignment_granularity': 'word'},
    timeout=120,
)
assert r.status_code == 200
assert 'ndjson' in r.headers.get('content-type', '')

lines = [_json.loads(l) for l in r.text.strip().split('\n') if l.strip()]
audio_lines = [l for l in lines if l['type'] == 'audio']
done_lines = [l for l in lines if l['type'] == 'done']
print(f'Audio chunks: {len(audio_lines)}, Done: {len(done_lines)}')

aligned = [l for l in audio_lines if l.get('alignment')]
if aligned:
    print(f"Alignment: {aligned[0]['alignment']}")
assert len(done_lines) == 1

### 6.8 Character-level alignment

In [None]:
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Hi',
          'include_alignment': True, 'alignment_granularity': 'character'},
    timeout=120,
)
assert r.status_code == 200
lines = [_json.loads(l) for l in r.text.strip().split('\n') if l.strip()]
aligned = [l for l in lines if l.get('type') == 'audio' and l.get('alignment')]
if aligned:
    a = aligned[0]['alignment']
    print(f"Granularity: {a.get('granularity')}, Items: {a['items']}")
    assert a['granularity'] == 'character'

### 6.9 Seed parameter

> **Note:** Kokoro is a deterministic engine — `seed` is accepted but has no effect. Seed is meaningful for non-deterministic engines like Qwen3-TTS where it controls sampling behavior.

In [None]:
payload = {'model': TTS_MODEL, 'input': 'Reproducibility test', 'seed': 42}
r1 = httpx.post(f'{BASE_URL}/v1/audio/speech', json=payload, timeout=120)
assert r1.status_code == 200
print(f'Seed accepted. Size: {len(r1.content):,} bytes')

### 6.10 Text normalization

In [None]:
for mode in ['auto', 'on', 'off']:
    r = httpx.post(
        f'{BASE_URL}/v1/audio/speech',
        json={'model': TTS_MODEL, 'input': 'I have 3 cats.',
              'text_normalization': mode},
        timeout=120,
    )
    assert r.status_code == 200
    print(f'{mode}: {len(r.content):,} bytes')

## 7. Voice Management

### 7.1 List preset voices

In [None]:
r = httpx.get(f'{BASE_URL}/v1/voices', timeout=30)
data = r.json()
assert data['object'] == 'list'
assert len(data['data']) > 0
print(f"Total voices: {len(data['data'])}")
for v in data['data'][:5]:
    print(f"  {v['voice_id']}: {v['name']} ({v.get('language')})")

### 7.2 Voice CRUD (create / get / use / delete)

In [None]:
# Create a saved voice
r = httpx.post(
    f'{BASE_URL}/v1/voices',
    data={'name': 'test-voice', 'voice_type': 'designed',
          'instruction': 'A calm and warm English voice', 'language': 'en'},
    timeout=30,
)
assert r.status_code == 201, f'Create voice failed: {r.status_code} {r.text}'
VOICE_ID = r.json()['voice_id']
print(f'Created voice: {VOICE_ID}')

In [None]:
# Get saved voice
r = httpx.get(f'{BASE_URL}/v1/voices/{VOICE_ID}', timeout=10)
assert r.status_code == 200
print(f"Get voice: {r.json()['voice_id']}, name={r.json()['name']}")

# Use saved voice in synthesis
r = httpx.post(
    f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'Saved voice test',
          'voice': f'voice_{VOICE_ID}'},
    timeout=120,
)
assert r.status_code == 200
print(f'Audio with saved voice: {len(r.content):,} bytes')

# Delete saved voice
r = httpx.delete(f'{BASE_URL}/v1/voices/{VOICE_ID}', timeout=10)
assert r.status_code == 204
r2 = httpx.get(f'{BASE_URL}/v1/voices/{VOICE_ID}', timeout=10)
assert r2.status_code == 404
print('Voice deleted and confirmed gone.')

## 8. WebSocket Realtime (`/v1/realtime`)

Tests bidirectional WebSocket protocol for STT streaming and TTS full-duplex.

In [None]:
%%capture
!pip install websockets

### 8.1 STT streaming

In [None]:
import asyncio, json, websockets

async def test_ws_stt():
    ws_url = f'ws://localhost:{SERVER_PORT}/v1/realtime?model={STT_MODEL}'
    async with websockets.connect(ws_url) as ws:
        msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=10))
        assert msg['type'] == 'session.created'
        print(f"Session: {msg['session_id']}")

        # Send test audio PCM (skip 44-byte WAV header)
        with open(TEST_AUDIO, 'rb') as f:
            pcm = f.read()[44:]
        for i in range(0, min(len(pcm), 48000), 3200):
            await ws.send(pcm[i:i+3200])
            await asyncio.sleep(0.05)

        await ws.send(json.dumps({'type': 'input_audio_buffer.commit'}))

        events = []
        try:
            while True:
                raw = await asyncio.wait_for(ws.recv(), timeout=15)
                if isinstance(raw, str):
                    ev = json.loads(raw)
                    events.append(ev)
                    print(f"  {ev['type']}")
                    if ev['type'] == 'transcript.final':
                        print(f"  Text: {ev['text']}")
                        break
        except asyncio.TimeoutError:
            pass

        types = [e['type'] for e in events]
        assert 'transcript.final' in types or 'transcript.partial' in types
        await ws.send(json.dumps({'type': 'session.close'}))

await test_ws_stt()

### 8.2 Session configuration

In [None]:
async def test_ws_configure():
    ws_url = f'ws://localhost:{SERVER_PORT}/v1/realtime?model={STT_MODEL}'
    async with websockets.connect(ws_url) as ws:
        msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=10))
        assert msg['type'] == 'session.created'

        await ws.send(json.dumps({
            'type': 'session.configure',
            'language': 'en',
            'enable_partial_transcripts': True,
            'vad_sensitivity': 'high',
        }))
        print('Configure sent OK')

        await ws.send(json.dumps({'type': 'session.close'}))
        try:
            while True:
                await asyncio.wait_for(ws.recv(), timeout=3)
        except (asyncio.TimeoutError, websockets.exceptions.ConnectionClosed):
            pass
        print('Session closed OK')

await test_ws_configure()

### 8.3 TTS via WebSocket

In [None]:
async def test_ws_tts():
    ws_url = f'ws://localhost:{SERVER_PORT}/v1/realtime?model={STT_MODEL}'
    async with websockets.connect(ws_url) as ws:
        json.loads(await asyncio.wait_for(ws.recv(), timeout=10))

        await ws.send(json.dumps({'type': 'session.configure', 'model_tts': TTS_MODEL}))
        await ws.send(json.dumps({'type': 'tts.speak', 'text': 'Hello from WebSocket',
                                  'request_id': 'test_tts_1'}))

        events, audio_frames = [], []
        try:
            while True:
                raw = await asyncio.wait_for(ws.recv(), timeout=30)
                if isinstance(raw, bytes):
                    audio_frames.append(raw)
                else:
                    ev = json.loads(raw)
                    events.append(ev)
                    print(f"  {ev['type']}")
                    if ev['type'] == 'tts.speaking_end':
                        break
        except asyncio.TimeoutError:
            pass

        types = [e['type'] for e in events]
        assert 'tts.speaking_start' in types
        assert 'tts.speaking_end' in types
        total = sum(len(f) for f in audio_frames)
        print(f'Audio: {len(audio_frames)} frames, {total:,} bytes')
        assert total > 500
        await ws.send(json.dumps({'type': 'session.close'}))

await test_ws_tts()

### 8.4 TTS with alignment

In [None]:
async def test_ws_alignment():
    ws_url = f'ws://localhost:{SERVER_PORT}/v1/realtime?model={STT_MODEL}'
    async with websockets.connect(ws_url) as ws:
        json.loads(await asyncio.wait_for(ws.recv(), timeout=10))
        await ws.send(json.dumps({'type': 'session.configure', 'model_tts': TTS_MODEL}))
        await ws.send(json.dumps({'type': 'tts.speak', 'text': 'Alignment test',
                                  'include_alignment': True, 'request_id': 'align'}))

        events = []
        try:
            while True:
                raw = await asyncio.wait_for(ws.recv(), timeout=30)
                if isinstance(raw, str):
                    ev = json.loads(raw)
                    events.append(ev)
                    if ev['type'] == 'tts.speaking_end':
                        break
        except asyncio.TimeoutError:
            pass

        types = [e['type'] for e in events]
        assert 'tts.alignment' in types, f'No alignment event. Got: {types}'
        ae = next(e for e in events if e['type'] == 'tts.alignment')
        print(f"Alignment items: {ae['items']}")
        await ws.send(json.dumps({'type': 'session.close'}))

await test_ws_alignment()

### 8.5 TTS cancel

In [None]:
async def test_ws_cancel():
    ws_url = f'ws://localhost:{SERVER_PORT}/v1/realtime?model={STT_MODEL}'
    async with websockets.connect(ws_url) as ws:
        json.loads(await asyncio.wait_for(ws.recv(), timeout=10))
        await ws.send(json.dumps({'type': 'session.configure', 'model_tts': TTS_MODEL}))
        await ws.send(json.dumps({'type': 'tts.speak', 'request_id': 'cancel_test',
            'text': 'This long sentence should be cancelled before finishing.'}))

        # Wait for speaking_start, then cancel
        try:
            while True:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                if isinstance(raw, str):
                    ev = json.loads(raw)
                    if ev['type'] == 'tts.speaking_start':
                        await ws.send(json.dumps({'type': 'tts.cancel'}))
                        break
        except asyncio.TimeoutError:
            pass

        # Wait for cancelled speaking_end
        cancelled = False
        try:
            while True:
                raw = await asyncio.wait_for(ws.recv(), timeout=10)
                if isinstance(raw, str):
                    ev = json.loads(raw)
                    if ev['type'] == 'tts.speaking_end':
                        cancelled = ev.get('cancelled', False)
                        break
        except asyncio.TimeoutError:
            pass

        print(f'Cancelled: {cancelled}')
        assert cancelled
        await ws.send(json.dumps({'type': 'session.close'}))

await test_ws_cancel()

## 9. Error Handling

In [None]:
# Empty text
r = httpx.post(f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': ''}, timeout=30)
assert r.status_code >= 400
print(f'Empty text: {r.status_code}')

# Non-existent model
r = httpx.post(f'{BASE_URL}/v1/audio/speech',
    json={'model': 'nonexistent', 'input': 'test'}, timeout=30)
assert r.status_code >= 400
print(f'Bad model: {r.status_code}')

# Non-existent voice
r = httpx.get(f'{BASE_URL}/v1/voices/nonexistent', timeout=10)
assert r.status_code == 404
print(f'Bad voice: {r.status_code}')

# Invalid audio
r = httpx.post(f'{BASE_URL}/v1/audio/transcriptions',
    files={'file': ('bad.txt', b'not audio', 'text/plain')},
    data={'model': STT_MODEL}, timeout=30)
assert r.status_code == 400
print(f'Bad audio: {r.status_code}')

# Alignment + opus conflict
r = httpx.post(f'{BASE_URL}/v1/audio/speech',
    json={'model': TTS_MODEL, 'input': 'test',
          'include_alignment': True, 'response_format': 'opus'}, timeout=30)
assert r.status_code == 400
print(f'Alignment+opus: {r.status_code}')

## 10. Cleanup

In [None]:
import subprocess
subprocess.run(['pkill', '-f', 'macaw serve'],
               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
print('Server stopped.')