<a href="https://colab.research.google.com/github/nattaran/HealthTequity-LLM/blob/main/HealthTequity_VoicePipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# HealthTequity Voice Pipeline

## Introduction <a id="introduction"></a>
This notebook presents a modular voice pipeline for bilingual (Spanish–English) processing and evaluation. The system performs the following sequence:
1) Automatic Speech Recognition (ASR) on Spanish audio inputs (Whisper base).  
2) Machine translation (Spanish → English) for downstream analysis.  
3) Question answering over a tabular blood-pressure dataset using an LLM (GPT-40).  
4) Back-translation to Spanish followed by text-to-speech (TTS).  
5) Re-transcription of the generated Spanish audio using ASR and evaluation (WER, CER, SER).

All steps are clearly separated, reproducible, and designed to be executed independently.



## Table of Contents
1. [Introduction](#introduction)  
2. [Environment Setup](#setup)  
3. [Folder Configuration](#paths)  
4. [OpenAI Key Initialization](#openai)  
5. [Step 1 – ASR and Translation](#asr-translation)  
6. [Step 2 – ASR Evaluation](#asr-eval)  
7. [Step 3 – LLM Question Answering](#qa)  
8. [Step 4 – Translation and TTS](#tts)  
9. [Step 5 – Whisper Evaluation of TTS](#tts-eval)  
10. [Step 6 – Results Summary](#summary)  
11. [Appendix](#appendix)



## Environment Setup <a id="setup"></a>
Install dependencies from the provided `requirements.txt`. Run this cell once per runtime.


In [1]:

# Install project requirements (run once per session)
# If you prefer to pin versions, ensure requirements.txt includes exact versions.
!pip install -r requirements.txt


[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m


## Folder Configuration <a id="paths"></a>
Centralized path configuration for data and results. This version assumes a Drive-based working directory that persists across sessions.


In [None]:

from pathlib import Path
import os

# Project root in Google Drive (adjust if needed)
PROJECT_ROOT = Path("/content/drive/MyDrive/HealthTequity-LLM")

# Data and results subfolders
DATA_DIR     = PROJECT_ROOT / "data"
CSV_DIR      = DATA_DIR / "synthetic_csv"
AUDIO_DIR    = DATA_DIR / "Spanish_audio"
RESULTS_DIR  = PROJECT_ROOT / "results"
LLM_OUT      = RESULTS_DIR / "llm_outputs"
EVAL_DIR     = RESULTS_DIR / "evaluation_metrics"
TTS_DIR      = RESULTS_DIR / "tts_audio"

# Create outputs if missing (idempotent)
for p in [RESULTS_DIR, LLM_OUT, EVAL_DIR, TTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("Project root:", PROJECT_ROOT)
print("Data dir:", DATA_DIR)
print("CSV dir:", CSV_DIR)
print("Audio dir:", AUDIO_DIR)
print("Results dir:", RESULTS_DIR)
print("LLM outputs:", LLM_OUT)
print("Evaluation dir:", EVAL_DIR)
print("TTS dir:", TTS_DIR)



## OpenAI Key Initialization <a id="openai"></a>
This cell securely initializes the OpenAI client. If the `OPENAI_API_KEY` environment variable is not present, the cell prompts for a key using a hidden input.


In [None]:

import os
from getpass import getpass
from openai import OpenAI

# Attempt to load key from environment. Prompt if missing.
if not os.getenv("OPENAI_API_KEY"):
    print("OpenAI API key not found in environment.")
    os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key (input hidden): ").strip()

# Initialize client (will raise if key invalid)
client = OpenAI()
print("OpenAI client initialized.")



## Step 1 – ASR and Translation <a id="asr-translation"></a>
This section performs automatic speech recognition (ASR) on Spanish audio using Whisper (**base** model), followed by English translation via the OpenAI API.

**Inputs**: `.wav` files under `AUDIO_DIR`.  
**Outputs**: `audio_translations.csv` with columns: `audio_file`, `spanish_transcription`, `english_translation`, `language_detected`.


In [None]:

import whisper
import pandas as pd

def transcribe_spanish_audio(model, audio_path: Path):
    """
    Transcribe a single Spanish audio file using Whisper.

    Parameters
    ----------
    model : whisper.Whisper
        Loaded Whisper model instance.
    audio_path : Path
        Path to the .wav file.

    Returns
    -------
    text : str
        Detected transcription text.
    lang : str
        Detected language code.
    """
    result = model.transcribe(str(audio_path), language="es", task="transcribe", verbose=False)
    return result["text"].strip(), result.get("language", "unknown")


def translate_spanish_to_english(spanish_text: str) -> str:
    """
    Translate a Spanish transcription to English via OpenAI chat completion.

    Parameters
    ----------
    spanish_text : str
        Input Spanish text to translate.

    Returns
    -------
    english_text : str
        Translated English text.
    """
    prompt = (
        "Translate the following Spanish medical transcription into clear, faithful English:\n\n"
        + spanish_text
    )
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    return result.choices[0].message.content.strip()


def process_and_translate_audio(audio_folder: Path, output_csv: Path, model_size: str = "base") -> pd.DataFrame:
    """
    Run Whisper ASR on all .wav files in a folder, translate to English, and save results.

    Parameters
    ----------
    audio_folder : Path
        Directory containing .wav files.
    output_csv : Path
        Destination CSV for transcriptions and translations.
    model_size : str, optional
        Whisper model size (default: "base").

    Returns
    -------
    pd.DataFrame
        DataFrame of results with columns:
        [audio_file, spanish_transcription, english_translation, language_detected].
    """
    model = whisper.load_model(model_size)
    audio_files = sorted([f for f in os.listdir(audio_folder) if f.endswith(".wav")])

    results = []
    for fname in audio_files:
        audio_path = audio_folder / fname
        if not audio_path.exists():
            continue
        es_text, detected_lang = transcribe_spanish_audio(model, audio_path)
        en_text = translate_spanish_to_english(es_text)
        results.append({
            "audio_file": fname,
            "spanish_transcription": es_text,
            "english_translation": en_text,
            "language_detected": detected_lang
        })
    df = pd.DataFrame(results)
    df.to_csv(output_csv, index=False, encoding="utf-8-sig")
    print("Saved:", output_csv)
    return df

# Example (do not auto-run):
# trans_csv = LLM_OUT / "audio_translations.csv"
# _ = process_and_translate_audio(AUDIO_DIR, trans_csv, model_size="base")



## Step 2 – ASR Evaluation (Input) <a id="asr-eval"></a>
This section computes WER and CER for the input ASR stage by aligning Whisper transcriptions with ground-truth text.

**Assumptions**  
- Ground truth is provided in `CSV_DIR / "ground_truth.csv"` with columns: `audio_file`, `ground_truth`.
- Transcriptions are in `LLM_OUT / "audio_translations.csv"` with columns: `audio_file`, `spanish_transcription`.


In [None]:

import pandas as pd
from jiwer import wer, cer

def evaluate_asr_performance(ground_truth_csv: Path, trans_csv: Path, save_csv: Path) -> pd.DataFrame:
    """
    Compute WER and CER for input ASR against ground truth.

    Parameters
    ----------
    ground_truth_csv : Path
        CSV file with columns: [audio_file, ground_truth].
    trans_csv : Path
        CSV file with columns: [audio_file, spanish_transcription].
    save_csv : Path
        Destination CSV for ASR metrics.

    Returns
    -------
    pd.DataFrame
        Evaluation results with per-file WER/CER.
    """
    gt_df = pd.read_csv(ground_truth_csv)
    tr_df = pd.read_csv(trans_csv)

    # Defensive renaming for common variations.
    gt_df = gt_df.rename(columns={"filename": "audio_file", "spanish_text": "ground_truth"})
    tr_df = tr_df.rename(columns={"transcription": "spanish_transcription"})

    df = pd.merge(tr_df[["audio_file", "spanish_transcription"]],
                  gt_df[["audio_file", "ground_truth"]],
                  on="audio_file", how="inner")

    df["WER"] = [wer(ref, hyp) for ref, hyp in zip(df["ground_truth"], df["spanish_transcription"])]
    df["CER"] = [cer(ref, hyp) for ref, hyp in zip(df["ground_truth"], df["spanish_transcription"])]

    df.to_csv(save_csv, index=False)
    print("Saved:", save_csv)
    return df

# Example (do not auto-run):
# gt_csv  = CSV_DIR / "ground_truth.csv"
# trans_csv = LLM_OUT / "audio_translations.csv"
# asr_csv = EVAL_DIR / "asr_metrics.csv"
# _ = evaluate_asr_performance(gt_csv, trans_csv, asr_csv)



## Step 3 – LLM Question Answering <a id="qa"></a>
This section queries an LLM with English questions derived from the ASR+translation step and provides answers based on a tabular blood-pressure dataset.

**Inputs**: A CSV file with synthetic blood-pressure records.  
**Outputs**: English answers and associated computed fields.


In [None]:

import json
import pandas as pd

def ask_gpt(question_en: str, csv_block: str) -> dict:
    """
    Query the LLM with a question and a CSV context block.

    Parameters
    ----------
    question_en : str
        English question derived from Spanish transcription.
    csv_block : str
        CSV content as a single string for in-context grounding.

    Returns
    -------
    dict
        A dictionary with keys:
        - "answer": str (LLM's English answer)
        - "computed_fields": dict (optional structured fields)
    """
    system = (
        "You are a careful data analyst. Read the CSV and answer the question. "
        "Only use the information that can be derived from the table."
    )
    user = (
        f"CSV:\n{csv_block}\n\n"
        f"Question:\n{question_en}\n\n"
        "Provide a concise answer. If a calculation is required, do it transparently."
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    answer = resp.choices[0].message.content.strip()
    return {"answer": answer, "computed_fields": {}}



## Step 4 – Translation and TTS <a id="tts"></a>
This section back-translates the LLM's English answers to Spanish and generates Spanish audio (TTS).

Note: A simple gTTS-based fallback is provided (exports `.wav` via pydub). If your project includes a custom TTS, replace the fallback with your implementation and keep the same function signature.


In [None]:

from pathlib import Path

def translate_to_spanish(text_en: str) -> str:
    """
    Translate English text to Spanish using OpenAI chat completion.

    Parameters
    ----------
    text_en : str
        English text to translate.

    Returns
    -------
    str
        Spanish translation.
    """
    prompt = "Translate the following English medical answer into Spanish:\n\n" + text_en
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content.strip()


def text_to_speech_spanish(text_es: str, out_wav_path: Path):
    """
    Generate Spanish speech from text.

    This fallback uses gTTS + pydub to export a WAV file if a custom TTS is
    not available. Replace this function with your project-specific TTS if needed.

    Parameters
    ----------
    text_es : str
        Spanish text to synthesize.
    out_wav_path : Path
        Destination path for the WAV file.
    """
    try:
        from gtts import gTTS
        from pydub import AudioSegment
        tmp_mp3 = out_wav_path.with_suffix(".mp3")
        gTTS(text_es, lang="es").save(tmp_mp3)
        # Convert to WAV
        audio = AudioSegment.from_file(tmp_mp3, format="mp3")
        audio.export(out_wav_path, format="wav")
        os.remove(tmp_mp3)
    except Exception as e:
        raise RuntimeError(f"TTS fallback failed: {e}")



## Step 5 – Whisper Evaluation of TTS <a id="tts-eval"></a>
This section re-transcribes the generated Spanish audio responses using Whisper (base) and evaluates intelligibility against the ground-truth Spanish answers using WER, CER, and SER.


In [None]:

import re, unicodedata
import Levenshtein
from jiwer import process_words

def normalize_text(text: str) -> str:
    """
    Normalize text for fair ASR comparison.
    - Lowercase
    - Strip accents
    - Remove punctuation
    - Collapse extra whitespace
    """
    text = text.lower()
    text = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
    text = re.sub(r'[^a-z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


def evaluate_output_asr_whisper(
    tts_csv: Path,
    output_csv: Path = None,
    model_size: str = "base"
) -> pd.DataFrame:
    """
    Evaluate Spanish TTS audios using Whisper by computing WER, CER, and SER
    against ground-truth Spanish answers.

    Parameters
    ----------
    tts_csv : Path
        CSV with columns: [spanish_answer, audio_answer_file].
    output_csv : Path, optional
        Destination CSV for ASR metrics (defaults to EVAL_DIR / "output_asr_metrics_whisper.csv").
    model_size : str, optional
        Whisper model size (default: "base").

    Returns
    -------
    pd.DataFrame
        DataFrame with per-file metrics.
    """
    if output_csv is None:
        output_csv = EVAL_DIR / "output_asr_metrics_whisper.csv"

    if not tts_csv.exists():
        raise FileNotFoundError(f"Missing final results CSV: {tts_csv}")

    model = whisper.load_model(model_size)

    df = pd.read_csv(tts_csv)
    rows = []
    for _, row in df.iterrows():
        gt = str(row.get("spanish_answer", "")).strip()
        audio_file = str(row.get("audio_answer_file", "")).strip()
        if not gt or not audio_file or not os.path.exists(audio_file):
            continue

        # Transcribe generated audio and normalize
        res = model.transcribe(audio_file, language="es", task="transcribe", verbose=False)
        hyp = res.get("text", "").strip()

        gt_norm = normalize_text(gt)
        hyp_norm = normalize_text(hyp)

        measures = process_words(gt_norm, hyp_norm)
        wer_score = round(measures.wer, 4)
        subs, dels, ins = measures.substitutions, measures.deletions, measures.insertions
        cer_score = round(Levenshtein.distance(gt_norm, hyp_norm) / max(len(gt_norm), 1), 4)
        ser_score = 0 if gt_norm == hyp_norm else 1

        rows.append({
            "audio_file": os.path.basename(audio_file),
            "ground_truth": gt,
            "whisper_transcription": hyp,
            "WER": wer_score,
            "Substitutions": subs,
            "Deletions": dels,
            "Insertions": ins,
            "CER": cer_score,
            "SER": ser_score,
        })

    out_df = pd.DataFrame(rows)
    out_df.to_csv(output_csv, index=False)
    print("Saved:", output_csv)
    return out_df



## Orchestration (Optional) <a id="orchestration"></a>
The following function orchestrates all steps in sequence. Each step can also be run individually.


In [None]:

def run_full_pipeline(csv_path: Path, audio_folder: Path, whisper_model_size: str = "base"):
    """
    Execute the full pipeline, from input ASR/translation to TTS evaluation.

    Parameters
    ----------
    csv_path : Path
        Path to the synthetic blood-pressure CSV.
    audio_folder : Path
        Directory containing input Spanish .wav files.
    whisper_model_size : str, optional
        Whisper model size, default "base".

    Returns
    -------
    dict
        Summary dictionary with output artifact locations.
    """
    # Step 1: ASR + Translation
    trans_csv = LLM_OUT / "audio_translations.csv"
    _ = process_and_translate_audio(audio_folder, trans_csv, model_size=whisper_model_size)

    # Step 2: Evaluate input ASR
    gt_csv = CSV_DIR / "ground_truth.csv"
    asr_csv = EVAL_DIR / "asr_metrics.csv"
    _ = evaluate_asr_performance(gt_csv, trans_csv, asr_csv)

    # Step 3: Load tabular data for LLM grounding
    df_bp = pd.read_csv(csv_path)
    csv_block = df_bp.to_csv(index=False)

    # Step 4: Q&A + Spanish TTS
    results = []
    tr_df = pd.read_csv(trans_csv)
    for i, row in tr_df.iterrows():
        q_en = row["english_translation"]
        ans = ask_gpt(q_en, csv_block)
        ans_en = ans.get("answer", "").strip()
        ans_es = translate_to_spanish(ans_en)

        out_wav = TTS_DIR / f"answer_{i+1}_es.wav"
        text_to_speech_spanish(ans_es, out_wav)

        results.append({
            "question_number": i + 1,
            "audio_file_in": row["audio_file"],
            "spanish_question": row["spanish_transcription"],
            "english_question": q_en,
            "english_answer": ans_en,
            "spanish_answer": ans_es,
            "audio_answer_file": str(out_wav),
            "computed_fields": json.dumps(ans.get("computed_fields", {}))
        })

    final_csv = LLM_OUT / "final_pipeline_results.csv"
    pd.DataFrame(results).to_csv(final_csv, index=False)
    print("Saved:", final_csv)

    # Step 5: Evaluate TTS intelligibility
    output_asr_csv = EVAL_DIR / "output_asr_metrics_whisper.csv"
    _ = evaluate_output_asr_whisper(final_csv, output_csv=output_asr_csv, model_size=whisper_model_size)

    return {
        "transcriptions_csv": str(trans_csv),
        "input_asr_metrics_csv": str(asr_csv),
        "final_pipeline_csv": str(final_csv),
        "output_asr_metrics_csv": str(output_asr_csv),
    }

# Example (do not auto-run):
# bp_csv = CSV_DIR / "synthetic_bp_one_person.csv"
# _ = run_full_pipeline(bp_csv, AUDIO_DIR, whisper_model_size="base")



## Step 6 – Results Summary <a id="summary"></a>
This section summarizes average WER, CER, and SER across input ASR and TTS evaluation outputs.


In [None]:

import pandas as pd

def summarize_results(input_asr_csv: Path, output_asr_csv: Path):
    """
    Print dataset-level average metrics for input ASR and TTS ASR evaluation.

    Parameters
    ----------
    input_asr_csv : Path
        CSV with per-file input ASR metrics.
    output_asr_csv : Path
        CSV with per-file TTS ASR metrics.
    """
    print("Input ASR metrics (WER, CER):")
    if Path(input_asr_csv).exists():
        d1 = pd.read_csv(input_asr_csv)
        print(d1[["WER", "CER"]].mean(numeric_only=True).to_frame("Average"))
    else:
        print("Missing:", input_asr_csv)

    print("\nTTS ASR metrics (WER, CER, SER):")
    if Path(output_asr_csv).exists():
        d2 = pd.read_csv(output_asr_csv)
        print(d2[["WER", "CER", "SER"]].mean(numeric_only=True).to_frame("Average"))
    else:
        print("Missing:", output_asr_csv)

# Example (do not auto-run):
# summarize_results(EVAL_DIR / "asr_metrics.csv", EVAL_DIR / "output_asr_metrics_whisper.csv")



## Appendix <a id="appendix"></a>
- All paths are centralized under `PROJECT_ROOT` for reproducibility.  
- Replace the TTS fallback with a project-specific implementation if available.  
- The Whisper model size can be adjusted by changing `whisper_model_size` in the orchestration call.
