Skip to content

feat(eval): Azure OpenAI support for reader/judge clients (#19)#22

Merged
jphein merged 1 commit into
mainfrom
feat/azure-openai-eval
May 24, 2026
Merged

feat(eval): Azure OpenAI support for reader/judge clients (#19)#22
jphein merged 1 commit into
mainfrom
feat/azure-openai-eval

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented May 24, 2026

Summary

  • _default_client() in answer_generator.py and longmemeval_judge.py now detects AZURE_API_KEY + AZURE_API_BASE env vars and constructs AzureOpenAI instead of OpenAI. Falls back to existing OPENAI_API_KEY path when Azure vars aren't set.
  • run_longmemeval_mempalace.py script defaults flipped to Azure-deployed model names (gpt-4o-mini reader, gpt-4o judge). Library-level defaults unchanged for direct-OpenAI users.
  • 8 new Azure-path tests + fixes for 2 pre-existing tests that broke when AZURE_API_KEY is in the shell environment.

Dry-run cost estimate (500 questions)

Component Model Input tokens Output tokens Cost
Reader gpt-4o-mini ~761K ~25K $0.13
Judge gpt-4o ~150K ~10K $0.47
Total $0.60

Live 500-question run not included — ready to fire on go-ahead.

Test plan

  • 317 tests pass (8 new Azure-path tests)
  • Dry-run cost estimate against longmemeval_oracle.json
  • Live daemon run (pending go-ahead — ~$0.60)

🤖 Generated with Claude Code

_default_client() in answer_generator and longmemeval_judge now
detects AZURE_API_KEY + AZURE_API_BASE and constructs AzureOpenAI
instead of OpenAI. Run scripts default to Azure-deployed model
names (gpt-4o-mini reader, gpt-4o judge). Existing OPENAI_API_KEY
path unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 24, 2026 21:00
@jphein jphein merged commit 7d13451 into main May 24, 2026
0 of 4 checks passed
@jphein jphein deleted the feat/azure-openai-eval branch May 24, 2026 21:01
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Azure OpenAI support by updating the client initialization logic in answer_generator.py and longmemeval_judge.py to check for Azure-specific environment variables. It also updates the default models in run_longmemeval_mempalace.py to Azure-compatible versions and adds corresponding unit tests. Feedback focuses on improving error handling for partially configured environment variables to ensure the application fails fast, and refactoring duplicated client initialization and test logic into shared utilities to reduce maintenance overhead.

Comment on lines +50 to +52
azure_key = os.environ.get("AZURE_API_KEY")
azure_base = os.environ.get("AZURE_API_BASE")
if azure_key and azure_base:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the general rules, when fetching credentials from environment variables, we should explicitly check for their presence and raise a RuntimeError if they are missing to ensure the application fails fast. The current implementation silently falls back to the OpenAI path if Azure environment variables are only partially set, which can lead to confusing behavior.

    azure_key = os.environ.get("AZURE_API_KEY")
    azure_base = os.environ.get("AZURE_API_BASE")
    if (azure_key is not None) ^ (azure_base is not None):
        raise RuntimeError("Both AZURE_API_KEY and AZURE_API_BASE must be set for Azure OpenAI support.")
    if azure_key and azure_base:
References
  1. When fetching credentials from environment variables, explicitly check for their presence and raise a RuntimeError if missing to ensure the application fails fast.

Comment on lines +240 to +255
azure_key = os.environ.get("AZURE_API_KEY")
azure_base = os.environ.get("AZURE_API_BASE")
if azure_key and azure_base:
try:
from openai import AzureOpenAI # type: ignore[import-not-found]
except ImportError:
log.info("longmemeval_judge: openai SDK not installed")
return None
api_version = os.environ.get(
"AZURE_API_VERSION", "2024-12-01-preview"
)
return AzureOpenAI(
azure_endpoint=azure_base,
api_key=azure_key,
api_version=api_version,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This _default_client implementation is identical to the one added in sme/eval/answer_generator.py. Duplicating complex logic like this increases maintenance overhead and the risk of divergence. Consider refactoring this into a shared utility function within the sme.eval package.

Comment on lines +279 to +344
def _install_fake_openai_module(monkeypatch):
fake = ModuleType("openai")

class _FakeOpenAI:
last_kwargs: dict | None = None

def __init__(self, **kwargs):
type(self).last_kwargs = kwargs

class _FakeAzureOpenAI:
last_kwargs: dict | None = None

def __init__(self, **kwargs):
type(self).last_kwargs = kwargs

fake.OpenAI = _FakeOpenAI
fake.AzureOpenAI = _FakeAzureOpenAI
monkeypatch.setitem(sys.modules, "openai", fake)
return _FakeOpenAI, _FakeAzureOpenAI


def test_default_client_prefers_azure_when_both_env_vars_set(monkeypatch):
monkeypatch.setenv("AZURE_API_KEY", "azure-secret")
monkeypatch.setenv("AZURE_API_BASE", "https://example.azure.com/")
monkeypatch.setenv("AZURE_API_VERSION", "2099-12-01-preview")
monkeypatch.setenv("OPENAI_API_KEY", "openai-secret")
_, FakeAzure = _install_fake_openai_module(monkeypatch)

client = _default_client()

assert isinstance(client, FakeAzure)
assert FakeAzure.last_kwargs == {
"azure_endpoint": "https://example.azure.com/",
"api_key": "azure-secret",
"api_version": "2099-12-01-preview",
}


def test_default_client_azure_uses_default_api_version(monkeypatch):
monkeypatch.setenv("AZURE_API_KEY", "azure-secret")
monkeypatch.setenv("AZURE_API_BASE", "https://example.azure.com/")
monkeypatch.delenv("AZURE_API_VERSION", raising=False)
monkeypatch.delenv("OPENAI_API_KEY", raising=False)
_, FakeAzure = _install_fake_openai_module(monkeypatch)

_default_client()

assert FakeAzure.last_kwargs["api_version"] == "2024-12-01-preview"


def test_default_client_falls_back_to_openai_when_azure_partial(monkeypatch):
monkeypatch.setenv("AZURE_API_KEY", "azure-secret")
monkeypatch.delenv("AZURE_API_BASE", raising=False)
monkeypatch.setenv("OPENAI_API_KEY", "openai-secret")
FakeOpenAI, _ = _install_fake_openai_module(monkeypatch)

client = _default_client()

assert isinstance(client, FakeOpenAI)


def test_default_client_returns_none_when_neither_configured(monkeypatch):
monkeypatch.delenv("AZURE_API_KEY", raising=False)
monkeypatch.delenv("AZURE_API_BASE", raising=False)
monkeypatch.delenv("OPENAI_API_KEY", raising=False)
assert _default_client() is None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Azure client tests and the _install_fake_openai_module helper are duplicated from tests/test_answer_generator.py. Consolidating the client factory logic into a shared utility would also allow for consolidating these tests, ensuring consistent test coverage across the repository.

@jphein jphein review requested due to automatic review settings May 24, 2026 21:24
jphein added a commit that referenced this pull request May 26, 2026
…22)

Adds grade_answer_replicated() to characterize intra-rater LLM-judge
stochasticity. Backward compatible: replicates=1 delegates to
grade_answer with temperature=0.0 (paper setting). For K>1, runs K
independent judge calls at temperature=0.3 by default and aggregates
via majority vote, excluding ERROR replicates from the vote.

Returns added diagnostics for K>1:
- replicates: list of per-call results
- label_counts: dict of non-ERROR label -> count
- agreement_rate / flip_rate: fraction matching the majority
- usage: summed across all K calls (cost accounting still accurate)

Threads a temperature parameter through _call_openai and grade_answer
so K-replicate calls can opt into stochasticity without changing the
default deterministic behaviour.

Closes M0nkeyFl0wer#22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jphein added a commit that referenced this pull request May 26, 2026
* feat(cat4): surface per-ID edge degree in collision report (#13)

Adds id_degrees dict to CollisionGroup so operators can tell which
ID is the canonical one when deduplicating — the highest-degree ID
is usually the entity that other extractions converged on, while
low-degree duplicates are stragglers that missed canonicalization.

The format_report collision section now prints each ID with its
degree below the names line and marks the highest with "← keep".

Fixture ground_truth gains collision_id_degrees ({e1: 4, e2..e4: 0})
and two new tests cover both the dataclass field population and the
rendered "← keep" marker placement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(cat4c): edge count alongside component count, sparse annotation (#15)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(adapters): random + oracle retrieval baselines (TREC bounds) (#23)

Two reference adapters that establish the standard lower/upper bounds
for retrieval evaluation per TREC methodology:

- RandomRetrievalAdapter: returns K uniformly random items from the
  ingested corpus, seeded for reproducibility. Any system that can't
  beat this isn't doing retrieval.
- OracleRetrievalAdapter: returns the gold expected_sources verbatim
  in context_string. Substring-scorer ceiling — no system can do
  better than perfect retrieval.

Both wired into sme/cli.py _load_adapter() under the names
random/random-retrieval/random_retrieval and oracle/oracle-retrieval/
oracle_retrieval. Adapter-specific kwargs from other adapters are
silently dropped for CLI parity.

Per the issue brief, baselines are intentionally NOT added to the
adapter harness manifest contract — they're reference bounds, not
adapters under test.

Closes M0nkeyFl0wer#23

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(cat7b): query latency distribution with YCSB percentiles (#25)

Adds sme/categories/latency.py with LatencyCollector that wraps
adapter calls and captures wall-clock timing, computing YCSB-standard
percentiles (p50, p95, p99, p99.9, max) via numpy.

The [latency] optional extra reserves an HdrHistogram upgrade path
when higher-fidelity tail measurement is needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): cost-per-correct with versioned pricing table (HELM methodology) (#26)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(eval): judge variance — K-replicate readings with majority vote (#22)

Adds grade_answer_replicated() to characterize intra-rater LLM-judge
stochasticity. Backward compatible: replicates=1 delegates to
grade_answer with temperature=0.0 (paper setting). For K>1, runs K
independent judge calls at temperature=0.3 by default and aggregates
via majority vote, excluding ERROR replicates from the vote.

Returns added diagnostics for K>1:
- replicates: list of per-call results
- label_counts: dict of non-ERROR label -> count
- agreement_rate / flip_rate: fraction matching the majority
- usage: summed across all K calls (cost accounting still accurate)

Threads a temperature parameter through _call_openai and grade_answer
so K-replicate calls can opt into stochasticity without changing the
default deterministic behaviour.

Closes M0nkeyFl0wer#22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(stats): paired bootstrap CIs and BH-FDR correction (#21)

Adds sme/stats/ package with two standard tools for distinguishing
real effects from noise across conditions:

- paired_bootstrap_ci: non-parametric CI on per-question paired
  score differences (Efron & Tibshirani 1993)
- benjamini_hochberg: FDR correction for multiple comparisons
  across categories or conditions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(cat5): break out isolates by entity_type in structural summary (#14)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(cat5): representative cycles alongside Betti-1 (#16)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(cat5): component-size distribution and non-trivial summaries (#17)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(tests): clear ruff E741 + F401 blocking CI on PR #28

- tests/test_ingestion_integrity.py:155 — rename ambiguous `l` →
  `line` in next() comprehension (E741).
- tests/test_pricing.py:2 — drop unused `PricingTable` import (F401).

Both errors landed in the bundle and were caught only after rebase
onto main; main's pyproject ruff pin is stricter than the original
branch base.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant