feat(eval): Azure OpenAI support for reader/judge clients (#19)#22
Conversation
_default_client() in answer_generator and longmemeval_judge now detects AZURE_API_KEY + AZURE_API_BASE and constructs AzureOpenAI instead of OpenAI. Run scripts default to Azure-deployed model names (gpt-4o-mini reader, gpt-4o judge). Existing OPENAI_API_KEY path unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces Azure OpenAI support by updating the client initialization logic in answer_generator.py and longmemeval_judge.py to check for Azure-specific environment variables. It also updates the default models in run_longmemeval_mempalace.py to Azure-compatible versions and adds corresponding unit tests. Feedback focuses on improving error handling for partially configured environment variables to ensure the application fails fast, and refactoring duplicated client initialization and test logic into shared utilities to reduce maintenance overhead.
| azure_key = os.environ.get("AZURE_API_KEY") | ||
| azure_base = os.environ.get("AZURE_API_BASE") | ||
| if azure_key and azure_base: |
There was a problem hiding this comment.
According to the general rules, when fetching credentials from environment variables, we should explicitly check for their presence and raise a RuntimeError if they are missing to ensure the application fails fast. The current implementation silently falls back to the OpenAI path if Azure environment variables are only partially set, which can lead to confusing behavior.
azure_key = os.environ.get("AZURE_API_KEY")
azure_base = os.environ.get("AZURE_API_BASE")
if (azure_key is not None) ^ (azure_base is not None):
raise RuntimeError("Both AZURE_API_KEY and AZURE_API_BASE must be set for Azure OpenAI support.")
if azure_key and azure_base:References
- When fetching credentials from environment variables, explicitly check for their presence and raise a RuntimeError if missing to ensure the application fails fast.
| azure_key = os.environ.get("AZURE_API_KEY") | ||
| azure_base = os.environ.get("AZURE_API_BASE") | ||
| if azure_key and azure_base: | ||
| try: | ||
| from openai import AzureOpenAI # type: ignore[import-not-found] | ||
| except ImportError: | ||
| log.info("longmemeval_judge: openai SDK not installed") | ||
| return None | ||
| api_version = os.environ.get( | ||
| "AZURE_API_VERSION", "2024-12-01-preview" | ||
| ) | ||
| return AzureOpenAI( | ||
| azure_endpoint=azure_base, | ||
| api_key=azure_key, | ||
| api_version=api_version, | ||
| ) |
There was a problem hiding this comment.
| def _install_fake_openai_module(monkeypatch): | ||
| fake = ModuleType("openai") | ||
|
|
||
| class _FakeOpenAI: | ||
| last_kwargs: dict | None = None | ||
|
|
||
| def __init__(self, **kwargs): | ||
| type(self).last_kwargs = kwargs | ||
|
|
||
| class _FakeAzureOpenAI: | ||
| last_kwargs: dict | None = None | ||
|
|
||
| def __init__(self, **kwargs): | ||
| type(self).last_kwargs = kwargs | ||
|
|
||
| fake.OpenAI = _FakeOpenAI | ||
| fake.AzureOpenAI = _FakeAzureOpenAI | ||
| monkeypatch.setitem(sys.modules, "openai", fake) | ||
| return _FakeOpenAI, _FakeAzureOpenAI | ||
|
|
||
|
|
||
| def test_default_client_prefers_azure_when_both_env_vars_set(monkeypatch): | ||
| monkeypatch.setenv("AZURE_API_KEY", "azure-secret") | ||
| monkeypatch.setenv("AZURE_API_BASE", "https://example.azure.com/") | ||
| monkeypatch.setenv("AZURE_API_VERSION", "2099-12-01-preview") | ||
| monkeypatch.setenv("OPENAI_API_KEY", "openai-secret") | ||
| _, FakeAzure = _install_fake_openai_module(monkeypatch) | ||
|
|
||
| client = _default_client() | ||
|
|
||
| assert isinstance(client, FakeAzure) | ||
| assert FakeAzure.last_kwargs == { | ||
| "azure_endpoint": "https://example.azure.com/", | ||
| "api_key": "azure-secret", | ||
| "api_version": "2099-12-01-preview", | ||
| } | ||
|
|
||
|
|
||
| def test_default_client_azure_uses_default_api_version(monkeypatch): | ||
| monkeypatch.setenv("AZURE_API_KEY", "azure-secret") | ||
| monkeypatch.setenv("AZURE_API_BASE", "https://example.azure.com/") | ||
| monkeypatch.delenv("AZURE_API_VERSION", raising=False) | ||
| monkeypatch.delenv("OPENAI_API_KEY", raising=False) | ||
| _, FakeAzure = _install_fake_openai_module(monkeypatch) | ||
|
|
||
| _default_client() | ||
|
|
||
| assert FakeAzure.last_kwargs["api_version"] == "2024-12-01-preview" | ||
|
|
||
|
|
||
| def test_default_client_falls_back_to_openai_when_azure_partial(monkeypatch): | ||
| monkeypatch.setenv("AZURE_API_KEY", "azure-secret") | ||
| monkeypatch.delenv("AZURE_API_BASE", raising=False) | ||
| monkeypatch.setenv("OPENAI_API_KEY", "openai-secret") | ||
| FakeOpenAI, _ = _install_fake_openai_module(monkeypatch) | ||
|
|
||
| client = _default_client() | ||
|
|
||
| assert isinstance(client, FakeOpenAI) | ||
|
|
||
|
|
||
| def test_default_client_returns_none_when_neither_configured(monkeypatch): | ||
| monkeypatch.delenv("AZURE_API_KEY", raising=False) | ||
| monkeypatch.delenv("AZURE_API_BASE", raising=False) | ||
| monkeypatch.delenv("OPENAI_API_KEY", raising=False) | ||
| assert _default_client() is None |
There was a problem hiding this comment.
…22) Adds grade_answer_replicated() to characterize intra-rater LLM-judge stochasticity. Backward compatible: replicates=1 delegates to grade_answer with temperature=0.0 (paper setting). For K>1, runs K independent judge calls at temperature=0.3 by default and aggregates via majority vote, excluding ERROR replicates from the vote. Returns added diagnostics for K>1: - replicates: list of per-call results - label_counts: dict of non-ERROR label -> count - agreement_rate / flip_rate: fraction matching the majority - usage: summed across all K calls (cost accounting still accurate) Threads a temperature parameter through _call_openai and grade_answer so K-replicate calls can opt into stochasticity without changing the default deterministic behaviour. Closes M0nkeyFl0wer#22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat(cat4): surface per-ID edge degree in collision report (#13) Adds id_degrees dict to CollisionGroup so operators can tell which ID is the canonical one when deduplicating — the highest-degree ID is usually the entity that other extractions converged on, while low-degree duplicates are stragglers that missed canonicalization. The format_report collision section now prints each ID with its degree below the names line and marks the highest with "← keep". Fixture ground_truth gains collision_id_degrees ({e1: 4, e2..e4: 0}) and two new tests cover both the dataclass field population and the rendered "← keep" marker placement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(cat4c): edge count alongside component count, sparse annotation (#15) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(adapters): random + oracle retrieval baselines (TREC bounds) (#23) Two reference adapters that establish the standard lower/upper bounds for retrieval evaluation per TREC methodology: - RandomRetrievalAdapter: returns K uniformly random items from the ingested corpus, seeded for reproducibility. Any system that can't beat this isn't doing retrieval. - OracleRetrievalAdapter: returns the gold expected_sources verbatim in context_string. Substring-scorer ceiling — no system can do better than perfect retrieval. Both wired into sme/cli.py _load_adapter() under the names random/random-retrieval/random_retrieval and oracle/oracle-retrieval/ oracle_retrieval. Adapter-specific kwargs from other adapters are silently dropped for CLI parity. Per the issue brief, baselines are intentionally NOT added to the adapter harness manifest contract — they're reference bounds, not adapters under test. Closes M0nkeyFl0wer#23 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat7b): query latency distribution with YCSB percentiles (#25) Adds sme/categories/latency.py with LatencyCollector that wraps adapter calls and captures wall-clock timing, computing YCSB-standard percentiles (p50, p95, p99, p99.9, max) via numpy. The [latency] optional extra reserves an HdrHistogram upgrade path when higher-fidelity tail measurement is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(eval): cost-per-correct with versioned pricing table (HELM methodology) (#26) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(eval): judge variance — K-replicate readings with majority vote (#22) Adds grade_answer_replicated() to characterize intra-rater LLM-judge stochasticity. Backward compatible: replicates=1 delegates to grade_answer with temperature=0.0 (paper setting). For K>1, runs K independent judge calls at temperature=0.3 by default and aggregates via majority vote, excluding ERROR replicates from the vote. Returns added diagnostics for K>1: - replicates: list of per-call results - label_counts: dict of non-ERROR label -> count - agreement_rate / flip_rate: fraction matching the majority - usage: summed across all K calls (cost accounting still accurate) Threads a temperature parameter through _call_openai and grade_answer so K-replicate calls can opt into stochasticity without changing the default deterministic behaviour. Closes M0nkeyFl0wer#22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(stats): paired bootstrap CIs and BH-FDR correction (#21) Adds sme/stats/ package with two standard tools for distinguishing real effects from noise across conditions: - paired_bootstrap_ci: non-parametric CI on per-question paired score differences (Efron & Tibshirani 1993) - benjamini_hochberg: FDR correction for multiple comparisons across categories or conditions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat5): break out isolates by entity_type in structural summary (#14) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat5): representative cycles alongside Betti-1 (#16) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat5): component-size distribution and non-trivial summaries (#17) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(tests): clear ruff E741 + F401 blocking CI on PR #28 - tests/test_ingestion_integrity.py:155 — rename ambiguous `l` → `line` in next() comprehension (E741). - tests/test_pricing.py:2 — drop unused `PricingTable` import (F401). Both errors landed in the bundle and were caught only after rebase onto main; main's pyproject ruff pin is stricter than the original branch base. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
_default_client()inanswer_generator.pyandlongmemeval_judge.pynow detectsAZURE_API_KEY+AZURE_API_BASEenv vars and constructsAzureOpenAIinstead ofOpenAI. Falls back to existingOPENAI_API_KEYpath when Azure vars aren't set.run_longmemeval_mempalace.pyscript defaults flipped to Azure-deployed model names (gpt-4o-minireader,gpt-4ojudge). Library-level defaults unchanged for direct-OpenAI users.AZURE_API_KEYis in the shell environment.Dry-run cost estimate (500 questions)
Live 500-question run not included — ready to fire on go-ahead.
Test plan
longmemeval_oracle.json🤖 Generated with Claude Code