feat(eval): Azure OpenAI support for reader/judge clients (#19) by jphein · Pull Request #22 · techempower-org/multipass-structural-memory-eval

jphein · 2026-05-24T21:00:53Z

Summary

_default_client() in answer_generator.py and longmemeval_judge.py now detects AZURE_API_KEY + AZURE_API_BASE env vars and constructs AzureOpenAI instead of OpenAI. Falls back to existing OPENAI_API_KEY path when Azure vars aren't set.
run_longmemeval_mempalace.py script defaults flipped to Azure-deployed model names (gpt-4o-mini reader, gpt-4o judge). Library-level defaults unchanged for direct-OpenAI users.
8 new Azure-path tests + fixes for 2 pre-existing tests that broke when AZURE_API_KEY is in the shell environment.

Dry-run cost estimate (500 questions)

Component	Model	Input tokens	Output tokens	Cost
Reader	gpt-4o-mini	~761K	~25K	$0.13
Judge	gpt-4o	~150K	~10K	$0.47
Total				$0.60

Live 500-question run not included — ready to fire on go-ahead.

Test plan

317 tests pass (8 new Azure-path tests)
Dry-run cost estimate against longmemeval_oracle.json
Live daemon run (pending go-ahead — ~$0.60)

🤖 Generated with Claude Code

_default_client() in answer_generator and longmemeval_judge now detects AZURE_API_KEY + AZURE_API_BASE and constructs AzureOpenAI instead of OpenAI. Run scripts default to Azure-deployed model names (gpt-4o-mini reader, gpt-4o judge). Existing OPENAI_API_KEY path unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces Azure OpenAI support by updating the client initialization logic in answer_generator.py and longmemeval_judge.py to check for Azure-specific environment variables. It also updates the default models in run_longmemeval_mempalace.py to Azure-compatible versions and adds corresponding unit tests. Feedback focuses on improving error handling for partially configured environment variables to ensure the application fails fast, and refactoring duplicated client initialization and test logic into shared utilities to reduce maintenance overhead.

gemini-code-assist · 2026-05-24T21:03:00Z

+    azure_key = os.environ.get("AZURE_API_KEY")
+    azure_base = os.environ.get("AZURE_API_BASE")
+    if azure_key and azure_base:


According to the general rules, when fetching credentials from environment variables, we should explicitly check for their presence and raise a RuntimeError if they are missing to ensure the application fails fast. The current implementation silently falls back to the OpenAI path if Azure environment variables are only partially set, which can lead to confusing behavior.

azure_key = os.environ.get("AZURE_API_KEY") azure_base = os.environ.get("AZURE_API_BASE") if (azure_key is not None) ^ (azure_base is not None): raise RuntimeError("Both AZURE_API_KEY and AZURE_API_BASE must be set for Azure OpenAI support.") if azure_key and azure_base:

References

When fetching credentials from environment variables, explicitly check for their presence and raise a RuntimeError if missing to ensure the application fails fast.

gemini-code-assist · 2026-05-24T21:03:00Z

+    azure_key = os.environ.get("AZURE_API_KEY")
+    azure_base = os.environ.get("AZURE_API_BASE")
+    if azure_key and azure_base:
+        try:
+            from openai import AzureOpenAI  # type: ignore[import-not-found]
+        except ImportError:
+            log.info("longmemeval_judge: openai SDK not installed")
+            return None
+        api_version = os.environ.get(
+            "AZURE_API_VERSION", "2024-12-01-preview"
+        )
+        return AzureOpenAI(
+            azure_endpoint=azure_base,
+            api_key=azure_key,
+            api_version=api_version,
+        )


This _default_client implementation is identical to the one added in sme/eval/answer_generator.py. Duplicating complex logic like this increases maintenance overhead and the risk of divergence. Consider refactoring this into a shared utility function within the sme.eval package.

gemini-code-assist · 2026-05-24T21:03:00Z

+def _install_fake_openai_module(monkeypatch):
+    fake = ModuleType("openai")
+
+    class _FakeOpenAI:
+        last_kwargs: dict | None = None
+
+        def __init__(self, **kwargs):
+            type(self).last_kwargs = kwargs
+
+    class _FakeAzureOpenAI:
+        last_kwargs: dict | None = None
+
+        def __init__(self, **kwargs):
+            type(self).last_kwargs = kwargs
+
+    fake.OpenAI = _FakeOpenAI
+    fake.AzureOpenAI = _FakeAzureOpenAI
+    monkeypatch.setitem(sys.modules, "openai", fake)
+    return _FakeOpenAI, _FakeAzureOpenAI
+
+
+def test_default_client_prefers_azure_when_both_env_vars_set(monkeypatch):
+    monkeypatch.setenv("AZURE_API_KEY", "azure-secret")
+    monkeypatch.setenv("AZURE_API_BASE", "https://example.azure.com/")
+    monkeypatch.setenv("AZURE_API_VERSION", "2099-12-01-preview")
+    monkeypatch.setenv("OPENAI_API_KEY", "openai-secret")
+    _, FakeAzure = _install_fake_openai_module(monkeypatch)
+
+    client = _default_client()
+
+    assert isinstance(client, FakeAzure)
+    assert FakeAzure.last_kwargs == {
+        "azure_endpoint": "https://example.azure.com/",
+        "api_key": "azure-secret",
+        "api_version": "2099-12-01-preview",
+    }
+
+
+def test_default_client_azure_uses_default_api_version(monkeypatch):
+    monkeypatch.setenv("AZURE_API_KEY", "azure-secret")
+    monkeypatch.setenv("AZURE_API_BASE", "https://example.azure.com/")
+    monkeypatch.delenv("AZURE_API_VERSION", raising=False)
+    monkeypatch.delenv("OPENAI_API_KEY", raising=False)
+    _, FakeAzure = _install_fake_openai_module(monkeypatch)
+
+    _default_client()
+
+    assert FakeAzure.last_kwargs["api_version"] == "2024-12-01-preview"
+
+
+def test_default_client_falls_back_to_openai_when_azure_partial(monkeypatch):
+    monkeypatch.setenv("AZURE_API_KEY", "azure-secret")
+    monkeypatch.delenv("AZURE_API_BASE", raising=False)
+    monkeypatch.setenv("OPENAI_API_KEY", "openai-secret")
+    FakeOpenAI, _ = _install_fake_openai_module(monkeypatch)
+
+    client = _default_client()
+
+    assert isinstance(client, FakeOpenAI)
+
+
+def test_default_client_returns_none_when_neither_configured(monkeypatch):
+    monkeypatch.delenv("AZURE_API_KEY", raising=False)
+    monkeypatch.delenv("AZURE_API_BASE", raising=False)
+    monkeypatch.delenv("OPENAI_API_KEY", raising=False)
+    assert _default_client() is None


The Azure client tests and the _install_fake_openai_module helper are duplicated from tests/test_answer_generator.py. Consolidating the client factory logic into a shared utility would also allow for consolidating these tests, ensuring consistent test coverage across the repository.

…22) Adds grade_answer_replicated() to characterize intra-rater LLM-judge stochasticity. Backward compatible: replicates=1 delegates to grade_answer with temperature=0.0 (paper setting). For K>1, runs K independent judge calls at temperature=0.3 by default and aggregates via majority vote, excluding ERROR replicates from the vote. Returns added diagnostics for K>1: - replicates: list of per-call results - label_counts: dict of non-ERROR label -> count - agreement_rate / flip_rate: fraction matching the majority - usage: summed across all K calls (cost accounting still accurate) Threads a temperature parameter through _call_openai and grade_answer so K-replicate calls can opt into stochasticity without changing the default deterministic behaviour. Closes M0nkeyFl0wer#22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(cat4): surface per-ID edge degree in collision report (#13) Adds id_degrees dict to CollisionGroup so operators can tell which ID is the canonical one when deduplicating — the highest-degree ID is usually the entity that other extractions converged on, while low-degree duplicates are stragglers that missed canonicalization. The format_report collision section now prints each ID with its degree below the names line and marks the highest with "← keep". Fixture ground_truth gains collision_id_degrees ({e1: 4, e2..e4: 0}) and two new tests cover both the dataclass field population and the rendered "← keep" marker placement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(cat4c): edge count alongside component count, sparse annotation (#15) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(adapters): random + oracle retrieval baselines (TREC bounds) (#23) Two reference adapters that establish the standard lower/upper bounds for retrieval evaluation per TREC methodology: - RandomRetrievalAdapter: returns K uniformly random items from the ingested corpus, seeded for reproducibility. Any system that can't beat this isn't doing retrieval. - OracleRetrievalAdapter: returns the gold expected_sources verbatim in context_string. Substring-scorer ceiling — no system can do better than perfect retrieval. Both wired into sme/cli.py _load_adapter() under the names random/random-retrieval/random_retrieval and oracle/oracle-retrieval/ oracle_retrieval. Adapter-specific kwargs from other adapters are silently dropped for CLI parity. Per the issue brief, baselines are intentionally NOT added to the adapter harness manifest contract — they're reference bounds, not adapters under test. Closes M0nkeyFl0wer#23 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat7b): query latency distribution with YCSB percentiles (#25) Adds sme/categories/latency.py with LatencyCollector that wraps adapter calls and captures wall-clock timing, computing YCSB-standard percentiles (p50, p95, p99, p99.9, max) via numpy. The [latency] optional extra reserves an HdrHistogram upgrade path when higher-fidelity tail measurement is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(eval): cost-per-correct with versioned pricing table (HELM methodology) (#26) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(eval): judge variance — K-replicate readings with majority vote (#22) Adds grade_answer_replicated() to characterize intra-rater LLM-judge stochasticity. Backward compatible: replicates=1 delegates to grade_answer with temperature=0.0 (paper setting). For K>1, runs K independent judge calls at temperature=0.3 by default and aggregates via majority vote, excluding ERROR replicates from the vote. Returns added diagnostics for K>1: - replicates: list of per-call results - label_counts: dict of non-ERROR label -> count - agreement_rate / flip_rate: fraction matching the majority - usage: summed across all K calls (cost accounting still accurate) Threads a temperature parameter through _call_openai and grade_answer so K-replicate calls can opt into stochasticity without changing the default deterministic behaviour. Closes M0nkeyFl0wer#22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(stats): paired bootstrap CIs and BH-FDR correction (#21) Adds sme/stats/ package with two standard tools for distinguishing real effects from noise across conditions: - paired_bootstrap_ci: non-parametric CI on per-question paired score differences (Efron & Tibshirani 1993) - benjamini_hochberg: FDR correction for multiple comparisons across categories or conditions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat5): break out isolates by entity_type in structural summary (#14) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat5): representative cycles alongside Betti-1 (#16) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cat5): component-size distribution and non-trivial summaries (#17) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(tests): clear ruff E741 + F401 blocking CI on PR #28 - tests/test_ingestion_integrity.py:155 — rename ambiguous `l` → `line` in next() comprehension (E741). - tests/test_pricing.py:2 — drop unused `PricingTable` import (F401). Both errors landed in the bundle and were caught only after rebase onto main; main's pyproject ruff pin is stricter than the original branch base. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 24, 2026 21:00

Copilot started reviewing on behalf of jphein May 24, 2026 21:01 View session

jphein merged commit 7d13451 into main May 24, 2026
0 of 4 checks passed

jphein deleted the feat/azure-openai-eval branch May 24, 2026 21:01

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

jphein mentioned this pull request May 24, 2026

feat: run MemPalace fork through LongMemEval E2E QA (first published score) #19

Closed

jphein review requested due to automatic review settings May 24, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Azure OpenAI support for reader/judge clients (#19)#22

feat(eval): Azure OpenAI support for reader/judge clients (#19)#22
jphein merged 1 commit into
mainfrom
feat/azure-openai-eval

jphein commented May 24, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jphein commented May 24, 2026

Summary

Dry-run cost estimate (500 questions)

Test plan

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant