# Phase 7 Offline A/B Evaluation

This notebook compares **control (TF-IDF)** vs **treatment (two-stage system)** using the offline A/B simulation report.

In [1]:
from pathlib import Path
import json
import pandas as pd

report_path = Path('../data/logs/phase7_offline_ab_report.json')
if not report_path.exists():
    raise FileNotFoundError('Run scripts/10_offline_ab_simulation.py first to generate phase7_offline_ab_report.json')

report = json.loads(report_path.read_text())

report.keys()

dict_keys(['timestamp_utc', 'phase', 'control', 'treatment', 'ndcg_lift', 'mrr_lift', 'guardrails', 'hypothesis'])

In [2]:
comparison = pd.DataFrame([report['control'], report['treatment']]).set_index('arm_name')

comparison 

Unnamed: 0_level_0,n_requests,ndcg_at_k,mrr,null_result_rate,p95_latency_ms,mean_latency_ms,latency_alert
arm_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
control_tfidf,710,0.265647,0.108649,0.0,24.332577,16.520781,False
treatment_two_stage,710,0.796232,0.723944,0.0,34.037968,23.472322,False


In [3]:
lifts = pd.Series({'ndcg_lift': report['ndcg_lift'], 'mrr_lift': report['mrr_lift']})

guardrails = pd.Series(report['guardrails'])

lifts, guardrails

(ndcg_lift    0.530584
 mrr_lift     0.615295
 dtype: float64,
 null_result_rate_alert    False
 latency_alert             False
 max_null_result_rate       0.01
 max_p95_latency_ms        100.0
 dtype: object)

## Hypothesis for production experiment

The string below is generated by the simulation framework and can be used as the draft hypothesis in experiment documentation

In [4]:
print(report['hypothesis'])

Hypothesis: the two-stage treatment improves ranking quality over TF-IDF by at least +0.0100 NDCG@K while keeping guardrails healthy. Offline estimate: Î”NDCG=0.5306, which meets this bar.
