Materials for "DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects".
This repository provides:
- An implementation of DOLCE for off-policy evaluation (OPE) and off-policy learning (OPL) under support violation.
- Synthetic experiments used in the paper.
- A real-world data processing / evaluation pipeline (code only; the data itself is confidential).
# clone the repository
git clone git@github.com:shutech2001/DOLCE.git
# build the environment with poetry
poetry install
# activate virtual environment
eval $(poetry env activate)
# [Option] to activate the interpreter, select the following output as the interpreter.
poetry env info --path
src/ope/models.py: OPE estimators (calc_dm,calc_ips,calc_dr,calc_dolce)src/opl/models.py: OPL learners (RegressionBasedPolicyLearner,GradientBasedPolicyLearner,DOLCE)src/synthetic/: synthetic data generator and ground-truth utilitiesscripts/: experiment driversscripts/exec_synthetic_ope.pyscripts/exec_synthetic_opl.pyscripts/exec_real_world.py(pipeline entrypoint)scripts/real_world/(pipeline components)
You can run the DOLCE estimator alone (without the full experiment scripts) as follows:
from synthetic import generate_synthetic_data
from ope import calc_dolce
from utils import eps_greedy_policy
data = generate_synthetic_data(
num_data=1000,
num_features=5,
num_actions=5,
non_overlap_ratio=0.3,
lambda_=0.5,
eta=0.0,
random_state=42,
env_random_state=7,
)
# Target policy (example: epsilon-greedy on current-context component)
q_for_pi = data.get("g_x_t_a_t", data["q"])
pi = eps_greedy_policy(q_for_pi)
contrib, info = calc_dolce(data, pi, random_state=42)
print("DOLCE estimate:", float(contrib.mean()))
print("lag weight range:", float(info["lag_weight_min"]), float(info["lag_weight_max"]))You can also run the DOLCE policy learner directly:
from synthetic import generate_synthetic_data
from opl import DOLCE
logged = generate_synthetic_data(
num_data=1000,
num_features=5,
num_actions=5,
non_overlap_ratio=0.3,
lambda_=0.5,
eta=0.0,
random_state=42,
env_random_state=7,
logging_eps=0.2,
)
test = generate_synthetic_data(
num_data=2000,
num_features=5,
num_actions=5,
non_overlap_ratio=0.3,
lambda_=0.5,
eta=0.0,
random_state=999,
env_random_state=7,
logging_eps=0.2,
)
learner = DOLCE(num_features=5, num_actions=5, max_iter=30, random_state=42)
learner.fit(logged, test)
pi_hat = learner.predict(test)
value = float((test["q"] * pi_hat).sum(1).mean())
print("Learned policy value:", value)The synthetic experiments are controlled by the two scripts below. Each script supports sweeping:
support violation ratio(default)logged data sizenumber of actionslambdaeta
python3 scripts/exec_synthetic_ope.py --sweep support_violation
python3 scripts/exec_synthetic_ope.py --sweep num_data
python3 scripts/exec_synthetic_ope.py --sweep num_actions
python3 scripts/exec_synthetic_ope.py --sweep lambda
python3 scripts/exec_synthetic_ope.py --sweep etapython3 scripts/exec_synthetic_opl.py --sweep support_violation
python3 scripts/exec_synthetic_opl.py --sweep num_data
python3 scripts/exec_synthetic_opl.py --sweep num_actions
python3 scripts/exec_synthetic_opl.py --sweep lambda
python3 scripts/exec_synthetic_opl.py --sweep etaBy default, each --sweep uses the paper's grid; you can override with --values "..." if needed.
When sweeping anything other than support violation, you can fix the ratio with --support-violation (default: 0.0).
Plots are saved under results/plots/ as ope_<sweep>.pdf / opl_<sweep>.pdf.
The real-world dataset is confidential, so this repository provides only the processing/evaluation pipeline. Relevant files:
scripts/exec_real_world.py(entrypoint)scripts/real_world/(data loading, preprocessing, and OPE evaluation)
@article{tamano2025dolce,
author={Tamano, Shu},
title={{DOLCE}: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects},
journal={arXiv preprint arXiv:2505.00961},
year={2025},
}
If you have any question, please feel free to contact: tamano-shu212@g.ecc.u-tokyo.ac.jp