Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NN Policy Learner #78

Merged
merged 18 commits into from
Mar 27, 2021
Merged

Conversation

Kurorororo
Copy link
Contributor

This PR implements NNPolicyLearer.
It uses a neural network whose objective function is an OPE estimator. To realize this, estimate_policy_value_tensor is implemented in each OPE estimator. However, replay method and switch DR cannot be used because they are indifferentiable.
An example script examples/opl/evaluate_off_policy_learners.py and notebook opl.ipynb are also created.
Now, torch >= 1.7.1 is added to the dependencies.

@usaito
Copy link
Contributor

usaito commented Mar 18, 2021

@Kurorororo
Thanks! Overall, the implementations LGTM!
I have some minor comments. Could you fix these? Then, I will merge this PR.

[must]

offline.py

[nits]

tests/policy/test_offline.py

[nits]

maybe

assert np.allclose((action_dist.sum(1), np.ones_like(context_test.shape[0], len_list))

examples/opl/evaluate_off_policy_learners.py

[imo]

  • I think
["random_policy", "ipw_learner", f"nn_policy_learner (with {ope_estimator})"],

is more interpretable

https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/examples/opl/evaluate_off_policy_learners.py#L238

@Kurorororo
Copy link
Contributor Author

@usaito Thank you for the review!
I have changed the code according to the comments, so please check the updates.

@usaito
Copy link
Contributor

usaito commented Mar 27, 2021

@Kurorororo Thanks! I have just a few minor comments.

policy/offline.py

[imo&ask]

examples/README.md

I think

opl/: example implementations for comparing the performance of several off-policy learners with synthetic bandit datasets.

is better describing what you implemented instead of the current form below

opl/: example implementations for evaluating several off-policy learners with synthetic bandit datasets.

(this may be confusing because we also have many OPE-related examples)

@Kurorororo
Copy link
Contributor Author

Kurorororo commented Mar 27, 2021

I updated the README.md.

@usaito These arguments should not be default arguments, but they must be; this is because NNPolicyLearner inherits BaseOfflinePolicyLearner, whose constructor has a default argument len_list. In Python, ordinally arguments must not be placed after default arguments, so dim_context and off_policy_objective must be default arguments.

@usaito
Copy link
Contributor

usaito commented Mar 27, 2021

@Kurorororo Got it. Thanks! Then, how about removing default=None statements for those two variables? I think it may confuse users. How do you think about that?

@Kurorororo
Copy link
Contributor Author

@usaito
Sounds reasonable. I have updated the docstring.

@usaito
Copy link
Contributor

usaito commented Mar 27, 2021

@Kurorororo Thanks! I'l merge this PR

@usaito usaito merged commit 2591b02 into st-tech:master Mar 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants