Implement NN Policy Learner #78

Kurorororo · 2021-03-07T21:35:10Z

This PR implements NNPolicyLearer.
It uses a neural network whose objective function is an OPE estimator. To realize this, estimate_policy_value_tensor is implemented in each OPE estimator. However, replay method and switch DR cannot be used because they are indifferentiable.
An example script examples/opl/evaluate_off_policy_learners.py and notebook opl.ipynb are also created.
Now, torch >= 1.7.1 is added to the dependencies.

…licy-learner

usaito · 2021-03-18T04:51:49Z

@Kurorororo
Thanks! Overall, the implementations LGTM!
I have some minor comments. Could you fix these? Then, I will merge this PR.

[must]

nomura-san implemented test_offline_learner_performance.py to check whether offline policy learners produce reasonable performance compared to some simple baselines such as the uniform random policy.
Can you implement def test_offline_nnpolicylearner_performance( similar to def test_offline_ipwlearner_performance( below?

zr-obp/tests/policy/test_offline_learner_performance.py

Line 175 in 2b19c7f

def test_offline_ipwlearner_performance(

offline.py

[nits]

you have not yet defined what MLP means here (even though we can guess)
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L355
It is better to note that BaseOffPolicyEstimator.estimate_policy_value_tensor must be given here
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L370
off_policy_objective is better than ope_estimator_fun
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L462
must be a positive float is correct there
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L529
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L542
You should use \\alpha instead of \\lambda due to the consistency with variable names in the implementation
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L740
Can you add a note about what happens if None is given
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/obp/policy/offline.py#L758

tests/policy/test_offline.py

[nits]

Can you also check whether action_dist is a distribution over actions?
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/tests/policy/test_offline.py#L117
https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/tests/policy/test_offline.py#L1161

maybe

assert np.allclose((action_dist.sum(1), np.ones_like(context_test.shape[0], len_list))

examples/opl/evaluate_off_policy_learners.py

[imo]

I think

["random_policy", "ipw_learner", f"nn_policy_learner (with {ope_estimator})"],

is more interpretable

https://github.com/Kurorororo/zr-obp/blob/200d109a74040352f18a9c125726d1f1bee4ab5b/examples/opl/evaluate_off_policy_learners.py#L238

Kurorororo · 2021-03-27T14:44:33Z

@usaito Thank you for the review!
I have changed the code according to the comments, so please check the updates.

usaito · 2021-03-27T16:31:29Z

@Kurorororo Thanks! I have just a few minor comments.

policy/offline.py

[imo&ask]

What happens when None is given to off_policy_objective? I guess an ValueError occurs. I think the default None is unnecessary here as off_policy_objective plays a central role.
https://github.com/Kurorororo/zr-obp/blob/fb570d125b288396292d4fab2760bf837171afa4/obp/policy/offline.py#L457
What happens when None is given to dim_context here? Please clarify it in the docstring or remove the default None if it's unnecessary.
https://github.com/Kurorororo/zr-obp/blob/fb570d125b288396292d4fab2760bf837171afa4/obp/policy/offline.py#L456

examples/README.md

I think

opl/: example implementations for comparing the performance of several off-policy learners with synthetic bandit datasets.

is better describing what you implemented instead of the current form below

opl/: example implementations for evaluating several off-policy learners with synthetic bandit datasets.

(this may be confusing because we also have many OPE-related examples)

Kurorororo · 2021-03-27T16:41:45Z

I updated the README.md.

@usaito These arguments should not be default arguments, but they must be; this is because NNPolicyLearner inherits BaseOfflinePolicyLearner, whose constructor has a default argument len_list. In Python, ordinally arguments must not be placed after default arguments, so dim_context and off_policy_objective must be default arguments.

usaito · 2021-03-27T16:54:26Z

@Kurorororo Got it. Thanks! Then, how about removing default=None statements for those two variables? I think it may confuse users. How do you think about that?

Kurorororo · 2021-03-27T18:34:27Z

@usaito
Sounds reasonable. I have updated the docstring.

usaito · 2021-03-27T18:36:18Z

@Kurorororo Thanks! I'l merge this PR

Kurorororo added 14 commits February 22, 2021 04:10

wip

7e4b093

wip

b13177b

Merge branch 'master' of https://github.com/st-tech/zr-obp into nn-po…

2ac5177

…licy-learner

wip

5dfc0cf

passed tests

9f96017

fix test bugs

6174216

Merge branch 'master' of https://github.com/st-tech/zr-obp into nn-po…

6589f2d

…licy-learner

pass all tests

da0076d

fix

4c42fa2

add opl examples

f0bf030

fix

8933000

add mypy-extensions

d655183

use predict_proa

c4ee9ab

fix docs

200d109

usaito mentioned this pull request Mar 22, 2021

alpha_ and lambda_ are not necessary for contextual linear bandit algorithms #66

Closed

Kurorororo added 2 commits March 27, 2021 23:35

reflect the comments

21f8e5b

lint

fb570d1

update README

c908d5b

modify docstring

39c6876

usaito merged commit 2591b02 into st-tech:master Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NN Policy Learner #78

Implement NN Policy Learner #78

Kurorororo commented Mar 7, 2021

usaito commented Mar 18, 2021

Kurorororo commented Mar 27, 2021

usaito commented Mar 27, 2021 •

edited

Loading

Kurorororo commented Mar 27, 2021 •

edited

Loading

usaito commented Mar 27, 2021

Kurorororo commented Mar 27, 2021

usaito commented Mar 27, 2021

Implement NN Policy Learner #78

Implement NN Policy Learner #78

Conversation

Kurorororo commented Mar 7, 2021

usaito commented Mar 18, 2021

offline.py

tests/policy/test_offline.py

examples/opl/evaluate_off_policy_learners.py

Kurorororo commented Mar 27, 2021

usaito commented Mar 27, 2021 • edited Loading

policy/offline.py

examples/README.md

Kurorororo commented Mar 27, 2021 • edited Loading

usaito commented Mar 27, 2021

Kurorororo commented Mar 27, 2021

usaito commented Mar 27, 2021

usaito commented Mar 27, 2021 •

edited

Loading

Kurorororo commented Mar 27, 2021 •

edited

Loading