Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Model-Based RL: Player #1330

Merged
merged 24 commits into from Jan 8, 2019
Merged

Conversation

konradczechowski
Copy link
Contributor

Utilities and scripts for

  1. Interactive human playing with environment simulated by world model.
  2. Recording trajectories from real and simulated environment using policy trained in MBRL pipeline.

@googlebot googlebot added the cla: yes PR author has signed CLA label Dec 28, 2018
@konradczechowski
Copy link
Contributor Author

It seems that commit 6e0177c fixes one bug for configuration <Python 2.7 TF_VERSION="1.12.*"> (dopamine import) and reveals another, probably older problem (exceptions.OSError: [Errno 2] No such file or directory).

Travis before this PR: https://travis-ci.org/tensorflow/tensor2tensor/jobs/472985666
Travis after: https://travis-ci.org/tensorflow/tensor2tensor/jobs/473105651

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of comments - let's put a bit of effort to make this code more unified. Thanks!

FLAGS = flags.FLAGS


def make_simulated_env(real_env, world_model_dir, hparams, random_starts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not feel specific to player at all. Let's have a unified utilites and not duplicate this function from rlmb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved initial_frame_chooser function generation to different module, now make_simulated_env (which I renamed to make_simulated_gym_env) have little if any code repetition with rlmb train_agent(). (6a4503a)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved make_simulated_env to rl.py.

import rl_utils
from envs.simulated_batch_gym_env import FlatBatchEnv
from tensor2tensor.models.research.rl import get_policy
from tensor2tensor.rl.trainer_model_based import make_simulated_env_fn, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trainer_model_based file is a binary, not a library, it should not be imported. If we are having many general-purpose functions there, it's a sign we need a separate library with utilities, should we create one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed trainer_model_based imports, moved some functions from there to gym_env and rl_utils.

return flat_env


def last_epoch(data_dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this already exists somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know it does not. Epochs numbers for start_new_epoch(...) (which loads data when restarting) are defined by trainer_model_based.py main loop (and not read from disk).

return max([int(epoch_str) for epoch_str in epochs_str])


def load_t2t_env(hparams, data_dir, which_epoch_data=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should probably be part of T2TBatchGymEnv as a method to allow easy initialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_t2t_env uses setup_env(...) (from rl_utils; it unpacks hparams and parses game name), so I'll move setup_env() to T2TEnv (as static method) as well.

return t2t_env


def join_and_check(output_dir, subdirectory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have special functions for the os module, let's just not check if it exists if not strictly necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks are useful for faster failure, when incorrect data is given, but I've removed it in
3a3a103.

assert os.path.exists(path), "{} does not exists".format(path)
return path

class SimulatedEnv(Env):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could one of our existing SimulatedEnv classes already expose that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we want gym.Env interface (not the "batch" one), but FlatBatchEnv wrapper is enough here. I've cleaned it up in
e6c7c08.

random_starts=random_starts)


class PPOPolicyInferencer(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a new class? Could our Policy class not provide the appropriate methods already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different interfaces:

  • PolicyLearner is heavy - requires simulated env (not in _init_ thought), performs high level training and evaluation.
  • PPOPolicyInferencer only loads policy from disk, provides low level, non-tf interface for policy inference.

PolicyLearner also infers policy, but does it inside of graph, whole idea of PPOPolicyInferencer is to provide python api for different purposes.

All the methods are for these reasons different. We could merge these classes, this would require removing base_event_dir, total_num_epochs arguments from PolicyLearner.__init__ (or make them optional), initialize inference graph in PPOLearner.__init__ (but this will be unused in rlmb pipeline). I'm not sure if that will make things more clear. We can also just move PPOPolicyInferencer to ppo_learner.py. Do you think it would suffice?


"""

Run this script with the same parameters as trainer_model_based.py /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that recording should be an option to pass to the player -- let's not make another binary only for that. Let's merge it with player py and have a flag there to set up recording.

@konradczechowski
Copy link
Contributor Author

@lukaszkaiser Thanks for all the comments! I'll walk through them.

@konradczechowski
Copy link
Contributor Author

@lukaszkaiser Discussions about evaluation (+ if need record_ppo.py) and merging PPOLearner with PPOPolicyInferencer are open. I've applied rest of changes you suggested, what do you think of these?

We plan to work on evaluation in next week, it would be convenient for us to merge this PR first if this is possible. If adding record_ppo.py file is a problem you can remove it from this PR.

@konradczechowski
Copy link
Contributor Author

I've corrected relative imports (which kills Python 2) and rebased on current master.

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let's get it in.

@lukaszkaiser lukaszkaiser merged commit 186ffc3 into tensorflow:master Jan 8, 2019
tensorflow-copybara pushed a commit that referenced this pull request Jan 8, 2019
PiperOrigin-RevId: 228267563
kpe pushed a commit to kpe/tensor2tensor that referenced this pull request Mar 2, 2019
* SimulatedEnv with gym-like interface.

* Initial Player

* Player: Add reward header, keybord reset, CLI, move to player.py

* Player: add WAIT mode, few CLI options.

* Introduce Policy Inferencer

* Recording videos for ppo and player, some refactoring.

* Player refactor. Add real env recording with PPO agent.

* Extend CLI documentation. Remove some imports.

* Pylint

* Extend documentation.

* Correct dopamine import.

* Move gym.utils.play to global imports.

* Remove SimulatedEnv (unnecesarry wrapper for FlatBatchEnv<SimulatedBatchGymEnv>)

* Replace join_and_check with os.path.join.

* Move generation of initial_frame_chooser function to rl_utils.

* Move make_simulated_env_fn from trainer_model_based.py to rl.py

* Remove trainer_model_based imports, clean up player and record_ppo FLAGS.

* Move setup_env and load_t2t_gym_env to T2TGymEnv.

* Correct relative imports.

* Custom policy world_model and data paths for player.

* Enable BatchGymEnv to load directly from checkpoint file.

* Small fix record_ppo.

* Remove unused record_ppo.py.
kpe pushed a commit to kpe/tensor2tensor that referenced this pull request Mar 2, 2019
PiperOrigin-RevId: 228267563
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes PR author has signed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants