release 0.1

zuoxingdong · Sep 20, 2018 · e843d25 · e843d25
1 parent 4c511a5
commit e843d25
Show file tree

Hide file tree

Showing 30 changed files with 577 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -22,22 +22,23 @@
 
 # Basics
 
-`lagom` balances between the flexibility and the userability when developing reinforcement learning (RL) algorithms. The library is built on top of [PyTorch](https://pytorch.org/) and provides modular tools to quickly prototype RL algorithms. However, we do not go overboard, because going too low level is rather time consuming and prone to potential bugs, while going too high level degrades the flexibility which makes it difficult to try out some crazy ideas. 
+`lagom` balances between the flexibility and the usability when developing reinforcement learning (RL) algorithms. The library is built on top of [PyTorch](https://pytorch.org/) and provides modular tools to quickly prototype RL algorithms. However, it does not go overboard, because too low level is often time consuming and prone to potential bugs, while too high level degrades the flexibility which makes it difficult to try out some crazy ideas fast. 
 
-We are continuously making `lagom` more 'self-contained' to run experiments quickly. Now, it internally supports base classes for multiprocessing ([master-worker framework](https://en.wikipedia.org/wiki/Master/slave_(technology))) to parallelize (e.g. experiments and evolution strategies). It also supports hyperparameter search by defining configurations either as grid search or random search. 
+We are continuously making `lagom` more 'self-contained' to set up and run experiments quickly. It internally supports base classes for multiprocessing ([master-worker framework](https://en.wikipedia.org/wiki/Master/slave_(technology))) for parallelization (e.g. experiments and evolution strategies). It also supports hyperparameter search by defining configurations either as grid search or random search. 
 
-One of the main pipelines to use `lagom` can be done as following:
-1. Define environment and RL agent
-2. User runner to collect data for agent
-3. Define algorithm to train agent
-4. Define experiment and configurations. 
+A common pipeline to use `lagom` can be done as following:
+1. Define [environment](lagom/envs) and [agent](lagom/agents) (mainly for RL)
+2. Use [runner](lagom/runner) to collect data (trajectories or segments) for agent
+3. Define [engine](lagom/engine) for training and evaluating the agent
+4. Define [algorithm](lagom/base_algo.py)
+5. Define [experiment](lagom/experiment) and [configurations](lagom/experiment/configurator.py)
 
 A graphical illustration is coming soon. 
 
 # Installation
 
 ## Install dependencies
-Run the following command to install [all the dependencies](./requirements.txt):
+Run the following command to install [all required dependencies](./requirements.txt):
 
 ```bash
 pip install -r requirements.txt
@@ -53,7 +54,7 @@ We also provide some bash scripts in [scripts/](scripts/) directory to automatic
 
 ## Install lagom
 
-Run the following command to install from source:
+Run the following commands to install lagom from source:
 
 ```bash
 git clone https://github.com/zuoxingdong/lagom.git
@@ -73,7 +74,7 @@ The documentation hosted by ReadTheDocs is available online at [http://lagom.rea
 
 # Examples
 
-We shall continuously provide [examples/](examples/) to use lagom. 
+We are continuously providing [examples/](examples/) to use lagom. 
 
 # Test
 
@@ -86,7 +87,6 @@ pytest test -v
 # Roadmap
 
 ## Core
-    - Readthedocs Documentation
     - Tutorials
 ## More standard RL baselines
     - TRPO/PPO
@@ -99,7 +99,6 @@ pytest test -v
 ## More standard networks
     - Monte Carlo Dropout/Concrete Dropout
 ## Misc
-    - VecEnv: similar to that of OpenAI baseline
     - Support pip install
     - Technical report
 

diff --git a/examples/es/rl/README.md b/examples/es/rl/README.md
@@ -14,4 +14,4 @@ One could modify [experiment.py](./experiment.py) to quickly set up different co
 
 # Results
 
-<img src='data/result.png' width='100%'>
+<img src='data/result.png' width='75%'>
diff --git a/examples/policy_gradient/README.md b/examples/policy_gradient/README.md
@@ -1,5 +1,5 @@
-We benchmark three baselines for policy gradient method in several different perspectives
-1. REINFORCE
-2. Actor-Critic/Vanilla Policy Gradient
-3. Advantage Actor-Critic (A2C)
+This example includes the implementations of the following policy gradient algorithms:
 
+- [REINFORCE](reinforce)
+- [Vanilla Policy Gradient (VPG)](vpg)
+- [Advantage Actor-Critic (A2C)](a2c)
diff --git a/examples/policy_gradient/a2c/README.md b/examples/policy_gradient/a2c/README.md
@@ -0,0 +1,17 @@
+# Advantage Actor Critic (A2C)
+
+This is an implementation of [A2C](https://blog.openai.com/baselines-acktr-a2c/) algorithm. 
+
+# Usage
+
+Run the following command to start parallelized training:
+
+```bash
+python main.py
+```
+
+One could modify [experiment.py](./experiment.py) to quickly set up different configurations. 
+
+# Results
+
+<img src='data/result.png' width='75%'>
diff --git a/examples/policy_gradient/a2c/experiment.py b/examples/policy_gradient/a2c/experiment.py
@@ -28,6 +28,7 @@ def make_configs(self):
         configurator.fixed('algo.gamma', 0.99)
 
         configurator.fixed('agent.standardize_Q', False)  # whether to standardize discounted returns
+        configurator.fixed('agent.standardize_adv', True)  # whether to standardize advantage estimates
         configurator.fixed('agent.max_grad_norm', 0.5)  # grad clipping, set None to turn off
         configurator.fixed('agent.entropy_coef', 0.01)
         configurator.fixed('agent.value_coef', 0.5)

diff --git a/examples/policy_gradient/a2c/logs/0/config.yml b/examples/policy_gradient/a2c/logs/0/config.yml
@@ -9,6 +9,7 @@ algo.lr: 0.001
 algo.use_lr_scheduler: true
 algo.gamma: 0.99
 agent.standardize_Q: false
+agent.standardize_adv: true
 agent.max_grad_norm: 0.5
 agent.entropy_coef: 0.01
 agent.value_coef: 0.5

diff --git a/examples/policy_gradient/a2c/logs/configs.pkl b/examples/policy_gradient/a2c/logs/configs.pkl
diff --git a/examples/policy_gradient/a2c/main.ipynb b/examples/policy_gradient/a2c/main.ipynb
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/zuo/Code/lagom/lagom/core/plotter/__init__.py:9: UserWarning: ImageViewer failed to import due to pyglet. \n",
+      "  warnings.warn('ImageViewer failed to import due to pyglet. ')\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pathlib import Path\n",
+    "from lagom.experiment import Configurator\n",
+    "\n",
+    "from lagom import pickle_load\n",
+    "\n",
+    "from lagom.core.plotter import CurvePlot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>ID</th>\n",
+       "      <th>cuda</th>\n",
+       "      <th>env.id</th>\n",
+       "      <th>env.standardize</th>\n",
+       "      <th>network.hidden_sizes</th>\n",
+       "      <th>algo.lr</th>\n",
+       "      <th>algo.use_lr_scheduler</th>\n",
+       "      <th>algo.gamma</th>\n",
+       "      <th>agent.standardize_Q</th>\n",
+       "      <th>agent.standardize_adv</th>\n",
+       "      <th>...</th>\n",
+       "      <th>agent.constant_std</th>\n",
+       "      <th>agent.std_state_dependent</th>\n",
+       "      <th>agent.init_std</th>\n",
+       "      <th>train.timestep</th>\n",
+       "      <th>train.N</th>\n",
+       "      <th>train.T</th>\n",
+       "      <th>eval.N</th>\n",
+       "      <th>log.record_interval</th>\n",
+       "      <th>log.print_interval</th>\n",
+       "      <th>log.dir</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0</td>\n",
+       "      <td>True</td>\n",
+       "      <td>HalfCheetah-v2</td>\n",
+       "      <td>True</td>\n",
+       "      <td>[64, 64]</td>\n",
+       "      <td>0.001</td>\n",
+       "      <td>True</td>\n",
+       "      <td>0.99</td>\n",
+       "      <td>False</td>\n",
+       "      <td>True</td>\n",
+       "      <td>...</td>\n",
+       "      <td>None</td>\n",
+       "      <td>False</td>\n",
+       "      <td>0.5</td>\n",
+       "      <td>1000000.0</td>\n",
+       "      <td>16</td>\n",
+       "      <td>5</td>\n",
+       "      <td>10</td>\n",
+       "      <td>100</td>\n",
+       "      <td>1000</td>\n",
+       "      <td>logs</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>1 rows × 25 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   ID  cuda          env.id  env.standardize network.hidden_sizes  algo.lr  \\\n",
+       "0   0  True  HalfCheetah-v2             True             [64, 64]    0.001   \n",
+       "\n",
+       "   algo.use_lr_scheduler  algo.gamma  agent.standardize_Q  \\\n",
+       "0                   True        0.99                False   \n",
+       "\n",
+       "   agent.standardize_adv   ...     agent.constant_std  \\\n",
+       "0                   True   ...                   None   \n",
+       "\n",
+       "   agent.std_state_dependent  agent.init_std  train.timestep train.N train.T  \\\n",
+       "0                      False             0.5       1000000.0      16       5   \n",
+       "\n",
+       "   eval.N  log.record_interval  log.print_interval  log.dir  \n",
+       "0      10                  100                1000     logs  \n",
+       "\n",
+       "[1 rows x 25 columns]"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "log_folder = Path('logs')\n",
+    "\n",
+    "list_config = pickle_load(log_folder/'configs.pkl')\n",
+    "configs = Configurator.to_dataframe(list_config)\n",
+    "configs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_results(log_folder, ID, f):\n",
+    "    p = Path(log_folder)/str(ID)\n",
+    "    \n",
+    "    list_result = []\n",
+    "    for sub in p.iterdir():\n",
+    "        if sub.is_dir() and (sub/f).exists():\n",
+    "            list_result.append(pickle_load(sub/f))\n",
+    "            \n",
+    "    return list_result\n",
+    "\n",
+    "\n",
+    "def get_returns(list_result):\n",
+    "    returns = []\n",
+    "    for result in list_result:\n",
+    "        #x_values = [i['evaluation_iteration'][0] for i in result]\n",
+    "        x_values = [i['accumulated_trained_timesteps'][0] for i in result]\n",
+    "        y_values = [i['average_return'][0] for i in result]\n",
+    "        returns.append([x_values, y_values])\n",
+    "        \n",
+    "    return returns\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ID = 0\n",
+    "env_id = configs.loc[configs['ID'] == ID]['env.id'].values[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "list_result = load_results('logs', ID, 'eval_logs.pkl')\n",
+    "returns = get_returns(list_result)\n",
+    "x_values, y_values = zip(*returns)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot = CurvePlot()\n",
+    "plot.add('A2C', y_values, xvalues=x_values)\n",
+    "ax = plot(title=f'A2C on {env_id}', \n",
+    "          xlabel='Iteration', \n",
+    "          ylabel='Mean Episode Reward', \n",
+    "          num_tick=6, \n",
+    "          xscale_magnitude=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ax.figure.savefig('data/result.png')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/policy_gradient/reinforce/README.md b/examples/policy_gradient/reinforce/README.md
@@ -0,0 +1,17 @@
+# REINFORCE
+
+This is an implementation of [REINFORCE](https://link.springer.com/article/10.1007/BF00992696) algorithm. 
+
+# Usage
+
+Run the following command to start parallelized training:
+
+```bash
+python main.py
+```
+
+One could modify [experiment.py](./experiment.py) to quickly set up different configurations. 
+
+# Results
+
+<img src='data/result.png' width='75%'>
diff --git a/examples/policy_gradient/reinforce/data/result.png b/examples/policy_gradient/reinforce/data/result.png
diff --git a/examples/policy_gradient/reinforce/experiment.py b/examples/policy_gradient/reinforce/experiment.py
@@ -18,7 +18,7 @@ def make_configs(self):
 
         configurator.fixed('cuda', True)  # whether to use GPU
 
-        configurator.fixed('env.id', 'Reacher-v2')
+        configurator.fixed('env.id', 'HalfCheetah-v2')
         configurator.fixed('env.standardize', True)  # whether to use VecStandardize
 
         configurator.fixed('network.hidden_sizes', [64, 64])

diff --git a/examples/policy_gradient/reinforce/logs/0/1478610112/eval_logs.pkl b/examples/policy_gradient/reinforce/logs/0/1478610112/eval_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/1478610112/train_logs.pkl b/examples/policy_gradient/reinforce/logs/0/1478610112/train_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/209652396/eval_logs.pkl b/examples/policy_gradient/reinforce/logs/0/209652396/eval_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/209652396/train_logs.pkl b/examples/policy_gradient/reinforce/logs/0/209652396/train_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/398764591/eval_logs.pkl b/examples/policy_gradient/reinforce/logs/0/398764591/eval_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/398764591/train_logs.pkl b/examples/policy_gradient/reinforce/logs/0/398764591/train_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/441365315/eval_logs.pkl b/examples/policy_gradient/reinforce/logs/0/441365315/eval_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/441365315/train_logs.pkl b/examples/policy_gradient/reinforce/logs/0/441365315/train_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/924231285/eval_logs.pkl b/examples/policy_gradient/reinforce/logs/0/924231285/eval_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/924231285/train_logs.pkl b/examples/policy_gradient/reinforce/logs/0/924231285/train_logs.pkl
diff --git a/examples/policy_gradient/reinforce/logs/0/config.yml b/examples/policy_gradient/reinforce/logs/0/config.yml
@@ -1,6 +1,6 @@
 ID: 0
 cuda: true
-env.id: Reacher-v2
+env.id: HalfCheetah-v2
 env.standardize: true
 network.hidden_sizes:
 - 64

diff --git a/examples/policy_gradient/reinforce/logs/configs.pkl b/examples/policy_gradient/reinforce/logs/configs.pkl