# Reproduction of Deepmind’s StarCraft II Research by Using Higher-Level Framework

Author: Shengyi (Costa) Huang, Costa.Huang@outlook.com

## Introduction

In early 2017, Google Deepmind introduced a python library [``Pysc2``](https://github.com/deepmind/pysc2), the SC2LE (StarCraft II Learning Environment). It provides a interface for RL (Reinforcement Learning) agents to interact with StarCraft 2 by providing the observations and receiving actions. In their paper, Deepmind also described some baseline training algorithms and used them to train agents in [mini-games](https://github.com/deepmind/pysc2/blob/master/docs/mini_games.md) such as ``BuildMarines``, ``DefeatRoaches`` and ``MoveToBeacon``<cite data-cite="Vinyals2017-ck"> (Vinyals, 2017)</cite>. Nevertheless, the implementation of the algorithm is not revealed. In this research project, we surveys some reproduction of Deepmind's result and found most of their implementation hard to understand and reproduce. In light of this, I reproduces the results by using higher-level framework (e.g. Tensorforce and gym) to enhance maintainability and understandibility.

## Basic Mechanics of Pysc2

One of the most helpful tutorial is from [Building a Basic PySC2 Agent](https://chatbotslife.com/building-a-basic-pysc2-agent-b109cde1477c) and [Building a Smart PySC2 Agent](https://chatbotslife.com/building-a-smart-pysc2-agent-cdc269cb095d). In the simplest form, you create a class that inherits from ``base_agent.BaseAgent`` and override the ``step()`` function. In essence, the ``step`` function of ``base_agent.BaseAgent`` gives you the observation of current state, including units killed, rewards, and etc, and you need to return an action.

In [12]:
from pysc2.agents import base_agent
from pysc2.lib import actions

class SimpleAgent(base_agent.BaseAgent):
    def step(self, obs):
        super(SimpleAgent, self).step(obs)
        
        return actions.FunctionCall(actions.FUNCTIONS.no_op.id, [])

Suppose you follow the author's instruction and create a [``simple_agent.py``](https://github.com/skjb/pysc2-tutorial/blob/master/Building%20a%20Basic%20Agent/simple_agent.py) and run
```cmd
python -m pysc2.bin.agent \
--map Simple64 \
--agent simple_agent.SimpleAgent \
--agent_race T
```

You will get something like this:

In [21]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/vd-LxvKmnX8" frameborder="0" allowfullscreen></iframe>

## Reinforcement Learning

Definition: reinforcement learning trains the agent to maximize accumulated rewards.

<br>
<center>
<img src="screenshots/Reinforcement Learning.PNG" height="600" width="600" />
<br>
<span><i>Powerpoint. The visualization of RL by David Silver</i></span>
</center>

<br>
<center>
<img src="screenshots/Reinforcement Learning example.PNG" height="600" width="600" />
<br>
<span><i>Powerpoint. A RL example by David Silver</i></span>
</center>

<br>
<center>
<img src="screenshots/Reinforcement optimal policy.PNG" height="600" width="600" />
<br>
<span><i>Powerpoint. The policy of an agent by David Silver</i></span>
</center>

<br>
<center>
<img src="screenshots/Reinforcement Value Function.PNG" height="600" width="600" />
<br>
<span><i>Powerpoint. The value function of an agent by David Silver</i></span>
</center>

### Actor critic algorithm

Actor is the "policy" of the agent, while Critic is the "value function" of the agent. The actor-only algorithm usually suffers from slow learning, while the critic-only algorithm had issues dealing with continuous action spaces<cite data-cite="Grondman2012-ve"> (Grondman, 2012)</cite>. Futhermore, according to Grondman, the actor-critic algorithm combines the best of the both world and possesses the following advantages:

* Faster training time
* Better handling of continuous actions
* (Usually) Nice convergence properties

### A3C

A3C specifies a procedure that asynchronously trains multiple agents in parallel, on multiple
instances of the environment. Such procedure surpasses the previous state-of-the-art on the Atari domain while "training for half the time on a single multi-core CPU instead of a GPU" <cite data-cite="Mnih2016-uo"> (Mnih, 2016)</cite>. In an intuitive way, one can imagine multiple agents being trained in parallel universe and reflect their experience to the model asynchronously, hence resulting in a more generalized training.

## Existing Reproduction

Some of the popular (most starred) reproduction repos are but not limited to:
* [pysc2-examples](https://github.com/chris-chris/pysc2-examples)
* [pysc2-agents](https://github.com/xhujoy/pysc2-agents)
* [sc2aibot](https://github.com/pekaalto/sc2aibot)
* [pysc2-RLagents](https://github.com/greentfrapp/pysc2-RLagents)

Unfortunately, most of the implementation are very hard to understand. Common issues includes 

* Unconventional project setup
    * For example, ``pysc2-examples``, one of the most starred reproduction, does not follow the typical python project structure defined [here](https://packaging.python.org/tutorials/distributing-packages/#initial-files). As a result, it seems that one can only use Pycharm to run the project.
* Poor documentation
    * For example, ``sc2aibot`` only provides a documentation for the [agent parameters](https://github.com/pekaalto/sc2aibot/blob/master/actorcritic/agent.py), and all the other functions are left undocumented. Because of this, it's even hard to figure out how the observation is passed to the model.
* Tight coupling/ Large functions
    * For example, ``pysc2-RLagents`` use [one giant file](https://github.com/greentfrapp/pysc2-RLagents/blob/master/Agents/PySC2_A3C_AtariNet.py) to include "everything" (the training, runing, reacting components of the algorithm). Deeply nested loop and condition are ubiquitous and, as a result, the author's work is almost unreadable. 
    
Nonetheless, I would like to point out that ``sc2aibot`` is probably the best out of those repos. Unlike other repos,  ``sc2aibot`` almost reproduces most of the Mini-games and shows a decent results compared to Deepmind's results:


<center>
<br>
<table align="center">
  <tr>
        <td align="center">Map</td>
        <td align="center">Avg score</td>
        <td align="center">Deepmind avg</td>
    </tr>
    <tr>
        <td align="center">MoveToBeacon</td>
        <td align="center">25</td>
        <td align="center">26</td>
    </tr>
    <tr>
        <td align="center">CollectMineralShards</td>
        <td align="center">91</td>
        <td align="center">103</td>
    </tr>
    <tr>
      <td align="center">DefeatZerglingsAndBanelings</td>
      <td align="center">48</td>
      <td align="center">62</td>
    </tr>
    <tr>
      <td align="center">FindAndDefeatZerglings</td>
      <td align="center">42</td>
      <td align="center">45</td>
    </tr>
    <tr>
      <td align="center">DefeatRoaches</td>
      <td align="center">70-90</td>
      <td align="center">100</td>
    </tr>
</table>
<br>
<span><i>Table. The comparision between the results from Deepmind and sc2aibot</i></span>
</center>

I ran ``sc2aibot`` implementation on ``CollectMineralShards`` mini-games for 12 hours, yet the environment only ran for 5,000 episodes and the average reward is still around 20. Since he had already ran such algorithms with 56,000 episodes, I was not interested in running it.

<center>
<img src="screenshots/sc2aibot_CollectMineralShards_training_time.png" width="600" height="600" />
<br>
<span><i>Picture. The training episodes (x-axis) and scores (y-axis)</i></span>
</center>

Notice in his screenshot, the training time only took about 8 hours. This is probably due to his more powerful computer and parallel trainings.

## Reproduction through OpenAI Gym and Tensorforce

From a software engineering perspective, we want to write modular code for maintainabiliy and understandibily. Because of this, we want to seperate the task of training agents into two subsystem: the system that deals with gaming environment and the system that deals with agents training.

### The subsystem that deals with the gaming environment

Fortunately, there exist such systems/librarys that are well structured and documented. For example, OpenAI Gym is a python RL toolkit that gives you access to a standardized set of environments <cite data-cite="Brockman2016-dq"> (Brockman, 2016)</cite>. Each environment is expected to have a standardized set of methods for user to call. A typical environment would look like this

In [18]:
# https://github.com/openai/gym/tree/master/gym/envs
import gym
from gym import error, spaces, utils
from gym import spaces
from gym.utils import seeding

class FooEnv(gym.Env):
    metadata = {'render.modes': ['human']}
    
    def __init__(self):
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(low, high)

    def _step(self, action):
        pass
    
    def _reset(self):
        pass
    
    def _render(self, mode='human', close=False):
        pass

Because of such setup, one can find out explicitly what the valid actions and observation look like by using ``action_space`` and ``observation_space``.

In [20]:
action_space = spaces.Discrete(2)
# The action_space only has two discrete actions: 0 and 1
print(action_space)
action_space.contains(15)

Discrete(2)


False

By using OpenAI Gym, we enhance the robostness and predictability of the environment. I was planning on creating a gym wrapper/binding with ``pysc2``, but luckily someone has already made such library: [``sc2gym``](https://github.com/islamelnabarawy/sc2gym). It enables us 




In [None]:
#blablabla

### The subsytem that deals with the training process

One of the shortfall of the existing reproduction is that they hard-coded the training process includes predicting an action, analyzing rewards, updating action policy and etc. Instead of reinventing the wheels, we should use reinforcement learning libraries such as Tensorforce and Keras-Rl that are readily available. Their training process is much more robost because the underlying codebase is usually well-tested.

After much comparison, I chose Tensorforce because of its flexibility.
> TensorForce is an open source reinforcement learning library focused on providing clear APIs, readability and modularisation to deploy reinforcement learning solutions both in research and practice. <cite data-cite="Schaarschmidt2017-ur"> (Schaarschmidt, 2017)</cite>

A typical example involves Openai Gym looks like the following:

In [22]:
# https://github.com/reinforceio/tensorforce/blob/master/examples/quickstart.py
import numpy as np

from tensorforce.agents import PPOAgent
from tensorforce.execution import Runner
from tensorforce.contrib.openai_gym import OpenAIGym

# Create an OpenAIgym environment
env = OpenAIGym('CartPole-v0', visualize=True)

# Network as list of layers
network_spec = [
    dict(type='dense', size=32, activation='tanh'),
    dict(type='dense', size=32, activation='tanh')
]

agent = PPOAgent(
    states_spec=env.states,
    actions_spec=env.actions,
    network_spec=network_spec,
    batch_size=4096,
    # Agent
    preprocessing=None,
    exploration=None,
    reward_preprocessing=None,
    # BatchAgent
    keep_last_timestep=True,
    # PPOAgent
    step_optimizer=dict(
        type='adam',
        learning_rate=1e-3
    ),
    optimization_steps=10,
    # Model
    scope='ppo',
    discount=0.99,
    # DistributionModel
    distributions_spec=None,
    entropy_regularization=0.01,
    # PGModel
    baseline_mode=None,
    baseline=None,
    baseline_optimizer=None,
    gae_lambda=None,
    normalize_rewards=False,
    # PGLRModel
    likelihood_ratio_clipping=0.2,
    summary_spec=None,
    distributed_spec=None
)

# Create the runner
runner = Runner(agent=agent, environment=env)


# Callback function printing episode statistics
def episode_finished(r):
    print("Finished episode {ep} after {ts} timesteps (reward: {reward})".format(ep=r.episode, ts=r.episode_timestep,
                                                                                 reward=r.episode_rewards[-1]))
    return True


# Start learning
runner.run(episodes=10, max_episode_timesteps=200, episode_finished=episode_finished)

# Print statistics
print("Learning finished. Total episodes: {ep}. Average reward of last 100 episodes: {ar}.".format(
    ep=runner.episode,
    ar=np.mean(runner.episode_rewards[-100:]))
)

InternalError: Blas GEMM launch failed : a.shape=(1, 4), b.shape=(4, 32), m=1, n=32, k=4
	 [[Node: actions-and-internals/apply/apply/apply/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_state_0_1/_113, ppo/actions-and-internals/layered-network/apply/dense0/apply/linear/apply/W/read)]]
	 [[Node: actions-and-internals/sample/Select/_129 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_75_actions-and-internals/sample/Select", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'actions-and-internals/apply/apply/apply/MatMul', defined at:
  File "C:\Users\costa\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\costa\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "C:\Users\costa\Anaconda3\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 474, in start
    ioloop.IOLoop.instance().start()
  File "C:\Users\costa\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "C:\Users\costa\Anaconda3\lib\site-packages\tornado\ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tornado\stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\costa\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "C:\Users\costa\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Users\costa\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tornado\stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 390, in execute_request
    user_expressions, allow_stdin)
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\costa\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 501, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\costa\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\costa\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Users\costa\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-22-e01ddb7d3331>", line 49, in <module>
    distributed_spec=None
  File "f:\grad school\tensorforce\tensorforce\tensorforce\agents\ppo_agent.py", line 155, in __init__
    keep_last_timestep=keep_last_timestep
  File "f:\grad school\tensorforce\tensorforce\tensorforce\agents\batch_agent.py", line 72, in __init__
    batched_observe=batched_observe
  File "f:\grad school\tensorforce\tensorforce\tensorforce\agents\agent.py", line 134, in __init__
    actions_spec=self.actions_spec,
  File "f:\grad school\tensorforce\tensorforce\tensorforce\agents\ppo_agent.py", line 178, in initialize_model
    likelihood_ratio_clipping=self.likelihood_ratio_clipping
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\pg_prob_ratio_model.py", line 75, in __init__
    gae_lambda=gae_lambda,
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\pg_model.py", line 80, in __init__
    entropy_regularization=entropy_regularization,
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\distribution_model.py", line 69, in __init__
    variable_noise=variable_noise
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\model.py", line 113, in __init__
    self.setup()
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\model.py", line 228, in setup
    deterministic=self.deterministic_input
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\model.py", line 765, in create_output_operations
    deterministic=deterministic
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 261, in __call__
    return self._call_func(args, kwargs, check_for_new_variables=True)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 217, in _call_func
    result = self._func(*args, **kwargs)
  File "f:\grad school\tensorforce\tensorforce\tensorforce\models\distribution_model.py", line 144, in tf_actions_and_internals
    embedding, internals = self.network.apply(x=states, internals=internals, update=update, return_internals=True)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 261, in __call__
    return self._call_func(args, kwargs, check_for_new_variables=True)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 217, in _call_func
    result = self._func(*args, **kwargs)
  File "f:\grad school\tensorforce\tensorforce\tensorforce\core\networks\network.py", line 241, in tf_apply
    x = layer.apply(x, update, *layer_internals)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 261, in __call__
    return self._call_func(args, kwargs, check_for_new_variables=True)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 217, in _call_func
    result = self._func(*args, **kwargs)
  File "f:\grad school\tensorforce\tensorforce\tensorforce\core\networks\layer.py", line 513, in tf_apply
    xl1 = self.linear.apply(x=x, update=update)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 261, in __call__
    return self._call_func(args, kwargs, check_for_new_variables=True)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\template.py", line 217, in _call_func
    result = self._func(*args, **kwargs)
  File "f:\grad school\tensorforce\tensorforce\tensorforce\core\networks\layer.py", line 436, in tf_apply
    x = tf.matmul(a=x, b=self.weights)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1891, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2436, in _mat_mul
    name=name)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
    op_def=op_def)
  File "C:\Users\costa\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(1, 4), b.shape=(4, 32), m=1, n=32, k=4
	 [[Node: actions-and-internals/apply/apply/apply/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_state_0_1/_113, ppo/actions-and-internals/layered-network/apply/dense0/apply/linear/apply/W/read)]]
	 [[Node: actions-and-internals/sample/Select/_129 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_75_actions-and-internals/sample/Select", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]


By the way, the reason you can use ``env = OpenAIGym('CartPole-v0', visualize=True)`` to visualzie the training process of OpenAI Gym with tensorforce is because I made a [pull request](https://github.com/reinforceio/tensorforce/pull/242). You are welcome :)

## First Attempt

With the tools ready, I gave it a try with the following code. (Don't try to run the code in jupyter notebook, because it just keeps printing stuff)

In [None]:
import numpy as np

from tensorforce.agents import PPOAgent
from tensorforce.execution import Runner
from tensorforce.contrib.openai_gym import OpenAIGym

import sc2gym
from absl import flags
FLAGS = flags.FLAGS
FLAGS([__file__])


# Create an OpenAIgym environment
# ReversedAddition-v0
# CartPole-v0
env = OpenAIGym('SC2CollectMineralShards-v2', visualize=False)

# Network as list of layers
network_spec = [
    dict(type='conv2d', size=32),
    dict(type='flatten'),
    dict(type='dense', size=32, activation='relu'),
    dict(type='lstm', size=32)
]

saver_spec = {
    'load': True,
    'file': 'model.ckpt-7914479',
    'directory': './model',
    'seconds': 3600
}

agent = PPOAgent(
    states_spec=env.states,
    actions_spec=env.actions,
    network_spec=network_spec,
    batch_size=10,
    # Agent
    preprocessing=None,
    exploration=None,
    reward_preprocessing=None,
    saver_spec=saver_spec,
    # BatchAgent
    keep_last_timestep=True,
    # PPOAgent
    step_optimizer=dict(
        type='adam',
        learning_rate=1e-4,
        epsilon=5e-7
    ),
    optimization_steps=10,
    # Model
    scope='ppo',
    discount=0.99,
    # DistributionModel
    distributions_spec=None,
    entropy_regularization=0.01,
    # PGModel
    baseline_mode=None,
    baseline=None,
    baseline_optimizer=None,
    gae_lambda=None,
    normalize_rewards=False,
    # PGLRModel
    likelihood_ratio_clipping=0.2,
    summary_spec=None,
    distributed_spec=None
)
    
print('partially success')

# Create the runner
runner = Runner(agent=agent, environment=env)


# Callback function printing episode statistics
rewards = []
def episode_finished(r):
    print("Finished episode {ep} after {ts} timesteps (reward: {reward})".format(ep=r.episode, ts=r.episode_timestep,
                                                                                 reward=r.episode_rewards[-1]))
    global rewards
    rewards += [r.episode_rewards[-1]]
    return True


# Start learning
runner.run(episodes=60000, episode_finished=episode_finished)

# Print statistics
print("Learning finished. Total episodes: {ep}. Average reward of last 100 episodes: {ar}.".format(
    ep=runner.episode,
    ar=np.mean(runner.episode_rewards[-100:]))
)


## Second Attempt

After running the first algorithm for roughly 5 days unstoppingly, the average score is still about 20, and I lost faith on model's ability to ever jump out of the local max. As a result, I stopped the program and ran it with a different set of hyperparameters.

<center>
<img src="screenshots/Hyperparameter_Tuning_web.png" />
<br>
<span><i>Picture. The definition of Hyperparameter Tuning by Chris Albon</i></span>
</center>

## Reproduction through 

In [10]:
%%HTML
<style> 
code {
    background-color : #eff0f1 !important;
    padding: 1px 5px !important;
}

</style>