# Minecraft Reinforcement Learning on Ray cluster with Azure Machine Learning

In this notebook, we run scaled distributed reinforcement learning (RL) with Ray framework in Azure Machine Learning.<br>
This example is based on [here](https://github.com/tsmatz/minecraft-rl-on-ray-cluster), in which the agent will learn to solve the maze in Minecraft RL, Project Malmo.

Using Azure Machine Learning, the computing instances will automatically be scaled down to 0 instances when the training has completed.<br>
This example also sends logs (episode total and reward mean in each training iterations) to Azure Machine Learning workspace.

> Note : It’s better to run on GPU for practical training. Change configuration for running this example on GPU. (This example is for getting started, and runs on CPU.)

> Note : You can now also use Python package ```ray-on-aml``` for running ray cluster on Azure Machine Learning. (See [here](https://github.com/james-tn/ray-on-aml).)
To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).
2. Install Azure Machine Learning SDK (core package) as follows

```
pip install azureml-core
```

## 1. Create script for RL training (train_ray_cluster.py)

Save a script file (```train_ray_cluster.py```) for Ray RLlib training.

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [2]:
%%writefile script/train_ray_cluster.py
import os
import ray
import ray.tune as tune
from azureml.core import Run

# Function for stopping a learner when successful training
def stop_check(trial_id, result):
    return result["episode_reward_mean"] >= 85

# Function for logging in Azure Machine Learning workspace
# (Callback on train result to record metrics returned by trainer)
def on_train_result(info):
    run = Run.get_context()
    run.log(
        name='episode_reward_mean',
        value=info["result"]["episode_reward_mean"])
    run.log(
        name='episodes_total',
        value=info["result"]["episodes_total"])

def train_agent(num_workers, num_gpus, num_cpus_per_worker):
    ray.init(address="auto")

    ray.tune.run(
        "IMPALA",
        config={
            "log_level": "WARN",
            "env": "custom_malmo_env:MalmoMazeEnv-v0",
            "num_workers": num_workers,
            "num_gpus": num_gpus,
            "num_cpus_per_worker": num_cpus_per_worker,
            "explore": True,
            "exploration_config": {
                "type": "EpsilonGreedy",
                "initial_epsilon": 1.0,
                "final_epsilon": 0.02,
                "epsilon_timesteps": 500000
            },
            "callbacks": {"on_train_result": on_train_result},
        },
        stop=stop_check,
        checkpoint_at_end=True,
        checkpoint_freq=2,
        local_dir='./outputs'
    )

Overwriting script/train_ray_cluster.py


## 2. Create script for entry (start_server.py)

Create an entry script for starting Ray cluster (head and workers) and invoking RL training.<br>
Here we run 3 nodes with the following roles.

- Rank 0 : Ray Head
- Rank 1 : Ray Worker
- Rank 2 : Ray Worker

In [3]:
%%writefile script/start_server.py
import argparse
import os
from mpi4py import MPI
import socket

from train_ray_cluster import train_agent

parser = argparse.ArgumentParser()
parser.add_argument("--num_workers",
    type=int,
    required=False,
    default=1,
    help="number of ray workers")
parser.add_argument("--num_gpus",
    type=int,
    required=False,
    default=0,
    help="number of gpus")
parser.add_argument("--num_cpus_per_worker",
    type=int,
    required=False,
    default=1,
    help="number of cores per worker")
args = parser.parse_args()

mpi_comm = MPI.COMM_WORLD
mpi_rank = mpi_comm.Get_rank()
if mpi_rank == 0 :
    #
    # Head Node (Rank 0)
    #

    # Start Ray Head (Run the following command)
    # ray start --head --port=6379
    os.environ["LC_ALL"] = "C.UTF-8" # Needed for running Ray
    os.system("ray start --head --port=6379")
    del os.environ["LC_ALL"] # Removed for running Malmo

    # Send head address to workers
    ipaddr = socket.gethostbyname(socket.gethostname())
    header_info = {
        "address"  : ipaddr + ":6379"
    }
    header_info = mpi_comm.bcast(header_info, root=0)

    # Wait for staring workers
    req = mpi_comm.irecv(source=1, tag=1)
    data = req.wait()
    req = mpi_comm.irecv(source=2, tag=2)
    data = req.wait()

    # Run previous script !
    try:
        train_agent(args.num_workers, args.num_gpus, args.num_cpus_per_worker)
        print("Training has done !")
        os.system("ray stop")
    except:
        data = mpi_comm.bcast({"status":"error"}, root=0)
        os.system("ray stop")
        raise

    data = mpi_comm.bcast({"status":"done"}, root=0)

else :
    #
    # Worker Nodes (Rank 1, 2)
    #

    # Wait for starting header  (with address info)
    header_info = mpi_comm.bcast(None, root=0)
    header_address = header_info["address"]

    # Start Ray Worker (Run the following command)
    # ray start --address='xx.xx.xx.xx:6379' --redis-password="5241590000000000"
    os.environ["LC_ALL"] = "C.UTF-8" # Needed for running Ray
    os.system("ray start --address=\"" + header_address + "\" --redis-password=\"5241590000000000\"")
    del os.environ["LC_ALL"] # Removed for running Malmo

    # Send ready message to head
    req = mpi_comm.isend('ready', dest=0, tag=mpi_rank)
    req.wait()

    # Wait for completing job (with status info)
    status_info = mpi_comm.bcast(None, root=0)
    os.system("ray stop")

Overwriting script/start_server.py


## 3. Connect to Azure Machine Learning (Create AML config)

Connect to your Azure Machine Learning (AML) workspace.<br>
Please fill the following workspace name, subscription id, and resource group name. (You can get these values on AML resource blade in Azure Portal.)

In [4]:
from azureml.core import Workspace
import azureml.core

ws = Workspace(
    workspace_name = "{AML WORKSPACE NAME}",
    subscription_id = "{SUBSCRIPTION ID}",
    resource_group = "{RESOURCE GROUP NAME}")

## 4. Create cluster (multiple nodes)

Create a remote cluster with 3 nodes - 1 head node and 2 worker nodes.

Here we use ```Standard_D3_v2``` for VMs, but it's better to use GPU VMs for this training in practical use. (Dockerfile and pip packages should also be changed for running on GPU.)

In [5]:
from azureml.core import Workspace
import azureml.core
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
 
# Create AML compute (or Get existing one)
# (Total 3 : 1 Header, 2 Workers)
try:
    compute_target = ComputeTarget(workspace=ws, name='cluster01')
    print('found existing:', compute_target.name)
except ComputeTargetException:
    print('creating new.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_D3_v2',
        min_nodes=0,
        max_nodes=3,
        location="eastus")
    compute_target = ComputeTarget.create(ws, 'cluster01', compute_config)
    compute_target.wait_for_completion(show_output=True)

creating new.
InProgress....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded......................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 5. Generate config for run

Generate a script run configuration in AML.<br>
Here we generate custom container image, in which the following is installed and configured. (See [here](https://github.com/tsmatz/minecraft-rl-on-ray-cluster) for details.)

- Open MPI 3.1.2
- Azure ML Python SDK
- Ray 1.6.0 with TensorFlow 2.x backend
- Project Malmo with Minecraft (needs Java 8)
- Custom Gym env for running Maze agent (see [here](https://github.com/tsmatz/minecraft-rl-on-ray-cluster/tree/master/Malmo_Maze_Sample/custom_malmo_env))

In [6]:
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.environment import Environment
from azureml.core import Run, ScriptRunConfig
from azureml.core.runconfig import DockerConfiguration, MpiConfiguration

# Create environment
# (All components are alreday setup in this image.)
env = Environment('minecraft-rl')
env.python.user_managed_dependencies=True
env.python.interpreter_path = "/usr/bin/python"
env.docker.base_image = None
env.docker.base_dockerfile = """
FROM ubuntu:18.04

#
# Note : This image is not configured for running on GPU
#

WORKDIR /

# Prerequisites settings
RUN apt-get update && \
    apt-get install -y apt-utils git rsync wget bzip2 gcc g++ make

# Install Python
RUN apt-get install -y python3.6 && \
    apt-get install -y python3-pip && \
    pip3 install --upgrade pip
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.6 1

# Install Open MPI

#RUN wget -q https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.4.tar.gz && \
#    tar -xzf openmpi-1.10.4.tar.gz && \
#    cd openmpi-1.10.4 && \
#    ./configure --prefix=/usr/local/mpi && \
#    make -j"$(nproc)" install && \
#    cd .. && \
#    rm -rf /openmpi-1.10.4 && \
#    rm -rf openmpi-1.10.4.tar.gz
#ENV PATH=/usr/local/mpi/bin:$PATH \
#    LD_LIBRARY_PATH=/usr/local/mpi/lib:$LD_LIBRARY_PATH

ENV OPENMPI_VERSION 3.1.2
RUN mkdir /tmp/openmpi && \
    cd /tmp/openmpi && \
    wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz && \
    tar zxf openmpi-3.1.2.tar.gz && \
    cd openmpi-3.1.2 && \
    ./configure --enable-orterun-prefix-by-default && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \
    rm -rf /tmp/openmpi
RUN pip3 install mpi4py

# Install Java 8 (JDK)
RUN apt-get install -y openjdk-8-jdk
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# Install Ray with TensorFlow 2.x
RUN pip3 install gym lxml numpy pillow && \
    pip3 install tensorflow==2.4.1 ray[default]==1.6.0 ray[rllib]==1.6.0 ray[tune]==1.6.0 attrs==19.1.0 pandas

# Install Desktop Components for Headless
RUN apt-get install -y xvfb && \
    echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections && \
    apt-get install -y lxde

# Install Azure ML core
RUN pip3 install azureml-core

# Install Malmo
RUN pip3 install --index-url https://test.pypi.org/simple/ malmo==0.36.0
ENV MALMO_PATH=/malmo_package
WORKDIR $MALMO_PATH
RUN python3 -c "import malmo.minecraftbootstrap; malmo.minecraftbootstrap.download();"
ENV MALMO_XSD_PATH=$MALMO_PATH/MalmoPlatform/Schemas

WORKDIR /

# Install custom Gym env
RUN git clone https://github.com/tsmatz/minecraft-rl-on-ray-cluster
RUN cd minecraft-rl-on-ray-cluster && \
    pip3 install Malmo_Maze_Sample/

EXPOSE 6379 8265
"""

# register environment to re-use later
env.register(workspace=ws)
## # speed up by using the existing environment
## env = Environment.get(ws, name='minecraft-rl')

# create script run config
src = ScriptRunConfig(
    source_directory='./script',
    script='start_server.py',
    arguments=[
        '--num_workers', 3,
        '--num_cpus_per_worker', 3], 
    compute_target=compute_target,
    environment=env,
    docker_runtime_config=DockerConfiguration(use_docker=True),
    distributed_job_config=MpiConfiguration(process_count_per_node=1, node_count=3))

## 6. Run !

Now let's run Minecraft RL training on Ray.

This training requires about 1 day for completion when it's run on GPU.<br>
You can see the metrics (reward means and episode total) on Azure Machine Learning studio UI during the training. (See "Experiments" in AML studio.)

> Note : For the first time to run, it builds docker image and takes a long time to start training. (Once it's registered, it can speed up to start.)

In [None]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name='minecraft_rl_test')
run = exp.submit(config=src)
# See the output when debugging
# run.wait_for_completion(show_output=True)

RunId: minecraft_rl_test_1634097472_c2a1da1f
Web View: https://ml.azure.com/runs/minecraft_rl_test_1634097472_c2a1da1f?wsid=/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/TEST20211011/workspaces/ws01&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/55_azureml-execution-tvmps_594e4f8070c2df7271bfb89a011c5981a0bb209292aae30e4ddec4eb22184b80_d.txt

2021-10-13T04:02:14Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/ws01/azureml/minecraft_rl_test_1634097472_c2a1da1f/mounts/workspaceblobstore
2021-10-13T04:02:15Z The vmsize standard_d3_v2 is not a GPU VM, skipping get GPU count by running nvidia-smi command.
2021-10-13T04:02:15Z Starting output-watcher...
2021-10-13T04:02:15Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
Login Succeeded
Using default tag: latest
latest: Pulling from azureml/azureml_8246df9e4b586fcfa0c160abcb816314
284055322776: Pulling fs layer
f83c636e4934: Pulling fs layer
7df8bb74

[2m[36m(pid=454, ip=10.0.0.9)[0m Finished waiting for instance
[2m[36m(pid=568)[0m 2021-10-13 04:07:11,897	INFO trainable.py:109 -- Trainable.setup took 213.393 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=452, ip=10.0.0.6)[0m Finished waiting for instance
[2m[36m(pid=597)[0m Finished waiting for instance
Result for IMPALA_custom_malmo_env:MalmoMazeEnv-v0_887c0_00000:
  agent_timesteps_total: 500
  custom_metrics: {}
  date: 2021-10-13_04-10-44
  done: false
  episode_len_mean: 5.102564102564102
  episode_media: {}
  episode_reward_max: 0.0
  episode_reward_mean: -104.4551282051282
  episode_reward_min: -128.0
  episodes_this_iter: 156
  episodes_total: 156
  experiment_id: e42db0f41ca44a67b6849af2b138aec9
  hostname: cd14b61fbfe54116be7b7b7e2f4f3ea4000000
  info:
    learner:
      default_policy:
        cur_lr: 0.0005000000237487257
        entropy: 0.13840879499912262
        entropy

Result for IMPALA_custom_malmo_env:MalmoMazeEnv-v0_887c0_00000:
  agent_timesteps_total: 1500
  custom_metrics: {}
  date: 2021-10-13_04-15-03
  done: false
  episode_len_mean: 5.098039215686274
  episode_media: {}
  episode_reward_max: -101.0
  episode_reward_mean: -105.09803921568627
  episode_reward_min: -126.0
  episodes_this_iter: 102
  episodes_total: 368
  experiment_id: e42db0f41ca44a67b6849af2b138aec9
  hostname: cd14b61fbfe54116be7b7b7e2f4f3ea4000000
  info:
    learner:
      default_policy:
        cur_lr: 0.0005000000237487257
        entropy: 0.0
        entropy_coeff: 0.009999999776482582
        grad_gnorm: 40.0
        model: {}
        policy_loss: -0.0
        var_gnorm: 9.450931549072266
        vf_explained_var: 0.04075777530670166
        vf_loss: 0.569044828414917
    learner_queue:
      size_count: 3
      size_mean: 0.0
      size_quantiles:
      - 0.0
      - 0.0
      - 0.0
      - 0.0
      - 0.0
      size_std: 0.0
    num_agent_steps_sampled: 1500
    nu

## 7. Remove cluster (Clean-up)

In [17]:
# Delete cluster (nodes) in AML workspace
mycompute = AmlCompute(workspace=ws, name='cluster01')
mycompute.delete()