# Solving Cartpole Balancing with Amazon SageMaker and Ray

![](figures/cartpole.gif)

---
## Introduction

In this notebook we'll start from the cart-pole balancing problem, where a pole is attached by an un-actuated joint to a cart, moving along a frictionless track. Instead of applying control theory to solve the problem, this example shows how to solve the problem with reinforcement learning on Amazon SageMaker and Ray RLlib. You can choose either TensorFlow or PyTorch as your underlying DL framework.

(For a similar example using Coach library, see this [link](../rl_cartpole_coach/rl_cartpole_coach_gymEnv.ipynb). Another Cart-pole example using Coach library and offline data can be found [here](../rl_cartpole_batch_coach/rl_cartpole_batch_coach.ipynb).)

1. *Objective*: Prevent the pole from falling over
2. *Environment*: The environment used in this exmaple is part of OpenAI Gym, corresponding to the version of the cart-pole problem described by Barto, Sutton, and Anderson [1]
3. *State*: Cart position, cart velocity, pole angle, pole velocity at tip	
4. *Action*: Push cart to the left, push cart to the right
5. *Reward*: Reward is 1 for every step taken, including the termination step

References

1. AG Barto, RS Sutton and CW Anderson, "Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem", IEEE Transactions on Systems, Man, and Cybernetics, 1983.

## Goal

In this notebook, we aim to show the whole process of using SageMaker and Ray to train RL agent.
1. Setup the pre-required dependencies
2. Initialize the Ray cluster
3. Initialize the RL agent
4. Train the agent in a distributed fashion using CPUs and GPUs provided by the SageMaker nootbook. 
5. Save/restore/evaluate the agent 
6. Tune the agent by trying different combinations of hyperparameters of the agent
7. Find the best hyperparameters, restore/evaluate the tuned agent

## Pre-requisites 
### Install dependencies
To get started, we need to install libraries as needed

In [None]:
!pip install -U 'ray[rllib, tune, serve]'
!pip install gym[atari] autorom[accept-rom-license]
!pip install box2d-py
!pip install pygame
!pip install tqdm

### Imports
We'll import the Python libraries as needed, set up the environment with a few prerequisites for permissions and configurations.

In [None]:
import torch 
import os 
import tqdm
import gym
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print
import logging
from ray.air.config import RunConfig, ScalingConfig

### CPU/GPU of the notebook in use (ml.p3.2xlarge).

In [None]:
cpu_num = os.cpu_count()
gpu_num = torch.cuda.device_count()
print(cpu_num, 'CPUs')
print(gpu_num, 'GPUs')

### Start a Ray cluster at the notebook instance (ml.p3.2xlarge)
Note that ray.shutdown() is called before ray.init(). This is used to avoid the error caused by calling ray.init() more than once by accident.

In [None]:
ray.shutdown()
ray.init(ignore_reinit_error=True, log_to_driver=False, logging_level=logging.FATAL)

Ray automatically detects all available CPUs and GPUs

In [None]:
num_cpus = int(ray.available_resources()['CPU'])
num_gpus = int(ray.available_resources()['GPU'])
print('num_cpus', num_cpus)
print('num_gpus', num_gpus)

### Set up configurations of Ray RL trainer

Define the deep learning framework: 'tf': tensorflow, 'torch': pytorch

In [None]:
config = DEFAULT_CONFIG.copy()
config['framework'] = 'torch'

This set of configurations control the allocation of CPUs and GPUs. 
    Check the link for guideline https://docs.ray.io/en/latest/rllib/rllib-training.html#specifying-parameters

In [None]:
config['num_workers'] = num_cpus-1  
config['num_gpus'] = num_gpus #how many GPUs are assigned to the driver
config['num_cpus_per_worker'] = 1 #how many CPU for each worker, at least 1
config['num_gpus_per_worker'] = 0 #how many GPU for each worker, we set it as 0 since worker is only responsible for data collection rather than learning.
config['num_envs_per_worker'] = 5 #how many envs interacts with by each worker. 
config['recreate_failed_workers'] = True # auto handle the worker failure.

This set of configurations control the training scheme of RL algorihtm (e.g., PPO).

In [None]:
config['train_batch_size'] = 5000
config['num_sgd_iter'] = 10
config['sgd_minibatch_size'] = 500
# config['model']['fcnet_hiddens'] = [64, 64]

In [None]:
config['log_level'] = 'ERROR'
config['create_env_on_driver'] = True # used for evaluation

Define the environment 

In [None]:
# env_name = 'Taxi-v3' 
env_name = 'CartPole-v1'

Initialize the agent 

In [None]:
agent = PPOTrainer(config=config,  env=env_name)

### Train the agent. Training results are automatically logged in '~/ray_results/' by default.

In [None]:
%%time
print('training ....')
rewards = []
for i in tqdm.tqdm(range(50)):
    result = agent.train()
    rewards.append(result['episode_reward_mean'])
    print("iteration {:3d} reward {:6.2f}".format(i+1,result['episode_reward_mean']))
    ## save checkpoints perodically.
    if i % 10 == 0 and i > 0:
        checkpoint_path = agent.save()
        print('iteration %s checkpoint saved at'%(i), checkpoint_path)


In [None]:
import pandas as pd
df = pd.DataFrame({'reward': rewards})
df.to_csv(env_name+'_rewards_num_worker_%s.csv'%(int(config['num_workers'])), index=False)

import matplotlib.pyplot as plt 
plt.figure(figsize=(6,4))
plt.plot(range(df.shape[0]), df['reward'].values, label='num_workers=%s'%(num_cpus-1))
plt.xlabel('training iterations', fontsize=12)
plt.ylabel('reward/episode', fontsize=12)
plt.legend(loc=0, fontsize=12)
plt.title(env_name, fontsize=12)
plt.tight_layout()
plt.savefig(env_name+'_rewards_num_worker_%s'%(int(config['num_workers']))+'.png', dpi=100)
plt.show()

Show the neural network structure of the agent policy

In [None]:
policy = agent.get_policy()
model = policy.model
print(model)

Save the trained agent as a checkpoint by calling agent.save() 

In [None]:
# Save the Trained Model as a check point
checkpoint_path = agent.save()
print(checkpoint_path)

### Load the checkpoint and evaluate the trained agent

In [None]:
evaluation = agent.evaluate(checkpoint_path)
print(pretty_print(evaluation))

### Hyperparameters tuning. 
1. param_space defines the hyperparameters that we would like to tune
    e.g., 'num_sdg_iter', 'sdg_mini_batchsize''train_batch_size' etc..
3. Ray will automatically schedule all available workers to run tunning trails

In [None]:
ray.shutdown()
ray.init(log_to_driver=False, logging_level=logging.FATAL)

In [None]:
import random
from ray import air, tune

In [None]:
tuner = tune.Tuner(
    'PPO',
    run_config=air.RunConfig(stop={"training_iteration": 100}, 
                             verbose=0,
                            ),
    
    tune_config=tune.TuneConfig(
        metric="episode_reward_mean",
        mode="max",
        num_samples=10, 
    ),
    
    param_space={
        'env': 'CartPole-v1',
        'framework': 'torch',
        "num_sgd_iter": tune.choice([10, 20]), # tune.uniform()/tune.grid_search()
        "sgd_minibatch_size": tune.choice([128, 256]),
        "train_batch_size": tune.choice([500, 1000]),
    },
)

In [None]:
tune_results = tuner.fit()

### Organize the tunning results in a dataframe and sort rows by filter metric e.g., episode_reward_mean

In [None]:
df = tune_results.get_dataframe(filter_metric="episode_reward_mean", filter_mode="max")
df[['episode_reward_mean', 'config/train_batch_size', 'config/num_sgd_iter','config/sgd_minibatch_size']].sort_values(by='episode_reward_mean', ascending=False)[:5]

## Retrieve the best hyperparameters. Restore and Evaluate the agent

In [None]:
best_result = tune_results.get_best_result(metric='episode_reward_mean')
best_checkpoint = best_result.checkpoint
# best_config = best_result.config
# # print(pretty_print(best_config))

evaluation = agent.evaluate(best_checkpoint)
print(pretty_print(evaluation))