<a href="https://colab.research.google.com/github/AISG-Technology-Team/Diner-Dash-Workshop/blob/master/Challenge_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Diner Dash Challenge**

---

## Objective:

Using Reinforcement Learning(RL) algorithms and a **maximum training timestep of 10 million**, maximise the average rewards from 100 games/episodes of Diner Dash.

## Instructions and Expectations:

1. Please use Google Colabs for all computing needs (installing of dependencies, training of model, testing of model, generation of submission, etc). This is to ensure fairness in this competition. You can run multiple notebooks but please take note of the contraints of GPU usage.

2. You are required to submit **2 files**: 
  - A fully ran Google Colab notebook 
  - A Json file which includes action lists for each seeded environment given 
  
  A function "Testing of policies and verification of submission" is provided to save your best algo's action list to a json file. For more information about the submission, please refer to the [workshop repo](https://github.com/AISG-Technology-Team/Diner-Dash-Workshop).

3. Please update the "Details of Submission" section

4. We expect to see that the models are learning during training

5. If you have any questions, please discuss within your groups first. Otherwise, please check if the issue is existing on the [workshop repo](https://github.com/AISG-Technology-Team/Diner-Dash-Workshop/issues) or raise one if it is not.

## Advice on approach to challenge

1. Spend some time to read up about the various RL algos, especially easily implementable baselines

2. Split the shortlisted algos among the group

3. You can choose to train for fewer timesteps and later on further train the model

4. Take note of the training duration. Time is tight!

5. If necessary, tune the hyperparameters to ensure learning

6. Have fun!

## Important Resources:

1. [Diner Dash repo](https://github.com/AdaCompNUS/diner-dash-simulator)

2. [Workshop repo](https://github.com/AISG-Technology-Team/Diner-Dash-Workshop)

3. [Stable Baselines](https://github.com/hill-a/stable-baselines)

## Things to note:

1. Please change the runtime to a GPU when using a GPU. In the above tabs, click Runtime > Change runtime type > GPU in the Hardware accelerator dropdown

2. If an "Error: A module (diner_dash) was specified for the environment but was not found, make sure the package is installed with `pip install` before calling `gym.make()`" error is raised, please restart the runtime and rerun the installation of the diner dash simulator.

2. Please ensure a strong internet connection throughout this challenge to avoid disconnecting from the collab GPUs

3. Do not idle your computer as collab automatically disconnects GPUs if the idle time is too long

4. GPUs run on CUDA 10.1

For other FAQs, refer to this [link](https://research.google.com/colaboratory/faq.html).

---

# Details of Submission [Please Edit]

### Team Name / ID:
AISG Engineers / Team 32

### Names of Group Members:
Ban Kar Weng, Yash Khare, Wee Yeong Loo

### Names of Algorithms tested:
Random Agent, PPO, PPO2, A2C, ACKTR, ACER


# Information on Colab

## Python Version

In [1]:
!python -V

Python 3.6.9


## Cuda Version

In [2]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


# Mounting Google Drive

To store trained models

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Create Project Directory

In [4]:
from os import path, chdir, getcwd, mkdir

# Choose a project name
projectName = "DinerDashChallenge"

# Project directory is in My Drive
projectDirectory = "/content/drive/My Drive/" + projectName

# Checks if cwd is in content folder
if getcwd() == "/content":
  # Makes project directory if it does not exist
  if not path.isdir(projectDirectory):
    mkdir(projectDirectory)
    print(f"Project {projectName} has been created!")
  else:
    print(f"Project {projectName} already exist!")
  # Changes to project directory
  chdir(projectDirectory)

print(f"The current working directory is {getcwd()}")

Project DinerDashChallenge already exist!
The current working directory is /content/drive/My Drive/DinerDashChallenge


# Installing Dependencies

Downloading relevant project dependencies

## Dependencies for [diner dash simulator](https://github.com/AdaCompNUS/diner-dash-simulator)

In [5]:
from os import path, getcwd

repoName = "diner-dash-simulator"

# Clones repo if it does not exist
if not path.isdir(repoName):
  !git clone https://github.com/AdaCompNUS/diner-dash-simulator.git
  print(f"Diner Dash repo has been cloned to {getcwd()}")
else:
  print(f"Diner Dash repo is already available at {path.join(getcwd(), repoName)}")

Diner Dash repo is already available at /content/drive/My Drive/DinerDashChallenge/diner-dash-simulator


In [6]:
!pip install -e diner-dash-simulator/DinerDashEnv

Obtaining file:///content/drive/My%20Drive/DinerDashChallenge/diner-dash-simulator/DinerDashEnv
Installing collected packages: diner-dash
  Found existing installation: diner-dash 0.0.1
    Can't uninstall 'diner-dash'. No files were found to uninstall.
  Running setup.py develop for diner-dash
Successfully installed diner-dash


In [7]:
import gym

# Test make environment
def testEnv():
  env = gym.make('diner_dash:DinerDash-v0').unwrapped
  env.flash_sim = False
  env.close()
  return True

if testEnv():
  print("Installation of diner dash simulator is successful!")

Installation of diner dash simulator is successful!


## Dependencies for Policy [Please Edit]

In [8]:
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [9]:
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x
!pip install stable-baselines[mpi]==2.10.0

TensorFlow 1.x selected.


# Check GPU usage

In [10]:
# Check if runtime uses GPU
# Ignore error if you do not wish to use a GPU
from tensorflow.test import gpu_device_name

device_name = gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')
else:
  print("GPU runtime is in use!")

GPU runtime is in use!


# Helper Functions

For easier debugging

In [11]:
def getAction(actionID):
    actionIDtoName = {
        0 : "None",
        1 : "Move to Table 1",
        2 : "Move to Table 2",
        3 : "Move to Table 3",
        4 : "Move to Table 4",
        5 : "Move to Table 5",
        6 : "Move to Table 6",
        7 : "Move to Counter",
        8 : "Pick Food for Table 1",
        9 : "Pick Food for Table 2",
        10 : "Pick Food for Table 3",
        11 : "Pick Food for Table 4",
        12 : "Pick Food for Table 5",
        13 : "Pick Food for Table 6",
        14 : "Move to Food Collection",
        15 : "Pick Table 1 for Group 1",
        16 : "Pick Table 2 for Group 1",
        17 : "Pick Table 3 for Group 1",
        18 : "Pick Table 4 for Group 1",
        19 : "Pick Table 5 for Group 1",
        20 : "Pick Table 6 for Group 1",
        21 : "Pick Table 1 for Group 2",
        22 : "Pick Table 2 for Group 2",
        23 : "Pick Table 3 for Group 2",
        24 : "Pick Table 4 for Group 2",
        25 : "Pick Table 5 for Group 2",
        26 : "Pick Table 6 for Group 2",
        27 : "Pick Table 1 for Group 3",
        28 : "Pick Table 2 for Group 3",
        29 : "Pick Table 3 for Group 3",
        30 : "Pick Table 4 for Group 3",
        31 : "Pick Table 5 for Group 3",
        32 : "Pick Table 6 for Group 3",
        33 : "Pick Table 1 for Group 4",
        34 : "Pick Table 2 for Group 4",
        35 : "Pick Table 3 for Group 4",
        36 : "Pick Table 4 for Group 4",
        37 : "Pick Table 5 for Group 4",
        38 : "Pick Table 6 for Group 4",
        39 : "Pick Table 1 for Group 5",
        40 : "Pick Table 2 for Group 5",
        41 : "Pick Table 3 for Group 5",
        42 : "Pick Table 4 for Group 5",
        43 : "Pick Table 5 for Group 5",
        44 : "Pick Table 6 for Group 5",
        45 : "Pick Table 1 for Group 6",
        46 : "Pick Table 2 for Group 6",
        47 : "Pick Table 3 for Group 6",
        48 : "Pick Table 4 for Group 6",
        49 : "Pick Table 5 for Group 6",
        50 : "Pick Table 6 for Group 6",
        51 : "Pick Table 1 for Group 7",
        52 : "Pick Table 2 for Group 7",
        53 : "Pick Table 3 for Group 7",
        54 : "Pick Table 4 for Group 7",
        55 : "Pick Table 5 for Group 7",
        56 : "Pick Table 6 for Group 7",
    }
    return actionIDtoName[actionID]

# Policies [Please Edit]

## Initialise Environment

In [12]:
import time
import gym
import numpy as np

# Initialises first env
def initEnv(seed=None):
  env = gym.make('diner_dash:DinerDash-v0').unwrapped
  env.flash_sim = False
  
  if seed != None:
    # sets random seed
    env.seed(seed)

  obs = env.reset()

  return env, obs

## Self Implemented/Adapted Models

### Random Agent

In [13]:
from random import randint

In [14]:
# Randomly select an action from the action space
def testRA(seed):

  # init env
  env, _ = initEnv(seed=seed)

  # init variables
  done = False
  sumReward = 0
  actionList = []

  while not done:
      action = randint(0, 56)
      actionList.append(action)
      state, reward, done, _ = env.step(action)
      sumReward += reward

  return sumReward, actionList

### [PPO](https://github.com/nikhilbarhate99/PPO-PyTorch)

In [15]:
import torch
import torch.nn as nn
import numpy as np
from torch.distributions import Categorical

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [16]:
class Memory:
    def __init__(self):
        self.actions = []
        self.states = []
        self.logprobs = []
        self.rewards = []
        self.is_terminals = []
    
    def clear_memory(self):
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.is_terminals[:]

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(ActorCritic, self).__init__()

        # actor
        self.action_layer = nn.Sequential(
                nn.Linear(state_dim, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, action_dim),
                nn.Softmax(dim=-1)
                )
        
        # critic
        self.value_layer = nn.Sequential(
                nn.Linear(state_dim, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, 1)
                )
        
    def forward(self):
        raise NotImplementedError
        
    def act(self, state, memory):
        state = torch.from_numpy(state).float().to(device) 
        action_probs = self.action_layer(state)
        dist = Categorical(action_probs)
        action = dist.sample()
        
        memory.states.append(state)
        memory.actions.append(action)
        memory.logprobs.append(dist.log_prob(action))
        
        return action.item()
    
    def evaluate(self, state, action):
        action_probs = self.action_layer(state)
        dist = Categorical(action_probs)
        
        action_logprobs = dist.log_prob(action)
        dist_entropy = dist.entropy()
        
        state_value = self.value_layer(state)
        
        return action_logprobs, torch.squeeze(state_value), dist_entropy
        
class PPO:
    def __init__(self, state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip):
        self.lr = lr
        self.betas = betas
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.K_epochs = K_epochs
        
        self.policy = ActorCritic(state_dim, action_dim, n_latent_var).to(device)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas)
        self.policy_old = ActorCritic(state_dim, action_dim, n_latent_var).to(device)
        self.policy_old.load_state_dict(self.policy.state_dict())
        
        self.MseLoss = nn.MSELoss()
    
    def update(self, memory):   
        # Monte Carlo estimate of state rewards:
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(reversed(memory.rewards), reversed(memory.is_terminals)):
            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.insert(0, discounted_reward)
        
        # Normalizing the rewards:
        rewards = torch.tensor(rewards).to(device)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
        
        # convert list to tensor
        old_states = torch.stack(memory.states).to(device).detach()
        old_actions = torch.stack(memory.actions).to(device).detach()
        old_logprobs = torch.stack(memory.logprobs).to(device).detach()
        
        # Optimize policy for K epochs:
        for _ in range(self.K_epochs):
            # Evaluating old actions and values :
            logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)
            
            # Finding the ratio (pi_theta / pi_theta__old):
            ratios = torch.exp(logprobs - old_logprobs.detach())
                
            # Finding Surrogate Loss:
            advantages = rewards - state_values.detach()
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1-self.eps_clip, 1+self.eps_clip) * advantages
            loss = -torch.min(surr1, surr2) + 0.5*self.MseLoss(state_values, rewards) - 0.01*dist_entropy
            
            # take gradient step
            self.optimizer.zero_grad()
            loss.mean().backward()
            self.optimizer.step()
        
        # Copy new weights into old policy:
        self.policy_old.load_state_dict(self.policy.state_dict())

In [17]:
def trainPPO():
    ############## Hyperparameters ##############
    # creating environment
    env, _ = initEnv()
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    log_directory = "./logs"     # log directory
    log_interval = 500          # print avg reward in the interval
    save_interval = int(5e5)    # checkpoints to save model
    max_timesteps = int(1e7)    # max training timesteps
    n_latent_var = 64           # number of variables in hidden layer
    update_timestep = 2000      # update policy every n timesteps
    lr = 0.003
    betas = (0.9, 0.999)
    gamma = 0.99                # discount factor
    K_epochs = 4                # update policy for K epochs
    eps_clip = 0.2              # clip parameter for PPO
    random_seed = None
    #############################################
    
    # Train model
    start_time = time.time()

    if random_seed:
        torch.manual_seed(random_seed)
        env.seed(random_seed)
    
    memory = Memory()
    ppo = PPO(state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip)
    print(lr,betas)
    
    # logging variables
    running_reward = 0
    avg_length = 0
    timestep = 0 # train timesteps
    t = 0 # timestep within each episode
    e = 0 # num of episodes

    done = False
    
    # training loop
    while timestep <= max_timesteps:
        state = env.reset()
        e += 1 # episode number
        while not done:
            timestep += 1
            t += 1 # timestep within each episode

            if timestep == max_timesteps:
                torch.save(ppo.policy.state_dict(), f'./PPO_diner-dash_{timestep:.0e}.pth')
                print(f"--- Time take to train model = {(time.time() - start_time)//60} minutes ---")
                return
            
            # Running policy_old:
            action = ppo.policy_old.act(state, memory)
            state, reward, done, _ = env.step(action)
            
            # Saving reward and is_terminal:
            memory.rewards.append(reward)
            memory.is_terminals.append(done)
            
            # update if its time
            if timestep % update_timestep == 0:
                ppo.update(memory)
                memory.clear_memory()

            # save model at checkpoints
            if timestep % save_interval == 0:
                if not path.isdir(log_directory):
                    mkdir(log_directory)
                torch.save(ppo.policy.state_dict(), f'{log_directory}/PPO_diner-dash_{timestep:.0e}.pth')
            
            running_reward += reward
                
        avg_length += t

        # reset timestep t and done since episode ended
        t = 0
        done = False
            
        # logging
        if e % log_interval == 0:
            avg_length = int(avg_length/log_interval)
            running_reward = int((running_reward/log_interval))
            
            print('Episode {} \t avg length: {} \t reward: {}'.format(e, avg_length, running_reward))
            running_reward = 0
            avg_length = 0
    return

In [18]:
trainPPO()

0.003 (0.9, 0.999)
Episode 500 	 avg length: 137 	 reward: -1131
Episode 1000 	 avg length: 137 	 reward: -968
Episode 1500 	 avg length: 138 	 reward: -972
Episode 2000 	 avg length: 137 	 reward: -675
Episode 2500 	 avg length: 136 	 reward: -490
Episode 3000 	 avg length: 136 	 reward: -245
Episode 3500 	 avg length: 135 	 reward: -122
Episode 4000 	 avg length: 136 	 reward: -10
Episode 4500 	 avg length: 137 	 reward: 42
Episode 5000 	 avg length: 138 	 reward: 4
Episode 5500 	 avg length: 136 	 reward: 105
Episode 6000 	 avg length: 136 	 reward: 166
Episode 6500 	 avg length: 138 	 reward: 201
Episode 7000 	 avg length: 138 	 reward: 208
Episode 7500 	 avg length: 138 	 reward: 199
Episode 8000 	 avg length: 137 	 reward: 228
Episode 8500 	 avg length: 138 	 reward: 261
Episode 9000 	 avg length: 139 	 reward: 277
Episode 9500 	 avg length: 138 	 reward: 326
Episode 10000 	 avg length: 138 	 reward: 253
Episode 10500 	 avg length: 139 	 reward: 219
Episode 11000 	 avg length: 14

In [19]:
def testPPO(seed):
    ############## Hyperparameters ##############
    # creating environment
    env, obs = initEnv(seed=seed)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    n_latent_var = 64           # number of variables in hidden layer
    filename = "./PPO_diner-dash_1e+07.pth" # path to saved model
    lr = 0.0007
    betas = (0.9, 0.999)
    gamma = 0.99                # discount factor
    K_epochs = 4                # update policy for K epochs
    eps_clip = 0.2              # clip parameter for PPO
    #############################################
    
    memory = Memory()
    ppo = PPO(state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip)
    
    ppo.policy_old.load_state_dict(torch.load(filename))

    ep_reward = 0
    done = False

    while not done:
      action = ppo.policy_old.act(obs, memory)
      obs, reward, done, _ = env.step(action)
      ep_reward += reward

    actionList = [action.item() for action in memory.actions]

    return ep_reward, actionList

## [Stable Baselines](https://github.com/hill-a/stable-baselines)

### Check Env setup for Stable Baselines

In [20]:
from stable_baselines.common.env_checker import check_env

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [21]:
error = check_env(gym.make('diner_dash:DinerDash-v0').unwrapped)
if error == None:
  print("Diner Dash environment is compatible with Stable-Baselines!")

Diner Dash environment is compatible with Stable-Baselines!


### Saving Models


#### Using Callbacks

- Save a checkpoint every 1000 steps
- Please change the callback variable name and name_prefix to whatever desire/specific to model
- Model saved in ./logs directory

  `PPO_callback = CheckpointCallback(save_freq=1000, save_path='./logs/', name_prefix='diner-dash-PPO')`

#### Using save function

- Model saved in current directory

  `model.save('name-of-model')`

### Loading Models

- If only for evaluation

  `PPO_model = PPO2.load('name-of-model')`

  `PPO_model.predict(state)`

- If loading for further training

  `PPO_model = PPO2.load('name-of-model', env)`

  `PPO_model.learn(5000)`

### Wrapper for better training performance

In [22]:
class OneHotWrapper(gym.Wrapper):
  """
  :param env: (gym.Env) Gym environment that will be wrapped
  """
  def __init__(self, env):
    # Call the parent constructor, so we can access self.env later
    super(OneHotWrapper, self).__init__(env)
    self.config = [7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
                   2, 2, 2, 2, 2, 2, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
                   7, 7, 19, 19]
    self.low_state = np.array([-20] * sum(self.config), dtype=np.float32)
    self.high_state = np.array([20] * sum(self.config), dtype=np.float32)
    self.observation_space = gym.spaces.Box(low=self.low_state, high=self.high_state, dtype=np.float32)

  def oneHotEncode(self, rawObs):
    for i, val in enumerate(rawObs):
      tmp = np.zeros(self.config[i])
      tmp[val] = 1
      if i == 0:
        obs = tmp
      else:
        obs = np.concatenate((obs, tmp))
    return obs

  def reset(self):
    """
    Reset the environment 
    """
    obs = self.env.reset()
    return self.oneHotEncode(obs)

  def step(self, action):
    """
    :param action: ([float] or int) Action taken by the agent
    :return: (np.ndarray, float, bool, dict) observation, reward, is the episode over?, additional informations
    """
    obs, reward, done, info = self.env.step(action)
    return self.oneHotEncode(obs), reward, done, info

### PPO2

In [25]:
# Refer to the stable baseline documentation for alternative implementations
# of callbacks, baselines and others
from stable_baselines.common.callbacks import CheckpointCallback
from stable_baselines.common import make_vec_env
from stable_baselines import PPO2

# fixed params for challenge
# Total timesteps for training
tts = int(1e7) 

# custom params
n_envs = 24
save_freq = int(1e5)
log_interval = 500

# Vectorize environment
env = make_vec_env(env_id="diner_dash:DinerDash-v0", n_envs=n_envs, wrapper_class=OneHotWrapper)

# Initialise model
model = PPO2('MlpPolicy', env, verbose=1)
# model = A2C(CnnPolicy, env, lr_schedule='constant', verbose=1)
# model.learn(total_timesteps=int(5e6))

# Initialise callback
ppo2_callback = CheckpointCallback(save_freq=save_freq, save_path=f'./logs-PPO2-nenv={n_envs}-tts={tts:.0e}/', name_prefix='diner-dash-PPO')

# Train model
start_time = time.time()
model.learn(total_timesteps=tts, log_interval=log_interval, callback=ppo2_callback)
print(f"--- Time take to train model = {(time.time() - start_time)//60} minutes ---")

# Save model
print("Saving Final Model...")
modelDirectory = "./"
modelName = f"PPO2-nenv={n_envs}-tts={tts:.0e}"
model.save(modelDirectory + modelName)
print(f"Model saved as {modelDirectory + modelName}")





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



--------------------------------------
| approxkl           | 8.491728e-06  |
| clipfrac           | 0.0           |
| ep_len_mean        | 113           |
| ep_reward_mean     | -1.36e+03     |
| explained_variance | -0.000222     |
| fps                | 1196          |
| n_updates          | 1             |
| policy_entropy     | 4.043032      |
| policy_loss        | -0.0006455337 |
| serial_timesteps   | 128           |
| time_elapsed       | 3.62e-05      |
| total_timesteps    | 3072          |
| value_loss         | 24675.428     |
--------------------------------------


KeyboardInterrupt: ignored

In [None]:
def testPPO2(seed):
  from stable_baselines.common import make_vec_env
  from stable_baselines import PPO2

  # Vectorize environment with given seed
  env = make_vec_env(env_id="diner_dash:DinerDash-v0", wrapper_class=OneHotWrapper, seed=seed)

  # Load saved model
  PPO_model = PPO2.load("PPO2-nenv=24-tts=1e+07", env=env)

  # Reset environment, init obs
  obs = env.reset()

  done = False
  sum_rewards = 0
  action_list = []

  while not done:
    action, _states = PPO_model.predict(obs)
    action_list.append(action.item())
    obs, rewards, done, info = env.step(action)
    sum_rewards += rewards

  return sum_rewards, action_list

### ACER

In [48]:
# Refer to the stable baseline documentation for alternative implementations
# of callbacks, baselines and others
from stable_baselines.common.callbacks import CheckpointCallback
from stable_baselines.common import make_vec_env
from stable_baselines import ACER

# fixed params for challenge
# Total timesteps for training
tts = int(1e7) 

# custom params
n_envs = 24
save_freq = int(1e5)
log_interval = 500

# Vectorize environment
env = make_vec_env(env_id="diner_dash:DinerDash-v0", n_envs=n_envs, wrapper_class=OneHotWrapper)

# Initialise model
model = ACER('MlpPolicy', env, verbose=1)

# Initialise callback
acer_callback = CheckpointCallback(save_freq=save_freq, save_path=f'./logs-ACER-nenv={n_envs}-tts={tts:.0e}/', name_prefix='diner-dash-PPO')

# Train model
start_time = time.time()
model.learn(total_timesteps=tts, log_interval=log_interval, callback=acer_callback)
print(f"--- Time take to train model = {(time.time() - start_time)//60} minutes ---")

# Save model
print("Saving Final Model...")
modelDirectory = "./"
modelName = f"ACER-nenv={n_envs}-tts={tts:.0e}"
model.save(modelDirectory + modelName)
print(f"Model saved as {modelDirectory + modelName}")

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.

----------------------------------
| avg_norm_adj        | 0.000288 |
| avg_norm_g          | 404      |
| avg_norm_grads_f    | 404      |
| avg_norm_k          | 7.55     |
| avg_norm_k_dot_g    | 404      |
| entropy             | 2.04e+03 |
| explained_variance  | 6.5e-06  |
| fps                 | 0        |
| loss                | 36.3     |
| loss_bc             | -0       |
| loss_f              | 28.6     |
| loss_policy         | 28.6     |
| loss_q              | 56.2     |
| mean_episode_length | 0        |
| mean_episode_reward | 0        |
| norm_grads          | 3.18     |
| norm_grads_policy   | 2.36     |
| norm_grads_q        | 2.13     |
| total_timesteps     | 480      |
----------------------------------
----------------------------------
| avg_norm_adj        | 47.3     |
| avg_norm_g          | 4.61e+03 |
| avg_no

In [67]:
def testACER(seed):
  from stable_baselines.common import make_vec_env
  from stable_baselines import ACER

  # Vectorize environment with given seed
  env = make_vec_env(env_id="diner_dash:DinerDash-v0", wrapper_class=OneHotWrapper, seed=seed)

  # Load saved model
  ACER_model = ACER.load("ACER-nenv=24-tts=1e+07")

  # Reset environment, init obs
  obs = env.reset()

  done = False
  sum_rewards = 0
  action_list = []

  while not done:
    action, _states = ACER_model.predict(obs)
    action_list.append(action.item())
    obs, rewards, done, info = env.step(action)
    sum_rewards += rewards

  return sum_rewards, action_list

# Testing of Policies and Verification of Submission [Please Edit]

Creates a json file of action lists from the best scoring algo

In [64]:
from random import randint
import json
from os import getcwd
from tqdm.notebook import tqdm

# Sample test
def test():
  ############################ CHANGEABLE AREA ##############################
  # Changeable parameters
  numEpisodes = 100                             # num of test episodes
  # algos = [testRA, testPPO, testPPO2]           # Add or remove algos (must have unique names)
  algos = [testACER]           # Add or remove algos (must have unique names)
  saveJson = True                              # Whether to save actions_dict
  fileDirectory = "./"                          # Path of saved json file
  fileName = "submission.json"                  # Name of json file

  ### Replace the list of randomSeeds with that given for submission
  # e.g. randomSeeds = [1, 2, 3]
  # randomSeeds = [randint(0, 1e8) for i in range(numEpisodes)]
  randomSeeds = [45990181, 42859851, 88292417, 17986451, 4310124, 28416871, 21378509, 28987250, 49793653, 81705172, 90381554, 13393105, 90402290, 69802779, 87378977, 7338848, 74942140, 86896376, 60192513, 90268611, 12193092, 45037492, 32444344, 60276470, 81720257, 48114169, 2745186, 39780027, 68039546, 63661496, 89673369, 54490252, 9508183, 78690722, 41872036, 40729179, 71091571, 52945376, 49602567, 11079941, 35506423, 32863705, 98722501, 95078645, 2050683, 30225876, 12983163, 5244339, 28278496, 80180211, 63902897, 46843366, 74357835, 90376940, 98407071, 48007796, 96438018, 54730109, 40955186, 60494091, 76878283, 24175421, 91447265, 36570693, 3334869, 14057265, 53946219, 30908957, 86325356, 90558192, 24759335, 51591742, 38364662, 1189567, 536631, 16559969, 68687507, 24406829, 9720389, 23088515, 34242387, 74268255, 23615670, 68613237, 7166219, 27203162, 29343492, 75431707, 39683866, 87146964, 78351462, 23184439, 9088138, 34637812, 25889305, 95479264, 55637910, 26835621, 37209126, 47123382]

  ############################################################################

  # uses function name as key
  # hence, name function with algo name (e.g. testPPO or just PPO)
  rewards_dict = {algo.__name__ : [] for algo in algos}
  actions_dict = {algo.__name__ : [] for algo in algos}

  # Test begins
  for seed in tqdm(randomSeeds):
    for algo in algos:
      # Given a random seed
      # Returns the sum of rewards for that episode and the actions list
      rewards, actions = algo(seed)

      rewards_dict[algo.__name__].append(rewards)
      actions_dict[algo.__name__].append(actions)

  # Print average rewards from n episodes for each algo
  avgReward_dict = {algo : int(sum(rewards)/len(rewards)) for algo, rewards in rewards_dict.items()}
  print(f"Average Rewards for each algo: {avgReward_dict}")

  # Prints best algo
  best_algo = max(avgReward_dict.keys(), key=(lambda k: avgReward_dict[k]))
  best_reward = avgReward_dict[best_algo]
  print(f"The best algo is {best_algo} with the highest rewards of {best_reward}")

  # Print an action dict containing actions list for each random seed env for each algo
  print(f"Actions list for each env for each algo: {actions_dict}")

  submission_dict = {best_algo: actions_dict[best_algo]}

  if saveJson:
    print("Saving best algo to json file...")
    with open(fileDirectory + fileName, "w") as write_file:
      json.dump(submission_dict, write_file)
      print(f"{fileName} was saved in {getcwd()}")
    
    print("-" * 100)
    
    print(f"Verifying {fileName}...")
    (best_algo, best_action_list), = submission_dict.items()
    print(f"Name of best algo: {best_algo}")
    submissionEpisodes = len(best_action_list)
    if submissionEpisodes != len(randomSeeds):
      raise ValueError("Number of episodes in submission does not match the number of random seeds!")
    print(f"Number of episodes(random seeds): {submissionEpisodes}")
    print("Number of episodes in submission matches the number of random seeds")
    print("Verification Complete! Please double check the verification results")
  
  return None

In [66]:
test()

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




AssertionError: ignored