## SETUP
*Read the introduction below during the setup of your environment.*

In [0]:
#@title ##### Imports and downloads
# Download and unzip ngrok
!apt install mosquitto
!pip install paho-mqtt
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -o ngrok-stable-linux-amd64.zip

# Download front end & framework
! git clone https://gitlab.com/tothepoint/reinforcement-learning-workshop/rl-frontend.git reinforcement-learning
! git clone https://gitlab.com/tothepoint/reinforcement-learning-workshop/rl-gridworld.git boarld_root

# Run MQTT broker
! mosquitto -d -c boarld_root/configs/mosquitto.conf

import subprocess
import time
import os
import paho.mqtt.client as mqtt
from threading import Thread
import time, urllib
import json


In [0]:
# @title ##### Run ngrok tunnels
% cd /content/
!pkill ngrok
import subprocess
subprocess.Popen(['./ngrok', 'start', '-config', 'boarld_root/configs/ngrok.conf', 'frontend'])
subprocess.Popen(['./ngrok', 'start', '-config', 'boarld_root/configs/ngrok.conf', 'mqttbroker'])
time.sleep(5)

# Get ngrok URLs
ngrok_data_frontend = json.load(urllib.request.urlopen('http://localhost:4040/api/tunnels'))
ngrok_data_mqttbroker = json.load(urllib.request.urlopen('http://localhost:4041/api/tunnels'))
frontend_ngrok_url = ngrok_data_frontend['tunnels'][0]['public_url'].split('//')[1]
mqtt_ngrok_url = ngrok_data_mqttbroker['tunnels'][0]['public_url'].split('//')[1]

In [0]:
# @title ##### Tell front end where to find MQTT broker (through ngrok tunnel)
with open("/content/reinforcement-learning/assets/tunnel-domain.txt", "w") as f:
  f.write(mqtt_ngrok_url)

In [0]:
# @title ##### Start front end
%cd /content/reinforcement-learning/
!npm install 

thr = Thread(target=os.system, args=('npm run dev', ))
thr.start()
time.sleep(5)

In [0]:
# @title ##### Install RL framework
%cd /content/boarld_root/boarld
!pip install .
%cd /content/

In [0]:
# @title ##### Import useful elements of the RL framework
from boarld.core.rl.arlgorithm.ARLgorithm import ARLgorithm
from boarld.core.rl.arlgorithm.QlearningARLgorithm import QlearningARLgorithm
from boarld.core.rl.arlgorithm.SarsaARLgorithm import SarsaARLgorithm
from boarld.core.rl.arlgorithm.BellmanARLgorithm import BellmanARLgorithm
from boarld.core.env.action.Action import *
from boarld.core.rl.runner.Runner import Runner
from boarld.core.rl.trainer.Trainer import Trainer
from boarld.gridworld.env.predefined.predefined_grids import *
from boarld.sliding_puzzle.env.board.SlidingPuzzle import SlidingPuzzle
from boarld.gridworld.rl.agent.GridAgent import GridAgent
from boarld.sliding_puzzle.rl.agent.PuzzleAgent import PuzzleAgent


## Introduction
The goal of the exercises below is to implement some reinforcement learning algorithms on an abstract level. We defined an `Agent` interacting with a `Board`. It has a state and can change this state by performing an `Action`. The `Agent` contains a `Qtable` determining the best action in each state to solve the problem at hand. 

What follows is a brief overview of the objects and methods you'll probably need to implement the learning process. Once done, you can test your algorithm in a gridworld environment, as well as on a sliding puzzle. For a more detailed description of the API, please refer to the documentation.

  

### `ARLgorithm`
`ARLgorithm` is an abstraction of a reinforcement learning algorithm. An `ARLgorithm` contains an `Agent`, which learns how to become good at a certain task. The `ARLgorithm`'s `learn()` method impacts the `Qtable` of the `Agent`, causing a change in the the `Agent`'s behavior.  
Important methods and attributes:
```
- agent: Agent
- learn(self, nb_episodes, nb_of_eps_before_table_update, qtable_convergence_threshold, nb_steps_before_timeout, random_rate=0.3, learning_rate=0.2, discount_factor=0.7)
```

### `Agent`
`Agent` represents a reinforcement learning agent. The agent is associated with a board on which it operates, a current state and one or more target states, receives a reward for performing a move or reaching a target state.  
Important methods and attributes:
```
- Qtable: Qtable
- choose_action_epsilon_greedily(random_rate, old_state)
- get_reward
- move(next_action)
- reset()
- set_agent_to_random_state(nb_actions)
- state_is_final(state)
```

### `Qtable`
`Qtable` represents a Q-table in reinforcement learning: it holds a value for tuples (state, action), representing the quality of doing a certain action in a certain state. Since, during training, we want to make a distinction between the Q-table we're updating to, and the Q-table from which we derive our policy, `Qtable` consists of two tables: the regular table, from which we can obtain Q-values, and a copy on which we can do updates using `update_value(state, action, value)`. After a couple of episodes, or if the Q-table has converged (`has_converged_since_last_snapshot`, comparing it to a previously taken snapshot: `take_snapshot()`), one can replace the old table by the new one using `update_table_by_shadow()`.  
Important methods and attributes:
```
- has_converged_since_last_snapshot(threshold)
- get_Q_value(state, action, replace_neg_inf_by_zero)
- update_value(state, action, value)
- take_snapshot()
```


## Exercises

In [0]:
# @title ##### Get the URL to our agent's visualization
print('Front end at http://%s' % frontend_ngrok_url)
# print('Mqtt broker at %s' % mqtt_ngrok_url)

### Defining my own, very stupid algorithm..

Let's define an algorithm that makes our agent move right in every state.  

Recall: a Q-table looks something like:  

|        	| UP  	| DOWN 	| LEFT 	| RIGHT 	|
|--------	|-----	|------	|------	|-------	|
| (0, 0) 	| 1   	| 1    	| 1    	| 10    	|
| (1, 0) 	| 10  	| 1    	| 1    	| 0     	|
| ...    	| ... 	| ...  	| ...  	| ...   	|

In [0]:
class StupidARLgorithm(ARLgorithm):
  def learn(self, nb_episodes, nb_of_eps_before_table_update, qtable_convergence_threshold,
          nb_steps_before_timeout, random_rate=0.3, learning_rate=0.2, discount_factor=0.7):

    for i in range(nb_episodes):        
      for state in self.agent.get_list_of_possible_states():
        self.agent.Qtable.update_value(state, Right(), 100)
        self.agent.Qtable.update_value(state, Left(), 1)
        self.agent.Qtable.update_value(state, Up(), 1)
        self.agent.Qtable.update_value(state, Down(), 1)
      self.agent.Qtable.update_table_by_shadow()

Train the agent and let it try to get to its goal. 

In [0]:
# -- AGENT SETUP -- #
stupid_agent = GridAgent(Grid1.GRID)

In [0]:
# -- TRAINER SETUP -- #
stupid_trainer = Trainer() \
    .with_agent(stupid_agent) \
    .with_arlgorithm(StupidARLgorithm) \
    .with_nb_episodes(1) \
    .with_nb_of_eps_before_table_update(1) \
    .with_qtable_convergence_threshold(.001) \
    .with_nb_steps_before_timeout(1) \
    .with_random_rate(0.3) \
    .with_learning_rate(0.2) \
    .with_discount_factor(1) \

# -- TRAIN -- #
stupid_trainer.train()

In [0]:
# -- RUN -- #
stupid_agent.reset()
stupid_runner = Runner() \
    .with_agent(stupid_agent) 
stupid_runner.run()

### Running a real RL algorithm

In [0]:
# -- AGENT SETUP -- #
grid_agent = GridAgent(Grid1.GRID)

In [0]:
# -- TRAINER SETUP -- #
gridworld_trainer = Trainer() \
    .with_agent(grid_agent) \
    .with_arlgorithm(QlearningARLgorithm) \
    .with_nb_episodes(5) \
    .with_nb_of_eps_before_table_update(50) \
    .with_qtable_convergence_threshold(.001) \
    .with_nb_steps_before_timeout(25) \
    .with_random_rate(0.3) \
    .with_learning_rate(0.2) \
    .with_discount_factor(1) \
    
### --- DANGER ZONE --- ###
# !!! ONLY UNCOMMENT THE LINES BELOW IF NB_EPISODES < 20 !!!
# If not, your agent will take forever to train. You'll have to factory reset your runtime and run the entire setup again. 

### --- --- --- ###
# gridworld_trainer \
#     .with_observe_agent(True) \
#     .with_observe_qtable(True) \
### --- --- --- ###


In [0]:
# -- TRAIN -- #
gridworld_trainer.train()

In [0]:
# -- RUN -- #
grid_agent.reset()
gridworld_runner = Runner() \
    .with_agent(grid_agent)

gridworld_runner.run()

### Your turn: implement Q-learning

![Pseudocode Q-learning](https://www.cse.unsw.edu.au/~cs9417ml/RL1/images/qalg.gif)  
[(Image credit)](https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html)

In [0]:
class MyQlearning(ARLgorithm):
  def learn(self, nb_episodes, nb_of_eps_before_table_update, qtable_convergence_threshold,
          nb_steps_before_timeout, random_rate=0.3, learning_rate=0.2, discount_factor=0.7):
    pass

In [0]:
# -- AGENT SETUP -- #
my_qlearning_agent = GridAgent(Grid1.GRID)

In [0]:
# -- TRAINER SETUP -- #
my_qlearning_trainer = Trainer() \
    .with_agent(my_qlearning_agent) \
    .with_arlgorithm(MyQlearning) \
    .with_nb_episodes(5) \
    .with_nb_of_eps_before_table_update(50) \
    .with_qtable_convergence_threshold(.001) \
    .with_nb_steps_before_timeout(25) \
    .with_random_rate(0.3) \
    .with_learning_rate(0.2) \
    .with_discount_factor(1)
    
### --- DANGER ZONE --- ###
# !!! ONLY UNCOMMENT THE LINES BELOW IF NB_EPISODES < 20 !!!
# If not, your agent will take forever to train. You'll have to factory reset your runtime and run the entire setup again. 

### --- --- --- ###
# my_qlearning_agent \
#     .with_observe_agent(True) \
#     .with_observe_qtable(True) \
### --- --- --- ###

In [0]:
# -- TRAIN -- #
my_qlearning_trainer.train()

In [0]:
# -- RUN -- #
my_qlearning_agent.reset()
my_qlearning_runner = Runner() \
    .with_agent(my_qlearning_agent)

my_qlearning_runner.run()

### Now do the same for SARSA, and train a new agent.
![Pseudocode SARSA](https://www.cse.unsw.edu.au/~cs9417ml/RL1/images/salg.gif)  
[(Image credit)](https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html)

In [0]:
class MySARSA(ARLgorithm):
  def learn(self, nb_episodes, nb_of_eps_before_table_update, qtable_convergence_threshold,
          nb_steps_before_timeout, random_rate=0.3, learning_rate=0.2, discount_factor=0.7):
    pass

In [0]:
# -- AGENT SETUP -- #
my_sarsa_agent = GridAgent(Grid1.GRID)

In [0]:
# -- TRAINER SETUP -- #
my_sarsa_trainer = Trainer() \
    .with_agent(my_sarsa_agent) \
    .with_arlgorithm(MySARSA) \
    .with_nb_episodes(5) \
    .with_nb_of_eps_before_table_update(50) \
    .with_qtable_convergence_threshold(.001) \
    .with_nb_steps_before_timeout(25) \
    .with_random_rate(0.3) \
    .with_learning_rate(0.2) \
    .with_discount_factor(1)
    
### --- DANGER ZONE --- ###
# !!! ONLY UNCOMMENT THE LINES BELOW IF NB_EPISODES < 20 !!!
# If not, your agent will take forever to train. If you intterrupt the training process, you'll have to factory reset your runtime and run the entire setup again. 

### --- --- --- ###
# my_sarsa_agent \
#     .with_observe_agent(True) \
#     .with_observe_qtable(True) \
### --- --- --- ###

In [0]:
# -- RUN -- #
my_sarsa_agent.reset()
my_sarsa_runner = Runner() \
    .with_agent(my_sarsa_agent)

my_sarsa_runner.run()

## Extra

### Define your own grid

In [0]:
# You can define your own grid, if you want!
class MyOwnGrid:
    name='my_own_grid'
    board_rows = 8
    board_columns = 8
    win_state = {Goal(7, 0)}
    lose_states = {Trap(4, 2)}
    start = Start(1, 4)
    blocked_cells = {Obstacle(1, 1), Obstacle(2, 1), Obstacle(3, 1), Obstacle(5, 5), Obstacle(5, 6), Obstacle(5, 3), Obstacle(5, 4), Obstacle(2, 6), Obstacle(3, 6), Obstacle(4, 6)}
    GRID = Grid(board_rows, board_columns, start, win_state, lose_states, blocked_cells, name)


### Sliding puzzle
You can use the algorithms you implemented to teach an agent how to solve a sliding puzzle. 

In [0]:
# -- BOARD SETUP -- #
nb_rows = 2
nb_cols = 2
puzz = SlidingPuzzle(nb_rows, nb_cols)

# -- AGENT SETUP -- #
puzzle_agent = PuzzleAgent(puzz)

In [0]:

# -- TRAINER SETUP -- #
puzzle_trainer = Trainer() \
    .with_agent(puzzle_agent) \
    .with_arlgorithm(MyQlearning) \
    .with_nb_episodes(5) \
    .with_nb_of_eps_before_table_update(50) \
    .with_qtable_convergence_threshold(.001) \
    .with_nb_steps_before_timeout(25) \
    .with_random_rate(0.3) \
    .with_learning_rate(0.2) \
    .with_discount_factor(1)

### --- DANGER ZONE --- ###
# !!! ONLY UNCOMMENT THE LINES BELOW IF NB_EPISODES < 20 !!!
# If not, your agent will take forever to train. You'll have to factory reset your runtime and run the entire setup again. 

### --- --- --- ###
# puzzle_trainer \
#     .with_observe_agent(True) \
#     .with_observe_qtable(True) \
### --- --- --- ###

In [0]:
# -- TRAIN -- #
puzzle_trainer.train()

In [0]:
# -- RUN -- #
puzzle_runner = Runner() \
    .with_agent(puzzle_agent) \

print('Shuffle..')
puzzle_agent.set_agent_to_random_state(nb_actions=100)

puzzle_runner.run()