# Exemplary Training: Q-Learning, SARSA and DQN



This notebook aims to demonstrate the training process of an agent in practise. For simplicity only the stowage planning problem for a small RORO deck is shown with the loading list outlined in the corresponding thesis (for algorithms default values have been defined which may be found in the annex of the thesis)

On the other hand the notebook `Example.ipynb` shows  how an already trained model could be used in practise.


***
## Imports

Firstly various modules are imported (including agent classes, environment classes, a plotting unit and a logger)

In [1]:
import os
import sys
sys.path.insert(0, os.path.abspath('../'))
module_path = str(os.getcwd())+'\\out\\'

from analysis.plotter import Plotter
from analysis.evaluator import *
from analysis.loggingUnit import LoggingBase
from env.roroDeck import RoRoDeck
from agent import sarsa, tdq, dqn
from analysis.algorithms import *

import logging

Using TensorFlow backend.


***
## Register logger and set training duration

The first step is to register the logging unit. This will also set the output path where the trained models will be saved. Furthermore it has to be decided on how many iterations the agent should be trained. If this is not set the agent will fall back on default values.

In [2]:
# Register Output path and Logger
loggingBase = LoggingBase()
module_path = loggingBase.module_path

print('Training outputs will be save to:\n'+module_path)

Training outputs will be save to:
C:\Users\braun\Documents\Masterarbeit\analysis\out\20201118\2049\2049


In [3]:
number_of_episodes = 6_000

In [10]:
# Choose algorithm from list
algorithms = ['SARSA','TDQ','DQN']
algorithm = algorithms[2]

***
## Initalise the environment

Secondly the environment is initialised. It can be decided the size of the environment and if it should behave stochasticly. If `stochastic` is set to true than the agent will behave with probability $p$ (`env.p`) determinisitic in a sense that the cargo type chosen by the agent is actually loaded. Subsequently, a random cargo type is loaded with probability $1-p$.

**Note:** In the thesis the environment is said to be deterministic. Since deviations are not assumed to happen regularly.

The `vehicle_data` variable corresponds to the loading list which may be changed by the user. After every change the environment needs to be reset.

The `reset()`-method will return the representation of the intial state.

In [11]:
l1 = np.array([[ 0,  1,  2,  3,  4],
       [ 5,  5, -1, -1,  2],
       [ 1,  1,  0,  0,  1],
       [ 1,  2,  1,  2,  2],
       [ 3,  4,  2,  3,  2],
       [ 0,  0,  0,  0,  1]])

l2 = np.array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 5,  5, -1, -1,  2,  2,  2],
       [ 1,  1,  0,  0,  1,  1,  1],
       [ 1,  2,  1,  2,  2,  1,  2],
       [ 2,  3,  2,  3,  2,  2,  3],
       [ 0,  0,  0,  0,  1,  0,  0]])



env = RoRoDeck(lanes=8, rows=12)

env.reset()
env.render()

-----------Loading Sequence----------------------------------------------------------------
X	X	X	X	X	X	X	X	

X	X	X	-	-	X	X	X	

X	X	-	-	-	-	X	X	

X	-	-	-	-	-	-	X	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-----------VehicleType--------------------------------------------------------------------
X	X	X	X	X	X	X	X	

X	X	X	-	-	X	X	X	

X	X	-	-	-	-	X	X	

X	-	-	-	-	-	-	X	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-----------Destination--------------------------------------------------------------------
X	X	X	X	X	X	X	X	

X	X	X	-	-	X	X	X	

X	X	-	-	-	-	X	X	

X	-	-	-	-	-	-	X	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	

-	-	-	-	-	-	-	-	




***
## Train the agent

Train the agent with the environment.
The user may choose between different algorithms:
- TDQ-Learning
- SARSA
- Deep Q-Learning (DQN)

The training is started by calling `agent.train()`. After the last training episode the `train()`-method will show a grid representation of the final stowage plan.

**Important Note:** The run time of this method might depend on how much memory is already used for Jupyter-Notebooks and on the browser settings. The `main()`-methods of `tdq.py`,`sarsa.py` and `dqn.py` are demonstrating the usage equivalently and might run faster.

In [14]:
# If DQN is used the number of episodes should not exceed roughly 14_000 (default value)
# to solve the problem in reasonable time if GPU cannot be used for training

if algorithm is 'DQN':
    assert 5_000 <=number_of_episodes <= 14_000


print('Train agent with '+algorithm+'\n')

if algorithm == 'SARSA':
    agent = sarsa.SARSA(env, module_path, number_of_episodes)
elif algorithm == 'TDQ':
    agent = tdq.TDQLearning(env, module_path, number_of_episodes)
else:
    agent = dqn.DQLearningAgent(env=env, module_path=module_path, number_of_episodes=number_of_episodes,
                                layers= [128,128])
    
# Call train-method
model, total_rewards, vehicle_loaded, eps_history, state_expansion = agent.train()
# Save model to output path
agent.save_model(module_path)

print(agent.get_info())

Train agent with DQN

episode  10 score -439.98 	 illegal moves 8 	 avg. score -427.25
episode  20 score -443.98 	 illegal moves 8 	 avg. score -432.07
episode  30 score -403.99 	 illegal moves 8 	 avg. score -431.27
episode  40 score -473.98 	 illegal moves 8 	 avg. score -431.93
episode  50 score -403.98 	 illegal moves 8 	 avg. score -433.31
episode  60 score -391.98 	 illegal moves 8 	 avg. score -435.09
episode  70 score -395.98 	 illegal moves 8 	 avg. score -434.80
episode  80 score -413.98 	 illegal moves 8 	 avg. score -434.62
episode  90 score -405.98 	 illegal moves 8 	 avg. score -434.62
episode  100 score -403.98 	 illegal moves 8 	 avg. score -434.40
episode  110 score -429.98 	 illegal moves 8 	 avg. score -436.22
episode  120 score -411.98 	 illegal moves 8 	 avg. score -434.63
episode  130 score -455.97 	 illegal moves 8 	 avg. score -436.02
episode  140 score -413.98 	 illegal moves 8 	 avg. score -436.24
episode  150 score -441.98 	 illegal moves 8 	 avg. score -434.

episode  1250 score -297.98 	 illegal moves 5 	 avg. score -411.55
episode  1260 score -449.97 	 illegal moves 8 	 avg. score -410.79
episode  1270 score -378.03 	 illegal moves 5 	 avg. score -410.08
episode  1280 score -387.97 	 illegal moves 8 	 avg. score -410.38
episode  1290 score -401.98 	 illegal moves 8 	 avg. score -410.34
episode  1300 score -423.98 	 illegal moves 8 	 avg. score -409.07
episode  1310 score -433.98 	 illegal moves 8 	 avg. score -407.92
episode  1320 score -403.97 	 illegal moves 8 	 avg. score -408.46
episode  1330 score -427.97 	 illegal moves 8 	 avg. score -409.69
episode  1340 score -212.01 	 illegal moves 4 	 avg. score -407.09
episode  1350 score -391.98 	 illegal moves 8 	 avg. score -406.16
episode  1360 score -412.01 	 illegal moves 6 	 avg. score -405.23
episode  1370 score -405.97 	 illegal moves 8 	 avg. score -405.15
episode  1380 score -413.97 	 illegal moves 8 	 avg. score -406.38
episode  1390 score -518.03 	 illegal moves 7 	 avg. score -40

episode  2480 score -288.05 	 illegal moves 6 	 avg. score -331.71
episode  2490 score -100.03 	 illegal moves 2 	 avg. score -324.10
episode  2500 score -364.01 	 illegal moves 6 	 avg. score -321.49
episode  2510 score -411.97 	 illegal moves 8 	 avg. score -325.73
episode  2520 score -352.01 	 illegal moves 6 	 avg. score -327.94
episode  2530 score -302.02 	 illegal moves 5 	 avg. score -325.21
episode  2540 score -362.06 	 illegal moves 7 	 avg. score -323.75
episode  2550 score -238.05 	 illegal moves 5 	 avg. score -326.04
episode  2560 score -399.97 	 illegal moves 8 	 avg. score -328.24
episode  2570 score -338.05 	 illegal moves 7 	 avg. score -323.61
episode  2580 score -176.02 	 illegal moves 2 	 avg. score -321.05
episode  2590 score -262.01 	 illegal moves 5 	 avg. score -325.77
episode  2600 score -387.98 	 illegal moves 8 	 avg. score -327.25
episode  2610 score -275.99 	 illegal moves 4 	 avg. score -323.41
episode  2620 score -377.97 	 illegal moves 8 	 avg. score -30

episode  3710 score -26.05 	 illegal moves 1 	 avg. score -209.73
episode  3720 score -415.97 	 illegal moves 8 	 avg. score -214.34
episode  3730 score -128.05 	 illegal moves 2 	 avg. score -204.19
episode  3740 score -238.06 	 illegal moves 5 	 avg. score -203.16
episode  3750 score -375.97 	 illegal moves 8 	 avg. score -197.10
episode  3760 score -50.03 	 illegal moves 1 	 avg. score -190.36
episode  3770 score -0.03 	 illegal moves 0 	 avg. score -185.16
episode  3780 score -238.01 	 illegal moves 5 	 avg. score -189.67
episode  3790 score -262.02 	 illegal moves 5 	 avg. score -191.37
episode  3800 score -100.01 	 illegal moves 2 	 avg. score -191.59
episode  3810 score -401.97 	 illegal moves 8 	 avg. score -192.66
episode  3820 score -150.06 	 illegal moves 3 	 avg. score -183.90
episode  3830 score -76.05 	 illegal moves 2 	 avg. score -182.84
episode  3840 score -377.97 	 illegal moves 8 	 avg. score -183.75
episode  3850 score -389.97 	 illegal moves 8 	 avg. score -190.76


episode  4940 score -290.03 	 illegal moves 5 	 avg. score -130.45
episode  4950 score -288.02 	 illegal moves 6 	 avg. score -131.14
episode  4960 score -138.02 	 illegal moves 3 	 avg. score -131.70
episode  4970 score -38.05 	 illegal moves 1 	 avg. score -126.83
episode  4980 score -26.06 	 illegal moves 1 	 avg. score -123.62
episode  4990 score -38.05 	 illegal moves 1 	 avg. score -122.04
episode  5000 score -26.03 	 illegal moves 1 	 avg. score -113.46
episode  5010 score -76.06 	 illegal moves 2 	 avg. score -110.39
episode  5020 score -126.03 	 illegal moves 3 	 avg. score -110.87
episode  5030 score -0.02 	 illegal moves 0 	 avg. score -105.24
episode  5040 score -226.06 	 illegal moves 5 	 avg. score -106.87
episode  5050 score -26.06 	 illegal moves 1 	 avg. score -102.61
episode  5060 score 11.95 	 illegal moves 0 	 avg. score -92.02
episode  5070 score -375.98 	 illegal moves 8 	 avg. score -101.20
episode  5080 score -126.06 	 illegal moves 3 	 avg. score -109.22
episod

***
## Plot training performance

This will plot:
1. Reward over time
2. The size of the Q-table if this was a tabular method (State Expansion).
3. The steps to finish (also how may vehicles are loaded to the deck)
4. The $\epsilon$-development over time for $\epsilon$-greedy exploration.

**Note:** The smothing window will smooth the ouput plots to make trends more visible. It will take the average of the last $n$ iterations where $n$ corresponds to the number defined by the variable `smoothing_window`. This is highly recommended.

In [None]:
plotter = Plotter(module_path, number_of_episodes, algorithm=algorithm,
                  smoothing_window=200, show_plot=True, show_title=True)
plotter.plot(total_rewards, state_expansion, vehicle_loaded, eps_history)

***
## Evaluation

Evaluate the final stowage plan after execution always the best action.

An optimal stowage plan would result in:

> Mandatory Cargo Loaded: 1.0

This stowage plan will load 100% of the mandatory cargo and ...

> Number of Shifts: 0.0

... causes zero shifts.

> Space Utilisation: 1.0

Moreover the space would be used to 100%.


The final stowage plan created by the agent evaluates as follows:

In [None]:
evaluator = Evaluator(env.vehicle_data, env.grid)
evaluation = evaluator.evaluate(env.get_stowage_plan())
print(evaluation)

In [None]:
metrics, info = training_metrics(total_rewards)
print(info)

In [None]:
env.render()