# Setting our program's arguments

There are a few arguments we can change that can influence the model, environment or debugging of this notebook.

> **Environments**: For all possible environments please have a look at: [OpenAI Environments](https://github.com/openai/gym/wiki/Table-of-environments)
> <br>Do note however that I have not tested this behaviour and you'll likely have to tune your hyperparameters accordingly and change some code

| Variable | Description | Default | Possible values | Type |
| :- | :- | :- | :- | :- |
|| **ENVIRONMENT OPTIONS** ||||
| **ENVIRONMENT** | The selected environment (from the environments list or just typed in by hand) | 'CartPole-v1' | A string representing a gym environment | str |
|| **DEBUG OPTIONS** ||||
| **VERBOSE**  | What types of debug data are printed. 0 for no debug printing, 1 for normal debug printing and 2 for normal + GPU device placement printing (this option is very spammy and resource intensive) | 1 | 0/1/2 | int
| **VISUALIZE['train']** | Whether to visualize training | False | - | bool |
| **VISUALIZE['evaluate']** | Whether to visualize evaluation | True | - | bool |
|| **AGENT OPTIONS** ||||
| **MODE** | What the program should be doing | 'both' | 'train' / 'evaluate' / 'both' | str |
| **SCORE_H5** | If you already have a saved model load it from `./{ENVIRONMENT}-{SCORE_H5}.h5` | None | - | str |
| **TRAIN_STEPS** | The amount of training games we wish to play | 500 | > 0 | int |

In [None]:
ENVIRONMENT = 'CartPole-v1'

# Parameters
MODE = 'both'
SCORE_H5 = None # load the .h5 file associated with this score if it exists
TRAIN_STEPS = 500

# Logging and visualisation
VERBOSE = 1
VISUALIZE = {
    'train': False,
    'evaluate': True
}

# Adding our packages

If your notebook stops here please make sure you have tensorflow > 2.0 installed

In [None]:
import tensorflow as tf
from collections import deque
from keras_helpers import importantText, checkTensorflowSetup
import numpy as np
import random
import gym
import os

# The functions to create the model with
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Activation

# Purely for logging
from gym.spaces.discrete import Discrete

# Setting up our environment

Our model has inputs and outputs and they are as follows.
- Our inputs are the things we perceive also known as the observation space (`env.observation_space`)
- Our outputs are the things we can do in this observation_space also known as the actions (`env.action_space`)

Now there's two different types of spaces in OpenAI gym.
- Discrete: discrete spaces/actions are actions that are either on or off but never inbetween (0/1)
- Box/Real: these spaces can be partly on or partly off (0,.1,.11,.5,1)

Usually box action spaces tend to have less actions since you have more control over every individual action

In [None]:
env = gym.make(ENVIRONMENT)

# Check if we're running a Discrete (on/off) or Box action_space (0,.1,.11,1)
if (type(env.action_space) == Discrete):
    possible_actions = env.action_space.n
    ACTION_TYPE = 'discrete (on/off)'
else:
    possible_actions = env.action_space.shape[0]
    ACTION_TYPE = 'real (.1,.5,1)'

observation_shape = env.observation_space.shape

# Log some information about the environment you're playing
print(f'You\'re playing {ENVIRONMENT}')
print(f'You have control over {possible_actions} {ACTION_TYPE} actions')
print(f'Your observable space has a shape of: {observation_shape}')

# Checking our TensorFlow setup

Here we can check if we run on GPU or CPU and what tensorflow version is installed.

In [None]:
checkTensorflowSetup(VERBOSE)

# Creating your model

A model always has a certain amount of input, output and a way to transform this input to output.
The little knobs we can turn to change that transformation are called hyperparameters, our environment has the following:

### HyperParameters
| Math Symbol | Variable | Description | Default | Possible values | Type |
| :- | :- | :- | :- | :- | :- |
| - | **inputs** | The input size | 2 | > 0 | int |
| - | **outputs** | The output size | 4 | > 0 | int |
| - | **learning_rate** | The initial step size of our model | 1e-3 | > 0 | float |
| - | **memory** | How much our network can remember | 2k | > 0 | int |
| γ | **gamma** | How important an action is to us in the future | 0.95 | 0 - 1 | float |
| 𝜖 | **epsilon** | The chance of our network taking a random actions instead of a prediction | 1.0 | 0 - 1 | float |
| 𝜖 | **epsilon_low** | (EPSILON lower bound) How liberal our network is once it's learned patterns | 1e-2 | 0 - <1 | float |
| 𝜖-greedy | **epsilon_decay** | How much EPSILON decreases by (`EPSILON *= EPSILON_DECAY`) | 0.95 | 0 - <1 | float |
| - | **batch_size** | How much memory it trains on | 32 | > 0 | int |

In [None]:
class Agent:
    def __init__(self, observation_shape, number_of_actions, environment_name):
        self.inputs = observation_shape
        self.outputs = number_of_actions
        self.env_name = environment_name # the model doesn't use this, it's just used as a name for the weights file
        self.memory = deque(maxlen=2000)
        self.gamma = .95
        self.epsilon = 1.0
        self.epsilon_low = 1e-2
        self.epsilon_decay = .95
        self.learning_rate = 1e-3
        self.batch_size = 32
        self.model = self._create_model()
    
    def _create_model(self):
        """ Creates a keras model
        
            Returns:
                (Sequential): A sequential keras model
        """
        # FIXME: Create a keras sequential model with self.inputs, self.outputs and a number of hidden layers
        # NOTE: feel free to use this function `Dense(..., activation = 'relu')`
        model = Sequential([
            # Add a Dense input layer with the input_shape argument equal to self.inputs
            # Add some Dense hidden layers
            # Add a Dense output layer with the first argument equal to self.outputs
        ])
        
        # FIXME: Finally compile the model using `model.compile` (this function requires the arguments: loss and optimizer)
        # For the loss function please use 'mse'
        # For the optimizer please use Adam with the learning_rate as an argument
        
        if VERBOSE == 1:
            model.summary()
        
        return model

    def memorize(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def save_weights(self, score):
        """ saves the weights into a .h5 file for later use """
        self.model.save_weights(f"./{self.env_name}-{score}.h5")
        
    def load_weights(self, score):
        """ loads the weights from a .h5 file """
        self.model.load_weights(f"./{self.env_name}-{score}.h5")
        print(importantText('Loaded model'))
        
    def act(self, state):
        """
            Returns:
                (int / float):
                    An action, either predicted or random based on the epsilon trigger.
        """
        if np.random.rand() <= self.epsilon: # See if we explore (do a random action) or predict
            return random.randrange(self.outputs)
        
        prediction = self.model.predict(state)[0]
        return np.argmax(prediction)

    def replay(self):
        if len(self.memory) < self.batch_size:
            return
        
        batch = random.sample(self.memory, self.batch_size)
        for state, action, reward, next_state, done in batch:
            target = reward
            
            if not done:
                next_prediction = self.model.predict(next_state)[0]
                target = reward + self.gamma * np.amax(next_prediction)

            target_f = self.model.predict(state)
            target_f[0][action] = target
            
            self.model.fit(
                state,
                target_f,
                verbose=0,
                epochs=1
            )
            
        # Slowly base actions more on predictions than on random actions
        if self.epsilon > self.epsilon_low:
            self.epsilon *= self.epsilon_decay

# Create our agent

In [None]:
agent = Agent(
    observation_shape = observation_shape,
    number_of_actions = possible_actions,
    environment_name = ENVIRONMENT
)

# If we have a saved model, load it.
if (os.path.isfile(f'./{ENVIRONMENT}-{SCORE_H5}.h5')):
    agent.load_weights(SCORE_H5)

# Boiling your PC, also known as "the training phase"

<img src="https://4.bp.blogspot.com/-Cu1mJOh11AU/XAIcUyPK0WI/AAAAAAAANNA/BRlNj0Cbt6EJHNH25D4RhB0e6_sbL1Y8QCLcBGAs/s1600/28056576_10213577221682063_7572084637958860851_n.jpg" width="500px" align="right"/>

In this heading all the previous items come together and we'll use our `Agent` to predict and execute actions on the environment (`env`).

OpenAI gives us a few built-in functions we can use:
- `step` (Execute an action on the environment)
- `reset` (Reset the environment)

#### What do these functions return?
The `reset` function returns the initial state of the environment<br>
The `step` function returns the following values:
- next_state: The new state that was created because of your action.
- reward: The reward you got for executing that action
- done: Whether the agent is done (level finished or a fatal action)
- info: (`_`) extra info about the environment (we don't use it)

In [None]:
if (MODE in ['train', 'both']):
    SCORES = []

    for step in range(TRAIN_STEPS + 1):
        state = env.reset()
        state = np.reshape(state, [1, observation_shape[0]])

        done = False
        score = 0
        while not done:
            if VISUALIZE['train']:
                env.render()

            # FIXME: Execute an action on the environment by using the `act` function of our `Agent` class
            action = 

            next_state, reward, done, _ = env.step(action)
            next_state = np.reshape(next_state, [1, observation_shape[0]])

            agent.memorize(state, action, reward, next_state, done)

            state = next_state
            score += 1

        print(f"Episode: {step}/{TRAIN_STEPS} score: {score}")
        SCORES.append(score)

        # Save our weights every 100 steps
        if step % 100 == 0 and step != 0:
            mean_score = np.mean(SCORES[-100:])
            print(f"Mean score of last 100 actions is {mean_score}")
            agent.save_weights(mean_score)

        # Retrain on memory
        agent.replay()

# Evaluating your model

So where's the fancy things like viewing our Agent in the environment?
Well that's where the evaluation part comes in. Here we can see our agent floundering about in the environment.

In [None]:
if (MODE in ['evaluate', 'both']):
    SCORES = []
    for step in range(101):
        state = env.reset()
        state = np.reshape(state, [1, observation_shape[0]])
        
        done = False
        score = 0
        while not done:
            if VISUALIZE['evaluate']:
                env.render()

            # FIXME: Execute an action on the environment by using the `act` function of our `Agent` class
            action = 
            
            next_state, reward, done, info = env.step(action)
            next_state = np.reshape(next_state, [1, observation_shape[0]])
            
            state = next_state
            score += 1
        
        SCORES.append(score)
    
    print(importantText(f"This environment was {'solved' if np.mean(SCORES) > 195 else 'unsolved'}"))
    print(f"Score was {np.mean(SCORES)}")
          
# Lastly close the environment
env.close()

# References / Sources
The sources that were used for creating this workshop can be found below. <br>

- Deep Q-Learning with Keras and Gym · Keon’s Blog. (2017, 6 februari). Consulted on 29 februari 2020, from https://keon.github.io/deep-q-learning/
- OpenAI. (2019a). Gym: A toolkit for developing and comparing reinforcement learning algorithms. Consulted on 29 februari 2020, from https://gym.openai.com/
- OpenAI. (2019b, 18 april). Openai/gym. Consulted on 29 februari 2020, from https://github.com/openai/gym/wiki/Table-of-environments
- Zychlinski, S. (2019, 23 februari). Towards Data Science. Consulted on 7 march 2020, from https://towardsdatascience.com/the-complete-reinforcement-learning-dictionary-e16230b7d24e
- Blok, S. (2019, 2 july). Selenecodes/IPASS. Consulted on 25 februari 2020, from https://github.com/selenecodes/IPASS