# AI-LAB LESSON 6: Deep Reinforcement Learning

In this lesson we will use the CartPole environment and we will see how to create and work with a neural network using Kears on top of Tensorflow.

## CartPole
The environment used is **CartPole** (taken from the book of Sutton and Barto as visible in the figure)

![Cartpole](images/cartpole.jpg)

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [1]:
import os, sys, tensorflow.keras, random, numpy
module_path = os.path.abspath(os.path.join('../tools'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gym, envs
from utils.ai_lab_functions import *
from timeit import default_timer as timer
from tqdm import tqdm as tqdm
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

The **state** of environment is represented as a tuple of 4 values: 
- *Cart Position* range from -4.9 to 4.8
- *Cart Velocity* range from -inf to +inf
- *Pole Angle* range from -24 deg to 24 deg
- *Pole Velocity* range from -inf to +inf

The **actions** allowed in the environment are 2:
- *action 0*: push cart to left
- *action 1*: push cart to right

The **reward** is 1 for every step taken, including the termination step.

In [2]:
env = gym.make("CartPole-v1")
state = env.reset()
print("STARTING STATE: {}".format(state))
print("\tCart Position: {}\n\tCart Velocity {}\n\tPole Angle {} \n\tPole Velocity {}".format(state[0], state[1], state[2], state[3]))

print("\nPOSSIBLE ACTIONS: ", env.action_space.n)

STARTING STATE: [-0.00579938 -0.0007288   0.03674978 -0.0240394 ]
	Cart Position: -0.005799378004022436
	Cart Velocity -0.000728801695898916
	Pole Angle 0.036749777798396985 
	Pole Velocity -0.024039403923742753

POSSIBLE ACTIONS:  2


Finally, we still have the standard functionalities of a Gym environment:
- step(action): the agent performs action from the current state. Returns a tuple (new_state, reward, done, info) where:
    - new_state: is the new state reached as a consequence of the agent's last action
    - reward: the reward obtained by the agent in this step
    - done: True if the episode is terminal, False otherwise
    - info: not used, you can safely discard it

- reset(): the environment is reset and the agent goes back to the starting position. Returns the initial state id

## Neural network with Kears
**Keras** is an open-source neural-network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.

![Network](images/neural_networks.png)

With kears you can easly create a neural network with the **Sequential** module. Before training a neural netowrk you must compile it, selecting the loss function and the optimizer, in our experiment we will use the *mean_squared_error* for the loss function and the *adam* optimizer, that is a standard configuration for a DQN problem.

In [3]:
input_layer = 3
layer_size = 5
output_layer = 2

model = Sequential()
model.add(Dense(layer_size, input_dim=input_layer, activation="relu")) #input layer + hidden layer #1
model.add(Dense(layer_size, activation="relu")) #hidden layer #2
model.add(Dense(layer_size, activation="relu")) #hidden layer #3
model.add(Dense(layer_size, activation="relu")) #hidden layer #4
model.add(Dense(layer_size, activation="relu")) #hidden layer #5
model.add(Dense(output_layer, activation="linear")) #output layer

model.compile(loss="mean_squared_error", optimizer='adam') #loss function and optimzer definition

2022-02-09 15:33:26.583376: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-02-09 15:33:26.589014: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-09 15:33:26.600182: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In Keras you can compute the output of a network with the **predict** function, that requires in input the values of the input layer nodes and returns the corresponding values of the output layer.

In [4]:
input_network = [random.uniform(0, 1), random.uniform(0, 1), random.uniform(0, 1)]
output_network = model.predict(np.array([input_network]))
print("Input network: {}".format(input_network))
print("network Prediction: {}".format(output_network[0]))

2022-02-09 15:33:30.321125: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-02-09 15:33:30.379390: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2899885000 Hz


Input network: [0.10617476863789388, 0.028838591879704834, 0.9263239751335113]
network Prediction: [-0.02210776 -0.03628682]


To train a network in Keras we must use the function **fit**, that take as input:
- *input*: the input of the network that we are interested to train
- *expected_output*: the output that we consider correct
- *epochs*: the number of iteration for the backpropagation (in DQN this value is always 1).

In [5]:
input_network = [random.uniform(0, 1), random.uniform(0, 1), random.uniform(0, 1)]
expected_output = [0, 0]

print("Prediction 'before' training:")
print(model.predict(np.array([input_network])))

model.fit(np.array([input_network]), np.array([expected_output]), epochs=1000, verbose=0)

print("\nPrediction 'after' training:")
print(model.predict(np.array([input_network])))

Prediction 'before' training:
[[-0.01937051 -0.03179402]]

Prediction 'after' training:
[[0. 0.]]


Finally, remember that for all the methods (*fit*, *predict*, ...) keras requires as input a numpy array of array, for example you must convert your state in the correct **shape**.  Kears will return, in the same way, an array of array, so to extract the corresponding ouutput layer you must select the first element.

In [6]:
state = np.array([0, 0, 0])
# model.predict(input_network) will give you a shape error
state = state.reshape(1, 3)
print("Prediction:", model.predict(state)[0])

Prediction: [0.0122903  0.00565043]


## Assignment: Q-Learning

Your first assignement is to implement all the functions nexessary for a deep q-learning algorithm. In particular you must implement the following functions: *create_model*, *train_model* and *DQN*.

#### Hint:
For the experience replay buffer you can use the python data structure *dequeue*, defining the maximum length allowed. With the *random.sample(replay_buffer, size)* function you can sample *size* element from the queue:

In [7]:
replay_buffer = deque(maxlen=10000)
for _ in range(100): replay_buffer.append(random.uniform(0, 1))
    
samples = random.sample(replay_buffer, 3) 
print("Get 3 element from replay_buffer:", samples)

Get 3 element from replay_buffer: [0.11087038738747923, 0.06471072930711519, 0.24706944929712527]


In [8]:
def create_model(input_size, output_size, hidden_layer_size, hidden_layer_number):
    """
    Create the neural network model with the given parameters
    
    Args:
        input_size: the number of nodes for the input layer
        output_size: the number of nodes for the output layer
        hidden_layer_size: the number of nodes for each hidden layer
        hidden_layer_number: the number of hidden layers
        
    Returns:
        model: the corresponding neural network
    """

    model = Sequential()

    # Layer iniziale di input + hidden layer
    model.add(Dense(hidden_layer_size, input_dim=input_size, activation="relu"))
    
    # Hidden layers intermedi
    for _ in range(1, hidden_layer_number):
        model.add(Dense(hidden_layer_size, activation="relu"))

    # Layer finale di output
    model.add(Dense(output_size, activation="linear"))

    model.compile(loss="mean_squared_error", optimizer='adam')
    return model

In [58]:
def experience_replay(neural_network, memory, batch_size, gamma=0.99):
    """
    Performs the value iteration algorithm for a specific environment
    
    Args:
        neural_network: the neural network model to train
        memory: the memory array on wich perform the training
        batch_size: the size of the batch sampled from the memory
        gamma: gamma value, the discount factor for the Bellman equation
    """    
    
    TRAINING_EPOCHS  = 1
    TRAINING_VERBOSE = 0

    # Se non abbiamo abbastanza eventi per riempire un mini-batch
    # evitiamo di fare training
    if len(memory) < batch_size:
        return

    # Ottiene mini-batch dall'esperienza
    experience = random.sample(memory, batch_size)
    for event in experience:
        current_state, action, next_state, reward, done = event
        
        # Dato uno stato passato genera la nuova risposta del NN
        nn_actions_score = neural_network.predict(np.array([current_state]))[0]
        action = numpy.argmax(nn_actions_score)

        # Calcola reward dell'azione a seconda che lo stato sia terminale o no
        if done == True:
            nn_actions_score[action] = reward
        else:
            max_q = max(neural_network.predict(np.array([next_state]))[0])
            nn_actions_score[action] = reward + max_q * gamma

        # Training sulle esperienze
        target_actions_score = np.array([nn_actions_score])
        neural_network.fit(np.array([current_state]), target_actions_score, 
            epochs=TRAINING_EPOCHS, verbose=TRAINING_VERBOSE)
    
    return model

In [59]:
def DQN(environment, neural_network, trials, goal_score, batch_size, epsilon_decay=0.9995):
    """
    Performs the Q-Learning algorithm for a specific environment on a specific neural network model
    
    Args:
        environment: OpenAI Gym environment
        neural_network: the neural network to train
        trials: the number of iterations for the training phase
        goal_score: the minimum score to consider 'solved' the problem
        batch_size: the size of the batch sampled from the memory
        epsilon_decay: the dacay value of epsilon for the eps-greedy exploration
        
    Returns:
        score_queue: 1-d dimensional array of the reward obtained at each trial step
    """


    epsilon = 1.0; epsilon_min = 0.01
    score = 0; score_queue = []

    EXPERIENCE_BUFFER_LEN = 10000
    experience_buffer = deque(maxlen = EXPERIENCE_BUFFER_LEN)

    trial_step  = 0
    for trial in range(trials):
        
        state   = environment.reset()
        done    = False
        epsilon = 1.0

        # Numero di azioni prese per esplorazione o sfruttamento
        exploration  = 0
        exploitation = 0

        # Episodio di addestramento
        trial_score = 0
        while not done:

            # Output del NN
            nn_actions_score = neural_network.predict(np.array([state]))[0]
                        
            # Con probabilità epsilon sceglie di esplorare e con prob. (1 - epsilon) di sfruttare
            if random.uniform(0.0, 1.0) > epsilon:
                action = numpy.argmax(nn_actions_score)
                exploitation += 1
            else:
                action = numpy.argmin(nn_actions_score)
                exploration += 1

            # Rendiamo sempre meno probabile l'esplorazione
            epsilon = max(epsilon * epsilon_decay, epsilon_min)

            # Esegue l'azione scelta e salva il risultato nella cache di esperienza
            next_state, reward, done, _ = environment.step(action)
            experience_buffer.append([state, action, next_state, reward, done])
            
            trial_score += reward
            state = next_state

            # Addestra il NN ogni 4 step
            trial_step += 1
            if trial_step % 4 == 0:
                experience_replay(neural_network, experience_buffer, batch_size)
            

        score_queue.append(trial_score)
        if score > goal_score or score_queue == []: 
            break
        
        print("Episode: {:7.0f}, Score: {:3.0f}, ExpBufferSize: {}, Exploitation: {}, Exploration: {}, EPS: {:3.5f}"
            .format(trial, score_queue[-1], len(experience_buffer), str(exploitation), str(exploration), epsilon))
    
    return neural_network, score_queue

In [60]:
env = gym.make("CartPole-v1")
neural_network = create_model(4, 2, 32, 2)
neural_network, score = DQN(env, neural_network, trials=1000, goal_score=130, batch_size=64)

Episode:       0, Score:   9, ExpBufferSize: 9, Exploitation: 0, Exploration: 9, EPS: 0.99551
Episode:       1, Score:  11, ExpBufferSize: 20, Exploitation: 0, Exploration: 11, EPS: 0.99451
Episode:       2, Score:  10, ExpBufferSize: 30, Exploitation: 0, Exploration: 10, EPS: 0.99501
Episode:       3, Score:   9, ExpBufferSize: 39, Exploitation: 0, Exploration: 9, EPS: 0.99551
Episode:       4, Score:   8, ExpBufferSize: 47, Exploitation: 0, Exploration: 8, EPS: 0.99601


KeyboardInterrupt: 

## Execution
The following code executes the DQN and plots the reward function, the execution could require up to 10 minutes on some computer. A more efficent version of the code can be found [here](https://github.com/d-corsi/BasicRL).
Correct results for comparison can be found here below. Notice that since the executions are stochastic the charts could differ: the important thing is the global trend and the final convergence to a visible reward improvement.

In [41]:
rewser = []
window = 10

score = rolling(np.array(score), window)
rewser.append({"x": np.arange(1, len(score) + 1), "y": score, "ls": "-", "label": "DQN"})
plot(rewser, "Rewards", "Episodes", "Rewards")

NameError: name 'score' is not defined

**Standard DQN on CartPole results:**
<img src="images/results-dqn.png" width="600">