__Author: Christian Urcuqui__

__Date: 23 october 2018__

![image](../../Utilities/Rl_agent.png)

# Reinforcement Learning

The aim of this notebook is to understand the theory, the state of art, it's frameworks and applications of reinforcement learning. I'm going to use some examples and the explanations from the literature that you can see in the references section. This notebook is divided in the next sections:

+ [Introduction](#Introduction)
+ [frameworks](#Frameworks)

## Introduction

Reinforcement learning is learning what to do - how to map situations to actions - so as to maximaze a numerical reward signal. The learner must discover which actions yield the most reward by trying them. The idea here is the loop of cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. 

We can mention two characteristics - trial-and-error search delayed reward - are two most important distinguishing features of reinforcement learning. 

_Reinforcement Learning_ is defined by characterizing a _learning problem_ through the interaction with environments. The objective of this approach is to reach a goal which is reaching by rewards obtained from the environment.

A full specification of _reinforcement learning_ problem in terms of optimal control of Markov decision process, where the idea is to capture the most important aspects of the real problem facing a learning agent interacting with its environment  to archieve the goal. In effect, this agent must be able to sense the state of the environment and take actions wich affect the state. The formulation includes three aspects -__sensation, action, and goal__- in their simplest forms without trivializing any of them. 



<img src="https://cdn-images-1.medium.com/max/1000/1*1qBauAy9xWzNFt1N_mmfEw.gif" />

A good example of reinforcement learning is a maze, where the idea is determining the path with the exact right moves.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Prim_Maze.svg/1200px-Prim_Maze.svg.png" width="350" />

+ The agent is the intelligent program
+ The environment is the maze
+ The state is the place in the maze where the agent is
+ The action is the move we take to move to the next state
+ The reward is the points associated with reaching a particular state. It can be positive, negative, or zero

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/250px-Reinforcement_learning_diagram.svg.png" />

As we can see in the next image, _reinforcement learning_ sits at the intersaction of many different fields of science.
<img src="https://www.researchgate.net/profile/Ahmad_Hammoudeh3/publication/323178749/figure/fig1/AS:594073228423171@1518649503854/Reinforcement-Learning-faces-10.png" width="400" />

One challenges of this area is the trade-off between __exploration__ and __exploitation__. To obtain a lot of reward, the agent must prefer actions that it has tried in the past and found to be effective in producing a reward (exploitation). But to discover such actions, it has to try actions that it has not selected before (exploration). The dilemma is that neither exploration nor exploitation ca be pusued exclusively without failing at the task. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. 

We can see the state transition process and the interaction of each feature as the next equation which needs to maximize:

$r0 + \gamma r1 + \gamma r2 + ... +$ where $0 < \gamma < 1$

At reach state transition, the reward is a different value in each step, that is the reason that we have in the last expresion r0, r1, r2, etc. Gamma ($\gamma$) is called a _discount factor_ and it determines what future reward types we get:

+ A value of gamma 0 means the reward is associated with the current state only
+ A gamma value of 1 means that the reward is long-term



We are going to have two constants, these are _gamma_($\gamma$) and _lambda_ ($\lambda$).

+ Gamma is used in each state transition and is a constant value at each state change. Gamma allows you to give information bout the type of reward you will be getting in every state.
+ Lambda is generally used when we are dealing with temporal difference problems. It is more involved with predictions in successive states. When lambda increases we can infer that the algorithm is learning fast.

The environments in the Reinforcement Learning have the next features:
+ Deterministic
+ Observable
+ Discrete or continiouous
+ Single or multiagent

Solutions in RL can be of single agent types or multiagent types, when we are dealing with complex problems, we use multiagent Reinforcement Learning. Complex problems might have different environments where the agent is doing different jobs to get involved in RL and the agent also wnats to interact. 

Multiagent solutions are based on the non-deterministic approach, because when the multiagent interact, there might be more than one option to change or move to the next state and we have to make decisions based on that ambiguity. Moreover, these agents have dynamic environments comparated to single angents with a lot alernativies. Dynamic environments can involve changing environments in the places to interact with. 

<img src="https://cdn-images-1.medium.com/max/1600/1*BLCc488tqRHMWrBcIkLzGw.png" width="400"/>

## Markov Decision Process

Environments in RF are represented by the Markov Decision Process (MDP).
+ SS is a finite set of states. 
+ AA is a finite set of actions.
+ $T:S×A×S→[0,1]T:S×A×S→[0,1]$ is a transition model that maps (state, action, state) triples to probabilities.
+ T(s,a,s′)T(s,a,s′) is the probability that you’ll land in state s′s′ if you were in state ss and took action aa.

`
T(s,a,s′)=P(s′|s,a)T(s,a,s′)=P(s′|s,a)
`
$R:S×S→RR:S×S→R$ is a reward function that gives a real number that represents the amount of reward (or punishment) the environment will grant for a state transition. 

We can define the expected utility for the agent to be the accumulated rewards it gets throughout its experience with the environment. If the agent goes through the states s0,s1,…,sn−1,sns0,s1,…,sn−1,sn, you could formally define its expected utility as follows:

$\sum nt = 1\gamma tE[R(st−1,st)]Σt=1nγtE[R(st−1,st)]$

### hello world (searching the treasure on right)

Let's see the next example proposed by MorvanZhou, the exercise has an escenario where we can find a letter O as wondered which wants to get the trasure T as fast as it can, it look like this:

```
O-----T
```
The wanderer tries to find the best path to reach the treasure, during each episode, the steps the wanderer takes to reach the treasure are counted. With each episode, the condition improves and the number of steps declines. 

+ The available actions are left or right

ACTIONS = ['left', 'right']

+ The wandered can be considered the agent 
+ The number of states (steps) is limited in this example to 6

N_STATES = 6

We must pay attention to the hyperparameters in a reinforcement learning approach, they are:

+ _Epsilon_ is the greedy factor
+ _Alpha_ is the learning rate
+ _Gamma_ is the discount factor

The maximum number of episodes in this case is 13. The refresh rate is when the scenario is refreshed

To create the process (it is called Q Learning) from which the computer learns, we have to formulate a table (it is called Q table). All the key elements are stored in the Q table and the decisions are made based on the Q table.

In [1]:
import numpy as np
import pandas as pd
import time

In [2]:
np.random.seed(2)
N_STATES = 6 # the length of the dimensional world
ACTIONS = ['left', 'right'] # available actions
EPSILON = 0.9 # greedy factor
ALPHA = 0.1  # learning rate
GAMMA = 0.9 # discount factor
MAX_EPISODES = 13 # maximum episodes
FRESH_TIME = 0.3 # fresh time for one move

In [16]:
def build_q_table(n_states, actions):
    table = pd.DataFrame(
    np.zeros((n_states, len(actions))), # q_table initial values
    columns = actions, # action's name
    )
    return table

build_q_table(N_STATES, ACTIONS)

Unnamed: 0,left,right
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,0.0


In [17]:
def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name

choose_action(2, build_q_table(N_STATES, ACTIONS))

'left'

Now we create the environment and determine how the agents will work within the environment

In [18]:
def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

This function prints the wanderer and treasure hunt conditions

In [19]:
def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print('\r{}'.format(interaction), end='')
        time.sleep(FRESH_TIME)

In [20]:
# The rl() method calls the Q Learning scenario
def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:

            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            q_predict = q_table.loc[S, A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state

            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table

In [21]:
# let's start it 
q_table = rl()
print('\r\nQ-table:\n')
print(q_table)

                                
Q-table:

       left     right
0  0.000003  0.004320
1  0.000000  0.026123
2  0.000000  0.116579
3  0.000000  0.361124
4  0.018238  0.745813
5  0.000000  0.000000


In [3]:
import gym
env = gym.make("Taxi-v2").env
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

## Dynamic programming


In [2]:
import numpy as np
import copy

import check_test
from frozenlake import FrozenLakeEnv
from plots_utils import plot_values

ModuleNotFoundError: No module named 'check_test'

## Frameworks

In [2]:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(100):
    env.render()
    env.step(env.action_space.sample()) # take a random action

KeyboardInterrupt: 

## References 

+ Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning (Vol. 135). Cambridge: MIT press.
+ openaigym. https://gym.openai.com
+ Nandy, A., & Biswas, M. (2017). Reinforcement Learning: With Open AI, TensorFlow and Keras Using Python.
+ https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow
+ https://www.kaggle.com/slobo777/tic-tac-toe-agent-using-q-learning