__Author: Christian Urcuqui__

__Date: 23 october 2018__

![image](../../Utilities/Rl_agent.png)

# Reinforcement Learning

The aim of this notebook is to understand the theory, the state of art, it's frameworks and applications of reinforcement learning. I'm going to use some examples and the explanations from the literature that you can see in the references section. This notebook is divided in the next sections:

+ [Introduction](#Introduction)
+ [frameworks](#Frameworks)

## Introduction

Reinforcement learning is learning what to do - how to map situations to actions - so as to maximaze a numerical reward signal. The learner must discover which actions yield the most reward by trying them. The idea here is the loop of cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. 

We can mention two characteristics - trial-and-error search delayed reward - are two most important distinguishing features of reinforcement learning. 

_Reinforcement Learning_ is defined by characterizing a _learning problem_ through the interaction with environments. The objective of this approach is to reach a goal which is reaching by rewards obtained from the environment.

A full specification of _reinforcement learning_ problem in terms of optimal control of Markov decision process, where the idea is to capture the most important aspects of the real problem facing a learning agent interacting with its environment  to archieve the goal. In effect, this agent must be able to sense the state of the environment and take actions wich affect the state. The formulation includes three aspects -__sensation, action, and goal__- in their simplest forms without trivializing any of them. 



<img src="https://cdn-images-1.medium.com/max/1000/1*1qBauAy9xWzNFt1N_mmfEw.gif" />

A good example of reinforcement learning is a maze, where the idea is determining the path with the exact right moves.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Prim_Maze.svg/1200px-Prim_Maze.svg.png" width="350" />

+ The agent is the intelligent program
+ The environment is the maze
+ The state is the place in the maze where the agent is
+ The action is the move we take to move to the next state
+ The reward is the points associated with reaching a particular state. It can be positive, negative, or zero

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/250px-Reinforcement_learning_diagram.svg.png" />

As we can see in the next image, _reinforcement learning_ sits at the intersaction of many different fields of science.
<img src="https://www.researchgate.net/profile/Ahmad_Hammoudeh3/publication/323178749/figure/fig1/AS:594073228423171@1518649503854/Reinforcement-Learning-faces-10.png" width="400" />

One challenges of this area is the trade-off between __exploration__ and __exploitation__. To obtain a lot of reward, the agent must prefer actions that it has tried in the past and found to be effective in producing a reward (exploitation). But to discover such actions, it has to try actions that it has not selected before (exploration). The dilemma is that neither exploration nor exploitation ca be pusued exclusively without failing at the task. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. 

We can see the state transition process and the interaction of each feature as the next equation which needs to maximize:

$r0 + \gamma r1 + \gamma r2 + ... +$ where $0 < \gamma < 1$

At reach state transition, the reward is a different value in each step, that is the reason that we have in the last expresion r0, r1, r2, etc. Gamma ($\gamma$) is called a _discount factor_ and it determines what future reward types we get:

+ A value of gamma 0 means the reward is associated with the current state only
+ A gamma value of 1 means that the reward is long-term



We are going to have two constants, these are _gamma_($\gamma$) and _lambda_ ($\lambda$).

+ Gamma is used in each state transition and is a constant value at each state change. Gamma allows you to give information bout the type of reward you will be getting in every state.
+ Lambda is generally used when we are dealing with temporal difference problems. It is more involved with predictions in successive states. When lambda increases we can infer that the algorithm is learning fast.

The environments in the Reinforcement Learning have the next features:
+ Deterministic
+ Observable
+ Discrete or continiouous
+ Single or multiagent

Solutions in RL can be of single agent types or multiagent types, when we are dealing with complex problems, we use multiagent Reinforcement Learning. Complex problems might have different environments where the agent is doing different jobs to get involved in RL and the agent also wnats to interact. 

Multiagent solutions are based on the non-deterministic approach, because when the multiagent interact, there might be more than one option to change or move to the next state and we have to make decisions based on that ambiguity. Moreover, these agents have dynamic environments comparated to single angents with a lot alernativies. Dynamic environments can involve changing environments in the places to interact with. 

<img src="https://cdn-images-1.medium.com/max/1600/1*BLCc488tqRHMWrBcIkLzGw.png" width="400"/>

_Reinforcement Learning system_ has four main subelements: a policy, a reward function, a value function, and, optionally, a model of the environment.

A _policy_ is a mapping from perceived states of the environment to actions to be taken when in those states (it is called in psychology a set of of stimulus-response rules or associations). In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. 

A _reward function_ defines the goal in a reinforcement learning problem, in other words, it maps each perceived state of the environment to a single number, a _reward_, indicating the intrinsic desirability of that state. A reinforcement learning agent's objective is to maximize the total reward it receives in the long run. The reward function defines what are the good and bad events for the agent. 

A _value function_ specifies what is good in the long run. Roughly speaking, it is the total amount of reward an agent can expect to accumulate over the future, starting from that state. 

A _model_ is used for planning, where the model can find information about the resultant next state and the next reward for a current state and action. 

## Markov Decision Process

Environments in RF are represented by the Markov Decision Process (MDP).
+ SS is a finite set of states. 
+ AA is a finite set of actions.
+ $T:S×A×S→[0,1]T:S×A×S→[0,1]$ is a transition model that maps (state, action, state) triples to probabilities.
+ T(s,a,s′)T(s,a,s′) is the probability that you’ll land in state s′s′ if you were in state ss and took action aa.

`
T(s,a,s′)=P(s′|s,a)T(s,a,s′)=P(s′|s,a)
`
$R:S×S→RR:S×S→R$ is a reward function that gives a real number that represents the amount of reward (or punishment) the environment will grant for a state transition. 

We can define the expected utility for the agent to be the accumulated rewards it gets throughout its experience with the environment. If the agent goes through the states s0,s1,…,sn−1,sns0,s1,…,sn−1,sn, you could formally define its expected utility as follows:

$\sum nt = 1\gamma t \epsilon [R(st−1,s\theta)] \sum t =1n \gamma t \epsilon [R(st−1,st)]$

### hello world (searching the treasure on right)

Let's see the next example proposed by MorvanZhou, the exercise has an escenario where we can find a letter O as wondered which wants to get the trasure T as fast as it can, it look like this:

```
O-----T
```
The wanderer tries to find the best path to reach the treasure, during each episode, the steps the wanderer takes to reach the treasure are counted. With each episode, the condition improves and the number of steps declines. 

+ The available actions are left or right

ACTIONS = ['left', 'right']

+ The wandered can be considered the agent 
+ The number of states (steps) is limited in this example to 6

N_STATES = 6

We must pay attention to the hyperparameters in a reinforcement learning approach, they are:

+ _Epsilon_ is the greedy factor
+ _Alpha_ is the learning rate
+ _Gamma_ is the discount factor

The maximum number of episodes in this case is 13. The refresh rate is when the scenario is refreshed

To create the process (it is called Q Learning) from which the computer learns, we have to formulate a table (it is called Q table). All the key elements are stored in the Q table and the decisions are made based on the Q table.

In [1]:
import numpy as np
import pandas as pd
import time

In [2]:
np.random.seed(2)
N_STATES = 6 # the length of the dimensional world
ACTIONS = ['left', 'right'] # available actions
EPSILON = 0.9 # greedy factor
ALPHA = 0.1  # learning rate
GAMMA = 0.9 # discount factor
MAX_EPISODES = 13 # maximum episodes
FRESH_TIME = 0.3 # fresh time for one move

In [16]:
def build_q_table(n_states, actions):
    table = pd.DataFrame(
    np.zeros((n_states, len(actions))), # q_table initial values
    columns = actions, # action's name
    )
    return table

build_q_table(N_STATES, ACTIONS)

Unnamed: 0,left,right
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,0.0


In [17]:
def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name

choose_action(2, build_q_table(N_STATES, ACTIONS))

'left'

Now we create the environment and determine how the agents will work within the environment

In [18]:
def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

This function prints the wanderer and treasure hunt conditions

In [19]:
def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print('\r{}'.format(interaction), end='')
        time.sleep(FRESH_TIME)

In [20]:
# The rl() method calls the Q Learning scenario
def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:

            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            q_predict = q_table.loc[S, A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state

            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table

In [21]:
# let's start it 
q_table = rl()
print('\r\nQ-table:\n')
print(q_table)

                                
Q-table:

       left     right
0  0.000003  0.004320
1  0.000000  0.026123
2  0.000000  0.116579
3  0.000000  0.361124
4  0.018238  0.745813
5  0.000000  0.000000


## An extended Example: Tic-tac-toe

How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its changes of winning?

First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. 

In [1]:
import numpy as np
import pandas as pd
import time
import sys

In [52]:
np.random.seed(128)
N_STATES = 9 # the length of the dimensional world
ACTIONS = [] # available actions
for i in range(3):
    ACTIONS += [[0, i], [1, i], [2, i]]
EPSILON = 0.9 # greedy factor
ALPHA = 0.1  # learning rate
GAMMA = 0.9 # discount factor
MAX_EPISODES = 50 # maximum episodes
FRESH_TIME = 0.3 # fresh time for one move

Second, we are going to define the Q table to this problem

In [3]:
def build_q_table(n_states, actions):
    table = pd.DataFrame(
    np.zeros((n_states, len(actions))), # q_table initial values
    columns = np.arange(0,9), # action's name
    )
    return table
build_q_table(N_STATES, ACTIONS)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
        action_name = ACTIONS[np.random.randint(0, len(ACTIONS))]        
    else:   # act greedy
        action_name = [state, state_actions.idxmax()]    # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name

choose_action(5, pd.DataFrame(
    np.zeros((N_STATES, len(ACTIONS))), # q_table initial values
    columns = np.arange(0,9), # action's name
    ))

[1, 0]

In [37]:
def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    lineas = [[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 3, 6], [1, 4, 7], [2, 5,8], [0, 4, 8], [2, 4, 6]]
    for linea in lineas:
        if A[linea[0]] == A[linea[1]] and A[linea[0]] == A[linea[2]] and A[linea[0]] != ' ':
            if S == N_STATES - 3:
                R = 1
                S_ = 'terminal'
            else:
                S_ = 'terminal'
                R = 0.5
        else:
            R = 0            
            S_ = S + 1
        return S_, R

get_env_feedback(2, ['X','X',' ',' ',' ',' ',' ',' ',' '])

(3, 0)

In [6]:
def update_env(S, episode, step_counter):
    global jugada_maquina
    # This is how environment be updated    
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
       
        print('total_steps = %s' %S)
        time.sleep(FRESH_TIME) 

In [7]:
def ver_tablero(board):

    print('   |   |')
    print(' ' + board[0] + ' | ' + board[1] + ' | ' + board[2])
    print('   |   |')
    print('-----------')
    print('   |   |')
    print(' ' + board[3] + ' | ' + board[4] + ' | ' + board[5])
    print('   |   |')
    print('-----------')
    print('   |   |')
    print(' ' + board[6] + ' | ' + board[7] + ' | ' + board[8])
    print('   |   |')  

In [8]:
def game_over(tablero):
    # hay tablas?
    no_tablas = False
    for i in range(0, len(tablero)):
        if tablero[i] == ' ':
            no_tablas = True
            
    # hay ganador?
    if ganador(tablero) == '0' and no_tablas:
        return False
    else:
        return True

In [9]:
def ganador(tablero):
    # combinaciones de estados de ganadores
    lineas = [[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 3, 6], [1, 4, 7], [2, 5,8], [0, 4, 8], [2, 4, 6]]
    ganador = '0'
    for linea in lineas:
        if tablero[linea[0]] == tablero[linea[1]] and tablero[linea[0]] == tablero[linea[2]] and tablero[linea[0]] != ' ':
            ganador = tablero[linea[0]]
    return ganador

In [46]:
def juega_ordenador(S, tablero):
    global jugada_maquina
    exit = True
    while(exit):        
        A = choose_action(S, q_table)
        if [0,0] == A:
            jugada_maquina = 0
        elif [0,1] == A:
            jugada_maquina = 1 
        elif [0,2] == A:
            jugada_maquina = 2 
        elif [1,0] == A:
            jugada_maquina = 3
        elif [1,1] == A:
            jugada_maquina = 4
        elif [1,2] == A:
            jugada_maquina = 5
        elif [2,0] == A:
            jugada_maquina = 6
        elif [2,1] == A:
            jugada_maquina = 7
        elif [2,2] == A:
            jugada_maquina = 8
        if tablero[jugada_maquina] == ' ':
            tablero[jugada_maquina] = 'X'
            exit = False
    return tablero, jugada_maquina

In [12]:
def juega_humano(tablero):
    ok= False
    while not ok:
        casilla = input("Casilla?")
        # obtenemos la posición de la casilla de 1-9 y comparamos con su respectivo indice en la lista
        if str(casilla) in '0123456789' and len(str(casilla)) == 1 and tablero[int(casilla)-1] == ' ':
            # asignamos a la casilla del jugador un valor de -1
            
            tablero[int(casilla)-1] = 'O'
            ok = True
        if casilla == "exit":
            sys.exit(0)
    return tablero

Training alone

In [53]:
#traning phase 1

# Execute this cell in order to start the game 
q_table = build_q_table(N_STATES, ACTIONS)

for i in range(MAX_EPISODES):    
    tablero = [' ',' ',' ',' ',' ',' ',' ',' ',' ']
    step_counter = 0
    S = 0    
    is_terminated = False
    while(is_terminated==False):        
        print("step: " + str(step_counter)+ "\n")
        ver_tablero(tablero)        
        tablero, A = juega_ordenador(S, tablero)
        # update the environment and the learning process
        S_, R = get_env_feedback(S, tablero)  # take action & get next state and reward
        q_predict = q_table.loc[step_counter, A]
        if game_over(tablero) or S_ == 'terminal':
            q_target = R     # next state is terminal
            is_terminated = True    # terminate this episode
            
        else:
            q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
        
        q_table.loc[step_counter, A] += ALPHA * (q_target - q_predict)  # update
       
        S = S_  # move to next state       
        step_counter += 1 
    ver_tablero(tablero)
    g = ganador(tablero)
    if g == '0':
        gana = "Tablas"
    elif g == 'O':
        gana = "Jugador"
    else:
        gana = "Ordenador"

    print("Ganador: " + gana)
    print("Episode: " + str(i))
    episode += 1   
q_table

step: 0

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 1

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   | X |  
   |   |
step: 2

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X | X |  
   |   |
step: 3

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X | X |  
   |   |
step: 4

   |   |
 X |   | X
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X | X |  
   |   |
   |   |
 X | X | X
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X | X |  
   |   |
Ganador: Ordenador
Episode: 0
step: 0

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 1

   |   |
   |   |  
   |   |
-----------
   |   |
   |   | X
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
 


   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
 X | X |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
   |   |
 X | X | X
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
Ganador: Ordenador
Episode: 17
step: 0

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 1

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X |   |  
   |   |
step: 3

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X |   | X
   |   |
step: 4

   |   |
 X | X |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
 X |   | X
   |   |
   |   |
 X | X |  
   |  

step: 1

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
 X |   |  
   |   |
-----------
   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 3

   |   |
 X |   |  
   |   |
-----------
   |   |
 X |   |  
   |   |
-----------
   |   |
   |   | X
   |   |
step: 4

   |   |
 X |   |  
   |   |
-----------
   |   |
 X |   | X
   |   |
-----------
   |   |
   |   | X
   |   |
step: 5

   |   |
 X | X |  
   |   |
-----------
   |   |
 X |   | X
   |   |
-----------
   |   |
   |   | X
   |   |
   |   |
 X | X | X
   |   |
-----------
   |   |
 X |   | X
   |   |
-----------
   |   |
   |   | X
   |   |
Ganador: Ordenador
Episode: 29
step: 0

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 1

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |


step: 1

   |   |
 X |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
 X | X |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 3

   |   |
 X | X |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   | X
   |   |
   |   |
 X | X |  
   |   |
-----------
   |   |
   | X |  
   |   |
-----------
   |   |
   |   | X
   |   |
Ganador: Ordenador
Episode: 41
step: 0

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 1

   |   |
   | X |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
   | X |  
   |   |
-----------
   |   |
   | X |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 3

   |   |
   | X |  
   |   |
-----------
   |   |
   | X |  
   |   |
-----------
   |   |
   |   | X
   |   |
step: 4

   |   |


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.01725,0.001907,0.0,0.001098,6e-06,1.2e-05,0.0,3e-06,0.003007
1,3.6e-05,0.015709,0.008825,0.025712,0.0045,0.009003,0.018414,0.01247,0.023577
2,0.0,0.003151,0.049846,0.015729,0.000693,0.003801,0.015178,0.007869,0.040126
3,0.00855,0.018428,0.098461,0.008505,0.001937,0.020598,0.006667,0.01021,0.004455
4,0.0,0.0495,0.042125,0.007695,0.0,0.0,0.00855,0.00855,0.0
5,0.0,0.095,0.05,0.0,0.0,0.0,0.0,0.0,0.0
6,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
# save the traning process
q_table.to_pickle("traning_rf.pkl") 

In [56]:
# load the traning process
q_table = pd.read_pickle("traning_rf.pkl")
q_table

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.01725,0.001907,0.0,0.001098,6e-06,1.2e-05,0.0,3e-06,0.003007
1,3.6e-05,0.015709,0.008825,0.025712,0.0045,0.009003,0.018414,0.01247,0.023577
2,0.0,0.003151,0.049846,0.015729,0.000693,0.003801,0.015178,0.007869,0.040126
3,0.00855,0.018428,0.098461,0.008505,0.001937,0.020598,0.006667,0.01021,0.004455
4,0.0,0.0495,0.042125,0.007695,0.0,0.0,0.00855,0.00855,0.0
5,0.0,0.095,0.05,0.0,0.0,0.0,0.0,0.0,0.0
6,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Training phase with human

In [57]:
#traning phase with human

# Execute this cell in order to start the game 
print("Introduce casilla o exit para terminar")
q_table = build_q_table(N_STATES, ACTIONS)
episode = 0
while(episode <= 3):    
    tablero = [' ',' ',' ',' ',' ',' ',' ',' ',' ']
    step_counter = 0
    S = 0    
    is_terminated = False
    while(is_terminated==False):        
        print("step: " + str(step_counter)+ "\n")
        ver_tablero(tablero)
        tablero = juega_humano(tablero)
        if game_over(tablero):
            is_terminated = True
            break
        ver_tablero(tablero)        
        tablero, A = juega_ordenador(S, tablero)
        # update the environment and the learning process
        S_, R = get_env_feedback(S, tablero)  # take action & get next state and reward
        q_predict = q_table.loc[step_counter, A]
        if game_over(tablero) or S_ == 'terminal':
            q_target = R     # next state is terminal
            is_terminated = True    # terminate this episode
            
        else:
            q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
        
        q_table.loc[step_counter, A] += ALPHA * (q_target - q_predict)  # update
       
        S = S_  # move to next state       
        step_counter += 1 
    ver_tablero(tablero)
    g = ganador(tablero)
    if g == '0':
        gana = "Tablas"
    elif g == 'O':
        gana = "Jugador"
    else:
        gana = "Ordenador"

    print("Ganador: " + gana)
    print("Episode: " + str(i))
    episode += 1  
q_table

Introduce casilla o exit para terminar
step: 0

   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
Casilla?1
   |   |
 O |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
-----------
   |   |
   |   |  
   |   |
step: 1

   |   |
 O |   |  
   |   |
-----------
   |   |
   |   | X
   |   |
-----------
   |   |
   |   |  
   |   |
Casilla?3
   |   |
 O |   | O
   |   |
-----------
   |   |
   |   | X
   |   |
-----------
   |   |
   |   |  
   |   |
step: 2

   |   |
 O | X | O
   |   |
-----------
   |   |
   |   | X
   |   |
-----------
   |   |
   |   |  
   |   |
Casilla?7
   |   |
 O | X | O
   |   |
-----------
   |   |
   |   | X
   |   |
-----------
   |   |
 O |   |  
   |   |
step: 3

   |   |
 O | X | O
   |   |
-----------
   |   |
   |   | X
   |   |
-----------
   |   |
 O |   | X
   |   |
Casilla?4
   |   |
 O | X | O
   |   |
-----------
   |   |
 O |   | X
   |   |
-----------
   |   |
 O |   | X
   |  

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Machine versus Machine 

<img src="https://s1.dmcdn.net/KlD-z/x1080-aCW.jpg" width="500"/>

## Dynamic programming


## Frameworks

https://gym.openai.com/

https://ai.googleblog.com/2019/02/introducing-planet-deep-planning.html

In [58]:
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(100):
    env.render()
    env.step(env.action_space.sample()) # take a random action

[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m


In [3]:
import gym
env = gym.make("Taxi-v2")
observation = env.reset()
for _ in range(1000):
    env.render()
    action = env.action_space.sample() # your agent here (this takes random actions)
    observation, reward, done, info = env.step(action)

+---------+
|[43mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

+---------+
|[43mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (Pickup)
+---------+
|[43mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)
+---------+
|R: | : :[35mG[0m|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)
+---------+
|R: | : :[35mG[0m|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (West)
+---------+
|R: | : :[35mG[0m|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (West)
+---------+
|R: | : :[35mG[0m|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (West)
+---------+
|[43mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (North)
+---------+
|[43mR[0m: 

## References 

+ Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning (Vol. 135). Cambridge: MIT press.
+ openaigym. https://gym.openai.com
+ Nandy, A., & Biswas, M. (2017). Reinforcement Learning: With Open AI, TensorFlow and Keras Using Python.
+ https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow
+ https://www.kaggle.com/slobo777/tic-tac-toe-agent-using-q-learning
+ https://github.com/rfeinman/tictactoe-reinforcement-learning/blob/master/agent.py