# Reinfocement Learning  Example :PathFinder Bot 

Suppose we have 5 rooms A to E, in a building connected by certain doors :
We  can  consider  outside  of  the  building  as  one  big  room  say  F  to  cover the building. 
There are two doors lead to the building from F, that is through room B and room E. 


![title](RL_problem.png)

Which path agent should choose??? 



# Step 1: Modeling the environment- 

- Represent the rooms by graph, 
- Each room as a vertex (or node) and 
- Each door as an edge (or link). 
- Goal room is the node F 
![image.png](RL1.png)




Goal :  Outside the building : Node F
Assign Reward Value to each room  

State:  Each room (including outside building )

Action : Agent’s Movement from 1 room to next room

Initial state : C (random )

Reward: Goal Node :highest reward (100)  rest – 0; 

State Diagram 
![image.png](RL2.png)


![title](RL_image.png)

In [27]:
import numpy as np
import math

In [28]:
# R matrix
# In this case if there is no door(edge) between 2 rooms(nodes)
# the reward assigned is negitive infinity if there exist the room(edge) and 
# destination room which is denoted by the horizontal axis is the goal then 
# the reward assigned is 100 else 0
inf = -math.inf

Rewards = np.matrix([[inf, inf, inf, inf, 0, inf], 
                     [inf, inf, inf, 0, inf, 100], 
                     [inf, inf, inf, 0, inf, inf], 
                     [inf, 0, 0, inf, 0, inf], 
                     [0, inf, inf, 0, inf, 100],
                     [inf, 0, inf, inf, 0, 100]])

Rewards

matrix([[-inf, -inf, -inf, -inf,   0., -inf],
        [-inf, -inf, -inf,   0., -inf, 100.],
        [-inf, -inf, -inf,   0., -inf, -inf],
        [-inf,   0.,   0., -inf,   0., -inf],
        [  0., -inf, -inf,   0., -inf, 100.],
        [-inf,   0., -inf, -inf,   0., 100.]])

In [29]:
# Q matrix: zero matrix  of size same as R matrix

# Reshaping the Q matrix with the dimensions of the reward matrix
Q = np.zeros((Rewards.shape[0], Rewards.shape[1]))

Q

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [30]:
# Gamma (learning parameter).
gamma = 0.8

In [31]:
import random

# Initial state. (Usually to be chosen at random)
# Write your Code to choose random State

# We have 5 possible states as follows
states = [0, 1, 2, 3, 4, 5]

# Initial state is choosen at random from the above states
# initial_state = random.randint(0,len(states)-1)

# initial_state




In [32]:
# This function returns all available actions in the state given as an argument
def available_actions(state):
    current_state_row = Rewards[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act

In [33]:
# Get available actions in the current state
available_act = available_actions(initial_state) 

In [34]:
# This function chooses at random which action to be performed within the range 
# of all the available actions.
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action

In [35]:
# Sample next action to be performed
action = sample_next_action(available_act)

In [36]:
# This function updates the Q matrix according to the path selected and the Q 
# learning algorithm
def update(current_state, action, gamma):
    
    max_index = np.where(Q[action,] == np.max(Q[action,]))[0]

    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
    
    # Max_value provides the most optimal choice for a given next state
    # Next state is action and max_index is the index providing max value for that state
    max_value = Q[action, max_index]
    
    # Q learning formula
    Q[current_state, action] = Rewards[current_state, action] + gamma * max_value

# Update Q matrix
update(initial_state,action,gamma)

In [37]:
#-------------------------------------------------------------------------------
# Training

# Train over 10 000 iterations. (Re-iterate the process above).
for i in range(10000):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state) 
    action = sample_next_action(available_act)
    score= update(current_state,action,gamma)

    # The "trained" Q matrix
print("The Trained Q matrix:")
print(Q)

# personal choice to normalize the matrix or not

# Normalize the "trained" Q matrix
# print("Trained Normalized Q matrix:")
# Q_nor=Q/np.max(Q)
# print(Q_nor), i

The Trained Q matrix:
[[  0.   0.   0.   0. 400.   0.]
 [  0.   0.   0. 320.   0. 500.]
 [  0.   0.   0. 320.   0.   0.]
 [  0. 400. 256.   0. 400.   0.]
 [320.   0.   0. 320.   0. 500.]
 [  0. 400.   0.   0. 400. 500.]]


In [38]:
#-------------------------------------------------------------------------------
# Testing

#STATES = [A,B,C,D,E,F]
#nO_State=[0,1,2,3,4,5]

# Goal state = 5
# Best sequence path starting from 2 -> 2, 3, 1, 5

for initial_state in range(1, 6):

    current_state = initial_state
    steps = [current_state]

    while current_state != 5:

        next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[0]
        # print(next_step_index)
        
        if next_step_index.shape[0] > 1:
            next_step_index = int(np.random.choice(next_step_index, size = 1))
        else:
            next_step_index = int(next_step_index)
        
        steps.append(next_step_index)
        current_state = next_step_index
    # Print selected sequence of steps
    print("Selected path:", end= ' ')
    print(steps)

Selected path: [1, 5]
Selected path: [2, 3]
Selected path: [2, 3, 4]
Selected path: [2, 3, 4, 5]
Selected path: [3, 4]
Selected path: [3, 4, 5]
Selected path: [4, 5]


In [39]:
# # Print selected sequence of steps
# print("Selected path:", end= ' ')
# print(steps)


![image.png](RL_prob.png)
