# 18.0 Reinforcement Learning

[Reinforcement Learning Tutorial](https://www.youtube.com/watch?v=LzaWrmKL1Z4&ab_channel=edureka%21)

[Python Tutorial](https://www.youtube.com/playlist?list=PL9ooVrP1hQOHY-BeYrKHDrHKphsJOyRyu)

[Google Colab](https://colab.research.google.com/drive/1pWue8kCeiIQrZCkYzQCe4fuIDkaBMfs_?usp=sharing)

#### Reinforcement learning occurs when you sequentially present the algorithm with examples that lack labels, as in unsupervised learning. However, you accompany each example with positive or negative feedback according to the solution the algorithm proposes. 

Reinforcement learning is connected to applications for
which the algorithm must make decisions (so the product is prescriptive, not just
descriptive, as in unsupervised learning), and the decisions bear consequences.

**Reinforcement learning is like learning by trial and error for humans.**

Errors help you learn because they incur a penalty (cost, loss of time, regret, pain, and so on),
teaching you that a certain course of action is less likely to succeed than others.

![Screen%20Shot%202022-11-24%20at%201.28.31%20AM.png](attachment:Screen%20Shot%202022-11-24%20at%201.28.31%20AM.png)

**An interesting example of reinforcement learning occurs when computers learn to
play video games by themselves.** In this case, an application presents the algorithm
with examples of specific situations, such as having the gamer stuck in a
maze while avoiding an enemy. The application lets the algorithm know the outcome
of actions it takes, and learning occurs while trying to avoid what it discovers
to be dangerous and to pursue survival. 

You can have a look at how the company
DeepMind has created a reinforcement learning program that plays old Atari video
games at https://www.youtube.com/watch?v=V1eYniJ0Rnk. When watching the
video, notice how the program is initially clumsy and unskilled but steadily
improves with training until it becomes a champion.

![Screen%20Shot%202022-11-24%20at%203.12.11%20PM.png](attachment:Screen%20Shot%202022-11-24%20at%203.12.11%20PM.png)

![Screen%20Shot%202022-11-24%20at%203.14.04%20PM.png](attachment:Screen%20Shot%202022-11-24%20at%203.14.04%20PM.png)

In [143]:
import numpy as np

## 1 Initializing R matrix and calculating a Q-value

In [144]:
# R matrix
# -1 represents the nodes we cannot travel to
# we can travel only to 0 or 100
R = np.matrix([
    [-1, -1, -1, -1, 0, -1], # row 0
    [-1, -1, -1, 0, -1, 100],# row 1
    [-1, -1, -1, 0, -1, -1],
    [-1, 0, 0, -1, 0, -1],
    [-1, 0, 0, -1, -1, 100],
    [-1, 0, -1, -1, 0, 100]
])

In [145]:
# Q matrix
Q = np.matrix(np.zeros([6, 6])) # we have 6 state starting from 0 to 5

In [146]:
# Gamma (learning parameter)
gamma = 0.8 # can be changed

In [147]:
# Initial state (usually to be chosen randomly)
initial_state = 1 # which room we start from

In [148]:
# A function returns all available actions in the state given as an argument
def available_actions(state):
    # addressing to the whole row 1
    current_state_row = R[state, ]
    
    # checking all the values which are >=0, where we can travel
    av_act = np.where(current_state_row >=0)[1]
    return av_act

In [149]:
# current_state_row = R[1, ]
# av_act = np.where(current_state_row >=0)[1]
# print(av_act)

In [150]:
# Get available actions in the current state
# Storing all available nodes where we can travel to from the initial state = 1
available_act = available_actions(initial_state)

In [151]:
# This function chooses randomely which action to perform 
# out of all available stored in the var 'available_act'
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act, 1))
    return next_action

In [152]:
# Sample next action to be performed
action = sample_next_action(available_act)

In [153]:
# This function updates the ! matrix according to the path selected and the Q
# learning algorithm

def update(current_state, action, gamma):
    
    # checking which of the possible actions will give us a maximum possible Q value
    # calculating the index which gives us a max Q value
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]
    
    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size=1))
    else:
        max_index = int(max_index)
    
    max_value = Q[action, max_index]
    
    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma * max_value
    return Q

In [154]:
# update Q matrix
update(initial_state, action, gamma)

matrix([[  0.,   0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0., 100.],
        [  0.,   0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,   0.]])

## 2 Training
### The more we train the algorithm, the better it is going to learn

In [155]:
arr = np.array([
    [1, 2, 3, 4], 
    [5, 6, 7, 8]
               ])
print(arr.shape)
print(arr.shape[0])
print(arr.shape[1])

(2, 4)
2
4


In [156]:
Q.shape[0]

6

In [157]:
print('Untrained Q matrix:')
print(Q)

Untrained Q matrix:
[[  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0. 100.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]]


In [158]:
# Train over 10000 iterations.
# Re-iterate the process above.

In [159]:
# 10000 iterations mean that my agent to go to 10k iterations to find out the best policy
for i in range(10000):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state, action, gamma)

In [160]:
# Normalize the 'trained' Q matrix
print('Trained Q matrix:')
print(Q / np.max(Q) * 100)

Trained Q matrix:
[[  0.    0.    0.    0.   80.    0. ]
 [  0.    0.    0.   64.    0.  100. ]
 [  0.    0.    0.   64.    0.    0. ]
 [  0.   80.   51.2   0.   80.    0. ]
 [  0.   80.   51.2   0.    0.  100. ]
 [  0.   80.    0.    0.   80.  100. ]]


## 3 Testing

In [163]:
# Goal state = 5
# Best sequence path starting from 2 -> 2, 3, 1, 5

current_state = 1
steps = [current_state]

while current_state != 5:
    
    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size=1))
    else:
        next_step_index = int(next_step_index)
        
    steps.append(next_step_index)
    current_state = next_step_index

**From the graph we can see that the best path is a direct one from 1 to 5, because it gives the maximum reward = 100.** Let's see if the function suggests this.

![Screen%20Shot%202022-11-24%20at%204.19.49%20PM.png](attachment:Screen%20Shot%202022-11-24%20at%204.19.49%20PM.png)

In [166]:
# Print selected sequence of steps
print('Selected path:', steps)

Selected path: [1, 5]


If we change the initial state or current state, the suggested path will be different:

In [167]:
# Goal state = 5
# Best sequence path starting from 2 -> 2, 3, 1, 5

current_state = 2
steps = [current_state]

while current_state != 5:
    
    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size=1))
    else:
        next_step_index = int(next_step_index)
        
    steps.append(next_step_index)
    current_state = next_step_index

![Screen%20Shot%202022-11-24%20at%204.23.17%20PM.png](attachment:Screen%20Shot%202022-11-24%20at%204.23.17%20PM.png)

In [168]:
# Print selected sequence of steps
print('Selected path:', steps)

Selected path: [2, 3, 4, 5]


***Thank you for going through this project. Your comments are more then welcome to ybezginova2021@gmail.com***

***Best wishes,***

***Yulia***