# Week 1: Reinforcement Learning and Q-Learning

In the first week of my independent study, I looked at several brief resources that covered a wide variety of reinforcement learning topics from dynamic programming to SARSA, Markov Decision Processes, and Monte Carlo trees.

What I wanted to focus on first, though, is Q-Learning.  





## What is Reinforcement Learning?

Reinforcement learning is a subset of machine learning that is concerned with how agents should behave (take actions) in a given environment.


A reinforcement learning algorithm has the following components:

- S -> set of states
- A -> set of actions
- Pr(s'|a,s) -> Transition probability, the probability of transitioning to a new state given a current state and action in the current state
- $\alpha$ -> starting state distribution
- $\gamma$ -> discount factor
- r(s,a) -> reward given a state and an action

## What is Q-Learning?

Q-Learning is a reinforcement learning technique that allows an agent to learn a policy for how to behave in a given environment.

Q stands for the __*quality*__ of an action in a given state.


More specifically, the aim for this approach is to obtain a function, $Q(s,a)$ that predictions the best action a in state s to maximize the cumulative "reward" value

This function is iteratively updated via the Bellman equation, which is as follows:

$Q(s,a) = r + \gamma max_{a'}Q(s',a')$

This first term, r, is the immediate reward, and the second term is the future reward.

In a relatively simple environment, the Q-Learning Algorithm can be represented as a matrix where the rows represent the actions, and the columns represent the states.  The cells themselves are the rewards for the state-action pair

*A sample representation of a Q-Matrix*

$
Q =
  \begin{bmatrix}
    1 & 2 & 3 \\
    3 & 1 & 5  \\
    3 & 3 & -5 
  \end{bmatrix}
$


## Implementing our first Q-Learning algorithm

*[The guide I am following can be found here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/FrozenLake/Q%20Learning%20with%20FrozenLake.ipynb)*

Using the OpenAI Gym library, we can create reinforcement learning algorithms using virtual environments that allows us to control agents within the environment.

Before we begin, we must import the dependencies

In [1]:
# Enables intellisense (press TAB after the .)
%config IPCompleter.greedy=True

import torch
import numpy as np
import gym
import random

### Create the environment


In [2]:
env = gym.make("FrozenLake-v0")

## Initialize the Q-table

Referring back to the previous cells, the Q-table's representation is state x action

In [3]:
action_size = env.action_space.n
state_size = env.observation_space.n

In [4]:
q_table = torch.zeros((state_size,action_size))

print(q_table)

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])


In [14]:
total_episodes = 15000 
learning_rate = 0.8
max_steps = 99
gamma = 0.95

epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.005

print(q_table)
print(q_table[state,:])

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])
tensor([0., 0., 0., 0.])


In [None]:
rewards = []

for episode in range(total_episodes):
    # Reset the environment!!
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # Choose an action a in the current world state (s)
        ## We randomize a number
        exp_exp_tradeoff = random.uniform(0,1)
        
        ## if this number is greater than epsilon, we take that action which corresponds to the biggest Q-value in a given state (Exploitation)
        if exp_exp_tradeoff > epsilon:
            action = torch.argmax(q_table[state,:]).item()
        # Else doing a random choice
        else:
            action = env.action_space.sample()
            
        # take an action (a), and observe the new state (s') and reward (r)
        new_state, reward, done, info = env.step(action)
        
        q_table[state,action] = q_table[state,action] + learning_rate * (reward + gamma * torch.max(q_table[new_state,:] - q_table[state,action]))
        
        total_rewards += reward
        
        state = new_state
        
        # if done equals true, we died
        if done == True:
            break
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate*episode)
        
print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)