# Q-Table Learning for FrozenLake

This is an implementation of Q-Table learning for the reinforcement learning environment FrozenLake. This follows the tutorial from [awjuliani](https://gist.github.com/awjuliani/9024166ca08c489a60994e529484f7fe#file-q-table-learning-clean-ipynb).  

## The Environment

In [1]:
import gym
import numpy as np

In [2]:
env = gym.make('FrozenLake-v0')

`env` is a $4 \times 4$ grid, with four types of tiles: S (starting), F (frozen), H (hole), and G (goal). The agent starts on the S tile and can choose one of four directions to move - 0 (left), 1 (down), 2 (right), and 3 (up). The agent must make it to the G tile while avoiding the H tile. Here's the environment.

In [3]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


A run ends when the agent falls into an H tile (reward: 0) or the agent finds the goal (reward: 1). 

The catch: the frozen lake is slippery. When the agent chooses to go in a specific direction, there's only a 1/3 chance the agent will actually go in that direction. There's a 2/3rd chance that the agent will go perpendicular to the desired direction (1/3rd to the left, with respect to the desired direction, and 1/3rd to the right). Because of this schotastic element the environment cannot be perfectly solved. 

In fact, when you look at the finished Q-table as a sanity check, you'll find that the best strategy to take when located next to an H tile is to move *directly away from it,* even if that doesn't get you closer to the goal. This is the one way the agent can guarantee that it *won't* fall into the hole (except when you are adjacent to two holes, which occurs once, at row 2, column 3). I had to check the source code to verify the slippery behaviour; before I did that I was confused by the strategies learned by the Q-Table. 

The locations are index from 0 to 15, going along rows, then by columns.

The Q-Table contains Q-values for state-action pairs. The Q-value for a state-action pair is the *estimated, expected value* of the reward for taking action $a$ when currently in state $s$. It takes the following form:

$Q(s, a) = r + \gamma(\text{max}(Q(s', a'))$

It's equal to the reward received haven entered state $s$ as well as the maximum discounted Q-value for the possible actions taken from this state. Each Q-value is updated according to Bellman's equation:

$Q(s, a) \leftarrow Q(s, a) + \alpha (r + \gamma \text{max}(Q(s', a')) - Q(s, a))$

Where $\alpha$ is the learning rate. 

## Q-Table Learning

In [4]:
Q = np.zeros([env.observation_space.n, env.action_space.n])
lr = .8
y = .95
num_episodes = 5000
rList = []
for i in range(num_episodes):
    s = env.reset()
    rAll = 0
    d = False
    j = 0
    while j < 99:
        j += 1
        a = np.argmax(Q[s,:] + np.random.randn(1, env.action_space.n) * (1./(i+1)))
        s1,r,d,_ = env.step(a)
        Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:
            break
    rList.append(rAll)

In [5]:
print("Average reward per episode: " + str(sum(rList)/num_episodes))

Average reward per episode: 0.5822


In [6]:
print("Q Table values")
print(np.round(Q, 3))

Q Table values
[[0.163 0.008 0.006 0.005]
 [0.    0.001 0.    0.146]
 [0.001 0.009 0.003 0.097]
 [0.    0.001 0.002 0.097]
 [0.34  0.005 0.    0.   ]
 [0.    0.    0.    0.   ]
 [0.    0.    0.025 0.   ]
 [0.    0.    0.    0.   ]
 [0.005 0.    0.005 0.36 ]
 [0.    0.654 0.001 0.001]
 [0.164 0.    0.    0.   ]
 [0.    0.    0.    0.   ]
 [0.    0.    0.    0.   ]
 [0.    0.001 0.892 0.003]
 [0.    0.    0.995 0.002]
 [0.    0.    0.    0.   ]]


In [7]:
dic = {0:'<', 1:'▼', 2:'>', 3:'▲'}

In [8]:
out = []
for state, actions in enumerate(np.round(Q, 3)):
    if np.sum(actions) == 0:
        out.append('O')
    else:
        out.append(dic[np.argmax(actions)])
out[-1] = '!'
out = np.array(out)
out = out.reshape(4, 4)
print(out)

[['<' '▲' '▲' '▲']
 ['<' 'O' '>' 'O']
 ['▲' '▼' '<' 'O']
 ['O' '>' '>' '!']]


We can see that the learner has learned to be "afraid" of holes, whenever it is adjacent to one the optimal behaviour is to move directly away from it. There's trouble at row 2, column 3 because moving left or right will put the agent in a hole. Interestingly, in this situation, the agent usually learns the optimal action is to move towards one of the holes. This is, in fact, the best strategy, because if the agent moves up or down directly toward safe ground, it actually has a 2/3rd chance of slipping sideways into a hole, whereas if it goes directly toward either of the holes it has a 2/3rd chance of slipping into safety. 