## Table of Contents
- [Part 2: Q-Learning for FrozenLake-v0, OpenAI Gym environment](#frozen-lake)
    - [Q-Learning Approach](#q-learn)
        - [OpenAI Gym Stochastic FrozenLake approach](#stochastic-field)
        - [Personal Deterministic FrozenLake approach](#deterministic-field)
        - [Play Against Environment](#pve)
    - [DNN Approach](#dnn)

## Part 2: Q-Learning for FrozenLake-v0, OpenAI Gym environment <a class="anchor" id="frozen-lake"></a>

The idea behind this is, just parsing data is not that exciting... aaannnd, I don't have time to make **Self taught Quantum Checkers** the _**thing**_. So I've used existing <a href="https://gym.openai.com/">environment</a> and a <a href="https://gym.openai.com/envs/FrozenLake8x8-v0/">Frozen Lake 8x8</a> in particular.

**Basic Idea**:
The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

_Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend._

**Grid is Described**:
- S: starting point, safe
- F: frozen surface, safe
- H: hole, fall to your doom
- G: goal, where the frisbee is located

**Example**: <br>
SFFF <br>
FHFH <br>
FFFH <br> 
HFFG <br>

**Glossary**:
- **environment** — It is like an object or interface through which we or our game bot(agent) can interact with the game and get details of current state and etc. There are several different games or environments available. You can find them here. 
- **step** - It’s a function through which we can do an action like what actually we want to do to at the current state/stage of the game.
- **action** - It’s a value or object which we basically want to do at the current state/stage of the game. Like moving right or left or jump or etc.
- **observation (object)** - An environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
- **reward (float)** - Amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
- **done (boolean)** - whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
- **info (dictionary)** - diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

In [1]:
'''
    If following import fails, just install gym from anaconda console, using:
        pip install gym
'''
import gym
import numpy as np
import time, pickle, os

In [2]:
# Global constants
epsilon = 0.9
total_epoches = 10000
max_steps = 1000

lr_rate = 0.81
gamma = 0.96

### Q-Learning Approach <a class="anchor" id="q-learn"></a>

Before moving to DNN implementation, I've decided to use <a href="https://en.wikipedia.org/wiki/Q-learning">Q-learning</a> algorithm.
The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.

#### OpenAI Gym Stochastic FrozenLake Approach <a class="anchor" id="stochastic-field"></a>

In this context, **stochastic** means that upon action selection there still is 1/3 of a chance to end up on different tile, due to environment. You see - ice is slippery.

In [3]:
# Load OpenAI gym environment
env_stochastic = gym.make('FrozenLake-v0')

# Define our Q-Learn matrix
Q_stochastic = np.zeros((env_stochastic.observation_space.n, env_stochastic.action_space.n))
file_Q_stochastic = "frozenLake_stochastic_qTable.pkl"

# Preview board
env_stochastic.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [45]:
# Defines all possible actions our agent can take. In our case it's a space of 4 actions: Left, Right, Up, Down and they are coded as [0,1,2,3].
display(env_stochastic.action_space)

# Defines observation space for our agent. In our case it's a descrete 4x4 board.
display(env_stochastic.observation_space)

# Defines reward range. In our case o if failed and 1 if succeded
display(env_stochastic.reward_range)

Discrete(4)

Discrete(16)

(0, 1)

In [46]:
# Used to dump Q-Matrices to files
def SaveQTableToFile(Q, name):
    with open(name, 'wb') as f:
        pickle.dump(Q, f)

In [47]:
# "Epsilon Greedy" action selection
def choose_action(state, environment, Q):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = environment.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action

# Learning function
def learn(state, state2, reward, action, Q):
    predict = Q[state, action]
    target = reward + gamma * np.max(Q[state2, :])
    Q[state, action] = Q[state, action] + lr_rate * (target - predict)

# Train function
def initQLearn(epoches, environment, Q, showOutput=False):
    for episode in range(epoches):
        state = environment.reset()
        t = 0
        while t < max_steps:
            if showOutput == True:
                environment.render()

            action = choose_action(state, environment, Q)  
            state2, reward, done, info = environment.step(action)
            if showOutput == True:
                print("Reward:", reward)
                print("Info:", info)
            
            learn(state, state2, reward, action, Q)
            state = state2
            t += 1
       
            if done == True and reward == 1:
                print('Episode', episode, 'was successful. Agent has reached the Exit.')
                break

            if showOutput == True:
                time.sleep(0.1)
    
    print('Our Q is equal:\n', Q)

In [48]:
# Train Q-Learn model on stochastic environment
initQLearn(total_epoches, env_stochastic, Q_stochastic, False)

Episode 72 was successful. Agent has reached the Exit.
Episode 105 was successful. Agent has reached the Exit.
Episode 118 was successful. Agent has reached the Exit.
Episode 148 was successful. Agent has reached the Exit.
Episode 191 was successful. Agent has reached the Exit.
Episode 224 was successful. Agent has reached the Exit.
Episode 333 was successful. Agent has reached the Exit.
Episode 355 was successful. Agent has reached the Exit.
Episode 371 was successful. Agent has reached the Exit.
Episode 479 was successful. Agent has reached the Exit.
Episode 673 was successful. Agent has reached the Exit.
Episode 752 was successful. Agent has reached the Exit.
Episode 867 was successful. Agent has reached the Exit.
Episode 869 was successful. Agent has reached the Exit.
Episode 898 was successful. Agent has reached the Exit.
Episode 1058 was successful. Agent has reached the Exit.
Episode 1115 was successful. Agent has reached the Exit.
Episode 1145 was successful. Agent has reached 

In [49]:
# Save our Q-Matrix to file
SaveQTableToFile(Q_stochastic, file_Q_stochastic)

Below function does not depend on any additional learning, so it can be freely changed.

In [2]:
# This function will select Move (Action) based on State and all previous Experience saved in model.
def choose_QModel_action(state, verbose, Q):
    action = np.argmax(Q[state, :])
    if verbose == True:
        print (action)
    return action

def initPlayByQModel(episodes_count, environment, file_QTable, showOutput=True, verbose=False):
    with open(file_QTable, 'rb') as f:
        _Q = pickle.load(f)

    for episode in range(episodes_count):
        state = environment.reset()
        
        print("*** Starting Episode: ", episode)
        t = 0
        
        while t < max_steps:
            if showOutput == True:
                environment.render()

            action = choose_QModel_action(state, verbose, _Q)
            state2, reward, done, info = environment.step(action)
            
            if verbose == True:
                print("Reward:", reward)
                print("Info:", info)
                print("State2:", state2)
                
            state = state2
            t += 1

            if done == True and reward == 1:
                print('Success: Agent passed the Lake!')
                break
            
            if done == True and reward == 0:
                print('Agent died in vain!')
                break
                
        os.system('clear')

In [20]:
# Due to environment running it few times, may and will produce different outcomes.
initPlayByQModel(10, env_stochastic, file_Q_stochastic, False, False)

*** Starting Episode:  0
Success: Agent passed the Lake!
*** Starting Episode:  1
Success: Agent passed the Lake!
*** Starting Episode:  2
Agent died in vain!
*** Starting Episode:  3
Agent died in vain!
*** Starting Episode:  4
Success: Agent passed the Lake!
*** Starting Episode:  5
Agent died in vain!
*** Starting Episode:  6
Agent died in vain!
*** Starting Episode:  7
Agent died in vain!
*** Starting Episode:  8
Agent died in vain!
*** Starting Episode:  9
Agent died in vain!


#### Personal Deterministic FrozenLake Approach <a class="anchor" id="deterministic-field"></a>

So, as you can see stochastic environment ain't that good for Q-Learning (Let's run few more times here, just to show how bad it is). Let's make environment **deterministic**!
For any references about arguments or environment we can look directly in OpenAI FrozenLake <a href="https://github.com/openai/gym/blob/master/gym/envs/toy_text/frozen_lake.py">implementation</a>.

In [4]:
# First, we need to register our new environment we going to work with
from gym.envs.registration import register

register(id='Deterministic-FrozenLake4x4-v0',
    entry_point='gym.envs.toy_text.frozen_lake:FrozenLakeEnv',
    kwargs={'map_name': '4x4', 'is_slippery': False}
)

In [5]:
# Create new environment instance to work with
env_deterministic = gym.make('Deterministic-FrozenLake4x4-v0')

# Define our Q-Learn matrix
Q_deterministic = np.zeros((env_deterministic.observation_space.n, env_deterministic.action_space.n))
file_Q_deterministic = "frozenLake_deterministic_qTable.pkl"

# Preview board
env_deterministic.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [57]:
# Same learning function but for different environment
initQLearn(total_epoches, env_deterministic, Q_deterministic, False)

Episode 19 was successful. Agent has reached the Exit.
Episode 39 was successful. Agent has reached the Exit.
Episode 52 was successful. Agent has reached the Exit.
Episode 66 was successful. Agent has reached the Exit.
Episode 98 was successful. Agent has reached the Exit.
Episode 124 was successful. Agent has reached the Exit.
Episode 199 was successful. Agent has reached the Exit.
Episode 263 was successful. Agent has reached the Exit.
Episode 295 was successful. Agent has reached the Exit.
Episode 331 was successful. Agent has reached the Exit.
Episode 394 was successful. Agent has reached the Exit.
Episode 406 was successful. Agent has reached the Exit.
Episode 412 was successful. Agent has reached the Exit.
Episode 434 was successful. Agent has reached the Exit.
Episode 498 was successful. Agent has reached the Exit.
Episode 500 was successful. Agent has reached the Exit.
Episode 513 was successful. Agent has reached the Exit.
Episode 519 was successful. Agent has reached the Exi

In [58]:
# Save our Q-Matrix to file
SaveQTableToFile(Q_deterministic, file_Q_deterministic)

In [59]:
initPlayByQModel(10, env_deterministic, file_Q_deterministic, False, False)

*** Starting Episode:  0
Success: Agent passed the Lake!
*** Starting Episode:  1
Success: Agent passed the Lake!
*** Starting Episode:  2
Success: Agent passed the Lake!
*** Starting Episode:  3
Success: Agent passed the Lake!
*** Starting Episode:  4
Success: Agent passed the Lake!
*** Starting Episode:  5
Success: Agent passed the Lake!
*** Starting Episode:  6
Success: Agent passed the Lake!
*** Starting Episode:  7
Success: Agent passed the Lake!
*** Starting Episode:  8
Success: Agent passed the Lake!
*** Starting Episode:  9
Success: Agent passed the Lake!


Wow. Environment definitely takes huge place in results.
Let's check differences in learned Q-Matrices:

In [60]:
print("Stochastic:\n\n", Q_stochastic, "\n\n\nDeterministic:\n\n", Q_deterministic)

Stochastic:

 [[0.63155759 0.7260818  0.63098495 0.64096586]
 [0.08923413 0.59320009 0.4857933  0.5864178 ]
 [0.58253307 0.54801955 0.74002867 0.5670273 ]
 [0.09847175 0.43629738 0.52282304 0.5225726 ]
 [0.69112814 0.72705092 0.76905186 0.12600166]
 [0.         0.         0.         0.        ]
 [0.59500203 0.00506861 0.15284422 0.46626663]
 [0.         0.         0.         0.        ]
 [0.73769729 0.12177414 0.02240348 0.75699989]
 [0.02665465 0.75511081 0.84217728 0.79496329]
 [0.88576406 0.8295706  0.79526765 0.79497617]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.15828885 0.02537745 0.92143326 0.16265478]
 [0.83612259 0.99437935 0.90915363 0.94866158]
 [0.         0.         0.         0.        ]] 


Deterministic:

 [[0.78275779 0.8153727  0.8153727  0.78275779]
 [0.78275779 0.         0.84934656 0.8153727 ]
 [0.8153727  0.884736   0.8153727  0.84934656]
 [0.84934656 0.         0.8153727  0.8153727 ]
 [0.8153727  0.84934656 0

#### Play against environment <a class="anchor" id="pve"></a>

Let's adapt environment to be human agent friendly

In [61]:
def initPlayVsEnv(environment, showOutput=True):
    state = environment.reset()
    
    while True:
        if showOutput == True:
            environment.render()

        action = input('Your action? 0 -> Left, 1 -> Down, 2 -> Right, 3 -> Up')
        action = int(action)

        if action >= 4:
            print ('No such input! Try once more (0 to 3)')
            break

        state2, reward, done, info = environment.step(action)
        state = state2
    
        if done == True and reward == 1:
            print('Success: You have passed the lake!')
            break
        
        if done == True and reward == 0:
            print('Bottom of the Lake is Dark and Full of Terrors!')
        break

In [62]:
initPlayVsEnv(env_stochastic, True)


[41mS[0mFFF
FHFH
FFFH
HFFG


Your action? 0 -> Left, 1 -> Down, 2 -> Right, 3 -> Up 0


For some reason it halts :(

### DNN Approach <a class="anchor" id="dnn"></a>

I had no time :'(