#  FrozenLake
Today you are going to learn how to survive walking over the (virtual) frozen lake through discrete optimization.

<img src="http://vignette2.wikia.nocookie.net/riseoftheguardians/images/4/4c/Jack's_little_sister_on_the_ice.jpg/revision/latest?cb=20141218030206" alt="a random image to attract attention" style="width: 400px;"/>


In [7]:
!cd /Users/alexajax/gym && pip install -e .[atari]

Obtaining file:///Users/alexajax/gym
Collecting atari_py>=0.0.17 (from gym==0.7.0)
  Downloading atari-py-0.0.18.tar.gz (750kB)
[K    100% |################################| 757kB 153kB/s 
Collecting PyOpenGL (from gym==0.7.0)
  Downloading PyOpenGL-3.1.0.tar.gz (1.2MB)
[K    100% |################################| 1.2MB 94kB/s 
[?25hBuilding wheels for collected packages: atari-py, PyOpenGL
  Running setup.py bdist_wheel for atari-py ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

In [4]:
!wget https://raw.githubusercontent.com/justheuristic/vime/master/mountaincar-a2c.ipynb -O mountaincar-a2c.ipynb

--2017-01-06 17:21:11--  https://raw.githubusercontent.com/justheuristic/vime/master/mountaincar-a2c.ipynb
Resolving raw.githubusercontent.com... 151.101.192.133, 151.101.128.133, 151.101.64.133, ...
Connecting to raw.githubusercontent.com|151.101.192.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458325 (448K) [text/plain]
Saving to: ���mountaincar-a2c.ipynb���


2017-01-06 17:21:13 (376 KB/s) - ���mountaincar-a2c.ipynb��� saved [458325/458325]



In [None]:
import gym

#create a single game instance
env = gym.make("GopherDeterministic-v0")

#start new game
env.reset();

In [None]:
# display the game state
env.render("rgb_array")

### legend

![img](https://cdn-images-1.medium.com/max/800/1*MCjDzR-wfMMkS0rPqXSmKw.png)

### Gym interface

The three main methods of an environment are
* __reset()__ - reset environment to initial state, _return first observation_
* __render()__ - show current environment state (a more colorful version :) )
* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)
 * _new observation_ - an observation right after commiting the action __a__
 * _reward_ - a number representing your reward for commiting action __a__
 * _is done_ - True if the MDP has just finished, False if still in progress
 * _info_ - some auxilary stuff about what just happened. Ignore it for now

In [29]:
print("initial observation code:",env.reset())
print('printing observation:')
env.render()
print("observations:",env.observation_space, 'n=',env.observation_space.n)
print("actions:",env.action_space, 'n=',env.action_space.n)

initial observation code: 0
printing observation:
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
observations: Discrete(16) n= 16
actions: Discrete(4) n= 4


In [30]:
print ("taking action 2 (right)")
new_obs, reward, is_done, _ = env.step(3)
print ("new observation code:",new_obs)
print ("reward:", reward)
print ("is game over?:",is_done)
print ("printing new state:")
env.render()

taking action 2 (right)
new observation code: 0
reward: 0.0
is game over?: False
printing new state:
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)


<ipykernel.iostream.OutStream at 0x10514a550>

In [31]:
action_to_i = {
    'left':0,
    'down':1,
    'right':2,
    'up':3
}

### Play with it
* Try walking 5 steps without falling to the (H)ole
 * Bonus quest - get to the (G)oal
* Sometimes your actions will not be executed properly due to slipping over ice
* If you fall, call __env.reset()__ to restart

In [32]:
env.step(action_to_i['up'])
env.render()

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)


<ipykernel.iostream.OutStream at 0x10514a550>

### Policy

* The environment has a 4x4 grid of states (16 total), they are indexed from 0 to 15
* From each states there are 4 actions (left,down,right,up), indexed from 0 to 3

We need to define agent's policy of picking actions given states. Since we have only 16 disttinct states and 4 actions, we can just store the action for each state in an array.

This basically means that any array of 16 integers from 0 to 3 makes a policy.

In [36]:
import numpy as np
def get_random_policy():
    """
    Build a numpy array representing agent policy.
    This array must have one element per each of 16 environment states.
    Element must be an integer from 0 to 3, representing action
    to take from that state.
    """
    return np.random.randint(0,4,16)

In [35]:
np.random.seed(1234)
policies = [get_random_policy() for i in range(10**4)]

assert all([len(p) == 16 for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == 3, 'maximal action id should be 3'
action_probas = np.unique(policies,return_counts=True)[-1] /10**4. /16.
print ("Action frequencies over 10^4 samples:",action_probas)
assert np.allclose(action_probas,[0.25]*4,atol=0.05), "The policies aren't uniformly random (maybe it's just an extremely bad luck)"
print ("Seems fine!")

Action frequencies over 10^4 samples: [ 0.25014375  0.25130625  0.2495375   0.2490125 ]
Seems fine!


### Let's evaluate!
* Implement a simple function that runs one game and returns the total reward

In [41]:
policy = get_random_policy()
policy

array([0, 2, 1, 0, 0, 0, 2, 2, 0, 3, 3, 2, 0, 1, 1, 3])

In [47]:
s = env.reset()

new_s,r,done,_ = env.step(policy[s])

In [49]:
env.step(policy[new_s])

(0, 0.0, False, {'prob': 0.3333333333333333})

In [50]:
def sample_reward(env,policy,t_max=25):
    """
    Interact with an environment, return sum of all rewards.
    If game doesn't end on t_max (e.g. agent walks into a wall), 
    force end the game and return whatever reward you got so far.
    """
    s = env.reset()
    total_reward = 0
    
    for i in range(t_max):
        s,r,is_end,_ = env.step(policy[s])
        total_reward +=r
        if is_end:break
    return total_reward

In [52]:
print ("generating 10^3 sessions...")
rewards = [sample_reward(env,get_random_policy()) for _ in range(10**3)]
assert all([type(r) in (int,float) for r in rewards]), 'sample_reward must return a single number'
assert all([0 <= r <= 1 for r in rewards]), 'total rewards should be between 0 and 1 for frozenlake'
print ("Looks good!")

generating 10^3 sessions...
Looks good!


In [53]:
def evaluate(policy,n_times=100):
    """Run several evaluations and average the score the policy gets."""
    rewards = [sample_reward(env,policy) for x in range(n_times)]
    return np.mean(rewards)
        

In [56]:
def print_policy(policy):
    """a function that displays a policy in a human-readable way"""
    lake = "SFFFFHFHFFFHHFFG"
    
    # where to move from each tile
    arrows = ['<v>^'[a] for a in policy]
    
    #draw arrows above S and F only
    signs = [arrow if tile in "SF" else tile for arrow,tile in zip(arrows,lake)]
    
    for i in range(0,16,4):
        print (' '.join(signs[i:i+4]))

print ("random policy:")
print_policy(get_random_policy())

random policy:
> < < >
< H ^ H
v < < H
H > ^ G


### Random search

In [59]:
best_policy = None
best_score = -float('inf')

from tqdm import tqdm
for i in tqdm(range(10000)):
    policy = get_random_policy()
    score = evaluate(policy)
    if score > best_score:
        best_score = score
        best_policy = policy
        print ("New best score:",score)
        print ("Best policy:")
        print_policy(best_policy)

  0%|          | 9/10000 [00:00<01:56, 85.60it/s]

New best score: 0.0
Best policy:
^ > ^ <
> H ^ H
> ^ v H
H < ^ G
New best score: 0.07
Best policy:
v ^ < <
< H v H
^ v ^ H
H > > G


  0%|          | 38/10000 [00:00<01:51, 89.23it/s]

New best score: 0.09
Best policy:
< > v ^
v H > H
v v < H
H > ^ G


  1%|          | 80/10000 [00:00<01:59, 82.72it/s]

New best score: 0.1
Best policy:
^ ^ v ^
v H < H
> v > H
H < > G
New best score: 0.13
Best policy:
> v > ^
v H < H
^ v < H
H v > G


  3%|▎         | 281/10000 [00:02<01:37, 99.51it/s]

New best score: 0.17
Best policy:
< ^ < v
< H < H
^ v < H
H ^ v G


  7%|▋         | 709/10000 [00:07<01:41, 91.70it/s]

New best score: 0.2
Best policy:
v v < >
< H < H
^ > v H
H > v G


 11%|█         | 1097/10000 [00:11<01:44, 84.96it/s]

New best score: 0.36
Best policy:
< > ^ >
< H > H
^ v < H
H > v G


100%|██████████| 10000/10000 [02:43<00:00, 61.32it/s]


### Genetic algorithm

In [None]:
def recombine(policy1,policy2,p=0.5):
    """
    for each state, with probability p take action from policy1, else policy2
    """
    <your code>
    return <your code>

In [None]:
np.random.seed(1234)
policies = [recombine(get_random_policy(),get_random_policy()) 
            for i in range(10**4)]

assert all([len(p) == 16 for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == 3, 'maximal action id should be 3'
print "Seems fine!"

In [None]:

pool_size = 100
n_recombinations = 50
n_mutations = 10

n_epochs = 100

print "initializing..."
pool = [get_random_policy() for _ in range(pool_size)]
pool_scores = [evaluate(p) for p in pool]

for epoch in range(n_epochs):
    print "Epoch %s:"%epoch
    recombined = <recombine random guys from pool>
    
    mutated = <add several new policies at random>
    
    everyone = pool + recombined + mutated
    
    scores = pool_scores+[evaluate(p) for p in recombined+mutated]
    
    #select best
    selected_indices = np.argsort(scores)[-pool_size:]
    pool = [everyone[i] for i in selected_indices]
    pool_scores = [scores[i] for i in selected_indices]
    
    print evaluate(pool[-1])
    print_policy(pool[-1])