#  Lab 1
## Voropaev Pavel, 144
#### Frozen Lake

In this notebook i was trying to understand how gym works, and create a few algoritms for simple game **FrozenLake 4x4**

This game game is some kind of labirint, but you can sometimes slide throw the cell. There are 4 command for agent - L, R, U, D  - left, right, up and down.

In [72]:
import gym

#create a single game instance
env = gym.make("FrozenLake-v0")

#start new game
env.reset();

[2017-12-18 11:46:24,116] Making new env: FrozenLake-v0


In [73]:
# display the game state
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


#### Gym interface

The three main methods of an environment are
* __reset()__ - reset environment to initial state, _return first observation_
* __render()__ - show current environment state (a more colorful version :) )
* __step(a)__ - commit action __a__ and return (new observation, reward, is done, info)
 * _new observation_ - an observation right after commiting the action __a__
 * _reward_ - a number representing your reward for commiting action __a__
 * _is done_ - True if the MDP has just finished, False if still in progress
 * _info_ - some auxilary stuff about what just happened. Ignore it for now

In [74]:
print("initial observation code:", env.reset())
print('printing observation:')
env.render()
print("observations:", env.observation_space, 'n=', env.observation_space.n)
print("actions:", env.action_space, 'n=', env.action_space.n)

initial observation code: 0
printing observation:

[41mS[0mFFF
FHFH
FFFH
HFFG
observations: Discrete(16) n= 16
actions: Discrete(4) n= 4


In [75]:
print("taking action 2 (right)")
new_obs, reward, is_done, _ = env.step(2)
print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", is_done)
print("printing new state:")
env.render()

taking action 2 (right)
new observation code: 1
reward: 0.0
is game over?: False
printing new state:
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG


In [76]:
action_to_i = {
    'left':0,
    'down':1,
    'right':2,
    'up':3
}

Now, we can try to win the game manualy. But it is really not so easy :)

In [77]:
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


Run this cell, and change the action, if you want

In [79]:
new_obs, reward, is_done, _ = env.step(action_to_i['right'])
print(new_obs,is_done, reward)
env.render()

0 False 0.0
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG


## Random srtategy

#### Now we can try to learn some simple strategy, random strategy for example

In [80]:
import numpy as np

n_states = env.observation_space.n
n_actions = env.action_space.n

# create numpy array representing agent policy
# array have size 16, so one action from 0 to 3 for each state of the game
def get_random_policy():
    return np.random.randint(0,n_actions,n_states)

In [81]:
np.random.seed(42)
policies = [get_random_policy() for i in range(10**4)]

In [83]:
# play the game with given policy
# return total reward, if the game not end after t_max steps return current reward
def sample_reward(env, policy, t_max=100):
    s = env.reset()
    total_reward = 0 
    t = 0
    is_done = False
    cur_obs = 0
    
    while t < t_max and not is_done:
        new_obs, reward, is_done, _ = env.step(policy[cur_obs])
        cur_obs = new_obs
        t += 1
        total_reward += reward
        
    return total_reward

In [84]:
# run the game n_times with one policy and return average reward
def evaluate(policy, n_times=100):
    rewards = [sample_reward(env,policy) for _ in range(n_times)]
    return float(np.mean(rewards))       

In [85]:
# print policy in readable way
def print_policy(policy):
    lake = "SFFFFHFHFFFHHFFG"
    assert env.spec.id == "FrozenLake-v0", "this function only works with frozenlake 4x4"

    # where to move from each tile (we're a bit unsure if this is accurate)
    arrows = ['>^v<'[a] for a in policy]
    
    #draw arrows above S and F only
    signs = [arrow if tile in "SF" else tile for arrow, tile in zip(arrows, lake)]
    
    for i in range(0, 16, 4):
        print(' '.join(signs[i:i+4]))

print("random policy:")
print_policy(get_random_policy())

random policy:
< ^ v <
> H < H
< v ^ H
H v > G


#### Main loop

In [None]:
best_policy = None
best_score = -float('inf')

from tqdm import tqdm
for i in tqdm(range(10000)):
    policy = get_random_policy()
    score = evaluate(policy)
    if score > best_score:
        best_score = score
        best_policy = policy
        print("New best score:", score)
        print("Best policy:")
        print_policy(best_policy)

  0%|          | 4/10000 [00:00<05:43, 29.12it/s]

New best score: 0.0
Best policy:
v < v >
< H v H
v < < H
H ^ > G
New best score: 0.1
Best policy:
> > ^ >
> H > H
< < > H
H v < G


  0%|          | 37/10000 [00:00<02:58, 55.78it/s]

New best score: 0.11
Best policy:
> ^ v ^
> H > H
< > > H
H ^ v G


  1%|          | 74/10000 [00:01<02:46, 59.54it/s]

New best score: 0.14
Best policy:
< < v <
v H > H
^ < ^ H
H ^ ^ G


  1%|          | 81/10000 [00:01<03:43, 44.34it/s]

New best score: 0.2
Best policy:
< < > >
^ H v H
> ^ v H
H ^ v G


  1%|          | 87/10000 [00:01<04:25, 37.27it/s]

New best score: 0.68
Best policy:
> v < v
> H v H
< ^ > H
H v ^ G


 21%|██        | 2076/10000 [00:33<02:00, 65.70it/s]

## Genetic algorithm

In [86]:
# for each state, with probability p take action from policy1, else policy2
def crossover(policy1, policy2, p=0.5):
    return np.where(np.random.random(policy1.shape[0]) <= p, policy1, policy2)

In [87]:
# for each state, with probability p replace action with random action
def mutation(policy, p=0.1):
    return crossover(get_random_policy(), policy, p)

In [37]:
n_epochs = 500 #how many cycles to make
pool_size = 1000 #how many policies to maintain
n_crossovers = 250 #how many crossovers to make on each step
n_mutations = 250 #how many mutations to make on each tick

In [26]:
print("initializing...")
pool = [get_random_policy() for _ in range(pool_size)]
pool_scores = [evaluate(policy) for policy in pool]

initializing...


#### Main loop

In [24]:
for epoch in range(n_epochs):
    print("Epoch %s:"%epoch)
    
    crossovered = [crossover(pool[np.random.randint(pool_size)],
                             pool[np.random.randint(pool_size)])
                   for _ in range(n_crossovers)]
    mutated = [mutation(pool[np.random.randint(pool_size)]) for _ in range(n_mutations)]
    
    #add new policies to the pool
    pool = pool + crossovered + mutated
    pool_scores = [evaluate(policy) for policy in pool]
    
    #select pool_size best policies
    selected_indices = np.argsort(pool_scores)[-pool_size:]
    pool = [pool[i] for i in selected_indices]
    pool_scores = [pool_scores[i] for i in selected_indices]

    #print the best policy so far (last in ascending score order)
    print("best score:", pool_scores[-1])
    print_policy(pool[-1])

Epoch 0:
best score: 0.2
> v v <
> H > H
< v v H
H ^ ^ G
Epoch 1:
best score: 0.22
> v v <
> H > H
< v v H
H ^ ^ G
Epoch 2:
best score: 0.22
> v v ^
> H > H
< v v H
H ^ ^ G
Epoch 3:
best score: 0.58
> v > >
> H > H
< ^ ^ H
H v ^ G
Epoch 4:
best score: 0.62
> v > >
> H > H
< ^ ^ H
H v ^ G
Epoch 5:
best score: 0.71
> v > ^
> H < H
< ^ > H
H v ^ G
Epoch 6:
best score: 0.68
> > > ^
> H < H
< ^ > H
H v ^ G
Epoch 7:
best score: 0.71
> < < <
> H > H
< ^ > H
H v ^ G
Epoch 8:
best score: 0.77
> < ^ <
> H v H
< ^ > H
H v ^ G
Epoch 9:
best score: 0.83
> < < <
> H > H
< ^ > H
H v ^ G
Epoch 10:
best score: 0.79
> < < <
> H v H
< ^ > H
H v ^ G
Epoch 11:
best score: 0.84
> < < <
> H > H
< ^ > H
H v ^ G
Epoch 12:
best score: 0.8
> v > >
> H v H
< ^ > H
H v ^ G
Epoch 13:
best score: 0.84
> < < <
> H > H
< ^ > H
H v ^ G
Epoch 14:
best score: 0.83
> < < <
> H v H
< ^ > H
H v ^ G
Epoch 15:
best score: 0.82
> v > >
> H v H
< ^ > H
H v ^ G
Epoch 16:
best score: 0.82
> < > v
> H v H
< ^ > H
H v ^ G
Epoch 17:

KeyboardInterrupt: 

#### Now we can try make our crossover a little bit smarter :) 
#### So each time we will take action from better policy  with higher probability

In [88]:
def smart_crossover(policy1, policy2, p=0.8):    
    score1 = evaluate(policy1)
    score2 = evaluate(policy2)
    
    if score1 < score2:
        p = 1 - p
    
    return np.where(np.random.random(policy1.shape[0]) <= p, policy1, policy2)

#### Main loop with smart crossover and more diverse pool

In [30]:
print("initializing...")
pool = [get_random_policy() for _ in range(pool_size)]
pool_scores = [evaluate(policy) for policy in pool]
n_random_policy = 10

for epoch in range(n_epochs):
    print("Epoch %s:"%epoch)
    
    crossovered = [smart_crossover(pool[np.random.randint(pool_size)],
                                   pool[np.random.randint(pool_size)])
                   for _ in range(n_crossovers)]
    mutated = [mutation(pool[np.random.randint(pool_size)], p=0.2) for _ in range(n_mutations)]
    
    assert type(crossovered) == type(mutated) == list
    
    #add new policies to the pool
    pool = pool + crossovered + mutated
    pool_scores = [evaluate(policy) for policy in pool]
    
    #select pool_size best policies
    selected_indices = np.argsort(pool_scores)[-pool_size + n_random_policy:]
    random_policies = [pool[np.random.randint(pool_size)] for _ in range(n_random_policy)]
    pool = [pool[i] for i in selected_indices] + random_policies
    pool_scores = [evaluate(policy) for policy in random_policies] + [pool_scores[i] for i in selected_indices]

    #print the best policy so far (last in ascending score order)
    print("best score:", pool_scores[-1])
    print_policy(pool[-1])

initializing...
Epoch 0:
best score: 0.28
^ < < <
^ H v H
^ v > H
H > v G
Epoch 1:
best score: 0.33
^ > > v
^ H > H
< < < H
H < < G
Epoch 2:
best score: 0.37
v > ^ >
> H v H
v < > H
H < v G
Epoch 3:
best score: 0.35
< < > <
v H > H
^ > ^ H
H < ^ G
Epoch 4:
best score: 0.39
< < > >
> H > H
^ > ^ H
H < v G
Epoch 5:
best score: 0.5
> v > v
> H ^ H
< v v H
H v ^ G
Epoch 6:
best score: 0.76
> v > <
> H ^ H
< v ^ H
H < ^ G
Epoch 7:
best score: 0.69
> v v <
> H v H
< v ^ H
H v ^ G
Epoch 8:
best score: 0.74
> < > v
> H ^ H
< v v H
H v v G
Epoch 9:
best score: 0.74
> ^ v >
> H < H
< ^ ^ H
H ^ v G
Epoch 10:
best score: 0.76
> ^ > >
> H ^ H
< ^ ^ H
H v v G
Epoch 11:
best score: 0.77
> > > <
> H ^ H
< ^ ^ H
H v ^ G
Epoch 12:
best score: 0.78
> < > <
> H > H
< ^ ^ H
H v ^ G
Epoch 13:
best score: 0.81
> < > <
> H < H
< ^ ^ H
H v ^ G
Epoch 14:
best score: 0.82
> > v <
> H v H
< ^ > H
H v ^ G
Epoch 15:
best score: 0.79
> > ^ v
> H > H
< ^ > H
H v ^ G
Epoch 16:
best score: 0.81
> v v <
> H v H
< ^ > H
