# Skiing by evolution

The skiing game has a reward which is more difficult in terms of learning.

The goal of the game is to avoid trees and pass through the gates, but the reward is given only at the end. Reward equals to -3..-7 for living and -500 * (missed gates) at the end.

Because of that, the agent can evaluate its behavior only at the end of the game, and not immediately after passing (or not passing) gates.

The standart Q-learning technique would require significant amount of training in this case, because 99.9% of the network's weight updates would be meaningless, as they correspond to that random living penalty.

Therefore, we use another approach to train the network: an evolutionary algorithm. We use mutation and selection based on sum of rewards to train the network, instead of gradient updates. The solution basically follows this scheme:
1. Create random set of NN's, each consisting of 2 convolutional and 2 fully-connected layers
2. Evaluate their fitness (i.e. sum of rewards)
3. Choose the best ones
4. Crossover and mutate them (possibly adding new neurons)
5. Repeat stage 2 for the result.

Our solution is inspired by a method called NEAT, used to play Mario: https://www.youtube.com/watch?v=qv6UVOQ0F44

The example on the video uses handcrafted features, but we use 2 convolutional layers trained the same way instead. This way, this approach (theoretically) can play any game of this kind, without any game-specific features or rewards

However, training was not enough, so now it is able only of going straight down (at the beginning it presses keys quite randomly).

# Imports

In [None]:
from six.moves import cPickle
import cv2
import numpy as np
from scipy.signal import convolve2d
import theano
import gym
from gym import wrappers
import theano
import theano.tensor as T
import lasagne

In [3]:
# Resizing to black-white 42x42
def _process_frame42(frame):
    frame = frame[34:34+160, :160]
    # Resize by half, then down to 42x42 (essentially mipmapping). If
    # we resize directly we lose pixels that, when mapped to 42x42,
    # aren't close enough to the pixel boundary.
    frame = cv2.resize(frame, (80, 80))
    frame = cv2.resize(frame, (42, 42))
    frame = frame.mean(2)
    frame = frame.astype(np.float32)
    frame *= (1.0 / 255.0)
    frame = np.reshape(frame, [42, 42, 1])
    return frame

In [4]:
# Neural Evolution
class NeuralNetwork:
    conv1_size = 5
    conv2_size = 5
    evolution_probability = 0.97
    scale_factor = 1
    def __init__(self):
        self.conv1_filtr = np.random.standard_normal((self.conv1_size, self.conv1_size))
        self.conv2_filtr = np.random.standard_normal((self.conv2_size, self.conv2_size))
        self.dense1_weights = np.random.standard_normal((49, 100))
        self.dense2_weights = np.random.standard_normal((100, 3))
    def Convolve(self, compressed_observation):
        input_var = T.dmatrix('inputs')

        pooling = theano.function([input_var],
                                  theano.tensor.signal.pool.pool_2d(input_var, (2, 2), ignore_border=True))
        return pooling(convolve2d(pooling(convolve2d(compressed_observation, self.conv1_filtr , mode='valid')),
                          self.conv2_filtr, mode='valid'))
    def ForwardPropogate(self, compressed_observation):
        result_convolution = self.Convolve(compressed_observation)
        result_convolution = result_convolution.reshape(1, -1)
        dense1_output = result_convolution.dot(self.dense1_weights)
        dense1_activations = 1 / (1 + np.exp(- dense1_output))
        dense2_output = dense1_activations.dot(self.dense2_weights)
        dense2_activations =  1 / (1 + np.exp(- dense2_output))
        return dense2_activations
    def Evolution(self):
        new_network = NeuralNetwork()
        new_network.conv1_filtr = self.conv1_filtr +\
            ((np.random.standard_normal((self.conv1_size, self.conv1_size)) - self.evolution_probability) > 0) \
            * np.random.standard_normal((self.conv1_size, self.conv1_size)) * self.scale_factor
        new_network.conv2_filtr = self.conv2_filtr +\
            ((np.random.standard_normal((self.conv2_size, self.conv2_size)) - self.evolution_probability) > 0) \
            * np.random.standard_normal((self.conv2_size, self.conv2_size)) * self.scale_factor
        new_network.dense1_weights = self.dense1_weights +\
            ((np.random.standard_normal(self.dense1_weights.shape) - self.evolution_probability) > 0) \
            * np.random.standard_normal(self.dense1_weights.shape) * self.scale_factor
        new_network.dense2_weights = self.dense2_weights +\
            ((np.random.standard_normal(self.dense2_weights.shape) - self.evolution_probability) > 0) \
            * np.random.standard_normal(self.dense2_weights.shape) * self.scale_factor
        return new_network

In [10]:
# a = argmax_a Q(s,a)
def predict_action(observation, network):
    return env.action_space.sample()
    compressed_observation = _process_frame42(observation)
    #return np.argmax(network.ForwardPropogate(compressed_observation[:,:,0]))

In [11]:
# Play one game taking actions provided by the network
def PlayGame(env, network):
    observation = env.reset()
    done = False
    iteration, all_reward = 0, 0
    while not done:
        env.render()
        action = predict_action(observation, network)
        observation, reward, done, info = env.step(action)
        all_reward += reward

        if all_reward < -30000 or iteration >= 9000:
            break
        
        iteration += 1

    print("Reward: ", all_reward)
    return all_reward

# Training

In [12]:
env = gym.make("Skiing-v0")
#env = wrappers.Monitor(env, "/tmp/gym-results", force = True)

[2017-02-01 17:44:38,142] Making new env: Skiing-v0


In [13]:
network = NeuralNetwork()
print("Initializing...")
reward = PlayGame(env, network)
print("Init done")
num_evolution_try = 3
iteration = 0
while reward < -6000:
    evolution_rewards = []
    evolution_networks = []
    
    for i in range(0, num_evolution_try):
        new_network = network.Evolution()
        evolution_networks += [new_network]
        print("Epoch {1} try {2}".format(iteration, i))
        evolution_rewards += [PlayGame(env, new_network)]

    i_max = np.argmax(evolution_rewards)
    if evolution_rewards[i_max] < reward:
        continue
    else:
        network = evolution_networks[i_max]
        
    iteration += 1

Initializing...


[2017-02-01 17:45:00,888] Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home/sergei/Documents/jupyter/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/home/sergei/Documents/jupyter/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/home/sergei/Documents/jupyter/local/lib/python2.7/site-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python2.7/inspect.py", line 1044, in getinnerframes
    framelist.append((tb.tb_frame,) + getframeinfo(tb, context))
  File "/usr/lib/python2.7/inspect.py", line 1004, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python2.7/inspect.py", line 454, in getsourcefile
    if hasattr(getmodule(object, filename), '__loader__'):
  File "/usr/lib/p

IndexError: string index out of range

In [None]:
PlayGame(env, network)

In [20]:
# Save and restore

def save_to_cPickle(file_name, obj):
    f = open(file_name + '.save', 'wb')
    cPickle.dump(obj, f, protocol=cPickle.HIGHEST_PROTOCOL)
    f.close()

def load_from_cPickle(file_name):
    f = open(file_name + '.save', 'rb')
    loaded_obj = cPickle.load(f)
    f.close()
    return loaded_obj

In [None]:
save_to_cPickle("best_network", network)
new_net = load_from_cPickle("best_network1")