# Cartpole

### Steps

For Deep Q Agent, define:
- the DNN model
- action selection procedure
  - will need epsilon and epsilon decay parameters
- memory and replay functions for learning
  - will need learning rate and batch size parameters
  - minimize cost (MSE) where target value is current reward + discounted (by factor of gamma) predicted reward of next state

Uses https://keon.io/deep-q-learning/ as reference


In [3]:
!pip install gym
!pip install keras

Collecting gym
  Downloading gym-0.9.7.tar.gz (108kB)
[K    100% |████████████████████████████████| 112kB 2.3MB/s 
Collecting pyglet>=1.2.0 (from gym)
  Downloading pyglet-1.3.1-py2.py3-none-any.whl (1.0MB)
[K    100% |████████████████████████████████| 1.0MB 1.1MB/s 
Building wheels for collected packages: gym
  Running setup.py bdist_wheel for gym ... [?25l- \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/a8/e4/fc/145832d732d33de702076907d7c3b4c47ba4302dbedd35fc80
Successfully built gym
Installing collected packages: pyglet, gym
Successfully installed gym-0.9.7 pyglet-1.3.1
Collecting keras
  Downloading Keras-2.1.4-py2.py3-none-any.whl (322kB)
[K    100% |████████████████████████████████| 327kB 2.1MB/s 
Installing collected packages: keras
Successfully installed keras-2.1.4


In [1]:
# Code in this cell from: https://www.kaggle.com/getting-started/47096 -- mount directory so that I can read / write files from drive

# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse



# Generate auth tokens for Colab
from google.colab import auth
auth.authenticate_user()


# Generate creds for the Drive FUSE library.
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}


# Create a directory and mount Google Drive using that directory.
!mkdir -p drive
!google-drive-ocamlfuse drive

print('Files in Drive:')
!ls drive/

# Create a file in Drive.
!echo "This newly created file will appear in your Drive file list." > drive/created.txt

Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
··········
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
Please enter the verification code: Access token retrieved correctly.
Files in Drive:
Australia Itinerary.ods  Fall 2015		Pictures-for-craigslist
CCC			 Ireland itinerary.ods	Resumes
colab_notebooks		 LPs			Sarangi
Colab Notebooks		 Misc-personal		saved_models
Columbia		 MPS Fall 2016		SensoDx
Columbia-backup		 MPS Spring 2016	SH
created.txt		 Old schoolwork		St. Croix expenses

In [4]:
import gym
import matplotlib.pyplot as plt
import numpy as np
import random
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

Using TensorFlow backend.


In [0]:
class DeepQAgent(object):
  
  
  def __init__(self, state_size, action_size):
    self.state_size = state_size
    self.action_size = action_size
    
    self.learning_rate = 0.001
    self.gamma = 0.1

    self.epsilon = 0.01 # set low if using pre-trained weights
    self.epsilon_decay = 0.005
    self.min_epsilon = 0.01
    
    self.replay_buffer = deque(maxlen=10000) # from stackoverflow: https://stackoverflow.com/questions/23487307/python-deque-vs-list-performance-comparison
    self.model = self.build_model()

    
  def build_model(self):
    
    # Construct a NN with two hidden layers, rectified linear unit activation
    model = Sequential()
    model.add(Dense(24, input_dim=self.state_size, activation='relu')) # input layer size (state_size,)
    model.add(Dense(24, activation='relu'))
    model.add(Dense(self.action_size, activation='linear')) # output layer size (action_size,). Linear because we want to map to Q values, e.g. not softmax probabilities
    
    model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
    
    return model

  
  def choose_action(self, state):
    if (np.random.rand() <= self.epsilon): # explore with probability of epsilon
      return env.action_space.sample()
    else:
      q_values = self.model.predict(state)
      return np.argmax(q_values[0]) # choose the action that maximizes q value

    
  def add_to_replay_buffer(self, state, action, reward, next_state, done):
    self.replay_buffer.append((state, action, reward, next_state, done))

    
  def replay(self, batch_size):
    # sample a batch from memory
    batch = random.sample(self.replay_buffer, batch_size)

    # for the sampled data in batch
    for state, action, reward, next_state, done in batch:
      target = reward

      if not done: # if the state isnt terminal, predict the future discounted reward.  Else, target value is equal to the reward received
        target = reward + (1 - self.gamma) * np.amax(self.model.predict(next_state)[0]) # return the highest VALUE (not the action), since we are updating target reward

      # train the agent to map current state to future (discounted) reward
      q_values = self.model.predict(state) # predict the values (since we don't know what both actions would have yielded)
      q_values[0][action] = target # and then reassign the taken action value to the reward / target

      self.model.fit(state, q_values, epochs=1, verbose=0)

    if self.epsilon > self.min_epsilon:
      self.epsilon *= (1 - self.epsilon_decay)

      
  def load_weights(self, name):
    self.model.load_weights(name)

    
  def save_weights(self, name):
    self.model.save_weights(name)
        

In [0]:
# helper function to transpose returned state so that it can serve as input to model

def transpose(state, dimension):
  return np.reshape(state, [1, dimension])

In [12]:
env = gym.make("CartPole-v0")
observation = env.reset()

# print(vars(env.action_space))
# print(vars(env.observation_space))

action_size = env.action_space.n
state_size = env.observation_space.shape[0]

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [13]:
EPISODES = 1000
MAX_STEPS = 500 # because we don't want this to go on forever (should the model ever get that good ;) )
SOLVED_STEPS = 195 # openAI says that solving is averaging at least 195 steps per trial
BATCH_SIZE = 32

FILE_NAME = 'drive/saved_models/cartpole_dqn_weights.h5'


agent = DeepQAgent(state_size, action_size)
agent.load_weights(FILE_NAME)

timesteps = 0
max_timesteps_so_far = 0

for i in range(EPISODES):
  
  state = env.reset()
  state = transpose(state, state_size)
  
  for t in range(MAX_STEPS):
    # env.render() # can use this to watch visualization on a VM that supports it
    
    action = agent.choose_action(state)
    next_state, reward, done, _ = env.step(action)
    next_state = transpose(next_state, state_size)
    agent.add_to_replay_buffer(state, action, reward, next_state, done)
    
    state = next_state

    if done:
      # calculate average timesteps over episode_range number of episodes
      timesteps += t
      episode_range = 10
      
      if i != 0 and i % episode_range == 0:
        print("Episodes {} through {} finished after an average of {} timesteps".format(i - episode_range, i, timesteps / episode_range))
        if timesteps / episode_range > max_timesteps_so_far:
          max_timesteps_so_far = timesteps / episode_range
#           agent.save_weights(FILE_NAME)
#           print('Model weights saved.')

        if timesteps / episode_range >= SOLVED_STEPS:
          print('Episodes were a sucess!')
    
        timesteps = 0
      break
      
  if len(agent.replay_buffer) > BATCH_SIZE: # start learning once we have seen enough examples to form a batch to train on
    agent.replay(BATCH_SIZE)

# env.render()
env.reset()
env.close()

Episodes 0 through 10 finished after an average of 215.8 timesteps
Episodes were a sucess!
Episodes 10 through 20 finished after an average of 190.5 timesteps
Episodes 20 through 30 finished after an average of 199.0 timesteps
Episodes were a sucess!
Episodes 30 through 40 finished after an average of 193.9 timesteps
Episodes 40 through 50 finished after an average of 194.0 timesteps
Episodes 50 through 60 finished after an average of 199.0 timesteps
Episodes were a sucess!


KeyboardInterrupt: ignored