# Default CartPole with Q-Learning

## *TFG Reinforcement Learning through the GymRetro Platform.*

In this notebook we will show how to load and train a Tensorforce DQN agent in the Gym CartPole environment, taking the screen as input.

In [None]:
%matplotlib inline

Adapted from the original work of: `Adam Paszke <https://github.com/apaszke>`_
[**Original code can be found here**](https://github.com/pytorch/tutorials/blob/master/intermediate_source/reinforcement_q_learning.py)

License for the original code:

BSD 3-Clause License

Copyright (c) 2017-2022, Pytorch contributors
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The CartPole task is designed so that the inputs to the agent are 4 real
values representing the environment state (position, velocity, etc.).
However, neural networks can solve the task purely by looking at the
scene, so we'll use a patch of the screen centered on the cart as an
input. Because of this, our results aren't directly comparable to the
ones from the official leaderboard - our task is much harder.
Unfortunately this does slow down the training, because we have to
render all the frames.

Strictly speaking, we will present the state as the difference between
the current screen patch and the previous one. This will allow the agent
to take the velocity of the pole into account from one image.

## Previous installations:

In [None]:
!pip install gym
!pip install torch
!pip install torchvision
!pip install tensorforce
!pip install keras

## Required libraries:

In [None]:
import gym
from tensorforce import Agent, Environment

from PIL import Image
import torchvision.transforms as T
import torch

import math
import random
import numpy as np

import time

from IPython.display import clear_output

## Setup of the environment:

We manually set our environment, adding functions to get the screen of each timestep, as well as apply some changes to it so it's easier to process every time.

In [None]:
resize = T.Compose([T.ToPILImage(),
                    T.Resize(40, interpolation=Image.CUBIC),
                    T.ToTensor()])

class CartPoleVisionEnvironment(Environment):

    def __init__(self):
        self.env = gym.make('CartPole-v0').unwrapped
        self.env.reset()
        self.init_screen = self.get_screen()
        self.init_state = (self.get_screen() - self.init_screen).cpu().squeeze(0).permute(1, 2, 0).numpy()
        self.last_screen = self.init_screen
        self.current_screen = self.init_screen
        super().__init__()
        
    def get_cart_location(self, screen_width):
        world_width = self.env.x_threshold * 2
        scale = screen_width / world_width
        return int(self.env.state[0] * scale + screen_width / 2.0)  # MIDDLE OF CART
        
    def get_screen(self):
        # Returned screen requested by gym is 400x600x3, but is sometimes larger
        # such as 800x1200x3. Transpose it into torch order (CHW).
        screen = self.env.render(mode='rgb_array').transpose((2, 0, 1))
        # Cart is in the lower half, so strip off the top and bottom of the screen
        _, screen_height, screen_width = screen.shape
        screen = screen[:, int(screen_height*0.4):int(screen_height * 0.8)]
        view_width = int(screen_width * 0.6)
        cart_location = self.get_cart_location(screen_width)
        if cart_location < view_width // 2:
            slice_range = slice(view_width)
        elif cart_location > (screen_width - view_width // 2):
            slice_range = slice(-view_width, None)
        else:
            slice_range = slice(cart_location - view_width // 2,
                                cart_location + view_width // 2)
        # Strip off the edges, so that we have a square image centered on a cart
        screen = screen[:, :, slice_range]
        # Convert to float, rescale, convert to torch tensor
        # (this doesn't require a copy)
        screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
        screen = torch.from_numpy(screen)
        # Resize, and add a batch dimension (BCHW)
        return resize(screen).unsqueeze(0)
    
    def is_vectorizable(self):
        return True                          
    
    def actions(self):
        return dict(type='int', shape=(), num_values=2)

    def states(self):
        return dict(type='float', shape=self.init_state.shape)

    def execute(self, actions):
        _, reward, done, _ = self.env.step(actions)
        self.last_screen = self.current_screen
        self.current_screen = self.get_screen()
        if not done:
            next_state = (self.current_screen - self.last_screen).cpu().squeeze(0).permute(1, 2, 0).numpy()
        else: next_state = None
        return next_state, done, reward
    
    def reset(self):
        self.env.reset()
        self.last_screen = self.init_screen
        self.current_screen = self.init_screen
        return self.init_state

## Creation or loading of the agent:

Execute the first cell if it's your first time training the agent, or execute the second cell if you want to load an existing agent.

In [None]:
environment =  Environment.create(environment = CartPoleVisionEnvironment, max_episode_timesteps=10000)

# Instantiate a Tensorforce agent
agent = Agent.create(
    agent='dqn',
    environment=environment,  # alternatively: states, actions, (max_episode_timesteps)
    memory=10000,
    batch_size=32,
    exploration=0.05,
    # Save agent every 100 updates and keep the 5 most recent checkpoints
    saver=dict(directory='Agent_directory', frequency=100, max_checkpoints=5),
    tracking = 'all',
)

In [None]:
agent = Agent.load(directory='Agent_directory')

## Agent training:

In [None]:
environment =  Environment.create(environment = CartPoleVisionEnvironment, max_episode_timesteps=10000)

episode_reward = []
episodeTimes = []
episodeTimeSteps = []

trainingStart = time.time()

# Train for 10000 episodes
for episode in range(10000):

    # Initialize episode
    states = environment.reset()
    terminal = False
    rewardTotal = 0
    currentEpisodeTimeSteps = 0
    episodeStart = time.time()
    while not terminal:
        # Episode timestep
        currentEpisodeTimeSteps += 1
        actions = agent.act(states=states)
        states, terminal, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminal, reward=reward)
        rewardTotal += reward
    
    episodeEnd = time.time()
    timeEpisode = episodeEnd - episodeStart
    episodeTimes.append(timeEpisode)
    episode_reward.append(rewardTotal)
    episodeTimeSteps.append(currentEpisodeTimeSteps)
    clear_output(wait=True)
    print(f"Episode: {episode}")
    
trainingEnd = time.time()
trainingTime = trainingEnd - trainingStart
environment.close()

print(f"Elapsed training time: {trainingTime} seconds")

We load some data gathered during the training into files so we can plot it and evaluate the evolution of the agent:

In [None]:
with open('rewards_per_episode.txt', 'w') as f:
    for item in episode_reward:
        f.write("%s\n" % item)
        
with open('timesteps_per_episode.txt', 'w') as f:
    for item in episodeTimeSteps:
        f.write("%s\n" % item)
        
with open('times_per_episode.txt', 'w') as f:
    for item in episodeTimes:
        f.write("%s\n" % item)

# Evaluation of out trained agent:

We check the perfomance of an already trained agent without training it again.

In [None]:
agent = Agent.load(directory='Agent_directory')
environment =  Environment.create(environment = CartPoleVisionEnvironment, max_episode_timesteps=10000)

episodeTimes = []
episodeTimeSteps = []
for _ in range(10):
    episodeStart = time.time()
    # Initialize episode
    states = environment.reset()
    terminal = False
    currentEpisodeTimeSteps = 0
    while not terminal:
        # Episode timestep
        currentEpisodeTimeSteps += 1
        actions = agent.act(states=states, independent = True, deterministic=True)
        states, terminal, reward = environment.execute(actions=actions)
    
    episodeEnd = time.time()
    timeEpisode = episodeEnd - episodeStart
    episodeTimes.append(timeEpisode)
    episodeTimeSteps.append(currentEpisodeTimeSteps)
    
environment.close()
    
avgEpisodeTime = sum(episodeTimes) / len(episodeTimes)
bestEpisodeTime = max(episodeTimes)
avgEpisodeTimeSteps = sum(episodeTimeSteps) / len(episodeTimeSteps)
bestEpisodeTimeSteps = max(episodeTimeSteps)

## Check results of training:

In [None]:
print(f"Average time steps per episode: {avgEpisodeTimeSteps} timesteps")
print(f"Best episode: {bestEpisodeTimeSteps} timesteps")

In [None]:
print(f"Training time: {trainingTime} seconds")
print(f"Average seconds per episode after training: {avgEpisodeTime} seconds")