# Default CartPole with Q-Learning

## *TFG Reinforcement Learning through the GymRetro Platform.*

In this notebook we will show how to load and train a Tensorforce DQN agent in the Gym CartPole environment.

## Previous installs:

First we will install __Gym__, a library by _OpenAI_ that provides different environments for reinforcement learning.

We will also install __Tensorforce__ which provides an easy way to create Deep Reinforcement Learning agents that interact with these environments, and other required installations.

In [None]:
!pip install gym[all]==0.21.0
!pip install tensorforce
!pip install keras==2.6.0
!pip install pygame

## Required libraries:

In [None]:
import gym
from tensorforce import Agent, Environment

from IPython.display import clear_output
import time

## Creation or loading of the agent:

Now we will create a _Deep Q-Learning_ Tensorforce agent that should learn to move the cart in a way that keeps the pole from tilting. Tensorforce has integrated support for gym environments, which will make the implementation much easier.

The information that the environment provides the agent has the following format:
[position of cart, velocity of cart, angle of pole, angular velocity of pole].

Execute the first cell if it's your first time training the agent, or execute the second cell if you want to load an existing agent.

In [None]:
# Add parameter visualize='True' if we want to see the training process. Slower.
environment = Environment.create(environment='gym', level='CartPole-v1')

# Instantiate a Tensorforce agent
agent = Agent.create(
    agent='dqn',
    environment=environment,  # alternatively: states, actions, (max_episode_timesteps)
    memory=50000,
    batch_size=32,
    # Save agent every 100 updates and keep the 5 most recent checkpoints
    saver=dict(directory='Agent_directory', frequency=100, max_checkpoints=5),
)

In [None]:
agent = Agent.load(directory='Agent_directory')

## Agent training:

In [None]:
environment = Environment.create(environment='gym', level='CartPole-v1')

episode_reward = []
episodeTimes = []
episodeTimeSteps = []

trainingStart = time.time()

# Train for 10000 episodes
for episode in range(10000):

    # Initialize episode
    states = environment.reset()
    terminal = False
    rewardTotal = 0
    currentEpisodeTimeSteps = 0
    episodeStart = time.time()
    while not terminal:
        # Episode timestep       
        currentEpisodeTimeSteps += 1
        actions = agent.act(states=states)
        states, terminal, reward = environment.execute(actions=actions)
        agent.observe(terminal=terminal, reward=reward)
        rewardTotal += reward
     
    episodeEnd = time.time()
    timeEpisode = episodeEnd - episodeStart
    episodeTimes.append(timeEpisode)
    episode_reward.append(rewardTotal)
    episodeTimeSteps.append(currentEpisodeTimeSteps)
    clear_output(wait=True)
    print(f"Episode: {episode}")
    
    
trainingEnd = time.time()
trainingTime = trainingEnd - trainingStart
environment.close()

print(f"Elapsed training time: {trainingTime} seconds")

We load some data gathered during the training into files so we can plot it and evaluate the evolution of the agent:

In [None]:
with open('rewards_per_episode.txt', 'w') as f:
    for item in episode_reward:
        f.write("%s\n" % item)
        
with open('timesteps_per_episode.txt', 'w') as f:
    for item in episodeTimeSteps:
        f.write("%s\n" % item)
        
with open('times_per_episode.txt', 'w') as f:
    for item in episodeTimes:
        f.write("%s\n" % item)

## Evaluation of out trained agent:

We check the perfomance of an already trained agent without training it again.

In [None]:
agent = Agent.load(directory='DQNCartPolemodel4')
environment = Environment.create(environment='gym', level='CartPole', max_episode_timesteps=10000)
# Uncomment the next line if want to see what the agent is doing.
environment.visualize = 'True'

episodeTimes = []
episodeTimeSteps = []
for _ in range(10):
    episodeStart = time.time()
    # Initialize episode
    states = environment.reset()
    terminal = False
    currentEpisodeTimeSteps = 0
    while not terminal:
        # Episode timestep
        currentEpisodeTimeSteps += 1
        actions = agent.act(states=states, independent = True, deterministic=True)
        #print(actions)
        states, terminal, reward = environment.execute(actions=actions)
    
    episodeEnd = time.time()
    timeEpisode = episodeEnd - episodeStart
    episodeTimes.append(timeEpisode)
    episodeTimeSteps.append(currentEpisodeTimeSteps)
    
environment.close()
    
avgEpisodeTime = sum(episodeTimes) / len(episodeTimes)
bestEpisodeTime = max(episodeTimes)
avgEpisodeTimeSteps = sum(episodeTimeSteps) / len(episodeTimeSteps)
bestEpisodeTimeSteps = max(episodeTimeSteps)

## Check results of training:

In [None]:
print(f"Average time steps per episode: {avgEpisodeTimeSteps} timesteps")
print(f"Best episode: {bestEpisodeTimeSteps} timesteps")
print(f"Average episode duration: {avgEpisodeTime} seconds")