# Default CartPole with Q-Learning

## *TFG Reinforcement Learning through the GymRetro Platform.*

In this notebook we will show how to implement the Q-Learning algorithm for the CartPole environment integrated in Gym, in order to train agents in that environment.

## Previous installs:

Only run this cell if you don't have Gym installed locally.

In [None]:
!pip install gym

## Required libraries:

In [None]:
import numpy as np 
import gym
import time
import math 

## A classic approach: Q-Learning

The main obstacle to implement the Q-Learning algorithm in this environment will be to discretize the state space, which, in the CartPole problem is continuous by definition.

First, we explore the environment attributes given by gym, as we observe, the action space is made up of two discreet actions (which correspond with applying a force of +1 or -1).

In [None]:
env = gym.make("CartPole-v1").env
print(env.action_space)
print(env.action_space.n)

The observation space is far more complex: an object of type Box which representes a 4-dimensional continuous space, and each state is a tuple (Cart position, Cart Velocity, Pole Angle, Pole Velocity) which is a symbolic representation of the absolute state of the environment.

In [None]:
#Peeking into the observation space to uncover and understand its structure
print(env.observation_space)
print(env.observation_space.shape)

After a brief inspection of the environment, we start to implement the q-learning algorithm:

## Q-Learning: 

Reward table (from the environment): P-table

Q-table: stores the _Q-value_ (the 'quality' of an action) associated to a `(state,action)` combination


Q-values are initialized to an arbitrary value, and as the agent exposes itself to the environment and receives different rewards by executing different actions, the Q-values are updated using the equation:

Q(state,action) ← (1−α) Q(state,action) + α(reward + γ maxa Q(next state,all actions))

Where:
- α (alpha) is the learning rate (0<α≤1)
- γ (gamma) is the discount factor (0≤γ≤1)

Q-Table values are initialized to a random value and then updated during training to values that optimize the agent's traversal through the
environment for maximum rewards

Steps:
- Initialize the Q-table.
- Start exploring actions: For each state, select any one among all possible actions for the current state (S). - Travel to the next state (S') as a result of that action (a).
- For all possible actions from the state (S') select the one with the highest Q-value.
- Update Q-table values using the equation.
- Set the next state as the current state.
- If goal state is reached, then end and repeat the process.

We want to prevent the action from always taking the same route, and possibly overfitting, so we'll be introducing another parameter called ε "epsilon" to cater to this during training.
Instead of just selecting the best learned Q-value action, we'll sometimes favor exploring the action space further.

In [None]:
#Q-Learning parameters
alpha = 0.1
gamma = 0.95
epsilon = 1
epsilon_decay_value = 0.99995

#Training-specific values
EPISODES = 500000
total_time = 0
total_reward = 0
prior_reward = 0

#Discretization values
Observation = [30, 30, 50, 50]
np_array_win_size = np.array([0.25, 0.25, 0.01, 0.1])

We check the dimensions of the Q-Table and an example of its initialization.

In [None]:
#Initialization of the q-table: for each state and action a random value is asigned
q_table = np.random.uniform(low=0, high=1, size=(Observation + [env.action_space.n]))
print(q_table.shape)
print(q_table)

Function for the discretization of each state.

In [None]:
#Discretization function
def get_discrete_state(state):
    discrete_state = state/np_array_win_size+ np.array([15,10,1,10])
    return tuple(discrete_state.astype(np.int))

## Agent training:

In [None]:
episodes_reward = []
episodeTimes = []
episodeTimeSteps = []

episodeTrainingTimes = []
trainingStart = time.time()
for episode in range(EPISODES): #go through the episodes
    episodeTrainingStart = time.time()
    t0 = time.time() #set the initial time
    discrete_state = get_discrete_state(env.reset()) #get the discrete start for the restarted environment 
    done = False
    episode_reward = 0 #reward starts as 0 for each episode
    
    currentEpisodeTimeSteps = 0

    if episode % 2000 == 0: 
        print("Episode: " + str(episode))

    while not done: 

        if np.random.random() > epsilon:

            action = np.argmax(q_table[discrete_state]) #take cordinated action
        else:

            action = np.random.randint(0, env.action_space.n) #do a random ation

        new_state, reward, done, _ = env.step(action) #step action to get new states, reward, and the "done" status.

        episode_reward += reward #add the reward
        
        currentEpisodeTimeSteps += 1

        new_discrete_state = get_discrete_state(new_state)

        if episode % 2000 == 0: #render
            env.render()

        if not done: #update q-table
            max_future_q = np.max(q_table[new_discrete_state])

            current_q = q_table[discrete_state + (action,)]

            new_q = (1 - alpha) * current_q + alpha * (reward + gamma * max_future_q)

            q_table[discrete_state + (action,)] = new_q

        discrete_state = new_discrete_state

    if epsilon > 0.05: #epsilon modification
        if episode_reward > prior_reward and episode > 10000:
            epsilon = math.pow(epsilon_decay_value, episode - 10000)

            if episode % 500 == 0:
                print("Epsilon: " + str(epsilon))

    t1 = time.time() #episode has finished
    episode_total = t1 - t0 #episode total time
    episodeTimes.append(episode_total)
    episodes_reward.append(episode_reward)
    episodeTimeSteps.append(currentEpisodeTimeSteps)
    total_time = total_time + episode_total

    total_reward += episode_reward #episode total reward
    prior_reward = episode_reward

    if episode % 1000 == 0: #every 1000 episodes print the average time and the average reward
        mean = total_time / 1000
        print("Time Average: " + str(mean))
        total = 0

        mean_reward = total_reward / 1000
        print("Mean Reward: " + str(mean_reward))
        total_reward = 0
    
trainingEnd = time.time()
trainingTime = trainingEnd - trainingStart
env.close()

We load some of the training data into files so later on we can plot them and check the evolution of the agent through the training.

In [None]:
with open('rewards_per_episode.txt', 'w') as f:
    for item in episodes_reward:
        f.write("%s\n" % item)
        
with open('timesteps_per_episode.txt', 'w') as f:
    for item in episodeTimeSteps:
        f.write("%s\n" % item)
        
with open('times_per_episode.txt', 'w') as f:
    for item in episodeTimes:
        f.write("%s\n" % item)

## Evaluation of our trained agent:

In [None]:
"""Evaluate agent's performance after Q-learning"""
total_epochs, total_rewards = 0, 0
episodes = 1
episodeTimes = []
episodeTimesteps = []
for _ in range(episodes):
    episodeStart = time.time()
    state = env.reset()
    env.render()
    currentTimesteps = 0
    rewards = 0
    done = False
    
    while not done:
        action = np.argmax(q_table[get_discrete_state(state)])
        state, reward, done, _ = env.step(action)
        rewards += reward
        currentTimesteps += 1
    
    total_rewards += rewards
    episodeEnd = time.time()
    timeEpisode = episodeEnd - episodeStart
    episodeTimes.append(timeEpisode)
    episodeTimesteps.append(currentTimesteps)

env.close()


#Some metrics
avgEpisodeTime = sum(episodeTimes) / len(episodeTimes)
bestEpisodeTime = max(episodeTimes)
avgEpisodeTimesteps = sum(episodeTimesteps) / len(episodeTimes)
bestEpisodeTimesteps = max(episodeTimesteps)

## Check results of training:

Reward results:

In [None]:
print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {avgEpisodeTimesteps} timesteps")
print(f"Best episode: {bestEpisodeTimesteps} timesteps")

Duration of episodes results:

In [None]:
print(f"Training time: {trainingTime} seconds")
print(f"Average seconds per episode after training: {avgEpisodeTime} seconds")
print(f"Longest episode: {bestEpisodeTime} seconds")