# DQN on Trackmania
<center><img src="https://cdn1.epicgames.com/b04882669b2e495e9f747c8560488c93/offer/TM_StarterEdition_Store_Landscape_2560x1440-2560x1440-e748deac61ee274e1ef8faa5f40b03cd.jpg?h=270&resize=1&w=480"/></center><br/>

Trackmania is avery interesting game for AI research. It's a complex environment with a quite sparse reward function. Furthermore, the game has been farly optimised by humans players, as Trackmania is a very competitive game.
Using the [TMForge tool](https://github.com/TheoBoyer/TMForge) I implemented the DQN algorithm on Trackmania, and this notebook's purpose is to analyse the results.

In [None]:
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import importlib
import json

# The benchmark
The map I ran the algorithm on is a custom one that I created especillay for Reinforcement learning algorithm. It contains a lot of checkpoints, reducing the sparsity of the reward function. Also, the multiple half turn parts help the agent because oonce it learned how to passe them, it can generalize it to all half-turns of the map. The ultimate goal is to finish the map to sort of pretrain the agent and then test it on other ones.

<img style="margin: auto" src="https://i.imgur.com/QofS9Mz.png">
<p style="text-align: center; padding: 15px">
    <span><i>The reference map for benchmark on TMForge. Available in the TMForge club</i></span>
</p>



# The data
The data available is the result folder of the experiment. It contains code of the agent, used hyperparameters, the TMForge config file, the model's weights, the replay buffer, and a csv file containing metrics.
The later we be the one we will analyze in the Notebook

In [None]:
def load_config(experiment_folder):
    """
        Check that a certain function exist in a given script. Raise an error if it's not the case
        Return the actual function if it was found
    """
    experiment_folder = os.path.join(experiment_folder, "config.py")
    spec = importlib.util.spec_from_file_location("config", experiment_folder)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    return module


# Feel free to upload your own experiments, fork the notebook and  change this line so you can see the performances of your agent
experiment_folder = '../input/tmforge-experiments/frames_buffer'

# Config
config = load_config(experiment_folder)
with open(os.path.join(experiment_folder, "hyperparameters.json")) as f:
    hyperparameters = json.load(f)
metrics = pd.read_csv(os.path.join(experiment_folder, 'metrics.csv'))
print("Hyperparameters:")
print(hyperparameters)
print("Metrics:")
metrics

In [None]:
frames = np.load(os.path.join(experiment_folder, "frames_buffer.npy"))
actions = np.load(os.path.join(experiment_folder, "actions_buffer.npy"))
rewards = np.load(os.path.join(experiment_folder, "rewards_buffer.npy"))
dones = np.load(os.path.join(experiment_folder, "dones_buffer.npy"))

In [None]:
idx = -227
sample = frames[idx]
print(actions[idx])
print(rewards[idx])
print(dones[idx])
plt.imshow(sample[-1], cmap='gray')
plt.show()
plt.imshow(sample[-2], cmap='gray')
plt.show()
plt.imshow(np.abs(sample[-1] - sample[-2]), cmap='gray')
plt.show()
print(sample[-3][0, 0])

## Performances of the agent
First let's see how the agent performed 

In [None]:
metrics['cp_crossed'] = metrics["Reward"] == 1.0
cp_reached = metrics[['Episode', 'cp_crossed']].groupby(['Episode']).sum()
plt.bar(cp_reached.index.values, cp_reached["cp_crossed"].values)
plt.title("Cp reached by the agent during training")
plt.xlabel("Episode")
plt.ylabel("Max cp reached")
print("The best run reached CP", np.max(cp_reached["cp_crossed"].values))
plt.show()

episode_reward = metrics["Episode Reward"].values
episode_reward = episode_reward[episode_reward == episode_reward]
plt.plot(episode_reward, label='Cumulative rewards')
plt.title("Cumulative rewards obtained during training")
plt.xlabel("Episode")
plt.ylabel("Cumulative reward")
plt.show()

In [None]:
cum_dis_r = 0
curr_ep_n = None
q_values = []
for r, n_ep in zip(reversed(metrics["Reward"].values), reversed(metrics["Episode"].values)):
    if curr_ep_n != n_ep:
        cum_dis_r = 0
        curr_ep_n = n_ep
    cum_dis_r = r + hyperparameters["reward_discount_factor"] * cum_dis_r
    q_values.append(cum_dis_r)
q_values = list(reversed(q_values))
metrics["True Q-value"] = q_values

approx_error = np.abs(metrics["Q-value"].values - metrics["True Q-value"].values)

plt.plot(metrics["Q-value"].values, label="Q-value estimation")
plt.plot(metrics["True Q-value"].values, label="Q-value truth")
plt.title("Computed Q-values")
plt.xlabel("Game steps")
plt.ylabel("Q-value")
plt.legend()
plt.show()

plt.plot(approx_error)
plt.title("True Q_value approximation error")
plt.xlabel("Game steps")
plt.ylabel("Absolute error")
plt.show()

The Q-value estimation is too high. I used the [double-DQN algorithm](https://arxiv.org/pdf/1509.06461.pdf) which is supposed to overcome this common issue in vanilla DQN. It's possible that my implementation is wrong

## Technical performances
As we do not have a perfect emulator for Trackmania like on the atari games, the deterministic assumption is wrong. 
<center><img src="https://i.imgur.com/V51ucym.png"/></center><br/>

In the classical framework, theduration of the decision process does not matter because the emulator is paused during it. However it's not the case for Trackmania because the so called emulator is in fact an interface with the game. The later never stops even during the decision process. This can have really bad consequences on the training process because it adds a lot of noise. The implementation of the DQN tries everything to keep the time happening between observing the state and performing the action as cojnstant as possible. Let's study the metric

In [None]:
DISPLAY_UPDATE_FRQUENCY = 60
action_latency = metrics["Action Latency"].values
percentiles = np.percentile(action_latency, np.arange(100))
plt.title("Distribution of the latency")
n_removed = len(action_latency) - np.sum(action_latency < percentiles[-1])
print("Removed {} samples ({:.3f} % of the data) for clarty".format(n_removed,100 * n_removed / len(action_latency)))
plt.xlabel("latency")
v = plt.hist(action_latency[action_latency < percentiles[-1]], bins=100, density=True)
latency_std = action_latency.std()
latency_99_std = action_latency[action_latency < percentiles[-1]].std()
frames_dt = 1/DISPLAY_UPDATE_FRQUENCY
print("Standard deviation of the latency:", latency_std)
print("Standard deviation of the latency of 99% of the frames:", latency_99_std)

print("Time between two frames:", frames_dt)

As we can see, despite some extreme cases happening rarely (probably when a backup is done, it involves writing muliple Gos files), 99% of the time we have ~8ms latency which more than enough to have no big consequences on a 60hz display.