# SuperMarioBot: Using Deep Q-Learning to Play Super Mario Bros. (Columbia University MA Stats GR5242: Advanced Machine Learning Final Project Report)

## Contributors

* Sam Kolins (sk3651)
* Atishay Sehgal (as5453)
* Arpita Shah (as5451)

## Introduction

In 2013, researchers at Google DeepMind published the paper Playing Atari with Deep Reinforcement Learning, in which for the first time, a neural network algorithm learned how to play a video game in a fairly organic sort of way by playing against itself and training on the images produced from those play sessions. Since then, in just five short years, a whole host of enthusiastic data scientists have trained neural networks on a wide variety of different games. Most of these games have been Atari games, but for our project we wanted to be a bit more ambitious and try Super Mario Bros. for the Nintendo Entertainment System (NES), using Deep Q-Learning. In particular, this means we want to get Mario to the end of the level before time runs out and without dying.


## Game Objective and Information

The goal of the game is to get to the end of the first level represented by a flag that Mario needs to touch to beat the level. This is tricky because Mario faces obstacles (pipes, floating walls), enemies and bottomless pits along the way. Mario also only starts with three lives. Contact with enemies and pits result in a loss of a life. The loss of all three lives results in the game ending. A clock counts down from 400 seconds which, if reaches 0 also ends the game. Super Mario Bros. is a side-scrolling platformer, where Mario will need to continue moving right in order to reach the flag at the end.

## Project Objective and Challenges

We wish to run a series of reinforcement learning algorithms on a downsampled version of Super Mario Bros. for the NES. This emulation is provided by OpenAI Gym, but is not part of the main OpenAI Gym website; the main tutorial we used to set it up is Christopher Messier's tutorial (1). Further documentation can be found on the PyPI project page for gym-super-mario-bros. Mario will have to learn to keep moving and jumping forward while avoiding obstacles that could cause him to lose lives, like on-screen enemies and bottomless pits. To achieve this, we will use a 4 dense layer neural network with rectified linear unit activation functions as our baseline. Further, we implement a convolutional neural network (CNN) architecture to learn from still frames created from training sessions on the game (either through standard Q-learning runs through episodes or through manually-created replays, which is a special functionality of the code we are using). After this, we will train a deep Q-network (DQN) on these images to attempt to get Mario through each level.



# Specifications



## OpenAI Gym and `gym-super-mario-bros`

The version of the game we are using for this project is not exactly the same as the NES version of Mario, however, as we do not have access to an authentic NES console. Instead, we are **using a NES emulator to run the game**, modified to run in Python (2.7, 3.5, or 3.6) and with some extra modifications for convenience. The package we are using, called **[`gym-super-mario-bros`](https://pypi.org/project/gym-super-mario-bros/)**, was created by Christian Kauten ("kautenja" on GitHub) and is an OpenAI Gym environment using the [`nes-py` emulator](https://pypi.org/project/gym-super-mario-bros/) (also made by Kauten) that can run both the original *Super Mario Bros.* and *Super Mario Bros. 2: The Lost Levels*. We are only concerned with *Super Mario Bros.* for this project.

A few noteable changes to the game are:

- __Levels can be loaded individually.__ Rather than play from the very beginning of the first level, we can instead decide to train a network on any particular level of our choosing. This can be a good way to train Mario on trickier levels that may have more difficult terrain (like Athletic or Castle levels) or enemies that only appear in certain levels.
- __There are multiple downsampled versions of the game.__ Downsampling is important because the reduction in rendering detail will make it easier for the convolutional layers of the deep Q-learning network to process the game images. There are three downsampled versions, in addition to `v0`, the original version of the game (see the `gym-super-mario-bros` documentation for more details):
    * `v1` does not affect any foreground elements, but removes color from the background and simplifies the designs of background elements somewhat.
    * `v2` further simplifies all in-game assets, including Mario and the in-game heads-up display (HUD), into blockier designs. **This is the render we are using for our project.**
    * `v3` takes this even further, simplifying all elements into colored rectangles.
- The base game contains **4 frames of frameskip**. This means that, though each frame is calculated, only every fourth frame is drawn when the game is rendered. **This frameskip can be removed**, but it is not recommended for CPU's/GPU's that are not fast enough to render the game quickly.

There are many more details about how the game tracks various statistics that we would like to optimize over (like Mario's x-position) or against (like the in-game clock). Please read the `gym-super-mario-bros` documentation for more details.

## The State Space

The PyPI project page (2) describes the info dictionary, which keeps track of data such as Mario's x-position, *x-pos the current world and stage (world and stage), the number of collected coins (coins), a boolean value **flag_get** if Mario collects a flag, **score**: the cumulative in-game score, **status**: Mario's current status (big, small, Fire Mario, etc.) and **time**: the remaining time left on the clock. In addition, the CNN algorithm collects data on Mario's sprite, what separates him from the uninteractable background, and what the other foreground objects are. This information is not technically recorded by gym-super-mario-bros as state information, but will be necessary to help Mario get through the stage.*


## The Action Space

The action space is relatively simple. We are limited to any valid button combination available on a standard [NES controller](https://pisces.bbystatic.com/image2/BestBuy_US/images/products/5579/5579396_sd.jpg;maxHeight=640;maxWidth=550). The NES controller has eight buttons: the ***D-pad*** (containing the four cardinal directions **up**, **down**, **left**, and **right**), **A**, **B**, **Start**, and **Select**. The function of each button is generally intuitive, but depends on context. The module ***Gym-super-mario-bros*** provides three action lists ***RIGHT_ONLY***, ***SIMPLE_MOVEMENT***, and ***COMPLEX_MOVEMENT*** that introduce various levels of constraints on the complete NES action space. We use the first two action lists for our project. `SIMPLE_MOVEMENT` contains the following seven inputs: **idle**, **right**, **right+A**, **right+B**, **right+A+B**, **A**, and **left**. `COMPLEX_MOVEMENT` adds button combinations that involve the direction **left** while `RIGHT_ONLY` removes any non-idle action that does not move Mario to the right.


## The Reward Function

The documentation for `gym_super_mario_bros` defines the reward per step $r$ as

$$ r = v + c + d $$

where:

* $v$ is Mario's **instantaneous velocity**. More precisely, it is the difference in the agent's $x$-position between states. Therefore, if $x_0$ is Mario's initial position and $x_1$ is Mario's position after advancing a step, then $v = x_1 - x_0$. As a consequence, $v > 0$ implies Mario is moving right, $v < 0$ implies Mario is moving left, and $v = 0$ implies Mario has not moved at all (at least not horizontally; perhaps he is jumping/falling in place or climbing a vine).
* $c$ is the **difference in game clock between steps**. More precisely, if $c_0$ and $c_1$ are the in-game times before and after the step respectively, then $c = c_0 - c_1$. Because the timer decreases as the game continues, $c$ is never a positive number; either $c = 0$ and the timer hasn't decreased at all (possible because one in-game "second" is roughly equivalent to $0.4$ real-world seconds, a length of time still longer than a frame or even $4$ frames in the frameskip version of the NES ROM we are using) or $c < 0$ and the in-game clock has ticked down. This is essentially insignificant to Mario's reward at best and a penalty at worst, which is by design as we don't want Mario to spend too much time standing still.
* Finally, $d$ is a **death penalty** that penalizes the agent for dying in a state. This is a hefty penalty that strongly discourages the agent from dying, which will help motivate it to learn what causes Mario to die in the first place. If Mario is alive, $d = 0$, but if Mario dies, $d = -15$. Because the game has been scrubbed of (virtually) all cutscenes, the death penalty should only penalize Mario for a single step.

We make an initial addition to the above by adding a game over penalty of $-50$ if Mario loses all his lives. This was done in order to see that if the game ends despite us having training steps left, could we make Mario learn to not lose all his lives.

After computing $r$, there is one final adjustment made to the reward: it is clipped into the interval $[-15, 15]$. That means Mario cannot gain or lose more than $15$ points on a single step. Because this value is also equal to the lowest possible value of $d$ (acquired upon death), Mario cannot do any worse on a single step than dying, which establishes death as the game's ultimate single-state penalty. Of course, in the long term, this likely won't greatly hinder Mario's reward score as Mario can only die three times in a single episode, totaling $-45$ points, but it works as an excellent short-term motivator against interacting with anything that might kill Mario and should still force the agent to think more intelligently about traversing each course.



# Project Implementation

We implement two separate architectures in our project: A basic Dense Neural Network and a 5 Convolutional Neural Network with AV-Stream. We use two GPUS to train separate models. A NVIDIA Tesla V100 GPU on a Google Cloud instance and ????. We need GPU computation for CNN training and simply because the problem is complex enough to require that Mario goes through as much training as possible.

## Baseline 

### Architecture: 4-Layer Dense Neural Network

We want to start with a simple, dense architecture. Our baseline architecture has four layers increasing in width from 24 units to 96 units with a successive increment of 24 units. Each of the layers have a rectified linear unit as their activation function. 

### Hyperparameters

We pick a high discount rate and a high exploration rate with a high decay so that as the game progresses, Mario gradually moves from exploration to exploitation. The goal is to explore as much as possible in the earlier stages in order to gather as much information as possible about state-action interactions. We keep the step size at 500 for the reason that in-game time travels 4/5ths slower than real time. Hence we keep the step size as 500 despite Mario having 400 seconds on the clock. We try different batch sizes ranging from 25 to 50,000 to see the variation in accrued mean reward. We also try a range of episodes to see the minimum number of episodes required for Mario to achieve certain small goals like killing a Goomba or jumping over a pipe. 


### Results

We saw that our plain vanilla neural network doesn't do so well. In the initial runs, the rewards ***decay*** as the number of episodes increase. We could visualize it when we rendered the environment. While Mario learnt to move to the right fairly quickly, it took a little longer to jump over the Goomba or jump on them to kill them. However, the biggest challenge was to be able to jump over the tall pipes. That is where Mario would get stuck for a long time and the rewards suffer. With our initial hyperparameter settings, Mario was unable to learn to jump over the tall pipe. The added hindrance was that now the network hadn't even had a chance to move ahead in frames and learn something from there. That is why we raised the minimum possible value of epsilon so that Mario could still explore around with random movements.

![rewards_atishay](https://imgur.com/5iLbXB9.png)
![rewards_arpita](https://imgur.com/Ht4dhCZ.png)

As we can see from the second plot, running the network for 50,000 time steps wasn't enough. While the positive reward area is a lot more dense, we still don't see any obvious signs of learning. This corraborates with our research that others have taken over 4 million steps to even see any significant learning by the network. Instead of running this network for a longer duration, we decided to add convolutional layers since we're obviously learning from screen images.

## Modified Network

### Architecture: 5-layer CNN with AV-stream

Now, we assemble the network architecture as an object instance of the class `CNN_DQNAgent`. The first half is a collection of five convolutional network layers; the second half is, for lack of a better term, going to be called the ***AV-stream***. This splits the output of the final convolutional layer into a *value stream* and an *advantage stream*. The former represents how well off Mario is in his current state; the latter represents how much better off Mario can be by taking a particular action. Advantage is essentially the difference between the Q function and the value stream. To get the estimated Q output, we must add back the advantage and value streams together; the highest advantage would then, of course, yield the highest Q output and therefore represent our optimal action for the current state.

The basis of this network design was drawn from [Branko Blagojevic's great tutorial](https://medium.com/ml-everything/learning-from-pixels-and-deep-q-networks-with-keras-20c5f3a78a0) on reinforcement learning with CNN's and Keras.

![model_summary](https://imgur.com/MbS6uoi.png)

We now describe the results from modifications for this network and how it affected the learning process of the agent.

### Training Period 1 (Initial Run): Early Failure

We used the following parameter settings (not including parameters that are not really intended for changing, like path location variables, `load_model`, `final_layer_size`, or `tau`):

```
batch_size = 64
num_epochs = 20
update_freq = 3
y = 0.99
prob_random_start = 0.6
prob_random_end = 0.1
epsilon_steps = 27
num_episodes = 30
pre_train_episodes = 3
max_num_step = 50000
print_every = 1
save_every = 1
```
Here are the plots for the (cumulative) number of steps per episode:

![x](https://i.imgur.com/Wzt3ldY.png)

As we can see, number of steps vary wildly between just a few hundred to over $2000$ with very little in-between. This seems to suggest that Mario is either dying early to one of the Goombas at the beginning of the level or is making it up to a certain area before getting stuck (like the [pipes](https://i.imgur.com/J4SHXL6.png) near the beginning of the level). Completely random Mario had a tendency to get stuck and will only occasionally succeed in jumping over the pipes. This seems to suggest our more intelligent Mario did not sufficiently learn well enough how to clear the pipes in $30$ episodes. But what about the few episodes where it ran for $4000$ steps? Well, we can look at the plots for the (rolling average of) reward(s) per episode to get more information:

![x](https://i.imgur.com/iG8ORiK.png)

The pre-trained, completely random episodes seem to establish a reward baseline of about $1700$, but the trained episodes afterward never seem to reach that threshold and, in fact, get increasingly stuck as the episodes go on. This can be attributed to the fact that $\epsilon$ is decaying and so Mario is exploring less and less with each episode. Evidently, Mario is not learning much, and this is further enforced by the plot of losses per episode:

![x](https://i.imgur.com/xUduZQZ.png)

(The losses in the first $6$ episodes, which comprise the set of pre-training episodes and the episodes that run before the first training weight update, are zero because the network has not yet run training on batches taken from the global experience replay buffer.) This is a very severe loss curve which plateaus at a low loss value quickly, which most certainly corroborates the idea that Mario is not learning.

Therefore, as a result of all this, it would seem reasonable to suggest that running this network for more than the relatively paltry number of $30$ episodes is not likely to be beneficial and we would be better off changing the parameters around and starting over. Luckily, the design of our code makes this easy. If we actually had a successful run, adding additional training is even easier; we don't have to change anything at all (except maybe the number of episodes to train on), we just need to re-run the training portion of the code because it will always check for model weights before it starts to do any training.

### Training Period 2: Increasing Initial $\epsilon$ and Optimizer Learning Rate

Since the random runs seem to at least get past the first row of pipes better than the trained runs in TP1 (Training Period 1), it may benefit Mario to try to be a little more random in the early stages. It's also possible that the learning rate for the Adam optimizer is just too low (that could contribute to the relatively pathetic loss curve above), so we bumped it up to `0.05` from `0.0001` (that's the learning rate that Blagojevic's tutorial used; he didn't give any justification as to why he used that value). We also added an extra pre-training episode, added a few more episodes to the total count, and adjusted the update frequency accordingly. The final parameters used are below; all parameters enclosed in asterisks were changed with the value in parentheses representing the value it previously had.

```
*self.model.optimizer.lr = 0.05* (0.0001)

batch_size = 64
num_epochs = 20
*update_freq = 4* (3)
y = 0.99
*prob_random_start = 0.9* (0.6)
prob_random_end = 0.1
*epsilon_steps = 32* (27)
*num_episodes = 36* (30)
*pre_train_episodes = 4* (3)
max_num_step = 50000
print_every = 1
save_every = 1
```

![x](https://i.imgur.com/sxRxe8P.png)
![x](https://i.imgur.com/E6FA15H.png)
![x](https://i.imgur.com/z5HwI9s.png)

The steps seem to be a bit longer on average compared to TP1, but nothing else looks promising. There was an episode - Episode 10, in particular - where Mario was able to score a reward higher than any of the pre-training runs, but that was the only time that happened. The strangest part is that the losses returned were `NaN`'s past Episode 8, which suggests that something about this combination of parameters causes problems. Perhaps the learning rate for the Adam optimizer is too high, leading to some weird exploding or vanishing gradient problem? We're not really sure at this point.

### Training Period 3: Decreasing Learning Rate Slightly

For this run, all we did was lower the learning rate down to `0.01`. The learning rate definitely seems to be the culprit of the loss issues, because while we did get `NaN`'s again, we also got this:

![x](https://i.imgur.com/fN1Ah38.png)

We also noticed some rather odd behaviors as the training episodes progressed. At first, it seems like it was learning at a solid rate, and it even managed to encounter a Koopa (a turtle-esque enemy common in the *Mario* franchise)!

![x](https://i.imgur.com/TRa2ZaE.gif)

But then, it updated the weights for the first time, proceeding into a phase that we are calling "**dancing Mario**". Mario would be constantly jumping back and forth, almost as if he was wildly confused, which is peculiar because there is no **left+A** movement option in the `SIMPLE_MOVEMENT` action library, which means Mario must be rapidly pressing `left` and `A` on separate but consecutive frames.

![x](https://i.imgur.com/8FgTNZr.gif)

(We know Mario kind of looks like he's floating around in this gif, but that's because the software that we used to capture the video of Mario's run off our screen drops a lot of frames.) After another update of the training weights, the losses returned to `NaN` and we were welcomed by a familiar performance:

![x](https://i.imgur.com/Hl5GuzO.gif)

The reward only continued to slide from here. Despite a bit of a dramatic episode, Mario failed to retain the knowledge he had obtained in the beginning.

Here are the plots:

![x](https://i.imgur.com/vAYbZi5.png)
![x](https://i.imgur.com/xT6yatd.png)
![x](https://i.imgur.com/c83Hi09.png)

All of these plots corroborate the above observations. There is definitely a spike of learning early on, but this dissipates quickly. Getting the right learning rate will therefore be of utmost importance.


### Training Period 4: Explorer Mario (learning rate = `0.001` and higher $\epsilon$ floor)

We will now do a training run of the same length but with a learning rate brought down to ten times its original value (`0.001`) rather than a hundred times (`0.01`) as it was before. It also occurred to us that what really might be helping Mario in his first set of training runs before the weights get updated is that he is still allowed to be experimental early; it could be that $\epsilon$ is decreasing too fast. Therefore, we are raising the floor of $\epsilon$, stored in `prob_random_end`, to `0.5`, and we will be also extending the length of the training period a bit longer. Here are the parameters we are using:

```
*self.model.optimizer.lr = 0.001* (0.0001)

batch_size = 64
num_epochs = 20
update_freq = 4
y = 0.99
prob_random_start = 0.9
*prob_random_end = 0.5* (0.1)
*epsilon_steps = 36* (32)
*num_episodes = 40* (36)
*pre_train_episodes = 4* (3)
max_num_step = 50000
print_every = 1
save_every = 1
```

And here are the plots:

![x](https://i.imgur.com/VaVdUkN.png)
![x](https://i.imgur.com/qcFM6gJ.png)
![x](https://i.imgur.com/TH6q5fA.png)

It appears at first that Mario is truly learning, and there are many episodes near the beginning that surpass the random pre-training performances. However, as Mario continues to update his training weights, his reward sinks quickly as he seems more and more determined to want to go left instead of right (or revert to "dancing Mario", in which he seems conflicted in which direction to travel). We're still not sure why he's pursuing this behavior, but it perhaps indicates the need for another test; since Mario clearly is able to score higher with higher $\epsilon$, at least for a little while, Mario clearly needs to run for a much longer amount of time. This will also allow us to decay $\epsilon$ even slower than we have been, allowing Mario to reap the full benefits of his training for longer.

### Training Period 5: Explorer Mario Part 2

After each training period, we have been removing the saved weights from the `.\models` directory, essentially wiping Mario's memory. This time, we are going to ***keep the weights*** from TP4 and continue training where they left off. We will also be decreasing the learning rate once again to `0.0005`, five times its original value and half of what it was before, as well as sharply increasing the number of episodes to train on (as well as associated parameters). Here's what we'll be using:

```
*self.model.optimizer.lr = 0.0005* (0.001)

batch_size = 64
num_epochs = 20
update_freq = 4
y = 0.99
prob_random_start = 0.9
*prob_random_end = 0.3* (0.5)
*epsilon_steps = 290* (36)
*num_episodes = 300* (40)
*pre_train_episodes = 10* (4)
max_num_step = 50000
print_every = 1
*save_every = 10* (1)
```

Running for between $30$ and $40$ episodes took around an hour, so running for $300$ episodes (remember: that's a whopping $900$ lives!) should take at least ten hours; because it's saving every $4$ episodes and will thus save $72$ times, we would actually estimate the total run time at being closer to eleven or twelve hours. The hope is that Mario will be able to correct himself because of the much greater time spent acting randomly, although we could very easily end up with twelve hours of "dancing Mario". Mario made quite the trek, taking well over $1.2\mathrm{M}$ steps!

![x](https://i.imgur.com/KncHdw9.png)

However, the plots don't tell a pretty story:

![x](https://i.imgur.com/oLGEItm.png)
![x](https://i.imgur.com/PEXWZbl.png)
![x](https://i.imgur.com/qGSj8Ao.png)

For the loss plot, we removed the losses from the initial training runs after the pre-training episodes because they were huge and were greatly distorting the plot, as one can see here:

![x](https://i.imgur.com/zNniyOk.png)

Mario does sustain a stronger average reward over time in the beginning, but before even a third of the episodes have completed, it has already started to decline. About halfway through, Mario's average reward dips below zero and never recovers. One thing is certain, then: running Mario on this architecture for a long time does not rectify his behavior. Even with $\epsilon$ values at around `0.6`, Mario seems to lose quite a bit of steam, having reward values that often dip below $1000$. The fact that Mario's reward eventually falls into the deep negative numbers suggests that ***Mario is learning to persistently move left***. The reward function, though, which we did not manually write and is taken as an output from `env.step()` is definitely telling Mario that his reward is sinking for his actions, yet as $\epsilon$ falls, he is only ever more persistent in traveling left. This at least suggests he is learning *something*, because his performance is definitely not identical to random performances, but it does sadly mean that he is learning how *not* to play the game. We figure there are three potential causes for this:

1. The action space just needs to be smaller. Perhaps Mario shouldn't be allowed to go left, in which case we can use the `RIGHT_ONLY` action library.
2. The architecture is fundamentally problematic and needs to be changed. This seems to be corroborated by our very steep loss functions. The learning rate might also still be too high.
3. Interestingly, if we take a screenshot of GIF of Mario's performance, one can see that the image dimensions are `256 x 240`. However, the shape of the observation space of the environment itself is actually `(240, 256, 3)`, which could mean that the observation space is a rotation of the true image we want to study on. This could be problematic as it could cause Mario to get disoriented during training (i.e. it might think moving left moves Mario's position to the right), but we have our doubts that this is the case for a few reasons. Firstly, with these dimensions, the observation space would be a *rotation* of the true images, not a *reflection*, which means Mario, and everything else on screen, would appear to be moving up or down, not left or right. Secondly, the reward function does seem to measure Mario's reward accurately, which means the algorithm seems to have an understanding that Mario is making a concerted effort to go left and is losing reward because of this, yet it is not correcting this behavior even after a lot of training sessions with high initial $\epsilon$ that decays slowly.

### Training Period 6: `RIGHT_ONLY` Mario

`RIGHT_ONLY` removes Mario's ability to move left and jump in place, which means Mario can do only five things: `idle`, `right`, `right+A`, `right+B`, and `right+A+B`. This could solve the problem in the short-term, but we have our doubts that Mario will somehow magically become amazing at the game in the long-term because Mario loses the ability to go left to correct his position if he needs to, say, taking a running start and acquire more horizontal momentum before jumping. In particular, this is an aspect that will be important for clearing the row of pipes at the beginning of World 1-1, so we have our doubts that Mario will be able to sustain training performances higher than random action for very long, and will instead plateau (but not decrease into the negatives, surely). On the other hand, if $\epsilon$ remains high enough, Mario will be forced to explore running and jumping right more often than he did before, in which case he could discover that maintaining high rightward momentum at all times is generally very helpful, especially in the beginning of World 1-1. If this does work though, we have our doubts that Mario will be able to use these techniques to beat the whole game as even human players cannot be that careless with their own rightward momentum. Such is the difficulty of reinforcement learning, however, and is the nature of how narrow neural networks are in their performance scope.

In any case, all we have changed here is the action library (`SIMPLE_MOVEMENT` to `RIGHT_ONLY`) and we have decreased the learning rate back down to what it was for TP1 (`0.0001`). This will be another long $300$-episode run, so we expect the code to take several hours to complete. This is what the code changes look like as far as the creation of `main_qn` and `target_qn` (remember: we also need to import a new action library before we use it!) as well as a summary of the number of parameters. The dense Advantage layer `final_advantage` has a smaller shape now, meaning we save on a few hundred parameters.

![x](https://i.imgur.com/fzhliyM.png)

![x](https://i.imgur.com/LHVP2U2.png)
![x](https://i.imgur.com/X88RHoR.png)
![x](https://i.imgur.com/Uz5b97y.png)

There's some intrigue here, although we are still seeing the reward totals decline per episode. Interestingly, the number of steps per episode rarely exceed $2000$, which perhaps indicates that Mario dies more often in these runs and does not allow himself to time out as often as before. The loss curve is also rather curious; it decreases at about the same rate as before, and in fact decreases much more gracefully than before, but then goes back up towards the end. This could potentially indicate that the learning rate is still too high, or that there are diminishing returns on training for this length of time.