# Train a Smartcab to Drive Report
## Abstract
In this project we aim at training a **Smartcab** (may be referred to as **smart agent**) to move on a  8 by 6 grid, with randomly chosen starting point and destination. The **environment** consists of the **grid**, three other **randomly moving agents** and **traffic lights** on every intersection, switching with different time intervals.

Our goal is to get our agent to the destination within the set deadline, and as fast as possible.

Note that most if not all of our code mentioned in this report will be in `agent.py`. We may make minor changes to other files for logging purposes only.

## Setting up the Baseline
### Random Walk Mode
Before we start implementing any machine learning algorithm, it is important that we know from what point we are optimizing from. And for problems like this, very often we can choose the random walk as our baseline.

This can be easily implemented with one line of code (or two lines, if you count the import):

```python
import random
action = random.choice(Environment.valid_actions)
```

where `Environment.valid_actions = [None, 'forward', 'left', 'right']`.

In [16]:
from __future__ import division
import numpy as np
import pandas as pd
from IPython.display import display

random_mode = pd.DataFrame({'Number of Trials' : 100,
                    'Success Rate' : pd.Series([0.24, 0.24, 0.18, 0.16, 0.21]),
                    'Number of Penalized Trials' : pd.Series([96, 98, 97, 96, 97]),
                    'Avg. Reward Rate' : pd.Series([0.728, 0.672, 0.641, 0.753, 0.675]),
                    'Max Reward Rate' : pd.Series([5.0, 3.167, 6.25, 10.5, 4.0]),
                    'Min Reward Rate' : pd.Series([0.038, 0.0, -0.069, -0.071, 0.0]),
                    'Mode Reward Rate' : pd.Series([(0.0, 19), (0.0, 15), (0.0, 26), (0.0, 24), (0.0, 16)])})

display(random_mode)

Unnamed: 0,Avg. Reward Rate,Max Reward Rate,Min Reward Rate,Mode Reward Rate,Number of Penalized Trials,Number of Trials,Success Rate
0,0.728,5.0,0.038,"(0.0, 19)",96,100,0.24
1,0.672,3.167,0.0,"(0.0, 15)",98,100,0.24
2,0.641,6.25,-0.069,"(0.0, 26)",97,100,0.18
3,0.753,10.5,-0.071,"(0.0, 24)",96,100,0.16
4,0.675,4.0,0.0,"(0.0, 16)",97,100,0.21


Above our the data for each test run in random walk mode. The KPIs are defined as below:

* Success Rate: $\frac{\text{Number of times the agent reaches the destination}}{\text{Number of trials}}$
* Number of Penalized Trials: Number of trials which get any negative reward
* Reward Rate: $\frac{\text{Net reward}}{\text{Number of steps taken to get to the destination}}$
* Avg. Reward Rate: Mean Reward Rate of all the successful trials
* Max Reward Rate: Maximum Reward Rate
* Min Reward Rate: Minimum Reward Rate
* Mode Reward Rate: Mode of Reward Rates rounded to the second digit

With the KPIs set up we can see that the random walk performs rather poorly on getting our cab to its destination. And we will improve that with a better algorithm.

## Optimize Against the Baseline
### Identify the States
Our environment assumes the US right-of-way rules. 

1. On a green light, you can turn left only if there is no oncoming traffic at the intersection coming straight. 
2. On a red light, you can turn right if there is no oncoming traffic turning left or traffic from the left going straight.

Although violating some of the rules are not penalized for the current version, we would like take those features into account and make them into our states so that the code is ready for future change (this will of course expand our feature space and therefore we will need more trials for the agent to learn the optimal policy).

Below are the inputs that our smartcab can take:

1. Next waypoint
2. Traffic light status
3. Status of oncoming agent
4. Status of agent on the left
5. Status of on the right
6. Deadline

As said earlier, although the smartcab doesn't get penalized for violating the right-of-way rules (which means we don't really need to let the smartcab sense the statuses of other agents), for now we still take them into account and observe how the smart agent would learn.

#### Next waypoint
This is the essential state that we need to take into account, since the destination is different for each new trial, representing the absolute location and heading is not useful. The only way for our smartcab to ever find the destination is to follow the **next waypoint**. The smartcab gets a reward of 2 each time it correctly follows the next waypoint, and a reward 0.5 when it doesn't.

#### Traffic light status
Status of the traffic light plays an important role here as well. The reward/penalty works as follows:

1. When the traffic light is 'red' and the agent goes 'forward' it gets penalized by -1
2. When the traffic light is 'red' and the agent goes 'left' it gets penalized by -1
3. In other cases the agent gets reward as stated in the Next waypoint section.

#### Statuses of other agents
As said ealier, statuses of other agents don't really play any roles here. We include them here so that our code is ready for future change. The feature space is slighly expanded by including them so we may need to take them out if our smart agent fails to learn the optimal policy. Yet, we may not need to.

#### Deadline
Deadline is not represented as a status explicitly for now. However, it does get implicitly represented by $\gamma$, since the less steps it takes for the agent to get to the destination, the more valueable the reward of reaching the destination would be, which in turn makes the policy that gets to the destination faster more favorable. We may take this into account later if the performance of the smart agent is not satisfying.


### Q-Learning and the Smarter Policy
#### Quick and Dirty Q-Learning

To start with, we implement Q-learning with epsilon-greedy algorithm to obtain policies for our smart agent. A quick and dirty implementation gives a the following result:

In [18]:
qd_perf = pd.DataFrame({'Number of Trials' : 100,
                    'Success Rate' : 0.46,
                    'Number of Penalized Trials' : 67,
                    'Avg. Reward Rate' : 1.545,
                    'Max Reward Rate' : 4.5,
                    'Min Reward Rate' : 0.538,
                    'Mode Reward Rate' : (0.0, 99)})

display(qd_perf)

Unnamed: 0,Avg. Reward Rate,Max Reward Rate,Min Reward Rate,Mode Reward Rate,Number of Penalized Trials,Number of Trials,Success Rate
0,1.545,4.5,0.538,0,67,100,0.46
1,1.545,4.5,0.538,99,67,100,0.46
