# A quick tour of Jupyter notebooks

For those who may be unfamiliar with Amazon SageMaker, Jupyter notebooks, and OpenAI Gym, this is a good starting place. First, we need to explain how to run cells. Try to run the cell below!

In [None]:
print("Hi! This is a cell. Press the ▶ button above to run it")
print("You can also run a cell with Ctrl+Enter or Shift+Enter. Experiment a bit with that.")

Let's look at some important components of a Jupyter notebook: 

**Dashboard** - When the notebook server is first started, a browser will be opened to the notebook dashboard. The dashboard serves as a home page for the notebook. Its main purpose is to display the portion of the filesystem accessible by the user, and to provide an overview of the running kernels, terminals, and parallel clusters.

**Files** - The files tab provides an interactive view of the portion of the filesystem which is accessible by the user. This is typically rooted by the directory in which the notebook server was started.

**Notebook body** - The body of a notebook is composed of cells. Each cell contains either markdown, code input, code output, or raw text. Cells can be included in any order and edited at-will, allowing for a large ammount of flexibility for constructing a narrative. There are different types of cells:

* **Markdown cells** - These are used to build a nicely formatted narrative around the code in the document. The majority of this lesson is composed of markdown cells.

* **Code cells** - These are used to define the computational code in the document. They come in two forms: the *input cell* where the user types the code to be executed, and the *output cell* which is the representation of the executed code. Depending on the code, this representation may be a simple scalar value, or something more complex like a plot or an interactive widget.

* **Raw cells** - These are used when text needs to be included in raw form, without execution or transformation.

**Terminal** - The notebook application is able to spawn interactive terminal instances. A new terminal can be spawned from the dashboard by clicking on the **`Files`** tab, followed by the **`New`** dropdown button, and then selecting **`Terminal`**.

**Shortcuts** - The following shortcuts have been found to be the most useful in day-to-day tasks:

 * Basic navigation: **`enter`**, **`shift-enter`**, **`up/k`**, **`down/j`**
 * Saving the notebook: **`s`**
 * Cell types: **`y`**, **`m`**, **`1-6`**, **`r`**
 * Cell creation: **`a`**, **`b`**
 * Cell editing: **`x`**, **`c`**, **`v`**, **`d`**, **`z`**, **`ctrl+shift+-`**
 * Kernel operations: **`i`**, **`.`**


# Do a fresh install of Gym with Box2D

Now that you've got the basics down, the rest of this notebook will explain the Lunar Lander environment, including Box2D and OpenAI Gym. 

In [None]:
!sudo yes y | pip uninstall gym; exit 0
!pip install box2d-py
!pip install gym[box2d]
!pip install --upgrade matplotlib

In [None]:
!pip show gym

In [None]:
# Make sure this is the same path as the one shown above next to "Location: ... Otherwise, change it"
Location = '/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages'

#### Changing the render function of the original lunar lander code by overwriting the class file...

In [None]:
!sudo cp ./src/lunar_lander-local.py {Location}/gym/envs/box2d/lunar_lander.py

In [None]:
!pygmentize {Location}/gym/envs/box2d/lunar_lander.py

#### Import and make the lunar lander environment

In [None]:
import gym
env = gym.make('LunarLander-v2')

# Explore some properties of this environment

#### What are the states of this agent?

```python
pos = self.lander.position
vel = self.lander.linearVelocity

state = [
            (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2),
            (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_H/SCALE/2),
            vel.x*(VIEWPORT_W/SCALE/2)/FPS,
            vel.y*(VIEWPORT_H/SCALE/2)/FPS,
            self.lander.angle,
            20.0*self.lander.angularVelocity/FPS,
            1.0 if self.legs[0].ground_contact else 0.0,
            1.0 if self.legs[1].ground_contact else 0.0
       ]
```       

#### What states of the environment can be observed outside?

In [None]:
env.observation_space

^ That means that all 8 states are observable. In many cases, this is not true.

#### What actions can the agent take?

In [None]:
env.action_space

i.e. No action, fire left engine, main engine, right engine.

According to Pontryagin's maximum principle it's optimal to fire engine full throttle or turn it off. That's the reason this environment is OK to have discreet actions (engine on or off).

#### What about rewards? 

```python
        reward = 0
        shaping = \
            - 100*np.sqrt(state[0]*state[0] + state[1]*state[1]) \
            - 100*np.sqrt(state[2]*state[2] + state[3]*state[3]) \
            - 100*abs(state[4]) + 10*state[6] + 10*state[7]   # And ten points for legs contact, the idea is if you
                                                              # lose contact again after landing, you get negative reward
        if self.prev_shaping is not None:
            reward = shaping - self.prev_shaping
        self.prev_shaping = shaping

        reward -= m_power*0.30  # less fuel spent is better, about -30 for heurisic landing
        reward -= s_power*0.03

        done = False
        if self.game_over or abs(state[0]) >= 1.0:
            done   = True
            reward = -100
        if not self.lander.awake:
            done   = True
            reward = +100
```

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector.
Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points.
If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or
comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main
engine is -0.3 points each frame. Firing side engine is -0.03 points each frame. Solved is 200 points.

Landing outside landing pad is possible. Fuel is infinite, but we can account for usage, so an agent can learn to fly and then land on its first attempt. Please see source code for details.

## Step through the environment with a random agent

Here we will:

1. Initialize the scene with ```env.reset()```
1. Get a random action from the set of allowable actions for this agent in this environment with ```action = env.action_space.sample()```
1. Perform a step using this random action ```env.step(action)```
1. Repeat for 200 time steps

### **IMPORTANT: Stop the animation by hitting the big blue power button before proceeding.**

In [None]:
%matplotlib notebook 

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.animation

f = plt.figure()
ax = f.gca()

state = env.reset()
im = env.render()
image = plt.imshow(im, interpolation='None', animated=True)

#Do an experiment and collect images, then animate
images = []
for _ in range(200):
    action = env.action_space.sample()
    state, reward, done, _ = env.step(action)
    images.append(env.render())
    
def animate(frame_index, images):
    image.set_data(images[frame_index])
    ax.set_title(str(frame_index))
    return image

ani = matplotlib.animation.FuncAnimation(f, animate, interval=100, frames=200, fargs=(images,))


Hard landing? Random actions make no sense! (kinda obvious) 

Can we try using some rules or heuristics to help land safely? When coding your AI in games, it is typical to construct rules or decision trees to provide the most sensible action from the current state that the agent is in right now. Perhaps control theory can help?

_Note_: This low resolution world is what the agent sees, but we will be modifying it and rendering a better version of the world in the lab you will do next!

--------------------------------------------------------

The ```heuristic()``` function below uses the following logic:
1. Point towards where the agent needs to land
1. Limit rotation or provide a target for angle ```angle_targ```
1. Hover above the center for soft, centered landing ```hover_targ```
1. Use two PID controllers:
    a. one for converging towards a target angle
    b. another one for converging target vertical velocity for soft landings
1. Output an appropriate action ``a=1,2,3 or 0``` to achieve the above targets


### Wait, what is a PID controller?

The basic idea behind a PID controller is to read a sensor, then compute the desired actuator output by calculating proportional, integral, and derivative responses and summing those three components to compute the output. For closed systems, a sensor is used to measure the process variable (here, angle and velocity) and provide feedback to the control system. The set point (or "target") is the desired or command value for the process variable, such as 100 degrees Celsius in the case of a temperature control system (here, these targets are for angle and velocity). At any given moment, the difference between the process variable and the set point is used by the control system algorithm (compensator), to determine the desired actuator output to drive the system (here, the actuators are the engines and the engines, one main and two side, are controlled by the actions you input into the environment). 

<img src="http://blog.opticontrols.com/wp-content/uploads/2011/03/PID-Controller.png" width ="50%">

### Ok, but how do you determine this? What are those magic coefficients in the PID equations below?

```angle_todo = (angle_targ - s[4])*0.5 - (s[5])*1.0```

```hover_todo = (hover_targ - s[1])*0.5 - (s[3])*0.5```

The process of setting the optimal gains for P, I and D to get an ideal response from a control system is called tuning. There are different methods of tuning of which the “guess and check” method and the Ziegler Nichols methods are popular. Assume that the tuning was done with trial and error and you know these coefficients.


### Ok, back to constructing the heuristic function ....

In [None]:
import numpy as np 

def heuristic(env, s):
    # Heuristic for:
    # 1. Testing. 
    # 2. Demonstration rollout.
    angle_targ = s[0]*0.5 + s[2]*1.0         # angle should point towards center (s[0] is horizontal coordinate, s[2] hor speed)
    if angle_targ >  0.4: angle_targ =  0.4  # more than 0.4 radians (22 degrees) is bad
    if angle_targ < -0.4: angle_targ = -0.4
    hover_targ = 0.55*np.abs(s[0])           # target y should be proporional to horizontal offset

    # PID controller: s[4] angle, s[5] angularSpeed
    angle_todo = (angle_targ - s[4])*0.5 - (s[5])*1.0
    #print("angle_targ=%0.2f, angle_todo=%0.2f" % (angle_targ, angle_todo))

    # PID controller: s[1] vertical coordinate s[3] vertical speed
    hover_todo = (hover_targ - s[1])*0.5 - (s[3])*0.5
    #print("hover_targ=%0.2f, hover_todo=%0.2f" % (hover_targ, hover_todo))

    if s[6] or s[7]: # legs have contact
        angle_todo = 0
        hover_todo = -(s[3])*0.5  # override to reduce fall speed, that's all we need after contact

    if env.continuous:
        a = np.array( [hover_todo*20 - 1, -angle_todo*20] )
        a = np.clip(a, -1, +1)
    else:
        a = 0
        if hover_todo > np.abs(angle_todo) and hover_todo > 0.05: a = 2
        elif angle_todo < -0.05: a = 3
        elif angle_todo > +0.05: a = 1
    return a

## Step through the environment with a heuristic agent

Only change here is that we use the above function suggest a "good action" 

In [None]:
# import matplotlib.pyplot as plt
import numpy as np
import matplotlib.animation

f2 = plt.figure()
ax2 = f2.gca()

state = env.reset()
im = env.render()
image = plt.imshow(im, interpolation='None', animated=True)

#Do an experiment and collect images, then animate
images = []
for _ in range(200):
    action = heuristic(env, state) # THIS LINE CHANGED !!!
    state, reward, done, _ = env.step(action)
    images.append(env.render())
    
def animate(frame_index, images):
    image.set_data(images[frame_index])
    ax2.set_title(str(frame_index))
    return image

ani2 = matplotlib.animation.FuncAnimation(f2, animate, interval=100, frames=200, fargs=(images,))

That should look a little better! 
**"Why do I need Reinforcement Learning (RL) then?"** ... For a simple environment and a simple agent with 4 actions, hand coded rules are manageable. In situations where hand coded rules are difficult to write, or impossible, RL comes to the rescue. 

This line in the above code is a "policy", specifically, a policy that we hand coded. This answers the question, _what is the best action I can take from the current state?_ 

```
action = heuristic(env, state)
```

In general, it is difficult to write and hand code policies for agents, so algorithms in RL help **generate** an optimal policy for an agent in an environment such as this Lunar Lander one. Today we will be using RL to teach the agent how to land. 

# Let's go!