In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

import pendulum

# Goal of the project
The goal of this project is to learn a policy for an inverted pendulum model to make it do a swing-up motion. Beyond the task of inverting a pendulum, the goal is to also gain an understanding on how value iteration and Q-learning work, their limitations and advantages.

To make the problem interesting, the inverted pendulum has a limit on the maximum torque it can apply, therefore it is necessary for the pendulum to do a few "back and forth" motions to be able to reach the inverted position ($\theta=\pi$) from the standing still non-inverted position ($\theta=0$). 

<img src='pendulum.png' width="120">

In the following, we will write $x = \begin{pmatrix} \theta \\ \dot{\theta} \end{pmatrix}$ as the vector of states of the system. We will also work with time-discretized dynamics, and refer to $x_n$ as the state at time $t = n \Delta t$ (assuming discretization time $\Delta t$)

We want to minimize the following discounted cost function
$$\sum_{i=0}^{\infty} \alpha^i g(x_i, u_i)$$ where 
$$g(x_i, u_i) = (\theta-\pi)^2 + 0.01 \cdot \dot{\theta}_i^2 + 0.0001 \cdot u_i^2 \qquad \textrm{and} \qquad\alpha=0.99$$
This cost mostly penalizes deviations from the inverted position but also encourages small velocities and control.

## Part 1 - Value iteration with a model
In the first part, we will implement the value iteration algorithm, which requires us to have a model of the robot dynamics, i.e. we need to be able to know the next state $x_{n+1}$ given $(x_n, u_n)$. To that end, we are given a robot (defined in the package ```pendulum.py```) with a function ```next_state(x,u)``` that returns $x_{n+1}$ given $(x_n, u_n)$.

In [2]:
# we can create a robot
robot = pendulum.Pendulum()

# assume we set theta and dtheta = 0 and u = -5, we can get the next state using
x = np.array([0,0])
u = -5
x_next = robot.next_state(x, u)

We also need to discretize the state space, we discretize $\theta \in [0, 2\pi]$ in 50 states and $\dot{\theta} \in [-6, 6]$ is 50 states. For example:

In [3]:
# we don't want 2pi to be in the set because it's the same as 0
discretized_theta = np.linspace(0, 2*np.pi, 50, endpoint=False)

discretized_thetadot = np.linspace(-6, 6, 50)

# now given an arbitrary continuous state theta
theta_arbitrary = 0.23471

# we can find the index of the closest element in the set of discretized states
index_in_discretized_theta = np.argmin(np.abs(discretized_theta - theta_arbitrary))

# and find the closed discretized state
closest_state = discretized_theta[index_in_discretized_theta]
print(f'the discretized theta closest to {theta_arbitrary} is {closest_state} with index {index_in_discretized_theta}')

the discretized theta closest to 0.23471 is 0.25132741228718347 with index 2


1. Write a function that implements the value iteration algorithm ```value_iteration(robot)``` that takes as an input the robot model (to use ```robot.next_state```) and returns two 50x50 arrays, one for the value function and one for the optimal policy.

2. Compute the optimal value function and policy when $u$ has three possible value $\{-5, 0, 5\}$ (plot these functions as 2D images - cf. below). How many iterations of the value iteration algorithm are necessary to converge?(assume convergence when no value changes more than $10^{-2}$ after an iteration and initialize the first guess for the value function to 0). Using the simulate / animate functions (cf. below) how many back and forth of the pendulum are necessary to go from $x = [0,0]$ to the fully inverted position?

3. Answer the same questions when $u \in \{-3, 0 ,3\}$

4. Answer the same questions when $u \in \{-5, 0 ,5\}$ and we use a "sparse" cost function, i.e. a cost function where every state apart from the states close to being inverted have the same cost $g(x,u) = \left\{ \begin{array}{l} 0 \quad \textrm{if } |\theta - \pi| < 0.2 \\ 1 \quad \textrm{otherwise} \end{array} \right.$

5. What are the advantages and disadvantages of using the value iteration algorithm? Could we use the same algorithm for a more complicated robot? Why?

## Part 2 - Tabular Q-learning
Now we want to implement the Q-learning algorithm (with a table and discretized states / actions). The difference with value iteration is that we will not assume that we can access the next state given any (x,u). We will need to run a realistic episode where we always start the pendulum at the position $[0,0]$.
1. Write a function ```q_learning(robot)``` that implements the tabular Q-learning algorithm (use episodes of 100 timesteps and an epsilon greedy policy with $\epsilon=0.1$). The function should return the Q-table as a 50x50x3 numpy array (assume that only three controls are possible as in Part 1).

2. How can you compute the optimal policy from the Q function? And the optimal value function? Write a function that does this using as an input the Q-table computed by ```q_learning(robot)```.

3. How many epsilodes (approximately) does it take for Q-learning to learn how to invert the pendulum when $u \in \{-5,0,5\}$? (use a learning rate of 0.1)

4. How does the estimates of the value function and policy compare to the ones computed with the Value Iteration algorithm? (plot these functions) Do you see any qualitative differences in terms of the computed policy compared to the optimal solution computed with value iteration? (e.g. do they achieve the same cost? did Q-learning find the optimal value function?  why?)

5. How is learning affected when changing $\epsilon$ and the learning rate?

6. Can you learn how to invert the pendulum with the sparse cost function define in Part 1.4? Is it easier or harder? Why?

In [4]:
# here is some code to plot results, assuming a policy and a value function are given
# this can be used to answer questions in both Part 1 and 2


# we make a robot
robot = pendulum.Pendulum()

value_function = np.zeros([50,50])
policy = np.zeros([50,50])

# we plot the value function
plt.figure(figsize=[6,6])
plt.imshow(value_function, extent=[0., 2*np.pi, -6, 6], aspect='auto')
plt.xlabel('Pendulum Angle')
plt.ylabel('Velocity')
plt.title('Value Function')

# we plot the policy
plt.figure(figsize=[6,6])
plt.imshow(policy, extent=[0., 2*np.pi, -6, 6], aspect='auto')
plt.xlabel('Pendulum Angle')
plt.ylabel('Velocity')
plt.title('Policy')

# now we simulate the dynamics for 100 time steps
x0 = np.array([0.,0.])

def controller(x):
    theta = np.linspace(0, 2*np.pi, 50, endpoint=False)
    dtheta = np.linspace(-6, 6, 50)
    
    th_index = np.argmin(np.abs(theta - x[0]))
    dth_index = np.argmin(np.abs(dtheta - x[1]))
    return policy[th_index, dth_index]

x, u = robot.simulate(x0, controller, 10)

# and plot the results
time = np.linspace(0.,20., len(x[0,:]))
plt.figure()
plt.subplot(3,1,1)
plt.plot(time,x[0,:])
plt.ylabel('angle')
plt.subplot(3,1,2)
plt.plot(time,x[1,:])
plt.ylabel('velocity')
plt.subplot(3,1,3)
plt.plot(time[:-1],u)
plt.ylabel('control')

# and show an animation
robot.animate_robot(x, robot.delta_t)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>