# Model Free RL - Q Learning with Continuous State Space

### Import required dependencies

In [None]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import gym
from IPython.display import display, HTML
from tqdm import tqdm
from ModelFreeRL_Support.helper import animate
import pickle

## Pendulum swing up

<div>
<img src="https://www.gymlibrary.dev/_images/pendulum.gif" width="200"/>
</div>

Please visit [THIS LINK](https://www.gymlibrary.dev/environments/classic_control/pendulum/) for more information about the environment.

A quick summary on how to use the environment:

```
env = gym.make('Pendulum-v1')
current_state, _ = env.reset()
next_state, reward, done, truncated, info = env.step(action)
```

where `env.reset()` will reset the environment and initialize it with a random state. `env.step` will move the simulation by one time step apply the action that you passed into it. The return of this function is the state in the next time step `next_state`, the immediate `reward` for being in `current_state` and take action `action`. In the case of Pendulum environment:
- `done` is when the agent reached the goal, this always returns False,
- `truncated` returns True when the episode end because it ran too long, the default is 200 time steps.

## Discretization

There are many ways to implement discretization of a continuous state space. In this notebook, I will introduce on of the methods: `np.linspace` and `np.digitize`. 

Let's say we want to discretize the state space from (-10, 10), and we want 9 different bins (10 dividers). We can generate these dividers with `np.linspace`

In [None]:
bins = np.linspace(-10, 10, 10)
print(bins)

Next, we want to have a function that receives a number and decide which bin to put it in, that's `np.digitize`

In [None]:
idx = np.digitize(3.21, bins)
print(idx)

Let's see what boundary does this index correspond to

In [None]:
bins[idx]

Try this sequence again with the boundary of the range, and see the behaviour and predict what might go wrong with it.

In [None]:
bins[np.digitize(-10, bins)]

In [None]:
bins[np.digitize(10, bins)]

## Q-Learning on Continuous State Space Implementation

Our Q-table is a 4D matrix, where the first 3 dimensions are the x, y and velocity according to the `gym` documentation. 

Hence, given a continuous state, we have to obtain the corresponding index of our bins. 

**TODO**: Let's write a function that receives 4 argument: 
- `state`: numpy array of 3 elements
- `x_bins`: numpy array 
- `y_bins`: numpy array
- `v_bins`: numpy array

In [None]:
def get_discrete_state(state, x_bins, y_bins, v_bins):
    #TODO 1: Getting the index corresponding to our bins  ========
    pass
    #ENDTODO =====================================================

# Have a quick test
print("We expect to see 3 intergers to be printed out such as [12, 35, 56] or (12, 35, 56)")
get_discrete_state([-1.0, 0.5, 1.0],
                    np.linspace(-1, 1, 10),
                    np.linspace(-1, 1, 10),
                    np.linspace(-1, 1, 10))

Next, let's define all of our parameters

In [None]:
MIN_E=0.001
MAX_E=1.0
E_DECAY = 80.0
NUM_EPISODES = 500
epsilon_by_episode = lambda ep_idx: (MIN_E + (MAX_E - MIN_E) * np.exp (-1 * ep_idx/E_DECAY))
fig, ax = plt.subplots(figsize=(4, 4))
ax.plot([epsilon_by_episode(i) for i in range(NUM_EPISODES)])
ax.set_xlabel("Num. episodes")
ax.set_ylabel("Epsilon")

NUM_X_BINS = 10
NUM_Y_BINS = 10
NUM_V_BINS = 10
NUM_A_BINS = 10
ALPHA = 0.1
GAMMA = 0.99

### Implementing Q-learning loop

There are 2 TODOs in this cell:
- **TODO 2**: Implement the epsilon greedy policy, remember that:
    - our `q_table` is a 4D numpy matrix, the first 3 index are for the states (End-Effector X, End-Effector Y, and angular velocity). To get a row of Q-value given a state tuple (x, y, v), we can call `q_table[x, y, v]`
    - Apply that action to the environment. Note that the environment accept continous torque in range (-2.0, 2.0) that's why if you find the optimal *discrete* action using `argmax`, please use `action_map` to map that to continous action 
    - To get the epsilon of current episode, please use function `epsilon_lambda(episode)`
- **TODO 3**: Update the Q_table, this should be very similar to the implementation in `ModelFreeRL.ipynb`. The only difference is we are dealing with 3D discrete state instead of 1D discrete state

In [None]:
def q_learning(env, epsilon_lambda,
               n_episodes, gamma, alpha,
               num_x_bins, num_y_bins, num_v_bins, num_a_bins, 
               saveto = None):
    """
    Q-learning algorithm for the Pendulum environment with a 4D Q-table.
    """
    # Discretize the state and action spaces
    x_bins = np.linspace(-1, 1, num_x_bins)
    y_bins = np.linspace(-1, 1, num_y_bins)
    v_bins = np.linspace(-env.max_speed, env.max_speed, num_v_bins)
    action_map = np.linspace(-2.0, 2.0, num_a_bins)

    # Initialize Q-table
    q_table = np.zeros((num_x_bins, num_y_bins, num_v_bins, num_a_bins))

    episode_rewards = []
    for episode in tqdm(range(n_episodes)):
        done = False
        truncated = False
        current_state, _ = env.reset()
        Dstate = get_discrete_state(current_state, x_bins, y_bins, v_bins)
        cumulative_rewards = 0
        
        while not (done or truncated):
            #TODO 2a: e-greed policy ====================================================
            action =
            #ENDTODO ====================================================================

            #TODO 2b: Take action and observe the state transition and reward ===========
            next_state, reward, done, truncated, _ = 
            #ENDTODO ====================================================================
            next_Dstate = get_discrete_state(next_state, x_bins, y_bins, v_bins)

            # Update Q-value
            #TODO 3: Update the Q table =================================================
            
            #ENDTODO ====================================================================

            #Booking keeping
            Dstate = next_Dstate
            cumulative_rewards += reward
        
        episode_rewards.append(cumulative_rewards)

    if saveto is not None:
        with open(f"{saveto}.pk", "wb") as f:
            pickle.dump({"q_table": q_table, 
                         "x_bins": x_bins, 
                         "y_bins": y_bins, 
                         "v_bins": v_bins, 
                         "action_map": action_map}, f)

    return q_table, episode_rewards, x_bins, y_bins, v_bins, action_map



env = gym.make('Pendulum-v1')

# Train the agent
q_table, e_rewards, x_bins, y_bins, v_bins, action_map = q_learning(env, epsilon_by_episode, NUM_EPISODES, GAMMA, ALPHA,
                                                                    NUM_X_BINS, NUM_Y_BINS, NUM_V_BINS, NUM_A_BINS)


# Close the environment
env.close()

## Let's visualize the policy that you obtained from Q-learning 

The last TODO is very similar to **TODO 3**, instead of using $\epsilon$-greedy policy, we use GREEDY policy, that means we always take optimal action according to our q_table.

In [None]:
from ModelFreeRL_Support.helper import animate

frame_buffer = []
# Reset to random state
state, _ = env.reset(seed = 42)

# Initialization for rendering
frame = env.render()
frame_buffer.append(frame)
# Number of steps you want to simulate/render
num_steps = 200

rewards = 0
for _ in range(num_steps):
    Dstate = get_discrete_state(state, x_bins, y_bins, v_bins)
    #TODO 4: Agent always follow optimal_policy ==============================
    action =
    next_state, reward, done, truncated, _ = 
    #ENDTODO ==================================================================
    rewards += reward
    
    # Render to IPython display
    frame = env.render()
    frame_buffer.append(frame)
    if done or truncated:
        break

print(f"Cumulative rewards of run: {rewards}")
env.close()
print("Animating ...")
animate(frame_buffer)