# Mountain Car and Q-Learning

The mountain car is a control theory task where you have to get a car on top of a mountain by pushing it gently left and right (and you cannot just push it up directly because the car is too heavy).

In this lecture we will use Q-learning to learn a policy to solve the mountain car.

# Clones, Installs, Imports

## Clone GitHub Repository
This will clone the repository to your machine.  This includes the code and data files.  Then change into the directory of the repository.

In [1]:
!git clone https://github.com/zlisto/reinforcement_learning_tutorial

import os
os.chdir("reinforcement_learning_tutorial")

Cloning into 'reinforcement_learning_tutorial'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 36 (delta 10), reused 33 (delta 7), pack-reused 0[K
Unpacking objects: 100% (36/36), done.


## Install Packages

In [2]:
!pip install gym pyvirtualdisplay 
!apt-get install -y xvfb python-opengl ffmpeg 


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Installing collected packages: pyvirtualdisplay
Successfully installed pyvirtualdisplay-3.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
Suggested packages:
  libgle3
The following NEW packages will be installed:
  python-opengl xvfb
0 upgraded, 2 newly installed, 0 to remove and 49 not upgraded.
Need to get 1,281 kB of archives.
After this operation, 7,687 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-opengl all 3.1.0+dfsg-1 [496 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe a

## Import Librarires



In [3]:
from scripts.rl_helper import *

import gym
from gym.wrappers import Monitor
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay



import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7f252fb7f310>

# Mountain Car Environment
The name of the simulator environment is `"MountainCar-v0"`, which we save to a variable `env_name`.  We can load the simulator environment with the `gym.make` function.  We use the `wrap_env` function so we can visualize the output of the simulator in Colab.  

Note: You can load many different gym simulator environments with this line of code.  Just change `env_name`.  You can find a list of the gym environments here: https://www.gymlibrary.ml/

## Load Environment

In [4]:
env_name = 'MountainCar-v0'



## Run Environment


In [5]:
env = wrap_env(gym.make(env_name))

observation = env.reset()
score = 0
step = 0
while True:
    env.render()
    step+=1
    # your agent goes here
    action = env.action_space.sample()
    
    observation, reward, done, _ = env.step(action)
    score+=reward
    if done:
        break
print(f"{env_name} Score = {score}, Steps = {step}")
env.close()
show_video()


  



MountainCar-v0 Score = -200.0, Steps = 200


# Q-Learning

We wil train an agent to solve the mountain car using Q-learning.  We will make a Q-table, which is a tensor (a high dimensional matrix) indexed by the states and actions.  The value in each cell of the table equals the rewards that will be achieved if one takes the action when in the given state, and then play the game perfectly.  Q-learning lets us update the values in this table as we play to learn the right policy.  

When we are done, we will have the tensor `Q` which is the Q-table.  To use this table, in each step of the simulator, we check what state we are in.  Then, we look up this row in the table, and choose the action in this row with the highest value.

## Discretize State

To use Q-learning, we need to discretize the state.  That means we assign an integer to each value of the state.  The mountain car state is the x-coordinate $x$ and the velocity $v$.  These are in the ranges 

$-1.2\leq x \leq 0.6$

$-0.07\leq v \leq 0.07$  

In the code below, let's write a function to turn the continuous state $(x,v)$ into a discrete state that is a pair of integers.

In [6]:
def discrete_state(state):
  x = state[0]
  v = state[1]
  xdiscrete = 
  
  vdiscrete = 
  state_discrete = (xdiscrete, vdiscrete)
  return state_discrete


## Number of States

We need to get the number of discrete states for each dimension to make the Q-table.  You can do this by taking the maximum state value and discretizing it (and then adding 1 since Python starts counting at 0).

In [7]:
num_states = 
print(f"number of states = {num_states}")


number of states = [19 15]


## Initialize Q-table
We initialize the Q-table as an array with dimensions `(num_states[0], num_states[1], num_actions)`.  We also initialize `reward_list` as a list of the reward of agent after each episode.  


In [8]:
# Initialize Q table
num_actions = env.action_space.n
Q = np.random.uniform(low = -1, high = 1, 
                    size = ())
print(f"Q table shape = {Q.shape}")

# Initialize rewards list
reward_list = []


Q table shape = (19, 15, 3)


## Initialize Parameters

We choose a few parameters for Q-learning.

1. `learning` = how much we change the Q-table by in each iteration

2. `discount` = hw much we discount past rewards

3. `epsilon` = the probability with which we ignore the Q-table and take a random action (this makes the algorithm not get stuck in loops)

4. `min_epsilon` = lower bound for epsilon

5. `episodes` = the number of episodes of the simulator we play to learn

6. `reduction` = how much we decrease `epsilon` by after each episode.  As we learn more, we want to do random actions less.

In [9]:
learning = 0.2
discount = 0.9
epsilon = 0.8
min_epsilon = 0
episodes = 10000


# Calculate episodic reduction in epsilon after each episode
reduction = epsilon/episodes

## Run Q-learning Algorithm

You can run this loop again to keep training the agent. It will pick up where it left off.  The Q-learning step is calculated as follows.  First, based on the action `action` you took from `state` into `state2`, the one-step look-ahead value in the Q-table is

$Q_{new}(\text{state},\text{action}) = \text{reward} +\text{discount}\times \max_{a}Q(\text{state2},a) $

The old value in the Q-table is simply $Q_{new}(\text{state},\text{action})$.  We change this value by the difference of this and $Q_{new}$, weighted by the learning rate `learning`.

$\text{delta} = \text{learning}\times(Q_{new}(\text{state},\text{action})-Q(\text{state},\text{action}))$

$Q(\text{state},\text{action}) = Q(\text{state},\text{action})+\text{delta}$.

Also, keep track of the best reward and best Q-table you have so far.

In [None]:
print(f"Running Q-learning on {env_name} for {episodes} episodes")
best_reward = -np.inf
Qbest = Q
for i in range(episodes):
    # Initialize parameters
    done = False
    tot_reward = 0  #total rewards for the episode
    state = env.reset()

    # Discretize state
    state_adj = 
    while done != True:   
        
        # Determine next action - epsilon greedy strategy
        if np.random.random() < 1 - epsilon:
            action =  
        else:
            action = 

        # Get next state and reward
        state2, reward, done, info = env.step(action) 

        # Discretize state2
        state2_adj = 
        
        #Allow for terminal states
        if done and state2[0] >= 0.5:
            Q[state_adj[0], state_adj[1], action] = reward

        # Q-Learning: adjust Q value for current state
        else:
            Qnew = 
            Qold = 
            delta = 
            Q[state_adj[0], state_adj[1],action] += delta

        # Update variables
        tot_reward += reward
        state_adj = state2_adj

    # Decay epsilon
    if epsilon > min_epsilon:
        epsilon -= reduction

    # Track rewards
    reward_list.append(tot_reward)
    if (i+1) % 1000 == 0:         
        print(f'Episode {i+1}: 100 episode average reward = {avg_reward}')
    
    # Save Q-table if new high 100 episode average
    avg_reward = np.mean(reward_list[-100:])
    if avg_reward> best_reward:
      Qbest = 
      best_reward = 
      print(f"New best reward: step {i}: {best_reward}")
    
env.close()

## Plot Rewards vs. Episode

We make a column of the moving average of the rewards using the `rolling` and `mean` functions on the `reward` column of the dataframe.

In [21]:
# Plot Rewards
mavg = 100 #number of episodes to avg rewards over
df = pd.DataFrame({'episode':list(range(len(reward_list))), 
                   'reward':reward_list})
df[f'reward_{mavg}mavg'] = 


In [None]:
fig = plt.figure(figsize = (8,6))
sns.lineplot(data = df[mavg:], x='episode', 
             y = f'reward_{mavg}mavg',
            color = 'orange')
plt.xlabel('Episodes')
plt.ylabel('Average Reward')
plt.title(f'Q-learning on {env_name}')
plt.grid()
plt.show()

# Test Trained Agent

Make sure you use your best Q-table `Qbest` to selection `action`.

In [None]:
env = wrap_env(gym.make(env_name))

state = env.reset()
score = 0
step = 0
while True:
    env.render()
    step+=1
    # your agent goes here
    state_adj = discrete_state(state)
    action = 
    state, reward, done, _ = env.step(action)
    score+=reward
    if done:
        break
print(f"{env_name} Score = {score}, Steps = {step}")
env.close()
show_video() 



