<a href="https://colab.research.google.com/github/shashwatrathod/MountainCar-v0/blob/master/MountainCarV0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MountainCar-v0 OpenAI**
## A Q-learning approach

The problem: https://gym.openai.com/envs/MountainCar-v0/ <br>
References: https://pythonprogramming.net/

In [0]:
import gym
from gym import logger as gymlogger
import numpy as np
import time
import pickle

All variables here: 

In [0]:
EPISODES = 1_000
LEARNING_RATE = 0.01
DISCOUNT = 0.95
epsilon = 0.5
decay = epsilon / ((EPISODES//2) - 1)

In [10]:
env = gym.make("MountainCar-v0")
observation = env.reset()
print(observation)
print(f"Size of the action space : {env.action_space.n}")
print(f"Range of observations : {env.observation_space.high} : {env.observation_space.low}")

[-0.53225033  0.        ]
Size of the action space : 3
Range of observations : [0.6  0.07] : [-1.2  -0.07]




Lets write a function to have just one episode

(Note: You can add env.render() to see what's going on behind the hood. It'll make the while process a lil bit slow though)

In [11]:
def play_once():
  done = False
  env.reset()
  while not done:
      action = 2
      new_state, reward, done, _ = env.step(action)
      print(reward, new_state)

play_once()

-1.0 [-0.46890143  0.00059602]
-1.0 [-0.4677138   0.00118763]
-1.0 [-0.46594335  0.00177045]
-1.0 [-0.46360316  0.00234019]
-1.0 [-0.46071051  0.00289265]
-1.0 [-0.45728674  0.00342378]
-1.0 [-0.45335703  0.00392971]
-1.0 [-0.44895024  0.00440679]
-1.0 [-0.44409864  0.00485159]
-1.0 [-0.43883767  0.00526098]
-1.0 [-0.43320557  0.0056321 ]
-1.0 [-0.42724314  0.00596243]
-1.0 [-0.42099336  0.00624978]
-1.0 [-0.41450102  0.00649234]
-1.0 [-0.40781238  0.00668863]
-1.0 [-0.40097479  0.00683759]
-1.0 [-0.39403627  0.00693852]
-1.0 [-0.38704519  0.00699108]
-1.0 [-0.38004985  0.00699534]
-1.0 [-0.37309816  0.00695169]
-1.0 [-0.36623726  0.0068609 ]
-1.0 [-0.35951322  0.00672404]
-1.0 [-0.35297072  0.0065425 ]
-1.0 [-0.34665278  0.00631795]
-1.0 [-0.34060047  0.00605231]
-1.0 [-0.33485274  0.00574773]
-1.0 [-0.32944616  0.00540658]
-1.0 [-0.32441478  0.00503138]
-1.0 [-0.31978995  0.00462483]
-1.0 [-0.31560021  0.00418974]
-1.0 [-0.31187117  0.00372904]
-1.0 [-0.30862545  0.00324572]
-1.0 [-0

Now, for this problem, the car starts at the initial position of around -0.5 . The goal is to reach 0.5 i.e the yellow flag. This can be achieved by gaining momentum by going back and forth. A reward of -1 is awarded everytime the agent takes an action and does not get to the yello flag. Once it gets to the flag, no reward is awarded. Ther area maximum of 200 steps allowed per episode.<br>
To work this problem out using Q-Learning, lets first start by making a q-table. The problem right now is that the states that are returned as a result of the action are continuous and not descrete. So first we need to quantize them.

In [12]:
QUANTIZATION_LEVELS = [30] * observation.shape[0]
QUANTIZATION_STEPS = (env.observation_space.high - env.observation_space.low)/ QUANTIZATION_LEVELS
print(f"Quantization levels: {QUANTIZATION_LEVELS}")
print(f"Quantization Steps: {QUANTIZATION_STEPS}")

def quantize_state(state):
  quantized_state = (state - env.observation_space.low) / QUANTIZATION_STEPS
  return tuple(quantized_state.astype(int))

Quantization levels: [30, 30]
Quantization Steps: [0.06       0.00466667]


Now, lets initialize our q-table

In [13]:
q_table = np.random.uniform(low=-0.2, high=0, size = (QUANTIZATION_LEVELS + [env.action_space.n]))
print(f"The shape of our q-table is {q_table.shape}")

The shape of our q-table is (30, 30, 3)


Lets start the learning!<br>
Following is the updation rule for the qtable entry:<br>

![Q updation rule](https://wikimedia.org/api/rest_v1/media/math/render/svg/678cb558a9d59c33ef4810c9618baf34a9577686)

In [14]:
prev_time = 0
all_steps = []
for i in range(EPISODES):
  current_state = quantize_state(env.reset())
  done = False
  steps = 0
  prev_time = time.time()
  while not done:
    if(np.random.random() > epsilon):
      action = np.argmax(q_table[current_state])
    else:
      action = env.action_space.sample()
    new_state, reward, done, info = env.step(action)
    new_state_q = quantize_state(new_state)

    if not done:
      
      maxQ = np.max(q_table[new_state_q])
      current_q = q_table[new_state_q + (action,)]
      q_table[current_state + (action,)] = current_q + LEARNING_RATE*(reward + DISCOUNT*maxQ - current_q)
    
    elif new_state[0] > env.goal_position:
      q_table[current_state+(action,)] = 0
      print(f"Achieved succes at game {i} in {steps} steps and time {time.time() - prev_time}")
      all_steps.append(steps)
    
    current_state = new_state_q

    epsilon -= decay
    steps = steps+1

with open("q_table","wb") as f:
  pickle.dump(q_table,f)

print(f"Least number of steps taken: {np.min(all_steps)} at episode {np.argmin(all_steps)}")
print(f"Maximum number of steps taken: {np.max(all_steps)} at episode {np.argmax(all_steps)}")

Achieved succes at game 72 in 173 steps and time 0.01383829116821289
Achieved succes at game 86 in 182 steps and time 0.02497076988220215
Achieved succes at game 89 in 194 steps and time 0.016370534896850586
Achieved succes at game 90 in 172 steps and time 0.015056371688842773
Achieved succes at game 94 in 170 steps and time 0.013183832168579102
Achieved succes at game 95 in 167 steps and time 0.012781858444213867
Achieved succes at game 96 in 176 steps and time 0.01427006721496582
Achieved succes at game 98 in 175 steps and time 0.012859821319580078
Achieved succes at game 100 in 163 steps and time 0.01680159568786621
Achieved succes at game 101 in 184 steps and time 0.017403602600097656
Achieved succes at game 103 in 182 steps and time 0.013735294342041016
Achieved succes at game 104 in 175 steps and time 0.01272892951965332
Achieved succes at game 105 in 170 steps and time 0.013014078140258789
Achieved succes at game 106 in 150 steps and time 0.011301755905151367
Achieved succes at 