### Deep Q Learning 2- Mountain Car Problem

This notebook is the continuation of [previous one](https://github.com/sezan92/DQL/blob/master/Deep%20Q%20Learning.ipynb) . Please check That code before starting this one. In this notebook I am trying to solve Mountain Car problem by ```Open AI```. The video is available [here](https://www.youtube.com/watch?v=MbArDXXYcjM)

##### Rule

The Rule of the game is simple. The car at the valley must reach the flag at the right top, with the least steps possible. For each step it will get -1 reward. The game is considered won when the car gets -110 . The game is considered failure when the car takes 200 steps but didn't reach the flag

![mountain car](Final.png)

Almost all of the parts of the codes are same. So I will try to explain which parts are slightly different , and why

In [1]:
import gym
import numpy as np
from collections import deque
import random
from keras import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import matplotlib.pyplot as plt
# In[2]
env = gym.make('MountainCar-v0')

Using TensorFlow backend.
[2018-03-06 23:43:59,858] Making new env: MountainCar-v0


In [2]:
legal_actions=env.action_space.n
actions = [0,1]
gamma =0.95
lr =0.5
num_episodes =1000
epsilon =1
epsilon_decay =0.995
memory_size =1000
batch_size=100
show=False
action_size=env.action_space.shape[0]
state_size=env.observation_space.shape[0]

Change ```show``` variable to ```True``` to visualize the training process

Now look at this part from Open Ai gym's wiki [page](https://github.com/openai/gym/wiki/MountainCar-v0)

![MountainObs](MountainCarObs.png)

The minimum velocity is -0.07 units and the maximum is 0.07 units. While the minimum position is -1.2 units and maximum position is 0.6 units. Here is the problem . The scale of the data is not the same. It is best to scale the data. For the position values, the scale is good. It is within 0 to 1. But the velocity observations' scale is pretty bad. Machine Learning Algorithms usually don't work good in those scales. So what's the solution ? Simple, multiplying different factors to the states before working with it. I have defined a list named $factor$ which will rescale the data accordingly

In [3]:
factor=[1,100]

You can try different values for the factor

The rest of the code is almost same , but there is a slight difference

In [None]:
# In[4]
if ep_list is None and reward_list is None:
    ep_list =[]
    reward_list =[] 
index=0 
for ep in range(num_episodes):
    s= env.reset()
    s=s.reshape((1,-1))
    s = s*factor
    rAll =0
    d = False
    j = 0
    for j in range(200):
        #time.sleep(0.01)
        #epsilon greedy. to choose random actions initially when Q is all zeros
        if np.random.random()< epsilon:
            a = np.random.randint(0,legal_actions)
            #epsilon = epsilon*epsilon_decay
        else:
            Q = model.predict(s.reshape(-1,s.shape[0],s.shape[1]))
            a =np.argmax(Q)
        new_s,r,d,_ = env.step(a)
        new_s = new_s.reshape((1,-1))
        new_s = new_s*factor
        rAll=rAll+r
        if show:
            env.render()
        if d:
            if rAll<-199:
                r =-100
                experience = (s,r,a,new_s)
                memory.append(experience)
                print("Episode %d, Failed! Reward %d"%(ep,rAll))
                #break
            elif rAll<-110 and rAll>-199:
                r=-10
                experience = (s,r,a,new_s)
                memory.append(experience)
                print("Episode %d, Better! Reward %d"%(ep,rAll))
            elif rAll>=-110:
                r=100
                experience = (s,r,a,new_s)
                memory.append(experience)

                print("Episode %d, Passed! Reward %d"%(ep,rAll))
            ep_list.append(ep)
            reward_list.append(rAll)
            break
        
        experience = (s,r,a,new_s)
        memory.append(experience)
        if j==199:
            print("Reward %d after full episode"%(rAll))
            
        s = new_s
    batches=random.sample(memory,batch_size)
    #batches= list(memory)[index:index+batch_size]
    states= np.array([batch[0] for batch in batches])
    rewards= np.array([batch[1] for batch in batches])
    actions= np.array([batch[2] for batch in batches])
    new_states= np.array([batch[3] for batch in batches])
    Qs =model.predict(states)
    new_Qs = model.predict(new_states)
    for i in range(len(rewards)):
        if rewards[i]==-100 or rewards[i]==-10:
            Qs[i][0][actions[i]]=Qs[i][0][actions[i]]+ lr*(rewards[i]-Qs[i][0][actions[i]])
        else:
            Qs[i][0][actions[i]]= Qs[i][0][actions[i]]+ lr*(rewards[i]+gamma*np.max(new_Qs[i])-Qs[i][0][actions[i]])
    model.fit(states,Qs,verbose=0)
    epsilon=epsilon*epsilon_decay
    index=index+batch_size
    if index>=len(memory):
        index=0
env.close()       


The slight difference is the punishment for stopping before agent reaches the flag. Please figure it out yourself :) 

### Performance Curve

Plot for Deep Q Learning

![MountainDQL](Reward_vs_Episode_DL_lr_0.500000_eps_1000.jpg)

Plot for Vanilla Q learning

![MountainCarVanillaQ](Reward_vs_Episode_QL_lr_0.500000_eps_1000.jpg)

We see the The Deep learning Based Q Network got more points and was more stable compared to Vanilla Q learning