# Understanding openAI Gym Interface

# Gym
- Gym is a toolkit for developing and comparing reinforcement learning algorithm.
- It supports teaching agents everything from walking to playimg games like pong or pinball

the environment will be provided by GYM API.

# Some Common terms
-  Agent
-  Environment
-  Actions,Rewards,,observations

In [None]:
! pip install gym

In [5]:
import gym

In [6]:
# create an environment
# this environment is a central class in gym area.
env=gym.make('CartPole-v0') # this argument is basically the name of the game.
# gym provides you hundreds of environment
# you can make environment for any game and make and train an agent that is able to play that game.
# It is also possible to write generic agents that can work in multiple games.

 once you create this environment class, it comes with certain methods/ attributes.
 - action space
 - observation_space
 - reset()
 - step()
 - render()

- action space consists of all possible actions that you can perform in a game environment.

In [None]:
# this method will take your game to initial step, these 4 array elements are the 4 variables associated with environment
env.reset()
# this is your environment initial state that we have initialised.
# returns you the initial state and also resets the environment.

# The four variables can be:
-   location of cart
-   velocity of cart
-   Angular velocity of rod
-   linear velocity of rod

In [None]:
import warnings
warnings.filterwarnings('ignore')
env.reset()
for t in range(1000):
    random_action=env.action_space.sample()
    env.step(random_action) # randomly move left or right
    env.render()
env.close()

- what we will actually do is, we will play this game , move cart left and right and we wil try to balance this rod. the surface 
- here is frictionless surface.there is small pivot here and the rod is there, which is likely to move left or right
- according to the moment of cart.so what we want is we want to balance this rod for sometime. In this game, you will get
- a reward of +1 for every time stamp for which you balance the rod.and there are some thresholds that if the angle becomes 
- greater than 15 degrees, the rod will get unbalanced and fall and game will be over.
- if at any moment , the rotation is greater than threshold, the rod will fall then the game will be over.
 -print(env.action_space) # it basically tells you what all actions you can take, so you can take only 2 actions

In [None]:
env.action_space 
# it basically tells you what all actions you can take, it is basically discrete action space

In [None]:
# action space is an object of discrete class, it basically contains all possible actions, in this 
# case the two actions are left and right.
# dot sample method randomly picks action from sample space.

In [None]:
env.action_space.n # no of actions we can use

In [None]:
env.observation_space

So, you can see it is an object of type box. this box is another class in gym which is used to represent n dimension tensor.

In [None]:
env.observation_space.shape

In [None]:
env.observation_space.shape[0] # how many observation you have

You will have to win a reward of 200 in order to win this game

# Playing Games with a random Strategy
-  Game Episode
-  Step() function in more detail
-  Game Over?

- 1 Game Episode is the entire Game play. or It can also be termed as the duration between starting of game till the game gets over.
- There are multiple ways in which game gets over:
- you have completed the level or enemy has killed you or your lifelines has finised.
- when we are using reinforcement learning, we need multiple game plays, each game play is called episode. we want to play a game multiple times.

- Step function returns four values: 
- observation: new state 
- reward :  reward at particular time stamp 
- done : tells whether the game is over or not
- info: extra information about the game.

In [None]:
for e in range(20):
    #play 20 episodes
    observations=env.reset()
    for t in range(50):
        # each game is of 50 seconds
        env.render()
        action=env.action_space.sample()
        observation,reward,done,info=env.step(action)
        if done:
            # Game Episode is over !
            print("Game Episode is {}/{} High Score:{}".format(e,20,t))
            break
env.close()
print('All 20 episodes over!!!!')
    
            

As we can see from the observations, maximum score in these 20 episodes is 45, what would happened was, we were given 50 seconds for one episode, so from t=0 to t=17 it was balanced but at t=17 seconds , the rotation angle becomes greater than threshold and it fall down and game got over and similar things happened at every episodes. since it was random strategy, scoe was not that high!!

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import os
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import random

Using TensorFlow backend.


In [8]:
class Agent:
    # characteristics of agent
    def __init__(self,state_size,action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95 #Discount Factor
        # Exploration Vs Exploitation Tradeoff
        # Exploration:Good in the beginning ==>helps you to try various random things
        # Exploitation:Sample Good Experiences from the past(memory) ==> good in the end.
        self.epsilon = 1.0 # 100% random exploration in the start(Epsilon Greedy method)
        self.epsilon_decay = 0.995 #99.5% random exploration and 0.005% experience from past.
        self.epsilon_min = 0.01 # 1% random Exploration
        self.learning_rate = 0.001 
        self.model = self._create_model()
        # brain of agent
    def _create_model(self):
        model=Sequential()
        model.add(Dense(24,input_dim=self.state_size,activation='relu'))
        model.add(Dense(24,activation='relu'))
        model.add(Dense(self.action_size,activation='linear'))
        model.compile(loss='mse',optimizer=Adam(lr=self.learning_rate))
        return model
    def remember(self,state,action,reward,next_state,done):
        # Remember Past Experience
        self.memory.append((state,action,reward,next_state,done))
    def act(self,state):
        # samling actions according to epsilon greedy method
        # initially epsilon is 1 i.e initially action will be according to random exploration
        # but lets say epsilon is 0.7, then it will be 70% according to random exploration 
        # and 30% according to past experiences.
        if np.random.rand()<=self.epsilon: # generate any random no and compare with epsilon
            # do some random action
            return random.randrange(self.action_size) # return either action out of 2
        # Ask neural network to give me the suitable action
        return np.argmax(self.model.predict(state)[0])
    def train(self,batch_size=32):
        # Training using a replay buffer
        # we will going to sample out batch of experiences from the buffer and feed one by one experiences to neural network
        # we are using SGD , that is we are doing weight updates after every tuple passed through neural network
        # suppose batch size is 5, we will sample out 5 tuples of experiences and feed experiences to network one by one.
        # we will make weight updates after every tuple has been passed through neural network
        mini_batch=random.sample(self.memory,batch_size)
        for experience in mini_batch:
            state,action,reward,next_state,done=experience
            # X,Y : we need X and Y to train our neural network,Y is the target value that will come from 
            # bellman equation that we have studied in deep Q Learning
            # X,Y:state,Expected reward
            if not done:
                # game is not yet over and bellman equation is used to approximate the target value
                target=reward+self.gamma*np.amax(self.model.predict(next_state)[0])
            else:
                # game is over
                target=reward
            target_f=self.model.predict(state) # this will give us prediction something like that [[--,--]], and initially there are random values there
            target_f[0][action]=target # so we are updating the target values with approximate values
            # X=state,Y=target_f
            self.model.fit(state,target_f,epochs=1,verbose=0)
            # we have given epochs=1 because there is only one example that are we feeding to neural network
            # verbose=logs
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
            # we are updating epsilon and reducing it from 1 to 0 because as we keep on doing training,
            # we dont want to do random exploration and we want to use experiences from past thats why 
            # we are multiplying/updating our epsilon with epsilon decay.
            # so we are saying that as we are getting more experienced, dont trust more on random exploration,
            # trust more on experience
    def load(self,name):
        self.model.load_weights(name)
    def save(self,name):
        self.model.save_weights(name)
        # maybe after 50 epochs , you want to save our model!!!

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

import warnings
warnings.filterwarnings('ignore')
model=Sequential()
model.add(Dense(24,input_dim=4,activation='relu'))
model.add(Dense(24,activation='relu'))
model.add(Dense(2,activation='linear'))
model.compile(loss='mse',optimizer=Adam(lr=0.001))
model.summary()

# Training the SQN Agent (Deep Q-Learner)

In [9]:
n_episodes=1000 # how many games we will play to train our agent
output_dir="carpole_model/" # the directory where we will save our model weights
state_size=4
action_size=2
batch_size=32

In [10]:
import warnings
warnings.filterwarnings('ignore')
agent=Agent(state_size=4,action_size=2)
done=False

W1123 00:58:10.301501 11636 deprecation_wrapper.py:119] From c:\users\ayush\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1123 00:58:10.340496 11636 deprecation_wrapper.py:119] From c:\users\ayush\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1123 00:58:10.344495 11636 deprecation_wrapper.py:119] From c:\users\ayush\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W1123 00:58:10.450478 11636 deprecation_wrapper.py:119] From c:\users\ayush\appdata\local\programs\python\python36\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.

In [11]:

for e in range(n_episodes):
    state=env.reset()
    state=np.reshape(state,[1,state_size])
    for time in range(5000):
        env.render()
        action=agent.act(state) # action is either 0 or 1
        #print(action)
        next_state,reward,done,other_info=env.step(action)
        reward=reward if not done else -10
        next_state=np.reshape(next_state,[1,state_size])
        agent.remember(state,action,reward,next_state,done)
        state = next_state
        if done:
            print("Game Episode is {}/{} High Score:{} Exploration_rate:{:.2}".format(e,n_episodes,time,agent.epsilon))
            break
    if len(agent.memory)>batch_size:
        agent.train(batch_size)
  
print("Deep Q Learner model Trained !!")
env.close()

Game Episode is 0/1000 High Score:20 Exploration_rate:1.0


W1123 00:58:11.440990 11636 deprecation_wrapper.py:119] From c:\users\ayush\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:2741: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W1123 00:58:11.443973 11636 deprecation_wrapper.py:119] From c:\users\ayush\appdata\local\programs\python\python36\lib\site-packages\keras\backend\tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.



Game Episode is 1/1000 High Score:14 Exploration_rate:1.0
Game Episode is 2/1000 High Score:17 Exploration_rate:0.99
Game Episode is 3/1000 High Score:12 Exploration_rate:0.99
Game Episode is 4/1000 High Score:24 Exploration_rate:0.99
Game Episode is 5/1000 High Score:9 Exploration_rate:0.98
Game Episode is 6/1000 High Score:11 Exploration_rate:0.98
Game Episode is 7/1000 High Score:16 Exploration_rate:0.97
Game Episode is 8/1000 High Score:14 Exploration_rate:0.97
Game Episode is 9/1000 High Score:19 Exploration_rate:0.96
Game Episode is 10/1000 High Score:21 Exploration_rate:0.96
Game Episode is 11/1000 High Score:11 Exploration_rate:0.95
Game Episode is 12/1000 High Score:28 Exploration_rate:0.95
Game Episode is 13/1000 High Score:9 Exploration_rate:0.94
Game Episode is 14/1000 High Score:9 Exploration_rate:0.94
Game Episode is 15/1000 High Score:9 Exploration_rate:0.93
Game Episode is 16/1000 High Score:30 Exploration_rate:0.93
Game Episode is 17/1000 High Score:11 Exploration_rate

Game Episode is 138/1000 High Score:19 Exploration_rate:0.5
Game Episode is 139/1000 High Score:29 Exploration_rate:0.5
Game Episode is 140/1000 High Score:24 Exploration_rate:0.5
Game Episode is 141/1000 High Score:21 Exploration_rate:0.5
Game Episode is 142/1000 High Score:59 Exploration_rate:0.49
Game Episode is 143/1000 High Score:50 Exploration_rate:0.49
Game Episode is 144/1000 High Score:137 Exploration_rate:0.49
Game Episode is 145/1000 High Score:53 Exploration_rate:0.49
Game Episode is 146/1000 High Score:25 Exploration_rate:0.48
Game Episode is 147/1000 High Score:50 Exploration_rate:0.48
Game Episode is 148/1000 High Score:109 Exploration_rate:0.48
Game Episode is 149/1000 High Score:30 Exploration_rate:0.48
Game Episode is 150/1000 High Score:58 Exploration_rate:0.47
Game Episode is 151/1000 High Score:28 Exploration_rate:0.47
Game Episode is 152/1000 High Score:30 Exploration_rate:0.47
Game Episode is 153/1000 High Score:62 Exploration_rate:0.47
Game Episode is 154/1000 H

Game Episode is 272/1000 High Score:199 Exploration_rate:0.26
Game Episode is 273/1000 High Score:199 Exploration_rate:0.26
Game Episode is 274/1000 High Score:144 Exploration_rate:0.25
Game Episode is 275/1000 High Score:170 Exploration_rate:0.25
Game Episode is 276/1000 High Score:95 Exploration_rate:0.25
Game Episode is 277/1000 High Score:146 Exploration_rate:0.25
Game Episode is 278/1000 High Score:119 Exploration_rate:0.25
Game Episode is 279/1000 High Score:169 Exploration_rate:0.25
Game Episode is 280/1000 High Score:138 Exploration_rate:0.25
Game Episode is 281/1000 High Score:199 Exploration_rate:0.25
Game Episode is 282/1000 High Score:150 Exploration_rate:0.24
Game Episode is 283/1000 High Score:199 Exploration_rate:0.24
Game Episode is 284/1000 High Score:113 Exploration_rate:0.24
Game Episode is 285/1000 High Score:91 Exploration_rate:0.24
Game Episode is 286/1000 High Score:110 Exploration_rate:0.24
Game Episode is 287/1000 High Score:193 Exploration_rate:0.24
Game Episo

Game Episode is 405/1000 High Score:159 Exploration_rate:0.13
Game Episode is 406/1000 High Score:195 Exploration_rate:0.13
Game Episode is 407/1000 High Score:199 Exploration_rate:0.13
Game Episode is 408/1000 High Score:199 Exploration_rate:0.13
Game Episode is 409/1000 High Score:199 Exploration_rate:0.13
Game Episode is 410/1000 High Score:199 Exploration_rate:0.13
Game Episode is 411/1000 High Score:199 Exploration_rate:0.13
Game Episode is 412/1000 High Score:199 Exploration_rate:0.13
Game Episode is 413/1000 High Score:136 Exploration_rate:0.13
Game Episode is 414/1000 High Score:131 Exploration_rate:0.13
Game Episode is 415/1000 High Score:159 Exploration_rate:0.13
Game Episode is 416/1000 High Score:199 Exploration_rate:0.12
Game Episode is 417/1000 High Score:133 Exploration_rate:0.12
Game Episode is 418/1000 High Score:101 Exploration_rate:0.12
Game Episode is 419/1000 High Score:146 Exploration_rate:0.12
Game Episode is 420/1000 High Score:199 Exploration_rate:0.12
Game Epi

Game Episode is 537/1000 High Score:167 Exploration_rate:0.068
Game Episode is 538/1000 High Score:199 Exploration_rate:0.068
Game Episode is 539/1000 High Score:188 Exploration_rate:0.067
Game Episode is 540/1000 High Score:187 Exploration_rate:0.067
Game Episode is 541/1000 High Score:121 Exploration_rate:0.067
Game Episode is 542/1000 High Score:166 Exploration_rate:0.066
Game Episode is 543/1000 High Score:148 Exploration_rate:0.066
Game Episode is 544/1000 High Score:149 Exploration_rate:0.066
Game Episode is 545/1000 High Score:146 Exploration_rate:0.065
Game Episode is 546/1000 High Score:151 Exploration_rate:0.065
Game Episode is 547/1000 High Score:165 Exploration_rate:0.065
Game Episode is 548/1000 High Score:143 Exploration_rate:0.064
Game Episode is 549/1000 High Score:152 Exploration_rate:0.064
Game Episode is 550/1000 High Score:166 Exploration_rate:0.064
Game Episode is 551/1000 High Score:193 Exploration_rate:0.063
Game Episode is 552/1000 High Score:199 Exploration_rat

Game Episode is 668/1000 High Score:150 Exploration_rate:0.035
Game Episode is 669/1000 High Score:146 Exploration_rate:0.035
Game Episode is 670/1000 High Score:171 Exploration_rate:0.035
Game Episode is 671/1000 High Score:149 Exploration_rate:0.035
Game Episode is 672/1000 High Score:157 Exploration_rate:0.035
Game Episode is 673/1000 High Score:195 Exploration_rate:0.034
Game Episode is 674/1000 High Score:199 Exploration_rate:0.034
Game Episode is 675/1000 High Score:199 Exploration_rate:0.034
Game Episode is 676/1000 High Score:199 Exploration_rate:0.034
Game Episode is 677/1000 High Score:185 Exploration_rate:0.034
Game Episode is 678/1000 High Score:199 Exploration_rate:0.034
Game Episode is 679/1000 High Score:179 Exploration_rate:0.033
Game Episode is 680/1000 High Score:199 Exploration_rate:0.033
Game Episode is 681/1000 High Score:199 Exploration_rate:0.033
Game Episode is 682/1000 High Score:199 Exploration_rate:0.033
Game Episode is 683/1000 High Score:199 Exploration_rat

Game Episode is 799/1000 High Score:199 Exploration_rate:0.018
Game Episode is 800/1000 High Score:180 Exploration_rate:0.018
Game Episode is 801/1000 High Score:199 Exploration_rate:0.018
Game Episode is 802/1000 High Score:199 Exploration_rate:0.018
Game Episode is 803/1000 High Score:199 Exploration_rate:0.018
Game Episode is 804/1000 High Score:199 Exploration_rate:0.018
Game Episode is 805/1000 High Score:199 Exploration_rate:0.018
Game Episode is 806/1000 High Score:199 Exploration_rate:0.018
Game Episode is 807/1000 High Score:199 Exploration_rate:0.018
Game Episode is 808/1000 High Score:199 Exploration_rate:0.018
Game Episode is 809/1000 High Score:199 Exploration_rate:0.017
Game Episode is 810/1000 High Score:155 Exploration_rate:0.017
Game Episode is 811/1000 High Score:159 Exploration_rate:0.017
Game Episode is 812/1000 High Score:172 Exploration_rate:0.017
Game Episode is 813/1000 High Score:189 Exploration_rate:0.017
Game Episode is 814/1000 High Score:199 Exploration_rat

Game Episode is 930/1000 High Score:199 Exploration_rate:0.01
Game Episode is 931/1000 High Score:146 Exploration_rate:0.01
Game Episode is 932/1000 High Score:199 Exploration_rate:0.01
Game Episode is 933/1000 High Score:199 Exploration_rate:0.01
Game Episode is 934/1000 High Score:199 Exploration_rate:0.01
Game Episode is 935/1000 High Score:199 Exploration_rate:0.01
Game Episode is 936/1000 High Score:199 Exploration_rate:0.01
Game Episode is 937/1000 High Score:199 Exploration_rate:0.01
Game Episode is 938/1000 High Score:199 Exploration_rate:0.01
Game Episode is 939/1000 High Score:199 Exploration_rate:0.01
Game Episode is 940/1000 High Score:185 Exploration_rate:0.01
Game Episode is 941/1000 High Score:199 Exploration_rate:0.01
Game Episode is 942/1000 High Score:199 Exploration_rate:0.01
Game Episode is 943/1000 High Score:182 Exploration_rate:0.01
Game Episode is 944/1000 High Score:199 Exploration_rate:0.01
Game Episode is 945/1000 High Score:170 Exploration_rate:0.01
Game Epi