# Reinforcement Learning with Keras + OpenAI

## What is Reinforcement Learning?

- It is the training of **ML models** to make a **sequence of decisions**
- The method employs **Hit and Trial** to get solution
- **Rewards** are given by games
- Goal is to **Maximize** the **Total Rewards**

<img src="reinforcepic.png">

## What is Keras?

Keras is one of the leading high-level neural networks APIs. It is written in Python and supports multiple back-end neural network computation engines.
- with Keras we do not need to make backpropogation algorithms
- Many Layers could be added in just few lines of code
- All the types of models are built on same principles hence it becomes easier to master

<img src = "neuralnetworkgif.gif">

## What is OpenAI Gym?

- Gym is a toolkit for developing and comparing reinforcement learning algorithms.
- It supports teaching agents everything from walking to playing games like Pong or Pinball.
- Has easy implementation in Python

## For Starters

- You need to install OpenAI gym package available at https://gym.openai.com
- You can run pip install gym on terminal
or !pip install gym on jupyter notebook

## Learning On CartPole


<img src = "cartpole.gif">

- A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
The system is controlled by applying a force of +1 or -1 to the cart.
- The pendulum starts upright, and the goal is to prevent it from falling over.
- A reward of +1 is provided for every timestep that the pole remains upright.
- The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
- Read more about it on https://gym.openai.com/envs/CartPole-v0/

## Understanding OpenAI Gym Environment

- OpenAI Gym environments are structured around two main parts: an observation space and an action space
- Based On The current state of observation, we determine the action

### Using Gym
- ##### create environment of gym
env = gym.make('env-name')
- #####  reset the environment
env.reset()
- #####  render the environment onto visible game
env.render()
- ##### Take next step in game
env.step('give your action')
- ##### close the rendering window
env.close()


In [7]:
import gym
import numpy as np

In [3]:
env = gym.make('CartPole-v0')

In [4]:
observation = env.reset()

In [8]:
t=0
while(t<1000):
    env.render()
    env.step()
    t+=1
env.close()



# Data Collection

- We will collect data by running a certain number of Random trials
- Only those trials will be considered that have got us a min score
- One Hot Encoding will be used for passing action


In [9]:
def gather_data(env):
    num_trials = 10000
    min_score = 50
    sim_steps = 300
    trainingX,trainingY = [],[]
    scores = []
    for trial in range(num_trials):
        observation = env.reset()
        score = 0
        training_sampleX,training_sampleY = [],[]
        for step in range(sim_steps):
            if(trial%400==0):
                env.render()
            action = np.random.randint(0,2) # left or right
            one_hot_action = np.zeros(2)
            one_hot_action[action] = 1
            training_sampleX.append(observation)
            training_sampleY.append(one_hot_action)
            observation , reward, done, info = env.step(action)
            score += reward
            if done:
                break
        if score>min_score:
            scores.append(score)
            trainingX+=training_sampleX
            trainingY+=training_sampleY
    trainingX,trainingY = np.array(trainingX), np.array(trainingY)
    print("Average: {}".format(np.mean(scores)))
    print("Median: {}".format(np.median(scores)))
    return trainingX,trainingY

In [11]:
print(gather_data(env))
env.close()

Average: 62.154798761609904
Median: 59.0
(array([[-0.02925428,  0.02547628,  0.01115619, -0.0022209 ],
       [-0.02874476,  0.22043648,  0.01111177, -0.29136314],
       [-0.02433603,  0.41539824,  0.00528451, -0.58052094],
       ...,
       [ 0.20535798, -0.65948285,  0.13012252,  1.63853338],
       [ 0.19216832, -0.46610441,  0.16289319,  1.38906612],
       [ 0.18284623, -0.27334315,  0.19067451,  1.15143091]]), array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 1.],
       [0., 1.],
       [1., 0.]]))


# Model Creation

- We will use keras for model definition
- The model we use here is a very simple one: several fully-connected layers
- We can use enhancement such as Convolutions, LSTM,Dropouts etc.
- Input will be the observation and output will be action
- Loss can be used are mean_squared_error, categorical_crossentropy etc.
- Preferred optimizer is usually adam

In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

In [14]:
def create_model():
    model = Sequential()
    model.add(Dense(128,input_shape=(4,),activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(512,activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(2, activation='softmax'))
    model.summary()

    model.compile(loss='mse',optimizer='adam',metrics=['accuracy'])
    return model

# Prediction

- From the data gathered above we train our data
- We will go through several trials to check on multiple cases
- In each trial score we get a score

In [15]:
def predict():
    env1 = gym.make('CartPole-v0')
    trainingX,trainingY = gather_data(env1)
    model = create_model()
    model.fit(trainingX,trainingY,epochs=5)
    
    scores = []
    num_trials = 50
    sim_steps = 300
    for trial in range(num_trials):
        observation = env1.reset()
        score = 0
        for step in range(sim_steps):
            if(trial%4==0):
                env1.render()
            action = np.argmax(model.predict(observation.reshape(1,4)))
            observation,reward,done,info = env1.step(action)
            
            if done:
                score+=reward
                break
        scores.append(score)
        print(np.mean(scores))
    env1.close()

In [16]:
predict()

Average: 63.26243093922652
Median: 60.0
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 128)               640       
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 256)               33024     
_________________________________________________________________
dropout_6 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 512)               131584    
_________________________________________________________________
dropout_7 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_9 (Dense