## Source 
 - https://adventuresinmachinelearning.com/reinforcement-learning-tutorial-python-keras/
 
## Agent

![](https://i2.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2018/02/Reinforcement-learning-environment.png?w=381&ssl=1)

## Actions
 - 0: forward one step
 - 1: backward to state 0
 
## Rewards
 - 0: returns 10 only in state 4
 - 1: returns always 2
 

## Environment
 
![](https://adventuresinmachinelearning.com/wp-content/uploads/2018/02/NChain-illustration.png)




## Reinforcement Learning Basics
source : https://towardsdatascience.com/reinforcement-learning-with-openai-d445c2c687d2
![](https://cdn-images-1.medium.com/max/1600/0*ft9-KOlrkKM06RkR.png)

> Earlier behavioral psychology experiments did pave the way for current RL movement in computer science by providing strong theoretical understanding behind the agent’s motivation.

reinforcement learning scenarios can be formulated as dynamic programming problem. Fundamentally meaning agent has to perform series of steps in systematic manner so that it can learn the ideal solution and it will receive guidance from reward values

![](https://cdn-images-1.medium.com/max/1600/1*CylzR3lBFqoMWMuJgjQo0w.png)

> Positive and Negative rewards increases or decreases tendency of that behavior. Eventually leading to better results in that environment over a period of time.


Reinforcement Learning is the science of making optimal decisions. Aim is to formulate reward-motivated behaviour exhibited by living species. Dopamine system in our brain which takes care of reward-motivated behaviour. 
 - source: https://becominghuman.ai/the-very-basics-of-reinforcement-learning-154f28a79071
![](https://cdn-images-1.medium.com/max/1600/1*4u2GtNnMa9xso1WkLh7hVA.png)

>  Instead of using a "model of the world", it uses data directly in the form of samples or simple trajectories. Therefore, it can be viewed as a data driven model-free dynamic programming that operates on samples of raw data.

## When environemnt get complex, Neural Nets enter the picture
NN is very good at mapping input to ouput. Here, state will enter as an input, action will be the output. Each action has a value. This is a regression problem!!

![](https://cdn-images-1.medium.com/max/1600/0*BPeyfQgVvGtB7E5U.png)


## Q-value

$$
Q(s,a) = Q(s,a) + \alpha(r + \gamma max_{a'} Q(s',a') - Q(s,a))
$$

Same as
$$
Q(s,a) = (1-\alpha) Q(s,a) + \alpha(r + \gamma max_{a'} Q(s',a'))
$$

 - $\alpha$ learning rate
 - $\gamma$ discount factor
 - target: $r + \gamma max_{a'} Q(s',a')$
 - prediction : $Q(s,a)$

In [15]:
# 1. It renders instance for 500 timesteps, perform random actions
import gym
env = gym.make('Acrobot-v1')
env.reset()
for _ in range(500):
    env.render()
    env.step(env.action_space.sample())
# 2. To check all env available, uninstalled ones are also shown
from gym import envs 
print(envs.registry.all())
env.close()

dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0), EnvSpec(KellyCoinflip-v0), EnvSpec(KellyCoinflipGeneralized-v0), EnvSpec(FrozenLake-v0), EnvSpec(FrozenLake8x8-v0), EnvSpec(CliffWalking-v0), EnvSpec(NChain-v0), EnvSpec(Roulette-v0), EnvSpec(Taxi-v2), EnvSpec(GuessingGame-v0), EnvSpec(HotterColder-v0), EnvSpec(Reacher-v2), EnvSpec(Pusher-v2), EnvSpec(Thrower-v2), EnvSpec(Striker-v2), EnvSpec(InvertedPendulum-v2), EnvSpec(InvertedDoublePendulum-v2), EnvSpec(HalfCheetah-v2), EnvSpec(HalfCheetah-v3), EnvSpec(Hopper-v2), EnvSpec(Hopper-v3), EnvSpec(Swimmer-v2), EnvSp

In [16]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        #print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

Episode finished after 17 timesteps
Episode finished after 14 timesteps
Episode finished after 34 timesteps
Episode finished after 14 timesteps
Episode finished after 55 timesteps
Episode finished after 35 timesteps
Episode finished after 15 timesteps
Episode finished after 17 timesteps
Episode finished after 47 timesteps
Episode finished after 11 timesteps
Episode finished after 31 timesteps
Episode finished after 19 timesteps
Episode finished after 24 timesteps
Episode finished after 21 timesteps
Episode finished after 44 timesteps
Episode finished after 14 timesteps
Episode finished after 11 timesteps
Episode finished after 16 timesteps
Episode finished after 16 timesteps
Episode finished after 30 timesteps


In [20]:
import gym
env = gym.make('MountainCarContinuous-v0') # try for different environements
observation = env.reset()
for t in range(100):
        env.render()
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Finished after {} timesteps".format(t+1))
            break
env.close()

## openai 
 - founded by Elon Musk and Sam Altman

For more info:
 - http://gym.openai.com/docs/
 

In [4]:
#!pip install 'gym[all]'

In [21]:
import gym

In [22]:
env = gym.make('NChain-v0')

In [24]:
env.reset()

0

What `env.step(1)` returns
 
 - The new state after the action
 - The reward due to the action
 - Whether the game is “done” or not – the NChain game is done after 1,000 steps
 - Debugging information – not relevant in this example

In [25]:
env.step(1)# 1:backward

(1, 0, False, {})

In [26]:
env.step(1)# 1:backward

(2, 0, False, {})

In [27]:
env.step(1) # 1:backward

(0, 2, False, {})

In [28]:
env.step(0) # 0:forward

(0, 2, False, {})

In [9]:
env.step(0) # 0:forward

(1, 0, False, {})

In [10]:
env.step(0) # 0:forward

(2, 0, False, {})

In [11]:
env.step(0) # 0:forward

(3, 0, False, {})

In [12]:
env.step(1)# 1:backward

(0, 2, False, {})

In [13]:
env.step(1)# 1:backward

(0, 2, False, {})

In [14]:
env.action_space

Discrete(2)

In [14]:
env.observation_space

Discrete(5)

In [19]:
env.action_space.to_jsonable

<bound method Space.to_jsonable of Discrete(2)>

In [21]:
env.render

<bound method Wrapper.render of <TimeLimit<NChainEnv<NChain-v0>>>>

## Naive RL

Reward table
 - $n_{states} \times n_{actions}$ = $[5 \times 2]$
 - Each entry $r_{s,a}$ the sum of the rewards that the agent has received in the past while taking action a in state s


In [14]:
import numpy as np

In [15]:
def naive_sum_reward_agent(env, num_episodes=500):
    # this is the table that will hold our summated rewards for
    # each action in each state
    r_table = np.zeros((5, 2))
    for g in range(num_episodes):
        s = env.reset()
        done = False
        while not done:
            if np.sum(r_table[s, :]) == 0:
                # make a random selection of actions
                a = np.random.randint(0, 2)
            else:
                # select the action with highest cummulative reward
                a = np.argmax(r_table[s, :])
            new_s, r, done, _ = env.step(a)
            r_table[s, a] += r
            s = new_s #  the state s is updated to new_s – the new state 
    return r_table

> Examining the results above, you can observe that the most common state for the agent to be in is the first state, seeing as any action 1 will bring the agent back to this point. The least occupied state is state 4, as it is difficult for the agent to progress from state 0 to 4 without the action being “flipped” and the agent being sent back to state 0.

In [16]:
naive_sum_reward_agent(env)

array([[     0., 641190.],
       [     0., 127442.],
       [     0.,  25312.],
       [     0.,   5106.],
       [     0.,   3252.]])

### Clearly – something is wrong with this table.
Locked In
> First, once there is a reward stored in one of the columns, the agent will always choose that action from that point on. 

# Delayed reward reinforcement learning

Q-value

$$
Q(s,a) = Q(s,a) + \alpha(r + \gamma max_{a'} Q(s',a') - Q(s,a))
$$

Same as
$$
Q(s,a) = (1-\alpha) Q(s,a) + \alpha(r + \gamma max_{a'} Q(s',a'))
$$

 - $\alpha$ learning rate
 - $\gamma$ discount factor
 - target: $r + \gamma max_{a'} Q(s',a')$
 - prediction : $Q(s,a)$

In [29]:
def q_learning_with_table(env, num_episodes=500):
    q_table = np.zeros((5, 2))
    y = 0.95
    lr = 0.8
    for i in range(num_episodes):
        s = env.reset()
        done = False
        while not done:
            if np.sum(q_table[s,:]) == 0:
                # make a random selection of actions
                a = np.random.randint(0, 2)
            else:
                # select the action with largest q value in state s
                a = np.argmax(q_table[s, :])
            new_s, r, done, _ = env.step(a)
            q_table[s, a] += r + lr*(y*np.max(q_table[new_s, :]) - q_table[s, a]) # Q learnning rule
            s = new_s
    return q_table

In [30]:
q_learning_with_table(env)

array([[ 0.        , 27.57202638],
       [28.18089044,  0.        ],
       [ 0.        , 28.8523263 ],
       [34.23301915,  0.        ],
       [33.68482516,  0.        ]])

## Epsilon-greedy (Mutation)

Locked in 
> initial bad decisions may continue

Mutation, random actions, can escape from locked-in

In [31]:
def eps_greedy_q_learning_with_table(env, num_episodes=500):
    q_table = np.zeros((5, 2))
    y = 0.95
    eps = 0.5
    lr = 0.8
    decay_factor = 0.999
    for i in range(num_episodes):
        s = env.reset()
        eps *= decay_factor # mutation/innovation frequency decreases over time == annealing
        done = False
        while not done:
            # make a random selection of actions
            if np.random.random() < eps or np.sum(q_table[s, :]) == 0:
                a = np.random.randint(0, 2)
            # select the action with highest cummulative reward
            else:
                a = np.argmax(q_table[s, :])
            # pdb.set_trace()
            new_s, r, done, _ = env.step(a)
            q_table[s, a] += r + lr * (y * np.max(q_table[new_s, :]) - q_table[s, a])
            s = new_s
    return q_table

In [34]:
eps_greedy_q_learning_with_table(env)

array([[31.30841065, 36.76243193],
       [32.40476918, 33.17920827],
       [38.39178364, 33.83378057],
       [42.79325837, 34.44995201],
       [38.89323493, 34.00175548]])

> Finally we have a table which favors action 0 in state 4 

## Test

In [36]:
def run_game(table, env):
    s = env.reset()
    tot_reward = 0
    done = False
    while not done:
        a = np.argmax(table[s, :])
        s, r, done, _ = env.step(a)
        tot_reward += r
    return tot_reward

In [39]:
def test_methods(env, num_iterations=10):
    winner = np.zeros((3,))
    for g in range(num_iterations):
        m0_table = naive_sum_reward_agent(env, 500)
        m1_table = q_learning_with_table(env, 500)
        m2_table = eps_greedy_q_learning_with_table(env, 500)
        m0 = run_game(m0_table, env)
        m1 = run_game(m1_table, env)
        m2 = run_game(m2_table, env)
        w = np.argmax(np.array([m0, m1, m2]))
        winner[w] += 1
        print("Game {} of {}".format(g + 1, num_iterations))
    return winner

In [40]:
test_methods(env)

Game 1 of 10
Game 2 of 10
Game 3 of 10
Game 4 of 10
Game 5 of 10
Game 6 of 10
Game 7 of 10
Game 8 of 10
Game 9 of 10
Game 10 of 10


array([2., 0., 8.])

> As can be observed, of the 100 experiments the  eps-greedy, Q learning algorithm (i.e. the third model that was presented) wins 65 of them.

## Deep RL

Function approximation approach!!

> In particular, deep reinforcement learning was developed by implying deep neural networks as a function approximation within the Bellman equation.

$$
\text{loss} = (\underbrace{r + \gamma \max_{a’} Q'(s’, a’)}_{\text{target}} – \underbrace{Q(s, a)}_{\text{prediction}})^2
$$

![](https://i1.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2018/03/Reinforcement-learning-Keras.png?zoom=2&resize=340%2C335&ssl=1)

In [49]:
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import SGD

model = Sequential()
#model.add(InputLayer(batch_input_shape=(1, 5)))
model.add(Dense(10, input_shape=(5,), activation='sigmoid'))
model.add(Dense(2, activation='linear'))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])

In [56]:
# now execute the q learning
y = 0.95
eps = 0.5
decay_factor = 0.999
num_episodes = 1000
r_avg_list = []
for i in range(num_episodes):
    s = env.reset()
    eps *= decay_factor
    if i % 100 == 0:
        print("Episode {} of {}".format(i + 1, num_episodes))
    done = False
    r_sum = 0
    while not done:
        # epsilon greedy
        if np.random.random() < eps:
            a = np.random.randint(0, 2)
        else:
            # a: best action in state s
            a = np.argmax(model.predict(np.identity(5)[s:s + 1]))
        # new state, after taking best action a from s
        new_s, r, done, _ = env.step(a)
        
        # Target is the reward r plus the discounted maximum of the predicted Q values for the new state, new_s. 
        target = r + y * np.max(model.predict(np.identity(5)[new_s:new_s + 1]))
        
        # Previous Q(s,a)
        target_vec = model.predict(np.identity(5)[s:s + 1])[0]
        
        # we want the Keras model to learn to predict for state s and action a New Q(s,a).
        target_vec[a] = target
        model.fit(np.identity(5)[s:s + 1], target_vec.reshape(-1, 2), epochs=1, verbose=0)
        s = new_s
        r_sum += r
    r_avg_list.append(r_sum / 1000)

Episode 1 of 1000
Episode 101 of 1000
Episode 201 of 1000
Episode 301 of 1000
Episode 401 of 1000
Episode 501 of 1000
Episode 601 of 1000
Episode 701 of 1000
Episode 801 of 1000
Episode 901 of 1000


In [53]:
s = 3
np.identity(5)[s:s + 1]

array([[0., 0., 0., 1., 0.]])

In [60]:
model.predict(np.identity(5)[s:s + 1])[0]

array([66.7478  , 62.803864], dtype=float32)

> This output looks sensible – we can see that the Q values for each state will favor choosing action 0 (moving forward) to shoot for those big, repeated rewards in state 4. Intuitively, this seems like the best strategy.

In [59]:
model.predict(np.identity(5))

array([[63.28666 , 61.928818],
       [66.7478  , 62.803867],
       [71.22221 , 63.97839 ],
       [77.43433 , 65.56561 ],
       [85.506065, 67.42528 ]], dtype=float32)