<a href="https://colab.research.google.com/github/vaibhawvipul/Reinforcement-Learning-Lectures/blob/master/RL_Lecture_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Reinforcement Learning

**Learn** to make **good sequence of decisions**! 

The fundamental challenge in machine learning is learning to make good decisions under uncertainity. 

---

## Types of Learning - 

*   Active Learning
*   Passive Learning

A human can learn **without examples** of optimal behaviour. Right?

## Why Learn?

There are two distinct reasons to learn - 

1.   FInd previously unknown solutions
2.   Find solutions **online** in case of unforeseen circumstances. (Not only generalization but learning online)

RL seeks to provide algorithm for both the cases.

---

## What is Reinforcement Learning?

RL is a science of making decisions from interaction. Is RL=AI?

## How is RL different from Machine Learning ?

Reinforcement Learning is different from other machine learning paradigms because - 

*   No Supervision, only **reward** signal.
*   Delayed feedback
*   Previous decisions might affect future predictions/interactions.

There is a fine line between imitation learning and reinforcement learning. Can you guess what?

## What does RL involves?

* Optimization - the optimal way to make decisions which yields best outcomes.
* Delayed consequences - decisions made now can 
* Exploration 
* Generalization


## Core Concepts of Reinforcement Learning - 

* Environment
* Reward Signal
* Agent - 
    *   Agent State
    *   Policy
    *   Value Function (probable)
    *   Model (Optional)

![Agent Environment Interaction](https://github.com/vaibhawvipul/Reinforcement-Learning-Lectures/blob/master/agent-env-rl.jpeg?raw=true)

---

## Behaviour and Intelligence - 

Let's spend some time in understanding this interesting creature - **Sea Squirt**

![Sea Squirt](https://goodheartextremescience.files.wordpress.com/2010/01/sea_squirts_img_0704.jpg)

Sea Squirts are a primitive creature most famous for “eating their brains.”  

Sea squirts are hermaphrodites—they have both male and female reproductive organs. They reproduce by releasing eggs and sperm into the water at the same time. When eggs develop into tadpole-like larvae, they swin by wiggle and twich movements.

The free-swimming larvae stage lasts only a short time, since the larvae aren’t capable of feeding. Soon, they settle to either the bottom of sea floor or on some rock and they cement themselves headfirst to the spot where they will spend the rest of their lives. 

The sea squirt larvae begin absorbing all the tadpole-like parts. Where the sea squirt larva once had gills, it develops the intake and exist siphons that will help it bring water and food into its body. It absorbs its twitching tail. It absorbs its primitive eye and its spine-like notocord. Finally, it even absorbs the rudimentary little “brain” (cerebral ganglion) that it used to swim about and find its attachment place.

Since the sea squirt no longer needs its brain to help it swim around or to see, this isn’t a great loss to the creature.  Read more about this fascinatin creature [here](https://goodheartextremescience.wordpress.com/2010/01/27/meet-the-creature-that-eats-its-own-brain/).

This example above suggests that brain is helping in decision making, so no more decisions needed then no need for brain?

This helps us in reminding **why an agent needs to be intelligent, it is because it has to make decisions!**

---




Let us install **OpenAI Gym**. We will be using OpenAI's Gym for a lot of tutorials and exercises. 

OpenAI Gym - Gym is a toolkit for developing and comparing reinforcement learning algorithms. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

Here is how you can install openAI Gym.

```
pip install gym
```


---

Following is a small demo of OpenAI Gym - 


In [0]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [0]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1

In [0]:
# Importing the necessary modules and dependencies
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

In [0]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

In [0]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

In [0]:
env = wrap_env(gym.make("CartPole-v0"))

In [0]:
print(env.action_space)

In [0]:
observation = env.reset()

while True:
  
    env.render()
    
    #your agent goes here
    action = env.action_space.sample() 
         
    observation, reward, done, info = env.step(action) 
   
        
    if done: 
      break;
            
env.close()
show_video()

The above code is also an example of how to run openai's gym in colab! Please make sure that you note is down somewhere.

---




## Agent and Environment

At each time step t the agent :
* receives an observation or environment state and reward.
* executes an Action

The environment:
* receives an action
* emits observation and reward

---

## Rewards

A reward is a scalar feedback(real-valued) signal. It indicates how well an agent is doing at the time step t. The agent's job is to maximize the *cummulative reward* (called **return**).

**Reward Hypothesis** - Any goal can be formalized as the outcome of maximizing cummulative rewards.







---


## History

A history is a sequence of past observations, actions and rewards. Agent chooses an action based on history

This history can then be used to construct **agent state**.


---



## Value

The expected cummulative reward from state S is called Value.

The goal is then to maximize value by choosing suitable actions. 

Note - Returns and Values can be defined recursively. 

---



## Policy 

Mapping from states to actions is called policy.

---



## Markov Assumption

The term Markov assumption is used to describe a model where the Markov property is assumed to hold, such as a hidden Markov model.

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Future is independent of the past given the present.

## Fully Observable Environments 

Suppose agent sees full environment state then the agent state is equal to observed state of the environment(e.g Board game states).

Let the current blood pressure be the current state. The action is whether to take medication or not is that markov?

## Why is Markov so popular?

* It can always be satisfied.
* In practice, often assume that the recent state is the sufficient statistic of history. However, this notion is changing with deep learning with the arrival of lstms etc.

## Partially Observable MDP (POMDP)

* Agent State is not same as the world state.
* The environment state can still be markov but the agent doesn't know it. 

example - Poker.


---



## RL algorithm components - 

Often includes one or more of - 

* Model - How the world changes in response to the agent's action.
* Policy - Mapping from agent state to actions.
* Value Function - Future reward from being in a state and action following a particular policy.


---

## Types of RL agents 

* Model based
* Model free


## Value Function 

Expected discounted sum of future rewards under a particular policy. It can be used to compare policies. 

**Bellman Equation** 