### **Installing Tensorflow Agents**

In [None]:
!pip install tf-agents[reverb]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### **Importing the required libraries**

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import abc
import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts

### **Instanciating the CartPole v0 environment**

In [None]:
environment = suite_gym.load('CartPole-v0')
print('action_spec:', environment.action_spec())
print('time_step_spec.observation:', environment.time_step_spec().observation)
print('time_step_spec.step_type:', environment.time_step_spec().step_type)
print('time_step_spec.discount:', environment.time_step_spec().discount)
print('time_step_spec.reward:', environment.time_step_spec().reward)

action_spec: BoundedArraySpec(shape=(), dtype=dtype('int64'), name='action', minimum=0, maximum=1)
time_step_spec.observation: BoundedArraySpec(shape=(4,), dtype=dtype('float32'), name='observation', minimum=[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], maximum=[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38])
time_step_spec.step_type: ArraySpec(shape=(), dtype=dtype('int32'), name='step_type')
time_step_spec.discount: BoundedArraySpec(shape=(), dtype=dtype('float32'), name='discount', minimum=0.0, maximum=1.0)
time_step_spec.reward: ArraySpec(shape=(), dtype=dtype('float32'), name='reward')


### **Implementing the CartPole environment**

#### **About the environment**

The Cartpole environment is one of the most well known classic reinforcement learning problems ( the "Hello, World!" of RL). A pole is attached to a cart, which can move along a frictionless track. The pole starts upright and the goal is to prevent it from falling over by controlling the cart.

- The observation from the environmentis a 4D vector representing the position and velocity of the cart, and the angle and angular velocity of the pole.
The agent can control the system by taking one of 2 actions 
: push the cart right (+1) or left (-1).

- A reward 
 is provided for every timestep that the pole remains upright. The episode ends when one of the following is true:
 - the pole tips over some angle limit
 - the cart moves outside of the world edges
 - 200 time steps pass.

- The goal of the agent is to learn a policy 
 so as to maximize the sum of rewards in an episode 
.
Here 
 is a discount factor in 
 that discounts future rewards relative to immediate rewards. This parameter helps us focus the policy, making it care more about obtaining rewards quickly.


*source(TensorFlow - tensorflow.org)*

> **Implementing the environment for 5, 10 and 15 steps**

In [None]:
tf_env = tf_py_environment.TFPyEnvironment(environment)

num_steps = [50, 100, 150]
reward = 0
for step in num_steps:
  transitions = []
  time_step = tf_env.reset()
  # reset() creates the initial time_step after resetting the environment.
  for i in range(step):
    action = tf.constant([i % 2])
    # applies the action and returns the new TimeStep.
    next_time_step = tf_env.step(action)
    transitions.append([time_step, action, next_time_step])
    reward += next_time_step.reward
    time_step = next_time_step

  np_transitions = tf.nest.map_structure(lambda x: x.numpy(), transitions)
  #print('\n'.join(map(str, np_transitions)))
  print(f'Total reward in {step} steps = {reward.numpy()}')

Total reward in 50 steps = [49.]
Total reward in 100 steps = [148.]
Total reward in 150 steps = [294.]


>**Implementing the environment for 3, 5 and 10 episodes**

In [None]:
tf_env = tf_py_environment.TFPyEnvironment(environment)


rewards = []
steps = []
num_episodes = [5, 10, 15]

for episode in num_episodes:
  time_step = tf_env.reset()
  for _ in range(episode):
    episode_reward = 0
    episode_steps = 0
    while not time_step.is_last():
      action = tf.random.uniform([1], 0, 2, dtype=tf.int32)
      time_step = tf_env.step(action)
      episode_steps += 1
      episode_reward += time_step.reward.numpy()
    rewards.append(episode_reward)
    steps.append(episode_steps)
    time_step = tf_env.reset()

  num_steps = np.sum(steps)
  avg_length = np.mean(steps)
  avg_reward = np.mean(rewards)

  print(f'num_episodes: {episode} num_steps: {num_steps}')
  print(f'avg_length : {avg_length} avg_reward : {avg_reward}')

num_episodes: 5 num_steps: 139
avg_length : 27.8 avg_reward : 27.799999237060547
num_episodes: 10 num_steps: 317
avg_length : 21.133333333333333 avg_reward : 21.133333206176758
num_episodes: 15 num_steps: 707
avg_length : 23.566666666666666 avg_reward : 23.566667556762695


### **Comparison and Comments on both approaches**

> Both the approaches show a really good reward performance but going by the step-wise approach the number of steps hugely impact the reward attained.

> The episodic approach shows to give a similar average reward for the three choices of number of episodes.

> The episodic approach does go for a larger number of steps as can be seen in the output of the above cell but the average reward doesn't seem to be impacted that much as compared to the step-wise approach.

> Important to mention that this observation wouldn't have been possible without exploring over a variety of steps and episodes.