<center><h1> Cart Pole Game</h1></center>
<center><img src="images/cart_pole.gif"></center>

A pole is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

## OpenAI CartPole Gym Environment

### Action Space
The action is a ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

<center>
<table>
<thead>
<tr><th><p>Num</p></th>
<th><p>Action</p></th>
</tr>
</thead>
<tbody>
<tr><td><p>0</p></td>
<td><p>Push cart to the left</p></td>
</tr>
<tr><td><p>1</p></td>
<td><p>Push cart to the right</p></td>
</tr>
</tbody>
</table>
</center>

### Observation Space
The observation is a ndarray with shape (4,) with the values corresponding to the following positions and velocities:

<center><table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head"><p>Num</p></th>
<th class="head"><p>Observation</p></th>
<th class="head"><p>Min</p></th>
<th class="head"><p>Max</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>0</p></td>
<td><p>Cart Position</p></td>
<td><p>-4.8</p></td>
<td><p>4.8</p></td>
</tr>
<tr class="row-odd"><td><p>1</p></td>
<td><p>Cart Velocity</p></td>
<td><p>-Inf</p></td>
<td><p>Inf</p></td>
</tr>
<tr class="row-even"><td><p>2</p></td>
<td><p>Pole Angle</p></td>
<td><p>~ -0.418 rad (-24°)</p></td>
<td><p>~ 0.418 rad (24°)</p></td>
</tr>
<tr class="row-odd"><td><p>3</p></td>
<td><p>Pole Angular Velocity</p></td>
<td><p>-Inf</p></td>
<td><p>Inf</p></td>
</tr>
</tbody>
</table>
</center>

### Rewards
Since the goal is to keep the pole upright for as long as possible, a reward of +1 for every step taken, including the termination step, is allotted.

### Starting State
All observations are assigned a uniformly random value in (-0.05, 0.05)

### Episode End

The episode ends if any one of the following occurs:
<ol class="arabic simple">
<li><p>Pole Angle is greater than ±12°</p></li>
<li><p>Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)</p></li>
</ol>

In [16]:
import gym
import random
import time

In [17]:
env = gym.make('CartPole-v1', render_mode='human')
env.reset()

(array([ 0.04530601, -0.04711993, -0.00513815,  0.04254381], dtype=float32),
 {})

In [18]:
episodes = 0
step = 0
while episodes < 25:
    last_time = time.time()
    
    action = random.choice([0, 1])
    state, reward, done, _, _ = env.step(action)
    step+=1
    
    if done:
        time.sleep(0.5)
        episodes += 1
        print(f'Episode {episodes} lasted {step} steps')
        step = 0
        env.reset()
        
    time.sleep(max(1/24 - (time.time() - last_time), 0))

Episode 1 lasted 29 steps
Episode 2 lasted 26 steps
Episode 3 lasted 20 steps
Episode 4 lasted 12 steps
Episode 5 lasted 14 steps
Episode 6 lasted 11 steps
Episode 7 lasted 56 steps
Episode 8 lasted 16 steps
Episode 9 lasted 21 steps
Episode 10 lasted 19 steps
Episode 11 lasted 48 steps
Episode 12 lasted 10 steps
Episode 13 lasted 25 steps
Episode 14 lasted 13 steps
Episode 15 lasted 23 steps
Episode 16 lasted 36 steps
Episode 17 lasted 15 steps
Episode 18 lasted 15 steps
Episode 19 lasted 24 steps
Episode 20 lasted 22 steps
Episode 21 lasted 15 steps
Episode 22 lasted 35 steps
Episode 23 lasted 16 steps
Episode 24 lasted 37 steps
Episode 25 lasted 20 steps


### Average reward with Random Action

In [19]:
env = gym.make('CartPole-v1')
 
rew_array = []
for i in range(100):
    rew = 0
    state, done, rew = env.reset(), False, 0
    
    while not done:
        action = random.choice([0, 1])
        state, reward, done, _, _ = env.step(action)
        rew += reward
    
    rew_array.append(rew)
    
print("average reward per episode :", sum(rew_array)/ len(rew_array))

average reward per episode : 20.32


In [20]:
env.close()

## Deep Q Network and DQN Agent

In [21]:
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque

In [22]:
torch.manual_seed(42)
random.seed(42)

### Building the Policy Network

<center><img src="images\q_net.webp"></center>

In [8]:
class PolicyModel(nn.Module):
    def __init__(self, state_size, action_size, learning_rate):
        super(PolicyModel, self).__init__()

        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

        self.relu = nn.ReLU()

        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.criterion = nn.MSELoss()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def fit(self, state, target_f):
        self.train()

        state_tensor = torch.tensor(state, dtype=torch.float32)
        target_f_tensor = torch.tensor(target_f, dtype=torch.float32)

        self.optimizer.zero_grad()

        output = self(state_tensor)

        loss = self.criterion(output, target_f_tensor)
        loss.backward()
        self.optimizer.step()

### Building the DQN Agent

In [9]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size

        self.memory = deque(maxlen=10000)
        self.gamma = 0.95
        
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.05
        
        self.learning_rate = 0.001
        self.model = PolicyModel(self.state_size, self.action_size, self.learning_rate)

    def remember(self, state, action, reward, next_state, done):
        self.memory.append([state, action, reward, next_state, done])

    def act(self, state, train=True):
        if random.random() <= self.epsilon and train:
            return random.randrange(self.action_size)
        else:
            self.model.eval()
            state_tensor = torch.tensor(state, dtype=torch.float32)
            act_values = self.model(state_tensor)
            return torch.argmax(act_values).item()

    def replay(self, batch_size):
        self.model.eval()
        minibatch = random.sample(self.memory, batch_size)

        for state, action, reward, next_state, done in minibatch:
            target = reward
            
            if not done:
                next_state_tensor = torch.tensor(next_state, dtype=torch.float32)
                target = reward + self.gamma * torch.max(self.model(next_state_tensor)).item()

            state_tensor = torch.tensor(state, dtype=torch.float32)
            target_f = self.model(state_tensor).clone()
            target_f[0][action] = target

            self.model.fit(state, target_f)

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
            
    def save(self, name):
        torch.save(self.model.state_dict(), name)
        
    def load(self, name):
        self.model.load_state_dict(torch.load(name))

## Learning to Play the CartPole game

In [23]:
import numpy as np
import plotly.express as px
import ipywidgets as widgets
from IPython.display import display

In [24]:
env = gym.make('CartPole-v1')

In [25]:
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
state_size, action_size

(4, 2)

In [26]:
batch_size = 128
agent = DQNAgent(state_size, action_size)

In [14]:
done = False
scores = []
episode = 0

while sum(scores[-10:])/10 < 250:
    episode += 1
    
    state, _ = env.reset()
    state = np.reshape(state, [1, state_size])
    
    for step in range(1000):
        
        action = agent.act(state)
        next_state, reward, done, _, _ = env.step(action)
        
        next_state = np.reshape(next_state, [1, state_size])
        
        if done:
            agent.remember(state, action, -10, next_state, done)
            print("episode: {},\t score: {},\t epsilon: {:.2}".format(episode, step, agent.epsilon))
            break
        else:
            agent.remember(state, action, reward, next_state, done)
        
        state = next_state
            
    scores.append(step)
    
    if len(agent.memory) > batch_size:
        agent.replay(batch_size)
        
    if episode % 100 == 0:
        agent.save(f"./models/{episode}.h5")
        
# agent.save("./models/best.h5")

  if not isinstance(terminated, (bool, np.bool8)):
  target_f_tensor = torch.tensor(target_f, dtype=torch.float32)


episode: 1,	 score: 10,	 epsilon: 1.0
episode: 2,	 score: 12,	 epsilon: 1.0
episode: 3,	 score: 32,	 epsilon: 1.0
episode: 4,	 score: 18,	 epsilon: 1.0
episode: 5,	 score: 19,	 epsilon: 1.0
episode: 6,	 score: 12,	 epsilon: 1.0
episode: 7,	 score: 14,	 epsilon: 1.0
episode: 8,	 score: 34,	 epsilon: 1.0
episode: 9,	 score: 73,	 epsilon: 0.99
episode: 10,	 score: 21,	 epsilon: 0.99
episode: 11,	 score: 13,	 epsilon: 0.99
episode: 12,	 score: 25,	 epsilon: 0.98
episode: 13,	 score: 39,	 epsilon: 0.98
episode: 14,	 score: 10,	 epsilon: 0.97
episode: 15,	 score: 33,	 epsilon: 0.97
episode: 16,	 score: 32,	 epsilon: 0.96
episode: 17,	 score: 12,	 epsilon: 0.96
episode: 18,	 score: 10,	 epsilon: 0.95
episode: 19,	 score: 23,	 epsilon: 0.95
episode: 20,	 score: 9,	 epsilon: 0.94
episode: 21,	 score: 17,	 epsilon: 0.94
episode: 22,	 score: 11,	 epsilon: 0.93
episode: 23,	 score: 24,	 epsilon: 0.93
episode: 24,	 score: 31,	 epsilon: 0.92
episode: 25,	 score: 11,	 epsilon: 0.92
episode: 26,	 scor

### Visualize the training of the agent

In [15]:
data = scores
fig = px.line(x=range(len(data)), y=data, labels={'x': 'Episode', 'y': 'Score'})
fig.show()

### Playing the game using a trained model

In [29]:
agent.load("./models/best.h5")

In [30]:
env = gym.make('CartPole-v1', render_mode='human')

for i in range(5):
    (state, _), done, _ = env.reset(), False, 0
    score = 0
    
    text_widget = widgets.Text(value=f"score: {score}")
    display(text_widget)
    
    while not done:
        last_time = time.time()
        
        action = agent.act(state, train=False)
        state, _, done, _, _ = env.step(action)
        
        score += 1
        text_widget.value = f"score: {score}"
        
        time.sleep(max(1/24 - (time.time() - last_time), 0))
        
env.close()

Text(value='score: 0')

Text(value='score: 0')

Text(value='score: 0')

KeyboardInterrupt: 