# Creating an AI to Play OpenAI's CartPole Simulation

### Deep Q-Network (DQN)

Due the the continous nature of this environment, approximating all of the possible action,states is inefficient, and uses up a substantial amount of resources for a fairly simple environment.  Instead, we will be using a deep Q-network.  A DQN works by approximating the optimal value function through the use of neural networks.

In this approach, a simple neural network will be used to generate the optimal value function for the CartPole scenario!

The tutorial I will be following can be found [here.](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)

In [1]:
# Enables intellisense (press TAB after the .)
%config IPCompleter.greedy=True

import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple
from itertools import count
from PIL import Image

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T


### Defining the environment, and the plot

In [3]:
env = gym.make('CartPole-v0').unwrapped

is_ipython = 'inline' in matplotlib.get_backend()

if is_ipython:
    from IPython import display
plt.ion()

# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


The DQN will be utilizing replay memory, and this replay memory will be randomly sampled to help aid in the agent's decision making.  Since the agent samples from the replay memory randomly, the transitions that build up the batch will now be decorrelated.

There are to classes involved with this, first, the `Transition` class, and the `ReplayMemory` class.

- `Transition`: A named tuple that represents a single transition in an environment.
- `ReplayMemory`: A cyclie buffer of bounded size that maintains recent transitions.  It contains a `.sample()` method, to randomly retrieve a transition batch

In [4]:
Transition = namedtuple('Transition',('state','action','next_state','reward'))

class ReplayMemory(object):
    def __init__(self,capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0
    
    def push(self, *args):
        """Saves the transition"""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity
    
    def sample(self, batch_size):
        return random.sample(self.memory,batch_size)
    def __len__(self):
        return len(self.memory)

## Defining the DQN

In [5]:
class DQN(nn.Module):
    
    def __init__(self):
        super(DQN,self).__init__()
        self.conv1 = nn.Conv2d(3,16, kernel_size=5,stride=2)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16,32,kernel_size=5,stride=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32,32, kernel_size=5, stride=2)
        self.bn3 = nn.BatchNorm2d(32)
        self.head = nn.Linear(448,2)
    
    def forward(self,x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        return self.head(x.view(x.size(0),-1))
        

## Retrieving input from the CartPole simulation!

In [7]:
resize = T.Compose([T.ToPILImage(), T.Resize(40, interpolation=Image.CUBIC), T.ToTensor()])

screen_width = 600

def get_cart_location():
    world_width = env.x_threshold * 2
    scale = screen_width / world_width
    
    # returning the location of the middle of the cart
    return int(env.state[0] * scale + screen_width / 2.0)

def get_screen():
    screen = env.render(mode='rgb_array').transpose((2,0,1))
    screen = screen[:,:160:320]
    view_width = 320 
    cart_location = get_cart_location()
    if cart_location  < view_width // 2:
        slice_range = slice(view_width)
    elif cart_location > (screen_width - view_width//2):
        slice_range = slice(-view_width, None)
    else:
        slice_range = slice(cart_location - view_width//2, cart_location + view_wdith//2)
    
    # Strip off the edges so we are left with the view of the cart itself
    screen = screen[:,:,slice_range]
    
    # convert to float, and then convert to torch tensor
    screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
    screen = torch.from_numpy(screen)
    
    # resize, and add batch dimension
    return resize(screen).unsqueeze(0).to(device)

# Training


## Hyperparamters, and utilities for training

In [None]:
BATCH_SIZE = 128
GAMMA = 0.999