<a href="https://colab.research.google.com/github/vsoni03/AI-projects/blob/main/Personal_Deep_Convolutional_Q_Learning_for_Pac_Man.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Convolutional Q-Learning for Pac-Man

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!pip install ale-py
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


This deep convolutional q learning model does not solve every environment and this means that it will needs an environment that is partial deterministic makes it less complex and easier to compute the convolution model.

### Importing the libraries

In [None]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque
from torch.utils.data import DataLoader, TensorDataset

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [None]:
class NeuralNetwork(nn.Module):
  def __init__(self, action_size, seed = 42):
    super(NeuralNetwork, self).__init__()
    # set the seed
    self.seed = torch.manual_seed(seed)
    # create convulation channel which takes in 3rgb (input), output channels for pacman is 32, kernel size, stride
    # layer 1
    self.conv1 = nn.Conv2d(3, 32,  kernel_size = 8, stride = 4)
    # batch normalization, number of features from previous layers
    self.bn1 = nn.BatchNorm2d(32)

    # second convulation chanel taking previous and increaing output, kernal size and stride decreases
    self.conv2 = nn.Conv2d(32, 64,  kernel_size = 4, stride = 2)
    # batch normalization, number of features from previous layers
    self.bn2 = nn.BatchNorm2d(64)

    # third and will be 64 for the convulation
    self.conv3 = nn.Conv2d(64, 64,  kernel_size = 3, stride = 1)
    # batch normalization, number of features from previous convulation(output)
    # outputs 64 feature maps
    self.bn3 = nn.BatchNorm2d(64)

    # fourth convulation and increasing it to 128 for output layers
    self.conv4 = nn.Conv2d(64, 128,  kernel_size = 3, stride = 1)
    # batch normalization, number of features from previous convulation(output)
    self.bn4 = nn.BatchNorm2d(128)
    # final output of the flattening layer
    # output size from all convlations is input, output will 512
    # Converting the multi-dimensional tensor into a 1-dimensional tensor to expecting a 1d
    self.fc1 = nn.Linear(10* 10 * 128, 512)
    # height * weight * channels (output) and vectore of size of 512
    # input will be previous
    self.fc2 = nn.Linear(512, 256)
    self.fc3 = nn.Linear(256, action_size)

  def forward(self, state):
    x = F.relu(self.bn1(self.conv1(state)))
    x = F.relu(self.bn2(self.conv2(x)))
    x = F.relu(self.bn3(self.conv3(x)))
    x = F.relu(self.bn4(self.conv4(x)))
    # flattens the tensor
    # it reshapes tensor so the first dimension remains the same and the other dimensions are flattened
    x = x.view(x.size(0), -1)
    # forward prograte to fully connected layer
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    return self.fc3(x)

**Constructor**: It ineritrts from the nn.Module class which is the base class for all neural network modules in pytorch. It intializes the network layers and sets up the network's architecture. The setting a seed which helps you control randomness, it intitial value for number generator which is 42. The network has four convolutional layers and four batch normalization layers. The first layer of convulational takes 3 input channels (rgb), 32 output channels. The kernel_size defines the size of the filter that captures features in the input such as edges and textures. The stride controls how much the kernel moves across the input. Higher stride controls produces smaller oupits while lower stides keep output closer in the size to the input. This steps up the 4 convulation layers and the input, output, kernel size, and stride. Output increases where as the kernal and stride decreases. The output increases as you go deeper into convolutional neural network, to capture more complec patterns gets more detailed features and increasing channels. The batch normalizion layer normalizes the outputs across each mini-batch, helping to stablize and speed up training. It will calculate the mean and standard deviation of all activations and use these values to scale the ouput to a mean of 0 and standard deviation of 1 for faster training. It reduces overfitting and improve generalization of the model - more stable and efficent. After the 4th batch normalization operation, fully connected layers process the flattened vector and output predicition to a vetcor of 512 - reduces the dimensionality of the feature vector while retaining key information. It futher reduces the feature vector size for the final prediction and then finally gets the action size that prodyces the final predicitions.

Kernel is a small window that slides over an image. The kernel looks at the 3*3 patch, multiplies the pixel values by its own number (weights) and adds them up to get a new value which becomes a pixel in the ouput. Different kernels detect different features like edges or patterns. It is zoomed into the image and has a filter to analyze it and how close it is to the filter, slides over to each block - creates a process called pooling -each time the filter can be more specfic.

Stride is how many pixels the kernel moves over each time. The stride of 1 means the kernel moves 1 pixel at a time, overlapping a lot with its previous position. 2 means 2 pizels at a time, covering mover at time and smaller output.



**Forward**: This defines the foward pass for the neural netowrk in pytorch. It describes how the input data is processed through the neural network layer by layer to produce the final output. The input to the network typically an image or a batch of images. Each convolutional layers with baych normalization and activation. Each convolutional layer is followed by batch normalization and activation function. It applies the convulational layer to input x and normalizes the output to stablize training and improve learning. It applies relu activation function which introduces non-linearlity and helps with the network learn complex patterns. The negative values are replaced with zero instead.There is flattening tensor which converts the multi-dimensional output tensor from convolutional layers into 1 d vetcor for each image in the batch. The x.size(0) represents batch size unchanged while flattenign the rest of the dimensions. After flattneing it is passed through a series of fully connected layers to make predictions. It returns the nmber of actions the network is predicting.

## Part 2 - Training the AI

### Setting up the environment

In [None]:
import ale_py
import gymnasium as gym
# small part will be determinstic and full action space to be false and be more simply
# easier to train the model
env = gym.make('MsPacmanDeterministic-v0', full_action_space = False)
state_shape = env.observation_space.shape
state_size = env.observation_space.shape[0]
number_actions = env.action_space.n
print('State shape: ', state_shape)
print('State size: ', state_size)
print('Number of actions: ', number_actions)

State shape:  (210, 160, 3)
State size:  210
Number of actions:  9


  logger.deprecation(


It is 210 * 160 for width of the pictures and rgb of 3. The state size is 210 and number of actions are 9.

### Initializing the hyperparameters

In [None]:
# Chose this after much experimenatation to get this learning rate
learning_rate  = 5e-4
# Number of the observation in one step to update the model's parameter
# Common pratice is 64 usual size - no optimal size
minibatch_size = 64
# Present value of future rewards
# Small makes it shortsighted and only look at current rewards
# Closer to 1 will make it look at future rewards in accumulation to total reward
# Want to do this instead of short sighted
discount_factor = 0.99
# wont do soft update this

Same parameters as before almost, the minibatch size will decrease to around 64. The experience replay buffer size will not be needed and will be done in a simple way. Interpolation parameter does not help for soft update as this will not improve for the learning model for pacman

### Preprocessing the frames

In [None]:
from PIL import Image
from torchvision import transforms

def preprocess_frame(frame):
    # Convert the frame to a PIL Image object if it's not already
    # now will be a numpy array to PIL image object
    frame = Image.fromarray(frame)
    # Will be proprocessing which contains a list of transformation that will be doing
    # It will be resize of our frames  which will be 210 * 160 - need make squares and smaller
    # convert into a pytorch tensor and normalize the frames 0-1
    preprocess = transforms.Compose([transforms.Resize((128, 128)), transforms.ToTensor()])
    return preprocess(frame).unsqueeze(0)

This does proprocessing steps to make it more suitable for pytorch model. The input is a frame is expected to be a numpy array. It does converstion which converts numpy array into PIL image object. The step is necessary because pytorch's transformation. It combines multiple precressing steps in a single pipeline. These transformation are applied sequentially into input frame. It resizes image to a fixed size of 128 x 128 pixels. It passed through the network have a uniform size which is requirement for most neural networks as the input images might originally have different dimensions. It finally converts the PIL image into a pytorch tensor. It adds a batch dimension at the

### Implementing the DCQN class

In [None]:
class Agent():
  def __init__(self, action_size):
    # using gpu if available else we are using a cpu`
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.action_size = action_size
    # instance of netural network class to create the local and target q network and send to our device
    self.local_qnetwork = NeuralNetwork(action_size).to(self.device)
    self.target_qnetwork = NeuralNetwork(action_size).to(self.device)

    # instance of the Adam to create an optimizer and update the weights to predict the weights to land on move
    # learning rate is inputted too
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)
    self.memory = deque(maxlen = 10000)



  def step(self, state, action, reward, next_state, done):
    # preprocess the states to add them to the memory do it for both state and next state
    state = preprocess_frame(state)
    next_state = preprocess_frame(next_state)
    # append into memory
    self.memory.append((state, action, reward, next_state, done))
    if len(self.memory) > minibatch_size:
      experiences = random.sample(self.memory, k = minibatch_size)
      self.learn(experiences, discount_factor)



    # select action depending on a given state and action selection policy
  def act(self, state, epsilon = 0.):
    state = preprocess_frame(state).to(self.device)
    # instance of netural network class which is from the nn.module class
    # set it into eval mode and foward pass the state [] -> [[]]
    self.local_qnetwork.eval()
    # check it is inference mode for making predictions not the training mode
    with torch.no_grad():
      action_values = self.local_qnetwork(state)
    self.local_qnetwork.train()
    if random.random() > epsilon:
      return np.argmax(action_values.cpu().data.numpy())
    else:
      return random.choice(np.arange(self.action_size))
    # foward pass those values into the local network and give the actions from that network - gives final action
    # set to local training mode
    # Using epilson greedy this is done by if the random number is larger than epilson pick largest q value
    # otherwise pick a random action

  def learn(self, experiences, discount_factor):
    # will unziip experiences which is in the memory deque as the memeory
    # states and next_states are already pytorch sensors
    states, actions, rewards, next_states, dones = zip(*experiences)

    # you will want to get a stack of the states, actions, rewards, next states, and dones
    states = torch.from_numpy(np.vstack(states)).float().to(self.device)
    actions = torch.from_numpy(np.vstack(actions)).long().to(self.device)
    rewards = torch.from_numpy(np.vstack(rewards)).float().to(self.device)
    next_states = torch.from_numpy(np.vstack(next_states)).float().to(self.device)
    dones = torch.from_numpy(np.vstack(dones).astype(np.uint8)).float().to(self.device)


    next_q_targets = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    # action values of next states and detach function actual detached tensor from computation graph in order to get the maximum
    # get max of each row and it returns the max and the indices so get the actual max
    # adds a dimension after each feature, [] , [[],[],[]]

    # compute the q target for our current state
    q_targets = rewards + discount_factor * next_q_targets * (1 - dones)
    q_expected = self.local_qnetwork(states).gather(1, actions)
    # calculate the loss
    loss = F.mse_loss(q_expected, q_targets)
    # reset it
    self.optimizer.zero_grad()
    # backprograte the loss
    loss.backward()
    # does one single optimization step
    self.optimizer.step()
    # allow it update the target network slowly and follow improvements withput changing too rapidly

The DCQN agent implementation for reinforcement learning where an agent learns to take actions in an environment to maximize cumulative rewards using a neural netowrk as a function approximator.

**Constructor:**
- Device setup to uses gpu otherwise defaults to cpu
- Q networks
  - Local q networks - used for policu evaluation during trainign
  - Target q-network - used for stable q valye updates to mitigate the instability
  - Both networks are instances of Neural Netowrk sent to self.device
- Optimizer: used for updating the weights of the local qnetwork
- Relay Memory: deque object stores transitions with fixed maximum size to implement experience


**Step**
- Stores transitions in memory
- samples a minibatch experiences once the memory size exceeds the defined minibatch sized
- calls the learn method to train the method used the sampled experiences

**Act**
- Uses a epsilon greedy strategy
- With prob 1-e, select the actions with the highest q value predicted by local network
- otherwise selects a random action
- ensures that exploration occurs during trainign while exploiting the learned policu

**Learn**
- Processes a batch of experiences
- target q values: calculate using the target netowrk
- Expected Q-values: Predicted by the local network (local_qnetwork) based on the actions taken.
- Loss Function: Mean Squared Error (MSE) between the target Q-values and expected Q-values.
- Updates the local_qnetwork using backpropagation and a gradient descent step

### Initializing the DCQN agent

In [None]:
agent = Agent(number_actions)

### Training the DCQN agent

In [None]:
# max number of episodes
number_episodes = 2000
# max time per episodes
maximum_number_timesteps_per_episode = 10000
epsilon_starting_value = 1.0
epsilon_ending_value = 0.01
epsilon_decay_value = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)
# decrement epsilon

for episode in range(1, number_episodes + 1):
  # reset to initial state
  state, _ = env.reset()
  score = 0
  for t in range(maximum_number_timesteps_per_episode):
    action = agent.act(state, epsilon)
    # act selcts an action in a given state following an epsilon greedy policy
    # the gynasium calculates the reward, next state, done, all from the environment step and takes this and uses it for the agent step
    next_state, reward, done, _, _ = env.step(action)
    # shows that action does to the environmet
    # agent acts on this and learns on the current state to next state
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break
  # append of the score of the action taken
  scores_on_100_episodes.append(score)
  # decrements the epsilon
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)
  # dynamic print - removed to make room and \r to loop back
  print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)), end = "")
  if episode % 100 == 0:
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_on_100_episodes)))
  if np.mean(scores_on_100_episodes) >= 500.0:
    print('\nEnivornment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode - 100, np.mean(scores_on_100_episodes)))
    torch.save(agent.local_qnetwork.state_dict(), 'checkpoint.pth')
    break






Episode 100	Average Score: 290.30
Episode 200	Average Score: 363.90
Episode 300	Average Score: 371.50
Episode 400	Average Score: 458.00
Episode 500	Average Score: 434.60
Episode 600	Average Score: 424.30
Episode 700	Average Score: 422.80
Episode 800	Average Score: 378.30
Episode 900	Average Score: 435.10
Episode 1000	Average Score: 395.80
Episode 1100	Average Score: 452.00
Episode 1200	Average Score: 404.00
Episode 1219	Average Score: 427.40

## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action)
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'MsPacmanDeterministic-v0')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()