# Yet Another DQN Code

This repo implements a Deep Q-Network as introduced in paper ["Playing Atari Games with Deep Reinforcement Learning"](https://arxiv.org/abs/1312.5602). In this notebook I tried to closely follow the ideas and tricks outlined by the paper authors and show how this model can be implemented using the [Pytorch-Lighntning](https://www.pytorchlightning.ai) framework with [Wandb](https://wandb.ai) logging. I highly reccomend reading the paper before reading this notebook, since most of the model logic is explained there.

The context of application of such model is peformed within [Gym Car Racing](https://www.gymlibrary.dev/environments/box2d/car_racing/) environment (with discrete actions), which is a simple example of a racing video game, where the user is requred to drive a car within a road as far as possible. 

The game is finished if the car has visited all tiles or goes outside the road, which in the latter case leads the user to receive a -100 score. Our main task is to score as much as possible.

Also do not forget to turn GPU if you wish to run this notebook.

# Import Libraries

In [1]:
!pip install wandb



In [2]:
import os
import subprocess

import numpy as np
import pandas as pd

import gym

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler, SequentialSampler
from torchvision.transforms import Grayscale

from pytorch_lightning import LightningDataModule, LightningModule, Trainer, seed_everything
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import ModelCheckpoint, LearningRateMonitor

from sklearn.metrics import accuracy_score

import wandb

from PIL import Image

# Replay Memory

We start our implementation of the DQN from the **Replay Memory**, which performs the role of a Dataset and allows the model to sample uncorrelated batches of data. Similar to the original implementation, at each observation we store **n** consecutive frames (in grayscale) of the game, action taken during these **n** frames, received reward, **n** frames of the next episode, and a flag specifying the end of the game. We also specify the size of the replay memory.





In [3]:
class ReplayMemory(Dataset):
    """
    Class contains all memory needed for a Dataloader to
    sample from it, including:
    current_observation_memory - observation during which there was a decision to
    take an action
    action_memory - actions taken at each current observation
    reward_memory - reward achieved at each corresponding action
    end_state_memory - tensors of booleans indicating whether the game
    episode was finished
    next_observation_memory - observation which appeared after an action was taken
    
    Comment on usage of Replay Memory: in argument *observation_shape* we should 
    input a list of shapes that we expect our observation tensor to have, in particular 
    in first number we specify the number of channels to store at each observation, 
    which is the same as the number of frames to perform an action.
    """
    def __init__(self, 
                 num_samples : int=1000, 
                 observation_shape : list=[4, 96, 96]):
        
        self.current_observation_memory = torch.empty(num_samples, *observation_shape)
        self.action_memory = torch.empty(num_samples)
        self.reward_memory = torch.empty(num_samples)
        self.end_state_memory = torch.zeros(num_samples, dtype=torch.bool)
        self.next_observation_memory = torch.empty(num_samples, *observation_shape)
        self.idx_memory = 0
        
    def __len__(self):
        """
        Get length of the Replay Memory
        """
        return self.current_observation_memory.shape[0]
    
    def store_observation(self, 
                          current_observation : torch.Tensor, 
                          action : int, 
                          reward : float, 
                          end_state : bool, 
                          next_observation : torch.Tensor):
        """
        Stores observation in replay memory
        """            
        index = self.idx_memory % self.current_observation_memory.shape[0]
        
        self.current_observation_memory[index] = torch.Tensor(current_observation)
        self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.end_state_memory[index] = end_state
        self.next_observation_memory[index] = torch.Tensor(next_observation)
        
        self.idx_memory += 1
    
    def __getitem__(self, index):
        """
        Returns an observation from Replay Memory
        """
        index = np.random.randint(low=0, high=self.current_observation_memory.shape[0])
        return self.current_observation_memory[index], self.action_memory[index], \
               self.reward_memory[index], self.end_state_memory[index], \
               self.next_observation_memory[index]

# Environment

The Environemnt handles all the interaction between agent, gym environment, and replay memory. It takes as one of its attributes the *Replay Memory* class, and also includes the logic behind *epsilon annealing* strategy, i.e. the gradual decrease in possibility of an agent choosing a random action. We also specify the batch size here which will be used for training the model.

In [4]:
class Environment(LightningDataModule):
    """
    Environment keeps track of all activity between agent
    and gym environment, including management of memory
    and preprocessing of observations
    """
    def __init__(self, 
                 batch_size : int=8,
                 num_samples_memory : int=1000,
                 observation_shape : list=[4, 96, 96],
                 start_epsilon : float=0.0, 
                 epsilon_decrease_step : float=0.0, 
                 stop_epsilon : float=0.0
                 ):
        super().__init__()
        
        self.batch_size = batch_size
        self.num_samples_memory = num_samples_memory
        self.observation_shape = observation_shape
        
        # ENVIRONMENT
        # Create Environment
        self.env = gym.make("CarRacing-v2", domain_randomize=False, continuous=False)
        
        # Set start game flag to True
        self.start_game_flag = True
        
        # PREPROCESSING OF OBSERVATIONS
        self.grayscale = Grayscale()
        self.frames_per_observation = observation_shape[0]
        
        # EPSILON
        self.epsilon = start_epsilon
        self.epsilon_decrease_step = epsilon_decrease_step
        self.stop_epsilon = stop_epsilon
        
        # LOGGING
        self.number_positive_games = 0
        self.number_negative_games = 0
        
        # ACTION DICTIONARY
        self.action_dict = {0 : 'nothing',
                           1 : 'left',
                           2 : 'right',
                           3 : 'gas',
                           4 : 'brake'}
        
    def setup(self, 
              stage : bool=None):
        """
        Create a randomly filled replay memory
        """
        # Create Replay Memory
        self.replay_memory = ReplayMemory(num_samples=self.num_samples_memory, 
                                          observation_shape=self.observation_shape)
        # Fill Replay Memory
        self.fill_replay_memory()
        
        
        
    def fill_replay_memory(self):
        """
        Fills replay memory with random actions
        """
        print('...Start filling Replay Memory...')
        self.play_game(n_steps=self.num_samples_memory-1,
                       model=None,
                       random_action=True)
        print('...Replay memory is filled...')
                
    def play_game(self, 
                  n_steps : int=100, 
                  model=None,
                  random_action : bool=False,
                  predict_stage : bool=False,
                  play_one_game : bool=False):
        """
        Plays game using a model or taking random action and also
        filling the replay memory
        Args:
        model - DQN model in determining the best action
        random_action - bool specifies whether game is played randomly,
        used when filling memory
        start_epsilon, epsilon_decrease_step, stop_epsilon - arguments
        used in defining threshold for choosing a random action
        predict_stage - used when predicting the game and testing how
        the agent works by producing frames of a game
        """
        # If test_stage, set self.epsilon to 0
        # to avoid taking random actions
        # and initialise test frame index
        if predict_stage:
            self.epsilon = 0
            self.stop_epsilon = 0
            self.test_frame_idx = 0
            random_action = False
        
        for idx in range(n_steps):
            # If start game, create a self.current_observation, else
            # create a next observation tensor and store it in the memory
            if self.start_game_flag:
                self.current_observation = self.start_new_game()
            else:
                # Determine action
                action = self.determine_action(model=model,
                                               current_observation=self.current_observation,
                                               random_action=random_action)
                
                # Create next_observation, reward, and end_state to store in memory
                self.next_observation, reward, end_state = \
                                              self.create_next_observation(action=action,
                                                                           predict_stage=predict_stage)
                
                # Store tensors in memory
                self.replay_memory.store_observation(current_observation=self.current_observation, 
                                                     action=action, 
                                                     reward=reward, 
                                                     end_state=end_state, 
                                                     next_observation=self.next_observation)

                
                # Assign assign next_observation to current_observatiom
                self.current_observation = self.next_observation
                
                # Decrease epsilon if possible and if not random action
                if not random_action:
                    self.decrement_epsilon()
                    
                if play_one_game and end_state:
                    break
            
            # Log necessary values if not random_action
            if not random_action and not predict_stage:
                self.log_wandb()
            
    def log_wandb(self):
        """
        Logs all wanted values into wandb
        """
        wandb.log({'epsilon' : self.epsilon})
    
    def determine_action(self,
                         current_observation : int,
                         random_action : bool,
                         model=None,
                         ):
        """
        Determine an action of current_observation
        randomly or based on model
        """
        # Determine action
        if random_action or (np.random.uniform() < self.epsilon):
            action = self.env.action_space.sample()
        else:
            with torch.no_grad():
                action = model(current_observation).max(dim=1).indices.item()
        return action
    
    def decrement_epsilon(self):
        """
        Decrements epsilon by self.epsilon_decrease_step
        if it is larger than self.stop_epsilon
        
        Note that the decrement of epsilon is adjusted according
        to the number of frames passed
        """
        if self.epsilon > self.stop_epsilon:
            decrement_epsilon = self.frames_per_observation * self.epsilon_decrease_step
            self.epsilon = max(self.stop_epsilon, self.epsilon - decrement_epsilon)
        
    
    def create_next_observation(self,
                                action : int,
                                predict_stage : bool=False):
        """
        Create a next_observation tensor by running
        through self.frames_per_observation frames a 
        prespecified action
        """
        # Define list of frames for a chosen action
        # and a reward
        frames_action = []
        reward_action = 0
        

            
        for idx_frame in range(self.frames_per_observation):
            frame, reward, terminated, truncated, _ = self.env.step(action)
            
            # If test_stage, save image to disk
            if predict_stage:
                if not 'test_game' in os.listdir():
                    subprocess.run('mkdir test_game'.split())
                    
                im = Image.fromarray(frame)
                im.save(f"test_game/frame_{self.test_frame_idx}_action_{self.action_dict[action]}.jpeg")
                self.test_frame_idx += 1
                
            # Update the reward of action
            reward_action += reward
            # Change value of the game
            self.game_total_return += reward
            # Get end state
            end_state = terminated or truncated

            # if end_state, call process_end_game and exit for loop
            if end_state:
                if not predict_stage:
                    frames_action = self.process_end_game(frames_action=frames_action,
                                                      last_frame=frame)
                break

            frames_action.append(frame)

        # Create a post-processed observation tensor
        next_observation = self.preprocess_frames(*frames_action)
        return next_observation, reward_action, end_state
        
    
    def preprocess_frames(self,
                          *frames : np.array):
        """
        Preprocesses all frames received from env
        and returns a tensor ready to be put in memory
        """
        out_memory = []
        # Apply grayscale to each of the observations
        for frame_ in frames:
            frame = torch.Tensor(frame_).transpose(0, 2)
            frame = self.grayscale(frame)
            out_memory.append(frame)
            
        # Concacenate tensors
        out = torch.cat(out_memory)
        return out
        
    def process_end_game(self, 
                         frames_action : list,
                         last_frame : np.array):
        """
        Procesess the game if it has reached
        an end state by filling the remaining
        frames_action with the last frame
        """
        frames_action = frames_action.copy()
        while len(frames_action) < self.frames_per_observation:
            frames_action.append(last_frame)
            
        # Set flag to start a new game to true
        self.start_game_flag = True
        
        # Log game_total_return
        wandb.log({"Game Total Return" : self.game_total_return})
        
        # Log either plus 1 positive or negative game
        if self.game_total_return >= 0:
            self.number_positive_games += 1
        else:
            self.number_negative_games += 1
        wandb.log({"number_positive_games" : self.number_positive_games,
                   "number_negative_games" : self.number_negative_games})
        
        return frames_action
        
    
    def start_new_game(self):
        """
        Starts new game by observing
        first self.frames_per_observation and
        taking action of not moving for first
        self.frames_per_observations
        """
        
        # Reset environment
        start_frame, _ = self.env.reset()
        # Assign first frame to tensor
        frames_start = [start_frame,]
        
        # Assign other self.frames_per_observation-1 
        # frames to frames_start
        for frame_ in range(self.frames_per_observation-1):
            frame, a_, b_, c_, d_ = self.env.step(0)
            frames_start.append(frame)
        
        # Set self.start_game_flag to false
        self.start_game_flag = False
        
        # Set value of the game to zero
        self.game_total_return = 0
        
        # Return processed start frames
        return self.preprocess_frames(*frames_start)      
        
    def train_dataloader(self):
        """
        Dataloader used for training
        
        To see what Dataloaders output, use code below:
        ---------------------
        env = Environment()
        env.setup(None)
        it = iter(env.train_dataloader())
        current_observations, actions, rewards, \
        end_states, next_observations = next(it)
        ---------------------
        """
        return DataLoader(self.replay_memory, batch_size=self.batch_size, shuffle=True)

# DQN Network

This is a standard Pytorch implementation of the convolutional neural network, which is almost identical to the neural network outlined in the paper, i.e. it takes two convolutional layers, followed by two linear layers with ReLU units between all layers.

In [5]:
class DQN(nn.Module):
    """
    DQN netowrk with two convolutional
    and two linear layers with ReLU 
    nonlinearity
    """
    def __init__(self,
                 input_shape : list=[4,96,96],
                 out_channels1 : int=5,
                 out_channels2 : int=5,
                 kernel_size : int=4,
                 stride : int=2,
                 hidden_size : int=256,
                 num_actions : int=5):
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels=input_shape[0], 
                               out_channels=out_channels1, 
                               kernel_size=kernel_size, 
                               stride=stride)
        
        self.conv2 = nn.Conv2d(in_channels=out_channels1, 
                               out_channels=out_channels2, 
                               kernel_size=kernel_size, 
                               stride=stride)
        
        # Determine the shape of self.conv2 output and pass it to linear1
        dummy_input = torch.rand(1,*input_shape)
        with torch.no_grad():
            out_conv2_shape = torch.flatten(self.conv2(self.conv1(dummy_input))).shape[0]
    
        self.linear1 = nn.Linear(out_conv2_shape, hidden_size)
        self.classifier = nn.Linear(hidden_size, num_actions)
        
        self.relu = nn.ReLU()
        
        
    def forward(self, x):
        """
        Forward pass of network
        """
        # Convert to tensor if not type tensor
        # and place it on cuda
        if not x.is_cuda:
            x = x.to('cuda')
        
        # Adjust tensor to have shape [batch, *image_shape]
        if len(x.shape) == 3:
            x = x.unsqueeze(0)
            
        x = self.conv1(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = self.relu(x)
        x = torch.flatten(x,1)
        x = self.linear1(x)
        x = self.relu(x)
        x = self.classifier(x)
        return x

# Agent

This class is a final object which takes all previously constructed objects and orcestrates them into smooth training procedure. Additional to the previous parameters of *DQN Network* and *Environment*, we also need to specify here the parameters of **gamma**, i.e. the discounting factor used in computation of Bellman equation, **learning rate** used to update the weights of the network, **memory_update_samples** used to define the number of actions to take each training step to update the *Replay Memory*, **target_net_update_freq** and **memory_update_freq** used in defining the frequencies with respect to training steps to update *Replay Memory* and *Target Network*, and finally the parameter **tau** used as a soft update of target network as outlined in [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971). Note that the update rule in this implementation differs from the original since it stabilises the training.

In [6]:
class Agent(LightningModule):
    """
    DQN Agent which handles all the training and interaction
    between the network and environment
    """
    def __init__(self, 
                 # Agent parameters
                 gamma : float=0.99,
                 learning_rate : float=1e-4,
                 tau : float=1e-3,
                 memory_update_samples : int=200,
                 target_net_update_freq : int=2,
                 memory_update_freq : int=2,
                 # DQN Parameters
                 input_shape : list=[4,96,96],
                 out_channels1 : int=5,
                 out_channels2 : int=5,
                 kernel_size : int=4,
                 stride : int=2,
                 hidden_size : int=256,
                 num_actions : int=5,
                 # Environment Parameters
                 batch_size : int=8,
                 num_samples_memory : int=1000,
                 observation_shape : list=[4, 96, 96],
                 start_epsilon : float=0.0, 
                 epsilon_decrease_step : float=0.0, 
                 stop_epsilon : float=0.0):
        super().__init__()
        
        # Set Agent parameters
        self.gamma = gamma
        self.lr = learning_rate
        self.tau = tau
        
        # Set target network and base network
        self.base_net = DQN(input_shape=input_shape,
                            out_channels1=out_channels1,
                            out_channels2=out_channels2,
                            kernel_size=kernel_size,
                            stride=stride,
                            hidden_size=hidden_size,
                            num_actions=num_actions)
        
        self.target_net = DQN(input_shape=input_shape,
                            out_channels1=out_channels1,
                            out_channels2=out_channels2,
                            kernel_size=kernel_size,
                            stride=stride,
                            hidden_size=hidden_size,
                            num_actions=num_actions)
        
        # Make target and base networks identical
        self.target_net.load_state_dict(self.base_net.state_dict())
        
        # Create loss
        self.loss = nn.MSELoss()
        
        # Create environemnt
        self.env = Environment(batch_size=batch_size,
                               num_samples_memory=num_samples_memory,
                               observation_shape=observation_shape,
                               start_epsilon=start_epsilon,
                               epsilon_decrease_step=epsilon_decrease_step,
                               stop_epsilon=stop_epsilon)
        
        # Create train step index
        self.train_step_idx = 0
        
        # Set number of samples to update in training loop
        self.memory_update_samples = memory_update_samples
        
        # Create parameters of memory and model update frequencies
        self.target_net_update_freq = target_net_update_freq
        self.memory_update_freq = memory_update_freq
        
        
    def forward(self, x):
        """
        Forward pass of base network
        """
        x = self.base_net(x)
        return x
    
    def training_step(self, batch, batch_idx):
        """
        Run training step where an agent plays a game with 
        updating memory and updates network if permitted 
        by frequency
        """
        
        # RUN FORWARD AND BACKWARD PASSES
        # ------------------------------------------------------
        # Sample a batch and determine best action q-values
        # for next_observation
        current_observations, actions, rewards, end_states, next_observations = batch
        
        with torch.no_grad():
            best_q_values = self.target_net(next_observations).max(dim=1).values
            best_q_values[end_states] = 0
            
        
        # Create y-tensor
        y = rewards + self.gamma * best_q_values
        y = y.reshape(-1,1)
        
        # Create q-values for current_observation
        out = self.base_net(current_observations)
        out = out[range(len(out)),actions.long()].reshape(-1,1)
                
        # Get loss function
        loss = self.loss(y, out)
        # ------------------------------------------------------
        
        # Update target network if needed
        if self.train_step_idx % self.target_net_update_freq == 0:
            self.update_target_network()
        
        # Play a game and update replay memory if needed
        if self.train_step_idx % self.memory_update_freq == 0:
            self.env.play_game(n_steps=self.memory_update_samples, 
                               model=self.base_net,
                               play_one_game=True,
                               random_action=False)
        
        # Increment train_step idx
        self.train_step_idx += 1
        self.log("train_loss", loss)
        return loss
    
    
    def update_target_network(self):
        """
        Update the weights of target network
        """
        new_target_net_dict = self.target_net.state_dict().copy()
        base_net_dict = self.base_net.state_dict().copy()
        
        for name in self.target_net.state_dict().copy().keys():
            new_target_net_dict[name] = self.tau * base_net_dict[name] + \
                                            (1 - self.tau) * new_target_net_dict[name]
            
        self.target_net.load_state_dict(new_target_net_dict)
        
        
    def configure_optimizers(self):
        """
        Configure optimiser for training
        """
        # filter(lambda p: p.requires_grad, model.parameters()) allows optimizer to skip params of
        # the pretrained model
        optimizer = AdamW(filter(lambda p: p.requires_grad, self.parameters()), lr=self.lr)
        return optimizer

# Running the Code

The Next 3 cells outline the configuration and launch code to start DQN model. If you also want to experience the visualisations of training and logging, please input your *Wandb API KEY* into the corresponding field, end enjoy the results.

I also reccomend running this code from the **Save Version** button -> **Save and Run All (Commit)** since it takes several hours to train a model.

For my visualisations of the code, you can refer to my [report](https://wandb.ai/vladargunov/DQN%20Car%20Racing/reports/Summary-Report--VmlldzoyODQ1OTky?accessToken=2biech898sne4t9pwhqy54giuj04793n5jfo133rsb887piv5vla8jaa7pqvu9ro) where I store the successful (and not so successful) runs of my models together with the configuration that I used in each of them.

In [7]:
#########################
# Project Variables
#########################

USE_WANDB = False
os.environ['WANDB_API_KEY'] = '0e1282269454018371df11ec32dacb75e1aa022a'
PROJECT_NAME = 'DQN Car Racing'
DEBUG = False

In [8]:
class CFG:
    # Agent parameters
    gamma=0.95
    learning_rate=1e-3
    tau=0.05
    memory_update_samples=10_000
    target_net_update_freq=2
    memory_update_freq=1
    # DQN Parameters
    input_shape=[4,96,96] # [channels, height, width] <- shape of observation to 
    # store in memory AFTER the preprocessing
    out_channels1=16
    out_channels2=32
    kernel_size=4
    stride=2 # Used for both conv layers
    hidden_size=256 # Size of hidden linear layer after the 2nd convolution layer
    num_actions=5
    # Environment Parameters
    batch_size=256
    num_samples_memory=15_000
    observation_shape=[4, 96, 96] # Must be equal to input_shape 
    start_epsilon=1
    epsilon_decrease_step=1 / 30_000
    stop_epsilon=0.01
    # Trainer Parameters
    num_train_steps = 6000

In [10]:
###############################
# Train
###############################

def main():
    
    if not USE_WANDB:
        os.environ['WANDB_MODE']= 'disabled'
    
    run = wandb.init(reinit=True, project=PROJECT_NAME)

    # Enable wandb logger
    wandb_logger = WandbLogger(log_model=True)

    # Create agent
    not_agent_cfg = ['num_train_steps']
    agent_dict = {key : value for key, value in CFG.__dict__.items() if not key.startswith('__') \
               and key not in not_agent_cfg}
    agent = Agent(**agent_dict)


    # Log histograms of gradients and parameters and parameter histograms
    wandb_logger.watch(agent, log='all')
    
    # Checkpoint callback to load the best model based on validation accuracy
    checkpoint_callback = ModelCheckpoint(monitor="Game Total Return", mode='max',save_top_k=1)

    # Start trainer
    trainer = Trainer(logger=wandb_logger,
                      max_epochs=CFG.num_train_steps,
                      reload_dataloaders_every_n_epochs=1,
                      limit_train_batches=1,
                      log_every_n_steps=1,
                      accelerator='auto',
                      fast_dev_run=DEBUG)

    trainer.fit(model=agent, datamodule=agent.env)
    

    # Update configuration to the wandb
    CFG_dict = {key : value for key, value in CFG.__dict__.items() if not key.startswith('__')}    
    wandb_logger.experiment.config.update(CFG_dict)

    wandb.finish()

if __name__ == "__main__":
    main()

  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
You are using a CUDA device ('NVIDIA GeForce RTX 3070 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision


...Start filling Replay Memory...


# Testing 

Finally, if you would like to test your model how it performs in real time, you can execute the next function to generate a folder with frames of a game and a corresponding action taken at each frame (named test_game.zip). You can downloaded it locally and see the result.

I uploaded a [dataset](https://www.kaggle.com/datasets/vladargunov/gym-car-racing-dqn-model) which contains a model of one of the successful runs, so if you would like to try it out, you can run the next cell, also initially setting **TEST_STAGE = True**. No need to train a model beforehand, so do not execute the previous cell. 

In [8]:
def test_agent():
    # Set download path
    #model_dir = '../input/gym-car-racing-dqn-model/model.ckpt'
    #model_dir = 'C:/Users/raduc/repos/RAU_2023/input/gym-car-racing-dqn-model/model.ckpt'
    model_dir = "epoch=299-step=300.ckpt"
    # Initialise test model
    not_agent_cfg = ['num_train_steps']
    agent_dict = {key : value for key, value in CFG.__dict__.items() if not key.startswith('__') \
                   and key not in not_agent_cfg}

    agent_dict['num_samples_memory'] = 5 # Set it to minimal value since we do not need memory replay here
    
    test_agent = Agent(**agent_dict)
    test_model = test_agent.target_net
    
    test_model.load_state_dict(torch.load(model_dir), strict=False)
    test_model.to('cuda')
    # Setup environment
    test_agent.env.setup()
    
    # Test model
    test_agent.env.play_game(model=test_model, predict_stage=True, play_one_game=True)
    # Remove previous zip folder if it exists
    #!rm test_game.zip
    # Zip folder with frames
    #!zip -r test_game.zip ./test_game -q
    # Remove original folder
    #!rm -r test_game
    
    

In [9]:
TEST_STAGE = True


if TEST_STAGE:
    test_agent()

...Start filling Replay Memory...
...Replay memory is filled...


NameError: name 'env' is not defined

Update: After a couple of test runs I discovered that a car many times opts for braking before turns so it does not lose points. I think in the next notebook we can adress this issue.

# Conclusion

Well, that's it, in the notebook we have managed to run the model so it gives positive score, but since we have worked in discrete environment we could not achieve the best outcome. Yet it is a great start!

I hope you liked this notebook - if you find any mistakes here or would like to leave a comment, you can do it here or you can also mail me if you hesitate to say it in public.

My mail: argunovvlad5@gmail.com

Hope to do better next time!