Skip to content

3rd Year: 1st - 100. A Reinforcement Learning Agent for Gravitar that was initially designed as a non-distributed version of R2D2 and was later developed further.

Notifications You must be signed in to change notification settings

shadowbourne/R2ND4-Gravitar-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recurrent Replay Non-Distributed Deeper Denser DQN - Gravitar

Submitted as part of the degree of Msci Natural Sciences (3rd year) to the Board of Examiners in the Department of Computer Sciences, Durham University. This summative assignment was assessed and marked by the professor of the module in question:

Grade: 1st - 100/100, 1st in year (of 136 students).

"This is an amazing agent that’s very impressive to see at an undergraduate RL course - you’ve clearly done a huge amount of research and spent many hours for this. It has top of the class high score - you could probably even publish some of the work here. Your agent design choices are all highly appropriate, picking many ways to balance exploration and exploitation with a recurrent architecture that can help with stability and analysis of long-term rewards while exploring within the difficult Gravitar environment. The agent in the video extremely intuitive, playing better than many humans. The convergence is outstanding, especially after 100k steps; increasing steadily after such a large amount of training is quite an achievement! Overall, this is a fantastic piece of academic writing, scientific experimentation, independent research, and countless hours of GPU training and shows an unexpected mastery of deep learning and reinforcement learning for an undergraduate. This must clearly be your passion, and all I can say is I hope you keep doing this in your career! Perfect!" - Dr Chris G. Willcocks

Architecture:

Recurrent Replay Non-Distributed Deeper Denser DQN (R2ND4) is a Reinforcement Learning Agent that was initially designed as a non-distributed version of R2D2 [1] and was later developed further. R2ND4 uses:

  • Double Q-Learning [4];
  • Prioritized Replay Buffer [6] using transition sequences of length 120 [1];
  • n-step Bellman Rewards and Targets [4];
  • Invertible Value Function Rescaling (forward and inverse) [7];
  • A Duelling Network Architecture [3];
  • A CNN followed by an LSTM for state encodings (with burnin as per R2D2 [1]);
  • A Novel Deeper Denser Architecture using Skip-Connections, inspired by D2RL's [2] findings;
  • Gradient Clipping as recommended in [3];
  • Frame Stacking;
  • Observations resized to 84x84 and turned to greyscale using OpenCV.

Note: For training on a less complex environment or if prioritising faster convergence, the 2nd much wider linear layer from the Value and Advantage networks containing the skip connection to the CNN output can be removed entirely to allow the model to converge to a lower 3,000 rolling mean score after 180k episodes (and reaching 3,500 after 500k episodes).

Contents:

  • An interactive notebook with all code for R2ND4 provided, with detailed in-line comments and descriptions to aid others in using this notebook as a learning resource - R2ND4.ipynb.
  • An example video of the (partially) converged agent playing Gravitar, achieving a score of 5050. This episode was not an anomaly nor an unusual result (see graph).
  • Training logs for statistical analysis or plotting of data.
  • A graph of the training scores over time.

To run notebook code (R2ND4.ipynb) open in Jupyter or upload to Google Colab.

Results:

Demo video (taken from my portfolio page):

Gifdemo2

We were tasked with creating a Reinforcement Learning agent to play the notoriously difficult Gravitar from the Atari-57 suite. I therefore decided to look for the current state-of-the-art Reinforcement Learning model for Atari (R2D2) and re-created it to the best I could with my limited hardware. I produced the best agent in the class, and my convergence graph was used as exemplar feedback to the cohort.

Training graph of mean scores (rolling average over past 100 episodes):

Training graph Note: This graph was selected as the best in the cohort and included in the class feedback as one of two exemplar graphs.

As is clear from the graph, the agent has not yet fully converged.

References:

The findings of the following papers were relied upon for the design of R2ND4:

The codebase is based upon and borrows from the following sources:

About

3rd Year: 1st - 100. A Reinforcement Learning Agent for Gravitar that was initially designed as a non-distributed version of R2D2 and was later developed further.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published