Skip to content
A collection of resources to study Transformers in depth.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information. Update Sep 24, 2019


A collection of resources to study Transformers in depth.

The original paper

Paper Reviews

If you want an easily comprehensible overview of the paper, Yannic Kilcher's video is a great starting point. For a more discussion-based introduction to Transformers, take a look at AISC's or Microsoft Reading Group's recording. Lastly, Rachel from Kaggle has a 3-part series/livestream, where she reads and tries to understand the paper, while responding to viewers' questions.

Blog Posts

Besides paper reviews, there are also incredible blog posts available. Jay Alammar's "The Illustrated Transformer", with its simple explanations and intuitive visualizations, is the best place to start understanding the different parts of the Transformer such as self-attention, the encoder-decoder architecture and positional encoding. From there, I would read Peter Bloem's blog post, which is one of the most well-written pieces I've encountered so far, with clear wording, beautiful graphics and understandable code. It goes into further detail on the self-attention mechanism and variations of the Transformer, while also providing accompanying PyTorch code. If you want to know more about different types of attention, head to Lilian Weng's blog.

Lectures and Talks

Beyond blog posts and paper reviews, you can also find some amazing lectures & talks on Transformers and self-attention online. Stanford CS224u's lecture goes into more details on the math, BERT and other contextual vectors, whereas the Stanford CS224n guest lecture (by the co-authors of the Transformer and Music Transformer) cover various use cases of self-attention. Rachel (and Jeremy) from give another great overview of Transformer in their Intro To NLP course, but also answer some of the common confusions around Transformers such as the query, key, value system and address its application to language translation. Finally, if you want to hear more from the co-authors of the Transformer, Lukasz Kaiser and Ashish Vaswani each gave a wonderful talk on their work at Pi School 2017 and RAAIS 2019 respectively.

Code Walkthroughs

One of the best ways to understand a concept is implementing it in code. Harvard NLP published an annotated version of the original paper with commented-out code in PyTorch, which is discussed in one of the recorded AISC sessions linked below. If you prefer TensorFlow, there is also a TensorFlow 2.0 Tutorial with a Colab notebook that you can run for free. Once you're comfortable with the basic concepts, check out the NAACL Tutorial on Transfer Learning, which has an amazing Colab that teaches you how to pre-train a GPT2-like Transformer, fine-tune it and do multi-task learning as well as an amazing slide deck full of information about recent developments in Transfer Learning for Natural Language Processing.

Follow-Up Papers

Since the original paper was published, there has been a massive wave of papers building on the Transformer. Most notably, BERT, GPT-2 and XLNet.


GPT-1 and GPT-2

Transformer XL and XLNet

You can’t perform that action at this time.