Until recently, the best performing models for image classification had been convolutional neural networks (CNNs) introduced in LeCun et al. (1998). Nowadays, transformer architectures have been shown to have similar to better performance. One such model, called Vision Transformer by Dosovitskiy et al. (2020) splits up images into regularly sized patches. The patches are treated as a sequence and attention weights are learned as in a standard transformer model.
The Transformer architecture, introduced in the paper Attention Is All You Need by Vaswani et al. (2017), is the most ubiquitous neural network architecture in modern machine learning. Its parallelism and scalability to large problems has seen it adopted in domains beyong those it was traditionally considered for (sequential data).
NOTE: We adapt/borrow a lot of material/concepts from Torralba, A., Isola, P., & Freeman, W. T. (2021, December 1). Foundations of Computer Vision. MIT Press; The MIT Press, Massachusetts Institute of Technology.
Potentially useful packages:

