# Vision Transformers

Not too long after Vaswani *et al.*'s [*Attention is all you need*](https://arxiv.org/abs/1706.03762) paper started revolutionizing NLP models, people started to wonder if transformers could be applied to other domains.

In 2021 the Alexey Dosovitskiy *et al.* (from Google Brain) published a preprint called [*An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*](https://arxiv.org/abs/2010.11929) where they applied transformers to computer vision. Thier intro starts:

> Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers’ computational efficiency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.

After noting that some attempts to apply transformers to computer vision had not been terribly successful, largely because they continued to use CNN-like approaches, Dosovitskiy *et al.* state:

> Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.

And while they find that with smaller datasets, results are less than stellar, Dosovitskiy *et al.* find that their Vision Transformer (ViT):

> When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.

Here's figure 1 from Dosovitskiy *et al.* (2021):

![Figure 1 from Dosovitskiy *et al.* 2021 ViT architecture](images/ViT_architecture.png)

## First a word about Python modules and Jupyter kernels...

Before getting into using transformers, we need to learn a bit more about Python modules and Jupyter kernels. Check out [this page on the class website](https://aibiology.github.io/vit_mamba_setup.html).

## And now, two options...

At this point, in looking for examples to work through, I ran into some challenges. First, this is all relatively new, so there are not a lot of great examples out there. There are several that walk through *training* a ViT-like model, but not many that go into *using* ViT with transfer learning, fine tuning, and application to data.

I did find two that seem good. 

### Option 1

The first is in this notebook and is from  Philipp Schmid's blog post [*Image Classification with Hugging Face Transformers and `Keras`*](https://www.philschmid.de/image-classification-huggingface-transformers-keras). The problem is, and I might have tried subsampling the data if I had more time, is that it takes about 1-hour per epoch to train as it is.

### Option 2

The second example, which I think we'll use in class is in notebook [21b_vit-fine-tuning.ipynb](21b_vit-fine-tuning.ipynb) and is from [Ruaf Momin's Kaggle notebook](https://www.kaggle.com/code/raufmomin/vision-transformer-vit-fine-tuning).

### General observation

And a general observation about both options...While both options use fine tuning of a ViT mobel, neither freezes the weights of the main ViT model as we did in transfer learning. Honestly don't know if that's an oversight, strategy, or what.


## Option 1

Much of this example comes from Philipp Schmid's blog post [*Image Classification with Hugging Face Transformers and `Keras`*](https://www.philschmid.de/image-classification-huggingface-transformers-keras).

This example uses the EuroSAT satellite images from [here](https://github.com/phelber/eurosat) published in:
* Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. Patrick Helber, Benjamin Bischke, Andreas Dengel, Damian Borth. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019. [https://doi.org/10.48550/arXiv.1709.00029](https://doi.org/10.48550/arXiv.1709.00029)

*  Introducing EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. Patrick Helber, Benjamin Bischke, Andreas Dengel. 2018 IEEE International Geoscience and Remote Sensing Symposium, 2018. [https://doi.org/10.1109/IGARSS.2018.8519248](https://doi.org/10.1109/IGARSS.2018.8519248)

For the class, the dataset has been installed at: `/blue/bsc48926/share/EuroSAT_RGB/`

Here's the description of the data and a composite image of patches:

> In this study, we address the challenge of land use and land cover classification using Sentinel-2 satellite images. The Sentinel-2 satellite images are openly and freely accessible provided in the Earth observation program Copernicus. We present a novel dataset based on Sentinel-2 satellite images covering 13 spectral bands and consisting out of 10 classes with in total 27,000 labeled and geo-referenced images. 

![EuroSAT image](https://github.com/phelber/EuroSAT/raw/master/eurosat_overview_small.jpg?raw=true)

The land use classes are: `AnnualCrop`,  `Forest`,  `HerbaceousVegetation`,  `Highway`,  `Industrial`,  `Pasture`,  `PermanentCrop`,  `Residential`,  `River`,  `SeaLake`. 

## Import modules

I have opted to stick with the image loading and processing pipeline used by Phillip rather than try to convert this to use the `image_dataset_from_directory`, which I *think* should work...

In [None]:
import os

import datasets

from transformers import pipeline
from transformers import ViTFeatureExtractor
from transformers import DefaultDataCollator
from transformers import TFViTForImageClassification, create_optimizer


from tensorflow import keras 
from tensorflow.keras import layers
from tensorflow.keras.callbacks import TensorBoard as TensorboardCallback, EarlyStopping

import tensorflow as tf

from huggingface_hub import HfFolder



In [None]:
# Set model ID and path to images
model_id = "google/vit-base-patch16-224-in21k"
eurosat_path="/blue/bsc4892/share/EuroSAT_RGB/"

In [None]:
def create_image_folder_dataset(path):
  """creates `Dataset` from image folder structure"""
  
  # get class names by folders names
  _CLASS_NAMES=os.listdir(path)
  # defines `datasets` features`
  features=datasets.Features({
                      "img": datasets.Image(),
                      "label": datasets.features.ClassLabel(names=_CLASS_NAMES),
                  })
  # temp list holding datapoints for creation
  img_data_files=[]
  label_data_files=[]
  # load images into list for creation
  for img_class in os.listdir(path):
    for img in os.listdir(os.path.join(path,img_class)):
      path_=os.path.join(path,img_class,img)
      img_data_files.append(path_)
      label_data_files.append(img_class)
  # create dataset
  ds = datasets.Dataset.from_dict({"img":img_data_files,"label":label_data_files},features=features)
  return ds

In [None]:
# Load the data
eurosat_ds = create_image_folder_dataset(eurosat_path)

In [None]:
# Display the class labels
img_class_labels = eurosat_ds.features["label"].names
print(img_class_labels)

In [None]:
# Preprocess the images
# Note this step takes several minutes to run--not sure why...
feature_extractor = ViTFeatureExtractor.from_pretrained(model_id)

In [None]:
# learn more about data augmentation here: https://www.tensorflow.org/tutorials/images/data_augmentation


# data_augmentation = keras.Sequential(
#     [
#         layers.Resizing(feature_extractor.size, feature_extractor.size),
#         layers.Rescaling(1./255),
#         layers.RandomFlip("horizontal"),
#         layers.RandomRotation(factor=0.02),
#         layers.RandomZoom(
#             height_factor=0.2, width_factor=0.2
#         ),
#     ],
#     name="data_augmentation",
# )

# # use keras image data augementation processing
# def augmentation(examples):
#     # print(examples["img"])
#     examples["pixel_values"] = [data_augmentation(image) for image in examples["img"]]
#     return examples

In [None]:
# Preprocess the images

# basic processing (only resizing)
def process(examples):
    examples.update(feature_extractor(examples['img'], ))
    return examples

# we are also renaming our label col to labels to use `.to_tf_dataset` later
eurosat_ds = eurosat_ds.rename_column("label", "labels")

In [None]:
# Preprocess dataset
processed_dataset = eurosat_ds.map(process, batched=True)
processed_dataset


In [None]:
# Split the data into train and test datasets
# test size will be 15% of train dataset
test_size=.15

processed_dataset = processed_dataset.shuffle().train_test_split(test_size=test_size)

In [None]:

id2label = {str(i): label for i, label in enumerate(img_class_labels)}
label2id = {v: k for k, v in id2label.items()}

num_train_epochs = 5
train_batch_size = 32
eval_batch_size = 32
learning_rate = 3e-5
weight_decay_rate=0.01
num_warmup_steps=0
output_dir=model_id.split("/")[1]
#hub_token = HfFolder.get_token() # or your token directly "hf_xxx"
hub_model_id = f'{model_id.split("/")[1]}-euroSat'
fp16=True

# Train in mixed-precision float16
# Comment this line out if you're using a GPU that will not benefit from this
if fp16:
  tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [None]:
# Data collator that will dynamically pad the inputs received, as well as the labels.
data_collator = DefaultDataCollator(return_tensors="tf")

# converting our train dataset to tf.data.Dataset
tf_train_dataset = processed_dataset["train"].to_tf_dataset(
   columns=['pixel_values'],
   label_cols=["labels"],
   shuffle=True,
   batch_size=train_batch_size,
   collate_fn=data_collator)

# converting our test dataset to tf.data.Dataset
tf_eval_dataset = processed_dataset["test"].to_tf_dataset(
   columns=['pixel_values'],
   label_cols=["labels"],
   shuffle=True,
   batch_size=eval_batch_size,
   collate_fn=data_collator)

In [None]:
# create optimizer wight weigh decay
num_train_steps = len(tf_train_dataset) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=learning_rate,
    num_train_steps=num_train_steps,
    weight_decay_rate=weight_decay_rate,
    num_warmup_steps=num_warmup_steps,
)



In [None]:
# load pre-trained ViT model
model = TFViTForImageClassification.from_pretrained(
    model_id,
    num_labels=len(img_class_labels),
    id2label=id2label,
    label2id=label2id,
)

# define loss
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# define metrics 
metrics=[
    tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy"),
    tf.keras.metrics.SparseTopKCategoricalAccuracy(3, name="top-3-accuracy"),
]

# compile model
model.compile(optimizer=optimizer,
              loss=loss,
              metrics=metrics
              )


In [None]:
callbacks=[]

callbacks.append(TensorboardCallback(log_dir=os.path.join(output_dir,"logs")))
callbacks.append(EarlyStopping(monitor="val_accuracy",patience=1))

#We won't be using HuggingFace hub
#if hub_token:
#  callbacks.append(PushToHubCallback(output_dir=output_dir,
#                                     hub_model_id=hub_model_id,
#                                     hub_token=hub_token))


In [None]:
train_results = model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    callbacks=callbacks,
    epochs=num_train_epochs,
)