# Earth Observation Foundation Models and Benchmarking

## Section A: Introduction

---
---

### From Transfer Learning and Beyond

---

Transfer learning is a powerful technique in machine learning that allows a model trained on one task to be reused for a different but related task. Instead of starting from scratch, transfer learning leverages the knowledge a model has already gained from a large dataset and applies it to a new problem, which is particularly useful when labeled data for the new task is limited. Once trained, the lower layers of these models typically learn general features, such as edges in images or syntactic patterns in text, which can be useful across a range of tasks. By fine-tuning the upper layers of a pre-trained model or adding new layers specific to the new task, developers can adapt it to perform well with far fewer training examples. Properly executed transfer learning not only saves computational resources but also significantly reduces training time. It can also lead to better performance, particularly in scenarios where the target dataset is small or imbalanced. There are different strategies for transfer learning, such as freezing all or some of the layers during training, or allowing all layers to be updated depending on the similarity between the source and target tasks. For example, in medical imaging, models trained on natural images can be adapted to detect tumors or classify X-rays by fine-tuning only a few layers.

<img src="./assets/pretraining_workflow.png" style='height:400px'/>

Figure 1. Simplified foundation model pretraining workflow

An emerging category of models coined **'Foundation Models (FM)'** subset the transfer learning technique in which the model pretraining task is derived from raw unlabelled data rather than a specific task. In other words, these models are trained in a 'self-supervised' manner. These models potentially offer similar benefits to models developed using traditional transfer learning techniques (i.e reduced training time and performance). However, unlike with transfer learning, FMs circumvent the development of an intial large scale labelled task. This technique is especially powerful in the remote sensing world where the availability of large scale unprocessed data is unprecedented and the methods for transforming it into actionable insights draw from a wide range of scientific disciplines. **The larger goal of the earth observation foundation model (EOFM) development communitity is to create a model capable of generalizing across domains by learning from the structure and correlations iniherent in the raw data and open the door for more versatile models that can adapt with minimal fine-tuning. Attempts towards this ephemeral goal, however, must be thoroughly assessed in a way that is relavent to the greater remote sensing community.** We'll cover in this notebook: a simplified understanding of EOFMs, how to utilize them, benchmark their performance, and assess their utility for your downstream task.

---

###  Pretraining Methodologies

---

Below are 3 common methods for pretraining a computer vision model in a self-supervised fashion. This is by no means not an exhaustive list! Due to widespread availability of satellite data, a number of variations and/or combinations of the methods below exist along with a number of alternative techniques that are not in widespread use. There is also an extensive effort to marry the computer vision and language domains both in and out of the remote sensing community. For simplicity, we omit those from this notebook. It is worth nothing that the listed paradigms are all 'transformer-based' where the core architecural unit is a vision-transformer trained under the 'encoder-decoder' paradigm. While it is important to understand the high level distinction between these techniques, it is not entirely necessary in order to implement them in practice as many developer groups have worked hard in abstracting most of the logic away in various frameworks. Deep understanding of each of these model families as well as the nuances of a transformer is out of the scope of this notebook but we encourage folks to learn as much as possible!

#### <u>Autoencoders</u> 

<img src="./assets/mae.png" style='height:400px'/>

Figure 2. Diagram depicting masked autoencoder training. Images are broken into patches that are subsequently embedded (typically with a linear layer) and masked. The decoder is tasked with reconstructing the original image from the unmasked context.

Autoencoders consist of an encoder that compresses input images into a lower-dimensional latent space and a decoder that reconstructs the original images from these representations. This compression helps the model capture essential features while discarding noise or irrelevant details. The most dominent though not necessarily the most performant method is masked image modelling (i.e the 'Masked Autoencoder' (MAE)) where the model is tasked with reconstructing the masked part of an image from unmasked context. There are a handful of masking strageties that researchers have presented though they all work under the same overarching idea that a model optimized on the reconstruction task is generalizable, or able to perform well on unseen data.

#### <u>Contrastive Learning</u> 

<img src="./assets/simclr.png" style='height:400px'/>

Figure 3. Diagram for SimCLR training paradigm.

Contrastive models are tasked with distinguishing between similar and dissimilar image representations. It works by pulling together representations of positive pairs (different augmented views of the same image) while pushing apart representations of negative pairs, which are views from different images. Common methods include 'Simple Contrastive Learning' (SimCLR) and 'Momentum Contrast' (MoCo). Unlike autoencoders, a decoder is typically not included in the pretraining phase since the task is executed entirely in the latent space.

#### <u>Non-contrastive Self Distillation</u> 

<img src="./assets/dino.png" style='height:400px'/>

Figure 4. Diagram for DINO training paradigm.

Non-contrastive self-distillation is a self-supervised learning approach for computer vision that avoids the need for negative samples. Instead of contrasting different images, it trains a student network to match the output of a teacher network, both fed with different augmented views of the same image. The teacher is often an exponential moving average of the student, providing stable targets for the two encoders. The objective is to produce consistent, meaningful embeddings, not to reconstruct or generate the input. Common methods include 'Self-**di**stillation with **no** labels' (DINO) and 'Bootstrap Your Own Latent' (BYOL).

---

### Decoders

---

You're probably wondering at this point, what do you do with the trained encoder? Just as you would with any trained neural network, you can load the weights for inference using similar code you might have seen in previous chapters. There is however, one more consideration you must make depending on the type of task at hand and the shape of the output of the encoder. If, your goal is to classify images, a lightweight machine learning algorithm directly on top of the vector outputs of the encoders may be sufficient. The assumption here is that the internal representation of the image is robust and **relavent** to your task. These are the 'embeddings' you may have heard of recently and are analigous to large coarse resolution raster. We will touch on this topic in a separate chapter. If your goal, on the other hand, is to segment the images, we may need to train a subsequent neural network to extract from that internal representation and produce a useable map. Below is a summary list of contemporary decoders used in conjunction with transformer-based encoders. Each vary in complexity, purpose, and method in which they extract from the transformer layers in the aforementioned encoders.

#### FCN

<img src="./assets/fcn.png" style='height:400px'/>

Figure 5. Full convolutional neural network architecural diagram.

The FCN is a stack of convolutional layers that upsamples and refines final feature maps produced by the transformer’s output embeddings and converts them back into an image-like spatial representation. This final representation can then be fed through a final prediction layer, which is typically itself convolutional layers, to produce the final map. This is typically more lightweight than some of the MLP-based decoders below, due to weight sharing in the learned kernel filters. Local context is also explicitly emphasized as those same kernel filters may only cover a small portion of the whole input image.

#### Segformer

<img src="./assets/segformer.png" style='height:400px'/>

Figure 6. Segformer style network architecural diagram for semantic segmentation.

Segformers use a MLP style decoder to extract from a hierarchical transformer-based encoder. Originally developed as a standalone framework for end to end training, the segformer has been recently adapted by a handful of research groups attempting to utilize the representation generated by the pretraining phase. The assumption is that the implementation of the transformer blocks in the encoder are sufficiently hierarchical in nature such that they can be composed using this method.

#### UperNet

<img src="./assets/upernet.png" style='height:400px'/>

Figure 7. UperNet architectural diagram.

Similar to the segformer architecture, the upernet leverages multi-scale hierarchical features to generate dense predictions. Rather than MLPs, the upernet utilizes pooling layers and convolutions to compose the features before the final prediction layers. It's worth noting here that the upernet was designed with a ResNet-50 backbone that was fine tuned. In other words, this closely aligns with the 'pretrained encoder' with a decoder head methology that has been popularized in the last year.

#### Linear Probe

<img src="./assets/linear.png" style='height:400px'/>

Figure 8. Simplified structure of a single linear layer.

The simplest decoder available. It is a single linear layer to process just the output of the encoder. In practice, this isn't typically used for segmentation due a limited model complexity. It is still useful, however, to assess representation quality as it is cheap to train and can be implemented in a consistently, making it an ideal tool for benchmarking. This technique is taken from the language domain where large language models are often evaluated on downstream tasks with linear probes on frozen embeddings before doing full fine-tuning. 

#### Muster

<img src="./assets/muster.png" style='height:400px'/>

Figure 9. Architecural diagram of MUSTER decoder.

A more recent decoder development in the computer vision realm. Similar to both the Segformer and the UperNet, hierarchical features play an important role in it's predictive strength with the most important distinction here being that the decoder itself is also transformer based. Also similar to the UperNet, the MUSTER decoder was designed to integrate with pretrained encoders. While this may lend it self to greater performance, it is important to consider the compute required to implement such an architecture.


## Section B: The Framework

---
---

We build upon the Pangaea Bench benchmarking framework originally developed by the ESA Phi Lab team. While, there are other frameworks in development by other teams, Pangaea already has many of the more popular remote sensing foundation models integrated into their pure pytorch framework along with a number of well established benchmarking datasets. It also includes a relatively simple dataset integration scheme along with additional decoder heads integrated by the team at Spatial Informatics Group for the purpose of evaluating methods for extracting the information from the internal representation of the foundation model. The limited dependency is especially beneficial for small scale demonstrations such as this learning notebook. You will find similar themes found in previous chapters of the Applied Deep Learning Book such as dataloaders, pytorch modules, loss functions, etc. These are not exclusiive to training end to end models and in fact, its the almost exactly the same! The intended purpose of EOFMs is solving the same segmentation, classification, or regression problems. Here, we focus on the additional considerations that come with using an EOFM.

<img src="./assets/pangaea_workflow.png" style='height:400px'/>

Figure 5. Diagram of Pangaea's general structure.

---

### Software/Hardware considerations

CUDA is the underlying software running NVidia GPUs. While it is not necessary, we recommend running this notebook in a CUDA enabled linux environment for convenience and setup simplicity. We also recommend installing python>=3.10 either standalone or through a conda managed environment. For running the model, we can estimate a VRAM required for storing the model using a simple calculation of # of parameters and precision. For example, a 300M parameter model at 32bit precision would roughly require 3.6GB to store the weights and optimizer states. Another 1.2 GB is required for the gradients of each of those parameters and a few more for the data itself. Likewise, the decoder used to extract from the internal representation also imposes VRAM / compute requirements. There's no universal formula here, but tools like PyTorch hooks, torch.cuda.memory_summary(), or profilers can help. 12GB of VRAM is a good place to start and luckily there are free collab options at this compute scale, but, there is a trend towards larger and more complex models which inevitably consume more compute. 

#### Installing framework and requirements

We are installing directly from a cloned repo along with the available requirements.txt file. This may take a few moments as PyTorch is a fairly large package. Please install python and pip on your own and setup on your notebook environment that best suits your preferences. 

In [None]:
!git clone https://github.com/sig-gis/pangaea-bench.git "pangaea-bench"
!pip install -e "pangaea-bench"
!pip install -r pangaea-bench/requirements.txt

#### The 'torchrun' Command

#### Configurations

#### Logging and Model Examination

## Section C: Example Workflow: Burn Scar Mapping (MTBS)

### Frozen vs Unfrozen vs Random

### Lets Fine Tune

#### DOFA

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_fcn\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_linear\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt
    

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_segformer\
    preprocessing=seg_resize_input_layer \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt

#### Prithvi 1.0

In [None]:
!HYDRA_FULL_ERROR=1 torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_linear\
    preprocessing=seg_resize_input_layer \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi\
    decoder=seg_segformer\
    preprocessing=seg_resize_input_layer \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt
  

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt
    

#### Chroma

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth
    

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_segformer\
    preprocessing=seg_resize_input_layer \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_muster\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_upernet\
    preprocessing=seg_resize_input_layer \
    criterion=cross_entropy \
    task=segmentation \
    task.trainer.n_epochs=16 \
    task.trainer.use_wandb=true \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth


### Results

#### Performance Metrics

#### Graphs

### Final Comments

#### A Well Defined Problem

#### Multimodality / Multisensor Models

#### Scale