# Earth Observation Foundation Models and Benchmarking

## Section A: Introduction

---
---

### From Transfer Learning and Beyond

---

Transfer learning is a powerful technique in machine learning that allows a model trained on one task to be reused for a different but related task. Instead of starting from scratch, transfer learning leverages the knowledge a model has already gained from a large dataset and applies it to a new problem, which is particularly useful when labeled data for the new task is limited. Once trained, the lower layers of these models typically learn general features, such as edges in images or syntactic patterns in text, which can be useful across a range of tasks. By fine-tuning the upper layers of a pre-trained model or adding new layers specific to the new task, developers can adapt it to perform well with far fewer training examples. Properly executed transfer learning not only saves computational resources but also significantly reduces training time. It can also lead to better performance, particularly in scenarios where the target dataset is small or imbalanced. There are different strategies for transfer learning, such as freezing all or some of the layers during training, or allowing all layers to be updated depending on the similarity between the source and target tasks. For example, in medical imaging, models trained on natural images can be adapted to detect tumors or classify X-rays by fine-tuning only a few layers.

<img src="./assets/pretraining_workflow.png" style='height:400px'/>

Figure 1. Simplified foundation model pretraining workflow

An emerging category of models coined **'Foundation Models (FM)'** subset the transfer learning technique in which the model pretraining task is derived from raw unlabelled data rather than a specific task. In other words, these models are trained in a 'self-supervised' manner. These models potentially offer similar benefits to models developed using traditional transfer learning techniques (i.e reduced training time and performance). However, unlike with transfer learning, FMs circumvent the development of an intial large scale labelled task. This technique is especially powerful in the remote sensing world where the availability of large scale unprocessed data is unprecedented and the methods for transforming it into actionable insights draw from a wide range of scientific disciplines. **The larger goal of the earth observation foundation model (EOFM) development communitity is to create a model capable of generalizing across domains by learning from the structure and correlations iniherent in the raw data and open the door for more versatile models that can adapt with minimal fine-tuning. Attempts towards this ephemeral goal, however, must be thoroughly assessed in a way that is relavent to the greater remote sensing community.** 

We'll cover in this notebook: a simplified understanding of EOFMs, how to utilize them, benchmark their performance, and what to consider when assessing their utility for your downstream task. We assume that you have some familiarity with training deep learning model and the software associated with that process as well as a high level understanding of concepts such as network weights, back propagation, and checkpoints.

---

###  Pretraining Methodologies

---

Below are 3 common methods for pretraining a computer vision model in a self-supervised fashion. This is by no means not an exhaustive list! Due to widespread availability of satellite data, a number of variations and/or combinations of the methods below exist along with a number of alternative techniques that are not in widespread use. There is also an extensive effort to marry the computer vision and language domains both in and out of the remote sensing community. For simplicity, we omit those from this notebook. It is worth nothing that the listed paradigms are all 'transformer-based' where the core architecural unit is a vision-transformer trained under the 'encoder-decoder' paradigm. While it is important to understand the high level distinction between these techniques, it is not entirely necessary in order to implement them in practice as many developer groups have worked hard in abstracting most of the logic away in various frameworks. Deep understanding of each of these model families as well as the nuances of a transformer is out of the scope of this notebook but we encourage folks to learn as much as possible!

#### <u>Autoencoders</u> 

<img src="./assets/mae.png" style='height:400px'/>

Figure 2. Diagram depicting masked autoencoder training. Images are broken into patches that are subsequently embedded (typically with a linear layer) and masked. The decoder is tasked with reconstructing the original image from the unmasked context.

Autoencoders consist of an encoder that compresses input images into a lower-dimensional latent space and a decoder that reconstructs the original images from these representations. This compression helps the model capture essential features while discarding noise or irrelevant details. The most dominent though not necessarily the most performant method is masked image modeling (i.e the 'Masked Autoencoder' (MAE)) where the model is tasked with reconstructing the masked part of an image from unmasked context. There are a handful of masking strageties that researchers have presented though they all work under the same overarching idea that a model optimized on the reconstruction task is generalizable, or able to perform well on unseen data.

#### <u>Contrastive Learning</u> 

<img src="./assets/simclr.png" style='height:400px'/>

Figure 3. Diagram for SimCLR training paradigm.

Contrastive models are tasked with distinguishing between similar and dissimilar image representations. It works by pulling together representations of positive pairs (different augmented views of the same image) while pushing apart representations of negative pairs, which are views from different images. Common methods include 'Simple Contrastive Learning' (SimCLR) and 'Momentum Contrast' (MoCo). Unlike autoencoders, a decoder is typically not included in the pretraining phase since the task is executed entirely in the latent space.

#### <u>Non-contrastive Self Distillation</u> 

<img src="./assets/dino.png" style='height:400px'/>

Figure 4. Diagram for DINO training paradigm.

Non-contrastive self-distillation is a self-supervised learning approach for computer vision that avoids the need for negative samples. Instead of contrasting different images, it trains a student network to match the output of a teacher network, both fed with different augmented views of the same image. The teacher is often an exponential moving average of the student, providing stable targets for the two encoders. The objective is to produce consistent, meaningful embeddings, not to reconstruct or generate the input. Common methods include 'Self-**di**stillation with **no** labels' (DINO) and 'Bootstrap Your Own Latent' (BYOL).

---

### Decoders

---

You're probably wondering at this point, what do you do with the trained encoder? Just as you would with any trained neural network, you can load the weights for inference using similar code you might have seen in previous chapters. There is however, one more consideration you must make depending on the type of task at hand and the shape of the output of the encoder. If, your goal is to classify images, a lightweight machine learning algorithm directly on top of the vector outputs of the encoders may be sufficient. The assumption here is that the internal representation of the image is robust and **relavent** to your task. These are the 'embeddings' you may have heard of recently and are analigous to large coarse resolution raster. We will touch on this topic in a separate chapter. If your goal, on the other hand, is to segment the images, we may need to train a subsequent neural network to extract from that internal representation and produce a useable map. Below is a summary list of contemporary decoders used in conjunction with transformer-based encoders. Each vary in complexity, purpose, and method in which they extract features from the transformer layers in the aforementioned encoders.

#### FCN

<img src="./assets/fcn.png" style='height:400px'/>

Figure 5. Full convolutional neural network architecural diagram.

The FCN is a stack of convolutional layers that upsamples and refines final feature maps produced by the transformer’s output embeddings and converts them back into an image-like spatial representation. This final representation can then be fed through a final prediction layer, which is typically itself convolutional layers, to produce the final map. This is typically more lightweight than some of the MLP-based decoders below, due to weight sharing in the learned kernel filters. Local context is also explicitly emphasized as those same kernel filters may only cover a small portion of the whole input image.

#### Segformer

<img src="./assets/segformer.png" style='height:400px'/>

Figure 6. Segformer style network architecural diagram for semantic segmentation.

Segformers use a MLP style decoder to extract from a hierarchical transformer-based encoder. Originally developed as a standalone framework for end to end training, the segformer has been recently adapted by a handful of research groups attempting to utilize the representation generated by the pretraining phase. The assumption is that the implementation of the transformer blocks in the encoder are sufficiently hierarchical in nature such that they can be composed using this method.

#### UperNet

<img src="./assets/upernet.png" style='height:400px'/>

Figure 7. UperNet architectural diagram.

Similar to the segformer architecture, the upernet leverages multi-scale hierarchical features to generate dense predictions. Rather than MLPs, the upernet utilizes pooling layers and convolutions to compose the features before the final prediction layers. It's worth noting here that the upernet was designed with a ResNet-50 backbone that was fine tuned. In other words, this closely aligns with the 'pretrained encoder' with a decoder head methology that has been popularized in the last year.

#### Linear Probe

<img src="./assets/linear.png" style='height:400px'/>

Figure 8. Simplified structure of a single linear layer.

The simplest decoder available. It is a single linear layer to process just the output of the encoder. In practice, this isn't typically used for segmentation due a limited model complexity. It is still useful, however, to assess representation quality as it is cheap to train and can be implemented in a consistently, making it an ideal tool for benchmarking. This technique is taken from the language domain where large language models are often evaluated on downstream tasks with linear probes on frozen embeddings before doing full fine-tuning. 

#### Muster

<img src="./assets/muster.png" style='height:400px'/>

Figure 9. Architecural diagram of MUSTER decoder.

A more recent decoder development in the computer vision realm. Similar to both the Segformer and the UperNet, hierarchical features play an important role in it's predictive strength with the most important distinction here being that the decoder itself is also transformer based. Also similar to the UperNet, the MUSTER decoder was designed to integrate with pretrained encoders. While this may lend it self to greater performance, it is important to consider the compute required to implement such an architecture.


## Section B: The Framework

---
---

[Link to original repository](https://github.com/VMarsocci/pangaea-bench)

We apply the Pangaea Bench benchmarking framework originally developed by the ESA Phi Lab team. While, there are other frameworks in development by other teams, Pangaea conveniently has many of the more popular remote sensing foundation models integrated into their pure pytorch framework along with a number of well established benchmarking datasets. It also includes a relatively simple dataset integration scheme along with additional decoder heads integrated by the team at Spatial Informatics Group for the purpose of evaluating methods for extracting the information from the internal representation of the foundation model. The limited dependency is especially beneficial for small scale demonstrations such as this learning notebook. You will find similar themes found in previous chapters of the Applied Deep Learning Book such as dataloaders, pytorch modules, loss functions, etc. These are not exclusiive to training end to end models and in fact, the workflow is almost exactly the same! The intended purpose of EOFMs is solving the same segmentation, classification, or regression problems. Here, we focus on the additional considerations that come with using an EOFM for segmentation.

<img src="./assets/pangaea_workflow.png" style='height:400px'/>

Figure 5. Diagram of Pangaea's general structure.

---

### Software/Hardware considerations

CUDA is the underlying software running NVidia GPUs. While it is not necessary, we recommend running this notebook in a CUDA enabled linux environment for convenience and setup simplicity. We also recommend installing python>=3.10 either standalone or through a conda managed environment. For running the model, we can estimate a VRAM required for storing the model using a simple calculation of # of parameters and precision. For example, a 300M parameter model at 32bit precision would roughly require 3.6GB to store the weights and optimizer states. Another 1.2 GB is required for the gradients of each of those parameters and a few more for the data itself. Likewise, the decoder used to extract from the internal representation also imposes VRAM / compute requirements. There's no universal formula here, but tools like PyTorch hooks, torch.cuda.memory_summary(), or profilers can help estimate how much compute you need. 12GB of VRAM is a good place to start and luckily there are free google colab options at this compute scale, but, there is a trend towards larger and more complex models which inevitably consume more compute. 

#### Installing framework and requirements

We are installing directly from a cloned repo along with the available requirements.txt file. This may take a few moments as PyTorch is a fairly large package. Please install python and pip on your own and setup on your notebook environment that best suits your preferences. 

In [3]:
!git clone https://github.com/sig-gis/pangaea-bench.git "pangaea-bench"
!pip install -e "pangaea-bench"
!pip install -r pangaea-bench/requirements.txt

fatal: destination path 'pangaea-bench' already exists and is not an empty directory.
[0mObtaining file:///home/myscon/workstation/eofm-book/pangaea-bench
  Preparing metadata (setup.py) ... [?25ldone
[0mInstalling collected packages: pangaea
  Attempting uninstall: pangaea
[0m    Found existing installation: pangaea 1.0.0
    Uninstalling pangaea-1.0.0:
      Successfully uninstalled pangaea-1.0.0
[33m  DEPRECATION: Legacy editable install of pangaea==1.0.0 from file:///home/myscon/workstation/eofm-book/pangaea-bench (setup.py develop) is deprecated. pip 25.0 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config-settings editable_mode=compat. Please consult the setuptools documentation for more information. Discussion can be found at https://github.com/pypa/pip/issues/11457[0m[33m
[0m  Running setup.py develop for panga

#### The 'torchrun' Command

The torchrun command is a utility provided by PyTorch to launch distributed training jobs across multiple processes, nodes, or GPUs. It replaces the older torch.distributed.launch and is part of the torch.distributed module. torchrun is designed to be simple and flexible, making it easier to scale up training scripts with minimal changes. At its core, torchrun sets up the environment variables necessary for distributed training, such as RANK, WORLD_SIZE, and MASTER_ADDR, and then spawns multiple processes, one per GPU or per node, depending on the configuration. This enables parallel training using frameworks like DistributedDataParallel (DDP), which helps synchronize gradients and reduce training time significantly. This will be especially useful for training larger networks however, for the purposes of this demonstration we will only use it to run on a single machine

The Pangaea benchmarking framework is a lightweight wrappper around this command to handle most of boilerplate code that comes with training a neural network. This includes establishing the training loop, calculating metrics, logging, as well as the datasets/dataloaders associated with the benchmarks. [See torchrun documentation on environment variables for more details](https://docs.pytorch.org/docs/stable/elastic/run.html). There are several command options available which can be found in ./pangaea-bench/configs. We only use a subset below in this demonstration notebook but there are many more available. Config files associated with the dataset, decoder, and encoder are associated with a PyTorch module that is instantiated by the framework based on the configurations provided.

```bash
!torchrun pangaea-bench/pangaea/run.py \                                ##### The torchrun entry command
    --config-name=train \                                               ##### configuration name which can be found in ./pangaea-bench/configs
    work_dir=checkpoints \                                              ##### the directory relative to the working directory to store checkpoint outputs
    dataset=hlsburnscars \                                              ##### the dataset config file found in ./pangaea-bench/configs
    encoder=dofa\                                                       ##### the encoder config file found in ./pangaea-bench/configs
    decoder=seg_fcn\                                                    ##### the decoder config file found in ./pangaea-bench/configs
    preprocessing=seg_default\                                          ##### preprocssing steps associated with the task type (e.g normalizing images)
    criterion=cross_entropy \                                           ##### the loss function on which to optimize the training cycle
    task=segmentation \                                                 ##### task type (others in clude classification and regression) 
    use_wandb=true \                                                    ##### whether to use the online logging software
    task.trainer.n_epochs=16 \                                          ##### number of epochs to limit training
    task.trainer.log_interval=1 \                                       ##### logging interval of calculated metrics
    task.trainer.ckpt_interval=16 \                                     ##### how often to save checkpoints based on metrics
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth    ##### path to pretarined weights of the FM

#### Logging and Model Examination

Weights & Biases (WandB) is a popular tool for experiment tracking, model monitoring, and collaboration in machine learning workflows. It integrates seamlessly with PyTorch, TensorFlow, Keras, and other frameworks, allowing users to log training metrics, visualize results in real-time, and compare model performance across runs. Pangaea benchmark happens to have a prebaked integration of wandb which is easily accessed using WandB's authentication protocol. [See WandD documentation on environment variables for more details](https://docs.wandb.ai/guides/track/environment-variables/)

## Section C: Example Workflow: Burn Scar Mapping (MTBS)

### Frozen vs Unfrozen vs Random vs LoRA

Testing frozen, unfrozen, and randomly initialized pretrained encoders is useful in understanding the value and applicability of transfer learning for a specific task. A frozen encoder uses pretrained weights without updating them during training. This setup helps evaluate how useful the pretrained features are on their own, especially in cases where the target dataset is small or the model is prone to overfitting. In contrast, an unfrozen encoder allows those pretrained weights to be updated, enabling the model to adapt its learned representations to better suit the target task. This approach often yields better performance when sufficient labeled data is available and the task deviates from the original pretraining domain however, requires that additional gradients be computed and stored. On the other hand, a randomly initialized encoder serves as a baseline, providing a measure of how well the model performs without any prior knowledge. Comparing results from this setup against those using pretrained weights helps quantify the benefits of pretraining in terms of training time and overall final performance. It's worth noting that depending on the task and the amount of available training data, the randomly initialized encoder may never achieve the same results as an unfrozen pretrained model. This behavior is still an on going area of research. 

Overall, these three basic configurations allow for a comprehensive assessment of whether transfer learning is useful, whether fine-tuning improves results, and whether pretraining is necessary at all for a specific task. There is a fourth method called Low Rank Adaption (LoRA) which is somewhat of a compromise between a fully frozen and unfrozen encoder. For larger models (i.e greater than 300m parameters). Low-Rank Adaptation (LoRA) is a technique used to efficiently fine-tune large pretrained neural networks by injecting small, trainable weight matrices into the model, while keeping the original weights frozen. Traditional fine-tuning updates all parameters of a model, which becomes expensive for large models. LoRA avoids this by approximating the weight updates as the product of two low-rank matrices, significantly reducing the number of trainable parameters. This low-rank decomposition acts as a bottleneck, enforcing efficiency and reducing overfitting. LoRA has gained popularity particularly with large language models. It also enables modular fine-tuning where separate LoRA modules can be swapped in for different tasks—making it attractive for applications requiring task specialization without retraining the entire model.


### Lets Fine Tune!

Okay, we've established some of the basic concepts, lets train some models! We'll be focusing on a single task for 3 frozen encoder variations: DOFA, CROMA, and Prithvi 1.0 (described below). Along with those 3 FMs we train the 4 different decoders described in section A. That's 12 different models for just a single task! It's clear here that for a comparison analysis for foundation models, compute becomes a key bottleneck in research especially when aiming for systematic comparisons across models, tasks, or training regimes. For the sake of this demonstration, we limit each training run to a batch size of 8 for 16 epochs an on a single RTX 4000 workstation GPU with 12GB of VRAM. A machine of this size is readily available on any cloud compute platform.

The task we are targeting is segementing burn scars in HLS scenes. An HLS-based benchmark is unique in that it sources its optical data from two satellite missions (Landsat and Sentinel) and in doing so, provides a relatively level test bed for models trained on either. Nevertheless, it’s important to acknowledge the nature of the benchmark dataset used. While it has been carefully curated to support robust model evaluation, it may not fully reflect the diversity or complexity of real-world fire scenarios with potentially with drastically different landscape dynamics, data availability, and task scope. The dataset is limited in geographic scope, vegetation types, and imaging conditions, which means performance metrics obtained here may not generalize well to all fire contexts such as risk mapping. As such, results from this benchmark should be interpreted with consideration of these constraints.

#### DOFA
[Link to paper](https://arxiv.org/pdf/2403.15356)

DOFA is a transformer-based model that is optimized on both a mask image modeling objective and self distillation objective (Figure 2b). DOFA's major contribution is their wavelength-conditioned dynamic patch embedding in which a the central wavelength of a given sensor is used to derive the weights for processing the resepective data. It is apparent from this that their approach is targettiing flexibility and generalizability across multiple modalities and does so fairly well. And yet, there are still open questions! For example, wavelength and reflectance does not necessarily translate to amplitude data from SAR. Does this matter? Benchmarking metrics will tell you no but do keep in mind: why?

<img src="./assets/dofa.png"/>

Figure 6: (a) Architecture design. DOFA builds on masked image modeling by processing input images with any number of channels within a single framework. (b) Dynamic
weight generator and continual training framework.

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_fcn\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mmtruong[0m ([33mspatial-informatics-group[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.19.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/myscon/workstation/eofm-book/wandb/run-20250715_155655-y9eilllh[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m20250715_155654_fc10d0_dofa_seg_fcn_hlsburnscars[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/spatial-informatics-group/geofm-bench[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/spatial-informatics-group/geofm-bench/runs/y9eilllh[0m
INFO - 07/15/25 15:56:55 - 0:00:00 - 'batch_size': 8,
                                      'ckpt_dir': None,
                                      'cr

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_linear\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt
    

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pt

#### Prithvi 1.0
[Link to paper](https://arxiv.org/pdf/2310.18660)

One of the earliest foundation models and the simplest of the three models tested in this demonstration. Developed by the NASA IMPACT, Prithvi 1.0 is a vanilla masked autoencoder specifically trained exclusively on HLS data. This limited scope presents a unique research opportunity for studying a vision transformer with relatively few modifications to its structure. The expectation then is that for this specific test, Prithvi would have the best performance metrics given that its pretraining dataset more closely aligns with the downstream task. TBD!


<img src="./assets/prithvi.png"/>

Figure 7: Masked autoencoder diagram for Prithvi's pretraining.

In [None]:
!HYDRA_FULL_ERROR=1 torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt
  

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt
    

#### CHROMA
[Link to paper](https://arxiv.org/pdf/2311.00566)

CHROMA is a another transformer-based model with a slightly different approach combining both masked image modeling (MIM) and contrastive learning. Rather than contrasting positive and negative samples, the CHROMA framework contrasts optical and radar data using separate encoders then combines those embeddings using cross attention with a unifying transformer encoder. The output of this terminal encoder is then randomly masked and trained in a typical encoder-decoder MIM style. The optical encoder requires all 12 spectral bands from Sentinel 2 while the radar encoder requires the 2 polarization bands from Sentinel 1. In our downstream task, we are limited to 6 harmonized spectral bands. How can we expect this to impact our results?

<img src="./assets/chroma.png"/>

Figure 8: (Left) Pretraining framework for CHROMA. (Right) Encoding workflow for leveraging internal representations.

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt
    

In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt


In [None]:
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

### Results

Performance metrics are not always straightforward to interpret and at the same time, comparison of metrics introduces another layer of complexity to benchmarking.

#### Per Class Metrics

#### Decoder Choice

#### Run Times

#### Compute Costs

### Final Comments

#### A Well Defined Problem

#### Multimodality / Multisensor Models

#### Scale