# Earth Observation Foundation Models: Encoder Representation Analysis and Downstream Task Benchmarking

## Section A: Introduction

---
### Scientific Background and Problem Formulation
---

In this notebook we will be exploring adapting EOFMs to a pair of wildfire montoring tasks. To begin, we will discuss the tasks as well as the associated challenges and the utility that emerging techniques may have to offer.

#### Problem Formulation

Following a wildfire event it can be vital to resource allocation and long-term planning to understand the severity and extent of the damages. This can be done through examining earth observation data, such as Sentinel-2 Imagery. However, these assessments can be time-consuming and expensive to produce even with the use of earth observation. The work is traditionally done using hand-tuned thresholds for specific indices such as DBR and NDBR. These thresholds are then specific to the regions they are developed for, sometimes even specific to a single event, and require expert knowledge to produce, which leaves them inaccessible to users who do not have access to these resources to produce them. 

This creates many barriers which make the prospect of automation attractive for these tasks. However, traditional approaches, such as supervised learning methods, may still impose significant requirements of large labeled datasets and custom architecture development. The promise of EOFMs pre-trained using self-supervision is the hope that models will learn generalizable features from large collections of unlabeled data whose insights can then be transfered to the relevant tasks. Fine-tuning our models we hope to be more straightforward than a custom architecture search, and save on costly model training time, as we require only attaching a small set of new weights. We refer to this small supplementary model, called a "decoder head", whose job is to customize our pre-trained foundation model to complete our task.

Here we will examine two tasks which have been introduced previously. Mapping burn scars, and burn severity.

- HLS Burn Scars: The HLS Burn Scars dataset (Phillips et al., 2023) consists of pairs HLS images with associated burn scar masks form 2018-2021 over the contiguous United States
- Monitoring Trends in Burn Severity (MTBS): The MTBS project utilizes expert knowledge to perform a post-fire assessment of fire severity within monitored areas. Here we pair these burn severity labels with images collected by the Landsat satellites 

When approaching the question of adapting the many available foundation models to these tasks, many natural questions arise.

- What do we need to do to adapt our EOFMS to this task? What are the downstream impacts of these decisions on performance and usablility of our final solution?
- Which models perform the best on our task? How do we compare them to each other and to our possible baseline methods?

We will discuss these questions in detail in this tutorial, using the aforementioned burn monitoring problems as a frame in order to assist the reader in develiping understanding of these topics to apply to their use cases. Due to the recent emergence and rapid development of EOFMs, many topics related to these discussions remain open questions and the material we present here is subject to discovery and change as new research is conducted. We aim to provide a snapshot of the situation as it stands as well as commentary to help frame discussion on approaching these open topics in order to equip our readers to confront them.



---
### Technical Background and Introduction
---

#### Transfer Learning and Beyond

---

Transfer learning is a powerful technique in machine learning that allows a model trained on one task to leverage the previously learned features and be adapted and reused for a different but related task. Instead of starting from scratch, transfer learning leverages the knowledge a model has already gained from a separate, and usually large, dataset and applies it to a new problem. This is particularly useful when a dataset is small or annotations for the new task are limited. Once an initial training has been completed, the layers towards the front (closer to the input layer) of these models typically learn general features, such as edges in images or syntactic patterns in text, which can be useful across a wide range of tasks. Ideally, by fine-tuning the initial layers of a pre-trained model or adding new layers specific to the new task, practicioners can adapt it to perform well with far fewer training samples. This also has the potential to improve performance thresholds for a given smaller or imbalanced dataset, as the pre-trained model already has latent feature extraction capabilities from prior training.  Properly executed transfer learning not only has the potential to boost performance, but also saves computational resources and significantly reduces training time. 

Many different strategies can be applied to achieve transfer learning, such as freezing different sets of layers during training, or allowing all layers to be updated, depending on the similarity between the source (initial) and target (new) tasks. For example, in medical imaging, models trained on natural images have been fine-tuned to detect tumors or classify X-rays by retraining only a few layers. 

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:250px" 
      src="./assets/pretraining_workflow.png" style='height:400px'/>

<div style="text-align: center;"> Figure 1. Simplified foundation model pretraining workflow</div>
<br/>

An emerging category of models termed **'Foundation Models (FM)'** are a specific transfer learning technique in which a given model's pretraining task is derived from a model with the goal to develop general purpose representations of a large unlabled collection of data which can serve as the *foundation* of many diverse models for new tasks later, as depicted in Figure 1 (Jakubik et al., 2023). This is in contrast to traditional supervised learning problems that aim to map data samples to a annotations from a specific task. FMs are said to be trained in a 'self-supervised' manner and potentially offer similar benefits to models developed using traditional transfer learning techniques (i.e reduced training time and performance). However, unlike with transfer learning, FMs circumvent the development of an intial large scale labelled task. This technique is especially powerful in the remote sensing world where the availability of large scale unprocessed data is an ever-increasing, and the methods for transforming it into actionable insights draw from a wide range of scientific disciplines. **The larger goal of the earth observation foundation model (EOFM) development communitity is to create models capable of generalizing across Earth Science domains and remote sensing modalities by learning from the structure and correlations iniherent in the raw data, and open the door for a wider group of practice and more versatile models that can adapt with minimal fine-tuning. Attempts towards this ephemeral goal, however, must be thoroughly assessed in a way that adheres not just to typical computer vision analysis, but also to the rigor required with respect to both the underlying science goals and the remote sensing instrumentation.** 

The representation constructed by processing a data point with a foundation model is often called an embedding, and the space where all these embeddings are drawn from is known as the latent space. This notebook will cover: a simplified understanding of EOFMs, how to utilize them, evaluate their intermediate ability to extract and represent data features, benchmark their performance, and what to consider when assessing their utility for your downstream task. We assume that you have some familiarity with training deep learning model and the software associated with that process as well as a high level understanding of concepts such as network weights, back propagation, and checkpoints.

---

####  Pretraining Methodologies

---

Below are three categories of methods for pretraining a computer vision models in a self-supervised fashion. These have been applied in varying ways to create the present field of EOFMs. This is by no means not an exhaustive list! Due to widespread availability of satellite data, a number of variations and/or combinations of the methods below exist along with a number of alternative techniques that are not in widespread use. There is also an extensive effort to interconnect strides made separately within the computer vision and natural language processing domains both in and out of the remote sensing community.

It is worth nothing that the models described and tested here are all 'transformer-based' where the core architecural unit is a vision-transformer trained under the 'encoder-decoder' paradigm (to be discussed). While it is important to understand the high level distinction between these techniques, it is not entirely necessary in order to implement them in practice as many developer groups have worked hard in abstracting most of the logic away in various frameworks. Deep understanding of each of these model families as well as the nuances of a transformer is out of the scope of this notebook but we encourage folks to learn as much as possible! The deeper a practicioner's knowledge, the easier it will be to disctinguish between appropriate and inappropriate practices both with data and the associated model training and deployment.

#### <u>Masked Autoencoding (MAE)</u> 

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%; height:250px" 
      src="./assets/mae.png"/>

<div style="text-align: center;"> Figure 2. Diagram depicting masked autoencoder training from (He et al., 2022). Images are broken into small sections, called patches. A percentage of these patches are hidden (masked). The decoder is tasked with reconstructing the original image from the remaining unmasked patches. </div>
<br/>

Autoencoder-centric methods consist of an encoder that projects input images into a (typically)lower-dimensional latent space and a decoder that reconstructs the original images from these representations. This compression helps the model capture essential features while discarding noise or irrelevant details. By reconstructing lost information with remaining unmasked patches, models are expected to encode general features of images and develop image image-level understanding and fine-grained spatial reasoning. Masked tokens do not have to be processed by the encoder as they are forgotten so higher levels of masking has the additional benefit of reducing computation time for training on each sample which can be used for more iterations on our dataset,allowing for larger datasets and more training iterations (He et al., 2022).

#### <u>Contrastive Learning</u> 

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%; height:300px"
    src="./assets/simclr.png"/>

<div style="text-align: center;"> Figure 3. Diagram for SimCLR training paradigm from (Chen et al., 2020). </div>
<br/>

Contrastive models are tasked with distinguishing between similar and dissimilar image representations. It works by pulling together representations of positive pairs (different augmented views of the same sample) to be closer in their representation space while pushing dissimilar pairs farther apart, which are constructed using views from different images. An example flow diagram can be seein in Figure 3. Common methods include 'Simple Contrastive Learning' (SimCLR) and 'Momentum Contrast' (MoCo) (Chen et al., 2020; He et al., 2020). Unlike autoencoders, a decoder is typically not included in the pretraining phase since the task is executed entirely in the latent space. This methodology also typically requires some level of annotation - creating positive / negative pairs within the training data and can sometimes be referenced to as a 'semi-supervised' methodology. Contrastive methods have exhibited stronger notions of similarity within their embedding space due to their trianing procedure, which result in very different latent space properties and Masked Autoencoder models.

#### <u>Non-contrastive Self Distillation</u> 

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%; height:300px"
    src="./assets/dino.png">
</img>

<div style="text-align: center;"> Figure 4. Diagram for DINO training paradigm from (Caron et al., 2021). </div>
<br/>

Non-contrastive self-distillation is a self-supervised learning approach for computer vision that avoids the need for negative samples. Instead of contrasting different images, it trains a student network to match the output of a teacher network, both fed with different augmented views of the same image. The teacher is often an exponential moving average of the student, providing stable targets for the two encoders. Figure 4 shows an example of this approach. The objective is to produce consistent, meaningful embeddings, not to reconstruct or generate the input. Common methods include 'Self-**di**stillation with **no** labels' (DINO) and 'Bootstrap Your Own Latent' (BYOL) (Grill et al., 2020; Caron et al., 2021). These methods are often combined with different pretraining objectives and are a strong method for refining new models while utilising the results of existing ones.

---

#### Encoder Evaluation

---

For the evaluation of encoders, as with traditional supervised tasks, performance measures should not rely solely on evaluation of the objective function on training and validation subsets of the data. While this is slightly different than evaluating the performance of a model on its ultimate skill of properly mapping input samples to a target distribution, the same principles should apply - a test subset of data previously unseen to the model should be used to evaluate the model's ability to perform the task at hand in quantitative and qualitative methods. One of the many benefits of using encoder / decoder paradigmatic architectures the fact that the intermediate separation in the model can be used to evaluate representational skill.

While this chapter is not focused on dataset accumulation, it should also be mentioned that proper splitting of the data in a geospatial setting is imperative for proper evaluation and should consider variables like keeping unseen spatiotemporal areas for (a test set) independent post-training evaluation of a model. This ensures that complex spatiotemporal cues built into geospatial datasets are not "memorized" and overstate a model's performance on a given task and potential generalizability (LaHaye et al., 2021; Parajuli et al., 2024). Here we discuss some ways an encoder's representational "skill" can be measured and evaluated by looking at both the weight matrix of the model and the output of the encoder, or intermediate data representation called encodings or embeddings. Unlike the metrics used for downstream task evaluation, many of these are newer and therefore less known. Here we provide a brief description and references for each methodology used.

##### <u>Resource Requirements</u> 
 
Understanding the resource demands of deep learning models is crucial for both model selection and hardware planning. As data science practicioners, we typically we want to balance performance, interpretability, and resource requirements as three main driving factors in model selection. It is also crucial that we aim to choose the smallest and most efficient model that meets the requirements for our task in these three areas. This is important not only for the sake of efficiency, but also because the environmental impact of large-scale deep learning models has become a significant concern as model sizes and computational demands continue to grow. Choosing small and optimized models where appropriate is an effective strategy for reducing the emissions and energy consumption footprint associated with model development and deployment (Strubell et al., 2019; Patterson et al., 2021) . These considerations should also include a model's pretraining phase as well as resources required for finetuning and operational inference.
 
 
Floating Point Operations (FLOPs), Multiply-Accumulate Operations (MACs), and parameter counts are three metrics that offer practical proxies for estimating a model's resource requirements. FLOPs denote the total number of floating-point operations (additions, multiplications, etc.) required to process a single input through the model. High FLOPs indicate a model that requires substantial computation resources, which can translate directly to longer inference and training times. Multiply-accumulate operations, foundational to neural network layers (especially in convolutions and matrix multiplications) are closely linked to the actual physical operations that deep learning accelerators (like GPUs and TPUs) execute, thus providing an estimate of real computational workload. The parameter count measures the total number of learnable variables (weights and biases) within a model. The number of parameters is a large driver for the memory required to load and store the model during execution and on disk.

Some limitations that should be considered:

1) FLOPs are platform-agnostic and do not directly account for hardware-specific optimizations or parallelization capabilities.

2) MACs are specifically meaningful for models dominated by linear algebra; other operations (e.g., activations, normalizations) may not be included in MAC counts

These parameters collectively can give us a well-rounded glimpse at model resource requirements, and allow us to factor that in when doing intercomparisons.


##### <u>Weight Matrix Analysis for a Data-Free Assessment of Training Quality</u> 

The Weight Watcher analysis library introduces the measure of an implicit form of self-regularization, as revealed by the spectral properties of a model's weight matrices (Martin et al.,  2021). The toolkit and associated papers apply Random Matrix Theory (RMT), a branch of mathematics that studies the properties of matrices whose entries are random variables and more specifically Empirical Spectral Density (ESD) analysis, the characterization of the distribution of eigenvalues of a given matrix—often large, random, or structured matrices forming a histogram that represents how the eigenvalues are distributed across their range of possible values. As a part of this analysis, the $\alpha$ parameter is an exponent characterizing the tail of the empirical spectral density (ESD) of a neural network layer’s weight matrix. When the distribution of singular values (or eigenvalues) of a weight matrix follows a power law distribution, the ESD behaves as: $\rho$($\lambda$)∼$\lambda$−$\alpha$

where $\lambda$ denotes the eigenvalues, and $\alpha$ is the power law exponent. This exponent is estimated empirically from the tail of the ESD for a given layer. $\alpha$ quantifies how "heavy-tailed" or "spread out" the layer's eigenvalue distribution is. A smaller $\alpha$ means a heavier tail: the weight matrix has more large singular values, indicating more complex/lower-rank structure. In this context, this means a given layer captures not only the true structure but also the noise—too much complexity. A larger $\alpha$ means a faster-decaying tail: the singular values decay rapidly, indicating dominance of noise-like randomness or over-regularization. Here, this means a given layer fails to capture the relevant data structure—too much simplicity.

Here we provide the means to do the full ESD-based analysis that the WeightWatcher tool provides, and use the. $\alpha$ parameter as a summarization metric for the analysis.


##### <u>Working with Embeddings</u> 

One of the most promising prospects of generalizable learned features from EOFMs is the ability to reason over embeddings, processed samples, of EOFM data. The goal of an embedding is to compress the data by removing unnecessary information, and to provide abstract features which can be used to more easily distinguish similar and dissimilar samples under diverse contexts than the raw imagery. Embedding based approaches are currently being explored and utilized for similarity search, few/zero shot learning, and low label monitoring tasks. 

However, these spaces are high dimensional (100s or 1000s of dimensions for each patch) and are the result of the learned function of our neural network and thus there is no prima facie method for understanding the meaning or importance of placement in this latent space. Thankfully much work has done on tools for exploring these spaces and investigating for features or distinctions of interest within them. These include longstanding visualization techniques such as UMAP (explored below) and newer work investigating comparisons between multiple models.

##### <u>Manifold Projections for Qualitative Analysis and Dimension Reduction</u> 

The Uniform Manifold Approximation and Projection (UMAP) methodology for Dimension Reduction can be used to as both a general dimension reduction tool and as an interpretability / visualization tool, and here we use it for both. UMAP assumes data lies on a manifold and utilizes local approximations (k-nearest neighbor graphs) to represent the data’s topological and metric structure. Each local neighborhood is encoded as a fuzzy simplicial set, and then merged probabilistically to form a global topological structure (McInnes et al., 2020). Figure X provides a couple of examples of embeddings from different models and geospatial datasets projceted into a lower feature space using UMAP, and overlayed with each sample's target value. 


<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%; height:200px"
           src="./assets/UMAP_Examples.png"/>
<div style="text-align: center;"> Figure 5. Embeddings from different models and geospatial datasets with 4 classes (left), 2 classes (center), and 7 classes (right) projected into a 2D feature space using UMAP. </div>
<br/>
Notice that the separation of label sets in the degrades from left to right, indicating the models' increasing challenge in representing class distinctions. Note that this does not always directly correlate with the fact that a model "cannot" represent the data's structure, (although this is definitely the case some of the time) but moreso that the representation is not simple, leading to difficulties in projecting it into a lower dimension. While this is a limitation of the tool, it is useful in two ways:

1) This also likely means that downstream learning task will also be harder, as it will require more care and precision to extract representation using a given decoder head.
2) **Data needs to be human interpretable in order to be useful in the sciences, as our goal is to answer questions and understand the "whys" of processes, not just meet metric thresholds and deploy automated decision making systems.** The authors feel strongly that this should be the aim for all data science and deep learning practice, but definitely here in these domains.

Some additional limitations include: 

1) the use of stochastic operations leading to different runs or subsampling yielding varied results and
2) the quality and nature of embeddings are sensitive to hyperparameters, although empirically, the authors note that while this is true a general similarity of structure can be achieved relative to class separation if the representations of N classes are distinct enough.

While useful, the uncertainties of these tools require us to use multiple approaches to get a more comprehensive look at the state of model representation.


##### <u>Data Kernels for Inter-Model Embedding Comparisons</u> 

Encoder projections and embedding geometries are not directly comparable due to inherent incongruities - namely, differences in the dimensionality of original encodings or in their basis vectors. To address these discrepancies, we leverage recent advances in embedding space analysis and graph projection techniques, enabling the joint projection of transformations from diverse encoder architectures (Duderstadt et al., 2023). This encoding-centric analysis facilitates direct comparison of entire embedding spaces generated by different models.

Such an approach is especially valuable when downselecting from a large pool of candidate encoders. By quantifying representational similarity, we avoid exhaustive testing: models with highly similar encoding geometries are likely to yield comparable performance on a target task, meaning only representative models from each group need to be evaluated. Likewise, this methodology aids in constructing a taxonomy of applicable models tailored to specific tasks or domains.

Within a joint projection, each data point is represented N times, where N is the number of models under consideration. To assess the significance of differences between two representations of the same datum, we model the null distribution of representation distances, drawing on theoretical foundations from random graph literature. For a single model, resampling (via perturbation and bootstrapping) yields a rejection radius: a threshold delineating typical within-model embedding variation. When the paired representation of a sample from a different model falls outside of this radius, it indicates a materially different encoding of that sample. The probability of rejection tends to correlate with discrepancies in encoder training data and highlights possible gaps in training regimes.

Finally, this technique can be iteratively applied: by generating a distance matrix that quantifies distances between identical samples across all encoding sets, and then applying dimensionality reduction again, we obtain a streamlined, task-relevant comparison of model representations via a single point representative of each embedding set / model representation.


Like performance metrics for downstream tasks, each of these methodologies provides a piece of the full picture, and collectively they can provide a more comprehensive view of a model's representational skill and potential generalizability.


---

#### Decoders

---

You're probably wondering at this point, what do you do with the trained encoder? Just as you would with any trained neural network, you can load the weights for inference using similar code you might have seen in previous chapters. There is still one more consideration you must make depending on the type of task at hand and the shape of the output of the encoder. If, your goal is to classify images, a lightweight machine learning algorithm directly on top of the vector outputs of the encoders may be sufficient. The assumption here is that the internal representation of the image is robust and **relavent** to your task. These are the aforementioned 'embeddings' and are analigous to large coarse resolution raster. We will touch on this topic below. If your goal, on the other hand, is to segment the images, we may need to train a subsequent neural network to extract from that internal representation and produce a useable map. Below is a summary list of contemporary decoders used in conjunction with transformer-based encoders. Each vary in complexity, purpose, and method in which they extract features from the transformer layers in the aforementioned encoders.

#### Linear Probe

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:250px"
           src="./assets/linear.png"/>

<div style="text-align: center;">Figure 6. Simplified structure of a single linear layer.</div>
<br/>

The simplest decoder available. It is a single linear layer, or small MLP to process just the output of the encoder. This technique is taken from the language domain where large language models are often evaluated on downstream tasks with linear probes on frozen embeddings before doing full fine-tuning.  In practice, this isn't typically used for segmentation due a limited model complexity. But this decoder is still useful in assessing representation quality as it is cheap to train and can be implemented in a consistently, making it an ideal tool for benchmarking. Since the model is so simple, the perfromance of a linear decoder can be said to give an analysis of how easily separable the encoder has made the samples from our dataset.

##### FCN

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:250px"
           src="./assets/fcn.png"/>

<div style="text-align: center;"> Figure 7. Full convolutional neural network architecural diagram from (Long et al., 2014).</div>
<br/>

The FCN is a stack of convolutional layers that upsamples and refines final feature maps produced by the transformer’s output embeddings and converts them back into an image-like spatial representation. This final representation can then be fed through a final prediction layer, which is typically itself convolutional layers, to produce the final map (Long et al., 2014). This is typically more lightweight than some of the MLP-based decoders below, due to weight sharing in the learned kernel filters. Local context is also explicitly emphasized as those same kernel filters may only cover a small portion of the whole input image.

##### Segformer

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:250px"
           src="./assets/segformer.png"/>

<div style="text-align: center;"> Figure 8. Segformer style network architecural diagram for semantic segmentation from (Xie et al., 2021).</div>
<br/>

Segformers use a MLP style decoder to extract from a hierarchical transformer-based encoder (Xie et al., 2021). Originally developed as a standalone framework for end to end training, the segformer has been recently adapted by a handful of research groups attempting to utilize the representation generated by the pretraining phase. The assumption is that the implementation of the transformer blocks in the encoder are sufficiently hierarchical in nature such that they can be composed using this method.

##### UperNet

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:300px"
           src="./assets/upernet.png"/>

<div style="text-align: center;">Figure 9. UperNet architectural diagram from (Xiao et al., 2018).</div>
<br/>

Similar to the segformer architecture, the upernet leverages multi-scale hierarchical features to generate dense predictions. Rather than MLPs, the upernet utilizes pooling layers and convolutions to compose the features before the final prediction layers. It's worth noting here that the original UperNet use case used a ResNet-50 backbone that was fine tuned. In other words, this architecture has been closely aligned with the 'pretrained encoder' / decoder paradigm that has been popularized in the last year (Xiao et al., 2018).

##### Muster

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:250px"
           src="./assets/muster.png"/>

<div style="text-align: center;">Figure 10. Architecural diagram of MUSTER decoder from (Xu et al., 2022).</div>
<br/>

A more recent decoder development in the computer vision realm. Similar to both the Segformer and the UperNet, hierarchical features play an important role in it's predictive strength with the most important distinction here being that the decoder itself is also transformer based (Xu et al., 2022). Also similar to the UperNet, the MUSTER decoder was designed to integrate with pretrained encoders. While this may lend it self to greater performance, it is important to consider the compute required to implement such an architecture.


## Section B: The Framework

---
---

[Link to original repository](https://github.com/VMarsocci/pangaea-bench)

We apply the Pangaea Bench benchmarking framework originally developed by the ESA Phi Lab team (Marsocci et al., 2024). While, there are other frameworks in development by other teams, Pangaea conveniently has many of the more popular remote sensing foundation models integrated into their pure pytorch framework along with a number of well established benchmarking datasets. It also includes a relatively simple dataset integration scheme along with additional decoder heads and encoder evaluation methodologies integrated by the team at Spatial Informatics Group. The limited dependency list is especially beneficial for small scale demonstrations such as this learning notebook. You will find similar themes found in previous chapters of the Applied Deep Learning Book such as dataloaders, pytorch modules, loss functions, etc. These are not exclusiive to training end to end models and in fact, the workflow is almost exactly the same! The EOFMs are intended to solve the same segmentation, classification, or regression problems in s different manner. Here, we focus on the additional considerations that come with using an EOFM for segmentation.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:300px"
           src="./assets/pangaea_workflow.png"/>

<div style="text-align: center;">Figure 11. Diagram of Pangaea's general structure.</div>
<br/>

---

### Software/Hardware considerations

CUDA is the underlying software running NVidia GPUs. While it is not necessary, we recommend running this notebook in a CUDA enabled linux environment for convenience and setup simplicity. We also recommend installing python>=3.10 either standalone or through a conda managed environment. For running the model, we can estimate a VRAM required for storing the model using a simple calculation of # of parameters and precision. For example, a 300M parameter model at 32bit precision would roughly require 3.6GB to store the weights and optimizer states. Another 1.2 GB is required for the gradients of each of those parameters and a few more for the data itself. Likewise, the decoder used to extract from the internal representation also imposes VRAM / compute requirements. There's no universal formula here, but tools like PyTorch hooks, torch.cuda.memory_summary(), or profilers can help estimate how much compute you need. 12GB of VRAM is a good place to start and luckily there are free google colab options at this compute scale, but, there is a trend towards larger and more complex models which inevitably consume more compute. 

#### Installing framework and requirements

We are installing directly from a cloned repos along with the available requirements.txt file. This may take a few moments as PyTorch is a fairly large package. Please install python and pip on your own and setup on your notebook environment that best suits your preferences. 

TODO - Nick add other forked repos and requirements for encoder stuff

In [None]:
!git clone https://github.com/sig-gis/pangaea-bench.git "pangaea-bench"
!pip install -e "pangaea-bench"
!pip install -r pangaea-bench/requirements.txt

#### The 'torchrun' Command

The torchrun command is a utility provided by PyTorch to launch distributed training jobs across multiple processes, nodes, or GPUs. It replaces the older torch.distributed.launch and is part of the torch.distributed module. torchrun is designed to be simple and flexible, making it easier to scale up training scripts with minimal changes. At its core, torchrun sets up the environment variables necessary for distributed training, such as RANK, WORLD_SIZE, and MASTER_ADDR, and then spawns multiple processes, one per GPU or per node, depending on the configuration. This enables parallel training using frameworks like DistributedDataParallel (DDP), which helps synchronize gradients and reduce training time significantly. This will be especially useful for training larger networks however, for the purposes of this demonstration we will only use it to run on a single machine

The Pangaea benchmarking framework is a lightweight wrappper around this command to handle most of boilerplate code that comes with training a neural network. This includes establishing the training loop, calculating metrics, logging, as well as the datasets/dataloaders associated with the benchmarks. [See torchrun documentation on environment variables for more details](https://docs.pytorch.org/docs/stable/elastic/run.html). There are several command options available which can be found in ./pangaea-bench/configs. We only use a subset below in this demonstration notebook but there are many more available. Config files associated with the dataset, decoder, and encoder are associated with a PyTorch module that is instantiated by the framework based on the configurations provided.

```bash
!torchrun pangaea-bench/pangaea/run.py \                                ##### The torchrun entry command
    --config-name=train \                                               ##### configuration name which can be found in ./pangaea-bench/configs
    work_dir=checkpoints \                                              ##### the directory relative to the working directory to store checkpoint outputs
    dataset=mtbs \                                              ##### the dataset config file found in ./pangaea-bench/configs
    encoder=dofa\                                                       ##### the encoder config file found in ./pangaea-bench/configs
    decoder=seg_fcn\                                                    ##### the decoder config file found in ./pangaea-bench/configs
    preprocessing=seg_default\                                          ##### preprocssing steps associated with the task type (e.g normalizing images)
    criterion=cross_entropy \                                           ##### the loss function on which to optimize the training cycle
    task=segmentation \                                                 ##### task type (others in clude classification and regression) 
    use_wandb=true \                                                    ##### whether to use the online logging software
    task.trainer.n_epochs=16 \                                          ##### number of epochs to limit training
    task.trainer.log_interval=1 \                                       ##### logging interval of calculated metrics
    task.trainer.ckpt_interval=16 \                                     ##### how often to save checkpoints based on metrics
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth    ##### path to pretarined weights of the FM

#### Logging and Model Examination

Weights & Biases (WandB) is a popular tool for experiment tracking, model monitoring, and collaboration in machine learning workflows. It integrates seamlessly with PyTorch, TensorFlow, Keras, and other frameworks, allowing users to log training metrics, visualize results in real-time, and compare model performance across runs. Pangaea benchmark happens to have a prebaked integration of wandb which is easily accessed using WandB's api key authentication protocol. [See WandB documentation on environment variables for more details](https://docs.wandb.ai/guides/track/environment-variables/)

## Section C: Encoder Assessments

### Frozen vs Unfrozen vs Random vs LoRA

Testing frozen, unfrozen, and randomly initialized pretrained encoders is useful in understanding the value and applicability of transfer learning for a specific task. A frozen encoder uses pretrained weights without updating them during training (Marsocci et al., 2024). This setup helps evaluate how useful the pretrained features are on their own, especially in cases where the target dataset is small or the model is prone to overfitting. In contrast, an unfrozen encoder allows those pretrained weights to be updated, enabling the model to adapt its learned representations to better suit the target task. This approach often yields better performance when sufficient labeled data is available and the task deviates from the original pretraining domain however, requires that additional gradients be computed and stored. On the other hand, a randomly initialized encoder serves as a baseline, providing a measure of how well the model performs without any prior knowledge. Comparing results from this setup against those using pretrained weights helps quantify the benefits of pretraining in terms of training time and overall final performance. It's worth noting that depending on the task and the amount of available training data, the randomly initialized encoder may never achieve the same results as an unfrozen pretrained model. This behavior is still an on going area of research.

Overall, these three basic configurations allow for a comprehensive assessment of whether transfer learning is useful, whether fine-tuning improves results, and whether pretraining is necessary at all for a specific task. There is a fourth method which we will not use here called Low Rank Adaption (LoRA) and is somewhat of a compromise between a fully frozen and unfrozen encoder (Hu et al., 2021). For larger models (i.e greater than 300m parameters). Low-Rank Adaptation (LoRA) is a technique used to efficiently fine-tune large pretrained neural networks by injecting small, trainable weight matrices into the model, while keeping the original weights frozen. Traditional fine-tuning updates all parameters of a model, which becomes expensive for large models. LoRA avoids this by approximating the weight updates as the product of two low-rank matrices, significantly reducing the number of trainable parameters. This low-rank decomposition acts as a bottleneck, enforcing efficiency and reducing overfitting. LoRA has gained popularity particularly with large language models. It also enables modular fine-tuning where separate LoRA modules can be swapped in for different tasks making it attractive for applications requiring task specialization without retraining the entire model.


### Highlighted Models
For the encoder evaluation, we will take a look at a large set of pretrained encoders available within Pangaea-Bench. From there, we will focus fine-tuning on a single task for 3 frozen encoder variations: DOFA, CROMA, and Prithvi 1.0 (described below). Below is a brief introduction to the models we will focus on in the fine tuning process and beyond. 
#### DOFA

DOFA is a transformer-based model that is optimized on both a mask image modeling objective and self distillation objective (Figure 2b). DOFA's major contribution is their wavelength-conditioned dynamic patch embedding in which a the central wavelength of a given sensor is used to derive the weights for processing the resepective data (Xiong et al., 2024). It is apparent from this that their approach is targettiing flexibility and generalizability across multiple modalities and does so fairly well. And yet, there are still open questions! For example, wavelength and reflectance does not necessarily translate to amplitude data from SAR. Does this matter? Benchmarking metrics will tell you no but do keep in mind: why?

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%;"
           src="./assets/dofa.png"/>

<div style="text-align: center;">Figure 12: (a) Architecture design. DOFA builds on masked image modeling by processing input images with any number of channels within a single framework. (b) Dynamic weight generator and continual training framework from (Xiong et al., 2024). </div>

#### Prithvi 1.0

One of the earliest foundation models and the simplest of the three models tested in this demonstration (Jakubik et al., 2023). Developed by the NASA IMPACT, Prithvi 1.0 is a vanilla masked autoencoder specifically trained exclusively on HLS data. This limited scope presents a unique research opportunity for studying a vision transformer with relatively few modifications to its structure. The expectation then is that for this specific test, Prithvi would have the best performance metrics given that its pretraining dataset more closely aligns with the downstream task. TBD!


<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:300px"
           src="./assets/prithvi.png"/>

<div style="text-align: center;">Figure 13: Masked autoencoder diagram for Prithvi's pretraining from (Jakubik et al., 2023).</div>

#### CROMA

CROMA is a another transformer-based model with a slightly different approach combining both masked image modeling (MIM) and contrastive learning (Fuller et al., 2023). Rather than contrasting positive and negative samples, the CROMA framework contrasts optical and radar data using separate encoders then combines those embeddings using cross attention with a unifying transformer encoder. The output of this terminal encoder is then randomly masked and trained in a typical encoder-decoder MIM style. The optical encoder requires all 12 spectral bands from Sentinel 2 while the radar encoder requires the 2 polarization bands from Sentinel 1. In our downstream task, we are limited to 6 harmonized spectral bands. How can we expect this to impact our results?

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:300px"
           src="./assets/chroma.png"/>

<div style="text-align: center;">Figure 14: (Left) Pretraining framework for CROMA. (Right) Encoding workflow for leveraging internal representations.</div>

#### Encoder Weight-Matrix Analysis

Before we look at the encoders relative to the input data, lets evaluate the weight matrices.

Below is the example command for CROMA. The commands for all other encoders can be found here: [Multi-Encoder Per-Layer Weight Matrix Analysis](https://github.com/sig-gis/pangaea-bench/blob/embedding_analysis_nl/scripts/encoder_analysis/ww_mtbs.sh)

In [None]:
#Move to directory.
!pushd pangaea-bench/pangaea/encoder_analysis/

#Library writes to local directories.
!mkdir img
!mkdir ww-img
!python3 weight_watcher.py dataset=mtbs encoder=croma_optical \
    preprocessing=seg_resize_input_layer task=segmentation

#Move output to location for longer-term storage
!mkdir ww/croma/
!mv img/* ww/croma/
!mv ww-img/* ww/croma/

There are additional details produced within plots generated by the commands run, but we will use the per-layer $\alpha$ plots in Figure 15 to provide a top-level summary of the findings. CROMA appears to have the most consistently well-trained layers, with most of the $\alpha$ values staying within the desired [2,5] range, and the deviation outside of this range being very minimal. Prithvi appears to have a handful of layers that are underfit to a small degree, and DOFA has a somewhat evenly distributed set of layers throughout the model that appear to be overfit. 

There are a few layers that deviate past 5 at the beginning of each model, which is a pattern identified in previous studies of vision models using the weight watcher tool. The current hypothesis is that most high-level features learned in early layers of vision models are similar across all scenes in a large training set, so the early layers in a model, typically the ones that learn these high-level features tend towards overfitting / a lack of generalizability. Here, the deviations only appear to be minor.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%; height:400px"
           src="./assets/Combined_Alpha_vs_Depth_Plots.png"/>

<div style="text-align: center;">Figure 15. Plots containing the $\alpha$ summary parameter from the model weight matrix analysis done for CROMA (top), Prithvi (center), and DOFA (bottom).</div>
</br>


This analysis not only informs us about performance considerations when using these models with frozen weights, but also can provide some insight into considerations we need to take into account if we are to fine-tune the weights of these encoders. For example: 

**1)** Which models' overfit layers might be more prone to a phenomena called catastrophic forgetting, where performance significantly dips after fine-tuning due to the information gained in pre-training being lost or overwritten.

**2)** Which model's layers may need more time to learn / optimize during the fine-tuning phase, due to being underfit in pre-training.


#### Resource Requirements - FLOPS, MACS, and Paremeter Counts

Next, we will take a look at the amount of computational and storage requirements needed for each model. 

Below is the example command for CROMA. The commands for all other encoders can be found here: [MTBS Multi-Encoder Resource Consumption Estimation](https://github.com/sig-gis/pangaea-bench/blob/main/scripts/encoder_analysis/compute_flops_mtbs.sh). Table 1 provides the results of the computation for a larger set of models supported by Pangaea.


In [None]:
python3 compute_flops.py dataset=mtbs   encoder=croma_optical  preprocessing=seg_resize_input_layer    task=segmentation


<img src="./assets/Resource_Consumption_Table.png"/>

<div style="text-align: center;">Table 1. Resource consumption of the three encoders relative to common pretrained models. Note the diversity in amount of calculations. We will touch on this in the following section.</div>
</br>

Here we can see that CROMA, DOFA, and Prithvi stand out as some of the largest models with number of operations for forwards and backwards passes also in some of the highest counts. These models are being used here to provide an example and do not necessarily outperform all others when compared to every other permutation possible within Pangaea **and those without**. The authors re-emphasize that the smallest and most optimized (least resource intensive) model that provides a good representation of the features of interest in the dataset and a good enough (typically task dependent) performance on the ultimate task should be chosen.


#### Embedding generation

Next will take the input dataset and pass it through the pre-trained encoders to get model-specific representations, or embeddings for each input scene. Below is an example command for the CROMA encoder. We also generate these for 16 other encoders. Both pre-trained and randomly initialized. Additional exampled can be found here: [MTBS Multi-Encoder Embedding Generation Script](https://github.com/sig-gis/pangaea-bench/blob/main/scripts/encoder_analysis/embed_gen_mtbs.sh)

In [None]:
!python3 embed.py dataset=mtbs encoder=croma_optical preprocessing=seg_default task=segmentation

#### UMAP Projection and KNN Graph Generation

Now that we have the embeddings we can project them onto a lower-dimension manifold and generate a KNN-Graph from that represenation. Below is the example command for CROMA. The commands for all other encoders can be found here: [MTBS Multi-Encoder UMAP Projection and KNN-Graph Generation](https://github.com/sig-gis/pangaea-bench/blob/main/scripts/encoder_analysis/umap_and_knn_graph_gen_mtbs.sh)

In [None]:
!python3 knn_graph_gen.py dataset=mtbs \
                          encoder=croma_optical \
                          preprocessing=seg_default \
                          criterion=cross_entropy \
                          task=segmentation

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 50%; height:400px"
           src="./assets/MTBS_UMAP.png"/>

<div style="text-align: center;">Figure 16. A plot of the embeddings generated from CROMA (left), Dofa (center), and Prithvi (left) for samples from the test split of the MTBS dataset projected onto a low-dimension (2D) manifold via UMAP and overlayed with their associated target value. The key for the targets on the bottom right.</div>
</br>

We can use these plots to first do a qualitative analysis. If we compare these UMAP plots to those in Figure 16, where we can see weak to moderate separations of the classes for all models and can take away the fact that this is not an easy task for these models. In all three cases, to varying degrees, there is an intermixing of the label values.  Both Prithvi and CROMA appear to do a better job of separating out subsets of the classes in different ways, while Dofa appears to have the hardest time representing the classes distincly in its embedding space, as demonstrated by a lack of clear separation of the classes anywhere in the plot.

As mentioned before, this methodology is not perfect, and we are removing representative information from embedded samples by reducing the feature space, but trade-offs have to be made for interpretability and analysis of model performance, as results need to be human-interpretable in order for them to be useful. If a model is more easily representing distinctions between samples in a simplified way at a higher dimension, those distinctions will likely be picked up in this lower dimension space. 

#### Inter-Model Embedding Comparisons Using Data Kernels

Here, we take the KNN-graphs previously generated and use a data-kernel-based projection to project all embedding samples to a uniform feature space and do subsequent analysis. Below is the example command for CROMA. The commands for all other encoders can be found here: [MTBS Multi-Encoder UMAP Projection and KNN-Graph Generation](https://github.com/sig-gis/pangaea-bench/blob/main/scripts/encoder_analysis/embed_compare_mtbs.sh)

In [None]:
!python3 data_kernel_analysis.py dataset=mtbs   task=segmentation 

From here we can start by generating a scatter plot of all samples to visualize similarities and differences. Figure 17 provides sample-wise scatter plots for a subset of samples projeceted into the uniform collective embedding space. This is shown for the models in focus (top left), a larger set of models supported by Pangaea-Bench (top right), as well as individual plots for each model in focus, with an additional visualization of spatial density (bottom) which will come in handy in this analysis.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:400px"
           src="./assets/Embed_Scatter.png"/>

<div style="text-align: center;">Figure 17. a depiction of the joint embedding projection for the embeddings of the three models of focus here (top left) as well as a larger set of models supported by Pangaea-Bench (top right). The bottom row depicts CROMA's (left), Dofa's (center), and Prithvi's (right) samples in the embedding space, with opacity being used here as a representation of spatial density of samples.</div>
</br>

We can also reduce the dimensionality again, creating a manifold where each set of embeddings is represented by a single point. In Figure 18 we can see that CROMA, DOFA, and Prithvi are all in distinct clusters, meaning that this measure of their representational structure shows strong distinctions. This gives us one point of information in our model selection and representation analysis journey, but should not be used as a comprehensive indicator on its own.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 45%; height:300px"
           src="./assets/Embed_Space_Dist_Mtx.png"/>

<div style="text-align: center;">Figure 18. a depiction of collapsing the embedding spaces into single points to visualize and measure population-level representational differences. </div>
</br>

With these joint projections, we can also model the null distribution of differences between the two representations of the same data by leveraging theoretical results from  random graph literature. Paired data that falls outside of a rejection radius is seen to be represented significantly different. Figure 4 shows pairwise comparison plots of these null distributions. 

The likelihood of rejection is strongly correlated with differences in the training data of the encoders can identify potential gaps in the training and representative regimes. Here again, using a more statistically robust methodology, we can see that the representations are identified as distinct, using this approach. In each case, the null hypothesis of similarity is rejected (empirically derived functions in yeallow do not map well to the expected uniform distribution) as most intercomparisons of sample distances fall outside the rejection radius. 


<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:400px"
           src="./assets/Scatter_P_Val.png"/>

<div style="text-align: center;">Figure 19. scatterplots of the embedding projections being used to measure population-level representational shifts between pairs of models. Here, the sample set from the model being compared against is plotted and the model we are measuring deviation from has its samples overlayed on top. Here, the opacity has an inverse relationship to the p-value of a given sample. All values with a p-value close to or equal to 1 are colored white here and therefore do not show up. </div> 
</br>

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:400px"
           src="./assets/P_Value_Dist.png"/>
<div style="text-align: center;">Figure 20. The distribution of p-values under the null hypothesis for each ordered pair of models.</div>
</br>

The comparison of Prithvi and CROMA samples (center top and bottom) provides an interesting aside. Notice that when CROMA embeddings are held as the reference set and Prithvi as the comparison set we get the closest, albeit still not close, measure of similarity in the set of plots, while we see a complete rejection of the null hypothesis when the roles of the datasets are swapped. This is due to the difference in spread between each model's samples in the collective embedding space. The rejection radius for CROMA is larger than for Prithvi, given the spread of a larger set of samples from CROMA. Thus, when we use CROMA as reference, more sample tests accept the null hypothesis. The difference is significantly less when looking at the plots for DOFA and Prithvi because the largest density of DOFA's embeddings is farther away from that of Prithvi's, when compared to CROMA's. The case is very similar when comparing DOFA and CROMA. We can see these patterns show up in the density-scaled scatter plots in Figure 17 as well as the inverse-p-value scaled ones in Figure 19. All of this analysis backs up our initial information gained from Figures 16 & 17 about the representations in each of these three models being distinct and worth exploring further.

These tools can not only be used to inter-compare distinct encoders, but can also be used to visualize what happens to a model's representations when processes are changed. For instance, they can beused to look at representational differences of a single encoder before and after fine-tuning or given two different pre-training dataset splits (Duderstadt et al., 2023).

Identifications of embedding level differences and large-scale representation gaps are crucial and can be quantitatively measured using these tools.  These methodologies to rigorously inter-compare representative capabilities are crucial for both better understanding encoder performance and for further architecture development. 

 

## Section D: Example Fine Tuning

### Background

Now that we've established some of the basic concepts, we can now move forward with fine tuning the models. Along with the 3 FMs mentioned above, we will train the 5 different decoders described in section A. That's 15 different models for just a single task! It's clear here that for a comparison analysis for foundation models, compute becomes a key bottleneck in research especially when aiming for systematic comparisons across models, tasks, or training regimes. For the sake of this demonstration, we limit each training run to a batch size of 8 for 16 epochs an on a single RTX 4000 workstation GPU with 12GB of VRAM. A machine of this size is readily available on any cloud compute platform.

We target two fire related tasks: segementing burn scars in HLS scenes and multiclass segmentation of burn intensity. Both are HLS-based derived benchmarks, imagery source that is unique in that it sources its optical data from two satellite missions (Landsat and Sentinel) and in doing so, provides a relatively level test bed for models trained on either. Nevertheless, it’s important to acknowledge the nature of the benchmark datasets used. While it has been carefully curated to support robust model evaluation, it may not fully reflect the diversity or complexity of real-world fire scenarios with potentially with drastically different landscape dynamics, data availability, and task scope. The dataset is limited in geographic scope, vegetation types, and imaging conditions, which means performance metrics obtained here may not generalize well to all fire contexts such as risk mapping. As such, results from this benchmark should be interpreted with consideration of these constraints.

### HLS BURN SCARS Training Commands


In [None]:

## DOFA

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_muster\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_fcn\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_linear\
    preprocessing=seg_default\
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=dofa\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth
    
## PRITHVI

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi \
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=prithvi\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=prithvi \
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt
    
## CROMA

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt
    
!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=hlsburnscars \
    encoder=croma_optical\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt


### HLS BURN SCARS Results

Some things to consider:

Below are the results from running the above commands aggregated into relatively simple charts. Keep in mind, we only train for 16 epochs for the purpose of a simple demonstration. This is a very small amount of time for a deep learning model. There is a reasonable chance that these metrics will change given enough training time, a different loss function, or a different set of hyperparameters. The framework allows for some level of configuration. On your own, try setting new configurations (see ./pangaea-bench/configs for details).

#### Training Graphs

In spite of a limited training window, we see a fairly distinct training curve which suggests the models have learned some useful features. Whether this is due to the pretraining or the simple nature of the task is still an open question. A straightforward test for that is to train using a randomly initialized encoder simply by adjusting the last configuration in each training command to null. In addition, there is a clear loss difference between each model with DOFA generally having higher cross entropy loss.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:400px"
           src="./assets/training_graph.png"/>
<div style="text-align: center;">Figure 21: Training loss per step for all model varations. Each model was optimized on cross entropy loss.</div>
</br>

#### Decoder Choice

The test metrics for the checkpoint saved at the final epoch tells a slightly different story. Overall, the CROMA combinations appear to be the most effective in terms of overall performance metrics (F1 and IoU). The PRITHVI encoder consistently shows good performance across different decoders, with the Linear Probe surprisingly being the top performer in this group. The DOFA Linear Probe combination stands out as an outlier with extremely low performance for burn scar detection. The lower performance of the probe into DOFA suggests their internal representatioin may not be appropriate for this task and the decoder might be doing the bulk of the heavy lifting in the other runs, which aligns with our evaluation of the encoder analysis in the above sections. More experimentation is required to confirm this. It's worth noting that for even with the same encoder, varying the decoder has a visible affect on performance that is comparible in magnitude to the differences between encoders.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 70%; height:400px"
           src="./assets/test_metrics.png"/>

<div style="text-align: center;">Figure 22: Test metrics per encoder-decoder variation</div>

#### Per Class Metrics

Looking at per class metrics gives a bit more insight into what's happening. Save for the DOFA-linear-probe outlier, the differences from model to model are realistically negligible, although further statistical testing would be recommended to confirm this. However, when we look at per class metrics, we see a more noticeable difference between each model's ability to segment the positive class. This suggests a class imbalance issue. While this may be an accurate representation of the task, this is something that needs to be taken into account when designing a benchmark dataset and evaluation framework.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 90%; height:500px"
           src="./assets/per_class.png"/>

<div style="text-align: center;">Figure 23: Per class metrics</div>
</br>

#### Run Times and Compute Costs

Based on the above metrics alone, CROMA seems to be a top performer, nevertheless, we must consider one more aspect: compute. CROMA consumed up to ~3 times the memory and up to ~9 times the training time than some of the other models for the same number of epochs. This is consistent with the FLOP calculations found in Table 1. In other words, for the same resources, DOFA combined with an even larger decoder trained for 128 epochs instead of 16 may very well outperform the CROMA models. In the same vein, Prithvi consumed a comparable amount of memory to CROMA but is far simpler and to implement and fine tunes several times faster.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:250px"
           src="./assets/memory.png"/>
           
<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:250px"
           src="./assets/training_time.png"/>

<div style="text-align: center;">Figure 24: (Left) GPU VRAM usage per encoder-decoder combination. (Right) Run time in seconds to reach 16 epochs.</div>
</br>

#### Output Masks

We have all these metrics and they all tell a similar story. Lets look at some of the outputs! Afterall, the end product should be a map. The first thing you should notice below is that the annotation is not perfect! So the metrics above should be interpreted carefully. For example, the prithvi masks from all decoders look vastly different while the metrics say otherwise. When the benchmark dataset was created, chip labels that contained multiple annotations in close proximity were likely left unfiltered. This does offer an opportunity to see how a model behaves in such situations which are frankly, all too common. Interestingly whether or not the model captures the false negatives depends mostly on the decoder. In addtion, there are artifacts in the FCN variations that arise from the structure of the decoder itself rather than anything meaningful from the image.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 80%; height:500px"
           src="./assets/masks.png"/>

<div style="display: flex; justify-content: center; gap: 20px;">
  <img style="width: 20%; height: 200px;" src="./assets/hls_label.png" />
  <img style="width: 20%; height: 200px;" src="./assets/hls_image.png" />
</div>

<div style="text-align: center;">Figure 25: (Top) Grid table of mask outputs of all models with all decoders. (Bottom Left) Original annotation. (Bottom Right) Source Image</div>

#### Multimodality / Multisensor Models Scale

You may have noticed in spite being a small decoder, the linear probe into PRITHVI had comparable performance to the larger models in this given task. While this may be attributed to the limited training time or architectural design choices, the more likely reason is that PRITHVI 1.0 was trained exclusively on HLS scenes. Foundation Models, like all machine learning models, are shaped by their source data and making them truly generalizabile is an ongoing problem. Likewise, spatiotemporal patterns are diverse in both downstream tasks and sensor data. Ultimately when selecting a base model, one of the first things to consider before benchmarking metrics is how well does the pretraining dataset align with your given task.


### MTBS Training Commands

The while it is an excellent example for learning to fine tune due to its small size and simplistic design, it may be difficult to differentiate model utility with HLS burn scar dataset. Next we'll be examining the dataset on which we performed the encoder analysis by running the exact same commands as the previous section with the dataset configuration set to mtbs. Luckily for us we've also implemented the dataloader and configurations found in ./pangaea-bench/pangaea and ./pangaea-bench/configs respectively. Since this is a much larger dataset (5 times in fact), these runs may take up to two hours to run rather than a few minutes.

In [None]:

# CROMA

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=croma_optical\
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=croma_optical\
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=croma_optical\
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=croma_optical\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=croma_optical\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/CROMA_large.pt

## PRITHVI

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=prithvi\
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=prithvi\
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=prithvi\
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=prithvi\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=prithvi\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/Prithvi_EO_V1_100M.pt

## DOFA

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=mtbs \
    encoder=dofa \
    decoder=seg_muster\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=dofa \
    encoder=prithvi\
    decoder=seg_fcn\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=dofa \
    encoder=prithvi\
    decoder=seg_linear\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=dofa \
    encoder=prithvi\
    decoder=seg_segformer\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

!torchrun pangaea-bench/pangaea/run.py \
    --config-name=train \
    work_dir=checkpoints \
    dataset=dofa \
    encoder=prithvi\
    decoder=seg_upernet\
    preprocessing=seg_default \
    criterion=cross_entropy \
    task=segmentation \
    use_wandb=true \
    task.trainer.n_epochs=16 \
    task.trainer.log_interval=1 \
    task.trainer.ckpt_interval=16 \
    encoder.encoder_weights=pretrained_models/DOFA_ViT_base_e100.pth

### MTBS Results

#### Overall Metrics

Unlike the previous task, we see markedly poorer results. Using the typical hyperparameter settings a practicioner would apply in their first pass (i.e cross entropy loss, moderate learning rate, etc) we find that this is insufficient in producingg meaningful results. Understandably, the burn intensity dataset has a greater number of classes, an additional temporal reasoning component, as well as regional differences that are not depicted here. These results are also consistent with the embedding analysis in Figure 16. Clever fine tuning techniques may be able compensate for these shortcomings, however, would potentially negate any pretraining benefits.

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 60%; height:400px"
           src="./assets/mtbs_test_metrics.png"/>

<div style="text-align: center;">Figure 15: Best metrics per best decoder with each encoder.</div>
</br>

#### Interpretation

Change detection / attribution is a difficult topic, as even the veterans of the remote sensing world will tell you. It is often multivariate and involves very subtle landscape dynamics which are not clear to the human eye without aid. For example below is sample pulled from the MTBS datset. These are scenes from the Boulder Lake, WA Fire in August 2022. Between the second and third image, it is clear where the fire had occured (vegetation turned to bare soil) in spite of the smoke other artifacts in the third image. Given our results from the first fine tuning workflow, detecting this would easily done by all models. Yet when we extend the problem to intensity, the model must additionally reason what vegetation was present prior to the fire, what vegetation died, and also what vegetation experienced a low enough fire tempterature to survive and regrow the following year all while contending to the strong noisy signals ever present in all images and the variations between them. An extremely complex task.

<div style="display: flex; justify-content: center; gap: 20px;">
  <img style="display: block; 
            margin-left: auto;
            margin-right: auto;
            width: 20%;"
            src="./assets/boulderlake1.png"/>
  <img style="display: block; 
            margin-left: auto;
            margin-right: auto;
            width: 20%; "
            src="./assets/boulderlake2.png"/>
  <img style="display: block; 
            margin-left: auto;
            margin-right: auto;
            width: 20%;"
            src="./assets/boulderlake3.png"/>
  <img style="display: block; 
            margin-left: auto;
            margin-right: auto;
            width: 20%;"
            src="./assets/boulderlake4.png"/>
</div>

<img style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 20%;"
           src="./assets/boulderlake4wlabels.png"/>

<div style="text-align: center;">Figure 26: (TOP) Landsat imagery from the Boulder Lake, WA fire in August 2022. from Left to right: 1 year prefire,  2 months post fire, 1 year post fire, 2 year post fire. (Bottom) 2 year post fire with class labels overlaid.</div>


## A Well Defined Problem

Benchmarks such as ImageNet for vision tasks or GLUE for language tasks have helped standardize comparisons, but emerging domains may lack such comprehensive baselines. In such cases, creating task-specific benchmarks becomes more necessary especially when these "Foundation Models" become more and more prominent. Like all standardized evaluations, benchmarks can often promote excessive metric fixation. Even so, these metrics are necessary when comparing across models and must be done under the same conditions: using the same dataset splits, preprocessing steps, and evaluation protocols. Differences in any of these factors can distort the interpretation of results. For that reason, we see testing in multiple controlled conditions an essential step for a robust evaluation. In our simplified example above, we vary the decoder attached to each frozen pretrained encoder as the quality of the internal representation is only as good as how easily we can extract from it. Model complexity, training time, and inference speed are also crucial metrics to consider, especially in deployment scenarios where computational resources or latency requirements are constrained. Lastly, an assessment of representation quality is all too important to determine whether the architecture, pretraining task, or both contribute overall to model's performance. **While this process may seem overwhelming and extensive, it is imperative that this level of care and rigor goes into the evaluation of representation and performance to ensure that these function approximators are actually aiding in scientific discovery and other downstream tasks, and not just creating additional new webs of questions to disntangle. This should not be seen as an unpassable barrier, but a challenge and a standard that will allow us all to work together to improve capabilities in scientific undersanding and geospatial intelligence** 


## References

1) Alain, G., & Bengio, Y. (2016, October 5). Understanding intermediate layers using linear classifier probes. arXiv.org. https://arxiv.org/abs/1610.01644

2) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021, April 29). Emerging Properties in Self-Supervised Vision Transformers. arXiv.org. https://arxiv.org/abs/2104.14294

3) Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, February 13). A simple framework for contrastive learning of visual representations. arXiv.org. https://arxiv.org/abs/2002.05709

4) Duderstadt, B., Helm, H. S., & Priebe, C. E. (2023, May 9). Comparing Foundation Models using Data Kernels. arXiv.org. https://arxiv.org/abs/2305.05126

5) Grill, J., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020, June 13). Bootstrap your own latent: A new approach to self-supervised Learning. arXiv.org. https://arxiv.org/abs/2006.07733

6) He, K., Fan, H.,  Wu, Y., Xie, S. and Girshick, R., "Momentum Contrast for Unsupervised Visual Representation Learning," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 9726-9735, doi: 10.1109/CVPR42600.2020.00975.

7) He, K., Chen, X., Xie, S., Li, Y., Dollár, P. and Girshick, R.,  "Masked Autoencoders Are Scalable Vision Learners," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 15979-15988, doi: 10.1109/CVPR52688.2022.01553.

8) Jakubik, J., Roy, S., Phillips, C. E., Fraccaro, P., Godwin, D., Zadrozny, B., Szwarcman, D., Gomes, C., Nyirjesy, G., Edwards, B., Kimura, D., Simumba, N., Chu, L., Mukkavilli, S. K., Lambhate, D., Das, K., Bangalore, R., Oliveira, D., Muszynski, M., . . . Ramachandran, R. (2023b, October 28). Foundation Models for Generalist Geospatial Artificial Intelligence. arXiv.org. https://arxiv.org/abs/2310.18660

9) LaHaye, N., Garay, M. J., Bue, B. D., El-Askary, H., & Linstead, E. (2021). A Quantitative Validation of Multi-Modal Image Fusion and Segmentation for Object Detection and Tracking. Remote Sensing, 13(12), 2364. https://doi.org/10.3390/rs13122364

10) Long, J., Shelhamer, E., & Darrell, T. (2014, November 14). Fully convolutional networks for semantic segmentation. arXiv.org. https://arxiv.org/abs/1411.4038

11) Marsocci, V., Jia, Y., Bellier, G. L., Kerekes, D., Zeng, L., Hafner, S., Gerard, S., Brune, E., Yadav, R., Shibli, A., Fang, H., Ban, Y., Vergauwen, M., Audebert, N., & Nascetti, A. (2024, December 5). PANGAEA: a global and inclusive benchmark for Geospatial foundation models. arXiv.org. https://arxiv.org/abs/2412.04204

12) Martin, Charles H.,  Mahoney, Michael W.; Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning. JMLR 22(165):1−73, 2021

13) Martin, Charles H.,  Peng, Tongsu (Serena)  & Mahoney, Michael W. ; Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nature Communications 12(4122), 2021

14) McInnes, Leland, Healy, John, & Melville, James. "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv preprint arXiv:1802.03426, 2020

15) Parajuli, P., Shinde, R., Gurung, I., Maskey, M., & Ramachandran, R. (2024). Curating AI-Ready datasets for equity and Environmental Justice: A Data-Centric AI case study. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, 483–487. https://doi.org/10.1109/igarss53475.2024.10641786

16) Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L., Rothchild, D., So, D., Texier, M., & Dean, J. (2021, April 21). Carbon emissions and large neural network training. arXiv.org. https://arxiv.org/abs/2104.10350

17) Phillips, Christopher and Roy, Sujit and Ankur, Kumar and Ramachandran, Rahul. (2023, August). HLS Foundation Burnscars Dataset. https://huggingface.co/ibm-nasa-geospatial/hls_burn_scars

18) Strubell, E., Ganesh, A., & McCallum, A. (2019, June 5). Energy and policy considerations for deep learning in NLP. arXiv.org. https://arxiv.org/abs/1906.02243

19) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018, July 26). Unified perceptual parsing for scene understanding. arXiv.org. https://arxiv.org/abs/1807.10221

20) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021, May 31). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv.org. https://arxiv.org/abs/2105.15203

21) Xiong, Z., Wang, Y., Zhang, F., Stewart, A. J., Hanna, J., Borth, D., Papoutsis, I., Saux, B. L., Camps-Valls, G., & Zhu, X. X. (2024, March 22). Neural Plasticity-Inspired multimodal foundation model for earth observation. arXiv.org. https://arxiv.org/abs/2403.15356

22) Xu, J., Shi, W., Gao, P., Wang, Z., & Li, Q. (2022, November 25). MUSTER: a multi-scale transformer-based decoder for semantic segmentation. arXiv.org. https://arxiv.org/abs/2211.13928v2

