# Google Research - Identify Contrails in Satellite Images

## Introduction
This was a 3-month-long Kaggle competition hosted by Google Research ([link](https://www.kaggle.com/competitions/google-research-identify-contrails-reduce-global-warming/code)). The significance of the competition lies in the fact that contrails from planes are likely responsible for more global warming than the kerosene they burn and may account for around 1% of total global warming. Recognizing them in satellite images can help inform route choices or be used to evaluate the effectiveness of countermeasures.

The results were evaluated by comparing pixel predictions on a hidden test set, with the predictions of the uploaded model. The ensemble of ML models I submitted scored 0.685, securing place 73 out of 954. The winner scored 0.724. The top 6 teams were awarded a total of $50,000 in prize money.

Due to the substantial size of the original dataset (450 GB) and the models I submitted (around 7 GB), I have uploaded the code segments on Kaggle. There, you can not only inspect the code but also run it without downloading all of this data. I also linked all relevant resources like the paper preprint and RGB recipes for satellite images.

# Project Timeline and Competition Details.
The competition spanned 3 months, but I joined during the final week, necessitating a sharp focus on efficiently utilizing my time. 
Below are the three project phases and the approximate timeline.
As a starter, [here is an animation visualizing input an labels](https://www.kaggle.com/code/raki21/time-step-visualization).

## Day 1-2, Overview and Exploratory Data Analysis
During the initial two days, I concentrated on gathering all the data available for this competition. I delved into domain knowledge and created Exploratory Data Analyses (EDAs) to thoroughly explore the data. My sources included the competition site, discussion forums, code-sharing sections, Google searches, and the provided [paper preprint](https://arxiv.org/abs/2304.02122) mentioned in the challenge description. I leveraged GPT-4 to gain insights into areas such as satellite imagery and infrared channels, primarily for sourcing recommendations, given the potential issues with GPT hallucinating information, and me not being able to tell this immediately in this unfamiliar domain. My results are detailed in the following sections.

### Evaluation metric
The competition required the semantic segmentation of pixels into two classes: contrails and clear sky. The evaluation metric applied was the global Dice loss, calculated as 

$$\text{{Dice}} = \frac{{2 | X \cap Y |}}{{| X | + | Y |}}$$

Where 1 is the maximum score, and 0 is the minimum. The "global" aspect means that the score was calculated across all images, rather than as the mean of individual images.

### Input description
Each sample contained input data with the shape **(256,256,8)×9**.
* **256,256:** The height and width, each pixel representing 2km*2km and each image 512x512 km. 
* **8:** The number of time steps, with the goal of identifying contrails in the 5th image, using images slightly before and after. The time sequence corresponds to minutes: -60, -45, -30, -15, 0, 15, 30, 45.
* **9:** Different infrared input channels. Explored in detail in [this notebook](https://www.kaggle.com/code/raki21/exploring-optimal-bands-for-contrail-detection).

### Label description
Labels were generated by multiple annotators, who were given the input data along with additional flight information. A pixel was labeled as a contrail if the majority agreed on its classification. The annotators used the [Ash Color Scheme](https://eumetrain.org/sites/default/files/2020-05/RGB_recipes.pdf), displaying three of the nine infrared channels as RGB. There was an oversampling of images with contrails in both the train and validation sets:

* Around 45% of training images and 30% of validation images contain contrails.
* Approximately 0.5% of training set pixels and 0.2% of validation set pixels represent contrails.

### Tricks from discussions

Several challenges and strategies emerged during discussions:

* Data Augmentation: Many reported problems with even standard augmentation techniques, like horizontal or vertical flips.
* Upscaling: Mentioned in the paper, scaling images from 256x256 to 512x512 using bilinear interpolation improved model performance.
* Pseudolabeling: This was suggested as a way to enhance performance, applicable to the 7 time steps not at t=0.

By synthesizing these insights, I sought to create a good inference pipeline.

## Day 3-6, Pipeline and Optimization

### Setting Up the Framework
I established a foundational framework by creating a skeleton, using the [segmentation models pytorch](https://github.com/qubvel/segmentation_models.pytorch) library. The architecture employed was a combination of [UNet](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/) (utilized as a decoder) and a small [ResNeSt](https://arxiv.org/abs/2004.08955) backbone from [Pytorch Image Models](https://timm.fast.ai/).

My updated training code is [here](https://www.kaggle.com/raki21/contrails-train), it is very similar to the code used for creating the models for my final submission with the only significant change being the use of the 0.5 pixel shift, described in "Data Augmentation".

### Losses and labels
The losses that seemed most promising were Weighted Binary Cross Entropy (WBCE) and Dice Loss.
I considered using two types of labels - the hard label, which is the ground truth in evaluation, and the soft labels, as mean of annotators, that I created [here](https://www.kaggle.com/code/raki21/exploring-optimal-bands-for-contrail-detection) and saved [here](https://www.kaggle.com/datasets/raki21/contrail-ash-band08). 

* Dice loss performed similar or better than WBCE in all experiments I conducted, but I improved further on this by developing an own dice loss variant because of a problem I noticed with soft label + regular dice loss. This problem and the custom solution are elaborated [here](https://www.kaggle.com/raki21/adjusted-dice-loss-for-soft-labels).

* Soft labels appear to be more effective, as hard labels are a lossy compression.

##### Val Dice of Different Labels & Loss Combinations, with optimal thresholds.
    Hard + dice:     0.621 at 0.17
    Hard + WBCE:     0.618 at 0.85

    Soft + dice:     0.629, at 0.999  
    Soft + WBCE:     0.621, at 0.9
    Soft + mydice:   0.633, at 0.62

### Input bands
I also tried to verify if the additional band08 was advantages as proposed [my notebook](https://www.kaggle.com/code/raki21/exploring-optimal-bands-for-contrail-detection).
It turned out the difference was small to slightly negative for adding band08. 
Every other band decreased performance by a lot, I think because overfitting gets easier for the model.

### Model Architectures and Size
I experimented with various backbones like EfficientNet, Convnext, and many timm models besides my ResNeSt, coupled with all decoders on SMP such as UNet++. While UNet++ yielded minor improvements, the computational overhead led me to invest in more impactful areas like larger backbones. After testing promising candidates like convnext or coatv3, I settled on ResNeSt for efficiency, scaling up to 200e, outperforming the larger 269e.

### Upscaling
Despite early trials, the substantial computational expense led me to halt at 384x384 upscaling, following recommendations for 512x512 in discussions and paper preprint. Nonetheless, this upscaling produced a tangible boost in global validation Dice score.

### Utilizing Time Steps

Given the limited time and resources, I was unable to develop or implement a complex LSTM/transformer+CNN hybrid, which might have effectively harnessed the information within different time steps.
Rather than pursuing this sophisticated approach, I opted to explore a more accessible pipeline to make use of the time information in the following manner:

1. **Initial Training:** I trained a model specifically on the labels and ash color input corresponding to contrails at t=0 (the 5th entry in the original data).
2. **Prediction Generation:** Using the trained model, I generated predictions for all other time steps.
3. **Secondary Training:** I trained a new model using only the predictions generated from other time steps, in conjunction with the original ash color input.

Unfortunately, this streamlined approach did not yield the promising results I had hoped for.

### Data Augmentation
Data augmentation, like horizontal and vertical flips, is a very powerful tool in computer vision with deep learning, but here all the spatial transforms, the type of data augmentation that should be best suited, performed poorly.

My first guess was that this was due to information about geographical location that the model infers from flight paths that would be lost in these augmentations. My fault was not checking alternative hypotheses. 

It turned out that the reason these augmentations did not work is that in the process of changing the annotation bounds to pixel masks (on the organizers side), a 0.5 pixel shift occured, which the model learned to correct for. This subtle shift was something the model had learned to correct for. When augmentations like flipping were introduced, this shift was randomized, severely undermining the model's predictive accuracy and ability to discern correct patterns. This was my biggest miss in this competition, as a 0.5 pixel shift correction, allows use of many spatial data augmentations, like shift, scale, rotate and flips, improving score by a lot, it is also a typical pitfall in deep learning, ['Neural net training fails silently'](http://karpathy.github.io/2019/04/25/recipe/). 

I created a shifted dataset in [this notebook](https://www.kaggle.com/code/raki21/shifted-ash-dataset-creation-0-5-pixels) and saved it [here](https://www.kaggle.com/datasets/raki21/shifted-ash-for-contrails-05-pixels).


## Day 7, Ensembling and Submission
The last day I used exclusively to create and test ensembles of models I trained.
Initially I was skeptical about how well ensembles would work here, but the ensemble performed very well, improving my position from around 150 to 70. 
I looked at combinations of saved model checkpoints I trained over the last week and found 12 that each contributed positively to the validation metric when put in an ensemble. 
I attribute this success largely to the diversity in my modeling approach throughout the competition. By intermittently saving models with a rich variety of architectures, losses, scales, and both soft and hard label training strategies, the models had sufficiently varied error sources.

The only difficulty was combining models at different input scales, the final result can be seen [here](https://www.kaggle.com/code/raki21/multi-scale-ensembler-inference).

## Conclusion 
In the end I learned a lot from this competition and I am proud of placing highly in the short amount of time I had.
My biggest mistake was definitely not checking the adhoc hypothesis of why data augmentation did not work well and on a more meta side I should have planned more time for the competition.

* **Things I tried but couldn't develop far enough**:  Using the time information better and looking more into large models like ConvNext.
* **Things I wanted to try but didn't have the time**: Pseudo labeling, postprocessing.


## List of Linked Files and Folders

[Visualization of Input and Labels over Time](https://www.kaggle.com/code/raki21/time-step-visualization)

[Train Code](https://www.kaggle.com/raki21/contrails-train)

[Multi Scale Ensemble](https://www.kaggle.com/code/raki21/multi-scale-ensembler-inference)

[Optimal Bands](https://www.kaggle.com/code/raki21/exploring-optimal-bands-for-contrail-detection)

[Ash+08 Soft Label Dataset](https://www.kaggle.com/datasets/raki21/contrail-ash-band08). 

[Custom Dice Loss](https://www.kaggle.com/raki21/adjusted-dice-loss-for-soft-labels).

[Shifted Contrail with soft label: Creation](https://www.kaggle.com/code/raki21/shifted-ash-dataset-creation-0-5-pixels)

[Shifted Ash Soft Label Dataset](https://www.kaggle.com/datasets/raki21/shifted-ash-for-contrails-05-pixels)