# SETI Breakthrough Listen challenge solution using a model ensemble

This notebook presents a solution to the Kaggle “SETI Breakthrough Listen - E.T. Signal Search” challenge:
- https://www.kaggle.com/competitions/seti-breakthrough-listen

The ensemble approach is based on the following backbone models that have been trained on the SETI dataset:
- ECA NFNet
- EfficientNet
- Regnet

## Install Python Packages

Install packagese required to run this notebook.

In [1]:
# Pretrained models
!pip install timm

# Image transformations

!pip install albumentations

# PyTorch configured to run on GPU (CUDA)
# !pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

# Evaluation metrics, such as accuracy
!pip install torchsummary

# Momentum based gradient descent optimizer
!pip install AdamP

# API for logging evaluation metrics to the cloud
!pip install wandb

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com



[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com



[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com



[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com



[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com



[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Python Packages

Import packages used in this notebook.

In [2]:
# Python utility packages
import os
import random
from collections import defaultdict
from datetime import datetime
import zipfile
from io import BytesIO

# PyTorhc packages for building, running and evaluating models
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
from torch.optim.lr_scheduler import CosineAnnealingLR
from torchsummary import summary
from torch.cuda.amp import GradScaler

# Scikit learn utility packages
from sklearn.decomposition import NMF # non-negative matrix factorisation
from sklearn.metrics import roc_auc_score, classification_report, accuracy_score # evaluation metrics
from sklearn.model_selection import StratifiedKFold # Used for k-Folds cross validation

# OpenCV library used for image processing
import cv2

# Data and processing utility packages
import numpy as np
import pandas as pd

# Collection of pretrained models
import timm

# Momentum based gradient descent optimiser
from adamp import AdamP

# Batch processing helper
from tqdm import tqdm

# Image transformation helpers
import albumentations
from albumentations.pytorch.transforms import ToTensorV2

# Utility for logging metrics using a cloud based API
import wandb

# Graph plotting library
from matplotlib import pyplot as plt

## Global Configuration

Create a configuration object for important values used in the notebook.

This makes them easy to track and tweak.

In [3]:
# Configuration object for important parameters used in the notebook
config = {
    'num_workers': 4,
    'model': 'nf_regnet_b5',
    'device': 'cuda',
    'image_size': 224,
    'input_channels': 1,
    'output_features': 1,
    'seed': 42,
    'target_size': 1,
    'T_max': 10,
    'min_lr': 1e-6,
    'lr': 1e-4,
    'weight_decay': 1e-8,
    'batch_size': 100,
    'epochs': 1,
    'num_folds': 2,
    'wandb_project': 'SETI - 300 epochs - transform - 2',
    'wandb_run_name': 'ensemble'
}

## Random Seed Initialisation

Set random seeds to fixed values so the notebook results are reproducable.

In [4]:
'''
Sets seeds for randomness based functions.
'''
def set_seeds(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True



# Set the random seeds to a fixed value.
set_seeds(seed=config['seed'])

## Data Loading

Load the SETI dataset labels.

In [5]:
data_dir = r'D:\UoL\Level 6\CM3070 - Final Project\SETI Signal Detection\Data\2000 dataset'
labels_filepath = os.path.join(data_dir, '2000_balanced_labels.npy')

with open(labels_filepath, 'rb') as f:
    initial_data = np.load(f, allow_pickle=True)

initial_data_df = pd.DataFrame(initial_data, columns=['id', 'target', 'image_filepath']).convert_dtypes()
initial_data_df['target'] = initial_data_df['target'].astype('int')

In [6]:
initial_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              2000 non-null   string
 1   target          2000 non-null   int32 
 2   image_filepath  2000 non-null   string
dtypes: int32(1), string(2)
memory usage: 39.2 KB


We can see that the data sample format include three fields:
- `id`: Containing the unique ID for the sample
- `target`: Indicates if the sample is positive (0) or negative (1)
- `image_filepath`: The location of the file containing the sample image data

Split the data with 70% for a training set and 30% for a test set:

In [7]:
data_split_mask = np.random.rand(len(initial_data_df)) < 0.7

train_df = initial_data_df[data_split_mask]
test_df = initial_data_df[~data_split_mask]

## Image Processing Functions

Helper functions for processing the sample images.

In [8]:
'''
Resizes an image to the specified size.
'''
def resize_image(image):
    return cv2.resize(image, dsize=(config['image_size'], config['image_size']), interpolation=cv2.INTER_CUBIC)



'''
Plots an image.
'''
def plot_image(image):
    plt.figure(figsize = (20, 6))
    plt.imshow(image, aspect='auto')
    plt.show()




'''
Min-max normalises the image pixel values (between 0 and 1).
'''
def normalise_image(image):
    image_min = image.min()
    image_max = image.max()

    return (image - image_min) / (image_max - image_min)



'''
Factorises an image into two matrices, and returns them.
Used to help remove image background noise.
'''
def get_decomposition_matrices(image):
    model = NMF(n_components=2, init='random', random_state=0)
    W = model.fit_transform(image + 100) # add 100 to ensure no negative values
    H = model.components_

    return (W, H)



'''
Removes the background noise from a set of sample images.

Based on: https://www.kaggle.com/competitions/seti-breakthrough-listen/discussion/245950
'''
def get_denoised_image(sample_images):
    combined_on_images = None
    combined_off_images = None
    combined_denoised_image = None

    for i in range(0, len(sample_images), 2):
        on_target_image = sample_images[i] # Get on target image
        off_target_image = sample_images[i+1] # Get off target image

        on_W, on_H = get_decomposition_matrices(on_target_image) # Decompose on target images into factor matrices
        off_W, off_H = get_decomposition_matrices(off_target_image) # Decomponse off target images into factor matrices

        # Get noise approximation by multiplying a factor matrix from each of the on target, and off target images.
        # Then subtract the approximated noise from the on target images
        denoised_image = normalise_image(on_target_image - np.matmul(on_W, off_H))

        # Consolidate the on target, off target and denoised images.
        combined_on_images = on_target_image if combined_on_images is None else combined_on_images + on_target_image
        combined_off_images = off_target_image if combined_off_images is None else combined_off_images + off_target_image
        combined_denoised_image = denoised_image if combined_denoised_image is None else combined_denoised_image + denoised_image

    # Return the denoised image
    return combined_denoised_image

## Custom Dataset

Create a custom dataset class.

This is used by the model to interact with a data sample.

In [9]:
class CustomDataset(Dataset):
    def __init__(self, images_filepaths, targets, transform=None):
        self.images_filepaths = images_filepaths
        self.targets = targets
        self.transform = transform

    def __len__(self):
        return len(self.images_filepaths)

    def __getitem__(self, idx):
        images_filepath = self.images_filepaths[idx]
        images = np.load(images_filepath).astype(np.float32)
        image = get_denoised_image(images)

        if self.transform is not None:
            image = self.transform(image=image)['image']
        else:
            image = resize_image(image)
            image = image[np.newaxis,:,:]
            image = torch.from_numpy(image).float()

        label = torch.tensor(self.targets[idx]).float()

        return image, label

## Metric Monitoring

Create utility class and function to track loss, accuracy and AUC ROC metrics.

Used with the Weights and Biases web API.

In [10]:
'''
Class used by Weights and Biases web API for metric monitoring.
'''
class MetricMonitor:
    def __init__(self, float_precision=3):
        self.float_precision = float_precision
        self.reset()

    def reset(self):
        self.metrics = defaultdict(lambda: {'val': 0, 'count': 0, 'avg': 0})

    def update(self, metric_name, val):
        metric = self.metrics[metric_name]

        metric['val'] += val
        metric['count'] += 1
        metric['avg'] = metric['val'] / metric['count']

    def __str__(self):
        return " | ".join(
            [
                '{metric_name}: {avg:.{float_precision}f}'.format(
                    metric_name=metric_name, avg=metric['avg'],
                    float_precision=self.float_precision
                )
                for (metric_name, metric) in self.metrics.items()
            ]
        )



'''
Helper method for returning the ROC AUC score for a prediction.
'''
def get_roc_auc(output, target):
    try:
        y_pred = torch.sigmoid(output).cpu()
        y_pred = y_pred.detach().numpy()
        target = target.cpu()

        return roc_auc_score(target, y_pred)
    except:
        return 0.5 # If an exception occurs, e.g. divide by zero, return a 0.5 score

## Model Classes

Create the classes for the models that will be used for classification.

In [11]:
'''
ECA NFNet model.
'''
class EcaNFNet(nn.Module):
    def __init__(self, model_name='eca_nfnet_l0', output_features=config['output_features'],
                 input_channels=config['input_channels'], pretrained=True):
        super().__init__()
        self.model = timm.create_model(model_name, pretrained=pretrained, in_chans=input_channels)

        # Modify final layer to allow for a binary classification
        n_features = self.model.head.fc.in_features
        self.model.head.fc = nn.Linear(n_features, output_features, bias=True)

    def forward(self, x):
        x = self.model(x)

        return x

In [12]:
'''
RegNet model.
'''
class RegNet(nn.Module):
    def __init__(self, model_name='nf_regnet_b1', output_features=config['output_features'],
                 input_channels=config['input_channels'], pretrained=True):
        super().__init__()
        self.model = timm.create_model(model_name, pretrained=pretrained, in_chans=input_channels)

        # Modify final layer to allow for a binary classification
        n_features = self.model.head.fc.in_features
        self.model.head.fc = nn.Linear(n_features, output_features, bias=True)

    def forward(self, x):
        x = self.model(x)

        return x

In [13]:
'''
EfficientNet model.
'''
class EfficientNet(nn.Module):
    def __init__(self, model_name='tf_efficientnet_b3', output_features=config['output_features'],
                 input_channels=config['input_channels'], pretrained=True):
        super().__init__()
        self.model = timm.create_model(model_name, pretrained=pretrained, in_chans=input_channels)

        # Modify final layer to allow for a binary classification
        n_features = self.model.classifier.in_features
        self.model.classifier = nn.Linear(n_features, output_features, bias=True)

    def forward(self, x):
        x = self.model(x)

        return x

## Loss Criterion

Define a loss criterion function for measuring loss:

In [14]:
criterion = nn.BCEWithLogitsLoss().to(config['device'])

## Data Loader

Data loader used during classification to access the data.

In [15]:
'''
Data loader for the test data.
'''

def get_test_loader(test_data):
    test_set = CustomDataset(
        images_filepaths=test_data['image_filepath'].values,
        targets=test_data['target'].values
    )

    return DataLoader(
        test_set,
        batch_size=config['batch_size'],
        shuffle=False,
        # num_workers=config['num_workers'],
        pin_memory=True
    )

    return test_loader

## Test Function

The function used to classify a model against the test dataset.

In [16]:
'''
Function for testing a model against the test dataset.

It receives a model and model state to restore as input.
It returns the model predictions for the ground truth labels.
'''
def test_model(model, model_state_file):
    # keep track of the predictions and ground truth labels
    final_targets = []
    final_outputs = []

    # initialise the model and place it in evaluation mode
    model_state = torch.load(os.path.join('ensemble models', model_state_file))
    model.load_state_dict(model_state)
    model.eval()

    # initialise the metric monitor for reporting progress
    metric_monitor = MetricMonitor()

    # initialise the test data loader with the test data
    test_loader = get_test_loader(test_df)

    # create the stream for iterating through the test data batch
    stream = tqdm(enumerate(test_loader), total=len(test_loader))

    with torch.no_grad(): # don't perform any back propogation as we only want to evaluate
        for i, (images, target) in stream: # iterate through each test data batch
            images = images.to('cpu', non_blocking=True)
            target = target.to('cpu', non_blocking=True).float().view(-1, 1)

            # make predictions
            output = model(images)

            # calculate metric performance metrics
            loss = criterion(output, target)
            accuracy = accuracy_score(target.cpu(), (output.cpu().detach().numpy() > 0), normalize=False)
            roc_auc = get_roc_auc(output, target)

            # report the metrics to the monitor for displaying model progress
            metric_monitor.update('Loss', loss.item())
            metric_monitor.update('ROC AUC', roc_auc)

            # report the metrics the W&B web API
            wandb.log({'Test Loss': loss.item(), 'Test Accuracy': accuracy, 'Test ROC AUC': roc_auc})

            # print the current model progress
            stream.set_description('Test {metric_monitor}'.format(metric_monitor=metric_monitor))

            # get the ground truth labels and predictions, so they can be returned
            targets = target.detach().cpu().numpy().tolist()
            outputs = output.detach().cpu().numpy().tolist()

            final_targets.extend(targets)
            final_outputs.extend(outputs)

    # return predictions and ground truth labels
    return final_outputs, final_targets

## Weights and Biases Initialisation

Initialise W&B with the project and run name to log metrics to.

In [17]:
wandb.init(
    project=config['wandb_project'],
    config=config,
    job_type='test',
    name=config['wandb_run_name']
)

[34m[1mwandb[0m: Currently logged in as: [33mmllm3[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Model Predictions

Instantiate models with states saved from when they were previously trained on the SETI dataset.

Then get their predictions against the test set:

In [18]:
# initialise models to test
ecanfnet_model = EcaNFNet()
regnet_model = RegNet()
efficientnet_model = EfficientNet()


# get model predictions against test data
ecanfnet_outputs, ecanfnet_targets = test_model(ecanfnet_model, 'eca_nfnet_l0_fold_2_epoch_1_roc_auc_0.999.pth')
regnet_outputs, regnet_targets = test_model(regnet_model, 'nf_regnet_b1_fold_2_epoch_1_roc_auc_0.998.pth')
efficientnet_outputs, efficientnet_targets = test_model(efficientnet_model, 'tf_efficientnet_b3_fold_2_epoch_2_roc_auc_0.999.pth')

Test Loss: 0.945 | ROC AUC: 0.506: 100%|██████████| 6/6 [03:00<00:00, 30.10s/it]
Test Loss: 1.066 | ROC AUC: 0.492: 100%|██████████| 6/6 [01:14<00:00, 12.43s/it]
Test Loss: 0.831 | ROC AUC: 0.508: 100%|██████████| 6/6 [02:20<00:00, 23.40s/it]


In [23]:
'''
Returns a dataframe composed of ground truth labels and binary classifications.
'''
def get_predictions_dataframe(targets, predictions):
    results_df = pd.DataFrame([targets, predictions])
    results_df = results_df.transpose()
    results_df.columns = ['label', 'prediction']

    results_df['label'] = results_df['label'].apply(lambda x: x[0])
    results_df['prediction'] = results_df['prediction'].apply(lambda x: x[0])

    # Use threshold of > 0.5 for a positive prediction, else it is a negative prediction
    results_df['prediction'] = results_df['prediction'].apply(lambda x: 1.0 if x > 0.5 else 0)

    return results_df



# get predictions for each model
ecanfnet_results_df = get_predictions_dataframe(ecanfnet_targets, ecanfnet_outputs)
regnet_results_df = get_predictions_dataframe(regnet_targets, regnet_outputs)
efficientnet_results_df = get_predictions_dataframe(efficientnet_targets, efficientnet_outputs)

## Ensemble Predictions



In [26]:
efficientnet_results_df.head()

Unnamed: 0,label,prediction
0,0.0,1.0
1,0.0,1.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0


## Results Evaluation

In [64]:
'''
Given three predictions, returns their mode value.
'''
def get_mode_prediction(prediction_1, prediction_2, prediction_3):
    predictions = pd.Series([prediction_1, prediction_2, prediction_3])

    return predictions.mode()

In [67]:
# build dataframe with all of the results
ensemble_results_df = pd.DataFrame()

ensemble_results_df['label'] = efficientnet_results_df['label']

ensemble_results_df['ecanfnet_prediction'] = efficientnet_results_df['prediction']
ensemble_results_df['regnet_prediction'] = regnet_results_df['prediction']
ensemble_results_df['efficientnet_prediction'] = efficientnet_results_df['prediction']

# get the mode prediction for each row for the ensemble prediction
ensemble_results_df['ensemble_prediction'] = ensemble_results_df.apply(
    lambda row: get_mode_prediction(row['ecanfnet_prediction'], row['regnet_prediction'], row['efficientnet_prediction']),
    axis=1
)

Print metrics for the ensemble predictions:

In [68]:
pd.crosstab(ensemble_results_df['ensemble_prediction'], ensemble_results_df['label'])

label,0.0,1.0
prediction,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,189,176
1.0,99,131


In [69]:
print(classification_report(ensemble_results_df['ensemble_prediction'], ensemble_results_df['label']))

              precision    recall  f1-score   support

         0.0       0.66      0.52      0.58       366
         1.0       0.43      0.58      0.49       229

    accuracy                           0.54       595
   macro avg       0.55      0.55      0.54       595
weighted avg       0.57      0.54      0.55       595

