# Update 7/15/21
I have updated the notebook to use the old model (trained on old-leaky data) on the new data for the contest reset. There seem to be more terrestrial needles now - at least it looks that way to me. They look a lot cleaner too. At a crude guess, the extra-terrestrial to terrestrial needle ratio looks like eight to one.

# Summary
The SETI problem is often described as "searching for needles in a haystack." This Kaggle contest's objective is to find synthetic signatures ("needles") planted in some of the on-channels of radio-telescope spectrogram cadences. A cadence contains six channels: three "on-channels" pointing at the star of interest, and three "off-channels" pointing at other nearby stars. A signal that is only present when looking at the on-channels might be extraterrestrial, but if it also shows up in an off-channel we can assume it's just a boring terrestrial signal (i.e. from a source on Earth).

For this notebook, I borrowed a trained neural network and compared its response for on-channels to its response for off-channels. I think the comparison sheds some light both on how the data set was created and why some people are doing very well in this contest without using the off-channels at all.

# Description
How useful are the off-channels? It's a serious question for the contest, and potentially for the SETI project itself. There are different ways they might be useful; in this notebook, I will consider just one.

Theoretically, if we see a needle in the on-channels, seeing that needle in an off-channel would mean that it's a terrestrial source - non-target. But that's only if it's the same needle. The channels are recorded at different times, so there's a sometimes tricky correspondence problem. And how often does this scenario come up anyway?

To explore this question, I used @ttahara's model in this public notebook. https://www.kaggle.com/ttahara/seti-e-t-resnet18d-baseline to look for needles in both on and off channels of the test set cadences. By filtering for cadences where both on and off channels show a high response, we can look for terrestrial needles and try to get a sense of how common they are.

As the model was trained to use channels 0, 2, and 4, the on-channel response is available directly in the submission file from that notebook (renamed 'test_on_channels.csv'). I obtained the off-channel response by substituting channels 1, 3, and 5 and running that through the model ('test_off_channels.csv'). In a similar way, I generated 'test_111.csv', 'test_333.csv', and 'test_555.csv' by combining three copies of the same channel and running that through the model.
# Acknowledgements 
I've leveraged the great work in these notebooks:

* https://www.kaggle.com/ttahara/seti-e-t-resnet18d-baseline
* https://www.kaggle.com/ihelon/signal-search-exploratory-data-analysis

If you find this notebook helpful, please give these guys an upvote.


# The Preliminaries

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

def get_test_filename_by_id(_id: str) -> str:
    return f"../input/c/seti-breakthrough-listen/test/{_id[0]}/{_id}.npy"

def show_cadence(filename: str, label: int) -> None:
    plt.figure(figsize=(16, 10))
    arr = np.load(filename)
    for i in range(6):
        plt.subplot(6, 1, i + 1)
        if i == 0:
            plt.title(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
        plt.imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
        plt.text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        plt.xticks([])
    plt.show()
    

# Read in the Precalculated Model Responses

In [None]:
#../input/seti-onoff-channel-needle-comparison
A = pd.read_csv('../input/seti-onoff-channel-needle-comparison/test_on_channels.csv')
B = pd.read_csv('../input/seti-onoff-channel-needle-comparison/test_off_channels.csv')

L = len(A) 
v = A.iloc[:, 1]
print('on-channel response. min, max, mean', min(v), max(v), np.mean(v))  
v = B.iloc[:, 1]
print('off-channel response. min, max, mean', min(v), max(v), np.mean(v))

# Sift for Examples with High Responses Both On-channel and Off.

In [None]:
on_threshold  = 0.3
off_threshold = 0.3
count = 0
on_count = 0
off_count = 0
terrestrials = []
for k in range(L):
    name = A.iloc [k, 0]
    a, b = A.iloc [k, 1], B.iloc[k, 1]
    if (a > on_threshold ): on_count += 1
    if (b > off_threshold): off_count+= 1
    if (a > on_threshold) and (b > off_threshold):
        count+= 1
        terrestrials.append([k, name, a, b])
        print(k, name, a, b)
print(count, 'found. percent passed = ', 100 * count / L, 'out of', L) 
print('on_count percent = ', 100 * on_count / L)
print('off_count percent = ', 100 * off_count / L)
            

# Plot Them

In [None]:
n = 0
for terr in terrestrials:
    print(terr)
    show_cadence(get_test_filename_by_id(terr[1]), .5)
    n += 1
    if n >= 70: 
        print('Only showing first 70')
        break

# Discussion

The plots above show some good examples of terrestrial needles. Lowering the thresholds will bring out more. But finding them isn't always helpful, and they're pretty scarce to begin with. In this haystack it seems, the extraterrestrial needles outnumber the terrestrial ones by quite a lot.

We know that the contest organizers created positive examples by adding needles to on-channels. They might also have created some negative examples by adding the same needle to both on and off channels. I see virtually no signs of the later in this, admittedly crude, test.

I suspect that the apparently very small number of terrestrial needles (real or synthetic)  leads to most neural nets simply learning to detect when needles are present in the on-channels. This may explain why some contestants can do so well when ignoring the off-channels.