<h3> Overview </h3>

This a simple EDA notebook for the [SETI Breakthrough Listen - E.T. Signal Search](https://www.kaggle.com/c/seti-breakthrough-listen/overview) challenge. 

<h4> Notebook Structure </h4>

- [data structure](#data)
    - train/test
    - train_labels
    - sample_submission

<h5> Props </h5>

Props to [ihelon](https://www.kaggle.com/ihelon/signal-search-exploratory-data-analysis): I read through his notebook before starting with mine. 

Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from colorama import Fore, Back, Style
g_ = Fore.GREEN
r_ = Fore.RED
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
import tqdm
import os
import glob

root_path = '/kaggle/input/seti-breakthrough-listen/'
train_path = os.path.join(root_path, "train")
test_path = os.path.join(root_path, "test")


# Functions taken and edited from https://www.kaggle.com/ihelon/signal-search-exploratory-data-analysis

def get_train_filename_by_id(_id: str) -> str:
    return f"../input/seti-breakthrough-listen/train/{_id[0]}/{_id}.npy"

def show_cadence(filename: str, label: int) -> None:
    fig, axes = plt.subplots(6, 1, figsize = (16, 10))
    ax = axes.ravel()
    arr = np.load(filename)
    for i in range(6):
        
        ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
        ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        if i != 5:
            ax[i].set_xticks([])
            
    fig.text(0.5, -0.02, 'Frequency Range', ha='center', fontsize=18)
    fig.text(-0.02, 0.5, 'Seconds', va='center', rotation='vertical', fontsize=18)

    plt.suptitle(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
    fig.tight_layout()
    plt.show()


<a id = 'data'></a>
<h3> Data </h3>

In this competition you are tasked with looking for technosignature signals in cadence snippets taken from the Green Bank Telescope (GBT). Please read the extended description on the [Data Information tab](https://www.kaggle.com/c/seti-breakthrough-listen/overview/data-information) for detailed information about the data (that's too lengthy to include here).

Files
- **train/** - a training set of cadence snippet files stored in `numpy` `float16` format (v1.20.1), one file per cadence snippet id, with corresponding labels found in the `train_labels.csv` file. Each file has dimension (6, 273, 256), with the 1st dimension representing the 6 positions of the cadence, and the 2nd and 3rd dimensions representing the 2D spectrogram.</li>
- **test/** - the test set cadence snippet files; you must predict whether or not the cadence contains a "needle", which is the target for this competition
- **sample_submission.csv** - a sample submission file in the correct format
- **train_labels** - targets corresponding (by `id`) to the cadence snippet files found in the `train/` folder

<h5> train_labels.csv </h5>


In [None]:
train_labels = pd.read_csv(os.path.join(root_path, 'train_labels.csv'))
display(train_labels.head(3))
print("\t\t\t\t{}{}Number of train labels: {}".format(r_, Back.BLACK, len(train_labels)))

In [None]:
cmap_plot = plt.get_cmap('jet_r')
ddt = train_labels.target.value_counts().to_frame()
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(1, 1, figsize = (12, 4))
sns.countplot(data = train_labels, x = 'target', orient = "v", palette = 'pastel', ax = ax)
plt.suptitle("Train target distribution")
plt.rcParams.update(plt.rcParamsDefault)

<h5> Train </h5>

- number of files
- example of some signals

In [None]:
train_files = glob.glob(train_path + "/*/*.npy")
print("\t\t\t\t{}{}Number of train files: {}".format(r_, Back.BLACK, len(train_files)))

In [None]:
positive_target = train_labels.query("target == 1").sample().id.item()
negative_target = train_labels.query("target == 0").sample().id.item()
show_cadence(get_train_filename_by_id(positive_target), 1)
show_cadence(get_train_filename_by_id(negative_target), 0)

<h5> Test </h5>

- number of files
- example of some signals

In [None]:
test_files = glob.glob(test_path + "/*/*.npy")
print("\t\t\t\t{}{}Number of test files: {}".format(r_, Back.BLACK, len(test_files)))

In [None]:
show_cadence(np.random.choice(test_files, 1).item(), None)
show_cadence(np.random.choice(test_files, 1).item(), None)

<h5> Sample submission </h5>

In [None]:
sample_sub = pd.read_csv(os.path.join(root_path, 'sample_submission.csv'))
display(sample_sub.sample(3))
print("\t\t\t\t{}{}Number of submission predictions: {}".format(r_, Back.BLACK, len(sample_sub)))

<h4> Data Exploration on a sample of train files </h4>

Based on what is written in the [Data Information tab](https://www.kaggle.com/c/seti-breakthrough-listen/overview/data-information) 
<img src = "https://i.imgur.com/AKcxEMZ.png" width=800></img>

it could be interesting to check pointwise difference distribution between images corresponding to positive and negative labels. 

Given an image/array `arr` of size $(6, 273, 256)$:

- Take difference and ravel: `(arr[0] - arr[2]).ravel()`
- Take difference and ravel: `(arr[0] - arr[4]).ravel()`
- Take difference and ravel: `(arr[2] - arr[4]).ravel()`
- concatenate the 3
- Compare distribution between negative and positive

In [None]:
def pointwise_difference(signal):
    
    if not isinstance(signal, np.ndarray):
        raise TypeError("signal should be a np.ndarray")
    
    if signal.shape != (6, 273, 256):
        raise ValueError("signal has wrong shape")
        
    return np.concatenate(((signal[0]-signal[2]).ravel(), (signal[0]-signal[4]).ravel(), (signal[2]-signal[4]).ravel()))

In [None]:
%time
SAMPLE_SIZE = 1000

sample_positive = np.random.choice(train_labels.query("target == 1").id.tolist(), SAMPLE_SIZE)
sample_negative = np.random.choice(train_labels.query("target == 0").id.tolist(), SAMPLE_SIZE)

sample_positive_files = list(map(lambda x: get_train_filename_by_id(x), sample_positive))
sample_negative_files = list(map(lambda x: get_train_filename_by_id(x), sample_negative))

In [None]:
%time
positives_dist = list(map(lambda x: pointwise_difference(np.load(x)), sample_positive_files))
print("finished calculating positive label images")
negatives_dist = list(map(lambda x: pointwise_difference(np.load(x)), sample_negative_files))
print("finished calculating negative label images")

In [None]:
positives_dist = np.concatenate(positives_dist)
negatives_dist = np.concatenate(negatives_dist)

quantiles = np.linspace(0.05, 0.95, 19)

positive_dist_df = (pd.DataFrame({"difference": np.quantile(positives_dist, quantiles),
                                  "quantile": np.round(quantiles, 3)}))

negative_dist_df = (pd.DataFrame({"difference": np.quantile(negatives_dist, quantiles),
                                  "quantile": np.round(quantiles, 3)}))

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (12, 6))
plt.style.use('fivethirtyeight')
#sns.kdeplot(np.random.choice(positives_dist, 200000).tolist(), shade=True, alpha=0.5, ax = ax)
#sns.kdeplot(np.random.choice(negatives_dist, 200000).tolist(), shade=True, alpha=0.2, ax = ax)

sns.kdeplot(positives_dist[:200000].tolist(), shade=True, alpha=0.5, ax = ax)
sns.kdeplot(negatives_dist[:200000].tolist(), shade=True, alpha=0.2, ax = ax)

plt.legend(labels = ['positive', 'negative'], title='targets', bbox_to_anchor=(1.05, 1), loc='upper left')
ax.set_xlim(-3, 3)
ax.set_title("Positive vs Negatives comparison: pixelwise difference")

It seems that positive labelled images/signals have a more widespread distribution