## Introduction
**(_This is a work in progress. More stuff related to modelling will be posted in upcoming days._)**

__The Wow! Signal__ is a narrowband radio signal observed by Ohio State University's Big Ear radio telescope in 1977. The signal appeared to come from the direction of the constellation Sagittarius and lasted just for 72 seconds. Jerry R. Ehman, the astronomer who spotted it on a computer printout days later was so impressed that he quickly scrawled “Wow!” in red pen across the page. The data looked much like what SETI astronomers expected to see from an alien intelligence. However, despite many attempts to follow up on the find, the so-called “Wow! Signal” has never reappeared.

<a href="https://imgur.com/lV1dQtn"><img src="https://i.imgur.com/lV1dQtn.png" title="source: imgur.com" /></a>

The Breakthrough Listen instrument at the Green Bank Telescope (GBT) is a digital spectrometer, which takes incoming raw data from the telescope (amounting to hundreds of TB per day) and performs a Fourier Transform to generate a spectrogram.

Breakthrough Listen generates spectrograms which typically span several GHz of the radio spectrum (rather than the approx. 2 MHz shown above). The data are stored either as filterbank format or HDF5 format files, but essentially are arrays of intensity as a function of frequency and time, accompanied by headers containing metadata such as the direction the telescope was pointed in, the frequency scale, and so on. We generate over 1 PB of spectrograms per year; individual filterbank files can be tens of GB in size. For the purposes of the Kaggle challenge, we have discarded the majority of the metadata and are simply presenting numpy arrays consisting of small regions of the spectrograms that we refer to as “snippets”.

Breakthrough Listen is searching for candidate signatures of extraterrestrial technology - so-called technosignatures. The main obstacle to doing so is that our own human technology (not just radio stations, but wifi routers, cellphones, and even electronics that are not deliberately designed to transmit radio signals) also gives off radio signals. We refer to these human-generated signals as “radio frequency interference”, or RFI.

One method we use to isolate candidate technosignatures from RFI is to look for signals that appear to be coming from particular positions on the sky. Typically we do this by alternating observations of our primary target star with observations of three nearby stars: 5 minutes on star “A”, then 5 minutes on star “B”, then back to star “A” for 5 minutes, then “C”, then back to “A”, then finishing with 5 minutes on star “D”. One set of six observations (ABACAD) is referred to as a “cadence”. Since we’re just giving you a small range of frequencies for each cadence, we refer to the datasets you’ll be analyzing as “cadence snippets”.

<a href="https://imgur.com/HQqsec1"><img src="https://i.imgur.com/HQqsec1.png" title="source: imgur.com" /></a>

As the plot title suggests, this is the Voyager 1 spacecraft. Even though it’s 20 billion kilometers from Earth, it’s picked up clearly by the GBT. The first, third, and fifth panels are the “A” target (the spacecraft, in this case). The yellow diagonal line is the radio signal coming from Voyager. It’s detected when we point at the spacecraft, and it disappears when we point away. It’s a diagonal line in this plot because the relative motion of the Earth and the spacecraft imparts a Doppler drift, causing the frequency to change over time.

In [None]:
# library imports
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data Overview
- __train/__ - a training set of cadence snippet files stored in numpy float16 format (v1.20.1), one file per cadence snippet id, with corresponding labels found in the `train_labels.csv` file. Each file has dimension (6, 273, 256), with the 1st dimension representing the 6 positions of the cadence, and the 2nd and 3rd dimensions representing the 2D spectrogram.
- __test/__ - the test set cadence snippet files; you must predict whether or not the cadence contains a "needle", which is the target for this competition
- __train_labels__ - targets corresponding (by id) to the cadence snippet files found in the train/ folder

In [None]:
# read the CSV files
path = '../input/seti-breakthrough-listen/'
train_df = pd.read_csv(os.path.join(path, 'train_labels.csv'))
train_df.head()

Before moving further, let's check out the distribution of the target.

In [None]:
sns.countplot(x=train_df['target'])
plt.show()

In [None]:
train_df['target'].value_counts(normalize=True)

As suspected, we are looking at a highly imbalanced class problem with around 91% `class:0` and rest for `class:1`. Let's check out some of the training data.

In [None]:
# get the file ids for both the classes
class_1 = train_df[train_df['target'] == 1][:2]
class_0 = train_df[train_df['target'] == 0][:2]
class_1 = list(zip(class_1.id, class_1.target))
class_0 = list(zip(class_0.id, class_0.target))
sample_data = [*class_1, *class_0]
sample_data

In [None]:
id = sample_data[0][0]
array = np.load(os.path.join(path, f'train/{id[0]}/{id}' + '.npy'))
array.shape

In [None]:
def plot_data(ids:str, target:int)->None:
    array = np.load(os.path.join(path, f'train/{ids[0]}/{ids}' + '.npy'))
    fig = plt.figure(figsize=(9, 8))
    for i in range(6):
        ax = fig.add_subplot(6, 1, i+1)
        ax.imshow(array[i].astype('float'), interpolation='nearest', aspect='auto')
        state = 'ON' if i%2 == 0 else 'OFF'
        if i == 0:
            ax.set_title(f'Id: {ids}, target: {target}, state: {state} target', size=16)
        else:
            ax.set_title(f'{state} target', size=16)
        plt.tight_layout()

In [None]:
# plot the data
id, target = sample_data[0][0], sample_data[0][1]
plot_data(id, target)

In [None]:
# plot the data
id, target = sample_data[1][0], sample_data[1][1]
plot_data(id, target)

In [None]:
# plot the data
id, target = sample_data[2][0], sample_data[2][1]
plot_data(id, target)

In [None]:
# plot the data
id, target = sample_data[3][0], sample_data[3][1]
plot_data(id, target)