In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import albumentations as A
import cv2
from PIL import Image 
import matplotlib.patches as patches

In [None]:
BASE_DIR = '../input/seti-breakthrough-listen/'
train_path = '../input/seti-breakthrough-listen/train'
test_path = '../input/seti-breakthrough-listen/test'
train = pd.read_csv(f'{BASE_DIR}train_labels.csv')
sam = pd.read_csv(f'{BASE_DIR}sample_submission.csv')
train['file_name'] = train['id'].apply(lambda a: a+'.npy')
train['dir_name'] = train['id'].apply(lambda a: a[0])

# SETI Breakthrough Listen - E.T. Signal Search

In this competition we are tasked with looking for technosignature signals in cadence snippets taken from the Green Bank Telescope (GBT) to help in the search for candidate signatures of extraterrestrial technology - so-called technosignatures.<br>
<h3>A cooler way to put it is that we are looking for aliens. </h3>

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTEZ3m0SFpeTux8e4sV_TLFEqufMkeeBqtl7g&usqp=CAU)
<br>
<br>
We are provided with 50165 cadence snippets in the train set, and we are suposed to create a Algrotihm with them to detect these technosignatures. <b>The cadence snippets looks like this:-</b>

In [None]:
df = train[train['target'] == 1].loc[16878]
df = train[train['target'] == 1].iloc[0]
target = df['target']
cadence_snippet_name = os.path.join(train_path,str(df['dir_name']),str(df['file_name']))

arr = np.load(cadence_snippet_name)
fig, ax = plt.subplots(6, 1, figsize=(16, 10))
plt.suptitle(f"ID: {os.path.basename(cadence_snippet_name)} TARGET: {target}", fontsize=18)
for i in range(6):
    ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
    ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
    plt.xticks([])
    if(i==0):
        rect = patches.Rectangle((87, 5), 10, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[0].add_patch(rect)
    if(i==2):
        rect = patches.Rectangle((105, 5), 15, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[2].add_patch(rect)
    if(i==4):
        rect = patches.Rectangle((136, 5), 30, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[4].add_patch(rect)

fig.show()
# df

<b>Lets break this cadence snippet down:-</b>

<b>Firstly,</b><br>
The individual plots signify a cadence, and the combination of the 6 plots is called a cadence snippet

<b>Secondly,</b><br>
What is the purpose of the ON and OFF in the snippet?<br>
* The human technology gives off radio signals. We refer to these human-generated signals as “radio frequency interference”, or RFI. This creates issues for the Breakthrough Listen.
* To deal with this, the Listen team intersperses scans of the target stars with scans of other regions of sky. Any signal that appears in both sets of scans probably isn’t coming from the direction of the target star.<br>

<b>Third,</b><br> The curve you see highlighted in red signifies the candidate signatures of extraterrestrial technology. It’s a curved line because the relative motion of the Earth and the spacecraft imparts a Doppler drift, causing the frequency to change over time. It is not necessary that it will always be diagonal.

<br>
We’ve taken tens of thousands of cadence snippets, which we’re calling the haystack, and we’ve hidden needles among them. Some of these needles should be easy to detect, even with classical detection algorithms. Others are hidden in noisy regions of the spectrum and will be harder, even though they might be relatively obvious on visual inspection:
<b> Example of this:-

In [None]:
df = train[train['id'] == 'e1f7c0159caa'].iloc[0]
target = df['target']
cadence_snippet_name = os.path.join(train_path,str(df['dir_name']),str(df['file_name']))

arr = np.load(cadence_snippet_name)
fig, ax = plt.subplots(6, 1, figsize=(16, 10))
plt.suptitle(f"Easier to detect", fontsize=18)
for i in range(6):
    ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
    ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
    plt.xticks([])
    if(i==0):
        rect = patches.Rectangle((105, 225), 7, 44, linewidth=1, edgecolor='r', facecolor='none')
        ax[0].add_patch(rect)
    if(i==2):
        rect = patches.Rectangle((95, 5), 10, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[2].add_patch(rect)
    if(i==4):
        rect = patches.Rectangle((83, 5), 9, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[4].add_patch(rect)

fig.show()

In [None]:
df = train[train['id'] == '7072ba75fac0'].iloc[0]
target = df['target']
cadence_snippet_name = os.path.join(train_path,str(df['dir_name']),str(df['file_name']))

arr = np.load(cadence_snippet_name)
fig, ax = plt.subplots(6, 1, figsize=(16, 10))
plt.suptitle(f"Harder to detect", fontsize=18)
for i in range(6):
    ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
    ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
    plt.xticks([])
    if(i==2):
        rect = patches.Rectangle((125, 5), 20, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[2].add_patch(rect)
    if(i==4):
        rect = patches.Rectangle((162, 5), 30, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[4].add_patch(rect)

fig.show()

Look at this [thread](https://www.kaggle.com/c/seti-breakthrough-listen/discussion/237980)<br>

<b> Some other nuances:- </b>
* Not all of the “needle” signals look like diagonal lines
* All of the “needle” signals may not be present for the entirety of all three “A” observations

What they do have in common is that they are only present in some or all of the “A” observations (panels 1, 3, and 5 in the cadence snippets).

In [None]:
plt.title('Distribution of the target variable')
sns.countplot(x=train['target'])

As we can see that there is a huge imbalance in the the training dataset.
<br><br>
<b>About the generation of the candidate signatures of extraterrestrial technology:-</b><br>
It obviously is not actual technosignatures, as if they were, it would mean that scientists have already discovered aliens. <br>
The organizers have taken tens of thousands of cadence snippets, which we’re calling the haystack, and they’ve hidden needles among them. 
* Some of these needles are similar to the signals created by man made interplanetary spacecraft and should be easy to detect, even with classical detection algorithms. 
* Other are hidden in noisy regions of the spectrum and will be harder, even though they might be relatively obvious on visual inspection
<br><br>

<h3>Some ways to approach the problem may be approaches from computer vision that are promising, as well as digital signal processing, anomaly detection, and more.</h3>

In [None]:
df = train[train['target'] == 1].loc[16878]
df = train[train['target'] == 1].iloc[2]
target = df['target']
cadence_snippet_name = os.path.join(train_path,str(df['dir_name']),str(df['file_name']))

arr = np.load(cadence_snippet_name)
fig, ax = plt.subplots(6, 1, figsize=(16, 10))
plt.suptitle(f"Alien signals present", fontsize=18)
for i in range(6):
    ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
    ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
    plt.xticks([])
    if(i==0):
        rect = patches.Rectangle((229, 5), 10, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[0].add_patch(rect)
    if(i==2):
        rect = patches.Rectangle((239, 5), 10, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[2].add_patch(rect)
    if(i==4):
        rect = patches.Rectangle((253, 5), 10, 264, linewidth=1, edgecolor='r', facecolor='none')
        ax[4].add_patch(rect)

fig.show()

In [None]:
df = train[train['target'] == 0].head(10).iloc[8]
target = df['target']
cadence_snippet_name = os.path.join(train_path,str(df['dir_name']),str(df['file_name']))

arr = np.load(cadence_snippet_name)
fig, ax = plt.subplots(6, 1, figsize=(16, 10))
plt.suptitle(f"Alien signals not present", fontsize=18)
for i in range(6):
    ax[i].imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
    ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
    plt.xticks([])
fig.show()

<b>Here it might feel like there are technosignatures present, but there isn't any, why?</b><br>
Even though we can see some diagonals, those are actually radio frequency interference (RIFs).<br>
A good way to understand it is that, there is no difference between the images even when we change the locations, Which means that signals are not coming from the sky, but rather  from earthly devices like radios etc.

It is highly recommended to check the competitions [Description](https://www.kaggle.com/c/seti-breakthrough-listen/overview/description), and [Data Understanding](https://www.kaggle.com/c/seti-breakthrough-listen/overview/data-information) sections, to understand the cadence images better.

<h2> Thank you for reading, If you found this notebook helpful, do upvote</h2>