# 1. Background
#### **1.1 Technosignatures**

A technosignature is any substance or phenomenon that provides scientific evidence to the existence of an intelligent civilization. Much analogous to biosignatures. Things like atmospheric composition of a planet, artifical heat, artificial light and radio signals coming from solar systems spread across the universe are some of the markers. With today's technology, a feasible search for intelligent life is possible through detection of radio signals coming from outer space. These particular signals have certain characteristics that separates them from the radio waves originating on Earth(Mobile phones, FM stations, Wi-Fi..etc).

![advanced_alien_life](https://i.pinimg.com/originals/89/e6/91/89e6912b1225c43ed18b7c2b31069f77.jpg)<br>

#### **1.2 Breakthrough Listen instrument & Green Bank Telescope**

Breakthrough Listen is a digital spectrometer that records the radio signals coming from the target pointed by the Green Bank Telescope. It is first recorded in an amplitude-time domain and after appying fourier transform, the signal is represented in the frequency-time domain (spectrogram).

Only if a signal passes the above two filters, it is considered as a potential marker.

![green bank telescope](https://skyandtelescope.org/wp-content/uploads/Green-Bank-Telescope.jpg)<br>
#### ***fig 1 Green Bank telescope***

![amplitude-time domain signal](https://miro.medium.com/max/700/1*4nhf3gx4BTBDP_QEeyIeIg.png)<br>
#### ***fig 2 amplitude-time domain signal***


![frequency-time domain signal](https://miro.medium.com/max/700/1*Dbc9s0KfznzFYO0Q7OOOWg.png)<br>
#### ***fig 3 frequency time domain signal(aka specrogram)***

# 2. Imports

In [None]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob

import torch
from tqdm.notebook import tqdm
from torch.utils.data import Dataset

# 3. Exploring training data

In [None]:
train_labels = pd.read_csv('../input/seti-breakthrough-listen/train_labels.csv')
train_labels.head()

In [None]:
sns.countplot(x = train_labels['target'].values)

In [None]:
value_counts = train_labels['target'].value_counts()
print(f"TARGET 0 PROPORTION : {value_counts[0]/(value_counts[0] + value_counts[1])}")
print(f"TARGET 1 PROPORTION : {value_counts[1]/(value_counts[0] + value_counts[1])}")

Target 0 corresponds to a radio signal generated on earth, whereas Target 1 corresponds to a radio signal coming from outer space.<br>
Number of Target 0 samples : 45,471.<br>
Number of Target 1 samples : 4,694

# 4. SETI's 2 Filters

SETI's Breakthrough Listen instrument generates hundreds of terrabytes of data per day and to find these special signals among the rest, its like finding a needle in the haystack. SETI employs two filters that discard the normal terrestrial signals.

#### **4.1 Filter 1(Doppler Shift):**

A car honking continuously moves across the street. A shift in the pitch of the horn is observed (pitch is lowered if the car is moving away from the observer), this is doppler shift for sound waves. Similarly, for radio waves or any electromagnetic wave in general, doppler shift occurs when there is a change in the frequency of the wave and either the observer or source are in motion. In this case since Earth is constantly rotating around the Sun,so the distance between a star and earth is not fixed, hence radio signals coming from that star should experience a shift in the frequency.

#### **4.2 Filter 2(Radio Frequency interference):**

How do we know if a radio signal is really coming from the star that we pointed the telescope to?, it could be appearing there due to several other reasons.
It could also be coming from nearby FM stations, Wi-Fi routers and mobile phones. This disturbance generated by an external source is called Radio Frequency Interference(RFI). SETI deals with this by looking for signals at particular positions in the sky. Typically they do this by alternating observations of their primary target star with observations of three nearby stars: 5 minutes on star “A”, then 5 minutes on star “B”, then back to star “A” for 5 minutes, then “C”, then back to “A”, then finishing with 5 minutes on star “D”. One set of six observations (ABACAD) is referred to as a “cadence”. If the signal only appears at position "A", it is most likely coming from that target star.

![spectogram](https://storage.googleapis.com/kaggle-media/competitions/SETI-Berkeley/Screen%20Shot%202021-05-03%20at%2011.34.06.png)<br>

***In the above cadence, "ON" refers to the radio signal captured when the telescope is pointed towards the target star, "OFF" refers to any other position in the sky other than the target star. We can conclude two points from this. First, there is a shift in the frequency(slanting line in the "ON" spectrograms) and second, the signal only appears when the telescope points towards the star. This signal passes the 2 filters and becomes a strong candidate for a technosignature.***

In [None]:
def show_cadence(_id, label, dataset = "TRAIN"):
    
    file_name = None
    if(dataset == "TRAIN"):
        file_name = f"../input/seti-breakthrough-listen/train/{_id[0]}/{_id}.npy"
    elif(dataset == "TEST"):
        file_name = f"../input/seti-breakthrough-listen/test/{_id[0]}/{_id}.npy"
    
    cadence = np.load(file_name)
    cadence = cadence.astype(np.float32)
    fig, ax = plt.subplots(nrows = 6, ncols = 1, figsize = (16, 10))
    fig.suptitle(f'ID:{_id}   TARGET:{label}', fontsize = 18)
    for i in range(6):
        ax[i].imshow(cadence[i], interpolation = 'nearest', aspect = 'auto')
        ax[i].text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        ax[i].get_xaxis().set_visible(False)
    
    plt.show()

train_sample_0 = train_labels[train_labels['target'] == 0].sample(5)
train_sample_1 = train_labels[train_labels['target'] == 1].sample(5)

for ind, row in train_sample_0.iterrows():
    show_cadence(row['id'], 0)
    
for ind, row in train_sample_1.iterrows():
    show_cadence(row['id'], 1)

#### ***In the samples above, frequency shifts for Target 1 type at "ON" sites can be observed, in some of them it is clearly visible and in others it is very faint.***

# 5. Exploring test data

In [None]:
test_paths = glob.glob('../input/seti-breakthrough-listen/test/*/*.npy')
test_paths = map(lambda x : x.split('/')[5].split('.npy')[0], test_paths)
test_df = pd.DataFrame({'path' : test_paths}, index = None)
test_sample = test_df.sample(10)

for ind, row in test_sample.iterrows():
    show_cadence(row['path'], None, dataset = "TEST")

# 6. Building pytorch dataset class

In [None]:
class SETI_Train(Dataset):
    
    def __init__(self, root_dir, train_csv, transform = None, tensorize = False):
        self.root_dir = root_dir
        self.train_csv = train_csv
        self.transform = transform
        self.tensorize = tensorize
        self.len = len(self.train_csv.index)
    
    def __len__(self):
        return self.len
    
    def __getitem__(self, index):
        if(index >= self.len):
            raise IndexError("Going beyond the set of valid indices")
            
        _id, target = self.train_csv.loc[index, ['id', 'target']].values
        filename = os.path.join(self.root_dir, _id[0], _id + '.npy')
        cadence = np.load(filename).astype(np.float32)
        
        if(self.transform): cadence = self.transform(image = cadence)['image'] #only when albumentations library is used for transformations. May change for others.
        if(self.tensorize): 
            cadence = torch.tensor(cadence)
            target = torch.tensor(target)
            
        return {'cadence':cadence, 'target':target}
        

d = SETI_Train(root_dir = '../input/seti-breakthrough-listen/train', train_csv = train_labels, transform = None, tensorize = None)

for data in tqdm(d, total = len(d), desc = "SANITY_CHECK_PROGRESS"):
    pass

# 7. Improvement & Role of Datascience
The filters miss signals, particularly those with complex time or frequency structure, and those in regions of the spectrum with lots of interference.
With techniques of data science, we need to create a model that can predict these outer space signals with a better "Area under the ROC curve" metric compared to the 2 filter algorithm.

# 8. Acknowledgements

Tons of thanks to Yaroslav Isaienkov's [EDA](https://www.kaggle.com/ihelon/signal-search-exploratory-data-analysis) on this, his notebook helped me in displaying the spectrograms with the right aspect ratio which is very crucial to visualize the frequency shifts properly and to Leland Roberts for writing [this](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53) article, explaining spectrograms in simple terminologies.

At last I would like to thank the Breakthrough Listen team for providing us with this data and giving us the oppurtunity to explore it.