# Welcome to the SETI Institute Code Challenge!

This first tutorial will explain a little bit on what the data is and where to get it.

# Introduction

For the Code Challenge, you will be using the **"primary" data set**, as we've called it. The primary data set is   

  * labeled data set of 350,000 simulated signals
  * 7 different labels, or "signal classifications"
  * total of about 128 GB of data

This data set should be used to train your models. **You do not need to use all the data to train your models if you do not want to or need to consume the entire set**. There are also a `small` and a `medium` sized subset of these primary data files. 


### Data File Format

Each data file has a simple format: 

 * File name: &lt;UUID&gt;.dat
 * Content:
   * JSON header in the first line that contains:
      * UUID
      * signal_classification (label)
   * followed by stream complex-valued time-series data. 

The [`ibmseti` Python package](https://pypi.python.org/pypi/ibmseti/) is available to assist in reading this data and performing some basic operations for you. 


### Data Index Files

For all data sets, there exists an **index** file. That file is a comma-separated value (CSV) file. Each row holds the UUID, signal_classification (label) for a simulation file in the data set. You can use these index files in a few different ways (from using to keep track of your downloads, to facilitate parallelization of your analysis on Spark).

Example content:

```
  UUID,SIGNAL_CLASSIFICATION
  b1...2e,narrowband
  d8...e4,squiggle
```


<hr>
# Getting started ("Basic") Data Set

There is also a second, simple and clean data set that you may use for warmup, which we call the **"basic" data set**. This basic set should be used as a sanity check and for very early-stage prototyping. We recommend that everybody starts with this. 

 * Only 4 different signal classifications
 * 1,000 simulation files for each class: 4,000 files total
 * Data ZIP file (~ 1.1 GB) 
   * File 1/1: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_basic_v2/basic4.zip
 * Index file: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_basic_v2_26may_2017.csv
       
### Basic Set versus Primary Set

> The difference between the `basic` and `primary` data sets is that the signals simulated in the `basic` set have, on average, much higher signal to noise ratio (they are larger amplitude signals). They also have other characteristics that will make the different signal classes very distinguishable. **You should be able to get very high signal classification accuracy with the basic data set.**  The primary data set has smaller amplitude signals and can look more similar to each other, making classification accuracy more difficult with this data set. There are also only 4 classes in the basic data set and 7 classes in the primary set. 


<hr>
# Primary Training Data Sets


During the code challenge you have access to a the `full` primary data set and a `small` and `medium` sized subset. 

### Primary Small Data Set

The `primary small` is a subset of the full primary data set.  Use for early-stage prototyping.

  * This data set contains
    * All 7 signal classifications
    * 1,000 simulations per class 
    * 7,000 data files (7 classes * 1,000 simulations)
  * Data ZIP file (2 GB): 
    * File 1/1: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_small.zip
  * Index file: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v2_small_1june_2017.csv (&lt;1 MB)

### Primary Medium Data Set

The `primary medium` is a subset of the full primary data set.  Use for early-stage prototyping & model building.

  * This data set contains
   * All 7 signal classifications
   * 10,000 simulations per class 
   * 70,000 data files (7 classes * 10,000 simulations)
  * Large enough for relatively robust model construction
  * Data ZIP files (20GB):
     * File 1/6: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_medium_1.zip
     * File 2/6: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_medium_2.zip
     * File 3/6: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_medium_3.zip
     * File 4/6: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_medium_4.zip
     * File 5/6: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_medium_5.zip
     * File 6/6: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_medium_6.zip
  * Index file: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v2_medium_1june_2017.csv (3.5 MB)
 
### Primary Full Data Set

The `primary full` is the entire primary data set.  Use only if you want an enourmous training data set. You will need a small data center to process these data in a reasonable amount of time. 

  * This data set contains
    * All 7 signal classifications
    * 50,000 simulations per class 
    * 350,000 data files files (7 classes * 50,000 simulations)
  * Data files (130 GB):
    * 350k individual files
    * One must read through the index file and download files individually, which will take some time from outside of IBM Cloud systems
  * Index file: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_v2_full_1june_2017.csv (17 MB)

<hr>
# Test Data Set

There is one `primary_test` data set. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key. 

  * This data set contains
    * All 7 signal classifications
    * ~1,000 simulations per class (+- 50) 
    * 7,014 total files
  * Data files only include the UUID in the header but not the classification
  * Data ZIP files (2GB):
    * File 1/1: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2_zipped/primary_testset.zip
  * Index file: https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_primary_testset_1k_1june_2017.csv (&lt;1 MB)

> **Submitting Classification Results**
> See the [Judging Criteria](../Judging_Criteria.ipynb) notebook for information on submitting your test-set classifications.

<hr>
<hr>
<br>

# Programmatically Accessing the Data

The data are stored in `containers` on IBM Object Storage. You can access these data with HTTP calls. Here we use system level `curl`, but you could easily use the Python `requests` package. 

The URL for all data files is composed of  `base_url/container/objectname`.
 
The `base_url` is:

In [None]:
#If you are running this in IBM Apache Spark (via Data Science Experience)
base_url = 'https://dal05.objectstorage.service.networklayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#ELSE, if you are outside of IBM:
#base_url = 'https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#NOTE: if you are outside of IBM, pulling down data will be slower. :/

In [None]:
#Defining a local data folder to dump data
import os

mydatafolder = os.path.join( os.environ['PWD'], 'my_data_folder' )
if os.path.exists(mydatafolder) is False:
    os.makedirs(mydatafolder)

## Accessing the Basic Data Set

Download the basic data set ZIP file and index file.

In [None]:
# download the data ZIP file
basic_container = 'simsignals_basic_v2'
basic4_zip_file = 'basic4.zip'
os.system('curl {}/{}/{} > {}'.format(base_url, basic_container, basic4_zip_file, mydatafolder + '/' + basic4_zip_file))
!ls -al $mydatafolder/$basic4_zip_file

# download the index csv file
basic4_csv_filename = 'public_list_basic_v2_26may_2017.csv'
basic4_csv_url = '{}/simsignals_files/{}'.format(base_url, basic4_csv_filename)
os.system('curl {} > {}'.format(basic4_csv_url, mydatafolder +'/'+ basic4_csv_filename))
!ls -al $mydatafolder/$basic4_csv_filename

## Accessing the Primary Data Sets


### Accessing the Primary Small Data Set

Download the primary small data set ZIP file and index file.

In [None]:
# download the data ZIP file
primary_small_filename = 'primary_small.zip'
primary_small_url = '{}/simsignals_v2_zipped/{}'.format(base_url, primary_small_filename)
os.system('curl {} > {}'.format(primary_small_url, mydatafolder +'/'+ primary_small_filename))
!ls -al $mydatafolder/$primary_small_filename

# download the index csv file
primary_small_csv_filename = 'public_list_primary_v2_small_1june_2017.csv'
primary_small_csv_url = '{}/simsignals_files/{}'.format(base_url, primary_small_csv_filename)
os.system('curl {} > {}'.format(primary_small_csv_url, mydatafolder +'/'+ primary_small_csv_filename))
!ls -al $mydatafolder/$primary_small_csv_filename

### Accessing the Primary Medium Data Set

Download the primary medium data set ZIP files `simignals_v2_zipped/primary_medium_1.zip` ... `simignals_v2_zipped/primary_medium_N.zip` and index file `public_list_primary_v2_medium_1june_2017.csv`

In [None]:
# download the data ZIP file
med_N = '{}/simsignals_v2_zipped/primary_medium_{}.zip'
for i in range(1,7):
    med_url = med_N.format(base_url, i)
    output_file = mydatafolder + '/primary_medium_{}.zip'.format(i)
    print 'GETing', output_file
    os.system('curl {} > {}'.format(med_url, output_file ))
!ls -al $mydatafolder/primary_medium_*.zip
    
# download the index csv file    
primary_medium_csv_filename = 'public_list_primary_v2_medium_1june_2017.csv'
med_csv_url = '{}/simsignals_files/{}'.format(base_url, primary_medium_csv_filename)
os.system('curl {} > {}'.format(med_csv_url, mydatafolder +'/' + primary_medium_csv_filename))    
!ls -al $mydatafolder/$primary_medium_csv_filename

### Accessing the Primary Full Data Set

Download the index file for the full data set and the 350k data files, one file at a time. 

In [None]:
primary_full_csv_filename = 'public_list_primary_v2_full_1june_2017.csv'
prim_full = '{}/simsignals_files/{}'.format(base_url, primary_full_csv_filename)
os.system('curl {} > {}'.format(prim_full, mydatafolder +'/' + primary_full_csv_filename))
!ls -al $mydatafolder/$primary_full_csv_filename

> Download this list and begin to pull down files individually if desired. Warning, **however, this will take approximately a billion years if you are not running on IBM Apache Spark** -- IBM Apache Spark and Object Storage exist in the same data center and share a fast network connection. 

The data are found in `base_url/simsignals_v2/<uuid>.dat`

Example data file URL:

https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2/aa7d082f-9263-4533-a9d4-5595c5cdde25.dat


In [None]:
import requests
import copy
file_list_container = 'simsignals_files'
file_list = 'public_list_primary_v2_full_1june_2017.csv'
primary_data_container = 'simsignals_v2'
r = requests.get('{}/{}/{}'.format(base_url, file_list_container, file_list), timeout=(9.0, 21.0))
filecontents = copy.copy(r.content)
full_primary_files = [line.split(',') for line in filecontents.split('\n')]
full_primary_files = full_primary_files[1:-1] #strip the header and empty last element
full_primary_files = map(lambda x: x[0]+".dat", full_primary_files)  #now list of file names (<uuid>.dat)

#save your data into a local subfolder
save_to_folder = mydatafolder + '/primary_data_set'
if os.path.exists(save_to_folder) is False:
    os.mkdir(save_to_folder)

count = 0
total = len(full_primary_files)
for row in full_primary_files:
    r = requests.get('{}/{}/{}'.format(base_url, primary_data_container, row), timeout=(9.0, 21.0))
    
    if count % 100 == 0:
        print 'done ', count, ' out of ',  total
    count += 1
    
    with open('{}/{}'.format(save_to_folder, row), 'w' ) as fout:
        fout.write(r.content)

> This will be a difficult data set to consume and process if you are using free-tier levels of software from any Cloud provider. You will likely want to have a robust machine, or sets of machines, with many threads and GPUs if you want to train models with such a large dat set. 

> For example, if you have access to an IBM Spark Enterprise cluster, because the network connection between IBM Spark and IBM Object Storage is so fast, we recommend that you **do NOT** download each file. Instead you could parallelize the index file and then retrieve and process each file on a worker node. 

In [None]:
## Using Spark -- can parallelize the job across your worker nodes
import ibmseti
def retrieve_and_process(row):
    try:
        r = requests.get('{}/{}/{}'.format(base_url, primary_data_container, row), timeout=(9.0, 21.0))
    except Exception as e:
        return (row, 'failed', [])
    
    aca = ibmseti.compamp.SimCompamp(r.content)
    spectrogram = aca.get_spectrogram() # or do something else
    features = my_feature_extractor(spectrogram) #example external function for reducing the spectrogram into a handful of features, perhaps
    
    signal_class = aca.header()['signal_classifiation']
        
    return (row, signal_class, features)

npartitions = 60  
rdd = sc.parallelize(full_primary_files, npartitions)

#Now ask Spark to run the job
process_results = rdd.map(retrieve_and_process).collect()

## Accessing the Test Data Set

Once you've trained your model, done all of your testing, and tweaks and are ready to submit an entry to the contest, you'll need to download the test data set and apply your model to that.  

The test data set is similar to the labeled data, except that the JSON header is missing the 'signal_classification' key, and just contains the 'uuid'. 

Like the other sets, this set is found in a `.zip` file in the `simsignals_v2_zipped` container;

In [None]:
# download the test data ZIP file
testset_filename = 'primary_testset.zip'
test_set_url = '{}/simsignals_v2_zipped/{}'.format(base_url, testset_filename)
os.system('curl {} > {}'.format(test_set_url, mydatafolder +'/'+testset_filename))
!ls -al $mydatafolder/$testset_filename

# download the test index csv file
testset_csv_filename = 'public_list_primary_testset_1k_1june_2017.csv'
test_set_csv_url = '{}/simsignals_files/{}'.format(base_url, testset_csv_filename)
os.system('curl {} > {}'.format(test_set_csv_url, mydatafolder + '/' + testset_csv_filename))
!ls -al $mydatafolder/$testset_csv_filename