# Build PNG Files

In this notebook, we'll take the `basic` data set, use `ibmseti` Python package to convert each data file into a spectrogram, then save as `.png` files.


Also, we'll split the data set into a training set and a test set and create a handful of zip files for each class. This will dovetail into the next tutorial where we will train a custom Watson Visual Recognition classifier (we will use the zip files of pngs) and measure it's performance with the test set. 

In [7]:
from __future__ import division

import cStringIO
import glob
import json
import requests
import ibmseti
import os
import zipfile
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#Making a local folder to put my data.

#NOTE: YOU MUST do something like this on a Spark Enterprise cluster at the hackathon so that
#you can put your data into a separate local file space. Otherwise, you'll likely collide with 
#your fellow participants. 

mydatafolder = 'my_team_name_data_folder/basic4'
if os.path.exists(mydatafolder) is False:
    os.makedirs(mydatafolder)

In [None]:
basic4zip = 'https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_basic_v2/basic4.zip'
os.system('curl {} > {}/{}'.format(basic4zip, mydatafolder, 'basic4.zip'))

In [4]:
!ls adam_data_folder/basic4

basic4.zip


In [5]:
outputpng_folder = 'adam_data_folder/png'
if os.path.exists(outputpng_folder) is False:
    os.makedirs(outputpng_folder)

In [8]:
zz = zipfile.ZipFile(mydatafolder + '/' + 'basic4.zip')

In [18]:
#Use `ibmseti`, or other methods, to draw the spectrograms

def draw_spectrogram(data):
    
    aca = ibmseti.compamp.SimCompamp(data)
    spec = aca.get_spectrogram()

    # Instead of using SimCompAmp.get_spectrogram method
    # perform your own signal processing here before you create the spectrogram
    #
    # SimCompAmp.get_spectrogram is relatively simple. Here's the code to reproduce it:
    #
    # header, raw_data = r.content.split('\n',1)
    # complex_data = np.frombuffer(raw_data, dtype='i1').astype(np.float32).view(np.complex64)
    # shape = (32, 6144)
    # spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data.reshape(*shape) ), 1) )**2
    # 
    # But instead of the line above, can you maniputlate `complex_data` with signal processing
    # techniques in the time-domain (windowing?, de-chirp?), or manipulate the output of the 
    # np.fft.fft process in a way to improve the signal to noise (Welch periodogram, subtract noise model)? 
    # 
    # example: Apply Hanning Window
    # complex_data = complex_data.reshape(*shape)
    # complex_data = complex_data * np.hanning(complex_data.shape[1])
    # spec = np.abs( np.fft.fftshift( np.fft.fft( complex_data ), 1) )**2


    fig, ax = plt.subplots(figsize=(10, 5))   

    # do different color mappings affect Watson's classification accuracy?
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='hot')
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
    # ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='Greys')
    
    ax.imshow(np.log(spec), aspect = 0.5*float(spec.shape[1]) / spec.shape[0], cmap='gray')
    
    return fig


In [None]:
count = 0
total = len(zz.namelist())
for fn in zz.namelist():
    data = zz.open(fn).read()
    fig = draw_spectrogram(data)
    png_file = fn + '.png'
    if count % 200 == 0:
        print 'completed', count, 'out of', total
        
    fig.savefig(outputpng_folder + '/' + png_file)
    plt.close(fig)

# Create Training / Test sets

Using the `basic` list, we'll create training and test sets for each signal class. Then we'll archive the `.png` files into a handful of `.zip` files (We need the .zip files to be smaller than 100 MB because there is a limitation with the size of batches of data that are uploaded to Watson Visual Recognition when training a classifier.)

In [23]:
# Grab the Basic file list in order to 
# Organize the Data into classes

r = requests.get('https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_basic_v2_26may_2017.csv', timeout=(9.0, 21.0))

uuids_classes_as_list = r.text.split('\n')[1:-1]  #slice off the first line (header) and last line (empty)

def row_to_json(row):
    uuid,sigclass = row.split(',')
    return {'uuid':uuid, 'signal_classification':sigclass}

uuids_classes_as_list = map(lambda row: row_to_json(row), uuids_classes_as_list)
print "found {} files".format(len(uuids_classes_as_list))

uuids_group_by_class = {}
for item in uuids_classes_as_list:
    uuids_group_by_class.setdefault(item['signal_classification'], []).append(item)

found 4000 files


In [24]:
training_percentage = 0.70

training_set_group_by_class = {}
test_set_group_by_class = {}
for k, v in uuids_group_by_class.iteritems():
    
    total = len(v)
    training_size = int(total * training_percentage)

    training_set = v[0:training_size]
    test_set = v[training_size:total]
    
    training_set_group_by_class[k] = training_set
    test_set_group_by_class[k] = test_set
    
    print '{}: training set size: {}'.format(k, len(training_set))
    print '{}: test set size: {}'.format(k, len(test_set))

squiggle: training set size: 700
squiggle: test set size: 300
narrowband: training set size: 700
narrowband: test set size: 300
noise: training set size: 700
noise: test set size: 300
narrowbanddrd: training set size: 700
narrowbanddrd: test set size: 300


In [31]:
training_set_group_by_class['noise'][0]

{'signal_classification': u'noise',
 'uuid': u'498becc2-3693-45b3-8533-50e93532706a'}

In [30]:
fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]

In [38]:
zipfilefolder = 'my_team_name_data_folder/zipfiles'
if os.path.exists(zipfilefolder) is False:
    os.makedirs(zipfilefolder)

In [60]:
#Figure out how many zip files we need to make

for k, v, in training_set_group_by_class.iteritems():
    
    fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]  #yes, files are <uuid>.dat.png :/
    
    count = 1
    for fn in fnames:
        
        archive_name = '{}/classification_{}_{}.zip'.format(zipfilefolder, count, k)
        
        if os.path.exists(archive_name):
            zz = zipfile.ZipFile(archive_name, mode='a')
        else:
            print 'creating new archive', archive_name
            zz = zipfile.ZipFile(archive_name, mode='w')
           
        zz.write(fn)
        zz.close()
        
        #if archive_name folder exceeds 75 MB, increase count to create a new one
        if os.path.getsize(archive_name) > 75.0*1024**2:
            count += 1
            

creating new archive adam_data_folder/zipfiles/classification_2_squiggle.zip
creating new archive adam_data_folder/zipfiles/classification_1_narrowband.zip
creating new archive adam_data_folder/zipfiles/classification_2_narrowband.zip
creating new archive adam_data_folder/zipfiles/classification_1_noise.zip
creating new archive adam_data_folder/zipfiles/classification_2_noise.zip
creating new archive adam_data_folder/zipfiles/classification_1_narrowbanddrd.zip
creating new archive adam_data_folder/zipfiles/classification_2_narrowbanddrd.zip


In [69]:
for k, v, in test_set_group_by_class.iteritems():
    
    fnames = [outputpng_folder + '/' + vv['uuid'] + '.dat.png' for vv in v]  #yes, files are <uuid>.dat.png :/
    
    for fn in fnames:
        
        archive_name = '{}/testset_{}.zip'.format(zipfilefolder, k)
        
        if os.path.exists(archive_name):
            zz = zipfile.ZipFile(archive_name, mode='a')
        else:
            print 'creating new archive', archive_name
            zz = zipfile.ZipFile(archive_name, mode='w')
           
        zz.write(fn)
        zz.close()
        

creating new archive adam_data_folder/zipfiles/testset_squiggle.zip
creating new archive adam_data_folder/zipfiles/testset_narrowband.zip
creating new archive adam_data_folder/zipfiles/testset_noise.zip
creating new archive adam_data_folder/zipfiles/testset_narrowbanddrd.zip


In [70]:
!ls -alrth my_team_name_data_folder/zipfiles

total 2.3G
drwxr-x--- 5 sfa7-9e7464df3e1117-5edfd8a0d95d users 4.0K Jun  1 04:44 ..
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  58M Jun  1 04:49 classification_2_squiggle.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  76M Jun  1 04:49 classification_1_narrowband.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  58M Jun  1 04:49 classification_2_narrowband.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  76M Jun  1 04:49 classification_1_noise.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  60M Jun  1 04:49 classification_2_noise.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  76M Jun  1 04:49 classification_1_narrowbanddrd.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  57M Jun  1 04:49 classification_2_narrowbanddrd.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  76M Jun  1 04:53 classification_1_squiggle.zip
-rw-r----- 1 sfa7-9e7464df3e1117-5edfd8a0d95d users  57M Jun  1 04:53 testset_squiggle.zip
-rw-r-