## Extracting Features from QPESUMS with Convolutional Autoencoder

In earlier analysis, we use PCA as a common linear dimension reduction technique. Here, we want to take advantage of the recent development of deep learning, using **convolutional autoencoder** as a nonlinear feature extration tool.


### Autoencoder

Autoencoders are widely used unsupervised application of neural networks whose original purpose is to find latent lower dimensional state-spaces of datasets, but they are also capable of solving other problems, such as image denoising, enhancement or colourization.

The main idea behind Autoencoders is to reduce the input into a latent state-space with fewer dimensions and then try to reconstruct the input from this representation. The first part is called encoding and the second step is the decoding phase. By reducing the number of variables which represent the data, we force the model to learn how to keep only meaningful information, from which the input is reconstructable. It can also be viewed as a compression technique.

<img src='figures/ae_illustrate.png' />


\[References\]
- [Aligning hand-written digits with Convolutional Autoencoders](https://towardsdatascience.com/aligning-hand-written-digits-with-convolutional-autoencoders-99128b83af8b)
- [Autoencoders — Deep Learning bits #1](https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694)
- [Autoencoders — Introduction and Implementation in TF](https://towardsdatascience.com/autoencoders-introduction-and-implementation-3f40483b0a85)
- [Deconvolution and Checkerboard Artifacts](https://distill.pub/2016/deconv-checkerboard/)
- [Up-sampling with Transposed Convolution](https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0)


### Problem declaration

Here we use AE as a tool of feature extraction on the QPESUMS dataset, so let's see how it performs. Hence we need the basic tool to read in QPESUMS data.

In [1]:
''' Input processing '''
# Scan QPESUMS data in *.npy: 6*275*162 
def search_dbz(srcdir):
    import pandas as pd
    fileinfo = []
    for subdir, dirs, files in os.walk(srcdir, followlinks=True):
        for f in files:
            if f.endswith('.npy'):
                # Parse file name for time information
                furi = os.path.join(subdir, f)
                finfo = f.split('.')
                ftime = finfo[0]
                #logging.debug([furi] + finfo[1:3])
                fileinfo.append([furi, ftime])
    results = pd.DataFrame(fileinfo, columns=['furi', 'timestamp'])
    results = results.sort_values(by=['timestamp']).reset_index(drop=True)
    return(results)

# Read uris containing QPESUMS data in the format of 6*275*162 
def loadDBZ(flist, to_log=False):
    ''' Load a list a dbz files (in npy format) into one numpy array. '''
    xdata = []
    for f in flist:
        tmp = np.load(f)
        # Append new record
        if tmp is not None:            # Append the flattened data array if it is not None
            xdata.append(tmp.flatten())
    x = np.array(xdata, dtype=np.float32)
    # Convert to log space if specified
    if to_log:
        x = np.log(x+1)
    # done
    return(x)

def data_generator_ae(flist, batch_size, to_log=False):
    ''' Data generator for batched processing. '''
    nSample = len(flist)
    # This line is just to make the generator infinite, keras needs that    
    while True:
        batch_start = 0
        batch_end = batch_size
        while batch_start < nSample:
            limit = min(batch_end, nSample)
            X = loadDBZ(flist[batch_start:limit], to_log)
            #print(X.shape)
            yield (X,X) #a tuple with two numpy arrays with batch_size samples     
            batch_start += batch_size   
            batch_end += batch_size
    # End of generator


## Autoencoding the QPESUMS on Itself

It means we want the decoded data is as similar to the original as possible.

In [None]:
# Autoencoder model
def init_ae_mse(input_shape):
    # Define input layer
    input_data = Input(shape=input_shape)  # adapt this if using `channels_first` image data format
    # Define encoder layers
    x = Conv2D(32, (3, 3), activation='relu', padding='same', name='encoder_conv1', data_format='channels_first')(input_data)
    x = MaxPooling2D((5, 3), name='encoder_maxpool1', data_format='channels_first')(x)
    x = Conv2D(16, (3, 3), activation='relu', padding='same', name='encoder_conv2', data_format='channels_first')(x)
    x = MaxPooling2D((1, 3), name='encoder_maxpool2', data_format='channels_first')(x)
    x = Conv2D(8, (3, 3), activation='relu', padding='same', name='encoder_conv3', data_format='channels_first')(x)
    x = MaxPooling2D((1, 3), name='encoder_maxpool3', data_format='channels_first')(x)
    x = Conv2D(4, (3, 3), activation='relu', padding='same', name='encoder_conv4', data_format='channels_first')(x)
    encoded = MaxPooling2D((5, 3), name='encoder_maxpool4', data_format='channels_first')(x)
    #encoded = x
    # Define decoder layers
    x = Conv2D(4, (3, 3), activation='relu', padding='same', name='decoder_conv1', data_format='channels_first')(encoded)
    x = UpSampling2D((5, 3), name='decoder_upsamp1', data_format='channels_first')(x)
    x = Conv2D(8, (3, 3), activation='relu', padding='same', name='decoder_conv2', data_format='channels_first')(x)
    x = UpSampling2D((1, 3), name='decoder_upsamp2', data_format='channels_first')(x)
    x = Conv2D(16, (3, 3), activation='relu', padding='same', name='decoder_conv3', data_format='channels_first')(x)
    x = UpSampling2D((1, 3), name='decoder_upsamp3', data_format='channels_first')(x)
    x = Conv2D(32, (3, 3), activation='relu', padding='same', name='decoder_conv4', data_format='channels_first')(x)
    x = UpSampling2D((5, 3), name='decoder_upsamp4', data_format='channels_first')(x)
    decoded = Conv2D(6, (3, 3), activation='sigmoid', name='decoder_output', padding='same', data_format='channels_first')(x)
    # Define autoencoder
    autoencoder = Model(input_data, decoded)
    #autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
    autoencoder.compile(optimizer='adam', loss='mse', metrics=['cosine_proximity','binary_crossentropy'])
    # Encoder
    encoder = Model(input_data, encoded)
    return((autoencoder, encoder))
