# Autoencoder for NOAA-GridSat-B1

This notebook aims to develop and evaluate autoencoder for NOAA-GridSat-B1 data.


## Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. (from [wikipedia](https://en.wikipedia.org/wiki/Autoencoder))

For advanced applications of AE, [this article](https://medium.com/%E7%A8%8B%E5%BC%8F%E5%B7%A5%E4%BD%9C%E7%B4%A1/autoencoder-%E4%B8%80-%E8%AA%8D%E8%AD%98%E8%88%87%E7%90%86%E8%A7%A3-725854ab25e8) introduced four variations of AE and their applications.


### Factorization of the data dimension

The dimension of the dataset used for analysis is (858,858), and we need to factorize the dimension in order to design a proper autoencoder.

$858 = 2 \times 3 \times 11 \times 13$


### Utility functions for data I/O

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import os, argparse, logging


# Utility functions
def list_noaagridsatb1_files(dir, suffix='.v02r01.nc', to_remove=['GRIDSAT-B1.','.v02r01.nc']):
    ''' To scan through the sapecified dir and get the corresponding file with suffix. '''
    import os
    import pandas as pd
    xfiles = []
    for root, dirs, files in os.walk(dir, followlinks=True):  # Loop through the directory
        for fn in files:
            if fn.endswith(suffix):         # Filter files with suffix
                timestamp = fn
                for s in to_remove:         # Removing prefix and suffix to get time-stamp
                    timestamp = timestamp.replace(s,'')
                xfiles.append({'timestamp':timestamp, 'xuri':os.path.join(root, fn)})
    return(pd.DataFrame(xfiles).sort_values('timestamp').reset_index(drop=True))

# Binary reader
def read_noaagridsatb1(furi, var='irwin_cdr', scale=0.01, offset=200, remove_na=True, crop_east_asia=True):
    ''' The method reads in a NOAA-GridSta-B1 image in netCDF4 format (.nc file). 
        The brightness temperature data was stored in int16 as 'irwin_cdr', with 
        a scal factor of 0.01 and offset of 200. The missing values is flagged as -31999.
        More details of the data is described in https://developers.google.com/earth-engine/datasets/catalog/NOAA_CDR_GRIDSAT-B1_V2.
        Since our analysis focuss on East Asia (0-60'N, 100-160'E), we used an 
        option to crop the data to this region (index: lat:1000~1858, lon:4000~4858).
        The output is a 2-d numpy array of float32 with shape (858, 858).
    '''
    import numpy as np
    import netCDF4 as nc
    # Read in data
    data = nc.Dataset(furi)
    cdr = np.array(data.variables['irwin_cdr'])*scale+offset
    # Remove missing value
    if remove_na:
        cdr[cdr<0] = offset
    # Crop domain to East-Asia (0-60'N, 100-160'E)
    if crop_east_asia:
        return(cdr[0, 1000:1858, 4000:4858])
    else:
        return(cdr[0,:,:])

def read_multiple_noaagridsatb1(flist):
    ''' This method reads in a list of NOAA-GridSat-B1 images and returns a numpy array. '''
    import numpy as np
    data = []
    for f in flist:
        tmp = read_noaagridsatb1(f)
        if tmp is not None:
            data.append(tmp)
    return(np.array(data, dtype=np.float32))

def data_generator_ae(flist, batch_size):
    ''' Data generator for batched processing. '''
    nSample = len(flist)
    # This line is just to make the generator infinite, keras needs that    
    while True:
        batch_start = 0
        batch_end = batch_size
        while batch_start < nSample:
            limit = min(batch_end, nSample)
            X = read_multiple_noaagridsatb1(flist[batch_start:limit])
            #print(X.shape)
            yield (X,X) #a tuple with two numpy arrays with batch_size samples     
            batch_start += batch_size   
            batch_end += batch_size
    # End of generator

### A few autoencoder models

We design a few autoencoder models for comparison.

In [8]:
# Autoencoder models
from tensorflow.keras.layers import Input, Dense, Flatten, Reshape
from tensorflow.keras.models import Model


def initialize_fc_autoencoder_noaagridsatb1(input_shape):
    # Copressed dimension
    latent_dim = 64
    # Debug information
    print('Input data dimension: '+str(input_shape))
    print('Flatten data dimension: '+str(latent_dim))
    # Define input layer
    input_data = Input(shape=input_shape)
    # Define encoder layers
    x = Flatten()(input_data)
    encoded = Dense(latent_dim, activation='relu')(x)
    # Define decoder layers
    x = Dense(np.prod(input_shape), activation='sigmoid')(encoded)
    decoded = Reshape(input_shape)(x)
    # Define autoencoder
    autoencoder = Model(input_data, decoded)
    autoencoder.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError(), metrics=['cosine_similarity'])
    # Encoder
    encoder = Model(input_data, encoded)
    return((autoencoder, encoder))


def initialize_conv_autoencoder_noaagridsatb1(input_shape):
    # Define input layer
    input_data = Input(shape=input_shape)  # adapt this if using `channels_first` image data format
    # Define encoder layers
    x = Conv2D(32, (3, 3), activation='relu', padding='same', name='encoder_conv1')(input_data)
    x = MaxPooling2D((2, 2), name='encoder_maxpool1')(x)
    x = Conv2D(16, (3, 3), activation='relu', padding='same', name='encoder_conv2')(x)
    x = MaxPooling2D((3, 3), name='encoder_maxpool2')(x)
    x = Conv2D(8, (3, 3), activation='relu', padding='same', name='encoder_conv3')(x)
    encoded = MaxPooling2D((11, 11), name='encoder_maxpool3')(x)
    # Define decoder layers
    x = Conv2D(8, (3, 3), activation='relu', padding='same', name='decoder_conv1')(encoded)
    x = UpSampling2D((11, 11), name='decoder_upsamp1')(x)
    x = Conv2D(16, (3, 3), activation='relu', padding='same', name='decoder_conv2')(x)
    x = UpSampling2D((3, 3), name='decoder_upsamp2')(x)
    x = Conv2D(32, (3, 3), activation='relu', padding='same', name='decoder_conv3')(x)
    x = UpSampling2D((2, 2), name='decoder_upsamp4')(x)
    decoded = Conv2D(1, (3, 3), activation='sigmoid', name='decoder_output', padding='same')(x)
    # Define autoencoder
    autoencoder = Model(input_data, decoded)
    autoencoder.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError(), metrics=['cosine_similarity'])
    # Encoder
    encoder = Model(input_data, encoded)
    return((autoencoder, encoder))


In [9]:
# Define parameters
datadir = '../../data/noaa/'
finfo = list_noaagridsatb1_files(datadir)

NX = 858
NY = 858
batch_size = 32

# Simple AE
ae_fc = initialize_fc_autoencoder_noaagridsatb1((NY, NX))
# Debug info
nSample = finfo.shape[0]
print(ae_fc[0].summary())
print("Training autoencoder with data size: "+str(nSample))
steps_train = np.ceil(nSample/batch_size)
print("Training data steps: " + str(steps_train))
# Fitting model
ae_fc[0].fit_generator(data_generator_ae(finfo['xuri'], batch_size), steps_per_epoch=steps_train, epochs=3, max_queue_size=batch_size, use_multiprocessing=False, verbose=1)

#ae_conv = initialize_conv_autoencoder_noaagridsatb1((NY, NX))


Input data dimension: (858, 858)
Flatten data dimension: 64
Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(None, 858, 858)]        0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 736164)            0         
_________________________________________________________________
dense_8 (Dense)              (None, 64)                47114560  
_________________________________________________________________
dense_9 (Dense)              (None, 736164)            47850660  
_________________________________________________________________
reshape_3 (Reshape)          (None, 858, 858)          0         
Total params: 94,965,220
Trainable params: 94,965,220
Non-trainable params: 0
_________________________________________________________________
None
Training autoencoder with data size: 8
Training 

<tensorflow.python.keras.callbacks.History at 0x2817ba5b508>