# analyzing sequence data

In this jupyter notebook, we'll compare a few different approaches for analyzing sequence data.

Of course, we'll need some sequence data to analyze.

In tensorflow, a 'sequence' is a 1-dimensional 'list' of data, with each 'data point' being a vector. So:

    [ 1.2, 2.3, 3.4, 4.5 ]

is *not* a sequence in tensorflow, but:

    [ [1.2], [2.3], [3.4], [4.5] ]

*is* a sequence.

You can think of a 'sequence' as similar to a 2-dimensional 'image' in tensorflow, exept you are stuck in 1-dimension, in the case of sequence data. Similar to the case of 'image' data, we can discuss the concept of 'channels' in sequence data.

The sequence above has a single 'channel', whereas the sequence below has 3 'channels':

    [ [1,2,3], [4,5,6], [7,8,9], [10,11,12] ]

As in the case of 2D 'image' data, the *meaning* of the channels will depend on the *meaning* of the data, which is particular to how the data were collected and what the analysis problem is trying to accomplish.

For this exercise, we'll just simulate some sequence data, so we can gain some experience analyzing such data and compare different models.

In this case, we'll use the scikit-learn make_classification function to simulate sequences of 128 data points (or 'features', each of which is one-dimensional. For classification, we'll set 32 features as 'informative' or correlated with class labels, while the remaining fetures in the sequence are 'random' with regard to class labels (ie, not informative). We'll simulate a total of 38,262 data samples and use binary class labels.

In [None]:
import numpy as np
import tensorflow as tf
import sklearn.datasets
import sklearn.model_selection


x, y = sklearn.datasets.make_classification(n_samples=38262,
                                            n_features=128,
                                            n_informative=32,
                                            random_state=8792439)
train_x, valid_x, train_y, valid_y = sklearn.model_selection.train_test_split(x,y, test_size=0.2, random_state=849691)
train_x = np.expand_dims(train_x, axis=-1)
valid_x = np.expand_dims(valid_x, axis=-1)
train_data = tf.data.Dataset.from_tensor_slices((train_x, train_y)).batch(32)
valid_data = tf.data.Dataset.from_tensor_slices((valid_x, valid_y)).batch(32)
print(train_data, valid_data)

In this case, we used the scikit-learn function train_test_split to randomly divide our simulated data into 80% training and 20% validation data.

Also notice that we had to use the numpy expand_dims function to transform our simulated data into the correct 'shape' for tensorflow sequences. The scikit-learn make_classification function creates 'sequences' of the form:

    [ a, b, c, d ]

which is *not* a tensorflow sequence, as it lacks the 'channel dimension' (or rank). To fix this, we 'expand' the last dimension, so the data takes on the form:

    [ [a], [b], [c], [d] ]

which is compatible with how tensorflow sequence models work.

After getting our data in the correct format for tensorflow, it's easy to create tensorflow Dataset objects using the from_tensor_slices function.

You should see that your Dataset objects have the shape:

    ((None, 128,1), (None,))

So, the explanatory variables (ie, the sequence data) have shape:

    (None, 128, 1)

That is, there are 128 'features' in the sequence, and each feature has dimensionality 1. Another way of saying this is that the sequences have *length* 128, and the sequence data has a single *channel*.

## convolution model

First, we'll analyze our sequence data using a convolution neural-network model, similar to what we did with the 2-dimensional 'image' data.

The only 'difference' we need to be aware of when applying convolution models to sequence data is that we'll need to use a tf.keras.layers.Conv1D object to model the 1-dimensional sequence data, rather than the tf.keras.layers.Conv2D object we used to model 2-dimensional 'image' data. 

For the 1D convolutions, there is only a single 'rank' in the kernel_size, so we'll need to set:

    kernel_sie=(3,)

when we create the 1D convolution layer. Of course, we could use a larger kernel size, if we wanted to.

For now, we'll just use a single Conv1D layer with ReLU activation in our model. We'll need to decide on the number of filters we want to use in our convolution layer. For now, we'll just set:

    filters=64

as a reasonable starting value.

As with the 2D 'image' data, we'll need to 'flatten' our sequence data to remove the 'channel' dimension, before sending it to the classification layer.

This is a binary classification problem, so we'll use a Dense layer with a single unit and sigmoid activation as the 'decision layer' in the network.

Finally, we'll need to use binary crossentropy loss as our loss function. In this example, we'll use the Adam optimizer, and we'll make sure to track the 'accuracy' of our model as it trains, which is commonly done for classification problems, as humans tend to 'reason' better with accuracy than with loss outputs.

With such a simple model, we should get a reasonalble model fit after 50 epochs of training using the Adam optimizer. You might want to make sure you're running the model fit with GPU hardware acceleration, though. Click on the downward-arrow in the upper-right corner; select 'View resources', then click on 'Change Runtime Type' and select GPU hardware acceleration before saving. Then you can run the following code cell.

Training should take about 5 seconds / epoch with GPU acceleration. So, about 5 minutes or so.

In [None]:
import numpy as np
import tensorflow as tf
import sklearn.datasets
import sklearn.model_selection

# simulate and package data
x, y = sklearn.datasets.make_classification(n_samples=38262,
                                            n_features=128,
                                            n_informative=32,
                                            random_state=8792439)
train_x, valid_x, train_y, valid_y = sklearn.model_selection.train_test_split(x,y, test_size=0.2, random_state=849691)
train_x = np.expand_dims(train_x, axis=-1)
valid_x = np.expand_dims(valid_x, axis=-1)
train_data = tf.data.Dataset.from_tensor_slices((train_x, train_y)).batch(32)
valid_data = tf.data.Dataset.from_tensor_slices((valid_x, valid_y)).batch(32)
print(train_data, valid_data)

# build model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv1D(filters=64, kernel_size=(3,), activation=tf.keras.activations.relu, input_shape=(128,1)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(units=1, activation=tf.keras.activations.sigmoid))
model.summary()

# compile model
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

# fit model
model.fit(train_data, epochs=50, validation_data=valid_data)