[MAJOR] Revisit model architecture #35

tallamjr · 2020-11-27T17:16:46Z

It is time to revisit the architecture to ensure there is a clear understanding of what transformations are taking place with respect to multi-variate time series data and what analogies one can draw in relation to the Transformer architecture that is used for sequence modelling of speech.

It was previously thought that by windowing the input sequence, it would be these windows that would be attending to one another, but this was mistaken. It is each time step, or each item at each time step that attends to every other item in the sequence, including itself.

It was also previously thought that by convolving the inputs of input shape (BATCH_SIZE, timesteps, num_features) that this would "preserve temporal information". The (wrongful) reasoning behind this was caused by reading in "Hands-On ML" book:

Keras offers a TimeDistributed layer ... it wraps any layer (e.g., a Dense layer) and applies it at every time step of its input sequence. It does this efficiently, by reshaping the inputs so that each time step is treated as a separate instance (i.e., it reshapes the inputs from [batch size, time steps, input dimensions] to [batch size × time steps, input dimensions];

The Dense layer actually supports sequences as inputs (and even higher-dimensional inputs): it handles them just like TimeDistributed(Dense(...)), meaning it is applied to the last input dimension only (independently across all time steps). Thus, we could replace the last layer with just Dense(10). For the sake of clarity, however, we will keep using TimeDistributed(Dense(10)) because it makes it clear that the Dense layer is applied independently at each time step and that the model will output a sequence, not just a single vector.

Note that a TimeDistributed(Dense(n)) layer is equivalent to a Conv1D(n, filter_size=1) layer.

What became apparent is that just because the convolution is applied at each time-step individually, this has no bearing on temporal information being preserved, and in fact from "Attention is all you need" paper, the state:

3.5 Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed.

Therefore, it is believed that Positional Encoding is required to "bring back" the temporal information that is originally present.

The Transformer architecture is whilst used for input sequences of words (sentences), the analogy can be draw to Astrophysical transients in the sense of a light curve being a sentence, and 6-D observations at each time step being equivalent to words. Considering the EncodingLayer only for now, the encoder takes input a batch of sentences/light-curves represented as sequences of word IDs/6-D observations (the input shape is [batch size, max input sentence length]), and it encodes each word/6-D observation into a 512-dimensional/d-model representation (so the encoder’s output shape is [batch size, max input sentence length, d-model]).

So, for our model, it will take in a full light curve, consisting of N-timesteps for each object. It will then apply a convolutional embedding to each timestep to transform the data from [batch size, N-timesteps, 6-D] --> [batch size, N-timesteps, d-model]. From here, a positional encoding will be calculated using trigonometric functions to determine the position for each observation in the sequence. These are then summed together to produce an input of shape [batch size, max input sentence length (==N-timesteps), d-model]. At this point, the EncodingLayer will process this input through the multi-head self attention layers as well other layers.

Going forward, the items that are required are to first implement a PositionalEncoding class. Following this, a refactor of the architecture as a whole will need to be looked at. Furthermore, a look into the plasticc data preprocessing that created the windowing; this should be revised to 100 (N x GPs) where an input is a whole sequence, i.e a single light-curve for a single object.

TODO

PositionalEncoding class
Refactor model.py drawing examples from Hands-On book and tensorflow documentation: https://www.tensorflow.org/tutorials/text/transformer
Ensure parquet file is correct with "appropriate" windowing taking place. Perhaps reduce number of GPs (investigate)

Refs:

The text was updated successfully, but these errors were encountered:

Since there is a direct relation with the number of timesteps and number_gps, this should be set at time of loading the dataset and also when creating the related parquet file on disk See #33 (comment) for comments and #35 for discussion modified: astronet/t2/utils.py

tallamjr added 3 - documentation Improvements or additions to documentation 2 - enhancement A request or update to existing functionality 3 - question General questions 1 - tests A Failed test or tests needs updating 1 - refactor DRY conflict. Refactor required labels Nov 27, 2020

tallamjr self-assigned this Nov 27, 2020

This was referenced Nov 27, 2020

[MR/35] Implement PositionalEncoding class #36

Closed

Investigate large window size, timesteps=100; steps=100 #33

Closed

tallamjr pinned this issue Dec 8, 2020

tallamjr unpinned this issue Jan 26, 2021

tallamjr closed this as completed Mar 17, 2021

tallamjr pinned this issue Mar 17, 2021

tallamjr unpinned this issue Jun 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAJOR] Revisit model architecture #35

[MAJOR] Revisit model architecture #35

tallamjr commented Nov 27, 2020 •

edited

Loading

[MAJOR] Revisit model architecture #35

[MAJOR] Revisit model architecture #35

Comments

tallamjr commented Nov 27, 2020 • edited Loading

TODO

tallamjr commented Nov 27, 2020 •

edited

Loading