[MAJOR] Revisit model architecture #35
Labels
1 - refactor
DRY conflict. Refactor required
1 - tests
A Failed test or tests needs updating
2 - enhancement
A request or update to existing functionality
3 - documentation
Improvements or additions to documentation
3 - question
General questions
It is time to revisit the architecture to ensure there is a clear understanding of what transformations are taking place with respect to multi-variate time series data and what analogies one can draw in relation to the Transformer architecture that is used for sequence modelling of speech.
It was previously thought that by windowing the input sequence, it would be these windows that would be attending to one another, but this was mistaken. It is each time step, or each item at each time step that attends to every other item in the sequence, including itself.
It was also previously thought that by convolving the inputs of input shape
(BATCH_SIZE, timesteps, num_features)
that this would "preserve temporal information". The (wrongful) reasoning behind this was caused by reading in "Hands-On ML" book:What became apparent is that just because the convolution is applied at each time-step individually, this has no bearing on temporal information being preserved, and in fact from "Attention is all you need" paper, the state:
Therefore, it is believed that Positional Encoding is required to "bring back" the temporal information that is originally present.
The Transformer architecture is whilst used for input sequences of words (sentences), the analogy can be draw to Astrophysical transients in the sense of a light curve being a sentence, and 6-D observations at each time step being equivalent to words. Considering the
EncodingLayer
only for now, the encoder takes input a batch of sentences/light-curves represented as sequences of word IDs/6-D observations (the input shape is[batch size, max input sentence length]
), and it encodes each word/6-D observation into a 512-dimensional/d-model representation (so the encoder’s output shape is[batch size, max input sentence length, d-model]
).So, for our model, it will take in a full light curve, consisting of N-timesteps for each object. It will then apply a convolutional embedding to each timestep to transform the data from
[batch size, N-timesteps, 6-D]
-->[batch size, N-timesteps, d-model]
. From here, a positional encoding will be calculated using trigonometric functions to determine the position for each observation in the sequence. These are then summed together to produce an input of shape[batch size, max input sentence length (==N-timesteps), d-model
]. At this point, the EncodingLayer will process this input through the multi-head self attention layers as well other layers.Going forward, the items that are required are to first implement a
PositionalEncoding
class. Following this, a refactor of the architecture as a whole will need to be looked at. Furthermore, a look into the plasticc data preprocessing that created the windowing; this should be revised to 100 (N x GPs) where an input is a whole sequence, i.e a single light-curve for a single object.TODO
PositionalEncoding
classmodel.py
drawing examples from Hands-On book and tensorflow documentation: https://www.tensorflow.org/tutorials/text/transformerparquet
file is correct with "appropriate" windowing taking place. Perhaps reduce number of GPs (investigate)Refs:
The text was updated successfully, but these errors were encountered: