Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAJOR] Revisit model architecture #35

Closed
3 tasks done
tallamjr opened this issue Nov 27, 2020 · 0 comments
Closed
3 tasks done

[MAJOR] Revisit model architecture #35

tallamjr opened this issue Nov 27, 2020 · 0 comments
Assignees
Labels
1 - refactor DRY conflict. Refactor required 1 - tests A Failed test or tests needs updating 2 - enhancement A request or update to existing functionality 3 - documentation Improvements or additions to documentation 3 - question General questions

Comments

@tallamjr
Copy link
Owner

tallamjr commented Nov 27, 2020

It is time to revisit the architecture to ensure there is a clear understanding of what transformations are taking place with respect to multi-variate time series data and what analogies one can draw in relation to the Transformer architecture that is used for sequence modelling of speech.

It was previously thought that by windowing the input sequence, it would be these windows that would be attending to one another, but this was mistaken. It is each time step, or each item at each time step that attends to every other item in the sequence, including itself.

image

It was also previously thought that by convolving the inputs of input shape (BATCH_SIZE, timesteps, num_features) that this would "preserve temporal information". The (wrongful) reasoning behind this was caused by reading in "Hands-On ML" book:

Keras offers a TimeDistributed layer ... it wraps any layer (e.g., a Dense layer) and applies it at every time step of its input sequence. It does this efficiently, by reshaping the inputs so that each time step is treated as a separate instance (i.e., it reshapes the inputs from [batch size, time steps, input dimensions] to [batch size × time steps, input dimensions];

The Dense layer actually supports sequences as inputs (and even higher-dimensional inputs): it handles them just like TimeDistributed(Dense(...)), meaning it is applied to the last input dimension only (independently across all time steps). Thus, we could replace the last layer with just Dense(10). For the sake of clarity, however, we will keep using TimeDistributed(Dense(10)) because it makes it clear that the Dense layer is applied independently at each time step and that the model will output a sequence, not just a single vector.

  • Note that a TimeDistributed(Dense(n)) layer is equivalent to a Conv1D(n, filter_size=1) layer.

What became apparent is that just because the convolution is applied at each time-step individually, this has no bearing on temporal information being preserved, and in fact from "Attention is all you need" paper, the state:

3.5 Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed.

Therefore, it is believed that Positional Encoding is required to "bring back" the temporal information that is originally present.

The Transformer architecture is whilst used for input sequences of words (sentences), the analogy can be draw to Astrophysical transients in the sense of a light curve being a sentence, and 6-D observations at each time step being equivalent to words. Considering the EncodingLayer only for now, the encoder takes input a batch of sentences/light-curves represented as sequences of word IDs/6-D observations (the input shape is [batch size, max input sentence length]), and it encodes each word/6-D observation into a 512-dimensional/d-model representation (so the encoder’s output shape is [batch size, max input sentence length, d-model]).

image

So, for our model, it will take in a full light curve, consisting of N-timesteps for each object. It will then apply a convolutional embedding to each timestep to transform the data from [batch size, N-timesteps, 6-D] --> [batch size, N-timesteps, d-model]. From here, a positional encoding will be calculated using trigonometric functions to determine the position for each observation in the sequence. These are then summed together to produce an input of shape [batch size, max input sentence length (==N-timesteps), d-model]. At this point, the EncodingLayer will process this input through the multi-head self attention layers as well other layers.

image

Going forward, the items that are required are to first implement a PositionalEncoding class. Following this, a refactor of the architecture as a whole will need to be looked at. Furthermore, a look into the plasticc data preprocessing that created the windowing; this should be revised to 100 (N x GPs) where an input is a whole sequence, i.e a single light-curve for a single object.

TODO

  • PositionalEncoding class
  • Refactor model.py drawing examples from Hands-On book and tensorflow documentation: https://www.tensorflow.org/tutorials/text/transformer
  • Ensure parquet file is correct with "appropriate" windowing taking place. Perhaps reduce number of GPs (investigate)

Refs:

@tallamjr tallamjr added 3 - documentation Improvements or additions to documentation 2 - enhancement A request or update to existing functionality 3 - question General questions 1 - tests A Failed test or tests needs updating 1 - refactor DRY conflict. Refactor required labels Nov 27, 2020
@tallamjr tallamjr self-assigned this Nov 27, 2020
tallamjr added a commit that referenced this issue Dec 3, 2020
Since there is a direct relation with the number of timesteps and
number_gps, this should be set at time of loading the dataset and also
when creating the related parquet file on disk

See #33 (comment) for comments
and #35 for discussion

	modified:   astronet/t2/utils.py
tallamjr added a commit that referenced this issue Dec 3, 2020
Since there is a direct relation with the number of timesteps and
number_gps, this should be set at time of loading the dataset and also
when creating the related parquet file on disk

See #33 (comment) for comments
and #35 for discussion

	modified:   astronet/t2/utils.py
@tallamjr tallamjr pinned this issue Dec 8, 2020
@tallamjr tallamjr unpinned this issue Jan 26, 2021
@tallamjr tallamjr pinned this issue Mar 17, 2021
@tallamjr tallamjr unpinned this issue Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - refactor DRY conflict. Refactor required 1 - tests A Failed test or tests needs updating 2 - enhancement A request or update to existing functionality 3 - documentation Improvements or additions to documentation 3 - question General questions
Projects
None yet
Development

No branches or pull requests

1 participant