Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About inputs to the decoder #223

Closed
puzzlecollector opened this issue Sep 5, 2021 · 15 comments
Closed

About inputs to the decoder #223

puzzlecollector opened this issue Sep 5, 2021 · 15 comments

Comments

@puzzlecollector
Copy link

@zhouhaoyi
Suppose I want to input a 28 length sequence, X_i,...,X_{i+27} and want to predict the 28 length sequence X_{i+28},...,X_{i+55}. Then the encoder will take in (X_i,...,X_{i+27}) as input and the decoder will take in (X_i,...,X_{i+27},0,...,0}. Is my understanding correct? Is this what you meant in the paper when you said you concat the start token and the zero placeholder for the target?

@puzzlecollector
Copy link
Author

@zhouhaoyi
By (X_i,...,X_{i+27},0,...,0) I mean the 28 length sequence that was passed on to the encoder, and the 28 length zero padded target sequence that the decoder has to predict.

@puzzlecollector
Copy link
Author

@zhouhaoyi Or does the encoder receive (X_i,...,X_{i+27},X_{i+28},...,X_{i+55}) and the decoder receives (X_i,...,X_{i+27},0,...,0) ?

@cookieminions
Copy link
Collaborator

@zhouhaoyi
Suppose I want to input a 28 length sequence, X_i,...,X_{i+27} and want to predict the 28 length sequence X_{i+28},...,X_{i+55}. Then the encoder will take in (X_i,...,X_{i+27}) as input and the decoder will take in (X_i,...,X_{i+27},0,...,0}. Is my understanding correct? Is this what you meant in the paper when you said you concat the start token and the zero placeholder for the target?

Hi, your understanding is correct, the input of Encoder is (X_i,...,X_{i+27}) and the input of Decoder can be (X_j,...,X_{i+27},0,...,0}, where i<=j<=i+27). (X_{i+28},...,X_{i+55}) is the groudtruth and we do not use it as input.

@zhouhaoyi
Copy link
Owner

Thanks @cookieminions

@puzzlecollector
Copy link
Author

puzzlecollector commented Sep 6, 2021

@zhouhaoyi @cookieminions

Thank you for the kind answers. I have another question - when looking at the InformerStack class in Informer2020/models/model.py what does the x_mark_enc and x_mark_dec represent in the forward function? What information do I have to pass for those two parameters?

class InformerStack(nn.Module):
    def __init__(self, enc_in, dec_in, c_out, seq_len, label_len, out_len, 
                factor=5, d_model=512, n_heads=8, e_layers=[3,2,1], d_layers=2, d_ff=512, 
                dropout=0.0, attn='prob', embed='fixed', freq='h', activation='gelu',
                output_attention = False, distil=True, mix=True,
                device=torch.device('cuda:0')):
        super(InformerStack, self).__init__()
        self.pred_len = out_len
        self.attn = attn
        self.output_attention = output_attention

        # Encoding
        self.enc_embedding = DataEmbedding(enc_in, d_model, embed, freq, dropout)
        self.dec_embedding = DataEmbedding(dec_in, d_model, embed, freq, dropout)
        # Attention
        Attn = ProbAttention if attn=='prob' else FullAttention
        # Encoder

        inp_lens = list(range(len(e_layers))) # [0,1,2,...] you can customize here
        encoders = [
            Encoder(
                [
                    EncoderLayer(
                        AttentionLayer(Attn(False, factor, attention_dropout=dropout, output_attention=output_attention), 
                                    d_model, n_heads, mix=False),
                        d_model,
                        d_ff,
                        dropout=dropout,
                        activation=activation
                    ) for l in range(el)
                ],
                [
                    ConvLayer(
                        d_model
                    ) for l in range(el-1)
                ] if distil else None,
                norm_layer=torch.nn.LayerNorm(d_model)
            ) for el in e_layers]
        self.encoder = EncoderStack(encoders, inp_lens)
        # Decoder
        self.decoder = Decoder(
            [
                DecoderLayer(
                    AttentionLayer(Attn(True, factor, attention_dropout=dropout, output_attention=False), 
                                d_model, n_heads, mix=mix),
                    AttentionLayer(FullAttention(False, factor, attention_dropout=dropout, output_attention=False), 
                                d_model, n_heads, mix=False),
                    d_model,
                    d_ff,
                    dropout=dropout,
                    activation=activation,
                )
                for l in range(d_layers)
            ],
            norm_layer=torch.nn.LayerNorm(d_model)
        )
        # self.end_conv1 = nn.Conv1d(in_channels=label_len+out_len, out_channels=out_len, kernel_size=1, bias=True)
        # self.end_conv2 = nn.Conv1d(in_channels=d_model, out_channels=c_out, kernel_size=1, bias=True)
        self.projection = nn.Linear(d_model, c_out, bias=True)
        
    def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, 
                enc_self_mask=None, dec_self_mask=None, dec_enc_mask=None):
        enc_out = self.enc_embedding(x_enc, x_mark_enc)
        enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)

        dec_out = self.dec_embedding(x_dec, x_mark_dec)
        dec_out = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask)
        dec_out = self.projection(dec_out)
        
        # dec_out = self.end_conv1(dec_out)
        # dec_out = self.end_conv2(dec_out.transpose(2,1)).transpose(1,2)
        if self.output_attention:
            return dec_out[:,-self.pred_len:,:], attns
        else:
            return dec_out[:,-self.pred_len:,:] # [B, L, D]

@cookieminions
Copy link
Collaborator

Hi, x_mark_enc and x_mark_dec are timestamps of x_enc and x_dec, enc_embedding and dec_embedding will use them to add time features on model inputs.

@puzzlecollector
Copy link
Author

puzzlecollector commented Sep 6, 2021

@cookieminions So I am guessing that this information corresponds to the section "Appendix B The Uniform Input
Representation" in the paper. If my data is daily time series (monday, tuesday, wednesday,...) then what should I pass over to the x_mark_enc and x_mark_dec arguments? Is it literally the timestamp data? So like "2016-01-01 Friday" , "2016-01-02 Saturday" ... for instance.

@puzzlecollector
Copy link
Author

puzzlecollector commented Sep 6, 2021

@cookieminions
I have taken a look at Informer2020/models/embed.py and I found this class

class TemporalEmbedding(nn.Module):
    def __init__(self, d_model, embed_type='fixed', freq='h'):
        super(TemporalEmbedding, self).__init__()

        minute_size = 4; hour_size = 24
        weekday_size = 7; day_size = 32; month_size = 13

        Embed = FixedEmbedding if embed_type=='fixed' else nn.Embedding
        if freq=='t':
            self.minute_embed = Embed(minute_size, d_model)
        self.hour_embed = Embed(hour_size, d_model)
        self.weekday_embed = Embed(weekday_size, d_model)
        self.day_embed = Embed(day_size, d_model)
        self.month_embed = Embed(month_size, d_model)
    
    def forward(self, x):
        x = x.long()
        
        minute_x = self.minute_embed(x[:,:,4]) if hasattr(self, 'minute_embed') else 0.
        hour_x = self.hour_embed(x[:,:,3])
        weekday_x = self.weekday_embed(x[:,:,2])
        day_x = self.day_embed(x[:,:,1])
        month_x = self.month_embed(x[:,:,0])
        
        return hour_x + weekday_x + day_x + month_x + minute_x

So my timestamp is in the format of year-month-day date. The timestamp ranges from 2016-01-01 to 2020-09-28. I guess I can modify the class so that I have

class TemporalEmbedding(nn.Module):
    def __init__(self, d_model, embed_type='fixed', freq='h'):
        super(TemporalEmbedding, self).__init__()

        minute_size = 4; hour_size = 24
        weekday_size = 7; day_size = 32; month_size = 13; year_size = 2022

        Embed = FixedEmbedding if embed_type=='fixed' else nn.Embedding
        if freq=='t':
            self.minute_embed = Embed(minute_size, d_model)
        self.hour_embed = Embed(hour_size, d_model)
        self.weekday_embed = Embed(weekday_size, d_model)
        self.day_embed = Embed(day_size, d_model)
        self.month_embed = Embed(month_size, d_model) 
        self.year_embed = Embed(year_size, d_model) 
    
    def forward(self, x):
        x = x.long()
        
        #minute_x = self.minute_embed(x[:,:,4]) if hasattr(self, 'minute_embed') else 0.
        #hour_x = self.hour_embed(x[:,:,3])
        weekday_x = self.weekday_embed(x[:,:,3])
        day_x = self.day_embed(x[:,:,2])
        month_x = self.month_embed(x[:,:,1])
        year_x = self.year_embed(x[:,:,0]) 
        return weekday_x + day_x + month_x + minute_x

where x[:,:,0] = 2016, x[:,:,1] = 1, x[:,:,2] = 1, x[:,:,3] = 5 for date 2016-01-01 Friday.

here I can encode sunday, monday, tuesday, ... to 0,1,2,...

and I will set freq = 'd' when I declare the model.

@puzzlecollector
Copy link
Author

puzzlecollector commented Sep 6, 2021

@cookieminions
I think I kind of figured out the timestamp problem. But now I have a slightly different issue. Suppose I want to predict the price of 21 agricultural goods for the next 28 days (given the past 28 days). So I declared the model as follows

from Informer2020.models.model import Informer, InformerStack

model = InformerStack(enc_in = 21, 
                      dec_in = 21, 
                      c_out = 21, 
                      seq_len = 28, 
                      label_len = 28, 
                      out_len = 56, 
                      freq = 'd') 
model.cuda()

Now, did I define the model parameters correctly? So the decoder takes in 21 sequences, where each sequence has starter length 28, and prediction length 28 (so we want to predict the next 28 days). And I assume out_len is 56 because it is the sum of the starter sequence + zero padded target tokens?

So if the batch size is 32, the dimensions of my inputs are as follows:

encoder_input: [32,28,21]
decoder_input: [32,56,21]
target: [32,28,21]
encoder_marks: [32,28,4] (represents the timestamps year, month, day, weekday)
decoder_marks: [32,28,4] (represents the timestamps year, month, day, weekday)

But when I do model.forward() I get the following error

Screen Shot 2021-09-07 at 2 49 21 AM

I tried running a simple code like this

cnt = 0 
for batch_item in train_dataloader:
    encoder_input = batch_item['encoder_input'].to(device) 
    decoder_input = batch_item['decoder_input'].to(device) 
    target = batch_item['target'].to(device) 
    enc_marks = batch_item['encoder_marks'].to(device) 
    dec_marks = batch_item['decoder_marks'].to(device) 
    
    print(encoder_input.shape, decoder_input.shape, target.shape, enc_marks.shape, dec_marks.shape)
    
    pred = model(x_enc=encoder_input, x_mark_enc=enc_marks, x_dec=decoder_input, x_mark_dec=dec_marks)
    
    if cnt == 0: 
        break 

@cookieminions
Copy link
Collaborator

Hi, out_len of decoder need to be set as 28, seq_len is the length of encoder input, label_len is the length of start token and out_len is the length of prediction series.
If you set label_len=28 and out_len=28, the decoder input dec_in will be [32, 56, 21], and you need to make dec_marks be [32, 56, 4], because dec_marks contains the timestamps of start token and prediction series.

@puzzlecollector
Copy link
Author

puzzlecollector commented Sep 8, 2021

@cookieminions @zhouhaoyi
Thank you so much for your kind replies. I was able to successfully train the model on my dataset and the result is much better than other baselines! (e.g. seq2seq+attention, ARIMA)

I have a question though. During inference, although I set the model to eval mode by model.eval() and also use forward with torch.no_grad(), the result of the predictions are slightly different all the time. Is this expected behavior? My inference code goes something like this

test_model.eval() 
### some code ### 
with torch.no_grad(): 
        outputs = test_model(x_enc=test_encoder_inputs,
                             x_mark_enc=test_encoder_marks, 
                             x_dec=test_decoder_inputs, 
                             x_mark_dec=test_decoder_marks) 
    
    outputs = outputs.cpu()

    outputs = outputs*torch_norm  # multiply output by some numbers to de-normalize  
        

and the results for outputs is slightly different all the time, even though eval() mode is on and I am using torch.no_grad().

@puzzlecollector
Copy link
Author

@zhouhaoyi

Oh I guess it was because of the probsparse attention. If I just use the full transformer (setting attn = 'full') I do not have the problem of having inconsistent outputs. By the way, even if the outputs are inconsistent, they are not supposed to deviate that much right?

@zhouhaoyi
Copy link
Owner

I think the inconsistency comes from the current implementation of Probsparse, in which the unselected attention may refer to the same leaf node rather than its original one. Please give a brief description of the following architecture, it may help us locate the problem. Thanks!

@Zero-coder
Copy link

nice discussion!

@HasnainKhanNiazi
Copy link

HasnainKhanNiazi commented Dec 20, 2022

Hey guys @zhouhaoyi @cookieminions @puzzlecollector , such a nice discussion to follow. I am working on Informer for a multivariate problem where I am having 94 features and one output target. I have a question about the model inference/prediction. In the _process_one_batch method, we are passing the decoder input where firstly, it is being initiated by zeros and it is being concatenated with batch_y. I am removing if conditions as I am working with padding==0. I am training for MS but I am not passing the target variable in the input by changing the read_data method so that seq_x will be having only the features.

    # decoder input
    dec_inp = torch.zeros([batch_y.shape[0], self.args.pred_len, batch_y.shape[-1]]).float()
    dec_inp = torch.cat([batch_y[:,:self.args.label_len,:], dec_inp], dim=1).float().to(self.device)

My question is why are we concatenating batch_y values as in realtime, we will not be having batch_y values. I trained a model and the results look so promising with the above decoder input but if I don't use the batch_y concatenation part then the results aren't looking good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants