Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError when using 'learned' or 'fixed' in args.embed #70

Closed
lzmax888 opened this issue Mar 24, 2021 · 21 comments
Closed

IndexError when using 'learned' or 'fixed' in args.embed #70

lzmax888 opened this issue Mar 24, 2021 · 21 comments
Labels
discussion This only contains independent questions

Comments

@lzmax888
Copy link

lzmax888 commented Mar 24, 2021

args.model = 'informerstack' # model of experiment, options: [informer, informerstack, informerlight(TBD)]

args.data = 'custom' # data
args.root_path = './' # root path of data file
args.data_path = 'test.csv' # data file
args.features = 'S' # forecasting task, options:[M, S, MS(TBD)]; M:multivariate predict multivariate, S:univariate predict univariate, MS:multivariate predict univariate
args.target = 'target' # target feature in S or MS task
args.freq = 't' # freq for time features encoding

args.seq_len = 128 # input sequence length of Informer encoder
args.label_len = 96 # start token length of Informer decoder
args.pred_len = 15 # prediction sequence length

args.enc_in = 1 # encoder input size number of features in input
args.dec_in = 1 # decoder input size number of features
args.c_out = 7 # output size output dimension before FN
args.factor = 5 # probsparse attn factor
args.d_model = 512 # dimension of model
args.n_heads = 8 # num of heads
args.e_layers = 3 # num of encoder layers
args.d_layers = 2 # num of decoder layers
args.d_ff = 512 # dimension of fcn in model
args.dropout = 0.05 # dropout
args.attn = 'full' # attention used in encoder, options:[prob, full]
args.embed = 'fixed' # time features encoding, options:[timeF, fixed, learned]
args.activation = 'gelu' # activation
args.distil = True # whether to use distilling in encoder
args.output_attention = False # whether to output attention in ecoder

args.batch_size = 64
args.learning_rate = 0.0001 ## 0.0001
args.loss = 'mse'
args.lradj = 'type1'

args.num_workers = 0
args.itr = 1
args.train_epochs = 6
args.patience = 3
args.des = 'exp'

我用以上参数训练,结果报错:


IndexError Traceback (most recent call last)
in
9 # train
10 print('>>>>>>>start training : {}>>>>>>>>>>>>>>>>>>>>>>>>>>'.format(setting))
---> 11 exp.train(setting)
12
13 # test

~/max/Informer2020/exp/exp_informer.py in train(self, setting)
169 # encoder - decoder
170 if self.args.output_attention:
--> 171 outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)[0]
172 else:
173 outputs = self.model(batch_x, batch_x_mark, dec_inp, batch_y_mark)

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/max/Informer2020/models/model.py in forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, enc_self_mask, dec_self_mask, dec_enc_mask)
144 def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec,
145 enc_self_mask=None, dec_self_mask=None, dec_enc_mask=None):
--> 146 enc_out = self.enc_embedding(x_enc, x_mark_enc)
147 enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)
148

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/max/Informer2020/models/embed.py in forward(self, x, x_mark)
105
106 def forward(self, x, x_mark):
--> 107 x = self.value_embedding(x) + self.position_embedding(x) + self.temporal_embedding(x_mark)
108
109 return self.dropout(x)

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/max/Informer2020/models/embed.py in forward(self, x)
75 x = x.long()
76
---> 77 minute_x = self.minute_embed(x[:,:,4]) if hasattr(self, 'minute_embed') else 0.
78 hour_x = self.hour_embed(x[:,:,3])
79 weekday_x = self.weekday_embed(x[:,:,2])

IndexError: index 4 is out of bounds for dimension 2 with size 4

如果使用'timeF',则一切正常。

@cookieminions
Copy link
Collaborator

感谢提醒!

这里你可以通过修改freq='h'应该也能解决报错,因为Dataset_Custom加载时间戳时并没有提供minute级别的特征,因此使用freq='t'会导致维度不匹配。

这是fixed或learned的embedding方式存在的问题,我们已经有一段时间没有更新这两个embedding方式了,接下来的更新中我们会将这两种方式与freq以及各种Dataset更好整合。

@zhrli
Copy link

zhrli commented Mar 24, 2021

对于更多的时间类型是否应该修改embedding编码方式 比如到second级别,就另second=61 ?

@cookieminions
Copy link
Collaborator

对于更多的时间类型是否应该修改embedding编码方式 比如到second级别,就另second=61 ?

如果是timeF的embedding方式,目前已经支持second级别的时间类型;如果是fixed或learned的embedding方式,使用second级别还需要修改data/data_loader.py中的代码让数据返回second级别的时间戳,以及修改model/embed.py中的代码,增加对应的embedding层

我们会在这几天更新关于embedding的部分,让所有embedding都能支持timeF支持的freq

@zhrli
Copy link

zhrli commented Mar 24, 2021

对于更多的时间类型是否应该修改embedding编码方式 比如到second级别,就另second=61 ?

如果是timeF的embedding方式,目前已经支持second级别的时间类型;如果是fixed或learned的embedding方式,使用second级别还需要修改data/data_loader.py中的代码让数据返回second级别的时间戳,以及修改model/embed.py中的代码,增加对应的embedding层

我们会在这几天更新关于embedding的部分,让所有embedding都能支持timeF支持的freq

非常感谢,希望考虑到秒级时间的数据类型,可以是浮点类型,这样就可以覆盖更高分辨率的数据类型了

@Erickurashi
Copy link

Erickurashi commented Mar 25, 2021

@cookieminions 请问如果我的数据是70个点一个周期,我能否把return index.hour / 23.0 - 0.5 改成 return index.hour / 69.0 - 0.5 然后 offsets.Hour: [HourOfDay]
小时的单位是不是不影响运算,主要是周期?谢谢。

@cookieminions
Copy link
Collaborator

cookieminions commented Mar 25, 2021

@cookieminions 请问如果我的数据是70个点一个周期,我能否把return index.hour / 23.0 - 0.5 改成 return index.hour / 69.0 - 0.5 然后 offsets.Hour: [HourOfDay]
小时的单位是不是不影响运算,主要是周期?谢谢。

如果你的数据是有date特征的,date是人类定义的日期和时间,那么不论周期是多少个点,直接使用timeF的embedding即可,不需要额外修改(这里index.hour会返回处于一天中的第几个小时,数值在0-23之间)

如果你的数据没有和日期相关的特征,但是有一个已知的周期如70个点,我的建议是:可以新增一个class,如IndexOfPeriod,在其__call__方法中 return index/69.0 - 0.5 (这里要注意的是,此处index不再是时间,也不再有index.hour这些属性,且应该是循环往复的,即0-69,0-69...,来标定处于周期中的哪个位置),然后再在下方增加offset.IdxPeriod:{IndexOfPeriod},并在embed.py对应的embedding的map中增加 'i':1

我们会考虑在之后的更新中加入对于如果没有date属性,只有一个已知的周期,或者只有顺序排布特点的数据的支持,如果你的数据有date属性,那么不用在意周期直接使用timeF就行

@Erickurashi
Copy link

Erickurashi commented Mar 25, 2021

@cookieminions 请问如果我的数据是70个点一个周期,我能否把return index.hour / 23.0 - 0.5 改成 return index.hour / 69.0 - 0.5 然后 offsets.Hour: [HourOfDay]
小时的单位是不是不影响运算,主要是周期?谢谢。

如果你的数据是有date特征的,date是人类定义的日期和时间,那么不论周期是多少个点,直接使用timeF的embedding即可,不需要额外修改(这里index.hour会返回处于一天中的第几个小时,数值在0-23之间)

如果你的数据没有和日期相关的特征,但是有一个已知的周期如70个点,我的建议是:可以新增一个class,如IndexOfPeriod,在其__call__方法中 return index/69.0 - 0.5 (这里要注意的是,此处index不再是时间,也不再有index.hour这些属性,且应该是循环往复的,即0-69,0-69...,来标定处于周期中的哪个位置),然后再在下方增加offset.IdxPeriod:{IndexOfPeriod},并在embed.py对应的embedding的map中增加 'i':1

我们会考虑在之后的更新中加入对于如果没有date属性,只有一个已知的周期,或者只有顺序排布特点的数据的支持,如果你的数据有date属性,那么不用在意周期直接使用timeF就行

以下粗体字为添加后,运行时会出错,请问如何添加index i 来代表indexofPeriod。

class SecondOfMinute(TimeFeature):
"""Minute of hour encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return index.second / 59.0 - 0.5

class MinuteOfHour(TimeFeature):
"""Minute of hour encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return index.minute / 59.0 - 0.5

class HourOfDay(TimeFeature):
"""Hour of day encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return index.hour / 23.0 - 0.5

class DayOfWeek(TimeFeature):
"""Hour of day encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return index.dayofweek / 6.0 - 0.5

class DayOfMonth(TimeFeature):
"""Day of month encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return (index.day - 1) / 30.0 - 0.5

class DayOfYear(TimeFeature):
"""Day of year encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return (index.dayofyear - 1) / 365.0 - 0.5

class MonthOfYear(TimeFeature):
"""Month of year encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return (index.month - 1) / 11.0 - 0.5

class weekOfYear(TimeFeature):
"""Week of year encoded as value between [-0.5, 0.5]"""
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return (index.isocalendar().week - 1) / 52.0 - 0.5

class IndexOfPeriod:
def call(self, index: pd.DatetimeIndex) -> np.ndarray:
return index/69.0 - 0.5

def time_features_from_frequency_str(freq_str: str) -> List[TimeFeature]:
"""
Returns a list of time features that will be appropriate for the given frequency string.
Parameters
freq_str
Frequency string of the form [multiple][granularity] such as "12H", "5min", "1D" etc.
"""
features_by_offsets = {
offsets.YearEnd: [],
offsets.QuarterEnd: [MonthOfYear],
offsets.MonthEnd: [MonthOfYear],
offsets.Week: [DayOfMonth, WeekOfYear],
offsets.Day: [DayOfWeek, DayOfMonth, DayOfYear],
offsets.BusinessDay: [DayOfWeek, DayOfMonth, DayOfYear],
offsets.Hour: [HourOfDay,DayOfWeek, DayOfMonth, DayOfYear],
offsets.Minute: [
MinuteOfHour,
HourOfDay,
DayOfWeek,
DayOfMonth,
DayOfYear,
],
offsets.Second: [
SecondOfMinute,
MinuteOfHour,
HourOfDay,
DayOfWeek,
DayOfMonth,
DayOfYear,
],
offset.IdxPeriod:[IndexOfPeriod]
}
offset = to_offset(freq_str)

for offset_type, feature_classes in features_by_offsets.items():
    if isinstance(offset, offset_type):
        return [cls() for cls in feature_classes]

supported_freq_msg = f"""
Unsupported frequency {freq_str}
The following frequencies are supported:
Y - yearly
alias: A
M - monthly
W - weekly
D - daily
B - business days
H - hourly
T - minutely
alias: min
S - secondly
i - period
"""

@cookieminions
Copy link
Collaborator

我们会在下次更新时加上这个可选项,即数据如果没有时间属性,只有由数字表现的周期

修改的话这里还涉及到数据的加载以及方法中index类型的定义,至少在def call(self, index: pd.DatetimeIndex) -> np.ndarray:这里需要改为def call(self, index) -> np.ndarray:, 以及对应修改数据加载部分的pd.to_datetime

如果方便的话,也可以提供你的一部分数据(涵盖1-2个周期)以及其中的周期表示方法,帮助我们完善这个部分的设计

@Erickurashi
Copy link

Erickurashi commented Mar 25, 2021

image
我的数据没有明确的周期,平均周期在100个点左右(如上图),这种情况应该如何处理,能否不使用周期进行预测。谢谢。
test.xlsx

@cookieminions
Copy link
Collaborator

image
我的数据没有明确的周期,平均周期在100个点左右(如上图),这种情况应该如何处理,能否不使用周期进行预测。谢谢。
test.xlsx

如果数据没有一个确定的周期,也没有时间信息,那么就只依靠positional encoding的信息就行,如果强行设置一个周期反而可能会造成不好的影响

在现有的代码下,你可以修改models/embed.py中的第107行为x = self.value_embedding(x) + self.position_embedding(x)即可,然后在你的数据文件中还是需要增加一个date列,不过日期可以随便填(可以都填成一个如2021-01-01 00:00:00),因为在embedding中不会用到这个日期信息了

@Erickurashi
Copy link

image
我的数据没有明确的周期,平均周期在100个点左右(如上图),这种情况应该如何处理,能否不使用周期进行预测。谢谢。
test.xlsx

如果数据没有一个确定的周期,也没有时间信息,那么就只依靠positional encoding的信息就行,如果强行设置一个周期反而可能会造成不好的影响

在现有的代码下,你可以修改models/embed.py中的第107行为x = self.value_embedding(x) + self.position_embedding(x)即可,然后在你的数据文件中还是需要增加一个date列,不过日期可以随便填(可以都填成一个如2021-01-01 00:00:00),因为在embedding中不会用到这个日期信息了

谢谢, 那么这行代码是否不用改动 args.freq = 'h',
args.seq_len,args.label_len 和args.pred_len也可以不需要根据周期改动了?

@cookieminions
Copy link
Collaborator

不用改动freq,因为不会用到时序信息

但是seq_len, label_len, pred_len还是需要根据你自己的数据和任务进行相应调整,如pred_len是预测长度,这和你的应用目标有关,而seq_len和label_len则可以根据你对数据的观察设置,如设为100或140或70等

@Erickurashi
Copy link

Erickurashi commented Mar 27, 2021

不用改动freq,因为不会用到时序信息

但是seq_len, label_len, pred_len还是需要根据你自己的数据和任务进行相应调整,如pred_len是预测长度,这和你的应用目标有关,而seq_len和label_len则可以根据你对数据的观察设置,如设为100或140或70等

谢谢,我还是想请问一下,如果我想设置一个周期为70个数据点(比如新建一个叫i的选项来取代原本的h)。哪些代码部分需要修改?数据目前有两列,一列是Date时间,另一列是数据。Date那一列的名字和时间序列是否也需要修改?谢谢

@zhouhaoyi zhouhaoyi added the discussion This only contains independent questions label Mar 27, 2021
@cookieminions
Copy link
Collaborator

如果你的数据中的Date是有实际意义的,那么我建议不需要关心周期是什么,让模型自己去学习和理解即可

如果你的数据中的Date没有实际意义,而是数据大概有一个70个点的周期,那么可以这样修改,首先在timefeatures.py中增加:

class IndexOfPeriod(TimeFeature):
    """Minute of hour encoded as value between [-0.5, 0.5]"""
    def __call__(self, index) -> np.ndarray:
        return index / 69.0 - 0.5

然后修改offsets中已有的任意一个,如offsets.Second,改为offsets.Second:[IndexOfPeriod],然后相应地在embed.py中修改TimeFeatureEmbedding中的freq_map,修改其中的's'为1

之后还需要修改dataloader中的数据加载部分,一方面你要确保自己的date列不再是时间数据,而是和周期对应的0-69之间的数,另一方面就是修改其中的data_stamp = time_features(pd.to_datetime(df_stamp['date'].values), freq=self.freq)data_stamp = time_features(df_stamp['date'].values, freq=self.freq)

之后在训练时使用--freq s就可以了

@Erickurashi
Copy link

如果你的数据中的Date是有实际意义的,那么我建议不需要关心周期是什么,让模型自己去学习和理解即可

如果你的数据中的Date没有实际意义,而是数据大概有一个70个点的周期,那么可以这样修改,首先在timefeatures.py中增加:

class IndexOfPeriod(TimeFeature):
    """Minute of hour encoded as value between [-0.5, 0.5]"""
    def __call__(self, index) -> np.ndarray:
        return index / 69.0 - 0.5

然后修改offsets中已有的任意一个,如offsets.Second,改为offsets.Second:[IndexOfPeriod],然后相应地在embed.py中修改TimeFeatureEmbedding中的freq_map,修改其中的's'为1

之后还需要修改dataloader中的数据加载部分,一方面你要确保自己的date列不再是时间数据,而是和周期对应的0-69之间的数,另一方面就是修改其中的data_stamp = time_features(pd.to_datetime(df_stamp['date'].values), freq=self.freq)data_stamp = time_features(df_stamp['date'].values, freq=self.freq)

之后在训练时使用--freq s就可以了

谢谢您,我按照上面的修改后出现以下错误
Use GPU: cuda:0

start training : informer_custom_ftS_sl200_ll100_pl100_dm512_nh8_el2_dl1_df1024_atprob_fc5_ebtimeF_dtTrue_exp_0>>>>>>>>>>>>>>>>>>>>>>>>>>


UFuncTypeError Traceback (most recent call last)
in ()
10 # train
11 print('>>>>>>>start training : {}>>>>>>>>>>>>>>>>>>>>>>>>>>'.format(setting))
---> 12 exp.train(setting)
13
14 # test

6 frames
/content/Informer2020/exp/exp_informer.py in train(self, setting)
144
145 def train(self, setting):
--> 146 train_data, train_loader = self._get_data(flag = 'train')
147 vali_data, vali_loader = self._get_data(flag = 'val')
148 test_data, test_loader = self._get_data(flag = 'test')

/content/Informer2020/exp/exp_informer.py in _get_data(self, flag)
85 target=args.target,
86 timeenc=timeenc,
---> 87 freq=freq
88 )
89 print(flag, len(data_set))

/content/Informer2020/data/data_loader.py in init(self, root_path, flag, size, features, data_path, target, scale, timeenc, freq)
216 self.root_path = root_path
217 self.data_path = data_path
--> 218 self.read_data()
219
220 def read_data(self):

/content/Informer2020/data/data_loader.py in read_data(self)
258 data_stamp = df_stamp.drop(['date'],1).values
259 elif self.timeenc==1:
--> 260 data_stamp = time_features(df_stamp['date'].values, freq=self.freq)
261 data_stamp = data_stamp.transpose(1,0)
262

/content/Informer2020/utils/timefeatures.py in time_features(dates, freq)
108
109 def time_features(dates, freq='h'):
--> 110 return np.vstack([feat(dates) for feat in time_features_from_frequency_str(freq)])

/content/Informer2020/utils/timefeatures.py in (.0)
108
109 def time_features(dates, freq='h'):
--> 110 return np.vstack([feat(dates) for feat in time_features_from_frequency_str(freq)])

/content/Informer2020/utils/timefeatures.py in call(self, index)
19 """Minute of hour encoded as value between [-0.5, 0.5]"""
20 def call(self, index) -> np.ndarray:
---> 21 return index / 99.0 - 0.5
22
23

UFuncTypeError: ufunc 'true_divide' cannot use operands with types dtype('<M8[ns]') and dtype('float64')

@cookieminions
Copy link
Collaborator

请检查数据中的date属性是否存在数值问题,报错和date的数据类型有关系,应该保证date列不再是时间数据,而是和周期对应的0-69之间的数字

@Erickurashi
Copy link

image
我现在以100为周期,excel的周期对应0-99之间的数字。不知这样改是否正确

@cookieminions
Copy link
Collaborator

理论上是没有问题的,但是在程序中你的date列似乎存在dtype('<M8[ns]')的数据,你可以再检查数据或者尝试用astype等方法把date.values的dtype都变成np.float64

@cookieminions
Copy link
Collaborator

你可以在time_features中打印一下dates数据看一下数据类型,因为也可能你的dataloader中修改的位置不对,你可以把所有DataSet中相关部分都改成data_stamp = time_features(df_stamp['date'].values, freq=self.freq)

@Erickurashi
Copy link

你可以在time_features中打印一下dates数据看一下数据类型,因为也可能你的dataloader中修改的位置不对,你可以把所有DataSet中相关部分都改成data_stamp = time_features(df_stamp['date'].values, freq=self.freq)

我更改了所有dataset相关的部分,但还是出现同样的错误。数据里到99的时候是否需要重置到0
image

@zhouhaoyi
Copy link
Owner

@Erickurashi 请尽量关注模型本身的问题,目前这些其实已经超出我们的处理范围(请不要让我们免费debug)。关于dataloder,因为每个人设计的方式和想法可能不一样,我们精力有限,也没法照顾到所有人的想法。同时我认为基本的问题已经得到了妥善的解决,我将就地关闭这个issue。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion This only contains independent questions
Projects
None yet
Development

No branches or pull requests

5 participants