Description
🐛 Bug Description
rdagent fin_mode's model training time exceeds 3600 seconds, so the process is killed.
I see Startup Commands: /bin/sh -c timeout --kill-after=10 3600 qrun conf.yaml; entry_exit_code=$?; chmod -R 777 /workspace/qlib_workspace/; exit $entry_exit_code
in the rdagent collect_info.
But how can i improve the train time? not only increase the kill time.
Oh, if even killed train, can we return the error to rdagent, not kill the train process that it will let the rdagent exit?
Normally when train the model, what time will it cost?
How do i let the rdagent to not run model like this to develop another model?
I use rocm/pytorch, should i also change the default model config?
To Reproduce
Steps to reproduce the behavior:
[8:MainThread]([DATETIME],005) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:75] - GeneralPTNN pytorch version...
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:93] - GeneralPTNN parameters setting:
n_epochs : 100
lr : 0.001
metric : loss
batch_size : 2000
early_stop : 10
optimizer : adam
loss_type : mse
device : cuda:0
n_jobs : 20
use_GPU : True
weight_decay : 0.0001
seed : None
pt_model_uri: model.model_cls
pt_model_kwargs: {'num_features': 20}
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model:
GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm(
(gru1): GRU(20, 128, batch_first=True)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(gru2): GRU(128, 128, batch_first=True)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(gru3): GRU(128, 128, batch_first=True)
(bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(dropout): Dropout(p=0.3, inplace=False)
(fc): Linear(in_features=128, out_features=1, bias=True)
)
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:131] - model size: 0.2448 MB
[8:MainThread]([DATETIME],289) INFO - qlib.timer - [log.py:127] - Time cost: 5.926s | Loading data Done
[8:MainThread]([DATETIME],635) INFO - qlib.timer - [log.py:127] - Time cost: 0.009s | FilterCol Done
[8:MainThread]([DATETIME],933) INFO - qlib.timer - [log.py:127] - Time cost: 0.297s | RobustZScoreNorm Done
[8:MainThread]([DATETIME],980) INFO - qlib.timer - [log.py:127] - Time cost: 0.047s | Fillna Done
[8:MainThread]([DATETIME],016) INFO - qlib.timer - [log.py:127] - Time cost: 0.012s | DropnaLabel Done
[8:MainThread]([DATETIME],120) INFO - qlib.timer - [log.py:127] - Time cost: 0.104s | CSRankNorm Done
[8:MainThread]([DATETIME],120) INFO - qlib.timer - [log.py:127] - Time cost: 0.831s | fit & process data Done
[8:MainThread]([DATETIME],121) INFO - qlib.timer - [log.py:127] - Time cost: 6.758s | Init data Done
[8:MainThread]([DATETIME],150) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:246] - Train samples: 478007
[8:MainThread]([DATETIME],150) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:247] - Valid samples: 128309
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:295] - training...
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch0:
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],161) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:302] - evaluating...
[8:MainThread]([DATETIME],041) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:305] - Epoch0: train 0.998860, valid 1.000189
[8:MainThread]([DATETIME],042) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch1:
[8:MainThread]([DATETIME],043) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],652) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch18:
[8:MainThread]([DATETIME],652) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],479) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:302] - evaluating...
───────────────────────────────────────────────────────── Docker Logs End
2025-05-30 22:13:53.184 | INFO | rdagent.utils.env:__run_ret_code_with_retry:167 - Running time: 3600.463972091675 seconds
2025-05-30 22:13:53.186 | WARNING | rdagent.utils.env:__run_ret_code_with_retry:169 - The running time exceeds 3600 seconds, so the process is killed.
2.model code
import torch
import torch.nn as nn
class GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm(nn.Module):
def __init__(self, num_features):
super(GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm, self).__init__()
self.num_features = num_features
self.gru1 = nn.GRU(num_features, 128, batch_first=True)
self.bn1 = nn.BatchNorm1d(128)
self.gru2 = nn.GRU(128, 128, batch_first=True)
self.bn2 = nn.BatchNorm1d(128)
self.gru3 = nn.GRU(128, 128, batch_first=True)
self.bn3 = nn.BatchNorm1d(128)
self.dropout = nn.Dropout(0.3)
self.fc = nn.Linear(128, 1)
def forward(self, x):
# x shape: (batch_size, num_features)
out = self.gru1(x)[0]
out = self.bn1(out)
out = self.dropout(out)
out = self.gru2(out)[0]
out = self.bn2(out)
out = self.dropout(out)
out = self.gru3(out)[0]
out = self.bn3(out)
out = self.dropout(out)
out = self.fc(out)
return out
model_cls = GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm
#Execution feedback:---------------
#Execution successful, output tensor shape: (8, 1)
#--------------Model value feedback:---------------
#The shape of the output is correct.
#No ground truth output provided. Value evaluation not impractical
-
a: single 6700xt amd gpu ,12g vram
b: From rocm/ pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.2.1
c: qlib master
d: rd-agent master
Expected Behavior
Screenshot
Environment
Note: Users can run rdagent collect_info
to get system information and paste it directly here.
- Name of current operating system:
- Processor architecture:
- System, version, and hardware information:
- Version number of the system:
- Python version:
- Container ID:
- Container Name:
- Container Status:
- Image ID used by the container:
- Image tag used by the container:
- Container port mapping:
- Container Label:
- Startup Commands:
- RD-Agent version:
- Package version: