Model train time fix?

## 🐛 Bug Description
rdagent fin_mode's model training time exceeds 3600 seconds, so the process is killed.
I see `Startup Commands: /bin/sh -c timeout --kill-after=10 3600 qrun conf.yaml; entry_exit_code=$?; chmod -R 777 /workspace/qlib_workspace/; exit $entry_exit_code` in the rdagent collect_info.
But how can i improve the train time? not only increase the kill time.
Oh, if even killed train, can we return the error to rdagent, not kill the train process that it will let the rdagent exit?
Normally when train the model, what time will it cost?
How do i let the rdagent to not run model like this to develop another model?
I use rocm/pytorch, should i also change the default model config?

## To Reproduce

Steps to reproduce the behavior:

1.
```bash
[8:MainThread]([DATETIME],005) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:75] - GeneralPTNN pytorch version...
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:93] - GeneralPTNN parameters setting:
n_epochs : 100
lr : 0.001
metric : loss
batch_size : 2000
early_stop : 10
optimizer : adam
loss_type : mse
device : cuda:0
n_jobs : 20
use_GPU : True
weight_decay : 0.0001
seed : None
pt_model_uri: model.model_cls
pt_model_kwargs: {'num_features': 20}
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:130] - model:
GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm(
(gru1): GRU(20, 128, batch_first=True)
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(gru2): GRU(128, 128, batch_first=True)
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(gru3): GRU(128, 128, batch_first=True)
(bn3): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(dropout): Dropout(p=0.3, inplace=False)
(fc): Linear(in_features=128, out_features=1, bias=True)
)
[8:MainThread]([DATETIME],984) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:131] - model size: 0.2448 MB
[8:MainThread]([DATETIME],289) INFO - qlib.timer - [log.py:127] - Time cost: 5.926s | Loading data Done
[8:MainThread]([DATETIME],635) INFO - qlib.timer - [log.py:127] - Time cost: 0.009s | FilterCol Done
[8:MainThread]([DATETIME],933) INFO - qlib.timer - [log.py:127] - Time cost: 0.297s | RobustZScoreNorm Done
[8:MainThread]([DATETIME],980) INFO - qlib.timer - [log.py:127] - Time cost: 0.047s | Fillna Done
[8:MainThread]([DATETIME],016) INFO - qlib.timer - [log.py:127] - Time cost: 0.012s | DropnaLabel Done
[8:MainThread]([DATETIME],120) INFO - qlib.timer - [log.py:127] - Time cost: 0.104s | CSRankNorm Done
[8:MainThread]([DATETIME],120) INFO - qlib.timer - [log.py:127] - Time cost: 0.831s | fit & process data Done
[8:MainThread]([DATETIME],121) INFO - qlib.timer - [log.py:127] - Time cost: 6.758s | Init data Done
[8:MainThread]([DATETIME],150) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:246] - Train samples: 478007
[8:MainThread]([DATETIME],150) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:247] - Valid samples: 128309
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:295] - training...
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch0:
[8:MainThread]([DATETIME],158) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],161) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:302] - evaluating...
[8:MainThread]([DATETIME],041) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:305] - Epoch0: train 0.998860, valid 1.000189
[8:MainThread]([DATETIME],042) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch1:
[8:MainThread]([DATETIME],043) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],652) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:299] - Epoch18:
[8:MainThread]([DATETIME],652) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:300] - training...
[8:MainThread]([DATETIME],479) INFO - qlib.GeneralPTNN - [pytorch_general_nn.py:302] - evaluating...
───────────────────────────────────────────────────────── Docker Logs End 
2025-05-30 22:13:53.184 | INFO     | rdagent.utils.env:__run_ret_code_with_retry:167 - Running time: 3600.463972091675 seconds
2025-05-30 22:13:53.186 | WARNING  | rdagent.utils.env:__run_ret_code_with_retry:169 - The running time exceeds 3600 seconds, so the process is killed.
```
2.model code
```python

import torch
import torch.nn as nn

class GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm(nn.Module):
    def __init__(self, num_features):
        super(GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm, self).__init__()
        self.num_features = num_features
        self.gru1 = nn.GRU(num_features, 128, batch_first=True)
        self.bn1 = nn.BatchNorm1d(128)
        self.gru2 = nn.GRU(128, 128, batch_first=True)
        self.bn2 = nn.BatchNorm1d(128)
        self.gru3 = nn.GRU(128, 128, batch_first=True)
        self.bn3 = nn.BatchNorm1d(128)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(128, 1)

    def forward(self, x):
        # x shape: (batch_size, num_features)
        out = self.gru1(x)[0]
        out = self.bn1(out)
        out = self.dropout(out)
        out = self.gru2(out)[0]
        out = self.bn2(out)
        out = self.dropout(out)
        out = self.gru3(out)[0]
        out = self.bn3(out)
        out = self.dropout(out)
        out = self.fc(out)
        return out

model_cls = GRUModel3LayerHidden128LeakyReLUDropout03BatchNorm

#Execution feedback:---------------
#Execution successful, output tensor shape: (8, 1)

#--------------Model value feedback:---------------
#The shape of the output is correct.
#No ground truth output provided. Value evaluation not impractical
```

3. 
     ### a: single 6700xt amd gpu ,12g vram 
     ### b: From rocm/     pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.2.1 
     ### c: qlib master
     ### d: rd-agent master


## Expected Behavior



## Screenshot



## Environment

**Note**: Users can run `rdagent collect_info` to get system information and paste it directly here.

 - Name of current operating system:
 - Processor architecture:
 - System, version, and hardware information:
 - Version number of the system:
 - Python version:
 - Container ID:
 - Container Name:
 - Container Status:
 - Image ID used by the container:
 - Image tag used by the container:
 - Container port mapping:
 - Container Label:
 - Startup Commands:
 - RD-Agent version:
 - Package version:

## Additional Notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Model train time fix? #918

🐛 Bug Description

To Reproduce

a: single 6700xt amd gpu ,12g vram

b: From rocm/ pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.2.1

c: qlib master

d: rd-agent master

Expected Behavior

Screenshot

Environment

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Model train time fix? #918

Description

🐛 Bug Description

To Reproduce

a: single 6700xt amd gpu ,12g vram

b: From rocm/ pytorch:rocm6.3.3_ubuntu22.04_py3.10_pytorch_release_2.2.1

c: qlib master

d: rd-agent master

Expected Behavior

Screenshot

Environment

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions