Reproducing training code #10

eugenelyj · 2022-05-31T07:06:51Z

Hi,

Thanks for your amazing work!! I am currently trying to reproduce train.py (mainly copy from RAFT project). Some training details are listed as following:

learning rate: 1e-4.
epochs: 40.
loss: sequence loss from RAFT (https://github.com/princeton-vl/RAFT/blob/master/train.py#L47) with gamma equals to 0.8.
random crop size [288, 384]

However, the performance is not satisfying even on training set (e.g. 1.207 epe on croped image/events and 2.913 epe on full image/events).

Did this happen to you while training? Looking forward to your reply.

The following are screenshots of my training status. Notice that 1px/3px/5px means the ratio of pixels whose epe is smaller than xx px compared with all valid pixels.

magehrig · 2022-07-22T21:02:18Z

Hi @eugenelyj

Sorry for the late reply. I reimplemented the method for a follow-up project and reached pretty much exactly the same performance but using a simpler One-Cycle LR schedule. Specifically, I used a learning rate of 0.0001, 250k steps, batch size of 3. In addition to the cropping you mentioned, I used 50% prob for horizontal flipping and also 10% prob for vertical flipping (have not tested without). These are the training metrics:

Nowadays, I would go with the One-Cycle LR schedule to keep it simple.

eugenelyj · 2022-07-23T04:16:35Z

@magehrig
Thanks! Besides, how are the evaluation metrics in each epoch? This information will help me a lot :)

magehrig · 2022-07-23T08:35:27Z

I don't have an evaluation in the loop because I am training on the full training set here. In general, the checkpoint at the last step is performing the best. In this case, the last checkpoint achieves 0.786 EPE and 2.74 AE on the test set.
It's important that you have a low learning rate in the final stages of your training, otherwise, your test score fluctuates. That's why a One-Cycle LR schedule is quite convenient.

eugenelyj · 2022-07-23T14:38:53Z

Actually i also employ One-Cycle LR schedule because my training code is migrated from RAFT. I will re-train once again because some of my settings are different. For example, my batch size is 6 and i ignore the data augmentation of flipping. Thansk for the information.

magehrig · 2022-07-23T19:06:48Z

I still think something else must be wrong with your code, because your training EPE is so high.
Another thing to check: In the original RAFT, if you want Flow from img1 to img2, you would give the context network the features from img1. In the case of event data, we provide the features from event representation 2, not 1.

HTLeoo · 2022-08-17T02:57:56Z

@eugenelyj Hi, have you reproduced the results on the benchmark ?

eugenelyj · 2022-08-22T12:05:32Z

@HTLeoo
Hi, I tried again, and the following are the latest result i got.
After convergence, it achieves about 1.5 epe on the full-resolution training data (trained on the cropped data) and 12.713 epe on average on the testining data (submitted on the DSEC website).

Some details:
lr: 1e-4
lr schedule: one-cycle
batch size:8
epochs: 150
random crop size: [288, 384]
prob of y-flipping: 0.1
prob of x-flipping: 0.5

eugenelyj · 2022-08-23T07:06:13Z

@magehrig
Hi, i fixed some mistake in my code and currently the training epe is low (about 0.58).
However, the epe on the testing data (12.7 on average) and also on the full-resolution training data (1.4 on average) are still high.
"In the case of event data, we provide the features from event representation 2, not 1." -> It should not be a problem for me as the code is from this repo. What i do is just adding a training script.

magehrig · 2022-08-23T12:03:29Z

Are you correctly augmenting the flow labels when flipping? E.g. horizontal flipping -> invert x channel direction, and vertical flipping -> invert y channel? I think wrong data is the most likely explanation for your train-test difference in performance. Such overfitting is basically impossible with this model on this dataset.

eugenelyj · 2022-09-27T13:33:40Z

@magehrig Thank you, i did mistaken the flow image during data augmentation (that is, forget to modify flow value but only flipping the flow image as what i do to a RGB image).
@HTLeoo Now i reproduce the result in e-raft, results on the benchmark are shown below, it even performs slightly better than the original paper. I guess it is because i also use slight scaling for data augmentation (i use the augmentor.py from raft and remove the photomentric augmentation).

eugenelyj closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing training code #10

Reproducing training code #10

eugenelyj commented May 31, 2022

magehrig commented Jul 22, 2022 •

edited

eugenelyj commented Jul 23, 2022

magehrig commented Jul 23, 2022 •

edited

eugenelyj commented Jul 23, 2022

magehrig commented Jul 23, 2022

HTLeoo commented Aug 17, 2022

eugenelyj commented Aug 22, 2022 •

edited

eugenelyj commented Aug 23, 2022

magehrig commented Aug 23, 2022

eugenelyj commented Sep 27, 2022

Reproducing training code #10

Reproducing training code #10

Comments

eugenelyj commented May 31, 2022

magehrig commented Jul 22, 2022 • edited

eugenelyj commented Jul 23, 2022

magehrig commented Jul 23, 2022 • edited

eugenelyj commented Jul 23, 2022

magehrig commented Jul 23, 2022

HTLeoo commented Aug 17, 2022

eugenelyj commented Aug 22, 2022 • edited

eugenelyj commented Aug 23, 2022

magehrig commented Aug 23, 2022

eugenelyj commented Sep 27, 2022

magehrig commented Jul 22, 2022 •

edited

magehrig commented Jul 23, 2022 •

edited

eugenelyj commented Aug 22, 2022 •

edited