Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration stop randomly again #7

Closed
xtanitfy opened this issue Apr 12, 2018 · 4 comments
Closed

Iteration stop randomly again #7

xtanitfy opened this issue Apr 12, 2018 · 4 comments

Comments

@xtanitfy
Copy link

xtanitfy commented Apr 12, 2018

I tried to update msgpack-numpy and msgpack as you said, but it doesn't work. Can you tell us what system you are using? Ubuntu16.04 automatically loses IP over a period of time, but your code uses pip. I suspect that it is an iterative random stop caused by a system problem.
Print log as follows:
ch:0, iter:4399, rpn_loss_cls: 0.0677, rpn_loss_box: 0.0325, loss_cls: 0.3604, loss_box: 0.6708, tot_losses: 1.1314, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 43epoch:0, iter:4400, rpn_loss_cls: 0.0281, rpn_loss_box: 0.0015, loss_cls: 0.0247, loss_box: 0.0002, tot_losses: 0.0544, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4401, rpn_loss_cls: 0.0373, rpn_loss_box: 0.0025, loss_cls: 0.0489, loss_box: 0.0004, tot_losses: 0.0892, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4402, rpn_loss_cls: 0.0223, rpn_loss_box: 0.0026, loss_cls: 0.0173, loss_box: 0.0130, tot_losses: 0.0551, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4403, rpn_loss_cls: 0.0571, rpn_loss_box: 0.0198, loss_cls: 0.2124, loss_box: 0.2481, tot_losses: 0.5374, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4404, rpn_loss_cls: 0.0359, rpn_loss_box: 0.0135, loss_cls: 0.1283, loss_box: 0.1383, tot_losses: 0.3160, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4405, rpn_loss_cls: 0.0455, rpn_loss_box: 0.0516, loss_cls: 0.1455, loss_box: 0.0754, tot_losses: 0.3181, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4406, rpn_loss_cls: 0.0611, rpn_loss_box: 0.0380, loss_cls: 0.0184, loss_box: 0.0022, tot_losses: 0.1198, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4407, rpn_loss_cls: 0.0297, rpn_loss_box: 0.0195, loss_cls: 0.0216, loss_box: 0.0106, tot_losses: 0.0814, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 44epoch:0, iter:4408, rpn_loss_cls: 0.0397, rpn_loss_box: 0.0038, loss_cls: 0.0574, loss_box: 0.0496, tot_losses: 0.1505, lr: 0.0006, speed: 0.391s/iter: 32%|▎| 4408/13754 [28:43<57:22, 2.72it/s]

@zengarden
Copy link
Owner

@xtanitfy FYI

msgpack (0.5.6)
msgpack-numpy (0.4.3)
msgpack-python (0.4.8)

Ubuntu 16.04.3 LTS

BTW, We use 16 processes to produce data in light-head (controlled by nr_dataflow = 16 in config.py), which should be adjusted to match your machine.

@xtanitfy
Copy link
Author

xtanitfy commented Apr 13, 2018

I think I have solved this problem, at least it seems no stopping after 11 epochs for this time. The reason should be the negligence of converting my data set.
Thanks for your enthusiastic reply!

@dedoogong
Copy link

I had same problem and solved it too. The problem came from duplicating one same image file path in several places in odgt file which is generated by my own python script. So there must be a race condition accessing one same image from 16 batch processes and then waiting forever.

@HiKapok
Copy link

HiKapok commented Jul 6, 2018

@dedoogong I reduce the process from 16 to 1, the training still hangs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants