Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it will stop in training ,there is no error in terminal. #11

Closed
szm88 opened this issue Dec 26, 2017 · 12 comments
Closed

it will stop in training ,there is no error in terminal. #11

szm88 opened this issue Dec 26, 2017 · 12 comments

Comments

@szm88
Copy link

szm88 commented Dec 26, 2017

hi,jeasinema.I get your code,do the next step:
1.change the config of gpu: '3,1,2,0' -> '0' # I only have gtx 1070
2.python3 setup.py build_ext --inplace # I use python3.5.2
3.cd utils
python3 preprocess.py

error:there is no config,so i copy tf_voxelnet/config.py to tf_voxelnet/utils/ #is that right?????
the data is from kitti object include data_object_velodyne.zip (about 29G) \image_2(12G)\label_2\voxel(about 25G)
4.python3 train.py
errror : thert is no label of testing data ,so i copy "training" to "testing" in trainpy.
then :
..........
train: 18/60 @ epoch:3/10 loss: 1.9014711380004883 reg_loss: 0.31266674399375916 cls_loss: 1.5888043642044067 default
train ['000004']
--------------------using time: 73.70951771736145s-------------------
train: 19/60 @ epoch:3/10 loss: 1.5401957035064697 reg_loss: 0.23529152572155 cls_loss: 1.3049042224884033 default
train ['000001']
--------------------using time: 77.3743188381195s-------------------
train: 20/60 @ epoch:3/10 loss: 1.8793950080871582 reg_loss: 0.2751219868659973 cls_loss: 1.6042730808258057 default

it will stop in this ,there is no error in terminal.


when use 4 tanx train the model, it used all the 20 cpu threads and 45G ram.it used gpu-memory 149m*8.Why it use so much cpu?????
i found that gpu-util:0%,0%,0%,50%.I train another model,so i think it didn't use gpu. what's the reason?

@szm88 szm88 closed this as completed Dec 26, 2017
@turboxin
Copy link

hi szm88, could you please kindly share how you solve this problem. I kind of get the same problem:

train: 20/18700 @ epoch:0/10 loss: 4.318506240844727 reg_loss: 2.653141498565674 cls_loss: 1.6653645038604736 default

It just stops here with no error in terminal. And the gpu-util turns down to 0% , with a high gpu memory usage of 8527/11172MB x 4 1080Ti

@qianguih
Copy link

qianguih commented Feb 1, 2018

I ran into the same problem. Any suggestion or comment will be appreciated. : )

@dominikj93
Copy link

dominikj93 commented Feb 1, 2018

As far as I know, the labels for testing set are not publicly available. Thus, you cannot use the training and testing split as provided by KITTI dataset. The solution is to split training data set into smaller training and validation sets. At least that's what worked for me. Hope it helped!

@qianguih
Copy link

qianguih commented Feb 5, 2018

@dominikj93 Thanks for your reply. Actually, I have already splitter the training data. However, it still crashed sometimes during training.

@jeasinema
Copy link
Contributor

@qianguih please upload the output of the terminal when you run this program. It's hard for us to determine what to go wrong with these limited informations.

BTW, @dominikj93 does provide the correct solution, sorry for not telling you that I use a split file available here.

@jeasinema jeasinema reopened this Feb 6, 2018
@qianguih
Copy link

qianguih commented Feb 6, 2018

@jeasinema Thanks for your reply. I did use the same split file in my experiments. It will run smoothly for couple of epochs and just stop training without reporting any errors or warnings. It works well at most of the time on a 1080 Ti GPU but fails frequently on a P100 GPU. I don't have a sample output right now. I am trying to reproduce the problem and will post a sample output here when it is available. Currently, I feel like it is something wrong with the multi-thread processing in data loader.

@jeasinema
Copy link
Contributor

@qianguih you do find a potential problem. The loaders may have competition with model. So you can try to add more workers like here.

@ashishkumar-rambhatla
Copy link

@jeasinema Can you share with us the code to split the kitti training data using the split files provided?

@qianguih
Copy link

qianguih commented Feb 6, 2018

@jeasinema Attached is a sample log. The CPU thread is still running but the GPU thread is dead. There is no error or warning. I have tried 8 workers. However, it didn't help.

log.txt

@jeasinema
Copy link
Contributor

@qianguih Have you tried to pause the training and then restart it again? Did it still stuck here?

@qianguih
Copy link

qianguih commented Feb 7, 2018

@jeasinema No, I didn't try this. I tried to replace the multi-thread data loader with a normal loader. And it solved the problem, which proved that the problem does come from the multi-thread data loader.

@zhanpx
Copy link

zhanpx commented Oct 16, 2018

@jeasinema No, I didn't try this. I tried to replace the multi-thread data loader with a normal loader. And it solved the problem, which proved that the problem does come from the multi-thread data loader.

I ran cross the same problem. Could you share your loader? Thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants