New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train Faster RCNN #494
Comments
Yes. I've seen the same error on one machine (but not the others) and use the same solution. I think that's because the GPU on that machine is in exclusive mode -- so using multiprocess may cause problems like this. |
Can you check your |
My machine has 8 v100 GPU and it is a bare metal machine. |
Interesting. Mine is P100. It might have something to do on how the new GPUs handle the fork. It works on old GPUs, though. |
Got it and thanks. |
Do you mind sharing the performance speed on 8 P100 when training Faster RCNN? When using 8 gpus, I can only get ~200-300 seconds per epoch, and utilization of each one is about 50%-60%. |
The performance will get stable only after about 3k steps. The default settings will take 70~80 seconds per epoch. GPU utilization is 70%~80% with the current default setting. |
The dataflow problem should be fixed now. |
Thanks. It seems that the speed per epoch is varied on my machine even after 30 epoch (9k steps) Furthermore, may I have few questions about training log?
Thanks. |
Here the speed roughly decreased from 70 sec / epoch @epoch10 to 120 sec / epoch @epoch700. It decreases because of more and more positive predictions. I haven't seen 200.
|
Thanks for your help. After about 400 epochs, the speed is more stable (~105 seconds per epoch). After pull the newest changes, got an error about mismtached data type. At
Error message:
since you cast the
However, the Note: I am using python3.6. |
Yes. |
Another error :( I just pull the newest changes.
|
This looks more like a problem of your environment. |
hmm, weird, after I comment out the line 78 in
It works again. maybe I system need to keep this variable existed even when we would set Ops in CPU. |
Hi, this might be related to the general tensorflow question, if you think stackoverflow is a better place to ask, please just ignore it. Now, the input image size is Do you have any suggestion? Thanks. |
I can't understand exactly what is "dynamically change the image resolution". But from all I can see |
Thanks for your reply. Yes, E.g. in my graph, there are two tensors, A and B, and the size of A is larger than B; during the running time, I would like to resize B to the size of A (the size of A is varied); however, I can not infer the size of A since I create a placeholder Thanks. |
Sorry, my bad, never mind. Thanks for reminding to use "tensor"... I used to use "list" to specify the targeted tensor size.. |
FYI I have a bug introduced in Nov 13 and fixed just now. It will affect the precision. |
Noted and thanks. |
A bug of resuming of faster RCNN: After I resume a trained model, the learning rate of the first epoch will be 0.003.
On the other hand, I can understand the speed per epoch might be varied for each epoch since the number of positive proposals might be increased after more epochs; however, when I resume model, the speed will become as slow as training from scratch and then it gradually increases its speed. However, the number of positive proposals after resume should be identical or similar to previous one, why the speed is slow? (As you can see that above log shows finishing time are 844 sec, I can get about 200 seconds on average before resuming.) |
Tensorflow convolution needs warm up. For variable-size inputs it needs more. |
Okay, got it. Thanks. |
A subtle bug that makes the result 2 points worse: 6fc4378 . This again shows how important it is to match the paper's performance -- if I didn't try to compare with some reference number, I'll never find hidden mistakes like this. |
Thanks for sharing your experience. :) |
Just want to share the trained results, basemodel is ResNet-50:
|
Thanks! I know the model probably gets slightly better than before, but haven't got time to train it. |
@ppwwyyxx @chunfuchen How can you achieve 60~200s per epoch? I used 4 K80, and after 3K steps, it still takes 1000s per epoch. I fine-tuned the mask RCNN mode with the follow codes: |
There are many factors here:
These two lines are recently added. They may impact speed (probably not much, if any) but I haven't run a benchmark yet. |
|
@ppwwyyxx Copy that~~Thanks for your patience~ |
I get an error to train faster rcnn based on your example; however, with your model, I am able to evaluate its performance and get the same results you posted on github.
Always include the following:
./examples/FasterRCNN/train.py --load snapshots/tensorpack/COCO-ResNet50-FasterRCNN.npz --gpu 2,3 --datadir /path/to/COCO14 --logdir snapshots/fasterRCNN-ResNet50
and then the program is idle there forever, does it related to the line about
CUDA_ERROR_NOT_INITIALIZED
Your environment (TF version, GPUs), if it matters.
TF version 1.4.0, Python-3.6, CUDA 9, CUDNN-7.
Tensorpack version: the newest commit.
Others:
ds = PrefetchDataZMQ(ds, 1)
inget_train_dataflow
function. ofdata.py
file, the training is running. Or if I replaceds = PrefetchDataZMQ(ds, 1)
byds = PrefetchData(ds, 500, 1)
, it will work as well.Thanks.
The text was updated successfully, but these errors were encountered: