Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about training steps #6

Closed
ehabhelmy82 opened this issue Mar 24, 2019 · 7 comments
Closed

about training steps #6

ehabhelmy82 opened this issue Mar 24, 2019 · 7 comments

Comments

@ehabhelmy82
Copy link

Dear Author
Thanks for sharing your interesting work but i have the following questions:

  1. What is the specs of the machine you used cause it takes days of running with me and did not finish, for example it is working since 3 days and reached to:
    [2019-03-24 17:43:27,001] [train step 145431] Loss: 4.35295 Pixel loss: 4.03669 Flow loss: 0.31627 (1.589 sec/batch, 5.036 instances/sec)
    and did not finish yet. so am asking what is the number of training steps?
  2. the input and output for your network is just images no videos right?
  3. I tried the following command :
    python trainer.py --batch_size 8 --dataset car --num_input 4
    but it gives the following error after reaching training step (number train step 4251), do you have any idea why?

[2019-03-23 05:36:24,763] [train step 4261] Loss: 2.96025 Pixel loss: 2.85450 Flow loss: 0.10575 (1.607 sec/batch, 2.489 instances/sec)
Traceback (most recent call last):
File "trainer.py", line 380, in
main()
File "trainer.py", line 377, in main
trainer.train()
File "trainer.py", line 193, in train
opt_gan=s > gan_start_step, is_train=True)
File "trainer.py", line 209, in run_single_step
batch_chunk = self.session.run(batch)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: RandomShuffleQueue '_0_shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 4, current size 3)
[[Node: shuffle_batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_STRING, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](shuffle_batch/random_shuffle_queue, shuffle_batch/n)]]

Caused by op u'shuffle_batch', defined at:
File "trainer.py", line 380, in
main()
File "trainer.py", line 374, in main
trainer = Trainer(config, dataset_train, dataset_test)
File "trainer.py", line 48, in init
dataset, self.batch_size, is_training=True)
File "/data/ehab/Multiview2NovelviewMaster/input_ops.py", line 76, in create_input_ops
min_after_dequeue=min_capacity,
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 1220, in shuffle_batch
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 791, in _shuffle_batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 457, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1342, in _queue_dequeue_many_v2
timeout_ms=timeout_ms, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): RandomShuffleQueue '_0_shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 4, current size 3)
[[Node: shuffle_batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_STRING, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](shuffle_batch/random_shuffle_queue, shuffle_batch/n)]]

  1. i tried to get some figures from tensorboard, but how can i get numerical results published in tables in your paper?
    I really appreciate your time and your reply
    Regards
@ehabhelmy82
Copy link
Author

?

@ehabhelmy82
Copy link
Author

Any reply is really appreciated

@shaohua0116
Copy link
Owner

Sorry for the delayed response.

  1. It takes a long time to train the model. If I remember correctly, it takes roughly 2 or 3 days (1M iterations).
  2. The input is just images.
  3. I tried your command and it works just fine. Usually, this error msg means there is a data point is corrupted and cannot be read. Can you check your dataset file?
  4. Please download the provided checkpoints and run the evaluation script.

@ehabhelmy82
Copy link
Author

Thanks for your replay
One more question:
The published results in the paper calculated after how many number of iterations?
When the model stop training, after how many iterations?
Thanks

@shaohua0116
Copy link
Owner

For ShapeNet (both cars and chairs), the models were trained for roughly 300k iterations without the GAN loss and 200k more iterations with the GAN loss.

@ehabhelmy82
Copy link
Author

I used the following training command :
python trainer.py --batch_size 8 --dataset chair --num_input 4

and the following testing command:

python evaler.py --dataset chair --data_id_list ./testing_tuple_lists/id_chair_random_elevation.txt --loss --checkpoint /data/ehab/Multiview2NovelviewMaster/train_dir/chair-default-bs_8_lr_flow_0.0001_pixel_5e-05_d_0.0001-num_input-4-20190325-150046/model-335001 --write_summary --summary_file log_chair335.txt

The recorded results in the report for only two views not 4 as follow:

Checkpoint: /data/ehab/Multiview2NovelviewMaster/train_dir/chair-default-bs_8_lr_flow_0.0001_pixel_5e-05_d_0.0001-num_input-4-20190325-150046/model-335001
Dataset: chair
Id list: ./testing_tuple_lists/id_chair_random_elevation.txt
[Final Avg Report] Total datapoint: 10000 from ./testing_tuple_lists/id_chair_random_elevation.txt
[Loss]
aggregate_improvement: 0.00000
aggregate_l1_loss: 0.34866
aggregate_report_loss_0: 0.52299
aggregate_report_loss_1: 0.52299
aggregate_report_ssim_0: 0.85165
aggregate_report_ssim_1: 0.85165
aggregate_total_loss: 0.34866
best_of_pixel_of_flow_report_loss: 0.52299
best_of_pixel_of_flow_report_ssim: 0.85165
flow_avg_report_loss_0: 0.34866
flow_avg_report_loss_1: 0.34866
flow_improvement: 0.00000
flow_l1_loss: 0.34866
flow_only_aggregate_improvement:-0.00000
flow_only_aggregate_l1_loss: 0.34866
flow_only_aggregate_report_loss_0: 0.52299
flow_only_aggregate_report_loss_1: 0.52299
flow_only_aggregate_report_ssim_0: 0.85165
flow_only_aggregate_report_ssim_1: 0.85165
flow_only_aggregate_total_loss: 0.34866
flow_total_loss: 0.34866
pixel_improvement: 0.00000
pixel_l1_loss: 0.34866
pixel_report_loss_0: 0.52299
pixel_report_loss_1: 0.52299
pixel_report_ssim_0: 0.85165
pixel_report_ssim_1: 0.85165
pixel_total_loss: 0.34866
[Time] (542.412 sec)

do you have any idea why report contains only results for two views only?
Regards

@shaohua0116
Copy link
Owner

By default, the evaler only feeds two source views to the model (which can be seen here). You need to specify --num_input 4 to evaluate the model using four source images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants