about training steps #6

ehabhelmy82 · 2019-03-24T10:04:29Z

Dear Author
Thanks for sharing your interesting work but i have the following questions:

What is the specs of the machine you used cause it takes days of running with me and did not finish, for example it is working since 3 days and reached to:
[2019-03-24 17:43:27,001] [train step 145431] Loss: 4.35295 Pixel loss: 4.03669 Flow loss: 0.31627 (1.589 sec/batch, 5.036 instances/sec)
and did not finish yet. so am asking what is the number of training steps?
the input and output for your network is just images no videos right?
I tried the following command :
python trainer.py --batch_size 8 --dataset car --num_input 4
but it gives the following error after reaching training step (number train step 4251), do you have any idea why?

[2019-03-23 05:36:24,763] [train step 4261] Loss: 2.96025 Pixel loss: 2.85450 Flow loss: 0.10575 (1.607 sec/batch, 2.489 instances/sec)
Traceback (most recent call last):
File "trainer.py", line 380, in
main()
File "trainer.py", line 377, in main
trainer.train()
File "trainer.py", line 193, in train
opt_gan=s > gan_start_step, is_train=True)
File "trainer.py", line 209, in run_single_step
batch_chunk = self.session.run(batch)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: RandomShuffleQueue '_0_shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 4, current size 3)
[[Node: shuffle_batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_STRING, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](shuffle_batch/random_shuffle_queue, shuffle_batch/n)]]

Caused by op u'shuffle_batch', defined at:
File "trainer.py", line 380, in
main()
File "trainer.py", line 374, in main
trainer = Trainer(config, dataset_train, dataset_test)
File "trainer.py", line 48, in init
dataset, self.batch_size, is_training=True)
File "/data/ehab/Multiview2NovelviewMaster/input_ops.py", line 76, in create_input_ops
min_after_dequeue=min_capacity,
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 1220, in shuffle_batch
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 791, in _shuffle_batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 457, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1342, in _queue_dequeue_many_v2
timeout_ms=timeout_ms, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): RandomShuffleQueue '_0_shuffle_batch/random_shuffle_queue' is closed and has insufficient elements (requested 4, current size 3)
[[Node: shuffle_batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_STRING, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](shuffle_batch/random_shuffle_queue, shuffle_batch/n)]]

i tried to get some figures from tensorboard, but how can i get numerical results published in tables in your paper?
I really appreciate your time and your reply
Regards

ehabhelmy82 · 2019-03-26T10:04:33Z

?

ehabhelmy82 · 2019-03-28T10:03:34Z

Any reply is really appreciated

shaohua0116 · 2019-04-05T08:27:08Z

Sorry for the delayed response.

It takes a long time to train the model. If I remember correctly, it takes roughly 2 or 3 days (1M iterations).
The input is just images.
I tried your command and it works just fine. Usually, this error msg means there is a data point is corrupted and cannot be read. Can you check your dataset file?
Please download the provided checkpoints and run the evaluation script.

ehabhelmy82 · 2019-04-08T11:27:52Z

Thanks for your replay
One more question:
The published results in the paper calculated after how many number of iterations?
When the model stop training, after how many iterations?
Thanks

shaohua0116 · 2019-04-08T13:57:54Z

For ShapeNet (both cars and chairs), the models were trained for roughly 300k iterations without the GAN loss and 200k more iterations with the GAN loss.

ehabhelmy82 · 2019-04-10T10:34:01Z

I used the following training command :
python trainer.py --batch_size 8 --dataset chair --num_input 4

and the following testing command:

python evaler.py --dataset chair --data_id_list ./testing_tuple_lists/id_chair_random_elevation.txt --loss --checkpoint /data/ehab/Multiview2NovelviewMaster/train_dir/chair-default-bs_8_lr_flow_0.0001_pixel_5e-05_d_0.0001-num_input-4-20190325-150046/model-335001 --write_summary --summary_file log_chair335.txt

The recorded results in the report for only two views not 4 as follow:

Checkpoint: /data/ehab/Multiview2NovelviewMaster/train_dir/chair-default-bs_8_lr_flow_0.0001_pixel_5e-05_d_0.0001-num_input-4-20190325-150046/model-335001
Dataset: chair
Id list: ./testing_tuple_lists/id_chair_random_elevation.txt
[Final Avg Report] Total datapoint: 10000 from ./testing_tuple_lists/id_chair_random_elevation.txt
[Loss]
aggregate_improvement: 0.00000
aggregate_l1_loss: 0.34866
aggregate_report_loss_0: 0.52299
aggregate_report_loss_1: 0.52299
aggregate_report_ssim_0: 0.85165
aggregate_report_ssim_1: 0.85165
aggregate_total_loss: 0.34866
best_of_pixel_of_flow_report_loss: 0.52299
best_of_pixel_of_flow_report_ssim: 0.85165
flow_avg_report_loss_0: 0.34866
flow_avg_report_loss_1: 0.34866
flow_improvement: 0.00000
flow_l1_loss: 0.34866
flow_only_aggregate_improvement:-0.00000
flow_only_aggregate_l1_loss: 0.34866
flow_only_aggregate_report_loss_0: 0.52299
flow_only_aggregate_report_loss_1: 0.52299
flow_only_aggregate_report_ssim_0: 0.85165
flow_only_aggregate_report_ssim_1: 0.85165
flow_only_aggregate_total_loss: 0.34866
flow_total_loss: 0.34866
pixel_improvement: 0.00000
pixel_l1_loss: 0.34866
pixel_report_loss_0: 0.52299
pixel_report_loss_1: 0.52299
pixel_report_ssim_0: 0.85165
pixel_report_ssim_1: 0.85165
pixel_total_loss: 0.34866
[Time] (542.412 sec)

do you have any idea why report contains only results for two views only?
Regards

shaohua0116 · 2019-04-11T01:23:33Z

By default, the evaler only feeds two source views to the model (which can be seen here). You need to specify --num_input 4 to evaluate the model using four source images.

ehabhelmy82 closed this as completed Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about training steps #6

about training steps #6

ehabhelmy82 commented Mar 24, 2019

ehabhelmy82 commented Mar 26, 2019

ehabhelmy82 commented Mar 28, 2019

shaohua0116 commented Apr 5, 2019

ehabhelmy82 commented Apr 8, 2019

shaohua0116 commented Apr 8, 2019

ehabhelmy82 commented Apr 10, 2019

shaohua0116 commented Apr 11, 2019

about training steps #6

about training steps #6

Comments

ehabhelmy82 commented Mar 24, 2019

ehabhelmy82 commented Mar 26, 2019

ehabhelmy82 commented Mar 28, 2019

shaohua0116 commented Apr 5, 2019

ehabhelmy82 commented Apr 8, 2019

shaohua0116 commented Apr 8, 2019

ehabhelmy82 commented Apr 10, 2019

shaohua0116 commented Apr 11, 2019