Errors with Train.py #25

luke2997 · 2020-03-03T12:24:37Z

I've extracted image patches successfully, however I get the following error when running train.py. Any ideas?

File "train.py", line 184, in run_once logger.set_logger_dir(save_dir) File "../tensorpack/utils/logger.py", line 131, in set_logger_dir action = input("Select Action: k (keep) / d (delete) / q (quit):").lower().strip()EOFError: EOF when reading a line

Any idea here? This is inside my log file:

^[[32m[0303 01:56:20 @logger.py:90]^[[0m Argv: train.py --gpu=0,1^[[32m[0303 01:56:39 @training.py:50]^[[0m [DataParallel] Training a model of 2 towers.^[[32m[0303 01:56:39 @interface.py:31]^[[0m Automatically applying QueueInput on the DataFlow.^[[32m[0303 01:56:39 @interface.py:43]^[[0m Automatically applying StagingInput on the DataFlow.

The text was updated successfully, but these errors were encountered:

simongraham · 2020-03-03T13:45:02Z

Hi @luke2997

It seems that you have not put the GPU ids as a string. If using GPUs 0 and 1, use:

python train.py --gpu='0,1'

Please let us know if this fixes the issue.

luke2997 · 2020-03-03T13:49:43Z

@simongraham - this doesn't fix the problem unfortunately.

simongraham · 2020-03-03T14:16:56Z

Can you let me know what tensorpack version you are using and then copy the command that you use in the terminal.

luke2997 · 2020-03-03T16:29:53Z

I have both tensorFlow v1.12 and tensorpack 0.9.0.1. In the command I type:

python train.py --gpu='0,1'

I also get the below output file, not sure if related.

simongraham · 2020-03-03T18:22:30Z

The output is telling you that there is already a checkpoint file where you plan to save your logs. You need to press k (keep), d (delete) or q (quit) depending on what you want to do.

I'm still not sure exactly what you are requesting here? Are you getting an error? If so, please supply the terminal output.

luke2997 · 2020-03-04T14:19:09Z

Yeah i did delete it, i think I now realise the source of the issue being a couple of modules are not importing properly due to libpng12 ... i'll update when I get this resolved.

simongraham · 2020-03-04T17:08:22Z

I will close this issue for now. Please reopen if necessary, with a specific question- then we can be of more assistance.

luke2997 · 2020-03-06T09:16:52Z

Please reopen: I had to start all over again as there were issues with my venv. Extract_patches.py worked before but when I run it now I get the below code.. is it something to do with the dataset?

vqdang · 2020-03-06T11:18:03Z

@luke2997 , for the record, you can reopen it by yourself.
Please show me

hover_net/src/extract_patches.py

Lines 27 to 28 in 909ef03

    
           step_size = [80, 80] # should match self.train_mask_shape (config.py)  
        
           win_size  = [540, 540] # should be at least twice time larger than

in your code. You have to read the error message, it says that the input is of wrong format (float)

Also, please provide as many details as possible for what you have changed in the code compared to the github version.

luke2997 · 2020-03-07T08:19:42Z

Well I only changed the paths which is why I didn't understand why the error came. So I have the same lines as those above.

I had a look and tried running config.py and it seems I get an error related to #12 (comment). So it seems it is an error with my paths, although it's the same paths i used before which then worked!

One thing that may be causing it is when i first run the code I get this error:

/lib64/libstdc++.so.6: version CXXABI_1.3.9' not found`

And I export it using

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/*username*/.conda/envs/*VirtualEnvironment*/lib/

which then fixes it but causes the error above.

vqdang · 2020-03-07T10:15:15Z

That is very strange, may be your new environment broke something. For now, you can try change this

hover_net/src/misc/patch_extractor.py

Line 82 in 909ef03

return flag, last_step

into return flag, int(last_step) to enforce all input to range is int.

vqdang · 2020-03-07T10:19:54Z

Also, the preferred library version is listed here https://github.com/vqdang/hover_net/blob/master/requirements.txt

So you may want to check if it matched. In case you need to reinstall, use the following as guideline.

conda create --name test python=3.6
conda activate test
pip install opencv-python=3.2 scipy scikit-image pandas matplotlib
pip install --upgrade git+https://github.com/tensorpack/tensorpack.git@0.9.0.1
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
source activate test

luke2997 · 2020-03-07T11:11:37Z

Perfect thanks a lot! Changing that line of code fixed it and I was able to successfully extract patches. However, I have a few errors with train.py suggesting TF GPU is not working, so I will try create a new virtual environment as you suggested and try again, anyway as perhaps there was a fault using tensorflow gpu. Although it is a bit of a pain getting these packages installed all together for some reason.

File "train.py", line 278, in <module> trainer.run() File "train.py", line 244, in run self.run_once(opt, sess_init=init_weights, save_dir=log_dir) File "train.py", line 188, in run_once model = self.get_model()(**model_flags) File "/lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/graph.py", line 113, in __init__ assert tf.test.is_gpu_available()

simongraham · 2020-03-07T13:46:59Z

Hi @luke2997 ,

As @vqdang suggested- please setup your environment from scratch to ensure there are no issues with library versions.

Although it is a bit of a pain getting these packages installed all together for some reason.

Installing the libraries should be simple and easy enough if you follows @vqdang 's instructions. This is only a few lines in the terminal. Of course you need to make sure you run the commands separately- line by line.

After this has been done let us know and we can advise how to proceed. Whaat CUDA version do you have installed?

luke2997 · 2020-03-07T14:39:40Z

I'm in Mainland China so some channels get blocked, e.g. using the command above doesn't work for tensorflow, but anyway I do have all the requirements I believe and I have restarted and now got a little further, I assume GPU is working now! However I do have an output but get the long below error. Also I have Cudatoolkit 9.2 and cudnn 7.6.5.

Traceback (most recent call last): File "train.py", line 278, in <module> trainer.run() File "train.py", line 244, in run self.run_once(opt, sess_init=init_weights, save_dir=log_dir) File "train.py", line 215, in run_once launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_gpus)) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/interface.py", line 97, in launch_train_with_config extra_callbacks=config.extra_callbacks) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 341, in train_with_defaults steps_per_epoch, starting_epoch, max_epoch) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 313, in train self.main_loop(steps_per_epoch, starting_epoch, max_epoch) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 278, in main_loop self.run_step() # implemented by subclass File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 181, in run_step self.hooked_sess.run(self.train_op) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run run_metadata=run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run raise six.reraise(*original_exc_info) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run return self._sess.run(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run run_metadata=run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Integer division by zero [[node tower0/div_12 (defined at /lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/utils.py:182) = FloorDiv[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower0/sub_13, tower0/sub_14)]] Caused by op u'tower0/div_12', defined at: File "train.py", line 278, in <module> trainer.run() File "train.py", line 244, in run self.run_once(opt, sess_init=init_weights, save_dir=log_dir) File "train.py", line 215, in run_once launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_gpus)) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/interface.py", line 87, in launch_train_with_config model._build_graph_get_cost, model.get_optimizer) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/tower.py", line 204, in setup_graph train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/trainers.py", line 106, in _setup_graph self._make_get_grad_fn(input, get_cost_fn, get_opt_fn), get_opt_fn) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/training.py", line 161, in build grad_list = DataParallelBuilder.build_on_towers(self.towers, get_grad_fn, devices) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/training.py", line 119, in build_on_towers ret.append(func()) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/tower.py", line 232, in get_grad_fn cost = get_cost_fn(*input.get_input_tensors()) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/tfutils/tower.py", line 284, in __call__ output = self._tower_fn(*args) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/model_desc.py", line 246, in _build_graph_get_cost ret = self.build_graph(*inputs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/model_desc.py", line 162, in build_graph return self._build_graph(args) File "/lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/graph.py", line 305, in _build_graph true_np = colorize(true_np[...,0], cmap='jet') File "/lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/utils.py", line 182, in colorize value = (value - vmin) / (vmax - vmin) # vmin..vmax File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 866, in binary_op_wrapper return func(x, y, name=name) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 999, in _div_python2 return gen_math_ops.floor_div(x, y, name=name) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 3079, in floor_div "FloorDiv", x=x, y=y, name=name) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__ self._traceback = tf_stack.extract_stack() InvalidArgumentError (see above for traceback): Integer division by zero [[node tower0/div_12 (defined at /lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/utils.py:182) = FloorDiv[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower0/sub_13, tower0/sub_14)]]

simongraham · 2020-03-07T14:44:03Z

It looks like you are using python 2.7. As stated in the requirements you need to use python 3.6.

luke2997 · 2020-03-07T14:52:47Z

I mean the virtual environment is fully set up with python 3.6..... however it seems to be initialising with 2.7 as you said. I'll try see how I can fix this.

Yeah - it seems one of the packages i installed manually through conda has simultaneously downgraded python. Will update.

luke2997 · 2020-03-08T04:04:41Z

Right, thanks a lot for the help I appreciate it, I've successfully trained the data after changing the Python Version!

simongraham added the usage how to run label Mar 3, 2020

simongraham closed this as completed Mar 4, 2020

vqdang reopened this Mar 6, 2020

luke2997 closed this as completed Mar 8, 2020

vqdang mentioned this issue Apr 9, 2020

you have done an excellent code but i am facing some issues while reproducing your paper. i am using the packages and libraries that you mentioned in requirement.txt #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors with Train.py #25

Errors with Train.py #25

luke2997 commented Mar 3, 2020

simongraham commented Mar 3, 2020 •

edited

Loading

luke2997 commented Mar 3, 2020 •

edited

Loading

simongraham commented Mar 3, 2020

luke2997 commented Mar 3, 2020

simongraham commented Mar 3, 2020 •

edited

Loading

luke2997 commented Mar 4, 2020

simongraham commented Mar 4, 2020

luke2997 commented Mar 6, 2020 •

edited

Loading

vqdang commented Mar 6, 2020 •

edited

Loading

luke2997 commented Mar 7, 2020

vqdang commented Mar 7, 2020

vqdang commented Mar 7, 2020

luke2997 commented Mar 7, 2020 •

edited

Loading

simongraham commented Mar 7, 2020

luke2997 commented Mar 7, 2020

simongraham commented Mar 7, 2020

luke2997 commented Mar 7, 2020 •

edited

Loading

luke2997 commented Mar 8, 2020

Errors with Train.py #25

Errors with Train.py #25

Comments

luke2997 commented Mar 3, 2020

simongraham commented Mar 3, 2020 • edited Loading

luke2997 commented Mar 3, 2020 • edited Loading

simongraham commented Mar 3, 2020

luke2997 commented Mar 3, 2020

simongraham commented Mar 3, 2020 • edited Loading

luke2997 commented Mar 4, 2020

simongraham commented Mar 4, 2020

luke2997 commented Mar 6, 2020 • edited Loading

vqdang commented Mar 6, 2020 • edited Loading

luke2997 commented Mar 7, 2020

vqdang commented Mar 7, 2020

vqdang commented Mar 7, 2020

luke2997 commented Mar 7, 2020 • edited Loading

simongraham commented Mar 7, 2020

luke2997 commented Mar 7, 2020

simongraham commented Mar 7, 2020

luke2997 commented Mar 7, 2020 • edited Loading

luke2997 commented Mar 8, 2020

simongraham commented Mar 3, 2020 •

edited

Loading

luke2997 commented Mar 3, 2020 •

edited

Loading

simongraham commented Mar 3, 2020 •

edited

Loading

luke2997 commented Mar 6, 2020 •

edited

Loading

vqdang commented Mar 6, 2020 •

edited

Loading

luke2997 commented Mar 7, 2020 •

edited

Loading

luke2997 commented Mar 7, 2020 •

edited

Loading