Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with Train.py #25

Closed
luke2997 opened this issue Mar 3, 2020 · 18 comments
Closed

Errors with Train.py #25

luke2997 opened this issue Mar 3, 2020 · 18 comments
Labels
usage how to run

Comments

@luke2997
Copy link

luke2997 commented Mar 3, 2020

I've extracted image patches successfully, however I get the following error when running train.py. Any ideas?

File "train.py", line 184, in run_once    logger.set_logger_dir(save_dir)  File "../tensorpack/utils/logger.py", line 131, in set_logger_dir    action = input("Select Action: k (keep) / d (delete) / q (quit):").lower().strip()EOFError: EOF when reading a line

Any idea here? This is inside my log file:

^[[32m[0303 01:56:20 @logger.py:90]^[[0m Argv: train.py --gpu=0,1^[[32m[0303 01:56:39 @training.py:50]^[[0m [DataParallel] Training a model of 2 towers.^[[32m[0303 01:56:39 @interface.py:31]^[[0m Automatically applying QueueInput on the DataFlow.^[[32m[0303 01:56:39 @interface.py:43]^[[0m Automatically applying StagingInput on the DataFlow.

@simongraham
Copy link
Collaborator

simongraham commented Mar 3, 2020

Hi @luke2997

It seems that you have not put the GPU ids as a string. If using GPUs 0 and 1, use:

python train.py --gpu='0,1'

Please let us know if this fixes the issue.

@simongraham simongraham added the usage how to run label Mar 3, 2020
@luke2997
Copy link
Author

luke2997 commented Mar 3, 2020

@simongraham - this doesn't fix the problem unfortunately.

@simongraham
Copy link
Collaborator

Can you let me know what tensorpack version you are using and then copy the command that you use in the terminal.

@luke2997
Copy link
Author

luke2997 commented Mar 3, 2020

I have both tensorFlow v1.12 and tensorpack 0.9.0.1. In the command I type:

python train.py --gpu='0,1'

I also get the below output file, not sure if related.

Screenshot 2020-03-04 at 00 27 00

@simongraham
Copy link
Collaborator

simongraham commented Mar 3, 2020

The output is telling you that there is already a checkpoint file where you plan to save your logs. You need to press k (keep), d (delete) or q (quit) depending on what you want to do.

I'm still not sure exactly what you are requesting here? Are you getting an error? If so, please supply the terminal output.

@luke2997
Copy link
Author

luke2997 commented Mar 4, 2020

Yeah i did delete it, i think I now realise the source of the issue being a couple of modules are not importing properly due to libpng12 ... i'll update when I get this resolved.

@simongraham
Copy link
Collaborator

I will close this issue for now. Please reopen if necessary, with a specific question- then we can be of more assistance.

@luke2997
Copy link
Author

luke2997 commented Mar 6, 2020

Please reopen: I had to start all over again as there were issues with my venv. Extract_patches.py worked before but when I run it now I get the below code.. is it something to do with the dataset?

Screenshot 2020-03-06 at 17 12 56

@vqdang vqdang reopened this Mar 6, 2020
@vqdang
Copy link
Owner

vqdang commented Mar 6, 2020

@luke2997 , for the record, you can reopen it by yourself.
Please show me

step_size = [80, 80] # should match self.train_mask_shape (config.py)
win_size = [540, 540] # should be at least twice time larger than
in your code. You have to read the error message, it says that the input is of wrong format (float)

Also, please provide as many details as possible for what you have changed in the code compared to the github version.

@luke2997
Copy link
Author

luke2997 commented Mar 7, 2020

Well I only changed the paths which is why I didn't understand why the error came. So I have the same lines as those above.

I had a look and tried running config.py and it seems I get an error related to #12 (comment). So it seems it is an error with my paths, although it's the same paths i used before which then worked!

One thing that may be causing it is when i first run the code I get this error:

/lib64/libstdc++.so.6: version CXXABI_1.3.9' not found`

And I export it using

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/*username*/.conda/envs/*VirtualEnvironment*/lib/

which then fixes it but causes the error above.

@vqdang
Copy link
Owner

vqdang commented Mar 7, 2020

That is very strange, may be your new environment broke something. For now, you can try change this

return flag, last_step

into return flag, int(last_step) to enforce all input to range is int.

@vqdang
Copy link
Owner

vqdang commented Mar 7, 2020

Also, the preferred library version is listed here https://github.com/vqdang/hover_net/blob/master/requirements.txt

So you may want to check if it matched. In case you need to reinstall, use the following as guideline.

conda create --name test python=3.6
conda activate test
pip install opencv-python=3.2 scipy scikit-image pandas matplotlib
pip install --upgrade git+https://github.com/tensorpack/tensorpack.git@0.9.0.1
pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
source activate test

@luke2997
Copy link
Author

luke2997 commented Mar 7, 2020

Perfect thanks a lot! Changing that line of code fixed it and I was able to successfully extract patches. However, I have a few errors with train.py suggesting TF GPU is not working, so I will try create a new virtual environment as you suggested and try again, anyway as perhaps there was a fault using tensorflow gpu. Although it is a bit of a pain getting these packages installed all together for some reason.

File "train.py", line 278, in <module> trainer.run() File "train.py", line 244, in run self.run_once(opt, sess_init=init_weights, save_dir=log_dir) File "train.py", line 188, in run_once model = self.get_model()(**model_flags) File "/lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/graph.py", line 113, in __init__ assert tf.test.is_gpu_available()

@simongraham
Copy link
Collaborator

Hi @luke2997 ,

As @vqdang suggested- please setup your environment from scratch to ensure there are no issues with library versions.

Although it is a bit of a pain getting these packages installed all together for some reason.

Installing the libraries should be simple and easy enough if you follows @vqdang 's instructions. This is only a few lines in the terminal. Of course you need to make sure you run the commands separately- line by line.

After this has been done let us know and we can advise how to proceed. Whaat CUDA version do you have installed?

@luke2997
Copy link
Author

luke2997 commented Mar 7, 2020

I'm in Mainland China so some channels get blocked, e.g. using the command above doesn't work for tensorflow, but anyway I do have all the requirements I believe and I have restarted and now got a little further, I assume GPU is working now! However I do have an output but get the long below error. Also I have Cudatoolkit 9.2 and cudnn 7.6.5.

Traceback (most recent call last): File "train.py", line 278, in <module> trainer.run() File "train.py", line 244, in run self.run_once(opt, sess_init=init_weights, save_dir=log_dir) File "train.py", line 215, in run_once launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_gpus)) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/interface.py", line 97, in launch_train_with_config extra_callbacks=config.extra_callbacks) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 341, in train_with_defaults steps_per_epoch, starting_epoch, max_epoch) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 313, in train self.main_loop(steps_per_epoch, starting_epoch, max_epoch) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 278, in main_loop self.run_step() # implemented by subclass File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/base.py", line 181, in run_step self.hooked_sess.run(self.train_op) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run run_metadata=run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run raise six.reraise(*original_exc_info) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run return self._sess.run(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run run_metadata=run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Integer division by zero [[node tower0/div_12 (defined at /lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/utils.py:182) = FloorDiv[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower0/sub_13, tower0/sub_14)]] Caused by op u'tower0/div_12', defined at: File "train.py", line 278, in <module> trainer.run() File "train.py", line 244, in run self.run_once(opt, sess_init=init_weights, save_dir=log_dir) File "train.py", line 215, in run_once launch_train_with_config(config, SyncMultiGPUTrainerParameterServer(nr_gpus)) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/interface.py", line 87, in launch_train_with_config model._build_graph_get_cost, model.get_optimizer) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/tower.py", line 204, in setup_graph train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/trainers.py", line 106, in _setup_graph self._make_get_grad_fn(input, get_cost_fn, get_opt_fn), get_opt_fn) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/training.py", line 161, in build grad_list = DataParallelBuilder.build_on_towers(self.towers, get_grad_fn, devices) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/training.py", line 119, in build_on_towers ret.append(func()) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/train/tower.py", line 232, in get_grad_fn cost = get_cost_fn(*input.get_input_tensors()) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/tfutils/tower.py", line 284, in __call__ output = self._tower_fn(*args) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/model_desc.py", line 246, in _build_graph_get_cost ret = self.build_graph(*inputs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorpack/graph_builder/model_desc.py", line 162, in build_graph return self._build_graph(args) File "/lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/graph.py", line 305, in _build_graph true_np = colorize(true_np[...,0], cmap='jet') File "/lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/utils.py", line 182, in colorize value = (value - vmin) / (vmax - vmin) # vmin..vmax File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 866, in binary_op_wrapper return func(x, y, name=name) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 999, in _div_python2 return gen_math_ops.floor_div(x, y, name=name) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 3079, in floor_div "FloorDiv", x=x, y=y, name=name) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/lustre/home/acct-clsyzs/clsyzs/.conda/envs/hovernew/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__ self._traceback = tf_stack.extract_stack() InvalidArgumentError (see above for traceback): Integer division by zero [[node tower0/div_12 (defined at /lustre/home/acct-clsyzs/clsyzs/Luke/hover_net-master/src/model/utils.py:182) = FloorDiv[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower0/sub_13, tower0/sub_14)]]

@simongraham
Copy link
Collaborator

It looks like you are using python 2.7. As stated in the requirements you need to use python 3.6.

@luke2997
Copy link
Author

luke2997 commented Mar 7, 2020

I mean the virtual environment is fully set up with python 3.6..... however it seems to be initialising with 2.7 as you said. I'll try see how I can fix this.

Yeah - it seems one of the packages i installed manually through conda has simultaneously downgraded python. Will update.

@luke2997
Copy link
Author

luke2997 commented Mar 8, 2020

Right, thanks a lot for the help I appreciate it, I've successfully trained the data after changing the Python Version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage how to run
Projects
None yet
Development

No branches or pull requests

3 participants