Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: At least two variables have the same name: init/initial_bn/beta #13

Open
arisliang opened this issue Apr 1, 2018 · 8 comments

Comments

@arisliang
Copy link

Downloaded the trained model, and run as below:
python main.py --mode=gtp --model_path='./savedmodels/model-0.4114.ckpt'
gives error:

CRITICAL root: Traceback (most recent call last):
File "main.py", line 231, in
fnFLAGS.MODE
File "main.py", line 226, in
'gtp': lambda: gtp(),
File "main.py", line 69, in gtp
engine = make_gtp_instance(flags=flags, hps=hps)
File "/home/ly/src/lib/alphazero/AlphaGOZero-python-tensorflow/utils/gtp_wrapper.py", line 110, in make_gtp_instance
n = Network(flags, hps)
File "/home/ly/src/lib/alphazero/AlphaGOZero-python-tensorflow/Network.py", line 85, in init
self.saver = tf.train.Saver(var_list=var_to_save, max_to_keep=10)
File "/home/ly/anaconda3/envs/learning/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1311, in init
self.build()
File "/home/ly/anaconda3/envs/learning/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1320, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/ly/anaconda3/envs/learning/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1357, in _build
build_save=build_save, build_restore=build_restore)
File "/home/ly/anaconda3/envs/learning/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 787, in _build_internal
saveables = self._ValidateAndSliceInputs(names_to_saveables)
File "/home/ly/anaconda3/envs/learning/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 635, in _ValidateAndSliceInputs
names_to_saveables = BaseSaverBuilder.OpListToDict(names_to_saveables)
File "/home/ly/anaconda3/envs/learning/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 612, in OpListToDict
name)
ValueError: At least two variables have the same name: init/initial_bn/beta

@yhyu13
Copy link
Owner

yhyu13 commented Apr 9, 2018

Could you tell me what is your tensorflow version? It seems like a backward incompatibility issue. I was using tf1.4.0 when developing this project. If you find a solution, pull request is welcomed. Thanks

Edit: It is comfired to be a bug caused by tf1.7.0. The program still works if you comment out the global variables in var_to_save in Network.py. @arisliang

@yhyu13 yhyu13 closed this as completed Apr 10, 2018
@arisliang
Copy link
Author

I use tf 1.7.0. If commented out var_to_save, we also need to comment out the self.saver attribute, is it? since it depends on the var_to_save to initialize. And once commented out that, the program will fail to load model, since there's no saver anymore.

@arisliang
Copy link
Author

By the way, same error happened in 1.8.0 too. Thanks to your updated comments in the code, it's more clear to me how to apply this fix.

@yhyu13
Copy link
Owner

yhyu13 commented Apr 30, 2018

@arisliang

I finally figure out what was wrong. We know all variables we user created are called "global variable" (in contrast, variables created inside tensorflow api is called "local variable"), among "global variable", variables whose "trainable" flag isn't false are called "trainable variable".

If you take a look at the _batch_norm() I wrote, I created the offset(beta) and scale(gamma) who are trainable. But tensorflow 1.4 didn't include them in "trainable variable". I noticed that and added them into var_to_save. And the tensorflow team fixed this bug in later version (start from 1.5 actually). Hope this explains everything.

@yhyu13 yhyu13 reopened this Apr 30, 2018
@yhyu13
Copy link
Owner

yhyu13 commented Apr 30, 2018

@arisliang

Since there is an issue in loading the model, maybe you want to try install tf 1.4 GPU (with py3.6 and linux O/S):
pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.4.0-cp36-cp36m-linux_x86_64.whl

@arisliang
Copy link
Author

Wow, didn't know tf 1.4 has this bug. Do you know which issue they created for this fix? I tried to google, but couldn't find. Actually would you consider to upgrade the code to more recent tf? since newer tf include this bug fix, plus other improvements and bug fixes I would imagine. I couldn't install tf1.4 despite trying, because I have cuda9.0 installed, tf1.4 seems to require cuda8.0

@awilliamson
Copy link

@yhyu13 This wouldn't be a general fix, as the installed CUDA version for people running TF 1.7/1.8 would be CUDA 8 or higher. A downgrade to 1.4 would require downgrading entire CUDA setups.

Would be interested in a fix for working on latest TF version.

@yhyu13
Copy link
Owner

yhyu13 commented Jul 8, 2018

@awilliamson The solution that worked is to comment out the list of variables and just left var_to_save = tf.trainable_variables(). But the issue was the trained model malfunctions under tf1.5 and higher. The code would work but requires to retrain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants