Skip to content
This repository has been archived by the owner on Dec 11, 2020. It is now read-only.

how to run elf in gogui #133

Closed
l1t1 opened this issue Feb 14, 2019 · 11 comments
Closed

how to run elf in gogui #133

l1t1 opened this issue Feb 14, 2019 · 11 comments

Comments

@l1t1
Copy link

l1t1 commented Feb 14, 2019

I can run df_console now, but how to set the parameters
for example, use pretrained-go-19x19-v2 model and use visits 3200
i want to test the elf engine with orignal weight vs lz with converted weight

@l1t1
Copy link
Author

l1t1 commented Feb 14, 2019

if i run the command line

D:\tool\go_gui>D:\tool\go_gui\gogui-twogtp -black "D:\elf_cpu_full\elf\df_console.exe " -white "D:\leela-zero-0.16-win64
\leelaz.exe --gtp -w D:\elfv2.gz --noponder -v 3201"  -games 10 -sgffile fb_elfv2 -auto -komi 7.5

it didnt work at all
if i run the gogui.exe
it shows some warning message and played

the go program is not responding to the command quit

df2 sent a malfromed response
text lines beofre the status character of the first response line are not allowd by gtp standard

@jillybob
Copy link

jillybob commented Feb 14, 2019

@l1t1 updated:

ELF works for me with the packaged Sabaki. However, when running in GTP, I get following error.

redirect path to the elf_gpu_full\elf folder and run df_console.exe --verbose --gpu 0 --num_block 20 --dim 256 --mcts_puct 1.50 --batchsize 16 --mcts_rollout_per_batch 16 --mcts_threads 2 --mcts_rollout_per_thread 8192 --resign_thres 0.05 --mcts_virtual_loss 1

I get
this error now:

https://i.imgur.com/3Aul6au.png

for 19 resnet layers until I get the code

[15832] Failed to execute script df_console

@l1t1
Copy link
Author

l1t1 commented Feb 14, 2019

https://github.com/pytorch/ELF#running-a-go-bot says

Here is a basic set of commands to run and play the bot via the GTP protocol:
1.Build ELF and run source scripts/devmode_set_pythonpath.sh as described above.
2.Train a model, or grab a pretrained model.
3.Change directory to scripts/elfgames/go/
4.Run ./gtp.sh path/to/modelfile.bin --verbose --gpu 0 --num_block 20 --dim 256 --mcts_puct 1.50 --batchsize 16 --mcts_rollout_per_batch 16 --mcts_threads 2 --mcts_rollout_per_thread 8192 --resign_thres 0.05 --mcts_virtual_loss 1

We've found that the above settings work well for playing the bot. You may change mcts_rollout_per_thread to tune the thinking time per move.

After the environment is set up and the model is loaded, you can start to type gtp commands to get the response from the engine.

and look at the file
https://github.com/pytorch/ELF/blob/master/scripts/elfgames/go/gtp.sh , it can accept a model file using --load $MODEL
did the python3 df_console.py equal to df_console.exe ?

MODEL=$1 
shift 

game=elfgames.go.game model=df_pred model_file=elfgames.go.df_model3 python3 df_console.py --mode online --keys_in_reply V rv \ 
    --use_mcts --mcts_verbose_time --mcts_use_prior --mcts_persistent_tree --load $MODEL \ 
    --server_addr localhost --port 1234 \ 
     --replace_prefix resnet.module,resnet \ 
    --no_check_loaded_options \ 
     --no_parameter_print \ 
   "$@" 

@l1t1
Copy link
Author

l1t1 commented Feb 15, 2019

df_console.exet doesnt support elfv2.bin

D:\elf_cpu_full\elf>df_console --load d:/elfv2.bin
Traceback (most recent call last):
  File "df_console.py", line 39, in <module>
    },
  File "rlpytorch\model_loader.py", line 162, in load_model
  File "rlpytorch\model_base.py", line 153, in load
  File "site-packages\torch\nn\modules\module.py", line 719, in load_state_dict
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
        Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias"
, "init_conv.1.running_mean", "init_conv.1.running_var", "pi_final_conv.1.num_batches_tracked", "value_final_conv.1.num_

load its own model.bin is ok

D:\elf_cpu_full\elf>df_console --load ./model.bin
genmove b

= D16

model.bin cannot be converted to lz format by elf_convert.py

D:\>c:\python37\python elf_convert.py model.bin
Traceback (most recent call last):
  File "elf_convert.py", line 56, in <module>
    b = convert_block(state, 'resnet.module.resnet.{}.conv_lower'.format(block))
  File "elf_convert.py", line 13, in convert_block
    weight = np.array(t[name + '.0.weight'])
KeyError: 'resnet.module.resnet.0.conv_lower.0.weight'

@l1t1
Copy link
Author

l1t1 commented Feb 15, 2019

i add --replace_prefix resnet.module,resnet, runs with some useful error message

D:\elf_cpu_full\elf>df_console.exe --load d:/elfv2.bin --replace_prefix resnet.module,resnet
Traceback (most recent call last):
  File "df_console.py", line 39, in <module>
    },
  File "rlpytorch\model_loader.py", line 162, in load_model
  File "rlpytorch\model_base.py", line 153, in load
  File "site-packages\torch\nn\modules\module.py", line 719, in load_state_dict
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
        Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias"
, "init_conv.1.running_mean", "init_conv.1.running_var", "pi_final_conv.1.num_batches_tracked", "value_final_conv.1.num_
batches_tracked".
        Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.wei
ght", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var".
        size mismatch for pi_final_conv.0.weight: copying a param of torch.Size([2, 224, 1, 1]) from checkpoint, where t
he shape is torch.Size([2, 256, 1, 1]) in current model.

it shows the size of should be 224x20, the size of erarly released model.
i verified use pretrained-go-19x19-v1.bin to replace model.bin is ok

D:\elf_cpu_full\elf>df_console.exe --load d:/pretrained-go-19x19-v1.bin --replace_prefix resnet.module,resnet
genmove b

= D4

but the new elfv2 isnt work yet

D:\elf_cpu_full\elf>df_console.exe --load d:/elfv2.bin --replace_prefix resnet.module,resnet --num_block 20 --dim 256
Traceback (most recent call last):
  File "df_console.py", line 39, in <module>
    },
  File "rlpytorch\model_loader.py", line 162, in load_model
  File "rlpytorch\model_base.py", line 153, in load
  File "site-packages\torch\nn\modules\module.py", line 719, in load_state_dict
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
        Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias"
, "init_conv.1.running_mean", "init_conv.1.running_var", "pi_final_conv.1.num_batches_tracked", "value_final_conv.1.num_
batches_tracked".
        Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.wei
ght", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var".
[7684] Failed to execute script df_console
^C

@l1t1
Copy link
Author

l1t1 commented Feb 15, 2019

observed the data structure by
for key in state.keys():
print(key, state[key].shape)

resnet.resnet.0.conv_lower.0.weight torch.Size([224, 224, 3, 3])
resnet.resnet.0.conv_lower.0.bias torch.Size([224])
resnet.resnet.0.conv_lower.1.weight torch.Size([224])
resnet.resnet.0.conv_lower.1.bias torch.Size([224])
resnet.resnet.0.conv_lower.1.running_mean torch.Size([224])
resnet.resnet.0.conv_lower.1.running_var torch.Size([224])

so modified the following line
b = convert_block(state, 'resnet.module.resnet.{}.conv_lower'.format(block))
write_block(f, b)
b = convert_block(state, 'resnet.module.resnet.{}.conv_upper'.format(block))

to
b = convert_block(state, 'resnet.resnet.{}.conv_lower'.format(block))
write_block(f, b)
b = convert_block(state, 'resnet.resnet.{}.conv_upper'.format(block))

the model.bin can be converted and can be used by leelaz

D:\>c:\python37\python elf_convert1.py model.bin
D:\>D:\leela-zero-0.16-win64\leelaz -w model2_converted_weights.txt
Using 2 thread(s).
RNG seed: 18089123876350262531
Leela Zero 0.16  Copyright (C) 2017-2018  Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.

BLAS Core: Haswell
Detecting residual layers...v2...224 channels...20 blocks.

@l1t1
Copy link
Author

l1t1 commented Feb 15, 2019

and found model.bin is idential to elfv0

D:\>more model2_converted_weights.txt
2
-0.0008018693 -0.003017578 0.00042337787 -0.00207849 -0.010346348 0.0010541559 -0.00016277476 -0.0027672334
D:\>more elf_converted_weights0.txt
2
-0.0008018693 -0.003017578 0.00042337787 -0.00207849 -0.010346348 0.0010541559 -0.00016277476 -0.0027672334

@l1t1
Copy link
Author

l1t1 commented Feb 15, 2019

model.bin elfv2.bin
init_conv.0.weight torch.Size([224, 18, 3, 3]) init_conv.module.0.weight torch.Size([256, 18, 3, 3])
init_conv.0.bias torch.Size([224]) init_conv.module.0.bias torch.Size([256])
init_conv.1.weight torch.Size([224]) init_conv.module.1.weight torch.Size([256])
init_conv.1.bias torch.Size([224]) init_conv.module.1.bias torch.Size([256])
init_conv.1.running_mean torch.Size([224]) init_conv.module.1.running_mean torch.Size([256])
init_conv.1.running_var torch.Size([224]) init_conv.module.1.running_var torch.Size([256])
pi_final_conv.0.weight torch.Size([2, 224, 1, 1]) init_conv.module.1.num_batches_tracked torch.Size([])
pi_final_conv.0.bias torch.Size([2]) pi_final_conv.0.weight torch.Size([2, 256, 1, 1])
pi_final_conv.1.weight torch.Size([2]) pi_final_conv.0.bias torch.Size([2])
pi_final_conv.1.bias torch.Size([2]) pi_final_conv.1.weight torch.Size([2])
pi_final_conv.1.running_mean torch.Size([2]) pi_final_conv.1.bias torch.Size([2])
pi_final_conv.1.running_var torch.Size([2]) pi_final_conv.1.running_mean torch.Size([2])
value_final_conv.0.weight torch.Size([1, 224, 1, 1]) pi_final_conv.1.running_var torch.Size([2])
value_final_conv.0.bias torch.Size([1]) pi_final_conv.1.num_batches_tracked torch.Size([])
value_final_conv.1.weight torch.Size([1]) value_final_conv.0.weight torch.Size([1, 256, 1, 1])
value_final_conv.1.bias torch.Size([1]) value_final_conv.0.bias torch.Size([1])
value_final_conv.1.running_mean torch.Size([1]) value_final_conv.1.weight torch.Size([1])
value_final_conv.1.running_var torch.Size([1]) value_final_conv.1.bias torch.Size([1])
pi_linear.weight torch.Size([362, 722]) value_final_conv.1.running_mean torch.Size([1])
pi_linear.bias torch.Size([362]) value_final_conv.1.running_var torch.Size([1])
value_linear1.weight torch.Size([256, 361]) value_final_conv.1.num_batches_tracked torch.Size([])
value_linear1.bias torch.Size([256]) pi_linear.weight torch.Size([362, 722])
value_linear2.weight torch.Size([1, 256]) pi_linear.bias torch.Size([362])
value_linear2.bias torch.Size([1]) value_linear1.weight torch.Size([256, 361])
resnet.resnet.0.conv_lower.0.weight torch.Size([224, 224, 3, 3]) value_linear1.bias torch.Size([256])
resnet.resnet.0.conv_lower.0.bias torch.Size([224]) value_linear2.weight torch.Size([1, 256])
resnet.resnet.0.conv_lower.1.weight torch.Size([224]) value_linear2.bias torch.Size([1])
resnet.resnet.0.conv_lower.1.bias torch.Size([224]) resnet.module.resnet.0.conv_lower.0.weight torch.Size([256, 256, 3, 3])
resnet.resnet.0.conv_lower.1.running_mean torch.Size([224]) resnet.module.resnet.0.conv_lower.0.bias torch.Size([256])

do one more tweak, only two errors

D:\elf_cpu_full\elf>df_console.exe --load d:/elfv2.bin --replace_prefix resnet.module,resnet init_conv.module,init_conv
--num_block 20 --dim 256
Traceback (most recent call last):
  File "df_console.py", line 39, in <module>
    },
  File "rlpytorch\model_loader.py", line 162, in load_model
  File "rlpytorch\model_base.py", line 153, in load
  File "site-packages\torch\nn\modules\module.py", line 719, in load_state_dict
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
        Missing key(s) in state_dict: "pi_final_conv.1.num_batches_tracked", "value_final_conv.1.num_batches_tracked".
[8332] Failed to execute script df_console

@l1t1
Copy link
Author

l1t1 commented Feb 15, 2019

https://dl.fbaipublicfiles.com/elfopengo/v2_training_run/models/1500000.bin
also has another question of elfv2.bin

D:\elf_cpu_full\elf>df_console.exe --load d:/1500000.bin --replace_prefix resnet.module,resnet init_conv.module,init_con
v--num_block 20 --dim 256
Traceback (most recent call last):
  File "df_console.py", line 39, in <module>
    },
  File "rlpytorch\model_loader.py", line 162, in load_model
  File "rlpytorch\model_base.py", line 148, in load
ValueError: not enough values to unpack (expected 2, got 1)
[8556] Failed to execute script df_console

@l1t1
Copy link
Author

l1t1 commented Feb 16, 2019

@l1t1
Copy link
Author

l1t1 commented Feb 25, 2019

ok now
#134 (comment)

@l1t1 l1t1 closed this as completed Feb 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants