-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Weird changes in version 1.4 [bug] #495
Description
There are several weird changes that I have observed by switching from tensor2tensor version 1.3.2 to version 1.4.1. (Tensorflow version is 1.4.1 (gpu) for both)
Running exactly the same t2t-trainer command results in vastly different trainings, and I will list all the weird and annoying differences that I have observed here. I have no idea what the problem could be, I don't know whether it is a bug, or I just have to change some parameters to adapt to the new tensor2tensor version.
The command that I run:
t2t-trainer --t2t_usr_dir=t2t_csaky --generate_data=False --data_dir=data_dir/facebook_ricsibot_character --model=transformer --problems=character_chatbot --hparams_set=transformer_dorka_big_dropout --output_dir=train_dir/trf_big_dropout_facebook_ricsibot_character --train_steps=800000 --keep_checkpoint_max=3 --keep_checkpoint_every_n_hours=1
As you can see I use my own problem and hparam definitions, however this shouldn't affect anything. In my registration files the code is exactly the same for both tensor2tensor versions. Running the above command results in the following changes from version 1.3.2 to 1.4.1:
- In 1.4.1 I can no longer see any training stats using tensorboard (loss, learning rate, etc.)
- I can still see eval stats in 1.4.1, but compared to 1.3.2, now there are two eval folders, one named eval, and one named eval_one_pass
- In 1.4.1 in my output_dir I don't get a flags.txt and hparams.json file compared to 1.3.2
- In 1.4.1 the training is run at 2000 steps at a time, and when this is finished the model is reloaded.
- This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)
- In 1.4.1 the evaluation wants to run for 10000 steps compared to 10 steps in 1.3.2
Despite these differences the actual trainings run the same, so the loss is going down the same way.
