Weird changes in version 1.4 [bug]

There are several weird changes that I have observed by switching from tensor2tensor version 1.3.2 to version 1.4.1. (Tensorflow version is 1.4.1 (gpu) for both)

Running exactly the same t2t-trainer command results in vastly different trainings, and I will list all the weird and annoying differences that I have observed here. I have no idea what the problem could be, I don't know whether it is a bug, or I just have to change some parameters to adapt to the new tensor2tensor version.

The command that I run:
```
t2t-trainer --t2t_usr_dir=t2t_csaky --generate_data=False --data_dir=data_dir/facebook_ricsibot_character --model=transformer --problems=character_chatbot --hparams_set=transformer_dorka_big_dropout --output_dir=train_dir/trf_big_dropout_facebook_ricsibot_character --train_steps=800000 --keep_checkpoint_max=3 --keep_checkpoint_every_n_hours=1
```

As you can see I use my own problem and hparam definitions, however this shouldn't affect anything. In my registration files the code is exactly the same for both tensor2tensor versions. Running the above command results in the following changes from version 1.3.2 to 1.4.1:
* In 1.4.1 I can no longer see any training stats using tensorboard (loss, learning rate, etc.)
  * I can still see eval stats in 1.4.1, but compared to 1.3.2, now there are two eval folders, one named eval, and one named eval_one_pass
* In 1.4.1 in my output_dir I don't get a flags.txt and hparams.json file compared to 1.3.2
* In 1.4.1 the training is run at 2000 steps at a time, and when this is finished the model is reloaded.
  * This results in having 2 checkpoints at each 2000 steps (2001 and 2002 for example)
* In 1.4.1 the evaluation wants to run for 10000 steps compared to 10 steps in 1.3.2
  * After about 70 steps I get a weird error, but the evaluation still prints metrics.
  * In 1.3.2 the evaluation runs for 10 steps and then prints metrics without any errors.
![weird_evaluation](https://user-images.githubusercontent.com/18282017/34435114-d8eb4366-ec93-11e7-8f0f-f20e4975764c.png)

Despite these differences the actual trainings run the same, so the loss is going down the same way.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weird changes in version 1.4 [bug] #495

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weird changes in version 1.4 [bug] #495

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions