How to measure the training cost(FLOPs)?

When I make some change about the model architecture, I would like to know how the training cost changes.
So I'm wondering is there any code in Tensor2Tensor (or any API of tensorflow) can measure the training cost (FLOPs), as shown in Table 2 of paper "Attention is All You Need" ? Besides, is it possible to profile the running time of each layer in the network architecture?

Thanks a lot.