The reason may be in layers.py#L2191, the loop adds too many ops.
An alternative way is as follow:
Assume input feature maps X is b*h*w*(r^2*c) tensor,
Xs=tf.split(X,r,3) #b*h*w*r*r
Xr=tf.concat(Xs,2) #b*h*(r*w)*r
X=tf.reshape(Xr,(b,r*h,r*w,c)) # b*(r*h)*(r*w)*c
Then X is the return value in L2197
This way only add three ops in the graph, and for an feature map 242464, the time constructing the graph is much more short.
Also, time for constructing train-ops is shorten.
layers.py#L2188 to L2196 need to change, and L2170 to L2186 can be removed.