Skip to content

Conversation

@ShieLian
Copy link
Contributor

Reduce the number of ops to do subpixel convolution
Fix #175

Reduce the number of ops to do subpixel convolution
@zsdonghao
Copy link
Member

zsdonghao commented Jul 15, 2017

Hi, are you sure the code is correct? i cant found where to define h and w in the code.
To test whether the implementation is correct, I suggest to test it on SRGAN example, do their also output the image with checkerboard artifact at the beginning?

@ShieLian
Copy link
Contributor Author

emm...
Sorry for didn't check the code before commit.(for readability, codes here are different from my local version)
w and h should be a and b, or just change a and b to w and h.
If you'd like to do this quick fix, it'will be helpful. Or I'll do it torrow.(东八区)

@zsdonghao
Copy link
Member

Great. What I worry is your implementation have different shuffle method with the original paper. Please double check, many thanks.

@subzerofun
Copy link

subzerofun commented Jul 21, 2017

@ShieLian @zsdonghao I'm still experimenting with the SRGAN project (the results are amazing!) and have also mentioned the slow model building time in the Subpixel-Conv2D layers in an issue. But i thought that was because of a CPU-core/speed bottleneck on my side (i have an older i7 Quad core and thought you need a bazillion-core xeon-machine for this kind of computation).

So i was glad to see a solution to the problem! But as @zsdonghao is assuming – @ShieLian: your fix changes the image quality of the GANs output drastically. You can see a checkerboard pattern in the generated sample images as well as color artifacts and an overall low res, blurry output. I have tried letting the training run anew for a few epochs because the model weights from the previous training (checkpoint) seem to reset after the code change. But the resolution (sharp edges, high frequency details) does not get better and the checkerboard pattern is also persistent in all further results.

@ShieLian If your code would produce the same output as before i should see no change at all in my generated images and should be able to use the same weights as before – or am i wrong here?

If i revert the Subpixel-code (and model weights) the output images are clear and sharp again. I currently don't have access to the machine with my local repo, but will upload the images later so you can maybe check why the artifacts happen.

The model buildup time is nearly instantaneous with this PR – would love to see it producing the same results as the original code!

@ShieLian
Copy link
Contributor Author

ShieLian commented Jul 21, 2017

For my local implement, If the Input(1x2x2x4,Batch,H,W,C) is
[
[ [0,1,2,3] , [4,5,6,7] ],
[ [8,9,10,11] , [12,13,14,15] ]
]
Then the output is
[
[ [0], [1], [4], [5]],
[ [2], [3], [6], [7]],
[ [8], [9],[12],[13]],
[[10],[11],[14],[15]]
]
While the original output is
[
[ [0], [2], [4], [6]],
[ [1], [3], [5], [7]],
[ [8], [10],[12],[14]],
[ [9],[11],[13],[15]]
]
May that's why the same weights differ @subzerofun
But for a completely new trained model, it will give the equivalent result.

@ShieLian
Copy link
Contributor Author

image
According to the paper, PS(T)_1,0,0=T_0,0,1=1
So accurately my implement is strictly same as the paper's.
But I will try to provide the version act the same as tensorlayer.
It's your turn to determine use which method. @zsdonghao

@zsdonghao
Copy link
Member

let me check

@zsdonghao zsdonghao merged commit a31ead9 into tensorlayer:master Jul 21, 2017
@subzerofun
Copy link

I'm sorry, but i have no idea about the mathematics behind the changes. But i have started the SRGAN training from scratch – with the new code. And when i compare the quality and output progress of the new generated images to the old ones, the newer ones clearly are worse. Don't know why that is, but it has to be because of this change – since i did not modify anything else in the SRGAN repo.

Take a look for yourself: here are two pictures, both from training epoch nr. 175. The new output image has a lot more artifacts and some parts again show the checkerboard pattern. Just look at the palm leaves – the results are clearly not the same. I will leave the training running for a while, but i don't think that the new model will reach the same fidelity as the old one.

Old pixelshuffler code:

0175_2000 _train

New code:

new

Could you please check the output again?

@zsdonghao
Copy link
Member

zsdonghao commented Jul 23, 2017

@subzerofun i am training SRGAN as well, may finish in 1 to 2 days, I will let you know when training finish~ 😊

@ShieLian
Copy link
Contributor Author

ShieLian commented Jul 23, 2017

@subzerofun
If you train the network starting with original weights, it may have very bad performance.
Try to reset the weights of Conv layer before the first SubConv and weights behind the first SubConv. Train those weights only.
实际上是因为channel顺序变了,用原来的权值训练很难扳回来。
第一层SubConv之前的Conv之前的权值都可以用,但是这一层Conv要用的话要调一下channel顺序。。

@ShieLian
Copy link
Contributor Author

@subzerofun @subzerofun
The version compatible with the original version is here: patch-2

@ShieLian ShieLian deleted the patch-1 branch July 23, 2017 17:37
@subzerofun
Copy link

I think you misunderstood me, i did delete all model weights – all .npz files before starting the training again. The comparison from above is a newly trained model with the only difference being the changed tensorlayer version (this PR). Starting from epoch nr. 0 – 175, compared to the old Subpixel-function.

There has to be a difference between the two functions (new/old) – unfortunately i don't understand the math behind it. @ShieLian But what could be the reason – if you say that the (numerical) outcome should be the same? Thank you for the patch, i will test your new version and compare it to the other results (will delete the old weights of course!).

@zsdonghao Lucky you – the two days on your Titan sound great. It took me 5-6 days on my 1080 Ti to get to epoch 2000.

@zsdonghao
Copy link
Member

@ShieLian @subzerofun , I re-train the SRGAN with simplified version for 1400 epochs, please have a look.

From the result, it seems the simplified version is correct.

Simplified layer:

train_1400

Previous version:

train_1400_ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants