Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100 epochs with 10,000 images from celebA... still noise? #4

Closed
RtFishers opened this issue Oct 24, 2017 · 15 comments
Closed

100 epochs with 10,000 images from celebA... still noise? #4

RtFishers opened this issue Oct 24, 2017 · 15 comments

Comments

@RtFishers
Copy link

RtFishers commented Oct 24, 2017

Hi, thanks very much for adding a more layers so that the networks would be able to generate higher res images...

I'm a bit confused about how to go about training properly. I put 10,000 images from "img_align_celebA" into the landscape/images folder and ran "DATA_ROOT=landscape dataset=folder ndf=30 ngf=90 th main.lua", but I'm still getting almost pure noise in the localhost:8000 display... is this normal?

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 24, 2017

Obviously - something is wrong. I trained on celebA (didn't put up the weights, as it isn't really 'art') and got pretty good results (recognizable as a face) pretty early on...

Make sure that your display hasn't crashed and isn't updating - run th -ldisplay.start in a new terminal. Also; can you please paste in a selection of the output of training (just the logs of one of the later epochs)? Has one loss dipped down to zero (is the generator / discriminator winning out over the other one?)

EDIT: sorry - didn't mean to close this issue :P

@RtFishers
Copy link
Author

Okay nvm... I removed two folders in my "art-DCGAN-master" directory that I believe may have been screwing with the process... one named "images" and another named "folder"... both empty.
I'm getting blocky noise that resembles the images now - using the landscape images scraped from the wiki (around ~1250 images).
How many epochs does it usually take till you get images that resemble the dataset?

@RtFishers
Copy link
Author

RtFishers commented Oct 25, 2017

Hmmm... I wonder if it's still supposed to look like this after 100 epochs: https://imgur.com/a/VOG63

Here is a portion from the logs:
Epoch: [95][ 0 / 13] Time: 0.266 DataTime: 0.000 Err_G: 0.6444 Err_D: 1.4740
Epoch: [95][ 1 / 13] Time: 0.177 DataTime: 0.000 Err_G: 1.3765 Err_D: 1.6764
Epoch: [95][ 2 / 13] Time: 0.282 DataTime: 0.000 Err_G: 0.7322 Err_D: 1.3951
Epoch: [95][ 3 / 13] Time: 0.286 DataTime: 0.000 Err_G: 1.5892 Err_D: 0.8094
Epoch: [95][ 4 / 13] Time: 1.161 DataTime: 1.004 Err_G: 1.4294 Err_D: 0.6988
Epoch: [95][ 5 / 13] Time: 0.284 DataTime: 0.000 Err_G: 1.3500 Err_D: 0.7171
Epoch: [95][ 6 / 13] Time: 0.283 DataTime: 0.000 Err_G: 1.1339 Err_D: 0.8040
Epoch: [95][ 7 / 13] Time: 0.293 DataTime: 0.134 Err_G: 1.7658 Err_D: 0.6731
Epoch: [95][ 8 / 13] Time: 1.090 DataTime: 0.932 Err_G: 0.9855 Err_D: 0.9451
Epoch: [95][ 9 / 13] Time: 0.748 DataTime: 0.252 Err_G: 1.2098 Err_D: 1.4906
Epoch: [95][ 10 / 13] Time: 0.186 DataTime: 0.007 Err_G: 0.6246 Err_D: 1.4969
Epoch: [95][ 11 / 13] Time: 0.967 DataTime: 0.810 Err_G: 2.3561 Err_D: 1.0235
Epoch: [95][ 12 / 13] Time: 1.040 DataTime: 0.882 Err_G: 0.5786 Err_D: 1.5334
Epoch: [95][ 13 / 13] Time: 0.287 DataTime: 0.000 Err_G: 2.3278 Err_D: 1.0305
End of epoch 95 / 100 Time Taken: 8.909
Epoch: [96][ 0 / 13] Time: 0.271 DataTime: 0.000 Err_G: 0.7569 Err_D: 1.3361
Epoch: [96][ 1 / 13] Time: 0.174 DataTime: 0.000 Err_G: 1.4464 Err_D: 1.1587
Epoch: [96][ 2 / 13] Time: 0.288 DataTime: 0.000 Err_G: 1.1522 Err_D: 1.6824
Epoch: [96][ 3 / 13] Time: 0.285 DataTime: 0.000 Err_G: 0.6205 Err_D: 1.5008
Epoch: [96][ 4 / 13] Time: 1.099 DataTime: 0.937 Err_G: 2.8557 Err_D: 2.0824
Epoch: [96][ 5 / 13] Time: 0.462 DataTime: 0.301 Err_G: 0.2834 Err_D: 1.9902
Epoch: [96][ 6 / 13] Time: 0.702 DataTime: 0.545 Err_G: 1.3848 Err_D: 1.0088
Epoch: [96][ 7 / 13] Time: 0.284 DataTime: 0.000 Err_G: 0.8531 Err_D: 1.1756
Epoch: [96][ 8 / 13] Time: 0.618 DataTime: 0.462 Err_G: 1.3284 Err_D: 1.3724
Epoch: [96][ 9 / 13] Time: 1.511 DataTime: 0.335 Err_G: 0.7013 Err_D: 1.2385
Epoch: [96][ 10 / 13] Time: 0.218 DataTime: 0.036 Err_G: 2.0415 Err_D: 1.0599
Epoch: [96][ 11 / 13] Time: 0.739 DataTime: 0.584 Err_G: 0.7567 Err_D: 1.1829
Epoch: [96][ 12 / 13] Time: 0.338 DataTime: 0.180 Err_G: 1.2995 Err_D: 1.1113
Epoch: [96][ 13 / 13] Time: 1.486 DataTime: 1.330 Err_G: 1.1576 Err_D: 1.0317
End of epoch 96 / 100 Time Taken: 9.749
Epoch: [97][ 0 / 13] Time: 0.269 DataTime: 0.000 Err_G: 1.3777 Err_D: 1.2087
Epoch: [97][ 1 / 13] Time: 0.174 DataTime: 0.000 Err_G: 0.5777 Err_D: 1.4291
Epoch: [97][ 2 / 13] Time: 0.287 DataTime: 0.000 Err_G: 1.2322 Err_D: 1.4589
Epoch: [97][ 3 / 13] Time: 1.049 DataTime: 0.893 Err_G: 0.3955 Err_D: 1.7495
Epoch: [97][ 4 / 13] Time: 0.288 DataTime: 0.000 Err_G: 0.9927 Err_D: 1.9682
Epoch: [97][ 5 / 13] Time: 0.281 DataTime: 0.000 Err_G: 0.6527 Err_D: 1.6678
Epoch: [97][ 6 / 13] Time: 0.285 DataTime: 0.000 Err_G: 1.0144 Err_D: 1.0721
Epoch: [97][ 7 / 13] Time: 0.900 DataTime: 0.743 Err_G: 1.3015 Err_D: 0.8945
Epoch: [97][ 8 / 13] Time: 0.841 DataTime: 0.684 Err_G: 1.3483 Err_D: 0.6924
Epoch: [97][ 9 / 13] Time: 0.662 DataTime: 0.155 Err_G: 1.4783 Err_D: 1.0527
Epoch: [97][ 10 / 13] Time: 1.082 DataTime: 0.914 Err_G: 0.4055 Err_D: 1.6375
Epoch: [97][ 11 / 13] Time: 0.574 DataTime: 0.416 Err_G: 2.6284 Err_D: 1.5000
Epoch: [97][ 12 / 13] Time: 0.364 DataTime: 0.206 Err_G: 0.4391 Err_D: 1.5753
Epoch: [97][ 13 / 13] Time: 0.285 DataTime: 0.091 Err_G: 1.7904 Err_D: 0.8829
End of epoch 97 / 100 Time Taken: 9.038
Epoch: [98][ 0 / 13] Time: 0.270 DataTime: 0.000 Err_G: 1.3242 Err_D: 0.7591
Epoch: [98][ 1 / 13] Time: 0.174 DataTime: 0.000 Err_G: 0.5229 Err_D: 1.5423
Epoch: [98][ 2 / 13] Time: 0.557 DataTime: 0.400 Err_G: 1.8146 Err_D: 1.1900
Epoch: [98][ 3 / 13] Time: 0.447 DataTime: 0.288 Err_G: 0.9109 Err_D: 1.1277
Epoch: [98][ 4 / 13] Time: 0.570 DataTime: 0.414 Err_G: 1.2145 Err_D: 0.9000
Epoch: [98][ 5 / 13] Time: 0.286 DataTime: 0.000 Err_G: 1.3356 Err_D: 0.9014
Epoch: [98][ 6 / 13] Time: 0.287 DataTime: 0.000 Err_G: 1.5560 Err_D: 0.8859
Epoch: [98][ 7 / 13] Time: 0.746 DataTime: 0.588 Err_G: 1.1245 Err_D: 0.7293
Epoch: [98][ 8 / 13] Time: 1.034 DataTime: 0.875 Err_G: 1.1159 Err_D: 0.9739
Epoch: [98][ 9 / 13] Time: 0.741 DataTime: 0.165 Err_G: 1.3555 Err_D: 1.0987
Epoch: [98][ 10 / 13] Time: 0.724 DataTime: 0.550 Err_G: 0.5194 Err_D: 1.6065
Epoch: [98][ 11 / 13] Time: 0.323 DataTime: 0.166 Err_G: 0.9838 Err_D: 1.5638
Epoch: [98][ 12 / 13] Time: 0.515 DataTime: 0.359 Err_G: 2.6404 Err_D: 1.1168
Epoch: [98][ 13 / 13] Time: 0.295 DataTime: 0.000 Err_G: 0.2485 Err_D: 2.4634
End of epoch 98 / 100 Time Taken: 8.713
Epoch: [99][ 0 / 13] Time: 0.269 DataTime: 0.000 Err_G: 2.0790 Err_D: 1.5072
Epoch: [99][ 1 / 13] Time: 0.173 DataTime: 0.000 Err_G: 1.6221 Err_D: 0.9545
Epoch: [99][ 2 / 13] Time: 0.286 DataTime: 0.000 Err_G: 0.6393 Err_D: 1.4517
Epoch: [99][ 3 / 13] Time: 0.282 DataTime: 0.000 Err_G: 0.8820 Err_D: 1.2528
Epoch: [99][ 4 / 13] Time: 0.349 DataTime: 0.191 Err_G: 1.6816 Err_D: 1.0430
Epoch: [99][ 5 / 13] Time: 0.440 DataTime: 0.283 Err_G: 0.7517 Err_D: 1.1776
Epoch: [99][ 6 / 13] Time: 0.486 DataTime: 0.327 Err_G: 1.1272 Err_D: 0.9839
Epoch: [99][ 7 / 13] Time: 1.124 DataTime: 0.968 Err_G: 1.4441 Err_D: 1.2618
Epoch: [99][ 8 / 13] Time: 0.283 DataTime: 0.000 Err_G: 0.3661 Err_D: 1.5913
Epoch: [99][ 9 / 13] Time: 0.839 DataTime: 0.351 Err_G: 1.6691 Err_D: 1.8219
Epoch: [99][ 10 / 13] Time: 1.221 DataTime: 1.052 Err_G: 0.5908 Err_D: 1.5005
Epoch: [99][ 11 / 13] Time: 0.467 DataTime: 0.312 Err_G: 1.6038 Err_D: 1.1185
Epoch: [99][ 12 / 13] Time: 0.284 DataTime: 0.000 Err_G: 0.7994 Err_D: 1.1239
Epoch: [99][ 13 / 13] Time: 0.287 DataTime: 0.116 Err_G: 1.8182 Err_D: 1.0219
End of epoch 99 / 100 Time Taken: 8.841
Epoch: [100][ 0 / 13] Time: 0.271 DataTime: 0.000 Err_G: 0.9349 Err_D: 1.0812
Epoch: [100][ 1 / 13] Time: 0.174 DataTime: 0.000 Err_G: 1.3770 Err_D: 1.0310
Epoch: [100][ 2 / 13] Time: 0.282 DataTime: 0.000 Err_G: 1.0230 Err_D: 1.1099
Epoch: [100][ 3 / 13] Time: 0.283 DataTime: 0.000 Err_G: 0.9857 Err_D: 1.0808
Epoch: [100][ 4 / 13] Time: 1.084 DataTime: 0.927 Err_G: 0.9773 Err_D: 1.2150
Epoch: [100][ 5 / 13] Time: 0.374 DataTime: 0.217 Err_G: 1.2164 Err_D: 1.1224
Epoch: [100][ 6 / 13] Time: 0.286 DataTime: 0.000 Err_G: 0.7667 Err_D: 1.0097
Epoch: [100][ 7 / 13] Time: 0.282 DataTime: 0.000 Err_G: 2.4329 Err_D: 1.0925
Epoch: [100][ 8 / 13] Time: 1.605 DataTime: 1.450 Err_G: 0.6667 Err_D: 1.2033
Epoch: [100][ 9 / 13] Time: 0.618 DataTime: 0.000 Err_G: 2.2147 Err_D: 0.9510
Epoch: [100][ 10 / 13] Time: 0.335 DataTime: 0.165 Err_G: 0.6985 Err_D: 1.1703
Epoch: [100][ 11 / 13] Time: 1.274 DataTime: 1.118 Err_G: 1.7755 Err_D: 1.1486
Epoch: [100][ 12 / 13] Time: 0.303 DataTime: 0.145 Err_G: 0.8030 Err_D: 1.3872
Epoch: [100][ 13 / 13] Time: 0.349 DataTime: 0.191 Err_G: 1.0955 Err_D: 1.1228
End of epoch 100 / 100 Time Taken: 9.174

I think maybe I'm missing something?

@RtFishers
Copy link
Author

Also... what happens if I run the command "DATA_ROOT=landscape dataset=folder ndf=50 ngf=150 th main.lua" AFTER I have already run it (choosing not to load from a checkpoint)? Does it just start the process over completely or does the file in the "cache" folder get involved?

@robbiebarrat
Copy link
Owner

delete the contents your cache folder - it builds the dataset into arrays to be used by the network, i think you are only training on a small portion of the dataset (the 13 number in the logs should be a lot larger - it's just your batches). Delete the contents of cache/, ensure that you have at least a few thousand landscapes in your landscapes/images folder, and begin training again... Let me know if that doesn't fix it.

@RtFishers
Copy link
Author

How big should it be? I deleted the file and now it's at 19...

@robbiebarrat
Copy link
Owner

What's your batch size and number of images in your folder?

@RtFishers
Copy link
Author

RtFishers commented Oct 25, 2017

Maybe my dataset for the landscape images is too low? There are only 1262 or so images and the batch size is just as default - 64

@RtFishers
Copy link
Author

Oh yes... just for clarification:
Torch7, Lua 5.2.4 (but I think my torch is using LuaJIT), Cuda 8.0, cudnn 5.1...

@RtFishers
Copy link
Author

But I didn't install luarocks cudnn or cunn stuff cuz that made it run into errors when I ran the training code.

@RtFishers
Copy link
Author

I think I'm gonna try with all 200,000+ images from celebA and see what happens... but I'm pretty sure I'm not getting the results as they are meant to be.

@robbiebarrat
Copy link
Owner

Hmm.... I doubt the cudnn has anything to do with it; although you may run into some errors loading from saved models (don't quote me on that)...

I think that the dataset size is the problem; I've gotten some very strange results when trying to train on data under ~3,000 images...

Let me know what you get with celebA - as that should definitely work. If it doesn't, send me your entire project folder (minus the dataset, maybe) on google drive or something, and I'll take a look myself.

Keep in mind that the project is currently undergoing a total overhaul in Keras, and is being reimplemented with a better model and in python/keras instead of torch, so if we're unable to solve your problems now, they shouldn't be an issue anymore in a week or two after the update.

@RtFishers
Copy link
Author

Okay, great to hear :). I will report back shortly.

@RtFishers
Copy link
Author

RtFishers commented Oct 25, 2017

Okay it works great now!.. although, I had to adjust the layers from 50:150 to 20:120 - otherwise, my discriminator overpowers my generator every single time and it just remains in noise forever.

@robbiebarrat
Copy link
Owner

Ah - nice, glad you got it working. Also; yeah, that makes sense. GANs are really "hard to train" - meaning that if you don't set all the hyperparameters just right, it'll screw everything up.

That's actually the reason I wanted to see your logs, usually if the discriminator wins over the generator, d_loss goes down and hangs around 0.00001 or a similar low value... The discriminator does have the easier job, so it often wins over unless you severely handicap it's number of layers.

You might have to play around with different numbers of filters per network when moving to different datasets, because the 20:120 ratio might not work for all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants