Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming from pretrained network checkpoints on CPU fails: unknown Torch class <torch.CudaTensor> #3

Closed
JCBrouwer opened this issue Oct 12, 2017 · 31 comments

Comments

@JCBrouwer
Copy link
Contributor

JCBrouwer commented Oct 12, 2017

Trying to resume training from one of the models on CPU returns an error regarding an unknown Torch class.

DATA_ROOT=myimages dataset=folder gpu=0 netD=checkpoints/landscapes_776_net_D.t7 netG=checkpoints/landscapes_776_net_G.t7 th main-128.lua
{
ntrain : inf
netD : "checkpoints/landscapes_776_net_D.t7"
nThreads : 4
niter : 100
batchSize : 64
netG : "checkpoints/landscapes_776_net_G.t7"
ndf : 40
fineSize : 128
nz : 100
loadSize : 129
gpu : 0
ngf : 160
dataset : "folder"
lr : 0.0002
noise : "normal"
name : "experiment1"
beta1 : 0.5
display_id : 10
display : 1
}
Random Seed: 8411
Starting donkey with id: 2 seed: 8413
table: 0x0a0c0bc8
Starting donkey with id: 1 seed: 8412
table: 0x0a0e2528
Starting donkey with id: 4 seed: 8415
table: 0x0a100ae0
Starting donkey with id: 3 seed: 8414
table: 0x0a122460
Loading train metadata from cache
Loading train metadata from cache
Loading train metadata from cache
Loading train metadata from cache
Dataset: folder Size: 5209
Initializing generator network from checkpoints/landscapes_776_net_G.t7
/Users/hans/torch/install/bin/luajit: /Users/hans/torch/install/share/lua/5.1/torch/File.lua:343: unknown Torch class <torch.CudaTensor>
stack traceback:
[C]: in function 'error'
/Users/hans/torch/install/share/lua/5.1/torch/File.lua:343: in function 'readObject'
/Users/hans/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/Users/hans/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
/Users/hans/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
/Users/hans/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
main-128.lua:72: in main chunk
[C]: in function 'dofile'
...hans/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x010912cd60

This seems to be because the models were trained using a GPU and thus require CUNN to load. According to this comment however, this can be remedied simply by converting the models to float before saving them. I would test it out myself and pull request (seeing as this might be as simple as adding 2 lines) but I don't have an NVIDIA graphics card.

Next to this I have found this script, which seems to be able to convert checkpoints after the fact. This also requires CUNN though, so it would be nice if the checkpoints could be converted for us CPU users!

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 12, 2017

Hey! Thanks for bringing this to my attention;

I actually had no idea that CPU users couldn't use the pre-trained models I put up; so sorry about that.

I'll make a commit once I get home from work tonight with the converted models.

@Caselles
Copy link

Caselles commented Oct 13, 2017

Indeed, it would fantastic to obtain the CPU-compatible models !

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 13, 2017

@Caselles Agreed - CPU-compatible models are a must;

On my work computer - the conversion script in Kaparthy's repo doesn't run (gives a very strange error)...

I think I'm going to try and add the "two-line change" that @JCBrouwer mentioned and see if it runs on a computer with CUDA_VISIBLE_DEVICES=""

@Caselles
Copy link

Caselles commented Oct 17, 2017

@robbiebarrat Did you have the time to progress on the issue? I really want to try out the pretrained models but it is currently not possible for me.

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 17, 2017

Sorry - I have been working on this.

I keep getting errors when trying to run a generation with the converted model, and recently a very discouraging one

In 1 module of nn.Sequential: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:32: Only Cuda supported duh!

Here; I've uploaded one of the models for you to try. See if it works by running net=landscapes_776_net_G_cpu.t7 th generate.lua' - and maybe comment out some of the lines that require cuda/cudnn/cunn with --`...

Get the model here and please let me know if you find anything out. In the meantime; I'll keep trying to solve this.

https://drive.google.com/open?id=0B-_m9VM1w1bKRnJMZmkzWEVtSDA

Also; the script I used for conversion is as follows:

require 'nn'
require 'optim'
require 'cunn'
require 'cudnn'

modelName = 'landscapes_776_net_G.t7'

model = torch.load(modelName)
model = model:float()
torch.save(modelName .. '_cpu', model:clearState())

@Caselles
Copy link

Caselles commented Oct 17, 2017

Thanks for your work. I tried, and got the same error as you, related to cudnn : unknown Torch class <cudnn.SpatialFullConvolution>

It might be too much work, but isn't there a way to convert the code to python ? The problem seems related to torch, I guess with Keras or Tensorflow these problems does not exist. I know it is a lot of work but I assure you that a lot of people would be grateful, since it is VERY HARD to find pre-trained art GANs models. I think this repo would be much more used if it was in python.

I'll try to seek answers for the cudnn problem though. Keep us in touch please :)

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 17, 2017

It might not be too much work, honestly; python is the language that im most comfortable with, and keras is my favorite library. I just finished converting an old project i wrote in pybrain, of all things, to keras and had great success doing that.

I'll keep you updated on the conversion process - I'm a little busy right now since I'm starting to apply to college, so it might take like, two or so weeks, but I feel like it'd definitely be worth it.

@Caselles
Copy link

Caselles commented Oct 17, 2017

Two or so weeks would be great ! I would really appreciate you doing that. Thanks

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 17, 2017

no problem - It'll help me a lot with some art projects I'm doing, too, so it's a win-win

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 24, 2017

@Caselles @JCBrouwer
Alright - so I've finished the data-loading part in python, hopefully this week I'll be able to finish the actual GAN part (not that hard once you already know the architecture); i very well may double the resolution, too, and have it be 256x256...

Anyways; yeah expect a python rewrite sometime this week(end?)

@Caselles
Copy link

Caselles commented Oct 25, 2017

Thanks a lot for the update. Do you mean we will have pre-trained models in 256x256 resolution, loadable in python ? This would be really great !

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 25, 2017

Yeah - I plan on making that the case. At my workplace, we have these insane GPU clusters that I'll set the models to train on; so training won't take forever, and once I finish defining the model in keras (i'm making some slight improvements in the architecture), it'll be relatively easy for me to train.

Let me know if you have any ideas for a pre-trained GAN you'd like to see included; I'm definitely going to do landscapes and nude portraits, but if you think of anything else you'd like just let me know.

@JCBrouwer
Copy link
Contributor Author

JCBrouwer commented Oct 25, 2017

@robbiebarrat Thanks a ton for working on this!

Some more training sets I can think of are: cartoon characters, pixel art, graffiti, space, and psychedelic art.

Feel like all of those should have more than enough examples to train on. They could also all be interesting checkpoints to resume from and train with new data.

@Caselles
Copy link

Caselles commented Oct 25, 2017

Maybe try the flowers dataset too : http://www.robots.ox.ac.uk/~vgg/data/flowers/

I would looove to see the results of a pre trained GAN on flowers that you fine tune on abstract art ! :)

@robbiebarrat
Copy link
Owner

robbiebarrat commented Oct 25, 2017

@JCBrouwer @Caselles The things that jump out at me as really cool ideas is the space gan, and the flowers + abstract fine-tuning GAN. I've been meaning to have a network that can sort of "show off" the whole micro-training-on-a-different-dataset thing I've come up with and flowers sound really good for that...

Thanks so much for the suggestions; I'll keep you updated on the progress of the rewrite in the coming week.

@Caselles
Copy link

Caselles commented Nov 3, 2017

@robbiebarrat Any progress on the rewrite ? :)

No hurry, just want to know if you had the time to work on it.

@robbiebarrat
Copy link
Owner

robbiebarrat commented Nov 3, 2017

@Caselles yeah - I've finished pretty much everything except for the network implementation itself (like defining the model in keras, but that shouldn't take long at all).

I'll put it up as soon as it's like, presentable with results and stuff, which I think is going to be ~a week from now. By the end of next weekend for sure I'll have something ready to put up.

@Caselles
Copy link

Caselles commented Nov 4, 2017

Thanks a lot ! Looking forward to try out these models !

@robbiebarrat
Copy link
Owner

robbiebarrat commented Nov 12, 2017

Hey guys - I'm actually running into a lot of trouble with the Keras model; it's insanely hard to train at 256x256 resolution so I'm messing around with architectural changes... Really sorry, but this might take longer to do than I initially thought it would.

@Caselles
Copy link

Caselles commented Nov 12, 2017

Hey @robbiebarrat, no worries and thanks for the update! Take your time, as long as you keep us updated about your work it is perfectly ok :) Good luck with the architectural changes !

@JCBrouwer
Copy link
Contributor Author

JCBrouwer commented Nov 13, 2017

@robbiebarrat training for 256x256 is quite difficult! Most implementations have a lot of issues with mode collapse above 128x128 resolution.

Perhaps it's an idea to get a working implementation and trained datasets for 128x128 up first and then expand to larger resolutions later.

Otherwise some tips that might help with larger resolution training can be found here and maybe something to look at implementing eventually would be progressive growing of resolution.

@Caselles
Copy link

Caselles commented Dec 4, 2017

Even if you have results in 128x128 I am very much interested in being able to get the pretrained models and code in python. Such repository do not really exist at the moment so even if it does not seem really perfect to publish I think you should consider it :)

@Caselles
Copy link

Caselles commented Dec 27, 2017

?

@robbiebarrat
Copy link
Owner

robbiebarrat commented Dec 27, 2017

@Caselles Hey - sorry about this; but I've run into a lot of problems with the python networks (mostly with regard to training stability). I've tried a bunch of things from ganhacks, different architectures, loss functions, etc. but to no avail. I thought about using the progressive growing of gans paper, but that takes multiple months to train. I'm putting this project on hold right now, since it's not really working out, and also I'm pretty swamped with applying to colleges right now...

I might come back to it in the next few months if I come across something that'll help me out, or if some cool new GAN paper comes out (like if there's one that does higher resolution generations easily) - but I really don't know if that's very likely.

@Caselles
Copy link

Caselles commented Dec 28, 2017

Ok, really disappointed. Could you at least provide the code for the 128x128 Gan in Keras ?

@robbiebarrat
Copy link
Owner

robbiebarrat commented Dec 28, 2017

Unfortunately I don't have that working; otherwise i'd absolutely provide it.

I'm going to try something using tf-gan in the near future, which will just be straight up tensorflow as opposed to keras, but I'll update this repo with some equivalent networks in tf - can't promise 256x256 though.

Repository owner deleted a comment from Caselles Dec 31, 2017
@JCBrouwer
Copy link
Contributor Author

JCBrouwer commented Jan 22, 2018

Hello @Caselles @robbiebarrat. I've finally gotten myself a cuda enabled GPU. I wrote a simple script that converts GPU checkpoints to CPU checkpoints and converted all the included pretrained networks with it. See #6

Landscape GAN

Generator
Discriminator

Nude-Portrait GAN

Generator
Discriminator

Portrait GAN

Generator
Discriminator

@vdaita
Copy link

vdaita commented Jan 23, 2018

Thank you so much! I will try this out at once!

@vdaita
Copy link

vdaita commented Jan 23, 2018

This worked! Thank you soooooo much! @JCBrouwer

@Eastkap
Copy link

Eastkap commented Oct 26, 2018

I'm French and I feel you robbie

@karl-schulz
Copy link

karl-schulz commented Oct 29, 2018

never trust french

@Boriyilla why make this about nationalities?

Repository owner deleted a comment from Boriyilla Oct 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants