-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Generic multi layer perceptron #3204
[MRG] Generic multi layer perceptron #3204
Conversation
What's the todo list for this one? |
Hi larsmans, the todo list is,
|
For the weight init, I would just use a |
Out of curiosity, does RBM initialisation mean that On 27 May 2014 19:14, Issam H. Laradji notifications@github.com wrote:
|
@ogrisel should we include another parameter - like @jnothman yes, an RBM trains on the unlabeled samples and its new, trained weights become the initial weights of the corresponding layer in the multi-layer perceptron. The image below shows a basic idea of how this is done. |
I think we can leave the RBM init to a separate PR. |
@larsmans sure thing :) For the travis build, I believe the error is coming from |
+1 for leaving the RBM init in a separate PR. Also, no need to couple the 2 models, just extract the weights from a pipeline of RBMs and manually stuck them as
Not only: the other builds have failed because the doc tests don't pass either as I told you earlier in the previous PR. |
Classifier train-time test-time error-rate | ||
------------------------------------------------------ | ||
nystroem_approx_svm 124.819s 0.811s 0.0242 | ||
MultilayerPerceptron 359.460s 0.217s 0.0271 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it possible to find hyperparams values to reach better accuracy with tanh
activations? It should be possible to go below 2% error rate with a vanilla MLP on mnist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed you intended to have additional unlabelled data, but perhaps
working out the best way to incorporate the unlabelled data into the
fitting procedure (particularly if you support partial_fit) might be a big
question of its own. So I'm +1 for delaying that decision :)
On 27 May 2014 19:43, Olivier Grisel notifications@github.com wrote:
In benchmarks/bench_mnist.py:
+=======================
+
+Benchmark multi-layer perceptron, Extra-Trees, linear svm
+with kernel approximation of RBFSampler and Nystroem
+on the MNIST dataset. The dataset comprises 70,000 samples
+and 784 features. Here, we consider the task of predicting
+10 classes - digits from 0 to 9. The experiment was run in
+a computer with a Desktop Intel Core i7, 3.6 GHZ CPU,
+operating the Windows 7 64-bit version.
+
- Classification performance:
- ===========================
- Classifier train-time test-time error-rate
- nystroem_approx_svm 124.819s 0.811s 0.0242
- MultilayerPerceptron 359.460s 0.217s 0.0271
Isn't it possible to find hyperparams values to reach better accuracy with
tanh activations? It should be possible to go below 2% error rate with a
vanilla MLP on mnist.—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3204/files#r13069391
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you found the source of the problem, it's great to have unit tests that check the correctness of the gradient!
Hi guys, I made some major changes.
Your feedback will be greatly appreciated. Thank you! :) |
@IssamLaradji great work! I will try to review in more details soon. Maybe @jaberg and @kastnerkyle might be interested in reviewing this as well. Can you please fix the remaining expit related failure under Python 3 w/ recent numpy / scipy? https://travis-ci.org/scikit-learn/scikit-learn/jobs/27179454#L5790 |
|
||
+ Since hidden layers in MLP make the loss function non-convex | ||
- which contains more than one local minimum, random weights' | ||
initialization could impact the predictive accuracy of a trained model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather say: "meaning that different random initializations of the weight can leading to trained models with varying validation accuracy".
Thanks for the feedback @ogrisel. I improved the documentation more, making it more didactic - especially in the mathematical formulation section. For the Thanks. |
This looks pretty cool so far - I will run some trials on it and try to understand the py3 issues. Things that would be nice, though maybe not strictly necessary for a first cut PR: A constructor arg for a custom loss function instead of fixed (maybe it is against the API). Thinking of things like cross-entropy, hinge loss ala Charlie Tang, etc. instead of standard softmax or what have you. It would be nice to have a few default ones available by strings, with the ability to create a custom one if needed. I like @ogrisel's suggestion for layer_coefs_. It would be useful to run experiments with KMeans networks and also pretraining with autoencoders instead of RBMs. This also opens the door for side packages that can take in weights from other nets (looking at Overfeat, Decaf, Caffe, pylearn2, etc.) and load them into sklearn. This is more a personal interest of mine, but it is nice to see the building blocks there. It is also plausible that very deep nets are possible to use in feedforward mode on the CPU, even if we can't train them in sklearn directly. Questions: I also like the support for other optimizers - it would be sweet to get a hessian free optimizer into scipy, and use it in this general setup. Could make deep-ish NN work somewhat accessible without GPU, though cg is what (I believe) Hinton used for the original DBM/pretraining paper. |
@IssamLaradji indeed it would be interesting to run a bench of lbfgs vs cg and maybe other optimizers from |
We might want to make it possible to use any optimizer from scipy.optimize if the API is homogeneous across all optimizers (I have not checked). |
@IssamLaradji about the expit pickling issue, it looks like a bug in numpy. I am working on a fix. |
I submitted a bugfix upstream: numpy/numpy#4800 . If the fix is accepted we might want to backport it in |
@IssamLaradji actually you can please try to add the ufunc fix to Try to add something like: import pickle
try:
pickle.loads(pickle.dumps(expit))
except AttributeError:
# monkeypatch numpy to backport a fix for:
# https://github.com/numpy/numpy/pull/4800
import numpy.core
def _ufunc_reconstruct(module, name):
mod = __import__(module, fromlist=[name])
return getattr(mod, name)
numpy.core._ufunc_reconstruct = _ufunc_reconstruct |
Hi @kastnerkyle and @ogrisel , thanks for the reply.
For the layer sizes, they can be different in any way- for example, 1024-512-256-128-64-28, but - like what Hinton said - nothing justifies any set of layer sizes since it depends on the problem instance. Anyhow, this framework can support any set of layer sizes even if they are larger than the number of features.
Anyhow, L-bfgs is now state-of-the-art optimizer. I tested it against CG and L-BFGS always performed better and faster than CG for several datasets (most other optimizers were unsuitable and did not come any close to CG and l-bfgs as far as speed and accuracy are concerned, but the scipy method also supports custom optimizers which is very useful). This claim is also justified by Adam Coates and Andrew Ng. here http://cs.stanford.edu/people/ang/?portfolio=on-optimization-methods-for-deep-learning But I did read that CG can perform better and faster for special kinds of datasets. So I am all for adding the generic scipy optimizer if it wasn't for the minimum version issue. What do you think? For the ufunc fix, did you mean Thank you. |
|
||
elif 'l-bfgs': | ||
self._backprop_lbfgs( | ||
X, y, n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put method calls on one line when they fit in 80 columns:
self._backprop_lbfgs(X, y, n_samples)
About the optimizers, thanks for the reference comparing lbfgs and CG. We could add support for arbitrary scipy optimizer and raise a |
It would be great to add squared_hinge and hinge loss functions. But in another PR. I would also consider pre-training and sparse penalties for autoencoders for separate PRs. |
The gradients are fine, I think. I forgot to shuffle mnist :-/ Now it looks good. |
@IssamLaradji That was the place I meant. Sorry, I don't understand your explanation. Would the behavior of the code change if you discarded the return value of |
@amueller oh I thought you meant something else. It wouldn't change the behavior. I could discard the Also, +1 for setting |
Training time for the |
Strange, I ran it again now and I got,
which is half the original training time. Are you training using My machine is equiped with 8 GB RAM and Intel® Core™ i7-2630QM Processor (6M Cache, 2.00 GHz) . |
In my PR against @amueller branch with the enhanced "constant" learning rate and momentum SGD seems to be faster than LBGFS although I have not ploted the "validation score vs epoch" curve as we have no way to do so at the moment. |
I ran exactly the same code, so lbfgs. I thin we should definitely do SGD as it should be much faster on mnist. |
👍 |
Excellent work to all, and an exciting feature to be added to sklearn! I have been looking forward to this functionality to a while - Does it appear likely, or has momentum dissipated? |
It will definitely be merged, and soon. |
@amueller - That's fantastic news, I'm very much looking forward to it. Great work as always! |
Using this a bit at the moment. Looks nice. Some notes:
|
@naught101 It is a long name... maybe we should use MLP. Can you check if the shape is still wrong in #3939? |
@amueller: Yes, the shape is still wrong. |
Huh, wonder why the common tests no complain. |
Thanks for checking. |
Merged via #5214 |
Waw!! That's fantastic!! :) :) Great work team! |
Thank you to everyone who worked on this. It will be really useful. |
Yes, thank you very much! I've been waiting for this for a long time.
(And sorry that I never ended up making good on my offer to help.)
|
Yes, aren't sprints amazing from the outside? Dormant threads are suddenly Thank you to all the sprinters from those of us on the outside, it's been a On 24 October 2015 at 20:04, Petr Baudis notifications@github.com wrote:
|
@jnothman indeed! it's a great surprise to see it merged as I felt that this would stay dormant for much longer time. Thanks a lot for your great reviews and effort team!! |
Currently I am implementing
layers_coef_
to allow for any number of hidden layers.This pull request is to implement the generic Multi-layer perceptron as part of the GSoC 2014 proposal.
The expected time to finish this pull request is June 15
The goal is to extend Multi-layer Perceptron to support more than one hidden layer and to support having a pre-training phase (initializing weights through Restricted Boltzmann Machines) and a fine-tuning phase; and write its documentation.
This directly follows from this pull-request: #2120
TODO:
_fit
by local variables and pass them as argument to private helper methods to make the code more readable and reduce pickled model size by not storing stuff that is not necessary at prediction time._fit
method to call into submethods for different algorithms.self.t_
to store SGD learning rate progress and decouple it fromself.n_iter_
that should consistently trackepochs
.ConvergenceWarning
whenevermax_iter
is reached when callingfit