[MRG] Generic multi layer perceptron #3204

IssamLaradji · 2014-05-27T01:53:39Z

Currently I am implementing layers_coef_ to allow for any number of hidden layers.

This pull request is to implement the generic Multi-layer perceptron as part of the GSoC 2014 proposal.

The expected time to finish this pull request is June 15

The goal is to extend Multi-layer Perceptron to support more than one hidden layer and to support having a pre-training phase (initializing weights through Restricted Boltzmann Machines) and a fine-tuning phase; and write its documentation.

This directly follows from this pull-request: #2120

TODO:

replace private attributes initialized in _fit by local variables and pass them as argument to private helper methods to make the code more readable and reduce pickled model size by not storing stuff that is not necessary at prediction time.
refactor the _fit method to call into submethods for different algorithms.
introduce self.t_ to store SGD learning rate progress and decouple it from self.n_iter_ that should consistently track epochs.
issue ConvergenceWarning whenever max_iter is reached when calling fit

larsmans · 2014-05-27T09:09:11Z

What's the todo list for this one?

IssamLaradji · 2014-05-27T09:14:54Z

Hi larsmans, the todo list is,

it should support more than one hidden layer; so there would be one generic layer list layer_coef_
it should support weights' initialization using trained Restricted Boltzmann Machines, like the one proposed by Hinton et al. (2006): http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf

ogrisel · 2014-05-27T09:20:30Z

For the weight init, I would just use a warm_start=True constructor param and let the user set the layers_coef_ and layers_intercept_ attribute manually as done for other existing models such as SGDClassifier for instance.

jnothman · 2014-05-27T09:20:38Z

Out of curiosity, does RBM initialisation mean that fit may be provided
with some unlabelled samples?

On 27 May 2014 19:14, Issam H. Laradji notifications@github.com wrote:

Hi larsmans, the todo list is,

it should support more than one hidden layer; so there would be one
generic layer list layer_coef_

it should support weights' initialization using trained Restricted
Boltzmann Machines, like the one proposed by Hinton et al. (2006):
http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3204#issuecomment-44252107
.

IssamLaradji · 2014-05-27T09:32:41Z

@ogrisel should we include another parameter - like unsupervised_weight_init_ - that runs an RBM (or any unsupervised learning algorithm) to initialize the layer weights? I believe warm_start starts training with the previously trained weights but does not necessarily use unsupervised learning algorithm for weight initialization.

@jnothman yes, an RBM trains on the unlabeled samples and its new, trained weights become the initial weights of the corresponding layer in the multi-layer perceptron. The image below shows a basic idea of how this is done.

larsmans · 2014-05-27T09:35:31Z

I think we can leave the RBM init to a separate PR.

IssamLaradji · 2014-05-27T09:36:49Z

@larsmans sure thing :)

For the travis build, I believe the error is coming from OrthogonalMatchingPursuitCV, given in line 5442

ogrisel · 2014-05-27T09:40:04Z

+1 for leaving the RBM init in a separate PR. Also, no need to couple the 2 models, just extract the weights from a pipeline of RBMs and manually stuck them as layers_coef_ of a MLP with warm_start=True and then call fit with the labels for fine tuning.

For the travis build, I believe the error is coming from OrthogonalMatchingPursuitCV, given in line 5442

Not only: the other builds have failed because the doc tests don't pass either as I told you earlier in the previous PR.

ogrisel · 2014-05-27T09:43:34Z

benchmarks/bench_mnist.py

+    Classifier            train-time  test-time error-rate
+    ------------------------------------------------------
+    nystroem_approx_svm     124.819s    0.811s    0.0242 
+    MultilayerPerceptron    359.460s    0.217s    0.0271


Isn't it possible to find hyperparams values to reach better accuracy with tanh activations? It should be possible to go below 2% error rate with a vanilla MLP on mnist.

I assumed you intended to have additional unlabelled data, but perhaps
working out the best way to incorporate the unlabelled data into the
fitting procedure (particularly if you support partial_fit) might be a big
question of its own. So I'm +1 for delaying that decision :)

On 27 May 2014 19:43, Olivier Grisel notifications@github.com wrote:

In benchmarks/bench_mnist.py:

+=======================
+
+Benchmark multi-layer perceptron, Extra-Trees, linear svm
+with kernel approximation of RBFSampler and Nystroem
+on the MNIST dataset. The dataset comprises 70,000 samples
+and 784 features. Here, we consider the task of predicting
+10 classes - digits from 0 to 9. The experiment was run in
+a computer with a Desktop Intel Core i7, 3.6 GHZ CPU,
+operating the Windows 7 64-bit version.
+

Classification performance:

===========================

Classifier train-time test-time error-rate

nystroem_approx_svm 124.819s 0.811s 0.0242

MultilayerPerceptron 359.460s 0.217s 0.0271

Isn't it possible to find hyperparams values to reach better accuracy with
tanh activations? It should be possible to go below 2% error rate with a
vanilla MLP on mnist.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3204/files#r13069391
.

@ogrisel I just made the error rate to 0.017 :)
(fixed an issue with tanh derivative - it didn't pass the gradient test until now)

@jnothman indeed, better to make RBM pipelining a separate PR

Glad you found the source of the problem, it's great to have unit tests that check the correctness of the gradient!

IssamLaradji · 2014-06-09T22:59:04Z

Hi guys, I made some major changes.

The algorithm now supports more than one hidden layer by simply putting a list of values in the n_hidden parameter.
For example, for 3 hidden layers where the first and the second layers have 100 neurons and the 3rd has 50 neurons, the list would be, n_hidden = [100, 100, 50]
I improved the speed of the implementation by more than 25% by removing a redundant loop.
I improved the documentation by making it more comprehensive.

Your feedback will be greatly appreciated. Thank you! :)

coveralls · 2014-06-10T00:22:23Z

Coverage increased (+0.16%) when pulling 2e8dc56 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

ogrisel · 2014-06-10T14:30:48Z

@IssamLaradji great work! I will try to review in more details soon. Maybe @jaberg and @kastnerkyle might be interested in reviewing this as well.

Can you please fix the remaining expit related failure under Python 3 w/ recent numpy / scipy?

https://travis-ci.org/scikit-learn/scikit-learn/jobs/27179454#L5790

ogrisel · 2014-06-10T14:35:57Z

doc/modules/neural_networks_supervised.rst

+
+    + Since hidden layers in MLP make the loss function non-convex 
+      - which contains more than one local minimum, random weights'
+      initialization could impact the predictive accuracy of a trained model.


I would rather say: "meaning that different random initializations of the weight can leading to trained models with varying validation accuracy".

IssamLaradji · 2014-06-11T04:50:24Z

Thanks for the feedback @ogrisel. I improved the documentation more, making it more didactic - especially in the mathematical formulation section.

For the expit related failure under Python 3, I am not sure how to fix the problem since I am using the expit version given in scikit-learn. Isn't the problem within sklearn.utils.fixes?

Thanks.

kastnerkyle · 2014-06-11T07:45:42Z

This looks pretty cool so far - I will run some trials on it and try to understand the py3 issues.

Things that would be nice, though maybe not strictly necessary for a first cut PR:

A constructor arg for a custom loss function instead of fixed (maybe it is against the API). Thinking of things like cross-entropy, hinge loss ala Charlie Tang, etc. instead of standard softmax or what have you. It would be nice to have a few default ones available by strings, with the ability to create a custom one if needed.

I like @ogrisel's suggestion for layer_coefs_. It would be useful to run experiments with KMeans networks and also pretraining with autoencoders instead of RBMs. This also opens the door for side packages that can take in weights from other nets (looking at Overfeat, Decaf, Caffe, pylearn2, etc.) and load them into sklearn. This is more a personal interest of mine, but it is nice to see the building blocks there.

It is also plausible that very deep nets are possible to use in feedforward mode on the CPU, even if we can't train them in sklearn directly.

Questions:
I see you have worked on deep autoencoders before - will this framework support that as well? In other words, can layer sizes be different but complimentary? Or are they expected to be a "block" (uniform in size)

I also like the support for other optimizers - it would be sweet to get a hessian free optimizer into scipy, and use it in this general setup. Could make deep-ish NN work somewhat accessible without GPU, though cg is what (I believe) Hinton used for the original DBM/pretraining paper.

ogrisel · 2014-06-11T08:26:30Z

@IssamLaradji indeed it would be interesting to run a bench of lbfgs vs cg and maybe other optimizers from scipy.optimize, maybe on (a subset of) mnist for instance.

ogrisel · 2014-06-11T08:27:45Z

We might want to make it possible to use any optimizer from scipy.optimize if the API is homogeneous across all optimizers (I have not checked).

ogrisel · 2014-06-11T08:28:19Z

@IssamLaradji about the expit pickling issue, it looks like a bug in numpy. I am working on a fix.

ogrisel · 2014-06-11T09:08:44Z

I submitted a bugfix upstream: numpy/numpy#4800 . If the fix is accepted we might want to backport it in sklearn.utils.fixes.

ogrisel · 2014-06-11T09:43:26Z

@IssamLaradji actually you can please try to add the ufunc fix to sklearn.utils.exists now to check that it works for us?

Try to add something like:

import pickle

try:
    pickle.loads(pickle.dumps(expit))
except AttributeError:
    # monkeypatch numpy to backport a fix for:
    # https://github.com/numpy/numpy/pull/4800
    import numpy.core
    def _ufunc_reconstruct(module, name):
        mod = __import__(module, fromlist=[name])
        return getattr(mod, name)
    numpy.core._ufunc_reconstruct = _ufunc_reconstruct

IssamLaradji · 2014-06-11T11:34:24Z

Hi @kastnerkyle and @ogrisel , thanks for the reply.

Custom loss function: I could add a parameter to the constructor that accepts strings for selecting the loss function. (In fact, I have done that in my older implementation, but was told to remove it since there weren't enough loss functions)
Pre-training: I could add a pipeline with a placeholder that selects a pre-trainer for the weights. Although I was told to keep that for the next PR, I don't see it as a harm adding an additional constructor parameter and a small method containing the pre-trainer for a quick test :).
Deep Auto-encoder: yes, a sparse autoencoder is a simple adaptation of the feedforward network - I simply need to inject a sparsity parameter into the loss function and its derivatives.

For the layer sizes, they can be different in any way- for example, 1024-512-256-128-64-28, but - like what Hinton said - nothing justifies any set of layer sizes since it depends on the problem instance. Anyhow, this framework can support any set of layer sizes even if they are larger than the number of features.

Selecting scipy optimizers: my older implementation of the vanilla MLP supported all scipy optimizers using the generic scipy minimize method, but there was one problem: it required users to have scipy 13.0+, while scikit-learn requires SciPy (>= 0.7). If we could raise the scipy version requirement, I could easily have this support all scipy optimizers.

Anyhow, L-bfgs is now state-of-the-art optimizer. I tested it against CG and L-BFGS always performed better and faster than CG for several datasets (most other optimizers were unsuitable and did not come any close to CG and l-bfgs as far as speed and accuracy are concerned, but the scipy method also supports custom optimizers which is very useful).

This claim is also justified by Adam Coates and Andrew Ng. here http://cs.stanford.edu/people/ang/?portfolio=on-optimization-methods-for-deep-learning

But I did read that CG can perform better and faster for special kinds of datasets. So I am all for adding the generic scipy optimizer if it wasn't for the minimum version issue. What do you think?

For the ufunc fix, did you mean sklearn.utils.fixes ? because my sklearn version doesn't have sklearn.utils.exists :( . I added the fix to sklearn.utils.fixes and pushed the code to see if it resolves the expit problem.

Thank you.

coveralls · 2014-06-11T11:52:02Z

Coverage increased (+0.16%) when pulling 1d4911b on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

ogrisel · 2014-06-11T12:48:01Z

sklearn/neural_network/multilayer_perceptron.py

+
+        elif 'l-bfgs':
+            self._backprop_lbfgs(
+                X, y, n_samples)


Please put method calls on one line when they fit in 80 columns:

self._backprop_lbfgs(X, y, n_samples)

ogrisel · 2014-06-11T13:23:13Z

About the optimizers, thanks for the reference comparing lbfgs and CG. We could add support for arbitrary scipy optimizer and raise a RuntimeException of the version of scipy is too low (with an informative error message) while still using fmin_l_bfgs_o directly by default so that we keep the backward compat fo old versions of scipy by default.

ogrisel · 2014-06-11T13:25:10Z

It would be great to add squared_hinge and hinge loss functions. But in another PR.

I would also consider pre-training and sparse penalties for autoencoders for separate PRs.

amueller · 2014-12-05T17:21:11Z

The gradients are fine, I think. I forgot to shuffle mnist :-/ Now it looks good.
Maybe we want to set shuffle=True by default? It is so cheap compared to the backprob

amueller · 2014-12-05T18:15:31Z

@IssamLaradji That was the place I meant. Sorry, I don't understand your explanation. Would the behavior of the code change if you discarded the return value of _compute_cost_grad?

IssamLaradji · 2014-12-05T19:49:35Z

@amueller oh I thought you meant something else.

It wouldn't change the behavior. I could discard the return value and the "left-hand side" of the equation coef_grads, intercept_grads =_compute_cost_grad(...), and the results will remain the same.

Also, +1 for setting shuffle=True as default.

amueller · 2014-12-17T19:33:01Z

Training time for the bench_mnist.py is twice as high on my box than what you gave, but only for the MLP. the others have comparable speed. Could you try to run again with the current parameters and see if it is still the same for you? How many cores do you have?

IssamLaradji · 2014-12-18T20:06:42Z

Strange, I ran it again now and I got,

Classifier                         train-time           test-time                   error-rate   
----------------------------------------------------------------------------------------------
MultilayerPerceptron                 364.75999999        0.088                      0.0178

which is half the original training time. Are you training using lbfgs or sgd ? lbfgs tends to converge faster.

My machine is equiped with 8 GB RAM and Intel® Core™ i7-2630QM Processor (6M Cache, 2.00 GHz) .

ogrisel · 2014-12-18T22:29:18Z

lbfgs tends to converge faster.

In my PR against @amueller branch with the enhanced "constant" learning rate and momentum SGD seems to be faster than LBGFS although I have not ploted the "validation score vs epoch" curve as we have no way to do so at the moment.

amueller · 2014-12-18T22:53:04Z

I ran exactly the same code, so lbfgs. I thin we should definitely do SGD as it should be much faster on mnist.

joelkuiper · 2015-03-24T09:21:55Z

👍

jeff-mettel · 2015-04-07T00:40:14Z

Excellent work to all, and an exciting feature to be added to sklearn!

I have been looking forward to this functionality to a while - Does it appear likely, or has momentum dissipated?

amueller · 2015-04-07T00:41:08Z

It will definitely be merged, and soon.

jeff-mettel · 2015-04-07T02:23:19Z

@amueller - That's fantastic news, I'm very much looking forward to it. Great work as always!

naught101 · 2015-05-01T05:28:31Z

Using this a bit at the moment. Looks nice. Some notes:

Currently if y in MultilayerPerceptronRegressor.fit() is a vector (dimensions (n,)), .predict() returns a 2d array with dimensions (n,1). Other regressors just return a vector in the same format as y.
That's a really long class name. Could it be MLPRegressor instead, similar to SGDRegressor? That abbreviation is common enough, I think (on the wikipedia disambiguation page, comes first in google search for 'MLP learning', and I don't think people will get it confused with My Little Pony)

amueller · 2015-05-01T16:21:27Z

@naught101 It is a long name... maybe we should use MLP. Can you check if the shape is still wrong in #3939?

naught101 · 2015-05-04T01:22:14Z

@amueller: Yes, the shape is still wrong.

amueller · 2015-05-04T13:41:48Z

Huh, wonder why the common tests no complain.

amueller · 2015-05-04T13:41:57Z

Thanks for checking.

amueller · 2015-10-23T17:23:25Z

Merged via #5214

IssamLaradji · 2015-10-23T17:25:25Z

Waw!! That's fantastic!! :) :) Great work team!

naught101 · 2015-10-24T08:39:54Z

Thank you to everyone who worked on this. It will be really useful.

pasky · 2015-10-24T09:04:39Z

Yes, thank you very much! I've been waiting for this for a long time. (And sorry that I never ended up making good on my offer to help.)

jnothman · 2015-10-24T21:06:05Z

Waw!! That's fantastic!! :) :) Great work team!

Yes, aren't sprints amazing from the outside? Dormant threads are suddenly
marked merged and that project you'd been trying to complete forever is now
off your todo list and you're ready to book a holiday...

Thank you to all the sprinters from those of us on the outside, it's been a
good one!

On 24 October 2015 at 20:04, Petr Baudis notifications@github.com wrote:

Yes, thank you very much! I've been waiting for this for a long time.
(And sorry that I never ended up making good on my offer to help.)

—
Reply to this email directly or view it on GitHub
#3204 (comment)
.

IssamLaradji · 2015-10-25T16:23:19Z

@jnothman indeed! it's a great surprise to see it merged as I felt that this would stay dormant for much longer time.

Thanks a lot for your great reviews and effort team!!

IssamLaradji mentioned this pull request May 27, 2014

[MRG] Multi-layer perceptron (MLP) #2120

Closed

ogrisel reviewed May 27, 2014
View reviewed changes

ogrisel reviewed Jun 10, 2014
View reviewed changes

ogrisel reviewed Jun 11, 2014
View reviewed changes

amueller mentioned this pull request Dec 5, 2014

[MRG] Mlp finishing touches #3939

Closed

22 tasks

ogrisel mentioned this pull request Dec 10, 2014

Adaptive Learning Rate for SGD [enhancement] #1261

Closed

amueller mentioned this pull request Dec 17, 2014

Regression in LinearSVC #3977

Closed

amueller mentioned this pull request May 19, 2015

[MRG + 2] FIX ransac output shape, add test for regressor output shapes #4739

Merged

added learning rates

5dce1a0

amueller mentioned this pull request Aug 28, 2015

Fix log_loss and add test for multiclass amueller/scikit-learn#28

Closed

glennq mentioned this pull request Sep 4, 2015

[MRG + 2] Mlp with adam, nesterov's momentum, early stopping #5214

Merged

amueller closed this Oct 23, 2015

[MRG] Generic multi layer perceptron #3204

[MRG] Generic multi layer perceptron #3204

Conversation

IssamLaradji commented May 27, 2014

larsmans commented May 27, 2014

IssamLaradji commented May 27, 2014

ogrisel commented May 27, 2014

jnothman commented May 27, 2014

IssamLaradji commented May 27, 2014

larsmans commented May 27, 2014

IssamLaradji commented May 27, 2014

ogrisel commented May 27, 2014

ogrisel May 27, 2014

Choose a reason for hiding this comment

jnothman May 27, 2014

Choose a reason for hiding this comment

IssamLaradji May 27, 2014

Choose a reason for hiding this comment

ogrisel May 27, 2014

Choose a reason for hiding this comment

IssamLaradji commented Jun 9, 2014

coveralls commented Jun 10, 2014

ogrisel commented Jun 10, 2014

ogrisel Jun 10, 2014

Choose a reason for hiding this comment

IssamLaradji commented Jun 11, 2014

kastnerkyle commented Jun 11, 2014

ogrisel commented Jun 11, 2014

ogrisel commented Jun 11, 2014

ogrisel commented Jun 11, 2014

ogrisel commented Jun 11, 2014

ogrisel commented Jun 11, 2014

IssamLaradji commented Jun 11, 2014

coveralls commented Jun 11, 2014

ogrisel Jun 11, 2014

Choose a reason for hiding this comment

ogrisel commented Jun 11, 2014

ogrisel commented Jun 11, 2014

amueller commented Dec 5, 2014

amueller commented Dec 5, 2014

IssamLaradji commented Dec 5, 2014

amueller commented Dec 17, 2014

IssamLaradji commented Dec 18, 2014

ogrisel commented Dec 18, 2014

amueller commented Dec 18, 2014

joelkuiper commented Mar 24, 2015

jeff-mettel commented Apr 7, 2015

amueller commented Apr 7, 2015

jeff-mettel commented Apr 7, 2015

naught101 commented May 1, 2015

amueller commented May 1, 2015

naught101 commented May 4, 2015

amueller commented May 4, 2015

amueller commented May 4, 2015

amueller commented Oct 23, 2015

IssamLaradji commented Oct 23, 2015

naught101 commented Oct 24, 2015

pasky commented Oct 24, 2015 via email

jnothman commented Oct 24, 2015

IssamLaradji commented Oct 25, 2015