[MRG] Generic multi layer perceptron #3204

Closed
wants to merge 41 commits into
from

Conversation

Projects
None yet
Contributor

IssamLaradji commented May 27, 2014

Currently I am implementing layers_coef_ to allow for any number of hidden layers.

This pull request is to implement the generic Multi-layer perceptron as part of the GSoC 2014 proposal.

The expected time to finish this pull request is June 15

The goal is to extend Multi-layer Perceptron to support more than one hidden layer and to support having a pre-training phase (initializing weights through Restricted Boltzmann Machines) and a fine-tuning phase; and write its documentation.

This directly follows from this pull-request: #2120

TODO:

  • replace private attributes initialized in _fit by local variables and pass them as argument to private helper methods to make the code more readable and reduce pickled model size by not storing stuff that is not necessary at prediction time.
  • refactor the _fit method to call into submethods for different algorithms.
  • introduce self.t_ to store SGD learning rate progress and decouple it from self.n_iter_ that should consistently track epochs.
  • issue ConvergenceWarning whenever max_iter is reached when calling fit
Owner

larsmans commented May 27, 2014

What's the todo list for this one?

Contributor

IssamLaradji commented May 27, 2014

Hi larsmans, the todo list is,

  1. it should support more than one hidden layer; so there would be one generic layer list layer_coef_
  2. it should support weights' initialization using trained Restricted Boltzmann Machines, like the one proposed by Hinton et al. (2006): http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf
Owner

ogrisel commented May 27, 2014

For the weight init, I would just use a warm_start=True constructor param and let the user set the layers_coef_ and layers_intercept_ attribute manually as done for other existing models such as SGDClassifier for instance.

Owner

jnothman commented May 27, 2014

Out of curiosity, does RBM initialisation mean that fit may be provided
with some unlabelled samples?

On 27 May 2014 19:14, Issam H. Laradji notifications@github.com wrote:

Hi larsmans, the todo list is,

  1. it should support more than one hidden layer; so there would be one
    generic layer list layer_coef_
  2. it should support weights' initialization using trained Restricted
    Boltzmann Machines, like the one proposed by Hinton et al. (2006):
    http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/3204#issuecomment-44252107
.

Contributor

IssamLaradji commented May 27, 2014

@ogrisel should we include another parameter - like unsupervised_weight_init_ - that runs an RBM (or any unsupervised learning algorithm) to initialize the layer weights? I believe warm_start starts training with the previously trained weights but does not necessarily use unsupervised learning algorithm for weight initialization.

@jnothman yes, an RBM trains on the unlabeled samples and its new, trained weights become the initial weights of the corresponding layer in the multi-layer perceptron. The image below shows a basic idea of how this is done.
rbmdeepbeliefnetwork

Owner

larsmans commented May 27, 2014

I think we can leave the RBM init to a separate PR.

Contributor

IssamLaradji commented May 27, 2014

@larsmans sure thing :)

For the travis build, I believe the error is coming from OrthogonalMatchingPursuitCV, given in line 5442

Owner

ogrisel commented May 27, 2014

+1 for leaving the RBM init in a separate PR. Also, no need to couple the 2 models, just extract the weights from a pipeline of RBMs and manually stuck them as layers_coef_ of a MLP with warm_start=True and then call fit with the labels for fine tuning.

For the travis build, I believe the error is coming from OrthogonalMatchingPursuitCV, given in line 5442

Not only: the other builds have failed because the doc tests don't pass either as I told you earlier in the previous PR.

@ogrisel ogrisel and 2 others commented on an outdated diff May 27, 2014

benchmarks/bench_mnist.py
+=======================
+
+Benchmark multi-layer perceptron, Extra-Trees, linear svm
+with kernel approximation of RBFSampler and Nystroem
+on the MNIST dataset. The dataset comprises 70,000 samples
+and 784 features. Here, we consider the task of predicting
+10 classes - digits from 0 to 9. The experiment was run in
+a computer with a Desktop Intel Core i7, 3.6 GHZ CPU,
+operating the Windows 7 64-bit version.
+
+ Classification performance:
+ ===========================
+ Classifier train-time test-time error-rate
+ ------------------------------------------------------
+ nystroem_approx_svm 124.819s 0.811s 0.0242
+ MultilayerPerceptron 359.460s 0.217s 0.0271
@ogrisel

ogrisel May 27, 2014

Owner

Isn't it possible to find hyperparams values to reach better accuracy with tanh activations? It should be possible to go below 2% error rate with a vanilla MLP on mnist.

@jnothman

jnothman May 27, 2014

Owner

I assumed you intended to have additional unlabelled data, but perhaps
working out the best way to incorporate the unlabelled data into the
fitting procedure (particularly if you support partial_fit) might be a big
question of its own. So I'm +1 for delaying that decision :)

On 27 May 2014 19:43, Olivier Grisel notifications@github.com wrote:

In benchmarks/bench_mnist.py:

+=======================
+
+Benchmark multi-layer perceptron, Extra-Trees, linear svm
+with kernel approximation of RBFSampler and Nystroem
+on the MNIST dataset. The dataset comprises 70,000 samples
+and 784 features. Here, we consider the task of predicting
+10 classes - digits from 0 to 9. The experiment was run in
+a computer with a Desktop Intel Core i7, 3.6 GHZ CPU,
+operating the Windows 7 64-bit version.
+

  • Classification performance:
  • ===========================
  • Classifier train-time test-time error-rate

  • nystroem_approx_svm 124.819s 0.811s 0.0242
  • MultilayerPerceptron 359.460s 0.217s 0.0271

Isn't it possible to find hyperparams values to reach better accuracy with
tanh activations? It should be possible to go below 2% error rate with a
vanilla MLP on mnist.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/3204/files#r13069391
.

@IssamLaradji

IssamLaradji May 27, 2014

Contributor

@ogrisel I just made the error rate to 0.017 :)
(fixed an issue with tanh derivative - it didn't pass the gradient test until now)

@jnothman indeed, better to make RBM pipelining a separate PR

@ogrisel

ogrisel May 27, 2014

Owner

Glad you found the source of the problem, it's great to have unit tests that check the correctness of the gradient!

Contributor

IssamLaradji commented Jun 9, 2014

Hi guys, I made some major changes.

  1. The algorithm now supports more than one hidden layer by simply putting a list of values in the n_hidden parameter.
    For example, for 3 hidden layers where the first and the second layers have 100 neurons and the 3rd has 50 neurons, the list would be, n_hidden = [100, 100, 50]

  2. I improved the speed of the implementation by more than 25% by removing a redundant loop.

  3. I improved the documentation by making it more comprehensive.

Your feedback will be greatly appreciated. Thank you! :)

Coverage Status

Coverage increased (+0.16%) when pulling 2e8dc56 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Owner

ogrisel commented Jun 10, 2014

@IssamLaradji great work! I will try to review in more details soon. Maybe @jaberg and @kastnerkyle might be interested in reviewing this as well.

Can you please fix the remaining expit related failure under Python 3 w/ recent numpy / scipy?

https://travis-ci.org/scikit-learn/scikit-learn/jobs/27179454#L5790

@ogrisel ogrisel commented on an outdated diff Jun 10, 2014

doc/modules/neural_networks_supervised.rst
+:math:`i+1`. :math:`layers_intercept_` is a list of the bias vectors, where the vector
+at index :math:`i` represents the bias values that are added to layer :math:`+i+1`.
+
+The advantages of Multi-layer Perceptron are:
+
+ + Capability to learn complex/non-linear models.
+
+ + Capability to learn models in real-time (on-line learning)
+ using ``partial_fit``
+
+
+The disadvantages of Multi-layer Perceptron (MLP) include:
+
+ + Since hidden layers in MLP make the loss function non-convex
+ - which contains more than one local minimum, random weights'
+ initialization could impact the predictive accuracy of a trained model.
@ogrisel

ogrisel Jun 10, 2014

Owner

I would rather say: "meaning that different random initializations of the weight can leading to trained models with varying validation accuracy".

@ogrisel ogrisel commented on an outdated diff Jun 10, 2014

doc/modules/neural_networks_supervised.rst
+The advantages of Multi-layer Perceptron are:
+
+ + Capability to learn complex/non-linear models.
+
+ + Capability to learn models in real-time (on-line learning)
+ using ``partial_fit``
+
+
+The disadvantages of Multi-layer Perceptron (MLP) include:
+
+ + Since hidden layers in MLP make the loss function non-convex
+ - which contains more than one local minimum, random weights'
+ initialization could impact the predictive accuracy of a trained model.
+
+ + MLP suffers from the Backpropagation diffusion problem; layers far from
+ the output update with decreasing momentum, leading to slow convergence.
@ogrisel

ogrisel Jun 10, 2014

Owner

I would add: with squashing activation function such as the logistic sigmoid and the tanh function. If we implement linear bottlnecks (the identity function) and ReLU later, this problem might no longer hold.

@ogrisel ogrisel commented on an outdated diff Jun 10, 2014

doc/modules/neural_networks_supervised.rst
+ MultilayerPerceptronClassifier(activation='tanh', algorithm='l-bfgs',
+ alpha=1e-05, batch_size=200, eta0=0.5,
+ learning_rate='constant', max_iter=200, n_hidden=[5, 2],
+ power_t=0.25, random_state=None, shuffle=False, tol=1e-05,
+ verbose=False, warm_start=False)
+
+After fitting (training), the model can predict labels for new samples::
+
+ >>> clf.predict([[2., 2.], [-1., -2.]])
+ array([1, 0])
+
+MLP can fit a non-linear model to the training data. The members
+``clf.layers_coef_`` containing the weight matrices constitute the model
+parameters::
+
+ >>> clf.layers_coef_
@ogrisel

ogrisel Jun 10, 2014

Owner

I would just display:

   >>> [coef.shape for coef in clf.layers_coef_]

instead.

Contributor

IssamLaradji commented Jun 11, 2014

Thanks for the feedback @ogrisel. I improved the documentation more, making it more didactic - especially in the mathematical formulation section.

For the expit related failure under Python 3, I am not sure how to fix the problem since I am using the expit version given in scikit-learn. Isn't the problem within sklearn.utils.fixes?

Thanks.

Owner

kastnerkyle commented Jun 11, 2014

This looks pretty cool so far - I will run some trials on it and try to understand the py3 issues.

Things that would be nice, though maybe not strictly necessary for a first cut PR:

A constructor arg for a custom loss function instead of fixed (maybe it is against the API). Thinking of things like cross-entropy, hinge loss ala Charlie Tang, etc. instead of standard softmax or what have you. It would be nice to have a few default ones available by strings, with the ability to create a custom one if needed.

I like @ogrisel's suggestion for layer_coefs_. It would be useful to run experiments with KMeans networks and also pretraining with autoencoders instead of RBMs. This also opens the door for side packages that can take in weights from other nets (looking at Overfeat, Decaf, Caffe, pylearn2, etc.) and load them into sklearn. This is more a personal interest of mine, but it is nice to see the building blocks there.

It is also plausible that very deep nets are possible to use in feedforward mode on the CPU, even if we can't train them in sklearn directly.

Questions:
I see you have worked on deep autoencoders before - will this framework support that as well? In other words, can layer sizes be different but complimentary? Or are they expected to be a "block" (uniform in size)

I also like the support for other optimizers - it would be sweet to get a hessian free optimizer into scipy, and use it in this general setup. Could make deep-ish NN work somewhat accessible without GPU, though cg is what (I believe) Hinton used for the original DBM/pretraining paper.

Owner

ogrisel commented Jun 11, 2014

@IssamLaradji indeed it would be interesting to run a bench of lbfgs vs cg and maybe other optimizers from scipy.optimize, maybe on (a subset of) mnist for instance.

Owner

ogrisel commented Jun 11, 2014

We might want to make it possible to use any optimizer from scipy.optimize if the API is homogeneous across all optimizers (I have not checked).

Owner

ogrisel commented Jun 11, 2014

@IssamLaradji about the expit pickling issue, it looks like a bug in numpy. I am working on a fix.

Owner

ogrisel commented Jun 11, 2014

I submitted a bugfix upstream: numpy/numpy#4800 . If the fix is accepted we might want to backport it in sklearn.utils.fixes.

Owner

ogrisel commented Jun 11, 2014

@IssamLaradji actually you can please try to add the ufunc fix to sklearn.utils.exists now to check that it works for us?

Try to add something like:

import pickle

try:
    pickle.loads(pickle.dumps(expit))
except AttributeError:
    # monkeypatch numpy to backport a fix for:
    # https://github.com/numpy/numpy/pull/4800
    import numpy.core
    def _ufunc_reconstruct(module, name):
        mod = __import__(module, fromlist=[name])
        return getattr(mod, name)
    numpy.core._ufunc_reconstruct = _ufunc_reconstruct
Contributor

IssamLaradji commented Jun 11, 2014

Hi @kastnerkyle and @ogrisel , thanks for the reply.

  1. Custom loss function: I could add a parameter to the constructor that accepts strings for selecting the loss function. (In fact, I have done that in my older implementation, but was told to remove it since there weren't enough loss functions)

  2. Pre-training: I could add a pipeline with a placeholder that selects a pre-trainer for the weights. Although I was told to keep that for the next PR, I don't see it as a harm adding an additional constructor parameter and a small method containing the pre-trainer for a quick test :).

  3. Deep Auto-encoder: yes, a sparse autoencoder is a simple adaptation of the feedforward network - I simply need to inject a sparsity parameter into the loss function and its derivatives.

For the layer sizes, they can be different in any way- for example, 1024-512-256-128-64-28, but - like what Hinton said - nothing justifies any set of layer sizes since it depends on the problem instance. Anyhow, this framework can support any set of layer sizes even if they are larger than the number of features.

  1. Selecting scipy optimizers: my older implementation of the vanilla MLP supported all scipy optimizers using the generic scipy minimize method, but there was one problem: it required users to have scipy 13.0+, while scikit-learn requires SciPy (>= 0.7). If we could raise the scipy version requirement, I could easily have this support all scipy optimizers.

Anyhow, L-bfgs is now state-of-the-art optimizer. I tested it against CG and L-BFGS always performed better and faster than CG for several datasets (most other optimizers were unsuitable and did not come any close to CG and l-bfgs as far as speed and accuracy are concerned, but the scipy method also supports custom optimizers which is very useful).

This claim is also justified by Adam Coates and Andrew Ng. here http://cs.stanford.edu/people/ang/?portfolio=on-optimization-methods-for-deep-learning

But I did read that CG can perform better and faster for special kinds of datasets. So I am all for adding the generic scipy optimizer if it wasn't for the minimum version issue. What do you think?

For the ufunc fix, did you mean sklearn.utils.fixes ? because my sklearn version doesn't have sklearn.utils.exists :( . I added the fix to sklearn.utils.fixes and pushed the code to see if it resolves the expit problem.

Thank you.

Coverage Status

Coverage increased (+0.16%) when pulling 1d4911b on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

@ogrisel ogrisel commented on an outdated diff Jun 11, 2014

sklearn/neural_network/multilayer_perceptron.py
+ for batch_slice in batch_slices:
+ cost = self._backprop_sgd(
+ X[batch_slice], y[batch_slice],
+ batch_size)
+
+ if self.verbose:
+ print("Iteration %d, cost = %.2f"
+ % (i, cost))
+ if abs(cost - prev_cost) < self.tol:
+ break
+ prev_cost = cost
+ self.t_ += 1
+
+ elif 'l-bfgs':
+ self._backprop_lbfgs(
+ X, y, n_samples)
@ogrisel

ogrisel Jun 11, 2014

Owner

Please put method calls on one line when they fit in 80 columns:

             self._backprop_lbfgs(X, y, n_samples)

@ogrisel ogrisel commented on an outdated diff Jun 11, 2014

sklearn/neural_network/multilayer_perceptron.py
+ if self.algorithm == 'sgd':
+ prev_cost = np.inf
+
+ for i in range(self.max_iter):
+ for batch_slice in batch_slices:
+ cost = self._backprop_sgd(
+ X[batch_slice], y[batch_slice],
+ batch_size)
+
+ if self.verbose:
+ print("Iteration %d, cost = %.2f"
+ % (i, cost))
+ if abs(cost - prev_cost) < self.tol:
+ break
+ prev_cost = cost
+ self.t_ += 1
@ogrisel

ogrisel Jun 11, 2014

Owner

I think this attribute would be better named n_iter_. Also it might be interesting to report the number of batch iterations when the _backprop_lbfgs method is used also in the same n_iter_ attribute for consistency. This is reported under the 'nit' key in the information dictionary of fmin_l_bfgs_b.

n_iter_ would thus be the number of epochs for SGD and the number of batch iterations for the LBFGS optimizer.

@ogrisel

ogrisel Jun 11, 2014

Owner

Note that also reporting the final value of the objective function as another fitted attribute might also be interesting.

@ogrisel ogrisel commented on an outdated diff Jun 11, 2014

sklearn/neural_network/multilayer_perceptron.py
+
+ n_samples, self.n_features = X.shape
+ self._validate_params()
+
+ if self.layers_coef_ is None:
+ self._init_param()
+ self._init_fit()
+
+ if self.t_ is None or self.eta_ is None:
+ self._init_t_eta_()
+
+ self._preallocate_memory(n_samples, X)
+
+ cost = self._backprop_sgd(X, y, n_samples)
+ if self.verbose:
+ print("Iteration %d, cost = %.2f" % (self.t_, cost))
@ogrisel

ogrisel Jun 11, 2014

Owner

I would rather use cost = %f to not constraint the precision of the cost report. Or use a larger value like %0.8f for instance.

Owner

ogrisel commented Jun 11, 2014

About the optimizers, thanks for the reference comparing lbfgs and CG. We could add support for arbitrary scipy optimizer and raise a RuntimeException of the version of scipy is too low (with an informative error message) while still using fmin_l_bfgs_o directly by default so that we keep the backward compat fo old versions of scipy by default.

Owner

ogrisel commented Jun 11, 2014

It would be great to add squared_hinge and hinge loss functions. But in another PR.

I would also consider pre-training and sparse penalties for autoencoders for separate PRs.

Owner

larsmans commented Jun 11, 2014

Indeed. Let's get the basic thing merged first. Is this PR in MRG phase?

Contributor

IssamLaradji commented Jun 11, 2014

Hi, thanks for the comments.

@ogrisel fitted_attributes is a great idea. I added a section under the classifier and regressor class documentations explaining these fitted attributes,
1) layers_coef_ : The ith element in the list represents the weight matrix corresponding to layer i.
2) layers_intercept_ : The ith element in the list represents the bias vector corresponding to layer i + 1.
3) cost_ : The current cost value computed by the loss function.
4) n_iter_ : The current number of iterations the algorithm has ran.
5) eta_ : The current learning rate.
So if a user prints mlp.cost_ after training, he'd get the minimum cost achieved by either sgd or l-bfgs.

@larsmans, I will set it as MRG, I think it is in its final phase - the scope is completed in my opinion. Things like generic optimizers, more loss functions, and pretraining can be done for the next PRs if that is okay.

Thank you.

IssamLaradji changed the title from Generic multi layer perceptron to [MRG] Generic multi layer perceptron Jun 11, 2014

Owner

kastnerkyle commented Jun 11, 2014

That sounds awesome to me - looking forward to playing with this! Great job.

Contributor

IssamLaradji commented Jun 11, 2014

Thank you for the compliment @kastnerkyle.

Contributor

IssamLaradji commented Jun 11, 2014

Ops, sounds like Travis does not have the Scipy version supporting d['nit'] for counting iterations. I will increment the iterations manually then.

Contributor

IssamLaradji commented Jun 11, 2014

Fixed :)

Coverage Status

Coverage increased (+0.16%) when pulling de407c2 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Owner

ogrisel commented Jun 12, 2014

@IssamLaradji can you please try to find parameters (eta0, learning_rate and so on) that make the model converge with SGD algorithm on MNIST? I have tried with various constant learning rate but was not very successful. Maybe grid searching on 10% of the data would work.

Contributor

IssamLaradji commented Jun 12, 2014

@ogrisel, in my side, SGD converged with eta0=0.01, learning_rate=constant, n_hidden = 100, and max_iter = 400. In the verbose you could see the cost decreasing constantly. However, with large eta0, the cost oscillates and it would never converge. Problem with invscaling learning rate is that, eta gets stuck around 0.1, while a good eta is 0.01 for the MNIST dataset.

Contributor

IssamLaradji commented Jun 12, 2014

There is a power_t parameter that you could increase, so that with learning_rate = invscaling, eta decreases in a faster rate, and therefore guaranteeing convergence.
Thanks.

Coverage Status

Coverage increased (+0.16%) when pulling 07376cb on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Owner

glouppe commented Jun 12, 2014

Regarding pre-training, I really wouldn't make that a priority. This heuristic has gone out of favor for some time now. What works best on common benchmark tasks is the good old backprogragation algorithm trained on labeled data only. The trick is in using appropriate activation functions (e.g., rectified linear units) and averaging strategies like dropout.

Owner

kastnerkyle commented Jun 12, 2014

I would save pretraining for another PR, but it can still be important in
scenarios with very little labeled data, but lots of unlabeled. It also has
merit because it can be done in an unsupervised fashion which is pretty
nice (KMeans networks, spike-and-slab coding, stacked autoencoders, etc.).

Pretrained nets made out of pretrained RBMs can be used as a generative
model, which makes for nice demos and examples (and potentially,
interesting applications in system simulation/testing...) . These
techniques are not as performant (now) as purely supervised nets, but it is
possible to get ~81% on CIFAR10 with only 400 labels a class. Adam Coates'
work covered this fairly extensively.

Having the ability to insert weights also gives opportunities for getting
weights from other models, perhaps even other neural network packages and
using them in scikit-learn. This could include models which are only
feasible for training on the GPU, but are OK for the CPU in feedforward
mode. I envision a scenario where one package (pylearn2 perhaps?) exports
it's weights in svmlight format, and we can load them into the NN using a
built-in function.

I would be willing to do this if there are more pressing issues for the
GSoC, but the fact remains that it seems useful. Granted, there are no
convolutional layers yet, but that can be left to future work as well.

I am showing my bias again, but projects like overfeat and decaf have shown
that transfer learning is definitely possible with these big deep nets, and
using the feature extraction of a really complex model for preprocessing of
new problems in scikit-learn's friendly format is very, very attractive. In
a dream world, it works kind of like the dataset downloader - download the
weights from someone's model hosted online, and BAM. You can replicate or
extend their experiment without a ton of hardware.

OK, I will shut up about it now :)

On Thu, Jun 12, 2014 at 3:07 PM, Gilles Louppe notifications@github.com
wrote:

Regarding pre-training, I really wouldn't make that a priority. This
heuristic has gone out of favor for some time now. What works best on
common benchmark tasks is the good old backprogragation algorithm trained
on labeled data only. The trick is in using appropriate activation
functions (e.g., rectified linear units) and averaging strategies like
dropout.


Reply to this email directly or view it on GitHub
#3204 (comment)
.

@jnothman jnothman commented on an outdated diff Jun 12, 2014

sklearn/neural_network/multilayer_perceptron.py
+ self.cost_ = None
+ self.n_iter_ = None
+ self.eta_ = None
+
+ def _pack(self, layers_coef_, layers_intercept_):
+ """Pack the coefficient and intercept parameters into a single vector.
+ """
+ all_params_ = layers_coef_ + layers_intercept_
+
+ return np.hstack([l.ravel() for l in all_params_])
+
+ def _unpack(self, packed_parameters):
+ """Extract the coefficients and intercepts from packed_parameters."""
+ for i in range(self.n_layers - 1):
+ s, e, shape = self.parameter_subsets[i]
+ self.layers_coef_[i] = np.reshape(packed_parameters[s:e], (shape))
@jnothman

jnothman Jun 12, 2014

Owner

(shape) is identical to shape

Contributor

IssamLaradji commented Jun 13, 2014

I agree with @kastnerkyle , pre-training is still a hot, useful research area, as many recent papers contain work along those lines - like [1] [2]. I also like the idea of sharing trained weights in a friendly format like scikit-learn's.

It makes sense that pre-training is useful as many samples on the internet are rather unlabeled; besides, Andrew Ng. made a great achievement with pre-training: http://www.wired.com/2013/05/neuro-artificial-intelligence.

But I also agree with @glouppe, the reason using pre-training was discouraged is due to its very long training time where the performance improvement might not be all worth it. Much of the relevant literature use several GPUs for pre-training which are unavailable to the common user. That's why more approaches are using dropout and sophisticated activation functions for a more convenient training time.

Having said that, pre-training is part of the GSoC proposal, so I am forced to include it, unless we all agree to implement something else instead :-) . Thank you!

[1] http://www.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf
[2] http://arxiv.org/pdf/1112.6209.pdf

Owner

jnothman commented Jun 13, 2014

The API for pretraining might be a challenge. What might be worthwhile in
the meantime is providing a way / documenting the ability to provide one's
own initial weights (externally pre-trained).

On 13 June 2014 11:55, Issam H. Laradji notifications@github.com wrote:

I agree with @kastnerkyle https://github.com/kastnerkyle ,
pre-training is still a hot, useful research area, as many recent papers
contain work along those lines - like [1] [2]. I also like the idea of
sharing trained weights in a friendly format like scikit-learn's.

It makes sense that pre-training is useful as many samples on the internet
are rather unlabeled; besides, Andrew Ng. made a great achievement with
pre-training: http://www.wired.com/2013/05/neuro-artificial-intelligence.

But I also agree with @glouppe https://github.com/glouppe, the reason
using pre-training was discouraged is due to its very long training time
and the performance improvement might not be worth it. Much of the relevant
literature use several GPUs for pre-training which are unavailable to the
common user. That's why more approaches are using dropout and sophisticated
activation functions for a more convenient training time.

Having said that, pre-training is part of the GSoC proposal, so I guess I
am forced to include it, unless we all agree to implement something else
instead. Thank you!

[1] http://www.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf
[2] http://arxiv.org/pdf/1112.6209.pdf


Reply to this email directly or view it on GitHub
#3204 (comment)
.

Owner

ogrisel commented Jun 13, 2014

@ogrisel, in my side, SGD converged with eta0=0.01, learning_rate=constant, n_hidden = 100, and max_iter = 400. In the verbose you could see the cost decreasing constantly. However, with large eta0, the cost oscillates and it would never converge. Problem with invscaling learning rate is that, eta gets stuck around 0.1, while a good eta is 0.01 for the MNIST dataset.

I think it would be useful to detect if the cost is increasing (e.g. 3 times in a row) and raise a ConvergenceWarning with a message that explains that the learning rate is probably too high or the data not properly standardized and maybe stop the algorithm.

Speaking of which I think we could change the MNIST benchmark to pipeline a StandardScaler as preprocessor. That might make the weight init work better and the model converge faster, no?

Owner

ogrisel commented Jun 13, 2014

Having said that, pre-training is part of the GSoC proposal, so I am forced to include it, unless we all agree to implement something else instead :-).

We can still decide together to change the task list of the initial GSoC proposal. For pre-training we can just add a new example that first fits a pipeline of 3 RBMs manually on a dataset and then manually initialize the weights (layers_coef_ + layers_intercept_) of an MLP instance manually and train it with warm_start=True for the fine tuning. No need to write a specific API, this can all be done manually with the existing API. It's more a matter of documenting how to do it in an example referenced from the narrative documentation.

Owner

ogrisel commented Jun 13, 2014

I created a new issue #3275 to move the discussion about pre-training there.

Contributor

IssamLaradji commented Jun 13, 2014

@jnothman sure, that's a good start. I am thinking of implementing these methods I suggested in #3275

@ogrisel ConvergenceWarning is a great idea. I added this code in the sgd body,

                if self.cost_ > prev_cost:
                    cost_increase_count += 1
                    if cost_increase_count == 0.2 * self.max_iter:
                        warnings.warn('Cost is increasing for more than 20%%'
                          ' of the iterations.'
                          ' Consider reducing eta0 and preprocessing your data'
                          ' with StandardScaler or MinMaxScaler.'
                          % self.cost_, ConvergenceWarning)

Since the cost might increase arbitrarily, the ConvergenceWarning takes place only when the cost keeps increasing for more than 20% of the iterations. We could decrease that percentage if it is more appropriate.

For MNIST, I realized that, dividing the data by 255, sgd's cost consistently decrease and then converge even with as high learning rate as 0.5. Since the MNIST benchmark already divides the data by 255 and therefore normalizing it, it seems StandardScaler would be redundant. What do you think?

Thanks!

Owner

ogrisel commented Jun 13, 2014

For MNIST, I realized that, dividing the data by 255, sgd's cost consistently decrease and then converge even with as high learning rate as 0.5. Since the MNIST benchmark already divides the data by 255 and therefore normalizing it, it seems StandardScaler would be redundant. What do you think?

I might be mistaken but I think I could have it diverge even with the division by 225. Also a [0 - 1] range is not equivalent to centering + unit scaling. The latter is what is expected by the fan in / fan out init scheme of the weights so StandardScaler might help the algorithm converge slightly faster.

Contributor

IssamLaradji commented Jun 13, 2014

I just compared the MNIST benchmark betweenStandardScaler, normalization, and MinMaxScaler. You are right, with StandardScaler, Multi-layer perceptron took 1/4 the time to train than it did with normalization (or division by 255) but had an error rate of 0.028 instead of 0.0169. Couldn't get the score lower than that :( with StandardScaler.

Also, StandardScaler made nystroem_approx_svm and fourier_approx_svm perform really poor (high error rate) while taking a very long time to train. Here are their scores with StandardScaler .

                 Algorithm                          Training time        Error rate
                 ---------------------------------------------------------------------------------------------
                 nystroem_approx_svm                 1382.65                0.4866     
                 fourier_approx_svm                  1363.49                0.6055 

MinMaxScaler, however, helped them perform better as expected. Multi layer perceptron took half the time to train with MinMaxScaler than with normalization. Do you think we should use MinMaxScaler instead?

Edit: Nevermind, MinMaxScaler works exactly like "division by 255" since they both fix values between 0 and 1. The decrease in speed was the result of a random factor. I will dig deeper into the issue of StandardScaler with nystroem_approx_svm and fourier_approx_svm.

Thanks

Owner

AlexanderFabisch commented Jun 14, 2014

There are some pixels in the MNIST dataset that are never greater than 0 which should not affect the result but there are some pixels that are almost never greater than zero. To get a variance of 1 for these components, the StandardScaler will multiply these pixels with a large value.

In [1]: from sklearn.datasets import fetch_mldata

In [2]: mnist = fetch_mldata('MNIST original')

In [3]: X = mnist.data

In [4]: X.sum(axis=0)
Out[4]: 
array([      0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,     126,     470,
           216,       9,       0,       0,       0,       0,       0,
             0,       0,       0,       0,       0,       0,       0,
             0,       0,       0,       0,      16,      93,     793,
          1615,    3026,    4357,    8255,   11987,   13539,   13306,
         14440,   12792,   11907,   10116,    6947,    4776,    3421,
          1282,     605,     212,       0,       0,       0,       0,
             0,       0,      64,      42,     417,     766,    3941,
         ...
       8505361, 7569160, 6196835, 4644333, 3183420, 2002145, 1165280,
        626008,  319420,  150857,   57252,   13300,    1238,      72,
             0,      24,    4175,   28731,  118835,  346211,  828203,
       1655578, 2794999, 4096358, 5390273, 6429651, 6988499, 6971463,
       6395512, 5346977, 4071292, 2831605, 1791788, 1055597,  584113,
        ...
         17195,    5667,    1470,      58,      59,       0,       0,
             0,       0,       0,       0,     152,     935,    2520,
          5787,    8581,   13136,   21761,   27641,   34682,   39975,
         46865,   41270,   33546,   23352,   13819,    6968,    3264,
          1163,     907,     120,       0,       0,       0,       0], dtype=uint64)

I think these pixels do distort the result completely (although I did not really test it). This is why you usually do not want to take a StandardScaler for this dataset. It is very sensitive to outliers. In almost all experiments with the MNIST dataset that I have seen the values have been scaled to [0, 1] by dividing each pixel by 255.0. I think the MinMaxScaler will give a very similar result but not the same because the maximum value in the data might not be 255.0 for each pixel.

Contributor

IssamLaradji commented Jun 15, 2014

@AlexanderFabisch Thanks for taking the time in analyzing this.
Indeed - I got different mean values - though very similar - between applying MinMaxScaler and division by 255 suggesting they are different for this data.
I guess I will keep the division by 255.0 as the preprocessing step since it is the popular normalization method for the MNIST dataset.
Thanks.

Coverage Status

Coverage increased (+0.17%) when pulling 3655474 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Contributor

IssamLaradji commented Jun 16, 2014

I pushed a new pull-request #3281 where I uploaded an example file mlp_with_pretraining.py that demonstrates pre-training an mlp with an rbm. Thanks.

Owner

ogrisel commented Jun 17, 2014

I just had a second look at the code of this PR and I think the following things should be done:

  • all attributes that are not public constructor parameters and that are mutated during a call to fit should be either public attributes with a trailing _ to show that they are estimated from the data and properly documented in the docstring, or made private by adding a leading _ to their name (in my opinion all the attributes that are currently not documented should be made private to limit backward compat issues later if we decide to refactor the internals of this estimator, except n_outputs that could be renamed n_outputs_ as done in many other scikit-learn models).
  • I think that fit should only preserve layers_coef_ and layers_intercept_ when warm_start=True. All the other fitted parameters (e.g. eta_, n_iter_ and so on) should be reinitialized from scratch when calling fit, as if fit was called for the first time.

This second item entails refactoring the _init_fit. I would rename it to _init_random_weights and only do the init of layers_coef_ and layers_intercept_ . Do not setup the packed_parameter_meta attribute and instead directly introspect the shape of the layers in the _unpack method instead. There is no need precompute it as _unpack is only called once per fit call.

Contributor

IssamLaradji commented Jun 18, 2014

Hi @ogrisel ,

  1. The reason I used packed_parameter_meta to precompute layer shapes is that, lbfgs calls the _unpack method for every iteration. At each time step, lbfgs runs the _cost_grad() method where the first line calls _unpack to extract the updated parameters. Problem is, the nature of lbfgs, and most scipy optimizers, is to pack and unpack parameters per iteration. Hence the precomputation of layer shapes.

I can precompute the layer shapes in a method other than _init_random_weights for easier pre-training - it is only needed for lbfgs not sgd. I called the new method as _precompute_layer_shapes.

  1. Thanks for explaining the syntax difference between public and private methods/attributes - I wasn't really aware of them :(. I pushed the updated file with the correct syntax.

Thank you.

Update: Thanks to your comments, I made the pre-training code #3281 much cleaner. Now, mlp only requires warm_start = True as well as coefficient and intercept assignments from RBM for pretraining.

Coverage Status

Coverage increased (+0.17%) when pulling 43250a7 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Owner

ogrisel commented Jun 18, 2014

The reason I used packed_parameter_meta to precompute layer shapes is that, lbfgs calls the _unpack method for every iteration. At each time step, lbfgs runs the _cost_grad() method where the first line calls _unpack to extract the updated parameters.

Indeed. For some reason I missed that when looking for _unpack calls.

I can precompute the layer shapes in a method other than _init_random_weights for easier pre-training - it is only needed for lbfgs not sgd. I called the new method as _precompute_layer_shapes.

Exactly what I would have suggested.

Update: Thanks to your comments, I made the pre-training code #3281 much cleaner. Now, mlp only requires warm_start = True as well as coefficient and intercept assignments from RBM for pretraining.

This is precisely why I wanted to have warm_start=True only affect layers_coef_ and layers_intercept_ :)

@ogrisel ogrisel and 1 other commented on an outdated diff Jun 18, 2014

sklearn/utils/fixes.py
@@ -49,6 +49,18 @@ def expit(x, out=None):
return out
+# added a code block that addresses the `expit` issue with python3
+try:
+ pickle.loads(pickle.dumps(expit))
@ogrisel

ogrisel Jun 18, 2014

Owner

Unfortunately this is actually triggering a bug under Python 3.4 on my box, for instance when I run the plot_rbm_logistic_classification.py but I cannot really understand why. Can you try to reproduce it?

Traceback (most recent call last):
  File "examples/plot_rbm_logistic_classification.py", line 39, in <module>
    from sklearn import linear_model, datasets, metrics
  File "/Users/ogrisel/code/scikit-learn/sklearn/linear_model/__init__.py", line 12, in <module>
    from .base import LinearRegression
  File "/Users/ogrisel/code/scikit-learn/sklearn/linear_model/base.py", line 28, in <module>
    from ..utils import as_float_array, atleast2d_or_csr, safe_asarray
  File "/Users/ogrisel/code/scikit-learn/sklearn/utils/__init__.py", line 11, in <module>
    from .validation import (as_float_array, check_arrays, safe_asarray,
  File "/Users/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 17, in <module>
    from .fixes import safe_copy
  File "/Users/ogrisel/code/scikit-learn/sklearn/utils/fixes.py", line 54, in <module>
    pickle.loads(pickle.dumps(expit))
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/numpy/core/__init__.py", line 61, in _ufunc_reduce
    return _ufunc_reconstruct, (whichmodule(func, name), name)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/pickle.py", line 283, in whichmodule
    for module_name, module in sys.modules.items():
RuntimeError: dictionary changed size during iteration
@IssamLaradji

IssamLaradji Jun 22, 2014

Contributor

@ogrisel , I got the exact same error with Python 3.4. I will look into this. Thanks

@IssamLaradji

IssamLaradji Jun 22, 2014

Contributor

On investigating this, I found that the error happens only when you run import matplotlib.pyplot as plt before other imports. If you place import matplotlib.pyplot as plt below other imports, the code will work just fine. This is strange, but I think there is a name conflict in the libraries being imported.

@ogrisel

ogrisel Jun 23, 2014

Owner

It looks like the call to _ufunc_reconstruct can sometimes trigger a module import that changes the dict of imported module sys.modules. I think this can be considered a bug in the pickle module module that should be made robust to such change by calling using something as:

for module_name, module in list(sys.modules.items()):
    ...

It would be great to try to write a pure-python non-regression test to submit with a fix to Python.

@IssamLaradji

IssamLaradji Jun 24, 2014

Contributor

Working on it, will let you know when it is fixed :)

@IssamLaradji

IssamLaradji Jun 26, 2014

Contributor

Hi @ogrisel , since pickle.loads(pickle.dumps(expit)) always returns an error, can we remove it and keep only the below code instead?
I found that calling pickle in .fixes is what causes import conflicts. Therefore, if we remove it, we will not face any problems with the travis pickle test, nor with module import conflicts. Thanks.

# added a code block that addresses the `expit` issue with python3
def _ufunc_reconstruct(module, name):
        mod = __import__(module, fromlist=[name])
        return getattr(mod, name)
np.core._ufunc_reconstruct = _ufunc_reconstruct

@arjoly arjoly commented on an outdated diff Jun 18, 2014

sklearn/neural_network/multilayer_perceptron.py
+ if self.cost_ > prev_cost:
+ cost_increase_count += 1
+ if cost_increase_count == 0.2 * self.max_iter:
+ warnings.warn('Cost is increasing for more than 20%%'
+ ' of the iterations. Consider reducing'
+ ' eta0 and preprocessing your data'
+ ' with StandardScaler or MinMaxScaler.'
+ % self.cost_, ConvergenceWarning)
+
+ elif prev_cost - self.cost_ < self.tol:
+ break
+
+ prev_cost = self.cost_
+ self.n_iter_ += 1
+
+ elif 'l-bfgs':
@arjoly

arjoly Jun 18, 2014

Owner

look like a bug

@ogrisel ogrisel and 1 other commented on an outdated diff Jun 18, 2014

doc/modules/neural_networks_supervised.rst
+
+ + Capability to learn models in real-time (on-line learning)
+ using ``partial_fit``
+
+
+The disadvantages of Multi-layer Perceptron (MLP) include:
+
+ + MLP with hidden layers have a non-convex loss function where there exists
+ more than one local minimum. Therefore different random weight
+ initializations can lead to different validation accuracy.
+
+ + MLP suffers from the Backpropagation diffusion problem; layers far from
+ the output update with decreasing momentum, leading to slow convergence.
+ However, with squashing activation functions such as the logistic sigmoid
+ and the tanh function, implementing linear bottlnecks (the identity function)
+ and ReLU later might resolve this problem.
@ogrisel

ogrisel Jun 18, 2014

Owner

I would not mention ReLU and linear bottlenecks in the doc as long as we have not implemented them.

@IssamLaradji

IssamLaradji Jun 22, 2014

Contributor

Fixed :)

@ogrisel ogrisel commented on an outdated diff Jun 18, 2014

sklearn/neural_network/multilayer_perceptron.py
+ Returns
+ -------
+ self
+ """
+ X = atleast2d_or_csr(X)
+
+ self._validate_params()
+
+ n_samples, n_features = X.shape
+ self.n_outputs_ = y.shape[1]
+
+ self._init_eta_()
+ self._init_param(n_features)
+
+ if not self.warm_start or \
+ (self.warm_start and self.layers_coef_ is None):
@ogrisel

ogrisel Jun 18, 2014

Owner

Please use additional ( and ) around the conditional expression rather then a trailing \.

Coverage Status

Coverage increased (+0.17%) when pulling 61d9de8 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Coverage Status

Coverage increased (+0.21%) when pulling a925a14 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Owner

mblondel commented Jun 27, 2014

Is the plan to merge all your PRs at once at the end of the summer? I'd rather merge progressively what is ready into master.

Contributor

IssamLaradji commented Jun 27, 2014

Hi @mblondel , I would be glad to have this merged before continuing on other PRs :). This is the farthest I got in a PR though; so if you or someone can guide me on the last remaining steps, it would be great :).

PS: a strange travis error for python 2.6 is related to 'OrthogonalMatchingPursuitCV'. I wonder if this PR is causing this.

Thanks.

Contributor

IssamLaradji commented Jun 30, 2014

Also I feel like working on the ELM PR #3306 would be much easier if this PR got merged first, since ELM's documentation belong to the same documentation as MLP :).
Please let me know what final steps I should take to complete this. Thanks. :)

Owner

ogrisel commented Jun 30, 2014

PS: a strange travis error for python 2.6 is related to 'OrthogonalMatchingPursuitCV'. I wonder if this PR is causing this.

This is unrelated. See #3190. Unfortunately it's apparently very hard to reproduce outside of the travis env so nobody could come up with a fix yet.

Coverage Status

Coverage increased (+0.21%) when pulling b1562d7 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Contributor

IssamLaradji commented Jul 1, 2014

Finally fixed the expit problem without having to import pickle in .fixes which caused conflicts with other imports.
@ogrisel indeed, 'OrthogonalMatchingPursuitCV' seems like a very subtle error - I had a similar error in #3306, which I couldn't reproduce, because the PR in the local machine passed all the unit tests, unlike in Travis.

Owner

ogrisel commented Jul 2, 2014

Finally fixed the expit problem without having to import pickle in .fixes which caused conflicts with other imports.

But the pickle check was intentional: I don't want us to monkeypatch numpy when it's not necessary.

As the fix has been included in numpy master, after the branching of numpy 1.9 you can test that the (major, minor) version of numpy is lower or equal to (1, 9) and only apply the _ufunc_reconstruct monkey patch in that case.

Owner

ogrisel commented Jul 2, 2014

BTW I reported the python bug here: http://bugs.python.org/issue21905

Coverage Status

Coverage increased (+0.17%) when pulling 5c13b7d on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Contributor

IssamLaradji commented Jul 3, 2014

Great! :) I added the numpy (major, minor) check to call _ufunc_reconstruct only when it is necessary.
Thanks for following up on this.

Coverage Status

Coverage increased (+0.17%) when pulling c338ccc on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Coverage Status

Coverage increased (+0.18%) when pulling 72ffdb5 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Owner

ogrisel commented Jul 30, 2014

The master branch has changed quite a bit during the past week, could you please rebase on top of master (and squash your commits) and fix any conflict? Note: the input validation helpers have changed, see: http://scikit-learn.org/dev/developers/utilities.html#validation-tools

Contributor

IssamLaradji commented Jul 30, 2014

Renovated the code using the same advice I got for ELM #3306 - It's much more readable and cleaner now (I hope) :)

But Travis is facing the same errors as I did with ELM's pull request, even with rebasing.
However, Travis for ELM #3306 fixed itself after a while (strange) :). Maybe Travis here will fix itself too?

Thanks.

@arjoly arjoly and 2 others commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ "to partial_fit.")
+ elif self.classes_ is not None and classes is not None:
+ if np.any(self.classes_ != np.unique(classes)):
+ raise ValueError("`classes` is not the same as on last call "
+ "to partial_fit.")
+ elif classes is not None:
+ self.classes_ = classes
+
+ if not hasattr(self, '_lbin'):
+ self._lbin = LabelBinarizer()
+ self._lbin._classes = classes
+
+ X, y = check_X_y(X, y, accept_sparse='csr')
+
+ # needs a better way to check multi-label instances
+ if isinstance(np.reshape(y, (-1, 1))[0][0], list):
@arjoly

arjoly Jul 31, 2014

Owner

The label binarizer will make it for you. Check the y_type_ attribute.

@IssamLaradji

IssamLaradji Aug 2, 2014

Contributor

That's awesome - makes life much easier :). I noticed that labelBinarizer
for multi-labeling will be deprecated by 0.17 since I get this warning.

DeprecationWarning: Direct support for sequence of sequences
multilabel representation will be unavailable from version 0.17.

It seems I have to use MultiLabelBinarizer for multi-label instances. Wouldn't that add an unnecessary clutter in the code?

I believe that in this case I can't use y_type_ to identify the type since I wouldn't know whether I should use MultiLabelBinarizer or LabelBinarizer until I check the instance type manually, right?
Thanks

@jnothman

jnothman Aug 2, 2014

Owner

No, you only need to use MultiLabelBinarizer if you have multilabel data represented as a list of lists of classes (or similar). This format is deprecated, and the MultiLabelBinarizer is provided as a utility for users because, while cumbersome to process, that format is arguably more human-readable than a label indicator matrix which will continue to be supported.

So in short, use label_binarize and your support for that old format, but not for label indicator matrices, will disappear together with the rest of the deprecation.

@IssamLaradji

IssamLaradji Aug 2, 2014

Contributor

Thanks @jnothman . I read all about the issues with having lists of lists for multi-label instances - it makes sense why its support is being deprecated.

I was having problems with make_multilabel_classification, until I set return_indicator to True, which returned a friendly format for LabelBinarizer that worked nicely. Thanks :)

@jnothman

jnothman Aug 2, 2014

Owner

make_multilabel_classification is yet to be fixed for this and for spare support (@hamsal, is
it to be yours?)

On 2 August 2014 22:49, Issam H. Laradji notifications@github.com wrote:

In sklearn/neural_network/multilayer_perceptron.py:

  •                         "to partial_fit.")
    
  •    elif self.classes_ is not None and classes is not None:
    
  •        if np.any(self.classes_ != np.unique(classes)):
    
  •            raise ValueError("`classes` is not the same as on last call "
    
  •                             "to partial_fit.")
    
  •    elif classes is not None:
    
  •        self.classes_ = classes
    
  •    if not hasattr(self, '_lbin'):
    
  •        self._lbin = LabelBinarizer()
    
  •        self._lbin._classes = classes
    
  •    X, y = check_X_y(X, y, accept_sparse='csr')
    
  •    # needs a better way to check multi-label instances
    
  •    if isinstance(np.reshape(y, (-1, 1))[0][0], list):
    

Thanks @jnothman https://github.com/jnothman . I read all about the
issues with having lists of lists for multi-label instances - it makes
sense why its support is being deprecated.

I was having problems with make_multilabel_classification, but by setting
return_indicator to True, it returned a friendly format for LabelBinarizer
that worked nicely. Thanks :)


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3204/files#r15728880.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ y : array-like, shape (n_samples, n_outputs)
+ Subset of the target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ X, y = check_X_y(X, y)
+
+ if y.ndim == 1:
+ y = np.reshape(y, (-1, 1))
+
+ super(MultilayerPerceptronRegressor, self).partial_fit(X, y)
+ return self
+
+ def predict(self, X):

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape (n_samples, n_outputs)
+ Subset of the target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ X, y = check_X_y(X, y)
+
+ if y.ndim == 1:
+ y = np.reshape(y, (-1, 1))
@arjoly

arjoly Jul 31, 2014

Owner

You can add a private self._validate_X_y to get generic code for both regression and classification. This would reduce the amount of boiler plate code.

@arjoly arjoly and 1 other commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Training data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape (n_samples, n_outputs)
+ Target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ X, y = check_X_y(X, y, multi_output=True)
+
+ if y.ndim == 1:
+ y = np.reshape(y, (-1, 1))
@arjoly

arjoly Jul 31, 2014

Owner

You can add a private self._validate_X_y to get generic code for both regression and classification. This would reduce the amount of boiler plate code.

@arjoly

arjoly Jul 31, 2014

Owner

You could probably have only one fit function in the base class :-)

@IssamLaradji

IssamLaradji Aug 3, 2014

Contributor

Yes, exactly :). Also one partial_fit in the base class is sufficient.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ Whether to print progress messages to stdout.
+
+ warm_start : bool, optional, default False
+ When set to True, reuse the solution of the previous
+ call to fit as initialization, otherwise, just erase the
+ previous solution.
+
+ Attributes
+ ----------
+ `classes_` : array or list of array of shape = [n_classes]
+ Class labels for each output.
+
+ `cost_` : float
+ The current cost value computed by the loss function.
+
+ `eta_` : float
@arjoly

arjoly Jul 31, 2014

Owner

eta => learning_rate_?
This would be consistent with the gradient boosting module.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ iterates until convergence (determined by 'tol') or
+ this number of iterations.
+
+ random_state : int or RandomState, optional, default None
+ State of or seed for random number generator.
+
+ shuffle : bool, optional, default False
+ Whether to shuffle samples in each iteration before extracting
+ minibatches.
+
+ tol : float, optional, default 1e-5
+ Tolerance for the optimization. When the loss at iteration i+1 differs
+ less than this amount from that at iteration i, convergence is
+ considered to be reached and the algorithm exits.
+
+ eta0 : double, optional, default 0.1
@arjoly

arjoly Jul 31, 2014

Owner

learning_rate_init?

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ y_prob : array-like, shape (n_samples, n_classes)
+ The predicted probability of the sample for each class in the
+ model, where classes are ordered as they are in
+ `self.classes_`.
+ """
+ scores = self.decision_function(X)
+
+ if len(scores.shape) == 1:
@arjoly

arjoly Jul 31, 2014

Owner

Why not using ndim?

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ X, y = check_X_y(X, y, accept_sparse='csr')
+
+ # needs a better way to check multi-label instances
+ if isinstance(np.reshape(y, (-1, 1))[0][0], list):
+ self._multi_label = True
+ else:
+ self._multi_label = False
+
+ self.classes_ = np.unique(y)
+ self._lbin = LabelBinarizer()
+ y = self._lbin.fit_transform(y)
@arjoly

arjoly Jul 31, 2014

Owner

I would factor those checks in a _validate_X_y?

@arjoly

arjoly Jul 31, 2014

Owner

Note that the y could be optional, so you would be able to re-use that function in the predict functions.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ """
+ def __init__(self, n_hidden=[100], activation="tanh",
+ algorithm='l-bfgs', alpha=0.00001,
+ batch_size=200, learning_rate="constant", eta0=0.5,
+ power_t=0.5, max_iter=200, shuffle=False,
+ random_state=None, tol=1e-5,
+ verbose=False, warm_start=False):
+ sup = super(MultilayerPerceptronClassifier, self)
+ sup.__init__(n_hidden=n_hidden, activation=activation,
+ algorithm=algorithm, alpha=alpha, batch_size=batch_size,
+ learning_rate=learning_rate, eta0=eta0, power_t=power_t,
+ max_iter=max_iter, shuffle=shuffle,
+ random_state=random_state, tol=tol,
+ verbose=verbose, warm_start=warm_start)
+
+ self.loss = 'log_loss'
@arjoly

arjoly Jul 31, 2014

Owner

Why not passing this as a parameter to the base class?

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ACTIVATIONS = {'tanh': _inplace_tanh, 'logistic': _inplace_logistic_sigmoid}
+
+DERIVATIVE_FUNCTIONS = {'tanh': _d_tanh, 'logistic': _d_logistic}
+
+
+class BaseMultilayerPerceptron(six.with_metaclass(ABCMeta, BaseEstimator)):
+ """Base class for MLP classification and regression.
+
+ Warning: This class should not be used directly.
+ Use derived classes instead.
+ """
+
+ _loss_functions = {
+ 'squared_loss': _squared_loss,
+ 'log_loss': _log_loss,
+ }
@arjoly

arjoly Jul 31, 2014

Owner

I would make two constants : CLASSIFICATION_LOSS and REGRESSION_LOSS. Later, you can use
is_classifier(self) to know if you are in regression or classification task.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+def _d_logistic(Z):
+ """Compute the derivative of the logistic function."""
+ return Z * (1 - Z)
+
+
+def _d_tanh(Z):
+ """Compute the derivative of the hyperbolic tan function."""
+ return 1 - (Z ** 2)
+
+
+def _squared_loss(Y, Z):
+ """Compute the square loss for regression."""
+ return np.sum((Y - Z) ** 2) / (2 * len(Y))
+
+
+def _log_loss(Y, Z):
@arjoly

arjoly Jul 31, 2014

Owner

y => y_true / Y_true,
Z => y_proba / Y_proba?

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ X /= X.sum(axis=1)[:, np.newaxis]
+
+ return X
+
+
+def _d_logistic(Z):
+ """Compute the derivative of the logistic function."""
+ return Z * (1 - Z)
+
+
+def _d_tanh(Z):
+ """Compute the derivative of the hyperbolic tan function."""
+ return 1 - (Z ** 2)
+
+
+def _squared_loss(Y, Z):
@arjoly

arjoly Jul 31, 2014

Owner

Y => y_true
Z => y_pred?

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ self.power_t = power_t
+ self.max_iter = max_iter
+ self.n_hidden = n_hidden
+ self.shuffle = shuffle
+ self.random_state = random_state
+ self.tol = tol
+ self.verbose = verbose
+ self.warm_start = warm_start
+
+ self.layers_coef_ = None
+ self.layers_intercept_ = None
+ self.cost_ = None
+ self.n_iter_ = None
+ self.eta_ = None
+
+ def _pack(self, layers_coef_, layers_intercept_):
@arjoly

arjoly Jul 31, 2014

Owner

I would extract this to make it a small function instead of a method.

@arjoly

arjoly Jul 31, 2014

Owner

_pack => _pack_network?

@arjoly arjoly and 1 other commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ self.tol = tol
+ self.verbose = verbose
+ self.warm_start = warm_start
+
+ self.layers_coef_ = None
+ self.layers_intercept_ = None
+ self.cost_ = None
+ self.n_iter_ = None
+ self.eta_ = None
+
+ def _pack(self, layers_coef_, layers_intercept_):
+ """Pack the coefficient and intercept parameters into a single vector.
+ """
+ return np.hstack([l.ravel() for l in layers_coef_ + layers_intercept_])
+
+ def _unpack(self, packed_parameters):
@arjoly

arjoly Jul 31, 2014

Owner

What do you think of extracting this method and create a function instead?
In the code, you would do something like

self.layer_coef_, self.layer_intercept_ =_unpack_network(parameters, n_layer)

(_unpack => _unpack_network?)

@IssamLaradji

IssamLaradji Aug 2, 2014

Contributor

I agree, having it outside the class results in better readability.

@arjoly arjoly and 1 other commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ self.layers_coef_ = None
+ self.layers_intercept_ = None
+ self.cost_ = None
+ self.n_iter_ = None
+ self.eta_ = None
+
+ def _pack(self, layers_coef_, layers_intercept_):
+ """Pack the coefficient and intercept parameters into a single vector.
+ """
+ return np.hstack([l.ravel() for l in layers_coef_ + layers_intercept_])
+
+ def _unpack(self, packed_parameters):
+ """Extract the coefficients and intercepts from packed_parameters."""
+ for i in range(self._n_layers - 1):
+ s, e, shape = self._packed_parameter_meta[i]
+ self.layers_coef_[i] = np.reshape(packed_parameters[s:e], (shape))
@arjoly

arjoly Jul 31, 2014

Owner

Dumb question, why not storing the coefficient in a sparse matrix format?

@IssamLaradji

IssamLaradji Aug 2, 2014

Contributor

Good question :P, the coefficient matrix doesn't usually have zeros - though it mostly has small values.

@arjoly

arjoly Aug 2, 2014

Owner

The packed parameters and shape looks like artificially a sparse csc matrix with only indptr and data. Would it make sense to have only a flat array for coefficient and an indptr array? This might avoid the need to pack and unpack parameters.

@IssamLaradji

IssamLaradji Aug 2, 2014

Contributor

Was working on this and found it to be a really cool idea that could avoid packing and unpacking of the parameters. :)
But what if there is zero element in one of the parameters ? then .data will not return all the elements since it ignores the zeros. I guess .reshape(-1,) would work.
Thanks

@IssamLaradji

IssamLaradji Aug 2, 2014

Contributor

hmmm but still. The opimizer expects a concatenated flattened coefficient parameters yet forward pass expects coeffcients of different shapes in the form (n_samples, n_features). It seems unpacking or using reshape in every iteration is inevitable :(.

@arjoly

arjoly Aug 4, 2014

Owner

Nevertheless with the flat array, you will avoid many allocation. Because you will work with memory view. Here a small ipython example

In [1]: import numpy as np

In [2]: a = np.arange(4)

In [3]: b = np.arange(3)

In [4]: c = np.hstack([a, b])

In [5]: c.flags
Out[5]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [6]: d = c[:4]

In [7]: e = c[4:]

In [8]: d.flags
Out[8]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [9]: f = d.reshape((2, 2))

In [10]: f.flags
Out[10]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [11]: a.flags
Out[11]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

Array a and b owns their data and there is a copy involved when creating c.
While the flat carray owns the data, taking a stride (array d or e) and reshaping (array f) a sub-array of C doesn't create any copy.

I have played a bit with the code and the packed_meta_parameters is not easy to use.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ if self.learning_rate in ("constant", "invscaling"):
+ if self.eta0 <= 0.0:
+ raise ValueError("eta0 must be > 0")
+
+ # raise ValueError if not registered
+ if self.activation not in ACTIVATIONS:
+ raise ValueError("The activation %s"
+ " is not supported. " % self.activation)
+ if self.learning_rate not in ["constant", "invscaling"]:
+ raise ValueError("learning rate %s "
+ " is not supported. " % self.learning_rate)
+ if self.algorithm not in ["sgd", "l-bfgs"]:
+ raise ValueError("The algorithm %s"
+ " is not supported. " % self.algorithm)
+
+ def _scaled_weight_init(self, fan_in, fan_out):
@arjoly

arjoly Jul 31, 2014

Owner

I would inline this function.

@arjoly arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
+ self._coef_grads = [0] * (self._n_layers - 1)
+ self._intercept_grads = [0] * (self._n_layers - 1)
+
+ # output for regression
+ if self.classes_ is None:
+ self._inplace_out_activation = _identity
+ # output for multi class
+ elif len(self.classes_) > 2 and self._multi_label is False:
+ self._inplace_out_activation = _inplace_softmax
+ # output for binary class and multi-label
+ else:
+ self._inplace_out_activation = _inplace_logistic_sigmoid
+
+ def _init_eta_(self):
+ """Initialize the learning rate `eta0` for SGD"""
+ self.eta_ = self.eta0
@arjoly

arjoly Jul 31, 2014

Owner

I would inline this function.

@ogrisel ogrisel commented on an outdated diff Aug 21, 2014

examples/neural_network/plot_mlp_alpha.py
+
+
+# Author: Issam H. Laradji
+# License: BSD 3 clause
+
+import numpy as np
+from matplotlib import pyplot as plt
+from matplotlib.colors import ListedColormap
+from sklearn.cross_validation import train_test_split
+from sklearn.preprocessing import StandardScaler
+from sklearn.datasets import make_moons, make_circles, make_classification
+from sklearn.neural_network import MultilayerPerceptronClassifier
+
+h = .02 # step size in the mesh
+
+alphas = np.arange(0, 2, 0.15)
@ogrisel

ogrisel Aug 21, 2014

Owner

Can you please try a wider range of alpha values with less intermediate steps as done in the ELM pull request.

The range alphas = np.logspace(-4, 4, 5) looks interesting to instance.

@ogrisel ogrisel commented on an outdated diff Aug 21, 2014

examples/neural_network/plot_mlp_alpha.py
+
+classifiers = []
+for i in alphas:
+ classifiers.append(MultilayerPerceptronClassifier(alpha=i, random_state=1))
+
+X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
+ random_state=0, n_clusters_per_class=1)
+rng = np.random.RandomState(2)
+X += 2 * rng.uniform(size=X.shape)
+linearly_separable = (X, y)
+
+datasets = [make_moons(noise=0.3, random_state=0),
+ make_circles(noise=0.2, factor=0.5, random_state=1),
+ linearly_separable]
+
+figure = plt.figure(figsize=(27, 9))
@ogrisel

ogrisel Aug 21, 2014

Owner

Please use a smaller horizontal size when the number of alphas is reduced to be consistent with the layout of the ELM pull request.

Owner

ogrisel commented Aug 21, 2014

If activation='relu' is always better or faster in your experiments I would make it the default activation function as nowadays nobody uses tanh anymore.

@arjoly arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
+
+ # Output for regression
+ if not isinstance(self, ClassifierMixin):
+ self.out_activation_ = 'identity'
+ # Output for multi class
+ elif self.label_binarizer_.y_type_ == 'multiclass':
+ self.out_activation_ = 'softmax'
+ # Output for binary class and multi-label
+ else:
+ self.out_activation_ = 'logistic'
+
+ # Initialize coefficient and intercept layers
+ self.layers_coef_ = []
+ self.layers_intercept_ = []
+
+ for i in range(self.n_layers_ - 1):
@arjoly

arjoly Aug 22, 2014

Owner

Small suggestion

for fan_in, fan_out in zip(layer_units[:-1], layer_units[1:])

with fan_in / n_fan_in and fan_out/n_fan_out?

@arjoly

arjoly Aug 22, 2014

Owner

There are other place where you might want to do that.

Owner

arjoly commented Aug 22, 2014

a_layers: The values held by each layer except for the output layer.
deltas : constitutes a large part of the equation that computes the gradient, reflecting the amount of change required for updating the solutions in an iteration.
layers_units: contains the number of neurons for each layer. It allows for having clean loops.
coef_grad: the amount of change used to update the coefficient parameters in an iteration.
intercept_grads: the amount of change used to update the intercept parameters in an iteration.
layers_coef: the weights connecting layer i and i+1.
layers_intercept_ : the bias vector for layer i+1.
n_hidden: the number of hidden layers, not counting the input layer nor the output layer.

I would add comments for the attributes that are private.

@arjoly arjoly and 1 other commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
+ weight_init_bound,
+ fan_out))
+
+ if self.shuffle:
+ X, y = shuffle(X, y, random_state=self.random_state)
+
+ # l-bfgs does not support mini-batches
+ if self.algorithm == 'l-bfgs':
+ batch_size = n_samples
+ else:
+ batch_size = np.clip(self.batch_size, 1, n_samples)
+
+ # Initialize lists
+ self._a_layers = [X]
+ self._a_layers.extend(np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1))
@arjoly

arjoly Aug 22, 2014

Owner

You might want to do

self._a_layers.extend(np.empty((batch_size, n_fan_out))
                      for n_fan_out in layer_units[1:])

@arjoly arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
+
+ if self.shuffle:
+ X, y = shuffle(X, y, random_state=self.random_state)
+
+ # l-bfgs does not support mini-batches
+ if self.algorithm == 'l-bfgs':
+ batch_size = n_samples
+ else:
+ batch_size = np.clip(self.batch_size, 1, n_samples)
+
+ # Initialize lists
+ self._a_layers = [X]
+ self._a_layers.extend(np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1))
+ self._deltas = [np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1)]
@arjoly

arjoly Aug 22, 2014

Owner

You might want to do

self._deltas = [np.empty_like(a_layer) for a_layer in self._a_layers[:-1]]

@arjoly arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
+ # l-bfgs does not support mini-batches
+ if self.algorithm == 'l-bfgs':
+ batch_size = n_samples
+ else:
+ batch_size = np.clip(self.batch_size, 1, n_samples)
+
+ # Initialize lists
+ self._a_layers = [X]
+ self._a_layers.extend(np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1))
+ self._deltas = [np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1)]
+ self._coef_grads = [np.empty((layer_units[i], layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1)]
+ self._intercept_grads = [np.empty(layer_units[i + 1])
+ for i in range(self.n_layers_ - 1)]
@arjoly

arjoly Aug 22, 2014

Owner

You might want to do

self._intercept_grads = [np.empty(fan_out) for fan_out in layer_units[1:]]

@arjoly arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
+ X, y = shuffle(X, y, random_state=self.random_state)
+
+ # l-bfgs does not support mini-batches
+ if self.algorithm == 'l-bfgs':
+ batch_size = n_samples
+ else:
+ batch_size = np.clip(self.batch_size, 1, n_samples)
+
+ # Initialize lists
+ self._a_layers = [X]
+ self._a_layers.extend(np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1))
+ self._deltas = [np.empty((batch_size, layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1)]
+ self._coef_grads = [np.empty((layer_units[i], layer_units[i + 1]))
+ for i in range(self.n_layers_ - 1)]
@arjoly

arjoly Aug 22, 2014

Owner

You might want to do:

self._coef_grads = [np.empty((fan_in, fan_out)) for fan_in, fan_out 
                    in zip(layer_units[:-1], layer_units[1:])]

@arjoly arjoly and 2 others commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
+ # Run the Stochastic Gradient Descent algorithm
+ if self.algorithm == 'sgd':
+ prev_cost = np.inf
+ cost_increase_count = 0
+
+ for i in range(self.max_iter):
+ for batch_slice in gen_batches(n_samples, batch_size):
+ self._a_layers[0] = X[batch_slice]
+ self.cost_ = self._backprop(X[batch_slice], y[batch_slice])
+
+ # update weights
+ for i in range(self.n_layers_ - 1):
+ self.layers_coef_[i] -= (self.learning_rate_ *
+ self._coef_grads[i])
+ self.layers_intercept_[i] -= (self.learning_rate_ *
+ self._intercept_grads[i])
@arjoly

arjoly Aug 22, 2014

Owner

Those lines look like daxpy (blas level1)

@arjoly

arjoly Aug 22, 2014

Owner

(dumb question) Is there interest of having varying number of neurons in the hidden units?
If not, those lines could be written as blas level 2 function.

@IssamLaradji

IssamLaradji Aug 23, 2014

Contributor

@arjoly, if we do write these lines as blas level 2 functions, should we then force the user not to change n_hidden?

@arjoly

arjoly Aug 26, 2014

Owner

Probably yes.

@larsmans

larsmans Sep 30, 2014

Owner

Shall we leave the code bumming for later and get this thing in a mergeable state?

Contributor

IssamLaradji commented Aug 22, 2014

If activation='relu' is always better or faster in your experiments

great! I will post a benchmark showing the convergence speed and accuracy on different sizes of MNIST for relu against tanh.

Contributor

IssamLaradji commented Aug 23, 2014

Finally got a sufficiently powerful computer to work on the comments :-).

These are the benchmark results on the digits dataset using 3-fold cross-validation,

n_hidden= [50, 25, 10]
Testing score for  relu: 0.9488, time: 3.76
Testing score for  tanh: 0.9310, time: 6.05

n_hidden= [150, 100]
Testing score for relu: 0.9711, time: 4.78
Testing score for tanh: 0.9694, time: 3.43

n_hidden= [50, 25]
Testing score for relu: 0.9627, time: 1.52
Testing score for tanh: 0.9471, time: 2.64

n_hidden=[50, 100]
Testing score for relu: 0.9677, time: 1.88
Testing score for tanh: 0.9605, time: 2.93

@ogrisel, like you said, relu is not always faster than tanh as tanh sometimes converges faster. But, my experimental results showed that relu consistently achieves higher score than tanh.

Coverage Status

Changes Unknown when pulling e623fdd on IssamLaradji:generic-multi-layer-perceptron into * on scikit-learn:master*.

Coverage Status

Changes Unknown when pulling cdb4dc6 on IssamLaradji:generic-multi-layer-perceptron into * on scikit-learn:master*.

IssamLaradji reopened this Aug 23, 2014

Coverage Status

Changes Unknown when pulling cdb4dc6 on IssamLaradji:generic-multi-layer-perceptron into * on scikit-learn:master*.

Owner

ogrisel commented Aug 25, 2014

But, my experimental results showed that relu consistently achieves higher score than tanh.

The optimal values for the other hyperparameters (in particular the regularization) is probably not the same for relu and tanh. Can you please try to run a small grid search for the optimal value of alpha when n_hidden=[150, 100]?

@ogrisel ogrisel commented on an outdated diff Aug 25, 2014

doc/modules/neural_networks_supervised.rst
+ >>> clf.predict([[2., 2.], [-1., -2.]])
+ array([1, 0])
+
+MLP can fit a non-linear model to the training data. ``clf.layers_coef_``
+contains the weight matrices that constitute the model parameters::
+
+ >>> [coef.shape for coef in clf.layers_coef_]
+ [(2, 5), (5, 2), (2, 1)]
+
+To get the raw values before applying the output activation function, run the
+following command,
+
+use :meth:`MultilayerPerceptronClassifier.decision_function`::
+
+ >>> clf.decision_function([[2., 2.], [1., 2.]])
+ array([ 11.55408143, 11.55408143])
@ogrisel

ogrisel Aug 25, 2014

Owner

You should use the ellipsis feature of doctestst to have this test pass on all the travis platforms:


>>> clf.decision_function([[2., 2.], [1., 2.]])  # doctest: +ELLIPSIS
array([ 11.55..., 11.55...])

@ogrisel ogrisel commented on an outdated diff Aug 25, 2014

doc/modules/neural_networks_supervised.rst
+use :meth:`MultilayerPerceptronClassifier.decision_function`::
+
+ >>> clf.decision_function([[2., 2.], [1., 2.]])
+ array([ 11.55408143, 11.55408143])
+
+Currently, :class:`MultilayerPerceptronClassifier` supports only the
+Cross-Entropy loss function, which allows probability estimates by running the
+``predict_proba`` method.
+
+MLP trains using backpropagation. For classification, it minimizes the
+Cross-Entropy loss function, giving a vector of probability estimates
+:math:`P(y|x)` per sample :math:`x`::
+
+ >>> clf.predict_proba([[2., 2.], [1., 2.]])
+ array([[ 9.59670230e-06, 9.99990403e-01],
+ [ 9.59670230e-06, 9.99990403e-01]])
@ogrisel

ogrisel Aug 25, 2014

Owner

Please use doctest ellipsis here again:

>>> clf.predict_proba([[2., 2.], [1., 2.]])  # doctest: +ELLIPSIS
array([[ 9.59...e-06, 9.99...e-01],
       [ 9.59...e-06, 9.99...e-01]])

@ogrisel ogrisel commented on an outdated diff Aug 25, 2014

doc/modules/neural_networks_supervised.rst
+a one hidden layer MLP.
+
+.. figure:: ../images/multilayerperceptron_network.png
+ :align: center
+ :scale: 60%
+
+ **Figure 1 : One hidden layer MLP.**
+
+The leftmost layer, known as the input layer, consists of a set of neurons
+:math:`\{x_i | x_1, x_2, ..., x_m\}` representing the input features. Each hidden
+layer transforms the values from the previous layer by a weighted linear summation
+:math:`w_1x_1 + w_2x_2 + ... + w_mx_m`, followed by a non-linear activation function
+:math:`g(\cdot):R \rightarrow R` - like the hyperbolic tan function. The output layer
+receives the values from the last hidden layer and transforms them into output values.
+
+The module contains the public attributes :math:`layers_coef_` and :math:`layers_intecept_`.
@ogrisel

ogrisel Aug 25, 2014

Owner

You should use double backticks quoting for layers_coef_ & layers_intercept_ (there is also a typo here).

Contributor

IssamLaradji commented Aug 26, 2014

This is the results of the grid search on the digits dataset for alphas=np.logspace(-4, 4, 5).
(First Line represents ReLU scores and the second line represents tanh scores)

For n_hidden=[150, 100]

relu : [mean: 0.97106, std: 0.00644, params: {'alpha': 0.0001}, mean: 0.96717, std: 0.00479, params: {'alpha': 0.01}, mean: 0.95771, std: 0.01032, params: {'alpha': 1.0}, mean: 0.96383, std: 0.00416, params: {'alpha': 100.0}, mean: 0.10128, std: 0.00079, params: {'alpha': 10000.0}]
tanh : [mean: 0.97051, std: 0.00208, params: {'alpha': 0.0001}, mean: 0.98386, std: 0.00416, params: {'alpha': 0.01}, mean: 0.98331, std: 0.00361, params: {'alpha': 1.0}, mean: 0.96049, std: 0.00343, params: {'alpha': 100.0}, mean: 0.10128, std: 0.00079, params: {'alpha': 10000.0}]

For n_hidden=[100, 50]

relu : [mean: 0.96439, std: 0.00208, params: {'alpha': 0.0001}, mean: 0.97551, std: 0.00928, params: {'alpha': 0.01}, mean: 0.98108, std: 0.00672, params: {'alpha': 1.0}, mean: 0.96327, std: 0.00472, params: {'alpha': 100.0}, mean: 0.10128, std: 0.00079, params: {'alpha': 10000.0}]
tanh : [mean: 0.96605, std: 0.00479, params: {'alpha': 0.0001}, mean: 0.98386, std: 0.00284, params: {'alpha': 0.01}, mean: 0.98331, std: 0.00273, params: {'alpha': 1.0}, mean: 0.96550, std: 0.00630, params: {'alpha': 100.0}, mean: 0.10184, std: 0.00000, params: {'alpha': 10000.0}]

It doesn't seem like ReLU performs better on average. I think I should test it on bigger datasets with larger number of layers. Perhaps that's where ReLU shines.

Coverage Status

Coverage increased (+0.07%) when pulling 9c451dc on IssamLaradji:generic-multi-layer-perceptron into 4b82379 on scikit-learn:master.

Coverage Status

Coverage increased (+0.07%) when pulling 9ed8d1f on IssamLaradji:generic-multi-layer-perceptron into 4b82379 on scikit-learn:master.

Coverage Status

Coverage increased (+0.07%) when pulling c2ce21f on IssamLaradji:generic-multi-layer-perceptron into 4b82379 on scikit-learn:master.

pasky commented Sep 4, 2014

Hi! I'm sorry to chime in as an external party - I'm watching this PR eagerly for quite some time now, and admittedly a little disappointed it couldn't have been merged yet. I was just wondering if there's some concrete list of TODO items that must be wrapped up before this can be merged, I suppose mainly regarding the API which will stay more or less set in stone at that point? Maybe I could help with some of the items if Issam Laradji is busy on other things... (I wonder if precise tuning of the activation function needs to be figured out before this is merged?)

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 4, 2014

benchmarks/bench_mnist.py
+
+
+def load_data(dtype=np.float32, order='F'):
+ # Load dataset
+ print("Loading dataset...")
+ data = fetch_mldata('MNIST original')
+ X, y = data.data, data.target
+ if order.lower() == 'f':
+ X = np.asfortranarray(X)
+
+ # Normalize features
+ X = X.astype('float64')
+ X = X / 255
+
+ # Create train-test split (as [Joachims, 2006])
+ logger.info("Creating train-test split...")
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

We do not use the logger (there is a pull request on it, but it got stalled). You should simply use a print controlled by a 'versbose' argument.

@GaelVaroquaux GaelVaroquaux commented on the diff Sep 4, 2014

sklearn/neural_network/tests/test_mlp.py
+
+def test_verbose_sgd():
+ """Test verbose."""
+ X = [[3, 2], [1, 6]]
+ y = [1, 0]
+ clf = MultilayerPerceptronClassifier(algorithm='sgd',
+ max_iter=2,
+ verbose=10,
+ n_hidden=2)
+ old_stdout = sys.stdout
+ sys.stdout = output = StringIO()
+
+ clf.fit(X, y)
+ clf.partial_fit(X, y)
+
+ sys.stdout = old_stdout
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

This should be done in the 'finally' of a try/finally' block, so that even if there is an exception, the stdout gets restored.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 4, 2014

sklearn/neural_network/tests/test_mlp.py
+ random_state=1,
+ batch_size=X.shape[0])
+ for i in range(150):
+ mlp.partial_fit(X, y)
+
+ pred2 = mlp.predict(X)
+ assert_almost_equal(pred1, pred2, decimal=2)
+ score = mlp.score(X, y)
+ assert_greater(score, 0.75)
+
+
+def test_partial_fit_errors():
+ """Test partial_fit error handling."""
+ X = [[3, 2], [1, 6]]
+ y = [1, 0]
+ clf = MultilayerPerceptronClassifier
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

You don't need to define the clf intermediate variable here. It mostly hinders readability.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 4, 2014

sklearn/neural_network/tests/test_mlp.py
+from sklearn.neural_network import MultilayerPerceptronRegressor
+from sklearn.preprocessing import LabelBinarizer
+from sklearn.preprocessing import StandardScaler, MinMaxScaler
+from scipy.sparse import csr_matrix
+from sklearn.utils.testing import assert_raises, assert_greater, assert_equal
+
+
+np.seterr(all='warn')
+
+LEARNING_RATE_TYPES = ["constant", "invscaling"]
+
+ACTIVATION_TYPES = ["logistic", "tanh", "relu"]
+
+digits_dataset_multi = load_digits(n_class=3)
+
+Xdigits_multi = MinMaxScaler().fit_transform(digits_dataset_multi.data[:200])
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

There should be an underscore between words: X_digits_multi, and y_digits_multi.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
+ The predicted probability of the sample for each class in the
+ model, where classes are ordered as they are in `self.classes_`.
+ """
+ y_scores = self.decision_function(X)
+
+ if y_scores.ndim == 1:
+ y_scores = logistic(y_scores)
+ return np.vstack([1 - y_scores, y_scores]).T
+ else:
+ return softmax(y_scores)
+
+
+class MultilayerPerceptronRegressor(BaseMultilayerPerceptron, RegressorMixin):
+ """Multi-layer Perceptron regressor.
+
+ Under a loss function, the algorithm trains either by l-bfgs or gradient
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

All these paragraphes (all aside the first sentence of the docstring) should be moved to a 'Notes' section, that is at the end of the docstring.

@ogrisel

ogrisel Oct 2, 2014

Owner

I think it's good to have a generic overview in one or two paragraphs here. The list of parameters and attributes is long and people will not necessarily think to scroll down to the end just to get the big picture on how this estimator works.

Owner

GaelVaroquaux commented Sep 4, 2014

Naive question: I note that the default activation function is the relu, and the default algorithm the LBFGS. I had in mind that LBFGS was very bad with the relu, because it is not smooth. Am I wrong?

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
+ """Multi-layer Perceptron regressor.
+
+ Under a loss function, the algorithm trains either by l-bfgs or gradient
+ descent. The training is iterative, in that at each time step the
+ partial derivatives of the loss function with respect to the model
+ parameters are computed to update the parameters.
+
+ It has a regularizer as a penalty term added to the loss function that
+ shrinks model parameters towards zero.
+
+ This implementation works with data represented as dense and sparse numpy
+ arrays of floating point values for the features.
+
+ Parameters
+ ----------
+ n_hidden : python list, length = n_layers - 2, default [100]
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

Just write 'list', and not 'python list'.

Contributor

IssamLaradji commented Sep 4, 2014

Hi @pasky ,
The requirements are in fact complete, but there might be some minor required changes (like naming or code conventions) that would come out from final reviews by the mentors.

I can't wait to get this merged as well :). I will address the comments as they come along, but you are welcome to help. Thanks.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ The input data.
+
+ y : array-like, shape (n_samples,)
+ The target values.
+
+ Returns
+ -------
+ self : returns a trained MLP model.
+ """
+ if self.algorithm != 'sgd':
+ raise ValueError("only SGD algorithm supports partial fit")
+
+ return self._fit(X, y, incremental=True)
+
+ def _decision_scores(self, X):
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

Stupid question, but isn't this the 'decision_function', according to the definition of scikit-learn.

In this case, should be called like this.

@IssamLaradji

IssamLaradji Oct 6, 2014

Contributor

The difference is that, decision_function returns a raveled form of y_pred for the classifier when
self.n_outputs=1. This is not the case with the regressor.

@GaelVaroquaux GaelVaroquaux and 3 others commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
+
+ # For the last layer
+ if with_output_activation:
+ out_activation = ACTIVATIONS[self.out_activation_]
+ self._a_layers[i + 1] = out_activation(self._a_layers[i + 1])
+
+ def _compute_cost_grad(self, layer, n_samples):
+ """Compute the cost gradient for the layer."""
+ self._coef_grads[layer] = safe_sparse_dot(self._a_layers[layer].T,
+ self._deltas[layer])
+ self._coef_grads[layer] += (self.alpha * self.layers_coef_[layer])
+ self._coef_grads[layer] /= n_samples
+
+ self._intercept_grads[layer] = np.mean(self._deltas[layer], 0)
+
+ def _cost_grad_lbfgs(self, packed_coef_inter, X, y):
@GaelVaroquaux

GaelVaroquaux Sep 4, 2014

Owner

As a very general remark on the design of this object/algorithm, I must say that I am a bit uneasy with all the internal states of the algorithm, that are very hidden/implicit in the code. I have in mind the _a_layers, _deltas,
._coef_grads, self._intercept_grads. They make the code hard to follow, because it is not possible by looking at a function/method call to know what has been changed.

How much possible is this to explicitely pass them around the code, rather than having them as attributes on the object? At least for a_layers, as these seem to me like they don't have core intrinsic difference to X and y, at least in terms of code organization.

@ogrisel

ogrisel Sep 30, 2014

Owner

We could make those functions have explicit arguments but might still want to keep them as pre-allocated attributes on the model itself for incremental learning with partial_fit.

@IssamLaradji

IssamLaradji Sep 30, 2014

Contributor

@ogrisel agreed, I kept them mainly for partial_fit. Is there a way I could utilize their pre-allocation advantages while improving readability ? how about getters and setters?
cheers.

@ogrisel

ogrisel Oct 2, 2014

Owner

no actually getters and setters would even be worse. @GaelVaroquaux 's point would be to explicitly pass the preallocated arrays for activations, gradients and deltas as arguments to private methods such as _compute_* and _backprop to make it more explicit about which datastructures are involved just by reading the prototype of the methods.

Actually by reading the code of _fit again, you don't even reuse the preallocated arrays for _a_layers, _coef_grads, _intercept_grads and _deltas, even when incremental is True. There is really no point in making them attributes of the class, please convert them to local variables of the _fit method and pass them explicitly as argument to the private helper methods.

Also, self._a_layers should be renamed activations and self._deltas should better be renamed updates.

@ogrisel

ogrisel Oct 3, 2014

Owner

... and self._deltas should better be renamed updates.

Actually scratch that, I was confused. self._deltas can be renamed deltas as it's just the difference between the current activations and the backpropagated error at that level.

@IssamLaradji

IssamLaradji Oct 6, 2014

Contributor

Done, removed the attributes. Indeed the code looks more readable now. :)

Thanks.

Owner

ogrisel commented Sep 30, 2014

Naive question: I note that the default activation function is the relu, and the default algorithm the LBFGS. I had in mind that LBFGS was very bad with the relu, because it is not smooth. Am I wrong?

I asked this question earlier and @IssamLaradji reported that LBFGS worked fine with ReLU despite the non-smooth kink at zero. This is rather counter intuitive to me:

#3204 (comment)

@ogrisel ogrisel commented on an outdated diff Oct 2, 2014

sklearn/neural_network/multilayer_perceptron.py
+ Number of outputs.
+
+ `out_activation_` : string
+ Name of the output activation function.
+
+ References
+ ----------
+ Hinton, Geoffrey E.
+ "Connectionist learning procedures." Artificial intelligence 40.1
+ (1989): 185-234.
+
+ Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of
+ training deep feedforward neural networks." International Conference
+ on Artificial Intelligence and Statistics. 2010.
+ """
+ def __init__(self, n_hidden=[100], activation="relu",
@ogrisel

ogrisel Oct 2, 2014

Owner

It's considered a malpractice to use a mutable default values for kwargs. Either use an unmutable tuple: n_hidden=(100,) or better in this case make n_hidden=100 work with literal integers (in that case the number of hidden layers would be assumed to be 1).

See: https://thenewcircle.com/static/bookshelf/python_fundamentals_tutorial/functions.html#_mutable_arguments_and_binding_of_default_values

If we want to enforce n_hidden to be sequence of ints (list, tuple...), I would rather rename that parameter to hidden_layers_sizes=(100,) instead.

Owner

ogrisel commented Oct 2, 2014

Actually members of the deep learning for speech recognition community reported that softplus(x) = log(1 + exp(x)) which is a smooth version of relu can work significantly better (generalization perf) on some problems.

However when I tried it on a grid search on a small subset (3000 samples) of MNIST, LBFGS seems to have no problem optimizing non-smooth ReLU (less iterations than softplus and significantly faster iterations). And validation accuracy seems to be better slightly as well. Here is the results of a grid search on 3000 digits:

[mean: 0.89889, std: 0.01571, params: {'activation': 'relu', 'n_hidden': [100, 100, 100], 'alpha': 9.9999999999999995e-07},
 mean: 0.89722, std: 0.00550, params: {'activation': 'relu', 'n_hidden': [100, 100, 100], 'alpha': 0.0001},
 mean: 0.89333, std: 0.01434, params: {'activation': 'relu', 'n_hidden': [100, 100, 100], 'alpha': 1e-08},
 mean: 0.89056, std: 0.01853, params: {'activation': 'relu', 'n_hidden': [100, 100], 'alpha': 0.0001},
 mean: 0.88833, std: 0.01312, params: {'activation': 'relu', 'n_hidden': [100, 100], 'alpha': 9.9999999999999995e-07},
 mean: 0.88722, std: 0.00685, params: {'activation': 'relu', 'n_hidden': [100, 100], 'alpha': 1e-08},
 mean: 0.88556, std: 0.01530, params: {'activation': 'softplus', 'n_hidden': [100], 'alpha': 1e-08},
 mean: 0.88333, std: 0.01650, params: {'activation': 'relu', 'n_hidden': [100], 'alpha': 0.0001},
 mean: 0.88278, std: 0.01235, params: {'activation': 'relu', 'n_hidden': [100], 'alpha': 9.9999999999999995e-07},
 mean: 0.88222, std: 0.01577, params: {'activation': 'softplus', 'n_hidden': [100], 'alpha': 0.0001},
 mean: 0.88111, std: 0.00906, params: {'activation': 'relu', 'n_hidden': [100], 'alpha': 1e-08},
 mean: 0.88111, std: 0.00671, params: {'activation': 'softplus', 'n_hidden': [100, 100, 100], 'alpha': 1e-08},
 mean: 0.88111, std: 0.00685, params: {'activation': 'softplus', 'n_hidden': [100, 100, 100], 'alpha': 0.0001},
 mean: 0.87833, std: 0.01534, params: {'activation': 'softplus', 'n_hidden': [100, 100], 'alpha': 0.0001},
 mean: 0.87778, std: 0.01517, params: {'activation': 'softplus', 'n_hidden': [100, 100], 'alpha': 9.9999999999999995e-07},
 mean: 0.87667, std: 0.00624, params: {'activation': 'softplus', 'n_hidden': [100], 'alpha': 9.9999999999999995e-07},
 mean: 0.87444, std: 0.01612, params: {'activation': 'softplus', 'n_hidden': [100, 100], 'alpha': 1e-08},
 mean: 0.87111, std: 0.01370, params: {'activation': 'softplus', 'n_hidden': [100, 100, 100], 'alpha': 9.9999999999999995e-07}]

Here is my implementation of softplus:

diff --git a/sklearn/neural_network/base.py b/sklearn/neural_network/base.py
index 114bba5..e7790b3 100644
--- a/sklearn/neural_network/base.py
+++ b/sklearn/neural_network/base.py
@@ -71,8 +71,12 @@ def relu(X):
     X_new : {array-like, sparse matrix}, shape (n_samples, n_features)
         The transformed data.
     """
-    np.clip(X, 0, np.finfo(X.dtype).max, out=X)
-    return X
+    return np.clip(X, 0, np.finfo(X.dtype).max, out=X)
+
+
+def softplus(X):
+    # log(1 + exp(X))
+    return np.logaddexp(X, 0, out=X)


 def softmax(X):
@@ -96,7 +100,7 @@ def softmax(X):


 ACTIVATIONS = {'identity': identity, 'tanh': tanh, 'logistic': logistic,
-               'relu': relu, 'softmax': softmax}
+               'relu': relu, 'softmax': softmax, 'softplus': softplus}


 def logistic_derivative(Z):
@@ -148,7 +152,7 @@ def relu_derivative(Z):


 DERIVATIVES = {'tanh': tanh_derivative, 'logistic': logistic_derivative,
-               'relu': relu_derivative}
+               'relu': relu_derivative, 'softplus': logistic}

@ogrisel ogrisel commented on an outdated diff Oct 3, 2014

sklearn/neural_network/multilayer_perceptron.py
+ for batch_slice in gen_batches(n_samples, batch_size):
+ self._a_layers[0] = X[batch_slice]
+ self.cost_ = self._backprop(X[batch_slice], y[batch_slice])
+
+ # update weights
+ for i in range(self.n_layers_ - 1):
+ self.layers_coef_[i] -= (self.learning_rate_ *
+ self._coef_grads[i])
+ self.layers_intercept_[i] -= (self.learning_rate_ *
+ self._intercept_grads[i])
+
+ if self.learning_rate == 'invscaling':
+ self.learning_rate_ = self.learning_rate_init / \
+ (self.n_iter_ + 1) ** self.power_t
+
+ self.n_iter_ += 1
@ogrisel

ogrisel Oct 3, 2014

Owner

self.n_iter_ for SGD currently does not have the same meaning for LBFGS and SGD: this is source of confusion.

We should instead introduce a new variable (for instance self.t_ to be consistent with SGDClassifier) for the learning rate schedule: the total number of samples that were used to train the model, irrespective of the fact that some samples might have been seen several times when training with several passes over a finite training set).

self.n_iter_ on the other hand should always reflect the number of "epochs" that is the number of passes over the full training set. This attribute should not be set when the users calls the partial_fit method as we don't know the total size of the full training set in that case.

Owner

ogrisel commented Oct 3, 2014

Please refactor the _fit method to call a submethod per optimizer. For instance, at the end of _fit:

fit_method = getattr(self, '_fit_' + self.algorithm, None)
if fit_method is None:
    raise ValueError('algorithm="%s" is not supported by %s'
                     % (self.algorithm, type(self).__name__))
fit_method(X, y, activations, deltas)

Maybe also pass to fit_method additional datastructures initialized in _fit that I might have missed.

This should make it easier for the user to derive from MultilayerPerceptronClassifier or MultilayerPerceptronRegressor to implement more experimental optimizers (e.g. Adadelta for instance).

Owner

ogrisel commented Oct 3, 2014

Also please raise a ConvergenceWarning when the tol based convergence criterion is not met prior to reaching max_iter in the fit method (do not do that for partial_fit).

Owner

ogrisel commented Oct 3, 2014

To get example of usage of ConvergenceWarning in scikit-learn, do git grep ConvergenceWarning.

Owner

ogrisel commented Oct 3, 2014

I started to summarize the remaining work in a todo list at the top of the PR.

Owner

amueller commented Dec 4, 2014

@IssamLaradji are you working on refactoring the _fit method? Otherwise I'd be happy to help.

Owner

amueller commented Dec 4, 2014

Btw, did you try to do the MNIST bench with SGD? That should be quite a bit faster. I didn't get it to work though :-/

Owner

ogrisel commented Dec 5, 2014

I could not make it work either. I suspect a bug in the SGD solver. Also we should add nesterov momentum to the SGD solver with momentum=0.9 by default as otherwise there can be problem where SGD converges too slowly.

Contributor

IssamLaradji commented Dec 5, 2014

Hi @amueller , it would be great if you help! Thanks :)

It seems 'SGD' is not running as expected, let me double check the gradients.

+1 for momentum.

Owner

amueller commented Dec 5, 2014

I agree about momentum. I'll see if I can make it work and I'll submit a parallel PR.

Owner

amueller commented Dec 5, 2014

I find it a bit confusing that coef_grads and intercept_grads are both modified in-place and returned by the functions operating on them. What is the reason for that?

Contributor

IssamLaradji commented Dec 5, 2014

@amueller , you mean this ?

 # Compute gradient for the last layer
        coef_grads, intercept_grads = self._compute_cost_grad(last, n_samples,
                                                              activations,
                                                              deltas,
                                                              coef_grads,
                                                              intercept_grads)

The reason is, I am calling self._compute_cost_grad twice; one for the output layer, and one for the for loop operating on the hidden layers.

I guess it would be more readable if I combined output layer and the hidden layers in one for loop? that way I wouldn't need the _compute_cost_grad method.

Owner

amueller commented Dec 5, 2014

The gradients are fine, I think. I forgot to shuffle mnist :-/ Now it looks good.
Maybe we want to set shuffle=True by default? It is so cheap compared to the backprob

amueller referenced this pull request Dec 5, 2014

Closed

[MRG] Mlp finishing touches #3939

15 of 22 tasks complete
Owner

amueller commented Dec 5, 2014

@IssamLaradji That was the place I meant. Sorry, I don't understand your explanation. Would the behavior of the code change if you discarded the return value of _compute_cost_grad?

Contributor

IssamLaradji commented Dec 5, 2014

@amueller oh I thought you meant something else.

It wouldn't change the behavior. I could discard the return value and the "left-hand side" of the equation coef_grads, intercept_grads =_compute_cost_grad(...), and the results will remain the same.

Also, +1 for setting shuffle=True as default.

Owner

amueller commented Dec 17, 2014

Training time for the bench_mnist.py is twice as high on my box than what you gave, but only for the MLP. the others have comparable speed. Could you try to run again with the current parameters and see if it is still the same for you? How many cores do you have?

Contributor

IssamLaradji commented Dec 18, 2014

Strange, I ran it again now and I got,

Classifier                         train-time           test-time                   error-rate   
----------------------------------------------------------------------------------------------
MultilayerPerceptron                 364.75999999        0.088                      0.0178     

which is half the original training time. Are you training using lbfgs or sgd ? lbfgs tends to converge faster.

My machine is equiped with 8 GB RAM and Intel® Core™ i7-2630QM Processor (6M Cache, 2.00 GHz) .

Owner

ogrisel commented Dec 18, 2014

lbfgs tends to converge faster.

In my PR against @amueller branch with the enhanced "constant" learning rate and momentum SGD seems to be faster than LBGFS although I have not ploted the "validation score vs epoch" curve as we have no way to do so at the moment.

Owner

amueller commented Dec 18, 2014

I ran exactly the same code, so lbfgs. I thin we should definitely do SGD as it should be much faster on mnist.

👍

Contributor

digital-dharma commented Apr 7, 2015

Excellent work to all, and an exciting feature to be added to sklearn!

I have been looking forward to this functionality to a while - Does it appear likely, or has momentum dissipated?

Owner

amueller commented Apr 7, 2015

It will definitely be merged, and soon.

Contributor

digital-dharma commented Apr 7, 2015

@amueller - That's fantastic news, I'm very much looking forward to it. Great work as always!

Using this a bit at the moment. Looks nice. Some notes:

  • Currently if y in MultilayerPerceptronRegressor.fit() is a vector (dimensions (n,)), .predict() returns a 2d array with dimensions (n,1). Other regressors just return a vector in the same format as y.
  • That's a really long class name. Could it be MLPRegressor instead, similar to SGDRegressor? That abbreviation is common enough, I think (on the wikipedia disambiguation page, comes first in google search for 'MLP learning', and I don't think people will get it confused with My Little Pony)
Owner

amueller commented May 1, 2015

@naught101 It is a long name... maybe we should use MLP. Can you check if the shape is still wrong in #3939?

@amueller: Yes, the shape is still wrong.

Owner

amueller commented May 4, 2015

Huh, wonder why the common tests no complain.

Owner

amueller commented May 4, 2015

Thanks for checking.

Owner

amueller commented Oct 23, 2015

Merged via #5214

amueller closed this Oct 23, 2015

Contributor

IssamLaradji commented Oct 23, 2015

Waw!! That's fantastic!! :) :) Great work team!

Thank you to everyone who worked on this. It will be really useful.

pasky commented Oct 24, 2015

Owner

jnothman commented Oct 24, 2015

Waw!! That's fantastic!! :) :) Great work team!

Yes, aren't sprints amazing from the outside? Dormant threads are suddenly
marked merged and that project you'd been trying to complete forever is now
off your todo list and you're ready to book a holiday...

Thank you to all the sprinters from those of us on the outside, it's been a
good one!

On 24 October 2015 at 20:04, Petr Baudis notifications@github.com wrote:

Yes, thank you very much! I've been waiting for this for a long time.
(And sorry that I never ended up making good on my offer to help.)


Reply to this email directly or view it on GitHub
#3204 (comment)
.

Contributor

IssamLaradji commented Oct 25, 2015

@jnothman indeed! it's a great surprise to see it merged as I felt that this would stay dormant for much longer time.

Thanks a lot for your great reviews and effort team!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment