# [MRG] Generic multi layer perceptron #3204

Closed
wants to merge 41 commits into
from

## Conversation

Projects
None yet
Contributor

### IssamLaradji commented May 27, 2014

 Currently I am implementing layers_coef_ to allow for any number of hidden layers. This pull request is to implement the generic Multi-layer perceptron as part of the GSoC 2014 proposal. The expected time to finish this pull request is June 15 The goal is to extend Multi-layer Perceptron to support more than one hidden layer and to support having a pre-training phase (initializing weights through Restricted Boltzmann Machines) and a fine-tuning phase; and write its documentation. This directly follows from this pull-request: #2120 TODO: replace private attributes initialized in _fit by local variables and pass them as argument to private helper methods to make the code more readable and reduce pickled model size by not storing stuff that is not necessary at prediction time. refactor the _fit method to call into submethods for different algorithms. introduce self.t_ to store SGD learning rate progress and decouple it from self.n_iter_ that should consistently track epochs. issue ConvergenceWarning whenever max_iter is reached when calling fit

Closed

Owner

### larsmans commented May 27, 2014

 What's the todo list for this one?
Contributor

### IssamLaradji commented May 27, 2014

 Hi larsmans, the todo list is, it should support more than one hidden layer; so there would be one generic layer list layer_coef_ it should support weights' initialization using trained Restricted Boltzmann Machines, like the one proposed by Hinton et al. (2006): http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf
Owner

### ogrisel commented May 27, 2014

 For the weight init, I would just use a warm_start=True constructor param and let the user set the layers_coef_ and layers_intercept_ attribute manually as done for other existing models such as SGDClassifier for instance.
Owner

### jnothman commented May 27, 2014

 Out of curiosity, does RBM initialisation mean that fit may be provided with some unlabelled samples? On 27 May 2014 19:14, Issam H. Laradji notifications@github.com wrote: Hi larsmans, the todo list is, it should support more than one hidden layer; so there would be one generic layer list layer_coef_ it should support weights' initialization using trained Restricted Boltzmann Machines, like the one proposed by Hinton et al. (2006): http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/3204#issuecomment-44252107 .
Contributor

### IssamLaradji commented May 27, 2014

 @ogrisel should we include another parameter - like unsupervised_weight_init_ - that runs an RBM (or any unsupervised learning algorithm) to initialize the layer weights? I believe warm_start starts training with the previously trained weights but does not necessarily use unsupervised learning algorithm for weight initialization. @jnothman yes, an RBM trains on the unlabeled samples and its new, trained weights become the initial weights of the corresponding layer in the multi-layer perceptron. The image below shows a basic idea of how this is done.
Owner

### larsmans commented May 27, 2014

 I think we can leave the RBM init to a separate PR.
Contributor

### IssamLaradji commented May 27, 2014

 @larsmans sure thing :) For the travis build, I believe the error is coming from OrthogonalMatchingPursuitCV, given in line 5442
Owner

### ogrisel commented May 27, 2014

 +1 for leaving the RBM init in a separate PR. Also, no need to couple the 2 models, just extract the weights from a pipeline of RBMs and manually stuck them as layers_coef_ of a MLP with warm_start=True and then call fit with the labels for fine tuning. For the travis build, I believe the error is coming from OrthogonalMatchingPursuitCV, given in line 5442 Not only: the other builds have failed because the doc tests don't pass either as I told you earlier in the previous PR.

### ogrisel and 2 others commented on an outdated diff May 27, 2014

benchmarks/bench_mnist.py
 +======================= + +Benchmark multi-layer perceptron, Extra-Trees, linear svm +with kernel approximation of RBFSampler and Nystroem +on the MNIST dataset. The dataset comprises 70,000 samples +and 784 features. Here, we consider the task of predicting +10 classes - digits from 0 to 9. The experiment was run in +a computer with a Desktop Intel Core i7, 3.6 GHZ CPU, +operating the Windows 7 64-bit version. + + Classification performance: + =========================== + Classifier train-time test-time error-rate + ------------------------------------------------------ + nystroem_approx_svm 124.819s 0.811s 0.0242 + MultilayerPerceptron 359.460s 0.217s 0.0271

#### ogrisel May 27, 2014

Owner

Isn't it possible to find hyperparams values to reach better accuracy with tanh activations? It should be possible to go below 2% error rate with a vanilla MLP on mnist.

#### jnothman May 27, 2014

Owner

I assumed you intended to have additional unlabelled data, but perhaps
working out the best way to incorporate the unlabelled data into the
fitting procedure (particularly if you support partial_fit) might be a big
question of its own. So I'm +1 for delaying that decision :)

On 27 May 2014 19:43, Olivier Grisel notifications@github.com wrote:

In benchmarks/bench_mnist.py:

+=======================
+
+Benchmark multi-layer perceptron, Extra-Trees, linear svm
+with kernel approximation of RBFSampler and Nystroem
+on the MNIST dataset. The dataset comprises 70,000 samples
+and 784 features. Here, we consider the task of predicting
+10 classes - digits from 0 to 9. The experiment was run in
+a computer with a Desktop Intel Core i7, 3.6 GHZ CPU,
+operating the Windows 7 64-bit version.
+

• Classification performance:
• ===========================
• Classifier train-time test-time error-rate

• nystroem_approx_svm 124.819s 0.811s 0.0242
• MultilayerPerceptron 359.460s 0.217s 0.0271

Isn't it possible to find hyperparams values to reach better accuracy with
tanh activations? It should be possible to go below 2% error rate with a
vanilla MLP on mnist.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/3204/files#r13069391
.

Contributor

@ogrisel I just made the error rate to 0.017 :)
(fixed an issue with tanh derivative - it didn't pass the gradient test until now)

@jnothman indeed, better to make RBM pipelining a separate PR

#### ogrisel May 27, 2014

Owner

Glad you found the source of the problem, it's great to have unit tests that check the correctness of the gradient!

Contributor

### IssamLaradji commented Jun 9, 2014

 Hi guys, I made some major changes. The algorithm now supports more than one hidden layer by simply putting a list of values in the n_hidden parameter. For example, for 3 hidden layers where the first and the second layers have 100 neurons and the 3rd has 50 neurons, the list would be, n_hidden = [100, 100, 50] I improved the speed of the implementation by more than 25% by removing a redundant loop. I improved the documentation by making it more comprehensive. Your feedback will be greatly appreciated. Thank you! :)

### coveralls commented Jun 10, 2014

 Coverage increased (+0.16%) when pulling 2e8dc56 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Owner

### ogrisel commented Jun 10, 2014

 @IssamLaradji great work! I will try to review in more details soon. Maybe @jaberg and @kastnerkyle might be interested in reviewing this as well. Can you please fix the remaining expit related failure under Python 3 w/ recent numpy / scipy? https://travis-ci.org/scikit-learn/scikit-learn/jobs/27179454#L5790

### ogrisel commented on an outdated diff Jun 10, 2014

doc/modules/neural_networks_supervised.rst
 +:math:i+1. :math:layers_intercept_ is a list of the bias vectors, where the vector +at index :math:i represents the bias values that are added to layer :math:+i+1. + +The advantages of Multi-layer Perceptron are: + + + Capability to learn complex/non-linear models. + + + Capability to learn models in real-time (on-line learning) + using partial_fit + + +The disadvantages of Multi-layer Perceptron (MLP) include: + + + Since hidden layers in MLP make the loss function non-convex + - which contains more than one local minimum, random weights' + initialization could impact the predictive accuracy of a trained model.

#### ogrisel Jun 10, 2014

Owner

I would rather say: "meaning that different random initializations of the weight can leading to trained models with varying validation accuracy".

### ogrisel commented on an outdated diff Jun 10, 2014

doc/modules/neural_networks_supervised.rst
 +The advantages of Multi-layer Perceptron are: + + + Capability to learn complex/non-linear models. + + + Capability to learn models in real-time (on-line learning) + using partial_fit + + +The disadvantages of Multi-layer Perceptron (MLP) include: + + + Since hidden layers in MLP make the loss function non-convex + - which contains more than one local minimum, random weights' + initialization could impact the predictive accuracy of a trained model. + + + MLP suffers from the Backpropagation diffusion problem; layers far from + the output update with decreasing momentum, leading to slow convergence.

#### ogrisel Jun 10, 2014

Owner

I would add: with squashing activation function such as the logistic sigmoid and the tanh function. If we implement linear bottlnecks (the identity function) and ReLU later, this problem might no longer hold.

### ogrisel commented on an outdated diff Jun 10, 2014

doc/modules/neural_networks_supervised.rst
 + MultilayerPerceptronClassifier(activation='tanh', algorithm='l-bfgs', + alpha=1e-05, batch_size=200, eta0=0.5, + learning_rate='constant', max_iter=200, n_hidden=[5, 2], + power_t=0.25, random_state=None, shuffle=False, tol=1e-05, + verbose=False, warm_start=False) + +After fitting (training), the model can predict labels for new samples:: + + >>> clf.predict([[2., 2.], [-1., -2.]]) + array([1, 0]) + +MLP can fit a non-linear model to the training data. The members +clf.layers_coef_ containing the weight matrices constitute the model +parameters:: + + >>> clf.layers_coef_

#### ogrisel Jun 10, 2014

Owner

I would just display:

   >>> [coef.shape for coef in clf.layers_coef_]


Contributor

### IssamLaradji commented Jun 11, 2014

 Thanks for the feedback @ogrisel. I improved the documentation more, making it more didactic - especially in the mathematical formulation section. For the expit related failure under Python 3, I am not sure how to fix the problem since I am using the expit version given in scikit-learn. Isn't the problem within sklearn.utils.fixes? Thanks.
Owner

### kastnerkyle commented Jun 11, 2014

 This looks pretty cool so far - I will run some trials on it and try to understand the py3 issues. Things that would be nice, though maybe not strictly necessary for a first cut PR: A constructor arg for a custom loss function instead of fixed (maybe it is against the API). Thinking of things like cross-entropy, hinge loss ala Charlie Tang, etc. instead of standard softmax or what have you. It would be nice to have a few default ones available by strings, with the ability to create a custom one if needed. I like @ogrisel's suggestion for layer_coefs_. It would be useful to run experiments with KMeans networks and also pretraining with autoencoders instead of RBMs. This also opens the door for side packages that can take in weights from other nets (looking at Overfeat, Decaf, Caffe, pylearn2, etc.) and load them into sklearn. This is more a personal interest of mine, but it is nice to see the building blocks there. It is also plausible that very deep nets are possible to use in feedforward mode on the CPU, even if we can't train them in sklearn directly. Questions: I see you have worked on deep autoencoders before - will this framework support that as well? In other words, can layer sizes be different but complimentary? Or are they expected to be a "block" (uniform in size) I also like the support for other optimizers - it would be sweet to get a hessian free optimizer into scipy, and use it in this general setup. Could make deep-ish NN work somewhat accessible without GPU, though cg is what (I believe) Hinton used for the original DBM/pretraining paper.
Owner

### ogrisel commented Jun 11, 2014

 @IssamLaradji indeed it would be interesting to run a bench of lbfgs vs cg and maybe other optimizers from scipy.optimize, maybe on (a subset of) mnist for instance.
Owner

### ogrisel commented Jun 11, 2014

 We might want to make it possible to use any optimizer from scipy.optimize if the API is homogeneous across all optimizers (I have not checked).
Owner

### ogrisel commented Jun 11, 2014

 @IssamLaradji about the expit pickling issue, it looks like a bug in numpy. I am working on a fix.
Owner

### ogrisel commented Jun 11, 2014

 I submitted a bugfix upstream: numpy/numpy#4800 . If the fix is accepted we might want to backport it in sklearn.utils.fixes.
Owner

### ogrisel commented Jun 11, 2014

 @IssamLaradji actually you can please try to add the ufunc fix to sklearn.utils.exists now to check that it works for us? Try to add something like: import pickle try: pickle.loads(pickle.dumps(expit)) except AttributeError: # monkeypatch numpy to backport a fix for: # https://github.com/numpy/numpy/pull/4800 import numpy.core def _ufunc_reconstruct(module, name): mod = __import__(module, fromlist=[name]) return getattr(mod, name) numpy.core._ufunc_reconstruct = _ufunc_reconstruct
Contributor

### IssamLaradji commented Jun 11, 2014

 Hi @kastnerkyle and @ogrisel , thanks for the reply. Custom loss function: I could add a parameter to the constructor that accepts strings for selecting the loss function. (In fact, I have done that in my older implementation, but was told to remove it since there weren't enough loss functions) Pre-training: I could add a pipeline with a placeholder that selects a pre-trainer for the weights. Although I was told to keep that for the next PR, I don't see it as a harm adding an additional constructor parameter and a small method containing the pre-trainer for a quick test :). Deep Auto-encoder: yes, a sparse autoencoder is a simple adaptation of the feedforward network - I simply need to inject a sparsity parameter into the loss function and its derivatives. For the layer sizes, they can be different in any way- for example, 1024-512-256-128-64-28, but - like what Hinton said - nothing justifies any set of layer sizes since it depends on the problem instance. Anyhow, this framework can support any set of layer sizes even if they are larger than the number of features. Selecting scipy optimizers: my older implementation of the vanilla MLP supported all scipy optimizers using the generic scipy minimize method, but there was one problem: it required users to have scipy 13.0+, while scikit-learn requires SciPy (>= 0.7). If we could raise the scipy version requirement, I could easily have this support all scipy optimizers. Anyhow, L-bfgs is now state-of-the-art optimizer. I tested it against CG and L-BFGS always performed better and faster than CG for several datasets (most other optimizers were unsuitable and did not come any close to CG and l-bfgs as far as speed and accuracy are concerned, but the scipy method also supports custom optimizers which is very useful). This claim is also justified by Adam Coates and Andrew Ng. here http://cs.stanford.edu/people/ang/?portfolio=on-optimization-methods-for-deep-learning But I did read that CG can perform better and faster for special kinds of datasets. So I am all for adding the generic scipy optimizer if it wasn't for the minimum version issue. What do you think? For the ufunc fix, did you mean sklearn.utils.fixes ? because my sklearn version doesn't have sklearn.utils.exists :( . I added the fix to sklearn.utils.fixes and pushed the code to see if it resolves the expit problem. Thank you.

### coveralls commented Jun 11, 2014

 Coverage increased (+0.16%) when pulling 1d4911b on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

### ogrisel commented on an outdated diff Jun 11, 2014

sklearn/neural_network/multilayer_perceptron.py
 + for batch_slice in batch_slices: + cost = self._backprop_sgd( + X[batch_slice], y[batch_slice], + batch_size) + + if self.verbose: + print("Iteration %d, cost = %.2f" + % (i, cost)) + if abs(cost - prev_cost) < self.tol: + break + prev_cost = cost + self.t_ += 1 + + elif 'l-bfgs': + self._backprop_lbfgs( + X, y, n_samples)

#### ogrisel Jun 11, 2014

Owner

Please put method calls on one line when they fit in 80 columns:

             self._backprop_lbfgs(X, y, n_samples)

### ogrisel commented on an outdated diff Jun 11, 2014

sklearn/neural_network/multilayer_perceptron.py
 + if self.algorithm == 'sgd': + prev_cost = np.inf + + for i in range(self.max_iter): + for batch_slice in batch_slices: + cost = self._backprop_sgd( + X[batch_slice], y[batch_slice], + batch_size) + + if self.verbose: + print("Iteration %d, cost = %.2f" + % (i, cost)) + if abs(cost - prev_cost) < self.tol: + break + prev_cost = cost + self.t_ += 1

#### ogrisel Jun 11, 2014

Owner

I think this attribute would be better named n_iter_. Also it might be interesting to report the number of batch iterations when the _backprop_lbfgs method is used also in the same n_iter_ attribute for consistency. This is reported under the 'nit' key in the information dictionary of fmin_l_bfgs_b.

n_iter_ would thus be the number of epochs for SGD and the number of batch iterations for the LBFGS optimizer.

#### ogrisel Jun 11, 2014

Owner

Note that also reporting the final value of the objective function as another fitted attribute might also be interesting.

### ogrisel commented on an outdated diff Jun 11, 2014

sklearn/neural_network/multilayer_perceptron.py
 + + n_samples, self.n_features = X.shape + self._validate_params() + + if self.layers_coef_ is None: + self._init_param() + self._init_fit() + + if self.t_ is None or self.eta_ is None: + self._init_t_eta_() + + self._preallocate_memory(n_samples, X) + + cost = self._backprop_sgd(X, y, n_samples) + if self.verbose: + print("Iteration %d, cost = %.2f" % (self.t_, cost))

#### ogrisel Jun 11, 2014

Owner

I would rather use cost = %f to not constraint the precision of the cost report. Or use a larger value like %0.8f for instance.

Owner

### ogrisel commented Jun 11, 2014

 About the optimizers, thanks for the reference comparing lbfgs and CG. We could add support for arbitrary scipy optimizer and raise a RuntimeException of the version of scipy is too low (with an informative error message) while still using fmin_l_bfgs_o directly by default so that we keep the backward compat fo old versions of scipy by default.
Owner

### ogrisel commented Jun 11, 2014

 It would be great to add squared_hinge and hinge loss functions. But in another PR. I would also consider pre-training and sparse penalties for autoencoders for separate PRs.
Owner

### larsmans commented Jun 11, 2014

 Indeed. Let's get the basic thing merged first. Is this PR in MRG phase?
Contributor

### IssamLaradji commented Jun 11, 2014

 Hi, thanks for the comments. @ogrisel fitted_attributes is a great idea. I added a section under the classifier and regressor class documentations explaining these fitted attributes, 1) layers_coef_ : The ith element in the list represents the weight matrix corresponding to layer i. 2) layers_intercept_ : The ith element in the list represents the bias vector corresponding to layer i + 1. 3) cost_ : The current cost value computed by the loss function. 4) n_iter_ : The current number of iterations the algorithm has ran. 5) eta_ : The current learning rate. So if a user prints mlp.cost_ after training, he'd get the minimum cost achieved by either sgd or l-bfgs. @larsmans, I will set it as MRG, I think it is in its final phase - the scope is completed in my opinion. Things like generic optimizers, more loss functions, and pretraining can be done for the next PRs if that is okay. Thank you.

Owner

### kastnerkyle commented Jun 11, 2014

 That sounds awesome to me - looking forward to playing with this! Great job.
Contributor

### IssamLaradji commented Jun 11, 2014

 Thank you for the compliment @kastnerkyle.
Contributor

### IssamLaradji commented Jun 11, 2014

 Ops, sounds like Travis does not have the Scipy version supporting d['nit'] for counting iterations. I will increment the iterations manually then.
Contributor

### IssamLaradji commented Jun 11, 2014

 Fixed :)

### coveralls commented Jun 11, 2014

 Coverage increased (+0.16%) when pulling de407c2 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Owner

### ogrisel commented Jun 12, 2014

 @IssamLaradji can you please try to find parameters (eta0, learning_rate and so on) that make the model converge with SGD algorithm on MNIST? I have tried with various constant learning rate but was not very successful. Maybe grid searching on 10% of the data would work.
Contributor

### IssamLaradji commented Jun 12, 2014

 @ogrisel, in my side, SGD converged with eta0=0.01, learning_rate=constant, n_hidden = 100, and max_iter = 400. In the verbose you could see the cost decreasing constantly. However, with large eta0, the cost oscillates and it would never converge. Problem with invscaling learning rate is that, eta gets stuck around 0.1, while a good eta is 0.01 for the MNIST dataset.
Contributor

### IssamLaradji commented Jun 12, 2014

 There is a power_t parameter that you could increase, so that with learning_rate = invscaling, eta decreases in a faster rate, and therefore guaranteeing convergence. Thanks.

### coveralls commented Jun 12, 2014

 Coverage increased (+0.16%) when pulling 07376cb on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Owner

### glouppe commented Jun 12, 2014

 Regarding pre-training, I really wouldn't make that a priority. This heuristic has gone out of favor for some time now. What works best on common benchmark tasks is the good old backprogragation algorithm trained on labeled data only. The trick is in using appropriate activation functions (e.g., rectified linear units) and averaging strategies like dropout.
Owner

### jnothman commented on an outdated diff Jun 12, 2014

sklearn/neural_network/multilayer_perceptron.py
 + self.cost_ = None + self.n_iter_ = None + self.eta_ = None + + def _pack(self, layers_coef_, layers_intercept_): + """Pack the coefficient and intercept parameters into a single vector. + """ + all_params_ = layers_coef_ + layers_intercept_ + + return np.hstack([l.ravel() for l in all_params_]) + + def _unpack(self, packed_parameters): + """Extract the coefficients and intercepts from packed_parameters.""" + for i in range(self.n_layers - 1): + s, e, shape = self.parameter_subsets[i] + self.layers_coef_[i] = np.reshape(packed_parameters[s:e], (shape))

#### jnothman Jun 12, 2014

Owner

(shape) is identical to shape

Contributor

### IssamLaradji commented Jun 13, 2014

 I agree with @kastnerkyle , pre-training is still a hot, useful research area, as many recent papers contain work along those lines - like [1] [2]. I also like the idea of sharing trained weights in a friendly format like scikit-learn's. It makes sense that pre-training is useful as many samples on the internet are rather unlabeled; besides, Andrew Ng. made a great achievement with pre-training: http://www.wired.com/2013/05/neuro-artificial-intelligence. But I also agree with @glouppe, the reason using pre-training was discouraged is due to its very long training time where the performance improvement might not be all worth it. Much of the relevant literature use several GPUs for pre-training which are unavailable to the common user. That's why more approaches are using dropout and sophisticated activation functions for a more convenient training time. Having said that, pre-training is part of the GSoC proposal, so I am forced to include it, unless we all agree to implement something else instead :-) . Thank you!
Owner

### jnothman commented Jun 13, 2014

 The API for pretraining might be a challenge. What might be worthwhile in the meantime is providing a way / documenting the ability to provide one's own initial weights (externally pre-trained). On 13 June 2014 11:55, Issam H. Laradji notifications@github.com wrote: I agree with @kastnerkyle https://github.com/kastnerkyle , pre-training is still a hot, useful research area, as many recent papers contain work along those lines - like [1] [2]. I also like the idea of sharing trained weights in a friendly format like scikit-learn's. It makes sense that pre-training is useful as many samples on the internet are rather unlabeled; besides, Andrew Ng. made a great achievement with pre-training: http://www.wired.com/2013/05/neuro-artificial-intelligence. But I also agree with @glouppe https://github.com/glouppe, the reason using pre-training was discouraged is due to its very long training time and the performance improvement might not be worth it. Much of the relevant literature use several GPUs for pre-training which are unavailable to the common user. That's why more approaches are using dropout and sophisticated activation functions for a more convenient training time. Having said that, pre-training is part of the GSoC proposal, so I guess I am forced to include it, unless we all agree to implement something else instead. Thank you! — Reply to this email directly or view it on GitHub #3204 (comment) .
Owner

### ogrisel commented Jun 13, 2014

 @ogrisel, in my side, SGD converged with eta0=0.01, learning_rate=constant, n_hidden = 100, and max_iter = 400. In the verbose you could see the cost decreasing constantly. However, with large eta0, the cost oscillates and it would never converge. Problem with invscaling learning rate is that, eta gets stuck around 0.1, while a good eta is 0.01 for the MNIST dataset. I think it would be useful to detect if the cost is increasing (e.g. 3 times in a row) and raise a ConvergenceWarning with a message that explains that the learning rate is probably too high or the data not properly standardized and maybe stop the algorithm. Speaking of which I think we could change the MNIST benchmark to pipeline a StandardScaler as preprocessor. That might make the weight init work better and the model converge faster, no?
Owner

### ogrisel commented Jun 13, 2014

 Having said that, pre-training is part of the GSoC proposal, so I am forced to include it, unless we all agree to implement something else instead :-). We can still decide together to change the task list of the initial GSoC proposal. For pre-training we can just add a new example that first fits a pipeline of 3 RBMs manually on a dataset and then manually initialize the weights (layers_coef_ + layers_intercept_) of an MLP instance manually and train it with warm_start=True for the fine tuning. No need to write a specific API, this can all be done manually with the existing API. It's more a matter of documenting how to do it in an example referenced from the narrative documentation.
Owner

### ogrisel commented Jun 13, 2014

 I created a new issue #3275 to move the discussion about pre-training there.
Contributor

### IssamLaradji commented Jun 13, 2014

 @jnothman sure, that's a good start. I am thinking of implementing these methods I suggested in #3275 @ogrisel ConvergenceWarning is a great idea. I added this code in the sgd body,  if self.cost_ > prev_cost: cost_increase_count += 1 if cost_increase_count == 0.2 * self.max_iter: warnings.warn('Cost is increasing for more than 20%%' ' of the iterations.' ' Consider reducing eta0 and preprocessing your data' ' with StandardScaler or MinMaxScaler.' % self.cost_, ConvergenceWarning)  Since the cost might increase arbitrarily, the ConvergenceWarning takes place only when the cost keeps increasing for more than 20% of the iterations. We could decrease that percentage if it is more appropriate. For MNIST, I realized that, dividing the data by 255, sgd's cost consistently decrease and then converge even with as high learning rate as 0.5. Since the MNIST benchmark already divides the data by 255 and therefore normalizing it, it seems StandardScaler would be redundant. What do you think? Thanks!
Owner

### ogrisel commented Jun 13, 2014

 For MNIST, I realized that, dividing the data by 255, sgd's cost consistently decrease and then converge even with as high learning rate as 0.5. Since the MNIST benchmark already divides the data by 255 and therefore normalizing it, it seems StandardScaler would be redundant. What do you think? I might be mistaken but I think I could have it diverge even with the division by 225. Also a [0 - 1] range is not equivalent to centering + unit scaling. The latter is what is expected by the fan in / fan out init scheme of the weights so StandardScaler might help the algorithm converge slightly faster.
Contributor

### IssamLaradji commented Jun 13, 2014

 I just compared the MNIST benchmark betweenStandardScaler, normalization, and MinMaxScaler. You are right, with StandardScaler, Multi-layer perceptron took 1/4 the time to train than it did with normalization (or division by 255) but had an error rate of 0.028 instead of 0.0169. Couldn't get the score lower than that :( with StandardScaler. Also, StandardScaler made nystroem_approx_svm and fourier_approx_svm perform really poor (high error rate) while taking a very long time to train. Here are their scores with StandardScaler .  Algorithm Training time Error rate --------------------------------------------------------------------------------------------- nystroem_approx_svm 1382.65 0.4866 fourier_approx_svm 1363.49 0.6055  MinMaxScaler, however, helped them perform better as expected. Multi layer perceptron took half the time to train with MinMaxScaler than with normalization. Do you think we should use MinMaxScaler instead? Edit: Nevermind, MinMaxScaler works exactly like "division by 255" since they both fix values between 0 and 1. The decrease in speed was the result of a random factor. I will dig deeper into the issue of StandardScaler with nystroem_approx_svm and fourier_approx_svm. Thanks
Owner

### AlexanderFabisch commented Jun 14, 2014

 There are some pixels in the MNIST dataset that are never greater than 0 which should not affect the result but there are some pixels that are almost never greater than zero. To get a variance of 1 for these components, the StandardScaler will multiply these pixels with a large value. In [1]: from sklearn.datasets import fetch_mldata In [2]: mnist = fetch_mldata('MNIST original') In [3]: X = mnist.data In [4]: X.sum(axis=0) Out[4]: array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 126, 470, 216, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 93, 793, 1615, 3026, 4357, 8255, 11987, 13539, 13306, 14440, 12792, 11907, 10116, 6947, 4776, 3421, 1282, 605, 212, 0, 0, 0, 0, 0, 0, 64, 42, 417, 766, 3941, ... 8505361, 7569160, 6196835, 4644333, 3183420, 2002145, 1165280, 626008, 319420, 150857, 57252, 13300, 1238, 72, 0, 24, 4175, 28731, 118835, 346211, 828203, 1655578, 2794999, 4096358, 5390273, 6429651, 6988499, 6971463, 6395512, 5346977, 4071292, 2831605, 1791788, 1055597, 584113, ... 17195, 5667, 1470, 58, 59, 0, 0, 0, 0, 0, 0, 152, 935, 2520, 5787, 8581, 13136, 21761, 27641, 34682, 39975, 46865, 41270, 33546, 23352, 13819, 6968, 3264, 1163, 907, 120, 0, 0, 0, 0], dtype=uint64)  I think these pixels do distort the result completely (although I did not really test it). This is why you usually do not want to take a StandardScaler for this dataset. It is very sensitive to outliers. In almost all experiments with the MNIST dataset that I have seen the values have been scaled to [0, 1] by dividing each pixel by 255.0. I think the MinMaxScaler will give a very similar result but not the same because the maximum value in the data might not be 255.0 for each pixel.
Contributor

### IssamLaradji commented Jun 15, 2014

 @AlexanderFabisch Thanks for taking the time in analyzing this. Indeed - I got different mean values - though very similar - between applying MinMaxScaler and division by 255 suggesting they are different for this data. I guess I will keep the division by 255.0 as the preprocessing step since it is the popular normalization method for the MNIST dataset. Thanks.

### coveralls commented Jun 15, 2014

 Coverage increased (+0.17%) when pulling 3655474 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Open

Contributor

### IssamLaradji commented Jun 16, 2014

 I pushed a new pull-request #3281 where I uploaded an example file mlp_with_pretraining.py that demonstrates pre-training an mlp with an rbm. Thanks.
Owner

### ogrisel commented Jun 17, 2014

 I just had a second look at the code of this PR and I think the following things should be done: all attributes that are not public constructor parameters and that are mutated during a call to fit should be either public attributes with a trailing _ to show that they are estimated from the data and properly documented in the docstring, or made private by adding a leading _ to their name (in my opinion all the attributes that are currently not documented should be made private to limit backward compat issues later if we decide to refactor the internals of this estimator, except n_outputs that could be renamed n_outputs_ as done in many other scikit-learn models). I think that fit should only preserve layers_coef_ and layers_intercept_ when warm_start=True. All the other fitted parameters (e.g. eta_, n_iter_ and so on) should be reinitialized from scratch when calling fit, as if fit was called for the first time. This second item entails refactoring the _init_fit. I would rename it to _init_random_weights and only do the init of layers_coef_ and layers_intercept_ . Do not setup the packed_parameter_meta attribute and instead directly introspect the shape of the layers in the _unpack method instead. There is no need precompute it as _unpack is only called once per fit call.
Contributor

### IssamLaradji commented Jun 18, 2014

 Hi @ogrisel , The reason I used packed_parameter_meta to precompute layer shapes is that, lbfgs calls the _unpack method for every iteration. At each time step, lbfgs runs the _cost_grad() method where the first line calls _unpack to extract the updated parameters. Problem is, the nature of lbfgs, and most scipy optimizers, is to pack and unpack parameters per iteration. Hence the precomputation of layer shapes. I can precompute the layer shapes in a method other than _init_random_weights for easier pre-training - it is only needed for lbfgs not sgd. I called the new method as _precompute_layer_shapes. Thanks for explaining the syntax difference between public and private methods/attributes - I wasn't really aware of them :(. I pushed the updated file with the correct syntax. Thank you. Update: Thanks to your comments, I made the pre-training code #3281 much cleaner. Now, mlp only requires warm_start = True as well as coefficient and intercept assignments from RBM for pretraining.

### coveralls commented Jun 18, 2014

 Coverage increased (+0.17%) when pulling 43250a7 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Owner

### ogrisel commented Jun 18, 2014

 The reason I used packed_parameter_meta to precompute layer shapes is that, lbfgs calls the _unpack method for every iteration. At each time step, lbfgs runs the _cost_grad() method where the first line calls _unpack to extract the updated parameters. Indeed. For some reason I missed that when looking for _unpack calls. I can precompute the layer shapes in a method other than _init_random_weights for easier pre-training - it is only needed for lbfgs not sgd. I called the new method as _precompute_layer_shapes. Exactly what I would have suggested. Update: Thanks to your comments, I made the pre-training code #3281 much cleaner. Now, mlp only requires warm_start = True as well as coefficient and intercept assignments from RBM for pretraining. This is precisely why I wanted to have warm_start=True only affect layers_coef_ and layers_intercept_ :)

### ogrisel and 1 other commented on an outdated diff Jun 18, 2014

sklearn/utils/fixes.py
 @@ -49,6 +49,18 @@ def expit(x, out=None): return out +# added a code block that addresses the expit issue with python3 +try: + pickle.loads(pickle.dumps(expit))

#### ogrisel Jun 18, 2014

Owner

Unfortunately this is actually triggering a bug under Python 3.4 on my box, for instance when I run the plot_rbm_logistic_classification.py but I cannot really understand why. Can you try to reproduce it?

Traceback (most recent call last):
File "examples/plot_rbm_logistic_classification.py", line 39, in <module>
from sklearn import linear_model, datasets, metrics
File "/Users/ogrisel/code/scikit-learn/sklearn/linear_model/__init__.py", line 12, in <module>
from .base import LinearRegression
File "/Users/ogrisel/code/scikit-learn/sklearn/linear_model/base.py", line 28, in <module>
from ..utils import as_float_array, atleast2d_or_csr, safe_asarray
File "/Users/ogrisel/code/scikit-learn/sklearn/utils/__init__.py", line 11, in <module>
from .validation import (as_float_array, check_arrays, safe_asarray,
File "/Users/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 17, in <module>
from .fixes import safe_copy
File "/Users/ogrisel/code/scikit-learn/sklearn/utils/fixes.py", line 54, in <module>
File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/numpy/core/__init__.py", line 61, in _ufunc_reduce
return _ufunc_reconstruct, (whichmodule(func, name), name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/pickle.py", line 283, in whichmodule
for module_name, module in sys.modules.items():
RuntimeError: dictionary changed size during iteration


Contributor

@ogrisel , I got the exact same error with Python 3.4. I will look into this. Thanks

Contributor

On investigating this, I found that the error happens only when you run import matplotlib.pyplot as plt before other imports. If you place import matplotlib.pyplot as plt below other imports, the code will work just fine. This is strange, but I think there is a name conflict in the libraries being imported.

#### ogrisel Jun 23, 2014

Owner

It looks like the call to _ufunc_reconstruct can sometimes trigger a module import that changes the dict of imported module sys.modules. I think this can be considered a bug in the pickle module module that should be made robust to such change by calling using something as:

for module_name, module in list(sys.modules.items()):
...

It would be great to try to write a pure-python non-regression test to submit with a fix to Python.

Contributor

Working on it, will let you know when it is fixed :)

Contributor

Hi @ogrisel , since pickle.loads(pickle.dumps(expit)) always returns an error, can we remove it and keep only the below code instead?
I found that calling pickle in .fixes is what causes import conflicts. Therefore, if we remove it, we will not face any problems with the travis pickle test, nor with module import conflicts. Thanks.

# added a code block that addresses the expit issue with python3
def _ufunc_reconstruct(module, name):
mod = __import__(module, fromlist=[name])
return getattr(mod, name)
np.core._ufunc_reconstruct = _ufunc_reconstruct

### arjoly commented on an outdated diff Jun 18, 2014

sklearn/neural_network/multilayer_perceptron.py
 + if self.cost_ > prev_cost: + cost_increase_count += 1 + if cost_increase_count == 0.2 * self.max_iter: + warnings.warn('Cost is increasing for more than 20%%' + ' of the iterations. Consider reducing' + ' eta0 and preprocessing your data' + ' with StandardScaler or MinMaxScaler.' + % self.cost_, ConvergenceWarning) + + elif prev_cost - self.cost_ < self.tol: + break + + prev_cost = self.cost_ + self.n_iter_ += 1 + + elif 'l-bfgs':

Owner

look like a bug

### ogrisel and 1 other commented on an outdated diff Jun 18, 2014

doc/modules/neural_networks_supervised.rst
 + + + Capability to learn models in real-time (on-line learning) + using partial_fit + + +The disadvantages of Multi-layer Perceptron (MLP) include: + + + MLP with hidden layers have a non-convex loss function where there exists + more than one local minimum. Therefore different random weight + initializations can lead to different validation accuracy. + + + MLP suffers from the Backpropagation diffusion problem; layers far from + the output update with decreasing momentum, leading to slow convergence. + However, with squashing activation functions such as the logistic sigmoid + and the tanh function, implementing linear bottlnecks (the identity function) + and ReLU later might resolve this problem.

#### ogrisel Jun 18, 2014

Owner

I would not mention ReLU and linear bottlenecks in the doc as long as we have not implemented them.

Contributor

Fixed :)

### ogrisel commented on an outdated diff Jun 18, 2014

sklearn/neural_network/multilayer_perceptron.py
 + Returns + ------- + self + """ + X = atleast2d_or_csr(X) + + self._validate_params() + + n_samples, n_features = X.shape + self.n_outputs_ = y.shape[1] + + self._init_eta_() + self._init_param(n_features) + + if not self.warm_start or \ + (self.warm_start and self.layers_coef_ is None):

#### ogrisel Jun 18, 2014

Owner

Please use additional ( and ) around the conditional expression rather then a trailing \.

### coveralls commented Jun 22, 2014

 Coverage increased (+0.17%) when pulling 61d9de8 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

### coveralls commented Jun 27, 2014

 Coverage increased (+0.21%) when pulling a925a14 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Owner

### mblondel commented Jun 27, 2014

 Is the plan to merge all your PRs at once at the end of the summer? I'd rather merge progressively what is ready into master.
Contributor

### IssamLaradji commented Jun 27, 2014

 Hi @mblondel , I would be glad to have this merged before continuing on other PRs :). This is the farthest I got in a PR though; so if you or someone can guide me on the last remaining steps, it would be great :). PS: a strange travis error for python 2.6 is related to 'OrthogonalMatchingPursuitCV'. I wonder if this PR is causing this. Thanks.
Contributor

### IssamLaradji commented Jun 30, 2014

 Also I feel like working on the ELM PR #3306 would be much easier if this PR got merged first, since ELM's documentation belong to the same documentation as MLP :). Please let me know what final steps I should take to complete this. Thanks. :)
Owner

### ogrisel commented Jun 30, 2014

 PS: a strange travis error for python 2.6 is related to 'OrthogonalMatchingPursuitCV'. I wonder if this PR is causing this. This is unrelated. See #3190. Unfortunately it's apparently very hard to reproduce outside of the travis env so nobody could come up with a fix yet.

### coveralls commented Jul 1, 2014

 Coverage increased (+0.21%) when pulling b1562d7 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Contributor

### IssamLaradji commented Jul 1, 2014

 Finally fixed the expit problem without having to import pickle in .fixes which caused conflicts with other imports. @ogrisel indeed, 'OrthogonalMatchingPursuitCV' seems like a very subtle error - I had a similar error in #3306, which I couldn't reproduce, because the PR in the local machine passed all the unit tests, unlike in Travis.
Owner

### ogrisel commented Jul 2, 2014

 Finally fixed the expit problem without having to import pickle in .fixes which caused conflicts with other imports. But the pickle check was intentional: I don't want us to monkeypatch numpy when it's not necessary. As the fix has been included in numpy master, after the branching of numpy 1.9 you can test that the (major, minor) version of numpy is lower or equal to (1, 9) and only apply the _ufunc_reconstruct monkey patch in that case.
Owner

### ogrisel commented Jul 2, 2014

 BTW I reported the python bug here: http://bugs.python.org/issue21905

### coveralls commented Jul 3, 2014

 Coverage increased (+0.17%) when pulling 5c13b7d on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.
Contributor

### IssamLaradji commented Jul 3, 2014

 Great! :) I added the numpy (major, minor) check to call _ufunc_reconstruct only when it is necessary. Thanks for following up on this.

### coveralls commented Jul 5, 2014

 Coverage increased (+0.17%) when pulling c338ccc on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

### coveralls commented Jul 12, 2014

 Coverage increased (+0.18%) when pulling 72ffdb5 on IssamLaradji:generic-multi-layer-perceptron into daa1dba on scikit-learn:master.

Open

Owner

### ogrisel commented Jul 30, 2014

 The master branch has changed quite a bit during the past week, could you please rebase on top of master (and squash your commits) and fix any conflict? Note: the input validation helpers have changed, see: http://scikit-learn.org/dev/developers/utilities.html#validation-tools
Contributor

### IssamLaradji commented Jul 30, 2014

 Renovated the code using the same advice I got for ELM #3306 - It's much more readable and cleaner now (I hope) :) But Travis is facing the same errors as I did with ELM's pull request, even with rebasing. However, Travis for ELM #3306 fixed itself after a while (strange) :). Maybe Travis here will fix itself too? Thanks.

### arjoly and 2 others commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + "to partial_fit.") + elif self.classes_ is not None and classes is not None: + if np.any(self.classes_ != np.unique(classes)): + raise ValueError("classes is not the same as on last call " + "to partial_fit.") + elif classes is not None: + self.classes_ = classes + + if not hasattr(self, '_lbin'): + self._lbin = LabelBinarizer() + self._lbin._classes = classes + + X, y = check_X_y(X, y, accept_sparse='csr') + + # needs a better way to check multi-label instances + if isinstance(np.reshape(y, (-1, 1))[0][0], list):

#### arjoly Jul 31, 2014

Owner

The label binarizer will make it for you. Check the y_type_ attribute.

Contributor

That's awesome - makes life much easier :). I noticed that labelBinarizer
for multi-labeling will be deprecated by 0.17 since I get this warning.

DeprecationWarning: Direct support for sequence of sequences
multilabel representation will be unavailable from version 0.17.

It seems I have to use MultiLabelBinarizer for multi-label instances. Wouldn't that add an unnecessary clutter in the code?

I believe that in this case I can't use y_type_ to identify the type since I wouldn't know whether I should use MultiLabelBinarizer or LabelBinarizer until I check the instance type manually, right?
Thanks

#### jnothman Aug 2, 2014

Owner

No, you only need to use MultiLabelBinarizer if you have multilabel data represented as a list of lists of classes (or similar). This format is deprecated, and the MultiLabelBinarizer is provided as a utility for users because, while cumbersome to process, that format is arguably more human-readable than a label indicator matrix which will continue to be supported.

So in short, use label_binarize and your support for that old format, but not for label indicator matrices, will disappear together with the rest of the deprecation.

Contributor

Thanks @jnothman . I read all about the issues with having lists of lists for multi-label instances - it makes sense why its support is being deprecated.

I was having problems with make_multilabel_classification, until I set return_indicator to True, which returned a friendly format for LabelBinarizer that worked nicely. Thanks :)

#### jnothman Aug 2, 2014

Owner

make_multilabel_classification is yet to be fixed for this and for spare support (@hamsal, is
it to be yours?)

In sklearn/neural_network/multilayer_perceptron.py:

•                         "to partial_fit.")

•    elif self.classes_ is not None and classes is not None:

•        if np.any(self.classes_ != np.unique(classes)):

•            raise ValueError("classes is not the same as on last call "

•                             "to partial_fit.")

•    elif classes is not None:

•        self.classes_ = classes

•    if not hasattr(self, '_lbin'):

•        self._lbin = LabelBinarizer()

•        self._lbin._classes = classes

•    X, y = check_X_y(X, y, accept_sparse='csr')

•    # needs a better way to check multi-label instances

•    if isinstance(np.reshape(y, (-1, 1))[0][0], list):


issues with having lists of lists for multi-label instances - it makes
sense why its support is being deprecated.

I was having problems with make_multilabel_classification, but by setting
return_indicator to True, it returned a friendly format for LabelBinarizer
that worked nicely. Thanks :)

Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3204/files#r15728880.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + y : array-like, shape (n_samples, n_outputs) + Subset of the target values. + + Returns + ------- + self : returns an instance of self. + """ + X, y = check_X_y(X, y) + + if y.ndim == 1: + y = np.reshape(y, (-1, 1)) + + super(MultilayerPerceptronRegressor, self).partial_fit(X, y) + return self + + def predict(self, X):

#### arjoly Jul 31, 2014

Owner

To avoid sub-classing you can do like in the tree module https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L279

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + ---------- + X : {array-like, sparse matrix}, shape (n_samples, n_features) + Training data, where n_samples in the number of samples + and n_features is the number of features. + + y : array-like, shape (n_samples, n_outputs) + Subset of the target values. + + Returns + ------- + self : returns an instance of self. + """ + X, y = check_X_y(X, y) + + if y.ndim == 1: + y = np.reshape(y, (-1, 1))

#### arjoly Jul 31, 2014

Owner

You can add a private self._validate_X_y to get generic code for both regression and classification. This would reduce the amount of boiler plate code.

### arjoly and 1 other commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + ---------- + X : {array-like, sparse matrix}, shape (n_samples, n_features) + Training data, where n_samples is the number of samples + and n_features is the number of features. + + y : array-like, shape (n_samples, n_outputs) + Target values. + + Returns + ------- + self : returns an instance of self. + """ + X, y = check_X_y(X, y, multi_output=True) + + if y.ndim == 1: + y = np.reshape(y, (-1, 1))

#### arjoly Jul 31, 2014

Owner

You can add a private self._validate_X_y to get generic code for both regression and classification. This would reduce the amount of boiler plate code.

#### arjoly Jul 31, 2014

Owner

You could probably have only one fit function in the base class :-)

Contributor

Yes, exactly :). Also one partial_fit in the base class is sufficient.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + Whether to print progress messages to stdout. + + warm_start : bool, optional, default False + When set to True, reuse the solution of the previous + call to fit as initialization, otherwise, just erase the + previous solution. + + Attributes + ---------- + classes_ : array or list of array of shape = [n_classes] + Class labels for each output. + + cost_ : float + The current cost value computed by the loss function. + + eta_ : float

#### arjoly Jul 31, 2014

Owner

eta => learning_rate_?
This would be consistent with the gradient boosting module.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + iterates until convergence (determined by 'tol') or + this number of iterations. + + random_state : int or RandomState, optional, default None + State of or seed for random number generator. + + shuffle : bool, optional, default False + Whether to shuffle samples in each iteration before extracting + minibatches. + + tol : float, optional, default 1e-5 + Tolerance for the optimization. When the loss at iteration i+1 differs + less than this amount from that at iteration i, convergence is + considered to be reached and the algorithm exits. + + eta0 : double, optional, default 0.1

#### arjoly Jul 31, 2014

Owner

learning_rate_init?

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + Parameters + ---------- + X : {array-like, sparse matrix}, shape (n_samples, n_features) + Data, where n_samples is the number of samples + and n_features is the number of features. + + Returns + ------- + y_prob : array-like, shape (n_samples, n_classes) + The predicted probability of the sample for each class in the + model, where classes are ordered as they are in + self.classes_. + """ + scores = self.decision_function(X) + + if len(scores.shape) == 1:

#### arjoly Jul 31, 2014

Owner

Why not using ndim?

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + + Returns + ------- + self : returns an instance of self. + """ + X, y = check_X_y(X, y, accept_sparse='csr') + + # needs a better way to check multi-label instances + if isinstance(np.reshape(y, (-1, 1))[0][0], list): + self._multi_label = True + else: + self._multi_label = False + + self.classes_ = np.unique(y) + self._lbin = LabelBinarizer() + y = self._lbin.fit_transform(y)

#### arjoly Jul 31, 2014

Owner

I would factor those checks in a _validate_X_y?

#### arjoly Jul 31, 2014

Owner

Note that the y could be optional, so you would be able to re-use that function in the predict functions.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + """ + def __init__(self, n_hidden=[100], activation="tanh", + algorithm='l-bfgs', alpha=0.00001, + batch_size=200, learning_rate="constant", eta0=0.5, + power_t=0.5, max_iter=200, shuffle=False, + random_state=None, tol=1e-5, + verbose=False, warm_start=False): + sup = super(MultilayerPerceptronClassifier, self) + sup.__init__(n_hidden=n_hidden, activation=activation, + algorithm=algorithm, alpha=alpha, batch_size=batch_size, + learning_rate=learning_rate, eta0=eta0, power_t=power_t, + max_iter=max_iter, shuffle=shuffle, + random_state=random_state, tol=tol, + verbose=verbose, warm_start=warm_start) + + self.loss = 'log_loss'

#### arjoly Jul 31, 2014

Owner

Why not passing this as a parameter to the base class?

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 +ACTIVATIONS = {'tanh': _inplace_tanh, 'logistic': _inplace_logistic_sigmoid} + +DERIVATIVE_FUNCTIONS = {'tanh': _d_tanh, 'logistic': _d_logistic} + + +class BaseMultilayerPerceptron(six.with_metaclass(ABCMeta, BaseEstimator)): + """Base class for MLP classification and regression. + + Warning: This class should not be used directly. + Use derived classes instead. + """ + + _loss_functions = { + 'squared_loss': _squared_loss, + 'log_loss': _log_loss, + }

#### arjoly Jul 31, 2014

Owner

I would make two constants : CLASSIFICATION_LOSS and REGRESSION_LOSS. Later, you can use
is_classifier(self) to know if you are in regression or classification task.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 +def _d_logistic(Z): + """Compute the derivative of the logistic function.""" + return Z * (1 - Z) + + +def _d_tanh(Z): + """Compute the derivative of the hyperbolic tan function.""" + return 1 - (Z ** 2) + + +def _squared_loss(Y, Z): + """Compute the square loss for regression.""" + return np.sum((Y - Z) ** 2) / (2 * len(Y)) + + +def _log_loss(Y, Z):

#### arjoly Jul 31, 2014

Owner

y => y_true / Y_true,
Z => y_proba / Y_proba?

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + X /= X.sum(axis=1)[:, np.newaxis] + + return X + + +def _d_logistic(Z): + """Compute the derivative of the logistic function.""" + return Z * (1 - Z) + + +def _d_tanh(Z): + """Compute the derivative of the hyperbolic tan function.""" + return 1 - (Z ** 2) + + +def _squared_loss(Y, Z):

Owner

Y => y_true
Z => y_pred?

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + self.power_t = power_t + self.max_iter = max_iter + self.n_hidden = n_hidden + self.shuffle = shuffle + self.random_state = random_state + self.tol = tol + self.verbose = verbose + self.warm_start = warm_start + + self.layers_coef_ = None + self.layers_intercept_ = None + self.cost_ = None + self.n_iter_ = None + self.eta_ = None + + def _pack(self, layers_coef_, layers_intercept_):

#### arjoly Jul 31, 2014

Owner

I would extract this to make it a small function instead of a method.

#### arjoly Jul 31, 2014

Owner

_pack => _pack_network?

### arjoly and 1 other commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + self.tol = tol + self.verbose = verbose + self.warm_start = warm_start + + self.layers_coef_ = None + self.layers_intercept_ = None + self.cost_ = None + self.n_iter_ = None + self.eta_ = None + + def _pack(self, layers_coef_, layers_intercept_): + """Pack the coefficient and intercept parameters into a single vector. + """ + return np.hstack([l.ravel() for l in layers_coef_ + layers_intercept_]) + + def _unpack(self, packed_parameters):

#### arjoly Jul 31, 2014

Owner

What do you think of extracting this method and create a function instead?
In the code, you would do something like

self.layer_coef_, self.layer_intercept_ =_unpack_network(parameters, n_layer)


(_unpack => _unpack_network?)

Contributor

I agree, having it outside the class results in better readability.

### arjoly and 1 other commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + self.layers_coef_ = None + self.layers_intercept_ = None + self.cost_ = None + self.n_iter_ = None + self.eta_ = None + + def _pack(self, layers_coef_, layers_intercept_): + """Pack the coefficient and intercept parameters into a single vector. + """ + return np.hstack([l.ravel() for l in layers_coef_ + layers_intercept_]) + + def _unpack(self, packed_parameters): + """Extract the coefficients and intercepts from packed_parameters.""" + for i in range(self._n_layers - 1): + s, e, shape = self._packed_parameter_meta[i] + self.layers_coef_[i] = np.reshape(packed_parameters[s:e], (shape))

#### arjoly Jul 31, 2014

Owner

Dumb question, why not storing the coefficient in a sparse matrix format?

Contributor

Good question :P, the coefficient matrix doesn't usually have zeros - though it mostly has small values.

#### arjoly Aug 2, 2014

Owner

The packed parameters and shape looks like artificially a sparse csc matrix with only indptr and data. Would it make sense to have only a flat array for coefficient and an indptr array? This might avoid the need to pack and unpack parameters.

Contributor

Was working on this and found it to be a really cool idea that could avoid packing and unpacking of the parameters. :)
But what if there is zero element in one of the parameters ? then .data will not return all the elements since it ignores the zeros. I guess .reshape(-1,) would work.
Thanks

Contributor

hmmm but still. The opimizer expects a concatenated flattened coefficient parameters yet forward pass expects coeffcients of different shapes in the form (n_samples, n_features). It seems unpacking or using reshape in every iteration is inevitable :(.

#### arjoly Aug 4, 2014

Owner

Nevertheless with the flat array, you will avoid many allocation. Because you will work with memory view. Here a small ipython example

In [1]: import numpy as np

In [2]: a = np.arange(4)

In [3]: b = np.arange(3)

In [4]: c = np.hstack([a, b])

In [5]: c.flags
Out[5]:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

In [6]: d = c[:4]

In [7]: e = c[4:]

In [8]: d.flags
Out[8]:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

In [9]: f = d.reshape((2, 2))

In [10]: f.flags
Out[10]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

In [11]: a.flags
Out[11]:
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False


Array a and b owns their data and there is a copy involved when creating c.
While the flat carray owns the data, taking a stride (array d or e) and reshaping (array f) a sub-array of C doesn't create any copy.

I have played a bit with the code and the packed_meta_parameters is not easy to use.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + if self.learning_rate in ("constant", "invscaling"): + if self.eta0 <= 0.0: + raise ValueError("eta0 must be > 0") + + # raise ValueError if not registered + if self.activation not in ACTIVATIONS: + raise ValueError("The activation %s" + " is not supported. " % self.activation) + if self.learning_rate not in ["constant", "invscaling"]: + raise ValueError("learning rate %s " + " is not supported. " % self.learning_rate) + if self.algorithm not in ["sgd", "l-bfgs"]: + raise ValueError("The algorithm %s" + " is not supported. " % self.algorithm) + + def _scaled_weight_init(self, fan_in, fan_out):

#### arjoly Jul 31, 2014

Owner

I would inline this function.

### arjoly commented on an outdated diff Jul 31, 2014

sklearn/neural_network/multilayer_perceptron.py
 + self._coef_grads = [0] * (self._n_layers - 1) + self._intercept_grads = [0] * (self._n_layers - 1) + + # output for regression + if self.classes_ is None: + self._inplace_out_activation = _identity + # output for multi class + elif len(self.classes_) > 2 and self._multi_label is False: + self._inplace_out_activation = _inplace_softmax + # output for binary class and multi-label + else: + self._inplace_out_activation = _inplace_logistic_sigmoid + + def _init_eta_(self): + """Initialize the learning rate eta0 for SGD""" + self.eta_ = self.eta0

#### arjoly Jul 31, 2014

Owner

I would inline this function.

### ogrisel commented on an outdated diff Aug 21, 2014

examples/neural_network/plot_mlp_alpha.py
 + + +# Author: Issam H. Laradji +# License: BSD 3 clause + +import numpy as np +from matplotlib import pyplot as plt +from matplotlib.colors import ListedColormap +from sklearn.cross_validation import train_test_split +from sklearn.preprocessing import StandardScaler +from sklearn.datasets import make_moons, make_circles, make_classification +from sklearn.neural_network import MultilayerPerceptronClassifier + +h = .02 # step size in the mesh + +alphas = np.arange(0, 2, 0.15)

#### ogrisel Aug 21, 2014

Owner

Can you please try a wider range of alpha values with less intermediate steps as done in the ELM pull request.

The range alphas = np.logspace(-4, 4, 5) looks interesting to instance.

### ogrisel commented on an outdated diff Aug 21, 2014

examples/neural_network/plot_mlp_alpha.py
 + +classifiers = [] +for i in alphas: + classifiers.append(MultilayerPerceptronClassifier(alpha=i, random_state=1)) + +X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, + random_state=0, n_clusters_per_class=1) +rng = np.random.RandomState(2) +X += 2 * rng.uniform(size=X.shape) +linearly_separable = (X, y) + +datasets = [make_moons(noise=0.3, random_state=0), + make_circles(noise=0.2, factor=0.5, random_state=1), + linearly_separable] + +figure = plt.figure(figsize=(27, 9))

#### ogrisel Aug 21, 2014

Owner

Please use a smaller horizontal size when the number of alphas is reduced to be consistent with the layout of the ELM pull request.

Owner

### ogrisel commented Aug 21, 2014

 If activation='relu' is always better or faster in your experiments I would make it the default activation function as nowadays nobody uses tanh anymore.

### arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
 + + # Output for regression + if not isinstance(self, ClassifierMixin): + self.out_activation_ = 'identity' + # Output for multi class + elif self.label_binarizer_.y_type_ == 'multiclass': + self.out_activation_ = 'softmax' + # Output for binary class and multi-label + else: + self.out_activation_ = 'logistic' + + # Initialize coefficient and intercept layers + self.layers_coef_ = [] + self.layers_intercept_ = [] + + for i in range(self.n_layers_ - 1):

#### arjoly Aug 22, 2014

Owner

Small suggestion

for fan_in, fan_out in zip(layer_units[:-1], layer_units[1:])


with fan_in / n_fan_in and fan_out/n_fan_out?

#### arjoly Aug 22, 2014

Owner

There are other place where you might want to do that.

Owner

### arjoly commented Aug 22, 2014

 a_layers: The values held by each layer except for the output layer. deltas : constitutes a large part of the equation that computes the gradient, reflecting the amount of change required for updating the solutions in an iteration. layers_units: contains the number of neurons for each layer. It allows for having clean loops. coef_grad: the amount of change used to update the coefficient parameters in an iteration. intercept_grads: the amount of change used to update the intercept parameters in an iteration. layers_coef: the weights connecting layer i and i+1. layers_intercept_ : the bias vector for layer i+1. n_hidden: the number of hidden layers, not counting the input layer nor the output layer. I would add comments for the attributes that are private.

### arjoly and 1 other commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
 + weight_init_bound, + fan_out)) + + if self.shuffle: + X, y = shuffle(X, y, random_state=self.random_state) + + # l-bfgs does not support mini-batches + if self.algorithm == 'l-bfgs': + batch_size = n_samples + else: + batch_size = np.clip(self.batch_size, 1, n_samples) + + # Initialize lists + self._a_layers = [X] + self._a_layers.extend(np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1))

#### arjoly Aug 22, 2014

Owner

You might want to do

self._a_layers.extend(np.empty((batch_size, n_fan_out))
for n_fan_out in layer_units[1:])


Contributor

+1

### arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
 + + if self.shuffle: + X, y = shuffle(X, y, random_state=self.random_state) + + # l-bfgs does not support mini-batches + if self.algorithm == 'l-bfgs': + batch_size = n_samples + else: + batch_size = np.clip(self.batch_size, 1, n_samples) + + # Initialize lists + self._a_layers = [X] + self._a_layers.extend(np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1)) + self._deltas = [np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1)]

#### arjoly Aug 22, 2014

Owner

You might want to do

self._deltas = [np.empty_like(a_layer) for a_layer in self._a_layers[:-1]]


### arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
 + # l-bfgs does not support mini-batches + if self.algorithm == 'l-bfgs': + batch_size = n_samples + else: + batch_size = np.clip(self.batch_size, 1, n_samples) + + # Initialize lists + self._a_layers = [X] + self._a_layers.extend(np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1)) + self._deltas = [np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1)] + self._coef_grads = [np.empty((layer_units[i], layer_units[i + 1])) + for i in range(self.n_layers_ - 1)] + self._intercept_grads = [np.empty(layer_units[i + 1]) + for i in range(self.n_layers_ - 1)]

#### arjoly Aug 22, 2014

Owner

You might want to do

self._intercept_grads = [np.empty(fan_out) for fan_out in layer_units[1:]]


### arjoly commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
 + X, y = shuffle(X, y, random_state=self.random_state) + + # l-bfgs does not support mini-batches + if self.algorithm == 'l-bfgs': + batch_size = n_samples + else: + batch_size = np.clip(self.batch_size, 1, n_samples) + + # Initialize lists + self._a_layers = [X] + self._a_layers.extend(np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1)) + self._deltas = [np.empty((batch_size, layer_units[i + 1])) + for i in range(self.n_layers_ - 1)] + self._coef_grads = [np.empty((layer_units[i], layer_units[i + 1])) + for i in range(self.n_layers_ - 1)]

#### arjoly Aug 22, 2014

Owner

You might want to do:

self._coef_grads = [np.empty((fan_in, fan_out)) for fan_in, fan_out
in zip(layer_units[:-1], layer_units[1:])]


### arjoly and 2 others commented on an outdated diff Aug 22, 2014

sklearn/neural_network/multilayer_perceptron.py
 + # Run the Stochastic Gradient Descent algorithm + if self.algorithm == 'sgd': + prev_cost = np.inf + cost_increase_count = 0 + + for i in range(self.max_iter): + for batch_slice in gen_batches(n_samples, batch_size): + self._a_layers[0] = X[batch_slice] + self.cost_ = self._backprop(X[batch_slice], y[batch_slice]) + + # update weights + for i in range(self.n_layers_ - 1): + self.layers_coef_[i] -= (self.learning_rate_ * + self._coef_grads[i]) + self.layers_intercept_[i] -= (self.learning_rate_ * + self._intercept_grads[i])

#### arjoly Aug 22, 2014

Owner

Those lines look like daxpy (blas level1)

#### arjoly Aug 22, 2014

Owner

(dumb question) Is there interest of having varying number of neurons in the hidden units?
If not, those lines could be written as blas level 2 function.

Contributor

@arjoly, if we do write these lines as blas level 2 functions, should we then force the user not to change n_hidden?

Owner

Probably yes.

#### larsmans Sep 30, 2014

Owner

Shall we leave the code bumming for later and get this thing in a mergeable state?

 IssamLaradji  intermediate update  fb02230
Contributor

### IssamLaradji commented Aug 22, 2014

 If activation='relu' is always better or faster in your experiments great! I will post a benchmark showing the convergence speed and accuracy on different sizes of MNIST for relu against tanh.
 IssamLaradji  updates  425ba58
Contributor

### IssamLaradji commented Aug 23, 2014

 Finally got a sufficiently powerful computer to work on the comments :-). These are the benchmark results on the digits dataset using 3-fold cross-validation, n_hidden= [50, 25, 10] Testing score for relu: 0.9488, time: 3.76 Testing score for tanh: 0.9310, time: 6.05 n_hidden= [150, 100] Testing score for relu: 0.9711, time: 4.78 Testing score for tanh: 0.9694, time: 3.43 n_hidden= [50, 25] Testing score for relu: 0.9627, time: 1.52 Testing score for tanh: 0.9471, time: 2.64 n_hidden=[50, 100] Testing score for relu: 0.9677, time: 1.88 Testing score for tanh: 0.9605, time: 2.93 @ogrisel, like you said, relu is not always faster than tanh as tanh sometimes converges faster. But, my experimental results showed that relu consistently achieves higher score than tanh.
 IssamLaradji  doc update  e623fdd

### coveralls commented Aug 23, 2014

 Changes Unknown when pulling e623fdd on IssamLaradji:generic-multi-layer-perceptron into * on scikit-learn:master*.
 IssamLaradji  updates  cdb4dc6

### coveralls commented Aug 23, 2014

 Changes Unknown when pulling cdb4dc6 on IssamLaradji:generic-multi-layer-perceptron into * on scikit-learn:master*.

### coveralls commented Aug 23, 2014

 Changes Unknown when pulling cdb4dc6 on IssamLaradji:generic-multi-layer-perceptron into * on scikit-learn:master*.
Owner

### ogrisel commented Aug 25, 2014

 But, my experimental results showed that relu consistently achieves higher score than tanh. The optimal values for the other hyperparameters (in particular the regularization) is probably not the same for relu and tanh. Can you please try to run a small grid search for the optimal value of alpha when n_hidden=[150, 100]?

### ogrisel commented on an outdated diff Aug 25, 2014

doc/modules/neural_networks_supervised.rst
 + >>> clf.predict([[2., 2.], [-1., -2.]]) + array([1, 0]) + +MLP can fit a non-linear model to the training data. clf.layers_coef_ +contains the weight matrices that constitute the model parameters:: + + >>> [coef.shape for coef in clf.layers_coef_] + [(2, 5), (5, 2), (2, 1)] + +To get the raw values before applying the output activation function, run the +following command, + +use :meth:MultilayerPerceptronClassifier.decision_function:: + + >>> clf.decision_function([[2., 2.], [1., 2.]]) + array([ 11.55408143, 11.55408143])

#### ogrisel Aug 25, 2014

Owner

You should use the ellipsis feature of doctestst to have this test pass on all the travis platforms:


>>> clf.decision_function([[2., 2.], [1., 2.]])  # doctest: +ELLIPSIS
array([ 11.55..., 11.55...])


### ogrisel commented on an outdated diff Aug 25, 2014

doc/modules/neural_networks_supervised.rst
 +use :meth:MultilayerPerceptronClassifier.decision_function:: + + >>> clf.decision_function([[2., 2.], [1., 2.]]) + array([ 11.55408143, 11.55408143]) + +Currently, :class:MultilayerPerceptronClassifier supports only the +Cross-Entropy loss function, which allows probability estimates by running the +predict_proba method. + +MLP trains using backpropagation. For classification, it minimizes the +Cross-Entropy loss function, giving a vector of probability estimates +:math:P(y|x) per sample :math:x:: + + >>> clf.predict_proba([[2., 2.], [1., 2.]]) + array([[ 9.59670230e-06, 9.99990403e-01], + [ 9.59670230e-06, 9.99990403e-01]])

#### ogrisel Aug 25, 2014

Owner

Please use doctest ellipsis here again:

>>> clf.predict_proba([[2., 2.], [1., 2.]])  # doctest: +ELLIPSIS
array([[ 9.59...e-06, 9.99...e-01],
[ 9.59...e-06, 9.99...e-01]])


### ogrisel commented on an outdated diff Aug 25, 2014

doc/modules/neural_networks_supervised.rst
 +a one hidden layer MLP. + +.. figure:: ../images/multilayerperceptron_network.png + :align: center + :scale: 60% + + **Figure 1 : One hidden layer MLP.** + +The leftmost layer, known as the input layer, consists of a set of neurons +:math:\{x_i | x_1, x_2, ..., x_m\} representing the input features. Each hidden +layer transforms the values from the previous layer by a weighted linear summation +:math:w_1x_1 + w_2x_2 + ... + w_mx_m, followed by a non-linear activation function +:math:g(\cdot):R \rightarrow R - like the hyperbolic tan function. The output layer +receives the values from the last hidden layer and transforms them into output values. + +The module contains the public attributes :math:layers_coef_ and :math:layers_intecept_.

#### ogrisel Aug 25, 2014

Owner

You should use double backticks quoting for layers_coef_ & layers_intercept_ (there is also a typo here).

 IssamLaradji  doc update  551a440 IssamLaradji  ellipsis  73a6c73
Contributor

### IssamLaradji commented Aug 26, 2014

 This is the results of the grid search on the digits dataset for alphas=np.logspace(-4, 4, 5). (First Line represents ReLU scores and the second line represents tanh scores) For n_hidden=[150, 100] relu : [mean: 0.97106, std: 0.00644, params: {'alpha': 0.0001}, mean: 0.96717, std: 0.00479, params: {'alpha': 0.01}, mean: 0.95771, std: 0.01032, params: {'alpha': 1.0}, mean: 0.96383, std: 0.00416, params: {'alpha': 100.0}, mean: 0.10128, std: 0.00079, params: {'alpha': 10000.0}] tanh : [mean: 0.97051, std: 0.00208, params: {'alpha': 0.0001}, mean: 0.98386, std: 0.00416, params: {'alpha': 0.01}, mean: 0.98331, std: 0.00361, params: {'alpha': 1.0}, mean: 0.96049, std: 0.00343, params: {'alpha': 100.0}, mean: 0.10128, std: 0.00079, params: {'alpha': 10000.0}] For n_hidden=[100, 50] relu : [mean: 0.96439, std: 0.00208, params: {'alpha': 0.0001}, mean: 0.97551, std: 0.00928, params: {'alpha': 0.01}, mean: 0.98108, std: 0.00672, params: {'alpha': 1.0}, mean: 0.96327, std: 0.00472, params: {'alpha': 100.0}, mean: 0.10128, std: 0.00079, params: {'alpha': 10000.0}] tanh : [mean: 0.96605, std: 0.00479, params: {'alpha': 0.0001}, mean: 0.98386, std: 0.00284, params: {'alpha': 0.01}, mean: 0.98331, std: 0.00273, params: {'alpha': 1.0}, mean: 0.96550, std: 0.00630, params: {'alpha': 100.0}, mean: 0.10184, std: 0.00000, params: {'alpha': 10000.0}] It doesn't seem like ReLU performs better on average. I think I should test it on bigger datasets with larger number of layers. Perhaps that's where ReLU shines.

 IssamLaradji  skip  2248aac IssamLaradji  doc update  9c451dc

### coveralls commented Aug 26, 2014

 Coverage increased (+0.07%) when pulling 9c451dc on IssamLaradji:generic-multi-layer-perceptron into 4b82379 on scikit-learn:master.

 IssamLaradji  doc update  855e3e9 IssamLaradji  update  9ed8d1f

### coveralls commented Aug 26, 2014

 Coverage increased (+0.07%) when pulling 9ed8d1f on IssamLaradji:generic-multi-layer-perceptron into 4b82379 on scikit-learn:master.
 IssamLaradji  last update  c2ce21f

### coveralls commented Aug 26, 2014

 Coverage increased (+0.07%) when pulling c2ce21f on IssamLaradji:generic-multi-layer-perceptron into 4b82379 on scikit-learn:master.

### pasky commented Sep 4, 2014

 Hi! I'm sorry to chime in as an external party - I'm watching this PR eagerly for quite some time now, and admittedly a little disappointed it couldn't have been merged yet. I was just wondering if there's some concrete list of TODO items that must be wrapped up before this can be merged, I suppose mainly regarding the API which will stay more or less set in stone at that point? Maybe I could help with some of the items if Issam Laradji is busy on other things... (I wonder if precise tuning of the activation function needs to be figured out before this is merged?)

### GaelVaroquaux commented on an outdated diff Sep 4, 2014

benchmarks/bench_mnist.py
 + + +def load_data(dtype=np.float32, order='F'): + # Load dataset + print("Loading dataset...") + data = fetch_mldata('MNIST original') + X, y = data.data, data.target + if order.lower() == 'f': + X = np.asfortranarray(X) + + # Normalize features + X = X.astype('float64') + X = X / 255 + + # Create train-test split (as [Joachims, 2006]) + logger.info("Creating train-test split...")

#### GaelVaroquaux Sep 4, 2014

Owner

We do not use the logger (there is a pull request on it, but it got stalled). You should simply use a print controlled by a 'versbose' argument.

### GaelVaroquaux commented on the diff Sep 4, 2014

sklearn/neural_network/tests/test_mlp.py
 + +def test_verbose_sgd(): + """Test verbose.""" + X = [[3, 2], [1, 6]] + y = [1, 0] + clf = MultilayerPerceptronClassifier(algorithm='sgd', + max_iter=2, + verbose=10, + n_hidden=2) + old_stdout = sys.stdout + sys.stdout = output = StringIO() + + clf.fit(X, y) + clf.partial_fit(X, y) + + sys.stdout = old_stdout

#### GaelVaroquaux Sep 4, 2014

Owner

This should be done in the 'finally' of a try/finally' block, so that even if there is an exception, the stdout gets restored.

### GaelVaroquaux commented on an outdated diff Sep 4, 2014

sklearn/neural_network/tests/test_mlp.py
 + random_state=1, + batch_size=X.shape[0]) + for i in range(150): + mlp.partial_fit(X, y) + + pred2 = mlp.predict(X) + assert_almost_equal(pred1, pred2, decimal=2) + score = mlp.score(X, y) + assert_greater(score, 0.75) + + +def test_partial_fit_errors(): + """Test partial_fit error handling.""" + X = [[3, 2], [1, 6]] + y = [1, 0] + clf = MultilayerPerceptronClassifier

#### GaelVaroquaux Sep 4, 2014

Owner

You don't need to define the clf intermediate variable here. It mostly hinders readability.

### GaelVaroquaux commented on an outdated diff Sep 4, 2014

sklearn/neural_network/tests/test_mlp.py
 +from sklearn.neural_network import MultilayerPerceptronRegressor +from sklearn.preprocessing import LabelBinarizer +from sklearn.preprocessing import StandardScaler, MinMaxScaler +from scipy.sparse import csr_matrix +from sklearn.utils.testing import assert_raises, assert_greater, assert_equal + + +np.seterr(all='warn') + +LEARNING_RATE_TYPES = ["constant", "invscaling"] + +ACTIVATION_TYPES = ["logistic", "tanh", "relu"] + +digits_dataset_multi = load_digits(n_class=3) + +Xdigits_multi = MinMaxScaler().fit_transform(digits_dataset_multi.data[:200])

#### GaelVaroquaux Sep 4, 2014

Owner

There should be an underscore between words: X_digits_multi, and y_digits_multi.

### GaelVaroquaux and 1 other commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
 + The predicted probability of the sample for each class in the + model, where classes are ordered as they are in self.classes_. + """ + y_scores = self.decision_function(X) + + if y_scores.ndim == 1: + y_scores = logistic(y_scores) + return np.vstack([1 - y_scores, y_scores]).T + else: + return softmax(y_scores) + + +class MultilayerPerceptronRegressor(BaseMultilayerPerceptron, RegressorMixin): + """Multi-layer Perceptron regressor. + + Under a loss function, the algorithm trains either by l-bfgs or gradient

#### GaelVaroquaux Sep 4, 2014

Owner

All these paragraphes (all aside the first sentence of the docstring) should be moved to a 'Notes' section, that is at the end of the docstring.

#### ogrisel Oct 2, 2014

Owner

I think it's good to have a generic overview in one or two paragraphs here. The list of parameters and attributes is long and people will not necessarily think to scroll down to the end just to get the big picture on how this estimator works.

Owner

### GaelVaroquaux commented Sep 4, 2014

 Naive question: I note that the default activation function is the relu, and the default algorithm the LBFGS. I had in mind that LBFGS was very bad with the relu, because it is not smooth. Am I wrong?

### GaelVaroquaux commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
 + """Multi-layer Perceptron regressor. + + Under a loss function, the algorithm trains either by l-bfgs or gradient + descent. The training is iterative, in that at each time step the + partial derivatives of the loss function with respect to the model + parameters are computed to update the parameters. + + It has a regularizer as a penalty term added to the loss function that + shrinks model parameters towards zero. + + This implementation works with data represented as dense and sparse numpy + arrays of floating point values for the features. + + Parameters + ---------- + n_hidden : python list, length = n_layers - 2, default [100]

#### GaelVaroquaux Sep 4, 2014

Owner

Just write 'list', and not 'python list'.

Contributor

### IssamLaradji commented Sep 4, 2014

 Hi @pasky , The requirements are in fact complete, but there might be some minor required changes (like naming or code conventions) that would come out from final reviews by the mentors. I can't wait to get this merged as well :). I will address the comments as they come along, but you are welcome to help. Thanks.

### GaelVaroquaux and 1 other commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
 + X : {array-like, sparse matrix}, shape (n_samples, n_features) + The input data. + + y : array-like, shape (n_samples,) + The target values. + + Returns + ------- + self : returns a trained MLP model. + """ + if self.algorithm != 'sgd': + raise ValueError("only SGD algorithm supports partial fit") + + return self._fit(X, y, incremental=True) + + def _decision_scores(self, X):

#### GaelVaroquaux Sep 4, 2014

Owner

Stupid question, but isn't this the 'decision_function', according to the definition of scikit-learn.

In this case, should be called like this.

Contributor

The difference is that, decision_function returns a raveled form of y_pred for the classifier when
self.n_outputs=1. This is not the case with the regressor.

### GaelVaroquaux and 3 others commented on an outdated diff Sep 4, 2014

sklearn/neural_network/multilayer_perceptron.py
 + + # For the last layer + if with_output_activation: + out_activation = ACTIVATIONS[self.out_activation_] + self._a_layers[i + 1] = out_activation(self._a_layers[i + 1]) + + def _compute_cost_grad(self, layer, n_samples): + """Compute the cost gradient for the layer.""" + self._coef_grads[layer] = safe_sparse_dot(self._a_layers[layer].T, + self._deltas[layer]) + self._coef_grads[layer] += (self.alpha * self.layers_coef_[layer]) + self._coef_grads[layer] /= n_samples + + self._intercept_grads[layer] = np.mean(self._deltas[layer], 0) + + def _cost_grad_lbfgs(self, packed_coef_inter, X, y):

#### GaelVaroquaux Sep 4, 2014

Owner

As a very general remark on the design of this object/algorithm, I must say that I am a bit uneasy with all the internal states of the algorithm, that are very hidden/implicit in the code. I have in mind the _a_layers, _deltas,
._coef_grads, self._intercept_grads. They make the code hard to follow, because it is not possible by looking at a function/method call to know what has been changed.

How much possible is this to explicitely pass them around the code, rather than having them as attributes on the object? At least for a_layers, as these seem to me like they don't have core intrinsic difference to X and y, at least in terms of code organization.

Owner

+1

#### ogrisel Sep 30, 2014

Owner

We could make those functions have explicit arguments but might still want to keep them as pre-allocated attributes on the model itself for incremental learning with partial_fit.

Contributor

@ogrisel agreed, I kept them mainly for partial_fit. Is there a way I could utilize their pre-allocation advantages while improving readability ? how about getters and setters?
cheers.

#### ogrisel Oct 2, 2014

Owner

no actually getters and setters would even be worse. @GaelVaroquaux 's point would be to explicitly pass the preallocated arrays for activations, gradients and deltas as arguments to private methods such as _compute_* and _backprop to make it more explicit about which datastructures are involved just by reading the prototype of the methods.

Actually by reading the code of _fit again, you don't even reuse the preallocated arrays for _a_layers, _coef_grads, _intercept_grads and _deltas, even when incremental is True. There is really no point in making them attributes of the class, please convert them to local variables of the _fit method and pass them explicitly as argument to the private helper methods.

Also, self._a_layers should be renamed activations and self._deltas should better be renamed updates.

#### ogrisel Oct 3, 2014

Owner

... and self._deltas should better be renamed updates.

Actually scratch that, I was confused. self._deltas can be renamed deltas as it's just the difference between the current activations and the backpropagated error at that level.

Contributor

Done, removed the attributes. Indeed the code looks more readable now. :)

Thanks.

Owner

### ogrisel commented Sep 30, 2014

 Naive question: I note that the default activation function is the relu, and the default algorithm the LBFGS. I had in mind that LBFGS was very bad with the relu, because it is not smooth. Am I wrong? I asked this question earlier and @IssamLaradji reported that LBFGS worked fine with ReLU despite the non-smooth kink at zero. This is rather counter intuitive to me: #3204 (comment)

### ogrisel commented on an outdated diff Oct 2, 2014

sklearn/neural_network/multilayer_perceptron.py
 + Number of outputs. + + out_activation_ : string + Name of the output activation function. + + References + ---------- + Hinton, Geoffrey E. + "Connectionist learning procedures." Artificial intelligence 40.1 + (1989): 185-234. + + Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of + training deep feedforward neural networks." International Conference + on Artificial Intelligence and Statistics. 2010. + """ + def __init__(self, n_hidden=[100], activation="relu",

#### ogrisel Oct 2, 2014

Owner

It's considered a malpractice to use a mutable default values for kwargs. Either use an unmutable tuple: n_hidden=(100,) or better in this case make n_hidden=100 work with literal integers (in that case the number of hidden layers would be assumed to be 1).

If we want to enforce n_hidden to be sequence of ints (list, tuple...), I would rather rename that parameter to hidden_layers_sizes=(100,) instead.

Owner

### ogrisel commented Oct 2, 2014

 Actually members of the deep learning for speech recognition community reported that softplus(x) = log(1 + exp(x)) which is a smooth version of relu can work significantly better (generalization perf) on some problems. However when I tried it on a grid search on a small subset (3000 samples) of MNIST, LBFGS seems to have no problem optimizing non-smooth ReLU (less iterations than softplus and significantly faster iterations). And validation accuracy seems to be better slightly as well. Here is the results of a grid search on 3000 digits: [mean: 0.89889, std: 0.01571, params: {'activation': 'relu', 'n_hidden': [100, 100, 100], 'alpha': 9.9999999999999995e-07}, mean: 0.89722, std: 0.00550, params: {'activation': 'relu', 'n_hidden': [100, 100, 100], 'alpha': 0.0001}, mean: 0.89333, std: 0.01434, params: {'activation': 'relu', 'n_hidden': [100, 100, 100], 'alpha': 1e-08}, mean: 0.89056, std: 0.01853, params: {'activation': 'relu', 'n_hidden': [100, 100], 'alpha': 0.0001}, mean: 0.88833, std: 0.01312, params: {'activation': 'relu', 'n_hidden': [100, 100], 'alpha': 9.9999999999999995e-07}, mean: 0.88722, std: 0.00685, params: {'activation': 'relu', 'n_hidden': [100, 100], 'alpha': 1e-08}, mean: 0.88556, std: 0.01530, params: {'activation': 'softplus', 'n_hidden': [100], 'alpha': 1e-08}, mean: 0.88333, std: 0.01650, params: {'activation': 'relu', 'n_hidden': [100], 'alpha': 0.0001}, mean: 0.88278, std: 0.01235, params: {'activation': 'relu', 'n_hidden': [100], 'alpha': 9.9999999999999995e-07}, mean: 0.88222, std: 0.01577, params: {'activation': 'softplus', 'n_hidden': [100], 'alpha': 0.0001}, mean: 0.88111, std: 0.00906, params: {'activation': 'relu', 'n_hidden': [100], 'alpha': 1e-08}, mean: 0.88111, std: 0.00671, params: {'activation': 'softplus', 'n_hidden': [100, 100, 100], 'alpha': 1e-08}, mean: 0.88111, std: 0.00685, params: {'activation': 'softplus', 'n_hidden': [100, 100, 100], 'alpha': 0.0001}, mean: 0.87833, std: 0.01534, params: {'activation': 'softplus', 'n_hidden': [100, 100], 'alpha': 0.0001}, mean: 0.87778, std: 0.01517, params: {'activation': 'softplus', 'n_hidden': [100, 100], 'alpha': 9.9999999999999995e-07}, mean: 0.87667, std: 0.00624, params: {'activation': 'softplus', 'n_hidden': [100], 'alpha': 9.9999999999999995e-07}, mean: 0.87444, std: 0.01612, params: {'activation': 'softplus', 'n_hidden': [100, 100], 'alpha': 1e-08}, mean: 0.87111, std: 0.01370, params: {'activation': 'softplus', 'n_hidden': [100, 100, 100], 'alpha': 9.9999999999999995e-07}]  Here is my implementation of softplus: diff --git a/sklearn/neural_network/base.py b/sklearn/neural_network/base.py index 114bba5..e7790b3 100644 --- a/sklearn/neural_network/base.py +++ b/sklearn/neural_network/base.py @@ -71,8 +71,12 @@ def relu(X): X_new : {array-like, sparse matrix}, shape (n_samples, n_features) The transformed data. """ - np.clip(X, 0, np.finfo(X.dtype).max, out=X) - return X + return np.clip(X, 0, np.finfo(X.dtype).max, out=X) + + +def softplus(X): + # log(1 + exp(X)) + return np.logaddexp(X, 0, out=X) def softmax(X): @@ -96,7 +100,7 @@ def softmax(X): ACTIVATIONS = {'identity': identity, 'tanh': tanh, 'logistic': logistic, - 'relu': relu, 'softmax': softmax} + 'relu': relu, 'softmax': softmax, 'softplus': softplus} def logistic_derivative(Z): @@ -148,7 +152,7 @@ def relu_derivative(Z): DERIVATIVES = {'tanh': tanh_derivative, 'logistic': logistic_derivative, - 'relu': relu_derivative} + 'relu': relu_derivative, 'softplus': logistic} 

### ogrisel commented on an outdated diff Oct 3, 2014

sklearn/neural_network/multilayer_perceptron.py
 + for batch_slice in gen_batches(n_samples, batch_size): + self._a_layers[0] = X[batch_slice] + self.cost_ = self._backprop(X[batch_slice], y[batch_slice]) + + # update weights + for i in range(self.n_layers_ - 1): + self.layers_coef_[i] -= (self.learning_rate_ * + self._coef_grads[i]) + self.layers_intercept_[i] -= (self.learning_rate_ * + self._intercept_grads[i]) + + if self.learning_rate == 'invscaling': + self.learning_rate_ = self.learning_rate_init / \ + (self.n_iter_ + 1) ** self.power_t + + self.n_iter_ += 1

#### ogrisel Oct 3, 2014

Owner

self.n_iter_ for SGD currently does not have the same meaning for LBFGS and SGD: this is source of confusion.

We should instead introduce a new variable (for instance self.t_ to be consistent with SGDClassifier) for the learning rate schedule: the total number of samples that were used to train the model, irrespective of the fact that some samples might have been seen several times when training with several passes over a finite training set).

self.n_iter_ on the other hand should always reflect the number of "epochs" that is the number of passes over the full training set. This attribute should not be set when the users calls the partial_fit method as we don't know the total size of the full training set in that case.

Owner

### ogrisel commented Oct 3, 2014

 Please refactor the _fit method to call a submethod per optimizer. For instance, at the end of _fit: fit_method = getattr(self, '_fit_' + self.algorithm, None) if fit_method is None: raise ValueError('algorithm="%s" is not supported by %s' % (self.algorithm, type(self).__name__)) fit_method(X, y, activations, deltas) Maybe also pass to fit_method additional datastructures initialized in _fit that I might have missed. This should make it easier for the user to derive from MultilayerPerceptronClassifier or MultilayerPerceptronRegressor to implement more experimental optimizers (e.g. Adadelta for instance).
Owner

### ogrisel commented Oct 3, 2014

 Also please raise a ConvergenceWarning when the tol based convergence criterion is not met prior to reaching max_iter in the fit method (do not do that for partial_fit).
Owner

### ogrisel commented Oct 3, 2014

 To get example of usage of ConvergenceWarning in scikit-learn, do git grep ConvergenceWarning.
Owner

### ogrisel commented Oct 3, 2014

 I started to summarize the remaining work in a todo list at the top of the PR.
Owner

### ogrisel commented Oct 6, 2014

 The doctests need to be updated: https://travis-ci.org/scikit-learn/scikit-learn/jobs/37154106#L1569

 IssamLaradji  updates  bb74c95 IssamLaradji  better updates  c3c49fc IssamLaradji  doc update  e5e002e IssamLaradji  doc update  b5f6500 IssamLaradji  doc update  52e4253 IssamLaradji  improved readability.  e037c7f
Owner

### amueller commented Dec 4, 2014

 @IssamLaradji are you working on refactoring the _fit method? Otherwise I'd be happy to help.
Owner

### amueller commented Dec 4, 2014

 Btw, did you try to do the MNIST bench with SGD? That should be quite a bit faster. I didn't get it to work though :-/
Owner

### amueller commented Dec 4, 2014

 There is a refactoring here: https://github.com/amueller/scikit-learn/tree/pr/3204
Owner

### ogrisel commented Dec 5, 2014

 I could not make it work either. I suspect a bug in the SGD solver. Also we should add nesterov momentum to the SGD solver with momentum=0.9 by default as otherwise there can be problem where SGD converges too slowly.
Contributor

### IssamLaradji commented Dec 5, 2014

 Hi @amueller , it would be great if you help! Thanks :) It seems 'SGD' is not running as expected, let me double check the gradients. +1 for momentum.
Owner

### amueller commented Dec 5, 2014

 I agree about momentum. I'll see if I can make it work and I'll submit a parallel PR.
Owner

### amueller commented Dec 5, 2014

 I find it a bit confusing that coef_grads and intercept_grads are both modified in-place and returned by the functions operating on them. What is the reason for that?
Contributor

### IssamLaradji commented Dec 5, 2014

 @amueller , you mean this ?  # Compute gradient for the last layer coef_grads, intercept_grads = self._compute_cost_grad(last, n_samples, activations, deltas, coef_grads, intercept_grads) The reason is, I am calling self._compute_cost_grad twice; one for the output layer, and one for the for loop operating on the hidden layers. I guess it would be more readable if I combined output layer and the hidden layers in one for loop? that way I wouldn't need the _compute_cost_grad method.
Owner

### amueller commented Dec 5, 2014

 The gradients are fine, I think. I forgot to shuffle mnist :-/ Now it looks good. Maybe we want to set shuffle=True by default? It is so cheap compared to the backprob

Closed

Owner

### amueller commented Dec 5, 2014

 @IssamLaradji That was the place I meant. Sorry, I don't understand your explanation. Would the behavior of the code change if you discarded the return value of _compute_cost_grad?
Contributor

### IssamLaradji commented Dec 5, 2014

 @amueller oh I thought you meant something else. It wouldn't change the behavior. I could discard the return value and the "left-hand side" of the equation coef_grads, intercept_grads =_compute_cost_grad(...), and the results will remain the same. Also, +1 for setting shuffle=True as default.

Open

Closed

Owner

### amueller commented Dec 17, 2014

 Training time for the bench_mnist.py is twice as high on my box than what you gave, but only for the MLP. the others have comparable speed. Could you try to run again with the current parameters and see if it is still the same for you? How many cores do you have?
Contributor

### IssamLaradji commented Dec 18, 2014

 Strange, I ran it again now and I got, Classifier train-time test-time error-rate ---------------------------------------------------------------------------------------------- MultilayerPerceptron 364.75999999 0.088 0.0178  which is half the original training time. Are you training using lbfgs or sgd ? lbfgs tends to converge faster. My machine is equiped with 8 GB RAM and Intel® Core™ i7-2630QM Processor (6M Cache, 2.00 GHz) .
Owner

### ogrisel commented Dec 18, 2014

 lbfgs tends to converge faster. In my PR against @amueller branch with the enhanced "constant" learning rate and momentum SGD seems to be faster than LBGFS although I have not ploted the "validation score vs epoch" curve as we have no way to do so at the moment.
Owner

### amueller commented Dec 18, 2014

 I ran exactly the same code, so lbfgs. I thin we should definitely do SGD as it should be much faster on mnist.

 👍
Contributor

### digital-dharma commented Apr 7, 2015

 Excellent work to all, and an exciting feature to be added to sklearn! I have been looking forward to this functionality to a while - Does it appear likely, or has momentum dissipated?
Owner

### amueller commented Apr 7, 2015

 It will definitely be merged, and soon.
Contributor

### digital-dharma commented Apr 7, 2015

 @amueller - That's fantastic news, I'm very much looking forward to it. Great work as always!

### naught101 commented May 1, 2015

 Using this a bit at the moment. Looks nice. Some notes: Currently if y in MultilayerPerceptronRegressor.fit() is a vector (dimensions (n,)), .predict() returns a 2d array with dimensions (n,1). Other regressors just return a vector in the same format as y. That's a really long class name. Could it be MLPRegressor instead, similar to SGDRegressor? That abbreviation is common enough, I think (on the wikipedia disambiguation page, comes first in google search for 'MLP learning', and I don't think people will get it confused with My Little Pony)
Owner

### amueller commented May 1, 2015

 @naught101 It is a long name... maybe we should use MLP. Can you check if the shape is still wrong in #3939?

### naught101 commented May 4, 2015

 @amueller: Yes, the shape is still wrong.
Owner

### amueller commented May 4, 2015

 Huh, wonder why the common tests no complain.
Owner

### amueller commented May 4, 2015

 Thanks for checking.

### amueller referenced this pull request May 19, 2015

Merged

#### [MRG + 2] FIX ransac output shape, add test for regressor output shapes #4739

 IssamLaradji  added learning rates  5dce1a0

Closed

Merged

Owner

### amueller commented Oct 23, 2015

 Merged via #5214

Contributor

### IssamLaradji commented Oct 23, 2015

 Waw!! That's fantastic!! :) :) Great work team!

### naught101 commented Oct 24, 2015

 Thank you to everyone who worked on this. It will be really useful.

### pasky commented Oct 24, 2015

 Yes, thank you very much! I've been waiting for this for a long time. (And sorry that I never ended up making good on my offer to help.)
Owner

### jnothman commented Oct 24, 2015

 Waw!! That's fantastic!! :) :) Great work team! Yes, aren't sprints amazing from the outside? Dormant threads are suddenly marked merged and that project you'd been trying to complete forever is now off your todo list and you're ready to book a holiday... Thank you to all the sprinters from those of us on the outside, it's been a good one! On 24 October 2015 at 20:04, Petr Baudis notifications@github.com wrote: Yes, thank you very much! I've been waiting for this for a long time. (And sorry that I never ended up making good on my offer to help.) — Reply to this email directly or view it on GitHub #3204 (comment) .
Contributor

### IssamLaradji commented Oct 25, 2015

 @jnothman indeed! it's a great surprise to see it merged as I felt that this would stay dormant for much longer time. Thanks a lot for your great reviews and effort team!!