Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[MRG] GSoC 2014: Standard Extreme Learning Machines #3306

Open
wants to merge 2 commits into from
@IssamLaradji

Finished implementing the standard extreme learning machines (ELMs). I am getting the following results with 550 hidden neurons against the digits datasets,

Training accuracy using the logistic activation function: 0.999444
Training accuracy using the tanh activation function: 1.000000

Fortunately, this algorithm is much easier to implement and debug than multi-layer perceptron :).
I will push a test file soon.

@ogrisel , @larsmans

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling e5e363d on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

@sveitser sveitser referenced this pull request in Shippable/support
Closed

Display Code Coverage on Github Pull Request Page #239

sklearn/neural_network/extreme_learning_machines.py
((109 lines not shown))
+ A += self.intercept_hidden_
+
+ Z = self._activation_func(A)
+
+ return Z
+
+ def fit(self, X, y):
+ """Fit the model to the data X and target y.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ y : numpy array of shape (n_samples)
@larsmans Owner

y should be an "array-like" and be validated as such.

Thanks for bringing this up. I made the changes in multi-layer perceptron as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((149 lines not shown))
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+
+ Returns
+ -------
+ array, shape (n_samples)
+ Predicted target values per element in X.
+ """
+ X = atleast2d_or_csr(X)
+
+ self.hidden_activations_ = self._get_hidden_activations(X)
+ output = safe_sparse_dot(self.hidden_activations_, self.coef_output_)
+
+ return output
+
@NelleV Owner
NelleV added a note

There should be an extra blank line here. Can you run pep8 on the file and check for pep8 compliance ?

Thanks - I had problems with pep8 auto formatting in sublime text, I fixed it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@IssamLaradji

Hi, I am wondering what verbose extreme learning machines should display. Any ideas ?

Thanks

@IssamLaradji

Travis is acting strange, in that it raises an error for test_multilabel_classification(), although, in my local machine, the test_multilabel_classification() method in test_elm runs correctly with 1000 different seeds. Also, the pull request passed the local test after executing make test on the whole library.

Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

@IssamLaradji IssamLaradji referenced this pull request
Open

[MRG] Generic multi layer perceptron #3204

3 of 4 tasks complete
@arjoly
Owner

This might be worth having a look at https://github.com/dclambert/Python-ELM.

@larsmans
Owner

Training squared error loss would seem appropriate for verbose output. Not every estimator has verbose output, though (naive Bayes doesn't because it runs instantly on typical problem sizes).

@coveralls

Coverage Status

Coverage increased (+0.07%) when pulling 2be2941 on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

@IssamLaradji

Thanks, displaying the training error as verbose is such a useful idea.

@ogrisel
Owner

However, Travis raises an error for test_multilabel_classification(). Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

The version of numpy / scipy used by the various travis workers are given in the environment variable of each build. You can see the exact setup in:

@IssamLaradji

@ogrisel thanks I will dig deeper to see where multi-label classification is being affected.

@IssamLaradji

Hi guys, I implemented weighted and regularized ELMs - here are their awesome results on the imbalanced dataset. :) :)

Non-Regularized ELMs (Large C)
non_regularized_elm

Regularized ELMs (Small C)
regularized_elm

sklearn/neural_network/extreme_learning_machines.py
((174 lines not shown))
+
+ self._init_random_weights()
+
+ H_tmp = self._get_hidden_activations(X)
+
+ if self.class_weight != None:
+ # compute weighted output coefficients using eq. 12 in
+ # reference [1]
+ W = self._set_weights(y, n_samples)
+ H = safe_sparse_dot(H_tmp.T, W).T
+ else:
+ H = H_tmp
+
+ # compute regularized output coefficients using eq. 3 in reference [1]
+ left_part = pinv2(
+ safe_sparse_dot(H.T, H_tmp) + identity(self.n_hidden) / self.C)
@agramfort Owner

you should use ridge implementation here.

Hi @agramfort , isn't this technically ridge regression? I am minimizing the L2 norm of the coefficients in the objective function - like in the equation below. Or do you mean I should use scikit-learn implementation of ridge ? Thanks.

l_elm

@agramfort Owner

this does not look like ridge but you seem to do

(H'H + 1/C Id)^{-1} H'

and this is really a ridge solution where H is X and y is y and C = 1/alpha

Sorry, the equation I gave is for weighted ELMs as it contains the weight term W which is not part of ridge. However, the implementation contains both versions - with W and without W.
The version without W computes the formulae you mentioned, (H'H + 1/C Id)^{-1} H'y.
Thanks.

@agramfort Owner

without w then it is a ridge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((330 lines not shown))
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+
+ Returns
+ -------
+ array, shape (n_samples, n_outputs)
+ Returns the probability of the sample for each class in the model,
+ where classes are ordered as they are in `self.classes_`.
+ """
+ scores = self.decision_function(X)
+
+ if len(self.classes_) == 2:
+ scores = logistic_sigmoid(scores)
+ return np.hstack([1 - scores, scores])
+ else:
+ return _softmax(scores)
@agramfort Owner

getting a proba here seems like a hack unless you use a log reg on top of your hidden features. Not a ridge.

I see, so to get proper probabilities I should use the regular least square solutions ||Ax - B||^2 without minimizing the norm of the coefficients?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@IssamLaradji

Pushed a lot of improvements.
1) Added sequential ELM support - with partial_fit
2) Added relevant tests for sequential ELM and weighted ELM

Created two examples.
1) Weighted ELM plot
plot_weighted

2) Training vs. Testing with respect to hidden neurons
plot_testing_training

Will be leaving the documentation till the end - after I implement the remaining part which is kernel support and after the code is reviewed. Thanks.

examples/neural_networks/plot_weighted_elm.py
((54 lines not shown))
+
+clf_weightless = ELMClassifier(n_hidden=n_hidden, class_weight=None)
+clf_weightless.fit(X, Y)
+
+clf_weight_auto = ELMClassifier(n_hidden=n_hidden, class_weight='auto')
+clf_weight_auto.fit(X, Y)
+
+clf_weight_1000 = ELMClassifier(n_hidden=n_hidden, class_weight={1: 1000})
+clf_weight_1000.fit(X, Y)
+
+fig, axes = plt.subplots(1, 3, figsize=(20, 7))
+
+plot_decision_function(
+ clf_weightless, axes[0], 'ELM(class_weight=None, C=10e5)')
+plot_decision_function(
+ clf_weight_auto, axes[1], 'ELM(class_weight=\'auto\', C=10e5)')
@agramfort Owner

rather than using \' use " to define the string : 'ELM(class_weight="auto", C=10e5)'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((21 lines not shown))
+from ..utils import check_random_state, atleast2d_or_csr
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _identity(X):
+ """Return the same input array."""
+ return X
+
+
+def _tanh(X):
+ """Compute the hyperbolic tan function
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
@agramfort Owner

param desc are missing here and below

@vene Owner
vene added a note

Is this worth factoring out? It's just used in two places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
@@ -0,0 +1,680 @@
+"""Extreme Learning Machines
+"""
+
+# Author: Issam H. Laradji <issam.laradji@gmail.com>
+# Licence: BSD 3 clause
+
+from abc import ABCMeta, abstractmethod
+
+import numpy as np
+
+from numpy import diag
@agramfort Owner

why not using np.diag ? that's what we usually do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((41 lines not shown))
+ """
+ return np.tanh(X, X)
+
+
+def _softmax(Z):
+ """Compute the K-way softmax, (exp(Z).T / exp(Z).sum(axis=1)).T
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+
+ Returns
+ -------
+ X_new : {array-like, sparse matrix}, shape (n_samples, n_features)
+ """
+ exp_Z = np.exp(Z.T - Z.max(axis=1)).T
@agramfort Owner

rather than using these T twice use newaxis:

np.exp(Z - Z.max(axis=1)[:, np.newaxis])

same below in return statement. It's more readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((60 lines not shown))
+class BaseELM(six.with_metaclass(ABCMeta, BaseEstimator)):
+
+ """Base class for ELM classification and regression.
+
+ Warning: This class should not be used directly.
+ Use derived classes instead.
+ """
+ _activation_functions = {
+ 'tanh': _tanh,
+ 'logistic': logistic_sigmoid
+ }
+
+ @abstractmethod
+ def __init__(
+ self, n_hidden, activation, algorithm, C, class_weight, batch_size,
+ verbose, random_state):
@agramfort Owner
    def __init__(self, n_hidden, activation, algorithm, C, class_weight,
                 batch_size, verbose, random_state):

is more standard indentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((136 lines not shown))
+ class_weight = {}
+
+ for class_ in np.unique(y_original):
+ class_size = len(np.where(y_original == class_)[0])
+ class_weight[class_] = 0.618 / class_size
+ else:
+ class_weight = dict(self.class_weight)
+
+ for class_ in self.classes_:
+ indices = np.where(y_original == class_)[0]
+ if class_ in class_weight.keys():
+ diagonals[indices] = class_weight[class_]
+ else:
+ diagonals[indices] = 1
+
+ return diag(diagonals)
@agramfort Owner

do you really need to allocate a full dense matrix? I doubt it.

Fixed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((162 lines not shown))
+ """Compute the least-square solutions for the whole dataset."""
+ H = self._get_hidden_activations(X)
+
+ # compute output coefficients by evaluating
+ # (ZH + identity/C)^{-1}Zy
+ if self.class_weight != None:
+ # set Z = H'W for weighted ELM
+ W = self._assign_weights(y)
+ Z = safe_sparse_dot(H.T, W)
+ else:
+ # set Z = H' for ELM
+ Z = H.T
+
+ left_part = safe_sparse_dot(Z, H) + identity(self.n_hidden) / self.C
+ right_part = safe_sparse_dot(Z, y)
+ self.coef_output_ = safe_sparse_dot(pinv2(left_part), right_part)
@agramfort Owner

solving linear system with a pinv2 is never recommended due to numerical errors. I am pretty sure there is something better to do it.

I am now solving the system using scipy.linalg.solve, is this more efficient? Thanks
It takes around half the computation time as well. :)

@agramfort Owner

Oh yes, I am using what ridge is using. I will try creating a ridgeClassifier object to compute elm solutions. Thanks.

So, I am reusing ridge.ridge_regression, which makes the code much cleaner :). But I don't think I can reuse it for sequential elm since the equation is fundamentally different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((379 lines not shown))
+ References
+ ----------
+ Zong, Weiwei, Guang-Bin Huang, and Yiqiang Chen.
+ "Weighted extreme learning machine for imbalance learning."
+ Neurocomputing 101 (2013): 229-242.
+
+ Liang, Nan-Ying, et al.
+ "A fast and accurate online sequential learning algorithm for
+ feedforward networks." Neural Networks, IEEE Transactions on
+ 17.6 (2006): 1411-1423.
+ http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf
+ """
+
+ def __init__(
+ self, n_hidden=500, activation='tanh', algorithm='regular', C=10e5,
+ class_weight=None, batch_size=200, verbose=False, random_state=None):
@agramfort Owner

same remark on indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((528 lines not shown))
+
+ y = column_or_1d(y, warn=True)
+
+ y = self._lbin.fit_transform(y)
+ super(ELMClassifier, self).partial_fit(X, y)
+
+ return self
+
+
+class ELMRegressor(BaseELM, RegressorMixin):
+
+ """Extreme learning machines regressor.
+
+ The algorithm trains a single-hidden layer feedforward network by computing
+ the hidden layer values using randomized parameters, then solving
+ for the output weights using least-square solutions.
@agramfort Owner

this description is the same as for ELMClassifier is it normal?

No, sorry. The difference is that ELMClassifier has an output gate function that converts continuous values to integers. Will change it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((593 lines not shown))
+ ----------
+ Zong, Weiwei, Guang-Bin Huang, and Yiqiang Chen.
+ "Weighted extreme learning machine for imbalance learning."
+ Neurocomputing 101 (2013): 229-242.
+
+ Liang, Nan-Ying, et al.
+ "A fast and accurate online sequential learning algorithm for
+ feedforward networks." Neural Networks, IEEE Transactions on
+ 17.6 (2006): 1411-1423.
+ http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf
+ """
+
+ def __init__(
+ self, n_hidden=100, activation='tanh', algorithm='regular',
+ batch_size=200, C=10e5, verbose=False, random_state=None):
+ class_weight = None
@agramfort Owner

indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@IssamLaradji

@agramfort thanks for your comments. I pushed the updated code.

@IssamLaradji

Updates,
1) ELM is now using ridge-regression as off-the-shelf solver to compute its solutions.
2) Added support for kernels - linear, poly, rbf, sigmoid.
Is there a way we could reuse the fast, efficient SVM kernel methods?
Thanks.

@larsmans
Owner

There are kernels in sklearn.metrics. The ones in sklearn.svm are buried deep down in the C++ code for LibSVM.

doc/modules/neural_networks_supervised.rst
@@ -0,0 +1,330 @@
+.. _neural_network:
+
+==================================
+Neural network models (supervised)
+==================================
+
+.. currentmodule:: sklearn.neural_network
+
+
+.. _multilayer_perceptron:
+
+Multi-layer Perceptron
+======================
@agramfort Owner

why is the doc for MLP in this ELM PR?

Hi, I removed it, it was added by accident in the last push :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@IssamLaradji

Thanks! reusing scikit-learn kernels made the code much cleaner.

examples/neural_networks/plot_elm_training_vs_testing.py
((7 lines not shown))
+neurons. The more hidden neurons the less the training error, which eventually
+reaches zero. However, testing error does not necessarily decrease as having
+more hidden neurons than necessary would cause overfitting on the data.
+
+"""
+print(__doc__)
+
+# Author: Issam H. Laradji <issam.laradji@gmail.com>
+# License: BSD 3 clause
+
+import numpy as np
+import matplotlib.pyplot as plt
+import random
+
+from sklearn import cross_validation
+from sklearn.datasets import load_digits, fetch_mldata
@agramfort Owner

'load_digits' imported but unused

run pyflakes on your files

@agramfort Owner

do we really need to fetch the full MNIST to illustrate this? it makes it impossible to run on crappy internet connection like mine now :(

@agramfort Owner

do we really need to fetch the full MNIST to illustrate this? it makes it impossible to run on crappy internet connection like mine now :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
examples/neural_networks/plot_weighted_elm.py
((37 lines not shown))
+ axis.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired)
+ axis.axis('off')
+ axis.set_title(title)
+
+
+# we create 40 separable points
+rng = np.random.RandomState(0)
+n_samples_1 = 1000
+n_samples_2 = 100
+X = np.r_[1.5 * rng.randn(n_samples_1, 2),
+ 0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
+Y = [0] * (n_samples_1) + [1] * (n_samples_2)
+
+
+# for reference, first fit without class weights
+# fit the model
@agramfort Owner

why 2 lines of comments?

@amueller Owner

maybe a for-loop over the different class_weight settings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
examples/neural_networks/plot_weighted_elm.py
((50 lines not shown))
+
+# for reference, first fit without class weights
+# fit the model
+n_hidden = 100
+
+clf_weightless = ELMClassifier(n_hidden=n_hidden, C=10e5, class_weight=None)
+clf_weightless.fit(X, Y)
+
+clf_weight_auto = ELMClassifier(n_hidden=n_hidden, C=10e5, class_weight='auto')
+clf_weight_auto.fit(X, Y)
+
+clf_weight_1000 = ELMClassifier(n_hidden=n_hidden, C=10e5,
+ class_weight={1: 1000})
+clf_weight_1000.fit(X, Y)
+
+fig, axes = plt.subplots(1, 3, figsize=(20, 7))
@agramfort Owner

this figsize is way to big for the doc. Make it not bigger than 10 inches wide and remove empty spaces with plt.subplots_adjust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
examples/neural_networks/plot_elm_training_vs_testing.py
((16 lines not shown))
+
+import numpy as np
+import matplotlib.pyplot as plt
+import random
+
+from sklearn import cross_validation
+from sklearn.datasets import load_digits, fetch_mldata
+from sklearn.neural_network import ELMClassifier
+
+np.random.seed(0)
+
+# Generate sample data
+mnist = fetch_mldata('MNIST original')
+X, y = mnist.data, mnist.target
+
+indices = np.array(random.sample(range(70000), 2000))
@agramfort Owner

use np.random and get rid of the random from standard lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
examples/neural_networks/plot_elm_training_vs_testing.py
((18 lines not shown))
+import matplotlib.pyplot as plt
+import random
+
+from sklearn import cross_validation
+from sklearn.datasets import load_digits, fetch_mldata
+from sklearn.neural_network import ELMClassifier
+
+np.random.seed(0)
+
+# Generate sample data
+mnist = fetch_mldata('MNIST original')
+X, y = mnist.data, mnist.target
+
+indices = np.array(random.sample(range(70000), 2000))
+X, y = X[indices].astype('float64'), y[indices]
+X /= 255
@agramfort Owner

255.

it's a float

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
examples/neural_networks/plot_weighted_elm.py
((5 lines not shown))
+
+Plot decision functions of extreme learning machines with different class
+weights. Assigning larger weight to a class will push the decision function
+away from that class to have more of its samples correctly classified.
+Such scheme is useful for imbalanced data so that underrepresented classes
+are emphasized and therefore not ignored by the classifier.
+
+"""
+print(__doc__)
+
+# Author: Issam H. Laradji <issam.laradji@gmail.com>
+# License: BSD 3 clause
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn import svm
@agramfort Owner

svm unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((18 lines not shown))
+from ..linear_model import ridge
+from ..utils import gen_even_slices
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils import check_random_state, atleast2d_or_csr
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _tanh(X):
+ """Compute the hyperbolic tan function."""
+ return np.tanh(X, X)
+
+
+def _softmax(Z):
+ """Compute the K-way softmax function. """
+ exp_Z = np.exp(Z - Z.max(axis=1)[:, np.newaxis])
@agramfort Owner

can you use Z to store the output of np.exp ? like you did for tanh above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((20 lines not shown))
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils import check_random_state, atleast2d_or_csr
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _tanh(X):
+ """Compute the hyperbolic tan function."""
+ return np.tanh(X, X)
+
+
+def _softmax(Z):
+ """Compute the K-way softmax function. """
+ exp_Z = np.exp(Z - Z.max(axis=1)[:, np.newaxis])
+
+ return (exp_Z / exp_Z.sum(axis=1)[:, np.newaxis])
@agramfort Owner

and do this division inplace with a /=

@arjoly Owner
arjoly added a note

maybe rename Z to X to be consistent with the tanh and relu?

@ogrisel Owner
ogrisel added a note

I think it would be faster to test for the sparse case explicitly:

if sp.issparse(X):
     ...  # do the current stuff with dia_matrix
else:
    return X * sample_weight[np.newaxis, 1]
@ogrisel Owner
ogrisel added a note

This comment has not been addressed yet, right?

No, sorry. I will upload all addressed comments right away.

While testing the code, I found that X here is never sparse. The value of X is in fact either H_batch or y_batch which is incremented by self.intercept_hidden_ and therefore is forced into a dense matrix.

In other words, sp.issparse(X) is always false.

@ogrisel Owner
ogrisel added a note

Ok so doing X * sample_weight[np.newaxis, 1] should always work and be faster then, no?

Yes, but it would rather be X * sample_weight[:, np.newaxis], since sample_weight is a vector and sample_weight[np.newaxis, 1] will just take the second element.

Thanks!

@ogrisel Owner
ogrisel added a note

Indeed, I mixed the .reshape and the np.newaxis slice notations for triggering broadcasting...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((95 lines not shown))
+ self._activation_func = self._activation_functions[self.activation]
+
+ if (self.kernel in ['poly', 'rbf', 'sigmoid']) and (self.gamma == 0):
+ # if custom gamma is not provided ...
+ self.gamma = 1.0 / self._n_features
+
+ if self.kernel != 'random':
+ self._X_train = X
+
+ def _scaled_weight_init(self, fan_in, fan_out):
+ """Scale the initial, random parameters for a specific layer."""
+ if self.activation == 'tanh':
+ interval = np.sqrt(6. / (fan_in + fan_out))
+
+ elif self.activation == 'logistic':
+ interval = 4. * np.sqrt(6. / (fan_in + fan_out))
@agramfort Owner

where are these numbers coming from? please point to the paper

Hi @agramfort , I am using the scaling scheme given here, [Xavier10] http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_GlorotB10.pdf
it is giving me better results than otherwise - cleaner plots and usually higher score.

Since I am adding the reLU activation function, I will add an else clause that sets the interval as 1./np.sqrt(n_features), a popular initialization method as claimed by the paper. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((194 lines not shown))
+ H = self._get_hidden_activations(X_batch)
+
+ if self._K is None:
+ # initialize K and coef_output_
+ self._K = safe_sparse_dot(H.T, H)
+ y_ = safe_sparse_dot(H.T, y_batch)
+
+ self.coef_output_ = ridge.ridge_regression(self._K, y_,
+ 1.0 / self.C).T
+ else:
+ self._K += safe_sparse_dot(H.T, H)
+ H_updated = safe_sparse_dot(H, self.coef_output_)
+ y_ = safe_sparse_dot(H.T, (y_batch - H_updated))
+
+ self.coef_output_ += ridge.ridge_regression(self._K, y_,
+ 1.0 / self.C).T
@agramfort Owner

put this line outside of the if and remove this call above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((227 lines not shown))
+ X = atleast2d_or_csr(X)
+
+ self._validate_params()
+
+ n_samples, self._n_features = X.shape
+ self.n_outputs_ = y.shape[1]
+ self._init_param(X)
+
+ if self.algorithm == 'standard':
+ # compute the least-square solutions for the whole dataset
+ self._solve_lsqr(X, y)
+
+ elif self.algorithm == 'sequential':
+ # compute the least-square solutions in batches
+ batch_size = np.clip(self.batch_size, 0, n_samples)
+ n_batches = int(n_samples / batch_size)
@agramfort Owner

n_samples // batch_size

to force integer division

@arjoly Owner
arjoly added a note

scores => y_scores?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((239 lines not shown))
+ elif self.algorithm == 'sequential':
+ # compute the least-square solutions in batches
+ batch_size = np.clip(self.batch_size, 0, n_samples)
+ n_batches = int(n_samples / batch_size)
+ batch_slices = list(gen_even_slices(n_batches * batch_size,
+ n_batches))
+ self._K = None
+
+ for batch, batch_slice in enumerate(batch_slices):
+ self._sequential_solve_procedure(X[batch_slice],
+ y[batch_slice])
+
+ if self.verbose:
+ # compute training square error
+ cost = np.sum((y[batch_slice] - self.decision_function(
+ X[batch_slice])) ** 2) / (2 * batch_size)
@agramfort Owner
  1. * batch_size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((242 lines not shown))
+ n_batches = int(n_samples / batch_size)
+ batch_slices = list(gen_even_slices(n_batches * batch_size,
+ n_batches))
+ self._K = None
+
+ for batch, batch_slice in enumerate(batch_slices):
+ self._sequential_solve_procedure(X[batch_slice],
+ y[batch_slice])
+
+ if self.verbose:
+ # compute training square error
+ cost = np.sum((y[batch_slice] - self.decision_function(
+ X[batch_slice])) ** 2) / (2 * batch_size)
+
+ print("Training square error for batch %d = %f" %
+ (batch, cost))
@agramfort Owner

bad indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((247 lines not shown))
+ for batch, batch_slice in enumerate(batch_slices):
+ self._sequential_solve_procedure(X[batch_slice],
+ y[batch_slice])
+
+ if self.verbose:
+ # compute training square error
+ cost = np.sum((y[batch_slice] - self.decision_function(
+ X[batch_slice])) ** 2) / (2 * batch_size)
+
+ print("Training square error for batch %d = %f" %
+ (batch, cost))
+
+ if self.verbose:
+ # compute training square error
+ cost = (np.sum((y - self.decision_function(X)) ** 2) /
+ (2 * n_samples))
@agramfort Owner

bad indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((248 lines not shown))
+ self._sequential_solve_procedure(X[batch_slice],
+ y[batch_slice])
+
+ if self.verbose:
+ # compute training square error
+ cost = np.sum((y[batch_slice] - self.decision_function(
+ X[batch_slice])) ** 2) / (2 * batch_size)
+
+ print("Training square error for batch %d = %f" %
+ (batch, cost))
+
+ if self.verbose:
+ # compute training square error
+ cost = (np.sum((y - self.decision_function(X)) ** 2) /
+ (2 * n_samples))
+ print("Training square error for the dataset = %f" % (cost))
@agramfort Owner

move () around cost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((253 lines not shown))
+ cost = np.sum((y[batch_slice] - self.decision_function(
+ X[batch_slice])) ** 2) / (2 * batch_size)
+
+ print("Training square error for batch %d = %f" %
+ (batch, cost))
+
+ if self.verbose:
+ # compute training square error
+ cost = (np.sum((y - self.decision_function(X)) ** 2) /
+ (2 * n_samples))
+ print("Training square error for the dataset = %f" % (cost))
+
+ return self
+
+ def decision_function(self, X):
+ """Fit the model to the data X and target y.
@agramfort Owner

bad docstring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((281 lines not shown))
+ X = atleast2d_or_csr(X)
+
+ self.hidden_activations_ = self._get_hidden_activations(X)
+ output = safe_sparse_dot(self.hidden_activations_, self.coef_output_)
+
+ return output
+
+ def partial_fit(self, X, y):
+ """Fit the model to the data X and target y.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Subset of training data.
+
+ y : array-like, shape (n_samples)
@agramfort Owner

(n_samples) -> (n_samples,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((286 lines not shown))
+ return output
+
+ def partial_fit(self, X, y):
+ """Fit the model to the data X and target y.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Subset of training data.
+
+ y : array-like, shape (n_samples)
+ Subset of target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
@agramfort Owner

make the formatting of return self consistent

@ogrisel Owner
ogrisel added a note

Also it's better to explain the motivation rather than stating a tautology:

self : return the estimator itself to chain a call to the predict method for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((9 lines not shown))
+import numpy as np
+from scipy import linalg
+from scipy.sparse import identity
+
+from ..base import BaseEstimator, ClassifierMixin, RegressorMixin
+from ..externals import six
+from ..preprocessing import LabelBinarizer
+from ..metrics.pairwise import linear_kernel, polynomial_kernel, rbf_kernel
+from ..metrics.pairwise import sigmoid_kernel
+from ..linear_model import ridge
+from ..utils import gen_even_slices
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils import check_random_state, atleast2d_or_csr
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
@agramfort Owner

run pyflakes:

sklearn/neural_network/extreme_learning_machines.py:10 'linalg' imported but unused
sklearn/neural_network/extreme_learning_machines.py:11
  'identity' imported but unused
sklearn/neural_network/extreme_learning_machines.py:21
  redefinition of unused 'check_random_state' from line 20
sklearn/neural_network/extreme_learning_machines.py:21
  redefinition of unused 'atleast2d_or_csr' from line 20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((500 lines not shown))
+ """Return the log of probability estimates.
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ T : array-like, shape (n_samples, n_outputs)
+ Returns the log-probability of the sample for each class in the
+ model, where classes are ordered as they are in
+ `self.classes_`. Equivalent to log(predict_proba(X))
+ """
+ return np.log(self.predict_proba(X))
@agramfort Owner

use a temp var and apply the log inplace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((553 lines not shown))
+ self.classes_ = classes
+
+ if not hasattr(self, '_lbin'):
+ self._lbin = LabelBinarizer()
+ self._lbin._classes = classes
+
+ y = column_or_1d(y, warn=True)
+
+ y = self._lbin.fit_transform(y)
+ super(ELMClassifier, self).partial_fit(X, y)
+
+ return self
+
+
+class ELMRegressor(BaseELM, RegressorMixin):
+
@agramfort Owner

remove empty line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((704 lines not shown))
+ Predicted target values per element in X.
+ """
+ X = atleast2d_or_csr(X)
+
+ return self.decision_function(X)
+
+ def partial_fit(self, X, y):
+ """Fit the model to the data X and target y.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape (n_samples)
@agramfort Owner

(n_samples) -> (n_samples,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((714 lines not shown))
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape (n_samples)
+ Subset of the target values.
+
+ Returns
+ -------
+ self
+ """
+ y = np.atleast_1d(y)
+
+ if y.ndim == 1:
+ y = np.reshape(y, (-1, 1))
@agramfort Owner

doc says ndim has to be 1 so force it to be the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/tests/test_elm.py
((67 lines not shown))
+ y_train = y[:150]
+ X_test = X[150:]
+
+ expected_shape_dtype = (X_test.shape[0], y_train.dtype.kind)
+
+ for activation in ACTIVATION_TYPES:
+ elm = ELMClassifier(n_hidden=50, activation=activation,
+ random_state=random_state)
+ elm.fit(X_train, y_train)
+
+ y_predict = elm.predict(X_test)
+ assert_greater(elm.score(X_train, y_train), 0.95)
+ assert_equal(
+ (y_predict.shape[0],
+ y_predict.dtype.kind),
+ expected_shape_dtype)
@agramfort Owner

indent looks weird

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/tests/test_elm.py
((179 lines not shown))
+ assert_raises(ValueError, clf(algorithm='standard').partial_fit, X, y)
+
+ elm = clf(algorithm='sequential')
+ elm.partial_fit(X, y, classes=[0, 1])
+ # different classes passed
+ assert_raises(ValueError, elm.partial_fit, X, y, classes=[0, 1, 2])
+
+
+def test_partial_fit_classification():
+ """
+ Test that partial_fit yields same results as 'fit'
+ for binary- and multi-class classification.
+ """
+ for X, y in classification_datasets:
+ X = X
+ y = y
@agramfort Owner

??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/tests/test_elm.py
((190 lines not shown))
+ for binary- and multi-class classification.
+ """
+ for X, y in classification_datasets:
+ X = X
+ y = y
+ batch_size = 200
+ n_samples = X.shape[0]
+
+ elm = ELMClassifier(algorithm='sequential', random_state=random_state,
+ batch_size=batch_size)
+ elm.fit(X, y)
+ pred1 = elm.predict(X)
+
+ elm = ELMClassifier(algorithm='sequential', random_state=random_state)
+
+ n_batches = int(n_samples / batch_size)
@agramfort Owner

//

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/tests/test_elm.py
((222 lines not shown))
+ """
+ X = Xboston
+ y = yboston
+ batch_size = 100
+ n_samples = X.shape[0]
+
+ for activation in ACTIVATION_TYPES:
+ elm = ELMRegressor(algorithm='sequential', random_state=random_state,
+ activation=activation, batch_size=batch_size)
+ elm.fit(X, y)
+ pred1 = elm.predict(X)
+
+ elm = ELMRegressor(algorithm='sequential', activation=activation,
+ random_state=random_state)
+
+ n_batches = int(n_samples / batch_size)
@agramfort Owner

//

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/tests/test_elm.py
((351 lines not shown))
+ elm_weightless = ELMClassifier(n_hidden=n_hidden,
+ class_weight=None,
+ random_state=random_state)
+ elm_weightless.fit(X_train, y_train)
+
+ elm_weight_auto = ELMClassifier(n_hidden=n_hidden,
+ class_weight='auto',
+ random_state=random_state)
+ elm_weight_auto.fit(X_train, y_train)
+
+ score_weightless = roc_auc_score(
+ y_test, elm_weightless.predict_proba(X_test)[:, 1])
+ score_weighted = roc_auc_score(
+ y_test, elm_weight_auto.predict_proba(X_test)[:, 1])
+
+ assert_greater(score_weighted, score_weightless)
@agramfort Owner

bad indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
Owner

@IssamLaradji please address my comments add starts a complete benchmark to find the best default parameters. I would also add ReLu activation functions which are fast to compute.

cc @ogrisel

@ogrisel
Owner

And also sparse random weights (reusing code from sklearn.random_projection) as an alternative to dense Gaussian random weights. It can be significantly faster.

Also, the amplitude of the random weights seems to be a very impacting hyperparameter as demonstrated in slides 14 and 15 of this deck: http://www.lce.hut.fi/~eiparvia/publ/KDIR_Parviainen_slides.pdf

We should therefore make the scale of the random weights an explicit hyper parameter of the ELM estimator(s) and write an example to highlight its importance, for instance using a grid search with a grid that includes the regularizer strength, the number of hidden nodes and the scale of the random weights.

@IssamLaradji

Thanks @agramfort and @ogrisel for all your comments. I pushed the updated code.
I added a weight_scale parameter that sets the interval that the uniform distribution picks values from.
By default weight_scale="auto" which selects the interval based on this paper (http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_GlorotB10.pdf), depending on the activation function. It works very well compared to other initialization methods.

For sklearn.random_projection.sparse_random_matrix, the output matrix contains only 3 unique values, a,
-a, and 0 which doesn't provide the asymmetry we need for weight initialization, right?

Running grid-search on the load_digits dataset with the following range of parameters,

parameters = {'weight_scale': np.arange(0.1, 1, 0.1),
              'n_hidden': np.arange(50, 800, 50), 'C': [1, 10, 100, 1000]}

I got the following best combination,

ELMClassifier(C=1, activation='tanh', algorithm='standard', batch_size=200,
       class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='random',
       n_hidden=500, random_state=None, verbose=False,
       weight_scale=0.10000000000000001)

Thanks.

sklearn/neural_network/extreme_learning_machines.py
((100 lines not shown))
+ if self.algorithm is not 'standard' and self.class_weight is not None:
+ raise NotImplementedError("class_weight is only supported "
+ "when algorithm='standard'.")
+
+ def _init_param(self, X):
+ """Set initial parameters."""
+ self._activation_func = self._activation_functions[self.activation]
+
+ if (self.kernel in ['poly', 'rbf', 'sigmoid']) and (self.gamma == 0):
+ # if custom gamma is not provided ...
+ self.gamma = 1.0 / self._n_features
+
+ if self.kernel != 'random':
+ self._X_train = X
+
+ def _scaled_weight_init(self, fan_in, fan_out):
@arjoly Owner
arjoly added a note

Maybe inline this function? it is called only once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((210 lines not shown))
+
+ if self._K is None:
+ # initialize K and coef_output_
+ self.coef_output_ = np.zeros((self.n_hidden, self.n_outputs_))
+ self._K = safe_sparse_dot(H.T, H)
+ y_ = safe_sparse_dot(H.T, y_batch)
+
+ else:
+ self._K += safe_sparse_dot(H.T, H)
+ H_updated = safe_sparse_dot(H, self.coef_output_)
+ y_ = safe_sparse_dot(H.T, (y_batch - H_updated))
+
+ self.coef_output_ += ridge.ridge_regression(self._K, y_,
+ 1.0 / self.C).T
+
+ def fit(self, X, y):
@arjoly Owner
arjoly added a note

Do you think the common implementation of fit and partial_fit could be merged?

@arjoly Owner
arjoly added a note

I would still use something different than sub. Noting, it's not entirely clear what would be the difference between a batch and a subset. Why do you think of H_accumulated / H_all_batches / H_recursive / H_incremental?

I think H_accumulated and H_all_batches are the most appropriate names for it :-)
I will name it as H_accumulated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((370 lines not shown))
+
+ n_hidden: int, default 100
+ The number of neurons in the hidden layer, it only applies to
+ kernel='random'.
+
+ activation : {'logistic', 'tanh', 'relu'}, default 'tanh'
+ Activation function for the hidden layer. It only applies to
+ kernel='random'.
+
+ - 'logistic' for 1 / (1 + exp(x)).
+
+ - 'tanh' for the hyperbolic tangent.
+
+ - 'relu' for log(1 + exp(x))
+
+ algorithm : {'standard', 'sequential'}, default 'standard'
@arjoly Owner
arjoly added a note

Does it make any sense to have other solver such as sgd?

@arjoly Owner
arjoly added a note

Why not calling this solver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((391 lines not shown))
+
+ - 'sequential' computes the least-square solutions by training
+ on the dataset in batches using a recursive least-square
+ algorithm.
+
+ kernel : {'random', 'linear', 'poly', 'rbf', 'sigmoid'},
+ optional, default 'random'
+ Specifies the kernel type to be used in the algorithm.
+
+ degree : int, optional, default 3
+ Degree of the polynomial kernel function 'poly'.
+ Ignored by all other kernels.
+
+ gamma : float, optional, default 0.0
+ Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. If gamma is
+ 0.0 then 1/n_features will be used instead.
@arjoly Owner
arjoly added a note

Maybe a better default is to set it to None instead of 0..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((566 lines not shown))
+ raise ValueError(
+ "only 'sequential' algorithm supports partial fit")
+
+ if self.classes_ is None and classes is None:
+ raise ValueError("classes must be passed on the first call "
+ "to partial_fit.")
+ elif self.classes_ is not None and classes is not None:
+ if np.any(self.classes_ != np.unique(classes)):
+ raise ValueError("`classes` is not the same as on last call "
+ "to partial_fit.")
+ elif classes is not None:
+ self.classes_ = classes
+
+ if not hasattr(self, '_lbin'):
+ self._lbin = LabelBinarizer()
+ self._lbin._classes = classes
@arjoly Owner
arjoly added a note

I would avoid this. It would better patching the label binarizer to accept a classes argument. (+1 for another pr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((750 lines not shown))
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape (n_samples, n_outputs)
+ Subset of the target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ y = np.atleast_1d(y)
+
+ if y.ndim == 1:
+ # reshape is necessary to preserve the data contiguity against vs
+ # [:, np.newaxis] that does not.
+ y = np.reshape(y, (-1, 1))
@arjoly Owner
arjoly added a note

With a private _validate_y in the base class and the partial_fit function could be shared between regression and classification.

@arjoly Owner
arjoly added a note

It might not be necessary to have a _validate_y since it's possible to distinguish classification from regression using the ClassifierMixin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((725 lines not shown))
+
+ def predict(self, X):
+ """Predict using the multi-layer perceptron model.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ array, shape (n_samples,)
+ Predicted target values per element in X.
+ """
+ X = atleast2d_or_csr(X)
@arjoly Owner
arjoly added a note

Since @amueller improvement in the utils module, we could the new awesome check_X_y, check_array or check_consistency_length. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((676 lines not shown))
+ Zong, Weiwei, Guang-Bin Huang, and Yiqiang Chen.
+ "Weighted extreme learning machine for imbalance learning."
+ Neurocomputing 101 (2013): 229-242.
+
+ Liang, Nan-Ying, et al.
+ "A fast and accurate online sequential learning algorithm for
+ feedforward networks." Neural Networks, IEEE Transactions on
+ 17.6 (2006): 1411-1423.
+ http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf
+ """
+ def __init__(self, n_hidden=100, activation='tanh', algorithm='standard',
+ weight_scale='auto', kernel='random', batch_size=200, C=10e5,
+ degree=3, gamma=0.0, coef0=0.0, verbose=False,
+ random_state=None):
+
+ class_weight = None
@arjoly Owner
arjoly added a note

This could be handled in the base class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((683 lines not shown))
+ 17.6 (2006): 1411-1423.
+ http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf
+ """
+ def __init__(self, n_hidden=100, activation='tanh', algorithm='standard',
+ weight_scale='auto', kernel='random', batch_size=200, C=10e5,
+ degree=3, gamma=0.0, coef0=0.0, verbose=False,
+ random_state=None):
+
+ class_weight = None
+
+ super(ELMRegressor, self).__init__(n_hidden, activation, algorithm,
+ kernel, C, degree, gamma, coef0,
+ class_weight, weight_scale,
+ batch_size, verbose, random_state)
+
+ self.classes_ = None
@arjoly Owner
arjoly added a note

Same here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((602 lines not shown))
+
+ Parameters
+ ----------
+ C: float, optional, default 10e5
+ Regularization term.
+
+ weight_scale : float or 'auto', default 'auto'
+ Scales the weights that initialize the outgoing weights of the first
+ hidden layer. The weight values will range between plus and minus an
+ interval based on the uniform distribution. That interval
+ is 1 / (n_features + n_hidden) if weight_scale='auto'; otherwise,
+ the interval is the value given to weight_scale.
+
+ n_hidden: int, default 100
+ The number of neurons in the hidden layer, it only applies to
+ kernel='random'.
@arjoly Owner
arjoly added a note

What happens for the other kernel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((598 lines not shown))
+ values.
+
+ This implementation works with data represented as dense and sparse numpy
+ arrays of floating point values for the features.
+
+ Parameters
+ ----------
+ C: float, optional, default 10e5
+ Regularization term.
+
+ weight_scale : float or 'auto', default 'auto'
+ Scales the weights that initialize the outgoing weights of the first
+ hidden layer. The weight values will range between plus and minus an
+ interval based on the uniform distribution. That interval
+ is 1 / (n_features + n_hidden) if weight_scale='auto'; otherwise,
+ the interval is the value given to weight_scale.
@arjoly Owner
arjoly added a note

That interval is 1 / (n_features + n_hidden) if weight_scale='auto'; otherwise, the interval is the value given to weight_scale.
=>
If weight_scale='auto', then weight_scale is set to 1 / (n_features + n_hidden)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((591 lines not shown))
+class ELMRegressor(BaseELM, RegressorMixin):
+ """Extreme learning machines regressor.
+
+ The algorithm trains a single-hidden layer feedforward network by computing
+ the hidden layer values using randomized parameters, then solving
+ for the output weights using least-square solutions. For prediction,
+ ELMRegressor computes the forward pass resulting in contiuous output
+ values.
+
+ This implementation works with data represented as dense and sparse numpy
+ arrays of floating point values for the features.
+
+ Parameters
+ ----------
+ C: float, optional, default 10e5
+ Regularization term.
@arjoly Owner
arjoly added a note

Which regularisation term?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((518 lines not shown))
+ return np.hstack([1 - scores, scores])
+ else:
+ return _softmax(scores)
+
+ def predict_log_proba(self, X):
+ """Return the log of probability estimates.
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ T : array-like, shape (n_samples, n_outputs)
@arjoly Owner
arjoly added a note

T => y_proba?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((523 lines not shown))
+ """Return the log of probability estimates.
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ T : array-like, shape (n_samples, n_outputs)
+ Returns the log-probability of the sample for each class in the
+ model, where classes are ordered as they are in
+ `self.classes_`. Equivalent to log(predict_proba(X))
+ """
+ tmp = self.predict_proba(X)
@arjoly Owner
arjoly added a note

tmp => y_proba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((524 lines not shown))
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ T : array-like, shape (n_samples, n_outputs)
+ Returns the log-probability of the sample for each class in the
+ model, where classes are ordered as they are in
+ `self.classes_`. Equivalent to log(predict_proba(X))
+ """
+ tmp = self.predict_proba(X)
+ return np.log(tmp, tmp)
@arjoly Owner
arjoly added a note

return np.log(y_proba, out= y_proba)?

@arjoly Owner
arjoly added a note

I got a bit surprised by the second argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((495 lines not shown))
+ scores = self.decision_function(X)
+
+ return self._lbin.inverse_transform(scores)
+
+ def predict_proba(self, X):
+ """Probability estimates.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ array, shape (n_samples, n_outputs)
@arjoly Owner
arjoly added a note

I assume you mean (n_samples, n_classes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((459 lines not shown))
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Training data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ y : array-like, shape (n_samples,)
+ Target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ y = column_or_1d(y, warn=True)
+ self.classes_ = np.unique(y)
+ y = self._lbin.fit_transform(y)
@arjoly Owner
arjoly added a note

Could be factored in a self._validate_y function

@arjoly Owner
arjoly added a note

It might not be necessary to have a _validate_y since it's possible to distinguish classification from regression using the ClassifierMixin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((312 lines not shown))
+ Subset of target values.
+
+ Returns
+ -------
+ self : returns an instance of self.
+ """
+ X = atleast2d_or_csr(X)
+
+ self.n_outputs_ = y.shape[1]
+
+ n_samples, self._n_features = X.shape
+ self._validate_params()
+ self._init_param(X)
+
+ if self.coef_output_ is None:
+ self._K = None
@arjoly Owner
arjoly added a note

What is _k?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((30 lines not shown))
+ # print X.shape
+ tmp = 1 + np.exp(X, X)
+ return np.log(tmp, tmp)
+
+
+def _softmax(Z):
+ """Compute the K-way softmax function. """
+ Z = np.exp(Z - Z.max(axis=1)[:, np.newaxis])
+ Z /= Z.sum(axis=1)[:, np.newaxis]
+
+ return Z
+
+
+def _square_error(y, y_pred, n_samples):
+ """Compute the square error."""
+ return (np.sum(y - y_pred) ** 2) / (2 * n_samples)
@arjoly Owner
arjoly added a note

Why not infering n_samples from the data?

Why not re-using the mean_squared error from the metrics module?

@arjoly Owner
arjoly added a note

I assume that activation unit will be shared among neural network. So you might want to define this dictionnary there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((43 lines not shown))
+def _square_error(y, y_pred, n_samples):
+ """Compute the square error."""
+ return (np.sum(y - y_pred) ** 2) / (2 * n_samples)
+
+
+class BaseELM(six.with_metaclass(ABCMeta, BaseEstimator)):
+ """Base class for ELM classification and regression.
+
+ Warning: This class should not be used directly.
+ Use derived classes instead.
+ """
+ _activation_functions = {
+ 'tanh': _tanh,
+ 'logistic': logistic_sigmoid,
+ 'relu': _relu
+ }
@arjoly Owner
arjoly added a note

Why this need to be an attribute instead of global constant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((79 lines not shown))
+ self.coef_output_ = None
+
+ def _validate_params(self):
+ """Validate input params."""
+ if self.n_hidden <= 0:
+ raise ValueError("n_hidden must be greater or equal zero")
+ if self.C <= 0.0:
+ raise ValueError("C must be > 0")
+
+ if self.activation not in self._activation_functions:
+ raise ValueError("The activation %s"
+ " is not supported. " % self.activation)
+
+ if self.algorithm not in ['standard', 'sequential']:
+ raise ValueError("The algorithm %s"
+ " is not supported. " % self.algorithm)
@arjoly Owner
arjoly added a note

It's nice to remind what are the possible choices in that sort of exception raised.

@arjoly Owner
arjoly added a note

Is it really the hidden_activation at this stage?

The end result (line 117) is hidden_activations, statements in lines 112 and 113 aren't meant to have a variable name, because the relevant paper considers the whole hidden_activation computation as one statement.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((85 lines not shown))
+ if self.C <= 0.0:
+ raise ValueError("C must be > 0")
+
+ if self.activation not in self._activation_functions:
+ raise ValueError("The activation %s"
+ " is not supported. " % self.activation)
+
+ if self.algorithm not in ['standard', 'sequential']:
+ raise ValueError("The algorithm %s"
+ " is not supported. " % self.algorithm)
+
+ if self.kernel not in ['random', 'linear', 'poly', 'rbf', 'sigmoid']:
+ raise ValueError("The kernel %s"
+ " is not supported. " % self.kernel)
+
+ if self.algorithm is not 'standard' and self.class_weight is not None:
@arjoly Owner
arjoly added a note

self.algorithm != 'standard'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((278 lines not shown))
+ print("Training square error for the dataset = %f" % cost)
+
+ return self
+
+ def decision_function(self, X):
+ """Predict using the trained model
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Data, where n_samples is the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ array, shape (n_samples,)
@arjoly Owner
arjoly added a note

There is no variable for the return. In classification, this looks like to be (n_samples, n_classes)

@vene Owner
vene added a note

keys are class labels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((181 lines not shown))
+
+ elif self.kernel == 'poly':
+ H = polynomial_kernel(X, self._X_train, self.degree, self.gamma,
+ self.coef0)
+
+ elif self.kernel == 'rbf':
+ H = rbf_kernel(X, self._X_train, self.gamma)
+
+ elif self.kernel == 'sigmoid':
+ H = sigmoid_kernel(X, self._X_train, self.gamma, self.coef0)
+
+ return H
+
+ def _solve_lsqr(self, X, y):
+ """Compute the least-square solutions for the whole dataset."""
+ H = self._get_hidden_activations(X)
@arjoly Owner
arjoly added a note

I don't find H to be a particularly descriptive variable name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((127 lines not shown))
+ return interval
+
+ def _init_hidden_weights(self):
+ """Initialize coef and intercept parameters for the hidden layer."""
+ rng = check_random_state(self.random_state)
+ fan_in, fan_out = self._n_features, self.n_hidden
+
+ interval = self._scaled_weight_init(fan_in, fan_out)
+
+ coef = rng.uniform(-interval, interval, (fan_in, fan_out))
+ intercept = rng.uniform(-interval, interval, (fan_out))
+
+ self.coef_hidden_ = coef
+ self.intercept_hidden_ = intercept
+
+ def _assign_weights(self, y):
@arjoly Owner
arjoly added a note

This is called only once. Maybe this function should be inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((267 lines not shown))
+ y_pred = self.decision_function(X[batch_slice])
+ cost = _square_error(y[batch_slice], y_pred, batch_size)
+
+ print("Training square error for batch %d = %f" % (batch,
+ cost))
+
+ if self.verbose:
+ # compute training square error
+ y_pred = self.decision_function(X)
+ cost = _square_error(y, y_pred, n_samples)
+
+ print("Training square error for the dataset = %f" % cost)
+
+ return self
+
+ def decision_function(self, X):
@arjoly Owner
arjoly added a note

Should we also have this in regression?

@ogrisel Owner
ogrisel added a note

+1 for renaming this to _predict and defining:

class ELMClassifier(...):
    ...
    def decision_function(self, X):
        return self._predict(X)

and:

class ELMRegressor(...):
    ...
    def predict(self, X):
        return self._predict(X)
@arjoly Owner
arjoly added a note

Why not allowing to perform first the standard algorithm, then continue with the recursive least square algorithm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((1 lines not shown))
+"""Extreme Learning Machines
+"""
+
+# Author: Issam H. Laradji <issam.laradji@gmail.com>
+# Licence: BSD 3 clause
+
+from abc import ABCMeta, abstractmethod
+
+import numpy as np
+
+from ..base import BaseEstimator, ClassifierMixin, RegressorMixin
+from ..externals import six
+from ..preprocessing import LabelBinarizer
+from ..metrics.pairwise import linear_kernel, polynomial_kernel, rbf_kernel
+from ..metrics.pairwise import sigmoid_kernel
+from ..linear_model import ridge
@arjoly Owner
arjoly added a note

maybe importing directly ridge_regression would be equivalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((16 lines not shown))
+from ..linear_model import ridge
+from ..utils import gen_even_slices
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _tanh(X):
+ """Compute the hyperbolic tan function."""
+ return np.tanh(X, X)
+
+
+def _relu(X):
+ """Compute the rectified linear unit function."""
+ # print X.shape
+ tmp = 1 + np.exp(X, X)
@arjoly Owner
arjoly added a note

np.exp(X, out=X)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((17 lines not shown))
+from ..utils import gen_even_slices
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _tanh(X):
+ """Compute the hyperbolic tan function."""
+ return np.tanh(X, X)
+
+
+def _relu(X):
+ """Compute the rectified linear unit function."""
+ # print X.shape
+ tmp = 1 + np.exp(X, X)
+ return np.log(tmp, tmp)
@arjoly Owner
arjoly added a note

np.log(tmp, out=tmp)?

@ogrisel Owner
ogrisel added a note

This is not the ReLU activation but a smooth approximation. The traditional ReLU is non-smooth but cheaper to compute (inplace):

    np.clip(X, 0, np.finfo(X.dtype).max, out=X)
    return X

As we do no gradient descent for the input to hidden layer weights of the ELM, there is really no point in using the smooth variant. For ReLU MLP trained with LBFGS, a smooth ReLU might be useful but practitioners often use SGD that does not care that much about smoothness of the objective function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((10 lines not shown))
+
+from ..base import BaseEstimator, ClassifierMixin, RegressorMixin
+from ..externals import six
+from ..preprocessing import LabelBinarizer
+from ..metrics.pairwise import linear_kernel, polynomial_kernel, rbf_kernel
+from ..metrics.pairwise import sigmoid_kernel
+from ..linear_model import ridge
+from ..utils import gen_even_slices
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _tanh(X):
+ """Compute the hyperbolic tan function."""
+ return np.tanh(X, X)
@arjoly Owner
arjoly added a note

np.tanh(X, out=X)?

@amueller Owner

maybe _multiply_sample_weights?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((8 lines not shown))
+
+import numpy as np
+
+from ..base import BaseEstimator, ClassifierMixin, RegressorMixin
+from ..externals import six
+from ..preprocessing import LabelBinarizer
+from ..metrics.pairwise import linear_kernel, polynomial_kernel, rbf_kernel
+from ..metrics.pairwise import sigmoid_kernel
+from ..linear_model import ridge
+from ..utils import gen_even_slices
+from ..utils import atleast2d_or_csr, check_random_state, column_or_1d
+from ..utils.extmath import safe_sparse_dot
+from ..utils.fixes import expit as logistic_sigmoid
+
+
+def _tanh(X):
@arjoly Owner
arjoly added a note

_tanh, _relu, _softmax should probably say that this is done inplace

@arjoly Owner
arjoly added a note

_inplace_tanh, _inplace_...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((93 lines not shown))
+ raise ValueError("The algorithm %s"
+ " is not supported. " % self.algorithm)
+
+ if self.kernel not in ['random', 'linear', 'poly', 'rbf', 'sigmoid']:
+ raise ValueError("The kernel %s"
+ " is not supported. " % self.kernel)
+
+ if self.algorithm is not 'standard' and self.class_weight is not None:
+ raise NotImplementedError("class_weight is only supported "
+ "when algorithm='standard'.")
+
+ def _init_param(self, X):
+ """Set initial parameters."""
+ self._activation_func = self._activation_functions[self.activation]
+
+ if (self.kernel in ['poly', 'rbf', 'sigmoid']) and (self.gamma == 0):
@arjoly Owner
arjoly added a note

Parenthesis are not needed.

@arjoly Owner
arjoly added a note

Thanks !

@ogrisel Owner
ogrisel added a note

I would rather like to have the constructor only store the hyper parameters and initialized the pulic and private attributes in _fit using hasattr(self, attribute_name) test if required. In particular coef_hidden_ should not be reset / reinitialized if self.warm_start is True.

Great. I also removed self.coef_output_ = None because it is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((108 lines not shown))
+ if (self.kernel in ['poly', 'rbf', 'sigmoid']) and (self.gamma == 0):
+ # if custom gamma is not provided ...
+ self.gamma = 1.0 / self._n_features
+
+ if self.kernel != 'random':
+ self._X_train = X
+
+ def _scaled_weight_init(self, fan_in, fan_out):
+ """Scale the initial, random parameters for a specific layer."""
+ if self.weight_scale == 'auto':
+ if self.activation == 'tanh':
+ interval = np.sqrt(6. / (fan_in + fan_out))
+ elif self.activation == 'logistic':
+ interval = 4. * np.sqrt(6. / (fan_in + fan_out))
+ else:
+ interval = np.sqrt(1. / (fan_in))
@arjoly Owner
arjoly added a note

The auto mode is not inline with the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((125 lines not shown))
+ interval = self.weight_scale
+
+ return interval
+
+ def _init_hidden_weights(self):
+ """Initialize coef and intercept parameters for the hidden layer."""
+ rng = check_random_state(self.random_state)
+ fan_in, fan_out = self._n_features, self.n_hidden
+
+ interval = self._scaled_weight_init(fan_in, fan_out)
+
+ coef = rng.uniform(-interval, interval, (fan_in, fan_out))
+ intercept = rng.uniform(-interval, interval, (fan_out))
+
+ self.coef_hidden_ = coef
+ self.intercept_hidden_ = intercept
@arjoly Owner
arjoly added a note

I have the feeling there are line that could be easily merged without impeding the reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((175 lines not shown))
+ A = safe_sparse_dot(X, self.coef_hidden_)
+ A += self.intercept_hidden_
+ H = self._activation_func(A)
+
+ elif self.kernel == 'linear':
+ H = linear_kernel(X, self._X_train)
+
+ elif self.kernel == 'poly':
+ H = polynomial_kernel(X, self._X_train, self.degree, self.gamma,
+ self.coef0)
+
+ elif self.kernel == 'rbf':
+ H = rbf_kernel(X, self._X_train, self.gamma)
+
+ elif self.kernel == 'sigmoid':
+ H = sigmoid_kernel(X, self._X_train, self.gamma, self.coef0)
@arjoly Owner
arjoly added a note

Maybe use the pairwise_kernels function?

@arjoly Owner
arjoly added a note

It might be better to place those computation about class_weight in init_param_validate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((187 lines not shown))
+ H = rbf_kernel(X, self._X_train, self.gamma)
+
+ elif self.kernel == 'sigmoid':
+ H = sigmoid_kernel(X, self._X_train, self.gamma, self.coef0)
+
+ return H
+
+ def _solve_lsqr(self, X, y):
+ """Compute the least-square solutions for the whole dataset."""
+ H = self._get_hidden_activations(X)
+
+ # compute output coefficients using ridge_regression
+ if self.class_weight is not None:
+ W = self._assign_weights(y)
+ else:
+ W = None
@arjoly Owner
arjoly added a note

What is W? hidden_weights?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly
Owner

A bunch of random thoughts. I hope it helps. :-)

@IssamLaradji

Great comments @arjoly. I addressed most of them and added an ELM documentation (neural_networks_supervised.rst). I was fixing some weird errors from Github like "fatal: unable to access 'https://github.com/scikit-learn/scikit-learn/', hence the late reply :(.

These are my questions about some of your comments,

1) Do you think the common implementation of fit and partial_fit could be merged
Do you mean merging the checks and verbose? namely,

        X = atleast2d_or_csr(X)
        self.n_outputs_ = y.shape[1]
        n_samples, self._n_features = X.shape
        self._validate_params()
        self._init_param(X)
        if self.verbose:
            # compute training square error
            y_pred = self.decision_function(X)
            cost = mean_squared_error(y, y_pred)
            print("Training square error for the batch = %f" % (cost))

2) Does it make any sense to have other solver such as sgd?
Oh yes, I could reuse sgd or logistic regression to learn the weights of the output layer - I will add this as a new pull request in the future :).

3) I would avoid this. Better patching the label binarizer to accept a classes argument.
Also, it seems that labelBinarizer will no longer support multi-label as multilabelbinarizer will take its place in dealing with multi-label instances. Wouldn't that complicate things?

4) Since @amueller improvement in the utils module, we could the new awesome check_X_y, check_array or check_consistency_length.
I would love to use check_X_y but it seems the latest available 64bit scikit-learn version does not support it yet :(?

5) What happens for the other kernel?
The value of n_hidden does not affect other kernels at all.

6) What is _k?
It is a variable given in the literature for sequential learning, I am not sure how else I could name it since it holds arbitrary information - like H^T H

7) Why not calling this solver?
We called it algorithm since we handle it the same way as the algorithm in #3204. Also, solver is a keyword reserved for ridge_regression which we would like the user to define through the ELM object. (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py).

8) I don't find H to be particularly descriptive variable name.
H is the variable given in the literature - though a more descriptive name is more desirable. Should we name it hidden_activations instead?

9) The auto mode is not inline with the documentation.
I agree - I will change that :-), thanks.

I addressed the remaining comments and learned new stuff about clean implementation in the process :).
Thanks!

@arjoly
Owner

1) Do you mean merging the checks and verbose?

Yes, having _partial_fit that would factor out all the logic of the fit (except maybe the batch creation). I was thinking of something like in the GaussianNB class. Would it be possible?

3) I would avoid this. Better patching the label binarizer to accept a classes argument.
Also, it seems that labelBinarizer will no longer support multi-label as multilabelbinarizer will take its place in dealing with multi-label instances. Wouldn't that complicate things?

I was thinking to the multiclass /sequence of sequence case where you might have new classes that weren't there before. By setting the classes_ attribute you bypass a bunch of the logic and you might get in trouble latter.

4) Since @amueller improvement in the utils module, we could the new awesome check_X_y, check_array or check_consistency_length.
I would love to use check_X_y but it seems the latest available 64bit scikit-learn version does not support it yet :(?

Hm strange, it should.

6) What is _k?
It is a variable given in the literature for sequential learning, I am not sure how else I could name it since it holds arbitrary information - like H^T H

Hm, I would try to find better name. Reviewing is harder when variables aren't meaningful. Note that I am not familiar with ELM and haven't been able to wrap my brain around those notation.

H is the variable given in the literature - though a more descriptive name is more desirable. Should we name it hidden_activations instead?

It looks already a lot better. Is it the value of the output hidden layer ? Would it make sense to call it hidden_layer_output / y_hidden_layer / y_hidden?

sklearn/neural_network/extreme_learning_machines.py
((102 lines not shown))
+
+ def _init_param(self, X):
+ """Set initial parameters."""
+ self._activation_func = Activation_Functions[self.activation]
+
+ if self.kernel in ['poly', 'rbf', 'sigmoid'] and self.gamma is None:
+ # if custom gamma is not provided ...
+ self.gamma = 1.0 / self._n_features
+
+ if self.kernel != 'random':
+ self._X_train = X
+
+ def _init_hidden_weights(self):
+ """Initialize coef and intercept parameters for the hidden layer."""
+ rng = check_random_state(self.random_state)
+ fan_in, fan_out = self._n_features, self.n_hidden
@arjoly Owner
arjoly added a note

Do you need these aliases? I would keep the original name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((107 lines not shown))
+ if self.kernel in ['poly', 'rbf', 'sigmoid'] and self.gamma is None:
+ # if custom gamma is not provided ...
+ self.gamma = 1.0 / self._n_features
+
+ if self.kernel != 'random':
+ self._X_train = X
+
+ def _init_hidden_weights(self):
+ """Initialize coef and intercept parameters for the hidden layer."""
+ rng = check_random_state(self.random_state)
+ fan_in, fan_out = self._n_features, self.n_hidden
+
+ # scale the initial, random parameters
+ if self.weight_scale == 'auto':
+ if self.activation == 'tanh':
+ interval = np.sqrt(6. / (fan_in + fan_out))
@arjoly Owner
arjoly added a note

interval => max_abs_weight or weight_scale to be consistent?

@ogrisel Owner
ogrisel added a note

Could you also make it possible to pass user specified sample_weight in the fit and partial_fit method. Also please add a smoke test to check that AdaboostClassifier(base_estimator=ELMClassifier()) works without crashing when calling fit on a toy dataset.

@ogrisel Owner
ogrisel added a note

If sample_weight is not None, one should not call compute_sample_weight but just use the provided weights. Please also add some tests on a toy data to check that sample_weight can be used to ignore some samples from the training set, as done here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/tests/test_gradient_boosting.py#L978

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((115 lines not shown))
+ """Initialize coef and intercept parameters for the hidden layer."""
+ rng = check_random_state(self.random_state)
+ fan_in, fan_out = self._n_features, self.n_hidden
+
+ # scale the initial, random parameters
+ if self.weight_scale == 'auto':
+ if self.activation == 'tanh':
+ interval = np.sqrt(6. / (fan_in + fan_out))
+
+ elif self.activation == 'logistic':
+ interval = 4. * np.sqrt(6. / (fan_in + fan_out))
+
+ else:
+ interval = np.sqrt(1. / self._n_features)
+ else:
+ interval = self.weight_scale
@arjoly Owner
arjoly added a note

Why is it called a weight_scale?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((90 lines not shown))
+ if self.algorithm not in ALGORITHMS:
+ raise ValueError("The algorithm %s is not supported. Supported "
+ "algorithms are %s" % (self.algorithm,
+ ALGORITHMS))
+
+ if self.kernel not in KERNELS:
+ raise ValueError("The kernel %s is not supported. Supported "
+ "kernels are %s" % (self.kernel, KERNELS))
+
+ if self.algorithm != 'standard' and self.class_weight is not None:
+ raise NotImplementedError("class_weight is only supported "
+ "when algorithm='standard'.")
+
+ def _init_param(self, X):
+ """Set initial parameters."""
+ self._activation_func = Activation_Functions[self.activation]
@arjoly Owner
arjoly added a note

You don't really need that attribute ^^
Without, you are probably saving yourself from pickling difficulties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((28 lines not shown))
+def _relu_inplace(X):
+ """Compute the rectified linear unit function."""
+ tmp = 1 + np.exp(X, out=X)
+
+ return np.log(tmp, out=tmp)
+
+
+def _softmax_inplace(X):
+ """Compute the K-way softmax function. """
+ X = np.exp(X - X.max(axis=1)[:, np.newaxis])
+ X /= X.sum(axis=1)[:, np.newaxis]
+
+ return X
+
+
+Activation_Functions = {'tanh': _tanh_inplace, 'logistic': logistic_sigmoid,
@arjoly Owner
arjoly added a note

pep8 Activation_Functions => ACTIVATION_FUNCTIONS or ACTIVATIONS?

@arjoly Owner
arjoly added a note

constant are written with capital letter.

@ogrisel Owner
ogrisel added a note

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((143 lines not shown))
+ A += self.intercept_hidden_
+ H = self._activation_func(A)
+
+ else:
+
+ if self.kernel == 'linear':
+ args = {}
+ elif self.kernel == 'rbf':
+ args = {'gamma': self.gamma}
+ elif self.kernel == 'sigmoid':
+ args = {'coef0': self.coef0, 'gamma': self.gamma}
+ elif self.kernel == 'poly':
+ args = {'degree': self.degree, 'coef0': self.coef0,
+ 'gamma': self.gamma}
+
+ H = pairwise_kernels(X, self._X_train, metric=self.kernel, **args)
@arjoly Owner
arjoly added a note

Looking at kernel_pca. You might probably be able to do

    def _get_kernel(self, X, Y=None):
        if callable(self.kernel):
            params = self.kernel_params or {}
        else:
            params = {"gamma": self.gamma,
                      "degree": self.degree,
                      "coef0": self.coef0}
        return pairwise_kernels(X, Y, metric=self.kernel,
                                filter_params=True, **params)
@arjoly Owner
arjoly added a note

But I haven't looks at the doc of pairwise kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((68 lines not shown))
+ self.n_hidden = n_hidden
+ self.verbose = verbose
+ self.random_state = random_state
+
+ self.classes_ = None
+ self.coef_hidden_ = None
+ self.coef_output_ = None
+
+ def _validate_params(self):
+ """Validate input params."""
+ ALGORITHMS = ['standard', 'sequential']
+ KERNELS = ['random', 'linear', 'poly', 'rbf', 'sigmoid']
+
+ if self.n_hidden <= 0:
+ raise ValueError("n_hidden must be greater or equal zero")
+ if self.C <= 0.0:
@arjoly Owner
arjoly added a note

Add a line here to be consistent with the rest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((367 lines not shown))
+ kernel='random'.
+
+ - 'logistic' for 1 / (1 + exp(x)).
+
+ - 'tanh' for the hyperbolic tangent.
+
+ - 'relu' for log(1 + exp(x))
+
+ algorithm : {'standard', 'sequential'}, default 'standard'
+ The algorithm for computing least-square solutions.
+ Defaults to 'sequential'
+
+ - 'standard' computes the least-square solutions using the
+ whole matrix at once.
+
+ - 'sequential' computes the least-square solutions by training
@arjoly Owner
arjoly added a note

would it make sense to call it recurisve_lsq or recursive_least_square?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neural_network/extreme_learning_machines.py
((129 lines not shown))
+ else:
+ interval = self.weight_scale
+
+ self.coef_hidden_ = rng.uniform(-interval, interval, (fan_in, fan_out))
+ self.intercept_hidden_ = rng.uniform(-interval, interval, (fan_out))
+
+ def _get_hidden_activations(self, X):
+ """Compute the hidden activations using the set kernel.
+ """
+ if self.kernel == 'random':
+ if self.coef_hidden_ is None:
+ self._init_hidden_weights()
+
+ A = safe_sparse_dot(X, self.coef_hidden_)
+ A += self.intercept_hidden_
+ H = self._activation_func(A)
@ogrisel Owner
ogrisel added a note

Please rename H and A as activations. Also rename _activation_func to _inplace_activation and do:

activations = safe_sparse_dot(X, self.coef_hidden_)
activations += self.intercept_hidden_
self._inplace_activation(activations)

to make it explicit that there is no memory copy besides the initial allocation of the safe_sparse_dot function.

@ogrisel Owner
ogrisel added a note

hidden_activations is even a better name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

Please rebase on the current master and only use check_X_y(X, y) for input validation:

http://scikit-learn.org/dev/developers/utilities.html#validation-tools

@ogrisel
Owner

4) Since @amueller improvement in the utils module, we could the new awesome check_X_y, check_array or check_consistency_length.

I would love to use check_X_y but it seems the latest available 64bit scikit-learn version does not support it yet :(?

You should setup your dev environment to build scikit-learn from source. If you are under Windows, I have updated the instructions here:

http://scikit-learn.org/stable/install.html#building-on-windows

sklearn/neural_network/tests/test_elm.py
((325 lines not shown))
+ Test whether increasing weight for the minority class improves AUC
+ for the below imbalanced dataset.
+ """
+ rng = np.random.RandomState(random_state)
+ n_samples_1 = 500
+ n_samples_2 = 10
+ X = np.r_[1.5 * rng.randn(n_samples_1, 20),
+ 1.2 * rng.randn(n_samples_2, 20) + [2] * 20]
+ y = [0] * (n_samples_1) + [1] * (n_samples_2)
+
+ X_train, X_test, y_train, y_test = cross_validation.train_test_split(
+ X, y, test_size=0.8, random_state=random_state)
+
+ n_hidden = 20
+ for activation in ACTIVATION_TYPES:
+ elm_weightless = ELMClassifier(n_hidden=n_hidden,
@ogrisel Owner
ogrisel added a note

Wrong indentation level. Please run pyflakes on the source files of this PR and fix all the errors / warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/neural_networks_supervised.rst
((2 lines not shown))
+
+==================================
+Neural network models (supervised)
+==================================
+
+.. currentmodule:: sklearn.neural_network
+
+
+.. _elm:
+
+Extreme Learning Machines
+=========================
+
+**Extreme Learning Machines (ELMs)** are supervised nonlinear learning algorithm
+based on least-square solutions. It trains single-hidden layer
+feedforward networks, as shown in Figure 1. ELMs enjoy high speed and efficient
@ogrisel Owner
ogrisel added a note

I find it a confusing description. I would rather say:

ELM is a supervised nonlinear learning algorithm that results from the sequential application of the following steps:

  • a random projection, possibly to a larger dimensional space than the input space,
  • an element-wise non-linear activation function such as a tanh sigmoid function,
  • a linear one versus all classifier or a multi-output ridge regression model.

Therefore ELM can be interpreted as a single layer feed forward neural network (see figure 1) where:

  • only the hidden to output connection weights are trained,
  • the connection weights input to hidden are randomly initialized and left random,
  • the (default) output loss function is least squares.

ELM have been empirically found to generalize similarly to 1 hidden layer MLPs or RBF kernel support vector machines on a variety of problems while being significantly faster to train.

The main hyper-parameters of ELMs are:

  • the variance of the random projection weights,
  • the number of nodes on the hidden layer,
  • the regularization strength of the output linear model.
@ogrisel Owner
ogrisel added a note

it applies a random projection...

@vene Owner
vene added a note

The prepositions are a bit off in this sentence. I'd say "to the input space, onto a possibly higher dimensional space"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

Please also update the benchmarks/bench_covertype.py file to compare ELMClassifers and Nystroem on that dataset. This will require doing of hyper parameter tuning.

If you need some compute power to run this kind of experiments, please tell me and I will send you credentials to use a beefy Rackspace cloud server.

@IssamLaradji

Thanks for the great reviews as always. I addressed your comments @ogrisel and @arjoly and pushed the updated code. These are the things remaining,

** 1) Why is it called a weight_scale?**
weight_scale scales the random weights of the input-to-hidden connections, but a more appropriate name could be, weight_random_scale or more academically driven, random_weight_variance?

2) What is _k?
since it is related to recursive_lsqr, is it desirable to name _k as recursive_var?

3) Please rebase on the current master and only use check_X_y(X, y) for input validation
It's nice that check_X_y(X, y) is working :), but what about methods that only accept X as input? should I use check_array in such cases?

4) If you need some compute power to run this kind of experiments, please tell me and I will send you **credentials to use a beefy Rackspace cloud server. **
Good thing I have a 32 GB RAM machine, unless you mean speed :).
I ran ELMClassifier with gridsearch on the dataset and added the result in benchmarks/bench_covertype.py. The question is, should I run all the algorithms, or only ELMClassifier?

*5) But I haven't looks at the doc of pairwise kernel. *
pairwise_kernel has a cool parameter filter_params that, when set to True, ignores unsupported parameters, making the code much cleaner :-)

Things to be done on my side,

1) Try to find a way to merge partial_fit and fit
2) Improve the documentation, neural_network_supervised.rst

Thank you very much.

@ogrisel
Owner

weight_scale is fine. weight_random_scale is wrong as the scale is not random, it's the weights that are randomly initialized with the provided scale. random_weight_scale is correct but too long IMO.

@ogrisel
Owner

but what about methods that only accept X as input? should I use check_array in such cases?

Yes. Read the docstring of check_array to chose the appropriate checks for each case.

@ogrisel
Owner

The question is, should I run all the algorithms, or only ELMClassifier?

Compare ELMClassifier with the best parameters you could find by grid search to the Nystrom method which is a very similar base line. Comment out temporarily (in your local source folder) the other methods to not run then. But do not commit the commenting-out itself.

@ogrisel
Owner

17% error rate on covertype for a non-linear model seems quite bad. I would have expected a better accuracy on this dataset. Have you tried to increase the number of hidden nodes, change the weight_scale and find the best value of C?

doc/modules/neural_networks_supervised.rst
((4 lines not shown))
+Neural network models (supervised)
+==================================
+
+.. currentmodule:: sklearn.neural_network
+
+
+.. _elm:
+
+Extreme Learning Machines