[MRG] GSoC 2014: Standard Extreme Learning Machines #3306

Open
wants to merge 2 commits into
from

Conversation

Projects
None yet
@IssamLaradji
Contributor

IssamLaradji commented Jun 22, 2014

Finished implementing the standard extreme learning machines (ELMs). I am getting the following results with 550 hidden neurons against the digits datasets,

Training accuracy using the logistic activation function: 0.999444
Training accuracy using the tanh activation function: 1.000000

Fortunately, this algorithm is much easier to implement and debug than multi-layer perceptron :).
I will push a test file soon.

@ogrisel , @larsmans

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 22, 2014

Coverage Status

Coverage increased (+0.0%) when pulling e5e363d on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

Coverage Status

Coverage increased (+0.0%) when pulling e5e363d on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

@sveitser sveitser referenced this pull request in Shippable/support Jun 23, 2014

Closed

Display Code Coverage on Github Pull Request Page #239

+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ y : numpy array of shape (n_samples)

This comment has been minimized.

@larsmans

larsmans Jun 23, 2014

Member

y should be an "array-like" and be validated as such.

@larsmans

larsmans Jun 23, 2014

Member

y should be an "array-like" and be validated as such.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jun 24, 2014

Contributor

Thanks for bringing this up. I made the changes in multi-layer perceptron as well.

@IssamLaradji

IssamLaradji Jun 24, 2014

Contributor

Thanks for bringing this up. I made the changes in multi-layer perceptron as well.

+ output = safe_sparse_dot(self.hidden_activations_, self.coef_output_)
+
+ return output
+

This comment has been minimized.

@NelleV

NelleV Jun 23, 2014

Member

There should be an extra blank line here. Can you run pep8 on the file and check for pep8 compliance ?

@NelleV

NelleV Jun 23, 2014

Member

There should be an extra blank line here. Can you run pep8 on the file and check for pep8 compliance ?

This comment has been minimized.

@IssamLaradji

IssamLaradji Jun 24, 2014

Contributor

Thanks - I had problems with pep8 auto formatting in sublime text, I fixed it now.

@IssamLaradji

IssamLaradji Jun 24, 2014

Contributor

Thanks - I had problems with pep8 auto formatting in sublime text, I fixed it now.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jun 27, 2014

Contributor

Hi, I am wondering what verbose extreme learning machines should display. Any ideas ?

Thanks

Contributor

IssamLaradji commented Jun 27, 2014

Hi, I am wondering what verbose extreme learning machines should display. Any ideas ?

Thanks

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jun 30, 2014

Contributor

Travis is acting strange, in that it raises an error for test_multilabel_classification(), although, in my local machine, the test_multilabel_classification() method in test_elm runs correctly with 1000 different seeds. Also, the pull request passed the local test after executing make test on the whole library.

Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

Contributor

IssamLaradji commented Jun 30, 2014

Travis is acting strange, in that it raises an error for test_multilabel_classification(), although, in my local machine, the test_multilabel_classification() method in test_elm runs correctly with 1000 different seeds. Also, the pull request passed the local test after executing make test on the whole library.

Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

@IssamLaradji IssamLaradji referenced this pull request Jun 30, 2014

Closed

[MRG] Generic multi layer perceptron #3204

3 of 4 tasks complete
@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jun 30, 2014

Member

This might be worth having a look at https://github.com/dclambert/Python-ELM.

Member

arjoly commented Jun 30, 2014

This might be worth having a look at https://github.com/dclambert/Python-ELM.

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Jun 30, 2014

Member

Training squared error loss would seem appropriate for verbose output. Not every estimator has verbose output, though (naive Bayes doesn't because it runs instantly on typical problem sizes).

Member

larsmans commented Jun 30, 2014

Training squared error loss would seem appropriate for verbose output. Not every estimator has verbose output, though (naive Bayes doesn't because it runs instantly on typical problem sizes).

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Jun 30, 2014

Coverage Status

Coverage increased (+0.07%) when pulling 2be2941 on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

Coverage Status

Coverage increased (+0.07%) when pulling 2be2941 on IssamLaradji:Extreme-Learning-Machines into 68b0a28 on scikit-learn:master.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 1, 2014

Contributor

Thanks, displaying the training error as verbose is such a useful idea.

Contributor

IssamLaradji commented Jul 1, 2014

Thanks, displaying the training error as verbose is such a useful idea.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 1, 2014

Member

However, Travis raises an error for test_multilabel_classification(). Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

The version of numpy / scipy used by the various travis workers are given in the environment variable of each build. You can see the exact setup in:

Member

ogrisel commented Jul 1, 2014

However, Travis raises an error for test_multilabel_classification(). Is there a chance that Travis uses libraries different (or a modified version) from the local for testing ?

The version of numpy / scipy used by the various travis workers are given in the environment variable of each build. You can see the exact setup in:

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 2, 2014

Contributor

@ogrisel thanks I will dig deeper to see where multi-label classification is being affected.

Contributor

IssamLaradji commented Jul 2, 2014

@ogrisel thanks I will dig deeper to see where multi-label classification is being affected.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 2, 2014

Contributor

Hi guys, I implemented weighted and regularized ELMs - here are their awesome results on the imbalanced dataset. :) :)

Non-Regularized ELMs (Large C)
non_regularized_elm

Regularized ELMs (Small C)
regularized_elm

Contributor

IssamLaradji commented Jul 2, 2014

Hi guys, I implemented weighted and regularized ELMs - here are their awesome results on the imbalanced dataset. :) :)

Non-Regularized ELMs (Large C)
non_regularized_elm

Regularized ELMs (Small C)
regularized_elm

+
+ # compute regularized output coefficients using eq. 3 in reference [1]
+ left_part = pinv2(
+ safe_sparse_dot(H.T, H_tmp) + identity(self.n_hidden) / self.C)

This comment has been minimized.

@agramfort

agramfort Jul 2, 2014

Member

you should use ridge implementation here.

@agramfort

agramfort Jul 2, 2014

Member

you should use ridge implementation here.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 2, 2014

Contributor

Hi @agramfort , isn't this technically ridge regression? I am minimizing the L2 norm of the coefficients in the objective function - like in the equation below. Or do you mean I should use scikit-learn implementation of ridge ? Thanks.

l_elm

@IssamLaradji

IssamLaradji Jul 2, 2014

Contributor

Hi @agramfort , isn't this technically ridge regression? I am minimizing the L2 norm of the coefficients in the objective function - like in the equation below. Or do you mean I should use scikit-learn implementation of ridge ? Thanks.

l_elm

This comment has been minimized.

@agramfort

agramfort Jul 3, 2014

Member

this does not look like ridge but you seem to do

(H'H + 1/C Id)^{-1} H'

and this is really a ridge solution where H is X and y is y and C = 1/alpha

@agramfort

agramfort Jul 3, 2014

Member

this does not look like ridge but you seem to do

(H'H + 1/C Id)^{-1} H'

and this is really a ridge solution where H is X and y is y and C = 1/alpha

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 3, 2014

Contributor

Sorry, the equation I gave is for weighted ELMs as it contains the weight term W which is not part of ridge. However, the implementation contains both versions - with W and without W.
The version without W computes the formulae you mentioned, (H'H + 1/C Id)^{-1} H'y.
Thanks.

@IssamLaradji

IssamLaradji Jul 3, 2014

Contributor

Sorry, the equation I gave is for weighted ELMs as it contains the weight term W which is not part of ridge. However, the implementation contains both versions - with W and without W.
The version without W computes the formulae you mentioned, (H'H + 1/C Id)^{-1} H'y.
Thanks.

This comment has been minimized.

@agramfort

agramfort Jul 3, 2014

Member

without w then it is a ridge

@agramfort

agramfort Jul 3, 2014

Member

without w then it is a ridge

+ scores = logistic_sigmoid(scores)
+ return np.hstack([1 - scores, scores])
+ else:
+ return _softmax(scores)

This comment has been minimized.

@agramfort

agramfort Jul 2, 2014

Member

getting a proba here seems like a hack unless you use a log reg on top of your hidden features. Not a ridge.

@agramfort

agramfort Jul 2, 2014

Member

getting a proba here seems like a hack unless you use a log reg on top of your hidden features. Not a ridge.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 2, 2014

Contributor

I see, so to get proper probabilities I should use the regular least square solutions ||Ax - B||^2 without minimizing the norm of the coefficients?
Thank you

@IssamLaradji

IssamLaradji Jul 2, 2014

Contributor

I see, so to get proper probabilities I should use the regular least square solutions ||Ax - B||^2 without minimizing the norm of the coefficients?
Thank you

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

Pushed a lot of improvements.

  1. Added sequential ELM support - with partial_fit
  2. Added relevant tests for sequential ELM and weighted ELM

Created two examples.

  1. Weighted ELM plot
    plot_weighted

  2. Training vs. Testing with respect to hidden neurons
    plot_testing_training

Will be leaving the documentation till the end - after I implement the remaining part which is kernel support and after the code is reviewed. Thanks.

Contributor

IssamLaradji commented Jul 5, 2014

Pushed a lot of improvements.

  1. Added sequential ELM support - with partial_fit
  2. Added relevant tests for sequential ELM and weighted ELM

Created two examples.

  1. Weighted ELM plot
    plot_weighted

  2. Training vs. Testing with respect to hidden neurons
    plot_testing_training

Will be leaving the documentation till the end - after I implement the remaining part which is kernel support and after the code is reviewed. Thanks.

+plot_decision_function(
+ clf_weightless, axes[0], 'ELM(class_weight=None, C=10e5)')
+plot_decision_function(
+ clf_weight_auto, axes[1], 'ELM(class_weight=\'auto\', C=10e5)')

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

rather than using ' use " to define the string : 'ELM(class_weight="auto", C=10e5)'

@agramfort

agramfort Jul 5, 2014

Member

rather than using ' use " to define the string : 'ELM(class_weight="auto", C=10e5)'

+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

param desc are missing here and below

@agramfort

agramfort Jul 5, 2014

Member

param desc are missing here and below

+
+import numpy as np
+
+from numpy import diag

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

why not using np.diag ? that's what we usually do

@agramfort

agramfort Jul 5, 2014

Member

why not using np.diag ? that's what we usually do

+ -------
+ X_new : {array-like, sparse matrix}, shape (n_samples, n_features)
+ """
+ exp_Z = np.exp(Z.T - Z.max(axis=1)).T

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

rather than using these T twice use newaxis:

np.exp(Z - Z.max(axis=1)[:, np.newaxis])

same below in return statement. It's more readable.

@agramfort

agramfort Jul 5, 2014

Member

rather than using these T twice use newaxis:

np.exp(Z - Z.max(axis=1)[:, np.newaxis])

same below in return statement. It's more readable.

+ @abstractmethod
+ def __init__(
+ self, n_hidden, activation, algorithm, C, class_weight, batch_size,
+ verbose, random_state):

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member
    def __init__(self, n_hidden, activation, algorithm, C, class_weight,
                 batch_size, verbose, random_state):

is more standard indentation.

@agramfort

agramfort Jul 5, 2014

Member
    def __init__(self, n_hidden, activation, algorithm, C, class_weight,
                 batch_size, verbose, random_state):

is more standard indentation.

+ else:
+ diagonals[indices] = 1
+
+ return diag(diagonals)

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

do you really need to allocate a full dense matrix? I doubt it.

@agramfort

agramfort Jul 5, 2014

Member

do you really need to allocate a full dense matrix? I doubt it.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

Fixed :)

@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

Fixed :)

+
+ left_part = safe_sparse_dot(Z, H) + identity(self.n_hidden) / self.C
+ right_part = safe_sparse_dot(Z, y)
+ self.coef_output_ = safe_sparse_dot(pinv2(left_part), right_part)

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

solving linear system with a pinv2 is never recommended due to numerical errors. I am pretty sure there is something better to do it.

@agramfort

agramfort Jul 5, 2014

Member

solving linear system with a pinv2 is never recommended due to numerical errors. I am pretty sure there is something better to do it.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

I am now solving the system using scipy.linalg.solve, is this more efficient? Thanks
It takes around half the computation time as well. :)

@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

I am now solving the system using scipy.linalg.solve, is this more efficient? Thanks
It takes around half the computation time as well. :)

This comment has been minimized.

@agramfort

agramfort Jul 6, 2014

Member

I am now solving the system using scipy.linalg.solve, is this more efficient? Thanks

did you bench? I recommend you have a look at how the ridge is solved and
maybe see how it can be reused (if it can). I would say profile / bench to
see what's the best way to solve the linear systems in this PR.

@agramfort

agramfort Jul 6, 2014

Member

I am now solving the system using scipy.linalg.solve, is this more efficient? Thanks

did you bench? I recommend you have a look at how the ridge is solved and
maybe see how it can be reused (if it can). I would say profile / bench to
see what's the best way to solve the linear systems in this PR.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 12, 2014

Contributor

Oh yes, I am using what ridge is using. I will try creating a ridgeClassifier object to compute elm solutions. Thanks.

@IssamLaradji

IssamLaradji Jul 12, 2014

Contributor

Oh yes, I am using what ridge is using. I will try creating a ridgeClassifier object to compute elm solutions. Thanks.

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 13, 2014

Contributor

So, I am reusing ridge.ridge_regression, which makes the code much cleaner :). But I don't think I can reuse it for sequential elm since the equation is fundamentally different.

@IssamLaradji

IssamLaradji Jul 13, 2014

Contributor

So, I am reusing ridge.ridge_regression, which makes the code much cleaner :). But I don't think I can reuse it for sequential elm since the equation is fundamentally different.

+
+ def __init__(
+ self, n_hidden=500, activation='tanh', algorithm='regular', C=10e5,
+ class_weight=None, batch_size=200, verbose=False, random_state=None):

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

same remark on indent

@agramfort

agramfort Jul 5, 2014

Member

same remark on indent

+
+ The algorithm trains a single-hidden layer feedforward network by computing
+ the hidden layer values using randomized parameters, then solving
+ for the output weights using least-square solutions.

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

this description is the same as for ELMClassifier is it normal?

@agramfort

agramfort Jul 5, 2014

Member

this description is the same as for ELMClassifier is it normal?

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

No, sorry. The difference is that ELMClassifier has an output gate function that converts continuous values to integers. Will change it now.

@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

No, sorry. The difference is that ELMClassifier has an output gate function that converts continuous values to integers. Will change it now.

+ def __init__(
+ self, n_hidden=100, activation='tanh', algorithm='regular',
+ batch_size=200, C=10e5, verbose=False, random_state=None):
+ class_weight = None

This comment has been minimized.

@agramfort

agramfort Jul 5, 2014

Member

indent

@agramfort

agramfort Jul 5, 2014

Member

indent

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 5, 2014

Contributor

@agramfort thanks for your comments. I pushed the updated code.

Contributor

IssamLaradji commented Jul 5, 2014

@agramfort thanks for your comments. I pushed the updated code.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 13, 2014

Contributor

Updates,

  1. ELM is now using ridge-regression as off-the-shelf solver to compute its solutions.
  2. Added support for kernels - linear, poly, rbf, sigmoid.
    Is there a way we could reuse the fast, efficient SVM kernel methods?
    Thanks.
Contributor

IssamLaradji commented Jul 13, 2014

Updates,

  1. ELM is now using ridge-regression as off-the-shelf solver to compute its solutions.
  2. Added support for kernels - linear, poly, rbf, sigmoid.
    Is there a way we could reuse the fast, efficient SVM kernel methods?
    Thanks.
@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Jul 13, 2014

Member

There are kernels in sklearn.metrics. The ones in sklearn.svm are buried deep down in the C++ code for LibSVM.

Member

larsmans commented Jul 13, 2014

There are kernels in sklearn.metrics. The ones in sklearn.svm are buried deep down in the C++ code for LibSVM.

+.. _multilayer_perceptron:
+
+Multi-layer Perceptron
+======================

This comment has been minimized.

@agramfort

agramfort Jul 13, 2014

Member

why is the doc for MLP in this ELM PR?

@agramfort

agramfort Jul 13, 2014

Member

why is the doc for MLP in this ELM PR?

This comment has been minimized.

@IssamLaradji

IssamLaradji Jul 13, 2014

Contributor

Hi, I removed it, it was added by accident in the last push :)

@IssamLaradji

IssamLaradji Jul 13, 2014

Contributor

Hi, I removed it, it was added by accident in the last push :)

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji Jul 13, 2014

Contributor

Thanks! reusing scikit-learn kernels made the code much cleaner.

Contributor

IssamLaradji commented Jul 13, 2014

Thanks! reusing scikit-learn kernels made the code much cleaner.

+import random
+
+from sklearn import cross_validation
+from sklearn.datasets import load_digits, fetch_mldata

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

'load_digits' imported but unused

run pyflakes on your files

@agramfort

agramfort Jul 14, 2014

Member

'load_digits' imported but unused

run pyflakes on your files

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

do we really need to fetch the full MNIST to illustrate this? it makes it impossible to run on crappy internet connection like mine now :(

@agramfort

agramfort Jul 14, 2014

Member

do we really need to fetch the full MNIST to illustrate this? it makes it impossible to run on crappy internet connection like mine now :(

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

do we really need to fetch the full MNIST to illustrate this? it makes it impossible to run on crappy internet connection like mine now :(

@agramfort

agramfort Jul 14, 2014

Member

do we really need to fetch the full MNIST to illustrate this? it makes it impossible to run on crappy internet connection like mine now :(

+
+
+# for reference, first fit without class weights
+# fit the model

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

why 2 lines of comments?

@agramfort

agramfort Jul 14, 2014

Member

why 2 lines of comments?

+ class_weight={1: 1000})
+clf_weight_1000.fit(X, Y)
+
+fig, axes = plt.subplots(1, 3, figsize=(20, 7))

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

this figsize is way to big for the doc. Make it not bigger than 10 inches wide and remove empty spaces with plt.subplots_adjust

@agramfort

agramfort Jul 14, 2014

Member

this figsize is way to big for the doc. Make it not bigger than 10 inches wide and remove empty spaces with plt.subplots_adjust

+mnist = fetch_mldata('MNIST original')
+X, y = mnist.data, mnist.target
+
+indices = np.array(random.sample(range(70000), 2000))

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

use np.random and get rid of the random from standard lib

@agramfort

agramfort Jul 14, 2014

Member

use np.random and get rid of the random from standard lib

+
+indices = np.array(random.sample(range(70000), 2000))
+X, y = X[indices].astype('float64'), y[indices]
+X /= 255

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

it's a float

@agramfort

agramfort Jul 14, 2014

Member

it's a float

+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn import svm

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

svm unused

@agramfort

agramfort Jul 14, 2014

Member

svm unused

+
+def _softmax(Z):
+ """Compute the K-way softmax function. """
+ exp_Z = np.exp(Z - Z.max(axis=1)[:, np.newaxis])

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

can you use Z to store the output of np.exp ? like you did for tanh above?

@agramfort

agramfort Jul 14, 2014

Member

can you use Z to store the output of np.exp ? like you did for tanh above?

+ """Compute the K-way softmax function. """
+ exp_Z = np.exp(Z - Z.max(axis=1)[:, np.newaxis])
+
+ return (exp_Z / exp_Z.sum(axis=1)[:, np.newaxis])

This comment has been minimized.

@agramfort

agramfort Jul 14, 2014

Member

and do this division inplace with a /=

@agramfort

agramfort Jul 14, 2014

Member

and do this division inplace with a /=

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 7, 2015

Member
Member

GaelVaroquaux commented May 7, 2015

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 7, 2015

Member

I wouldn't take his words for granted.

Which is why I suggested to do a quick comparison with SVMs and RBF nets. Or maybe someone has experience using them?

Would it work as well?

It would be the same provided that we create a random projection transformer which does the same as

https://github.com/scikit-learn/scikit-learn/pull/3306/files#diff-cc171f4410591cbb16cfe64bcc31841fR70

I can't say if we use standard random projections instead.

Member

mblondel commented May 7, 2015

I wouldn't take his words for granted.

Which is why I suggested to do a quick comparison with SVMs and RBF nets. Or maybe someone has experience using them?

Would it work as well?

It would be the same provided that we create a random projection transformer which does the same as

https://github.com/scikit-learn/scikit-learn/pull/3306/files#diff-cc171f4410591cbb16cfe64bcc31841fR70

I can't say if we use standard random projections instead.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji May 7, 2015

Contributor

In my experience, random neural networks or random_NN performs really well. It usually performs the same as single-hidden layer perceptron - although it needs much larger number of hidden neurons.
With some tricks such as stacking random-based autoencoders, you could get surprisingly great results with image datasets. I have been using this for a while, and it was quite a success.

I got emails from several people about using this tool and, disregarding the self-citations, the number of citations indicate the high demand of this algorithm.

I think it would be great if we keep it, change the name, and cite the older papers that came up with this idea (just an opinion).

I will work on showing some comparison results here today.

Contributor

IssamLaradji commented May 7, 2015

In my experience, random neural networks or random_NN performs really well. It usually performs the same as single-hidden layer perceptron - although it needs much larger number of hidden neurons.
With some tricks such as stacking random-based autoencoders, you could get surprisingly great results with image datasets. I have been using this for a while, and it was quite a success.

I got emails from several people about using this tool and, disregarding the self-citations, the number of citations indicate the high demand of this algorithm.

I think it would be great if we keep it, change the name, and cite the older papers that came up with this idea (just an opinion).

I will work on showing some comparison results here today.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

@mblondel most of this PR contains "smartly solving the ridge with partial fit".
That wouldn't be possible with a pipeline.

However, it sounds a bit like it should live in the ridge classifier / regressor and not here, and we should work on our partial-fit pipeline support?

Member

amueller commented May 7, 2015

@mblondel most of this PR contains "smartly solving the ridge with partial fit".
That wouldn't be possible with a pipeline.

However, it sounds a bit like it should live in the ridge classifier / regressor and not here, and we should work on our partial-fit pipeline support?

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 7, 2015

Member

@amueller Yes, ideally, the two contributions (random NN and efficient ridge regression) should be decoupled. But still, I don't understand the point of this incremental fitting algorithm. The most expensive operation within the for loop is solving a n_hidden x n_hidden system of linear equations. All the other operations are basically matrix multiplications. How is looping over batches supposed to help scaling w.r.t. n_samples? Or maybe I miss the point...

Member

mblondel commented May 7, 2015

@amueller Yes, ideally, the two contributions (random NN and efficient ridge regression) should be decoupled. But still, I don't understand the point of this incremental fitting algorithm. The most expensive operation within the for loop is solving a n_hidden x n_hidden system of linear equations. All the other operations are basically matrix multiplications. How is looping over batches supposed to help scaling w.r.t. n_samples? Or maybe I miss the point...

+ if sample_weight is None:
+ return X
+ else:
+ return X * sample_weight[:, np.newaxis]

This comment has been minimized.

@mblondel

mblondel May 7, 2015

Member

* means matrix multiplication when X is a sparse matrix.

@mblondel

mblondel May 7, 2015

Member

* means matrix multiplication when X is a sparse matrix.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 7, 2015

Member

we should work on our partial-fit pipeline support?

Yes that would indeed be nice, although this particular transformer would be (mostly) stateless, so the user can just call transform in a for loop as showed in my comment. For the general case, it is not clear whether the transform steps should be fitted against the full data or not.

Member

mblondel commented May 7, 2015

we should work on our partial-fit pipeline support?

Yes that would indeed be nice, although this particular transformer would be (mostly) stateless, so the user can just call transform in a for loop as showed in my comment. For the general case, it is not clear whether the transform steps should be fitted against the full data or not.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

The point of the incremental fitting is to do a partial_fit for ridge, which will mean not inverting the whole matrix at once, and in particular not storing it all at once, as n_samples x n_hidden might be too large to fit in memory.

Member

amueller commented May 7, 2015

The point of the incremental fitting is to do a partial_fit for ridge, which will mean not inverting the whole matrix at once, and in particular not storing it all at once, as n_samples x n_hidden might be too large to fit in memory.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

For partial fit pipelines, I would only support stateless transformers for the moment. I agree that you can write a pipeline like you did, but the for-loop is actually a bit more complex as you have to fit once in the first iteration. It is not complicated, but having a convenience function would be nice.

Member

amueller commented May 7, 2015

For partial fit pipelines, I would only support stateless transformers for the moment. I agree that you can write a pipeline like you did, but the for-loop is actually a bit more complex as you have to fit once in the first iteration. It is not complicated, but having a convenience function would be nice.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 7, 2015

Member

The matrix that we need to invert for solving the system is n_hidden x n_hidden for all iterations of the loop over the batches. Storing a n_samples x n_hidden shouldn't be a problem if n_hidden < n_features. If n_hidden > n_features, the usual kernel ridge trick can be used.

Member

mblondel commented May 7, 2015

The matrix that we need to invert for solving the system is n_hidden x n_hidden for all iterations of the loop over the batches. Storing a n_samples x n_hidden shouldn't be a problem if n_hidden < n_features. If n_hidden > n_features, the usual kernel ridge trick can be used.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

It is assumed h_hidden >> n_features, I think.
The kernel trick only works if n_samples is small, right?
I think this is supposed to work with say, expanding mnist to 10000 features or something. At least that is the idea that I got. Has anyone tried the mnist benchmark with this code?

Member

amueller commented May 7, 2015

It is assumed h_hidden >> n_features, I think.
The kernel trick only works if n_samples is small, right?
I think this is supposed to work with say, expanding mnist to 10000 features or something. At least that is the idea that I got. Has anyone tried the mnist benchmark with this code?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

On 05/07/2015 08:05 AM, Mathieu Blondel wrote:

I can't say if we use standard random projections instead.

our random projections are normal, ELM is uniform. I wouldn't think it
makes a difference, but we could try.

Member

amueller commented May 7, 2015

On 05/07/2015 08:05 AM, Mathieu Blondel wrote:

I can't say if we use standard random projections instead.

our random projections are normal, ELM is uniform. I wouldn't think it
makes a difference, but we could try.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

It would be good to have an empirical case where the partial fit actually helps. The incremental fitting is what really makes this code non-trivial. If this is helpful, I'd say merge this with renaming / reassignment of credit, and later refactor into Ridge.

If not, maybe just add a transformer and an example?

Member

amueller commented May 7, 2015

It would be good to have an empirical case where the partial fit actually helps. The incremental fitting is what really makes this code non-trivial. If this is helpful, I'd say merge this with renaming / reassignment of credit, and later refactor into Ridge.

If not, maybe just add a transformer and an example?

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 7, 2015

Member

For a stateless transformer, I presume the fit is mostly needed for input checking and setting the rng? The rng could be set in the first call to transform, although this might break your common tests. In any case, you can just call fit once outside of the for loop.

So, indeeded, the incremental fitting is useful in the n_features < n_hidden < n_samples regime. But this is the usual out-of-core learning setting: your features are too big so you need to build them on the fly and call partial fit on small batches. @agramfort had an example using polynomial features in his pydata talk :)

If we really want to go the estimator way (rather than the transformer way), there is actually a more elegant and concise way to solve the problem using conjugate gradient with a LinearOperator. This technique can be used to solve the system of linear equations without ever materializing the transformed features of size n_samples x n_hidden. This is because conjugate gradient only needs to compute products between the n_hidden x n_hidden matrix and a vector. This should be like 10 lines of code. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L63 for an example of how this works.

Member

mblondel commented May 7, 2015

For a stateless transformer, I presume the fit is mostly needed for input checking and setting the rng? The rng could be set in the first call to transform, although this might break your common tests. In any case, you can just call fit once outside of the for loop.

So, indeeded, the incremental fitting is useful in the n_features < n_hidden < n_samples regime. But this is the usual out-of-core learning setting: your features are too big so you need to build them on the fly and call partial fit on small batches. @agramfort had an example using polynomial features in his pydata talk :)

If we really want to go the estimator way (rather than the transformer way), there is actually a more elegant and concise way to solve the problem using conjugate gradient with a LinearOperator. This technique can be used to solve the system of linear equations without ever materializing the transformed features of size n_samples x n_hidden. This is because conjugate gradient only needs to compute products between the n_hidden x n_hidden matrix and a vector. This should be like 10 lines of code. See https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L63 for an example of how this works.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

Yeah stateless transformers and the common tests don't work well together, they are currently manually excluded, and that is something we / I need to fix.
The problem with setting the random weights on the first transform is that this would break if people want to use a transformer object on two different datasets. n_features is usually inferred in fit and if you use another one in transform that is an error. I guess you could set it in transform unless it was set in fit, and if you explicitly want to use it on another dataset, you have to call fit. That is slightly magic, though.

Member

amueller commented May 7, 2015

Yeah stateless transformers and the common tests don't work well together, they are currently manually excluded, and that is something we / I need to fix.
The problem with setting the random weights on the first transform is that this would break if people want to use a transformer object on two different datasets. n_features is usually inferred in fit and if you use another one in transform that is an error. I guess you could set it in transform unless it was set in fit, and if you explicitly want to use it on another dataset, you have to call fit. That is slightly magic, though.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2015

Member

For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are.

You are the expert in solving linear problems, I am certainly not, so if there are smarter ways to solve this, then we should go for these.
I didn't mentor this GSoC, I just heard multiple times "this just needs a final review"..

Member

amueller commented May 7, 2015

For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are.

You are the expert in solving linear problems, I am certainly not, so if there are smarter ways to solve this, then we should go for these.
I didn't mentor this GSoC, I just heard multiple times "this just needs a final review"..

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 8, 2015

Member

For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are.

Sorry, when I was talking about generating features on the fly, I was referring to the features generated by the random projection + activation transformer. This is the same setting with polynomial features as well: your original features fit in memory but not the combination features obtained by PolynomialFeatures. But the principle is the same even if you start from your raw data, as long as the transformer used is stateless (e.g., FeatureHasher). In all cases, we loop over small batches of data, transform them and call partial_fit.

Member

mblondel commented May 8, 2015

For better ways to solve the problem: well it could be that n_features is small enough that you could fit n_samples x n_features into ram, but not n_samples x n_hidden. Not sure what typical n_features and n_hidden are.

Sorry, when I was talking about generating features on the fly, I was referring to the features generated by the random projection + activation transformer. This is the same setting with polynomial features as well: your original features fit in memory but not the combination features obtained by PolynomialFeatures. But the principle is the same even if you start from your raw data, as long as the transformer used is stateless (e.g., FeatureHasher). In all cases, we loop over small batches of data, transform them and call partial_fit.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 8, 2015

Member

For partial fit pipelines, I would only support stateless transformers for the moment.

+1: let's tackle the simple things first.

Member

GaelVaroquaux commented May 8, 2015

For partial fit pipelines, I would only support stateless transformers for the moment.

+1: let's tackle the simple things first.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 8, 2015

Member
Member

mblondel commented May 8, 2015

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 8, 2015

Member

Did I say I have a plan? ;)
Three possible ways?

  1. detect (I have no idea how)
  2. annotate as stateless
  3. via api: stateless transformers don't need to call fit. Then we don't need to call fit in partial_fit. This would also allow users to provide a statefull transformer fit on a subset of the data.
Member

amueller commented May 8, 2015

Did I say I have a plan? ;)
Three possible ways?

  1. detect (I have no idea how)
  2. annotate as stateless
  3. via api: stateless transformers don't need to call fit. Then we don't need to call fit in partial_fit. This would also allow users to provide a statefull transformer fit on a subset of the data.
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 8, 2015

Member

Can we decide what to do with this PR first?
@IssamLaradji put a lot of work into it, and it has been sitting around for way to long. If we feel that the algorithm isn't suited for a classifier / regressor class, we should see what we can salvage and add transformers / examples etc.

Member

amueller commented May 8, 2015

Can we decide what to do with this PR first?
@IssamLaradji put a lot of work into it, and it has been sitting around for way to long. If we feel that the algorithm isn't suited for a classifier / regressor class, we should see what we can salvage and add transformers / examples etc.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 8, 2015

Member

+1 for a transformer on my side. Instead of using a pipeline of two transformers as I initially suggested, we can maybe create just one transformer that does the random projection and applies the activation functions. This should be fairly straightforward to implement. For the examples, showing how to do grid search with a pipeline would be nice. For the name, of the transformer, maybe RandomActivationTransformer?

Member

mblondel commented May 8, 2015

+1 for a transformer on my side. Instead of using a pipeline of two transformers as I initially suggested, we can maybe create just one transformer that does the random projection and applies the activation functions. This should be fairly straightforward to implement. For the examples, showing how to do grid search with a pipeline would be nice. For the name, of the transformer, maybe RandomActivationTransformer?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 8, 2015

Member

Do you think the interative ridge regression here has value or are there better ways to partial_fit ridge regression?

Member

amueller commented May 8, 2015

Do you think the interative ridge regression here has value or are there better ways to partial_fit ridge regression?

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 8, 2015

Member

The idea of accumulating the n_hidden x n_hidden matrix is nice but this won't scale if n_hidden is large. If we implement a general partial_fit out of this algorithm, this will crash when people try it on high dimensional data like bag of words. We can add it and recommend not to use it when n_features is large. This would still be useful in some settings where n_samples is huge but n_features is reasonably small. For n_features large, I guess one should use SGD's partial_fit.

Member

mblondel commented May 8, 2015

The idea of accumulating the n_hidden x n_hidden matrix is nice but this won't scale if n_hidden is large. If we implement a general partial_fit out of this algorithm, this will crash when people try it on high dimensional data like bag of words. We can add it and recommend not to use it when n_features is large. This would still be useful in some settings where n_samples is huge but n_features is reasonably small. For n_features large, I guess one should use SGD's partial_fit.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 8, 2015

Member

Ok. So lets do the transformer? @IssamLaradji do you want to do that?
Or do you think you don't have time?

I'm not sure about RandomActivationTransformer. Maybe NonlinearProjection though projection kinda means to a lower dim space. NonlinearRandomFeatures? RandomFeatures?

Member

amueller commented May 8, 2015

Ok. So lets do the transformer? @IssamLaradji do you want to do that?
Or do you think you don't have time?

I'm not sure about RandomActivationTransformer. Maybe NonlinearProjection though projection kinda means to a lower dim space. NonlinearRandomFeatures? RandomFeatures?

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene May 8, 2015

Member

I don't like RandomFeatures, it's way too generic. From the name I'd expect it to simply ignore X and return random features. Out of all the names here, it seems to me like RandomActivation is the most specific (it best conveys what the object does). (I'd remove the Transformer suffix).

Member

vene commented May 8, 2015

I don't like RandomFeatures, it's way too generic. From the name I'd expect it to simply ignore X and return random features. Out of all the names here, it seems to me like RandomActivation is the most specific (it best conveys what the object does). (I'd remove the Transformer suffix).

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 9, 2015

Member

aside @amueller re stateless transformers: for this purpose, transformers
that depend only on the type or number of columns of the input should also
be acceptable, just to make things tricky!

On 9 May 2015 at 01:43, Vlad Niculae notifications@github.com wrote:

I don't like RandomFeatures, it's way too generic. From the name I'd
expect it to simply ignore X and return random features. Out of all the
names here, it seems to me like RandomActivation is the most specific (it
best conveys what the object does). (I'd remove the Transformer suffix).


Reply to this email directly or view it on GitHub
#3306 (comment)
.

Member

jnothman commented May 9, 2015

aside @amueller re stateless transformers: for this purpose, transformers
that depend only on the type or number of columns of the input should also
be acceptable, just to make things tricky!

On 9 May 2015 at 01:43, Vlad Niculae notifications@github.com wrote:

I don't like RandomFeatures, it's way too generic. From the name I'd
expect it to simply ignore X and return random features. Out of all the
names here, it seems to me like RandomActivation is the most specific (it
best conveys what the object does). (I'd remove the Transformer suffix).


Reply to this email directly or view it on GitHub
#3306 (comment)
.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji May 9, 2015

Contributor

@amueller yah sure! I can do the transformer.

So I will open a new pull request for this.

Should the file, containing the algorithm, be under the scikit-learn main directory ?
I mean, would it be something like from sklearn import RandomActivation ?

Would the parameters be something like,
-weight_scale; which sets the range of values for the uniform random sampling.

  • activation_function; which could be identity, relu, logistic and so on.

PS: I think for ridge regression there is also feature-wise batch support which scales with n_features rather than n_samples.

Contributor

IssamLaradji commented May 9, 2015

@amueller yah sure! I can do the transformer.

So I will open a new pull request for this.

Should the file, containing the algorithm, be under the scikit-learn main directory ?
I mean, would it be something like from sklearn import RandomActivation ?

Would the parameters be something like,
-weight_scale; which sets the range of values for the uniform random sampling.

  • activation_function; which could be identity, relu, logistic and so on.

PS: I think for ridge regression there is also feature-wise batch support which scales with n_features rather than n_samples.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 9, 2015

Member

One possible place would be the pre-processing module.

Member

mblondel commented May 9, 2015

One possible place would be the pre-processing module.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel May 9, 2015

Member

Actually how about putting it in the neural_network module?

Member

mblondel commented May 9, 2015

Actually how about putting it in the neural_network module?

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji May 9, 2015

Contributor

sounds good.

Contributor

IssamLaradji commented May 9, 2015

sounds good.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji May 9, 2015

Contributor

right, it's only used by neural network algorithms as far as I know, so having it in neural_network module is better imo.

Contributor

IssamLaradji commented May 9, 2015

right, it's only used by neural network algorithms as far as I know, so having it in neural_network module is better imo.

@IssamLaradji

This comment has been minimized.

Show comment
Hide comment
@IssamLaradji

IssamLaradji May 11, 2015

Contributor

#4703 this is a rough implementation of the RandomActivation algorithm

Contributor

IssamLaradji commented May 11, 2015

#4703 this is a rough implementation of the RandomActivation algorithm

@jnothman jnothman referenced this pull request May 11, 2015

Open

[MRG] RandomActivation #4703

3 of 3 tasks complete
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 11, 2015

Member

How about "RandomBasisFunction" as a name?

Member

amueller commented May 11, 2015

How about "RandomBasisFunction" as a name?

@ekerazha

This comment has been minimized.

Show comment
Hide comment
@ekerazha

ekerazha May 13, 2015

The decoupled approach was the same approach of https://github.com/dclambert/Python-ELM
You had a "random_layer" and you could also pipeline it before a Ridge regression.

Moreover, it also included a MELM-GRBF implementation.

The decoupled approach was the same approach of https://github.com/dclambert/Python-ELM
You had a "random_layer" and you could also pipeline it before a Ridge regression.

Moreover, it also included a MELM-GRBF implementation.

@mblondel

This comment has been minimized.

Show comment
Hide comment

@dchambers dchambers referenced this pull request in BladeRunnerJS/fell Aug 13, 2015

Merged

Code coverage demo #5

@ProfFan

This comment has been minimized.

Show comment
Hide comment
@ProfFan

ProfFan Jan 8, 2016

Just FYI:
Recently Anton Akusok et al. has implemented ELM in python with MAGMA-based acceleration, under the name of hpelm (pypi:https://pypi.python.org/pypi/hpelm).

ProfFan commented Jan 8, 2016

Just FYI:
Recently Anton Akusok et al. has implemented ELM in python with MAGMA-based acceleration, under the name of hpelm (pypi:https://pypi.python.org/pypi/hpelm).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment