New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Restricted Boltzmann machines #1200

Closed
wants to merge 66 commits into
base: master
from

Conversation

Projects
None yet
@ynd
Contributor

ynd commented Oct 2, 2012

RBMs are a state-of-the-art generative model. They've been used to win the Netflix challenge [1] and in record breaking systems for speech recognition at Google [2] and Microsoft. This pull request adds a class for Restricted Boltzmann Machines (RBMs) to scikits-learn. The code is both easy to read and efficient.

[1] http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
[2] http://research.google.com/pubs/archive/38130.pdf

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Oct 3, 2012

verbose should be a constructor parameter.

mblondel commented on sklearn/rbm.py in f192a8f Oct 3, 2012

verbose should be a constructor parameter.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Oct 3, 2012

Member

That's great! How hard would it be to support scipy sparse matrices too? We're trying to avoid adding new classes that can only operate with numpy arrays...

Member

mblondel commented Oct 3, 2012

That's great! How hard would it be to support scipy sparse matrices too? We're trying to avoid adding new classes that can only operate with numpy arrays...

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Oct 3, 2012

Member

Great addition indeed! What would be convenient is to define a RBMClassifier class from your RBM. There is several ways to do that, but the most straightforward would be to encode both X and y into the visible units and then to train your RBM. For predictions, clamp the values of X on the visible units (but let the visible units of y free), then let the machine stabilize and finally output the values at the visible units corresponding to y as the final predictions. Actually this strategy can even be used for multi-output problems.

Also, do you intend to handle missing values? (since you motivate your PR with Netflix)

Member

glouppe commented Oct 3, 2012

Great addition indeed! What would be convenient is to define a RBMClassifier class from your RBM. There is several ways to do that, but the most straightforward would be to encode both X and y into the visible units and then to train your RBM. For predictions, clamp the values of X on the visible units (but let the visible units of y free), then let the machine stabilize and finally output the values at the visible units corresponding to y as the final predictions. Actually this strategy can even be used for multi-output problems.

Also, do you intend to handle missing values? (since you motivate your PR with Netflix)

Show outdated Hide outdated sklearn/rbm.py
Attributes
----------
W : array-like, shape (n_visibles, n_components), optional

This comment has been minimized.

@glouppe

glouppe Oct 3, 2012

Member

In scikit-learn, we are used to put a trailing underscore to all fitted attributes (self.W_).

@glouppe

glouppe Oct 3, 2012

Member

In scikit-learn, we are used to put a trailing underscore to all fitted attributes (self.W_).

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

In scikit-learn, we are used to put a trailing underscore to all fitted
attributes (self.W_).

In addition, we know try to avoid single-letter variable names. Can you
find a more explicit name?

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

In scikit-learn, we are used to put a trailing underscore to all fitted
attributes (self.W_).

In addition, we know try to avoid single-letter variable names. Can you
find a more explicit name?

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

In PCA, a related attribute is components_.

@mblondel

mblondel Oct 3, 2012

Member

In PCA, a related attribute is components_.

Show outdated Hide outdated sklearn/rbm.py
units and n_components is the number of hidden units.
b : array-like, shape (n_components,), optional
Biases of the hidden units
c : array-like, shape (n_visibles,), optional

This comment has been minimized.

@glouppe

glouppe Oct 3, 2012

Member

b and c are not meaningful variable names. Maybe bias_hidden and bias_visible would be better names? (This is a suggestion)

@glouppe

glouppe Oct 3, 2012

Member

b and c are not meaningful variable names. Maybe bias_hidden and bias_visible would be better names? (This is a suggestion)

This comment has been minimized.

@agramfort

agramfort Oct 3, 2012

Member

or maybe better intercept_hidden_ and intercept_visible_

@agramfort

agramfort Oct 3, 2012

Member

or maybe better intercept_hidden_ and intercept_visible_

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

or maybe better intercepthidden and interceptvisible

intercept_hidden and intercept_visible

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

or maybe better intercepthidden and interceptvisible

intercept_hidden and intercept_visible

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

bias is the usual term in the community. before joining sklearn, I never heard the word intercept.
Consistency with community or within sklearn is the question I guess ;)

@amueller

amueller Oct 3, 2012

Member

bias is the usual term in the community. before joining sklearn, I never heard the word intercept.
Consistency with community or within sklearn is the question I guess ;)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

On Wed, Oct 03, 2012 at 03:33:51AM -0700, Andreas Mueller wrote:

bias is the usual term in the community. before joining sklearn, I never heard
the word intercept.
Consistency with community or within sklearn is the question I guess ;)

In any case, the fact that the 2 terminology coexist needs to be stressed
in the documentation. I wasn't aware that "bias == intercept" :)

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

On Wed, Oct 03, 2012 at 03:33:51AM -0700, Andreas Mueller wrote:

bias is the usual term in the community. before joining sklearn, I never heard
the word intercept.
Consistency with community or within sklearn is the question I guess ;)

In any case, the fact that the 2 terminology coexist needs to be stressed
in the documentation. I wasn't aware that "bias == intercept" :)

This comment has been minimized.

@larsmans

larsmans Oct 3, 2012

Member

+1 for intercept_{visible,hidden}_.

@larsmans

larsmans Oct 3, 2012

Member

+1 for intercept_{visible,hidden}_.

Show outdated Hide outdated sklearn/rbm.py
inds = range(X.shape[0])
np.random.shuffle(inds)

This comment has been minimized.

@glouppe

glouppe Oct 3, 2012

Member

self.random_state should be used instead.

@glouppe

glouppe Oct 3, 2012

Member

self.random_state should be used instead.

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

It seems to me that other parts of the scikit don't record the random state in an attribute (they pass it around instead).

@mblondel

mblondel Oct 3, 2012

Member

It seems to me that other parts of the scikit don't record the random state in an attribute (they pass it around instead).

This comment has been minimized.

@mblondel

mblondel Oct 4, 2012

Member

You could do del self.random_state at the end of fit then.

@mblondel

mblondel Oct 4, 2012

Member

You could do del self.random_state at the end of fit then.

This comment has been minimized.

@ynd

ynd Oct 4, 2012

Contributor

On Thu, Oct 4, 2012 at 1:15 AM, Mathieu Blondel notifications@github.comwrote:

In sklearn/rbm.py:

  •    np.random.shuffle(inds)
    

You could do del self.random_state at the end of fit then.

What if fit is called twice? Wouldn't having deleted the self.random_state
cause some problems.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1758454.

@ynd

ynd Oct 4, 2012

Contributor

On Thu, Oct 4, 2012 at 1:15 AM, Mathieu Blondel notifications@github.comwrote:

In sklearn/rbm.py:

  •    np.random.shuffle(inds)
    

You could do del self.random_state at the end of fit then.

What if fit is called twice? Wouldn't having deleted the self.random_state
cause some problems.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1758454.

Show outdated Hide outdated sklearn/rbm.py
return v.shape[1] * np.log(self._sigmoid(fe_ - fe))
def fit(self, X, y=None, verbose=False):

This comment has been minimized.

@glouppe

glouppe Oct 3, 2012

Member

As Mathieu said, verbose should be a constructor parameter.

@glouppe

glouppe Oct 3, 2012

Member

As Mathieu said, verbose should be a constructor parameter.

This comment has been minimized.

@ynd

ynd Oct 3, 2012

Contributor

done.

On Wed, Oct 3, 2012 at 3:06 AM, Gilles Louppe notifications@github.comwrote:

In sklearn/rbm.py:

  •    v: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,)
    
  •    """
    
  •    fe = self.free_energy(v)
    
  •    v_ = v.copy()
    
  •    i_ = self.random_state.randint(0, v.shape[1], v.shape[0])
    
  •    v_[range(v.shape[0]), i_] = v_[range(v.shape[0]), i_] == 0
    
  •    fe_ = self.free_energy(v_)
    
  •    return v.shape[1] \* np.log(self._sigmoid(fe_ - fe))
    
  • def fit(self, X, y=None, verbose=False):

As Mathieu said, verbose should be a constructor parameter.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1747215.

@ynd

ynd Oct 3, 2012

Contributor

done.

On Wed, Oct 3, 2012 at 3:06 AM, Gilles Louppe notifications@github.comwrote:

In sklearn/rbm.py:

  •    v: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,)
    
  •    """
    
  •    fe = self.free_energy(v)
    
  •    v_ = v.copy()
    
  •    i_ = self.random_state.randint(0, v.shape[1], v.shape[0])
    
  •    v_[range(v.shape[0]), i_] = v_[range(v.shape[0]), i_] == 0
    
  •    fe_ = self.free_energy(v_)
    
  •    return v.shape[1] \* np.log(self._sigmoid(fe_ - fe))
    
  • def fit(self, X, y=None, verbose=False):

As Mathieu said, verbose should be a constructor parameter.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1747215.

Show outdated Hide outdated sklearn/rbm.py
if __name__ == '__main__':
main()

This comment has been minimized.

@glouppe

glouppe Oct 3, 2012

Member

Remove the lines above. Instead, could you provide an example in a separate file?

@glouppe

glouppe Oct 3, 2012

Member

Remove the lines above. Instead, could you provide an example in a separate file?

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

+1
Also: how are we going to test this beast?

@amueller

amueller Oct 3, 2012

Member

+1
Also: how are we going to test this beast?

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

Also: how are we going to test this beast?

Check that the energy decreases during the training?

@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

Also: how are we going to test this beast?

Check that the energy decreases during the training?

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

This is not really what you want. You want the probability to increase and you can't compute the partition function usually.
We could do an example with ~5 hidden units and test that. That would actually a good test.

But: (P)CD learning doesn't guarantee increasing the probability of the data set and may diverge.

So I guess plotting the true model probability of the data would be a cool example and we could also test against that.
We only need a function to calculate the partition function ....

@amueller

amueller Oct 3, 2012

Member

This is not really what you want. You want the probability to increase and you can't compute the partition function usually.
We could do an example with ~5 hidden units and test that. That would actually a good test.

But: (P)CD learning doesn't guarantee increasing the probability of the data set and may diverge.

So I guess plotting the true model probability of the data would be a cool example and we could also test against that.
We only need a function to calculate the partition function ....

This comment has been minimized.

@dwf

dwf Oct 3, 2012

Member

On Wed, Oct 3, 2012 at 6:31 AM, Andreas Mueller notifications@github.comwrote:

This is not really what you want. You want the probability to increase and
you can't compute the partition function usually.
We could do an example with ~5 hidden units and test that. That would
actually a good test.

But: (P)CD learning doesn't guarantee increasing the probability of the
data set and may diverge.

Well, PCD does, at least asymptotically, as long as you use an
appropriately decaying learning rate. But asymptotic behaviour is kind of
hard to test.

@dwf

dwf Oct 3, 2012

Member

On Wed, Oct 3, 2012 at 6:31 AM, Andreas Mueller notifications@github.comwrote:

This is not really what you want. You want the probability to increase and
you can't compute the partition function usually.
We could do an example with ~5 hidden units and test that. That would
actually a good test.

But: (P)CD learning doesn't guarantee increasing the probability of the
data set and may diverge.

Well, PCD does, at least asymptotically, as long as you use an
appropriately decaying learning rate. But asymptotic behaviour is kind of
hard to test.

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Really? Pretty sure that not, as you can not guarantee mixing of the chain.
Maybe I have to look into the paper again.

@amueller

amueller Oct 3, 2012

Member

Really? Pretty sure that not, as you can not guarantee mixing of the chain.
Maybe I have to look into the paper again.

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Doesn't say anything like that in the paper as far as I can tell and I think I observed divergence in practice.
Could be that it is possible to converge if you decay the learning rate fast enough but that doesn't mean that you converge to a point that is better than the point you started with, only that you stop somewhere.

@amueller

amueller Oct 3, 2012

Member

Doesn't say anything like that in the paper as far as I can tell and I think I observed divergence in practice.
Could be that it is possible to converge if you decay the learning rate fast enough but that doesn't mean that you converge to a point that is better than the point you started with, only that you stop somewhere.

This comment has been minimized.

@dwf

dwf Oct 3, 2012

Member

The "PCD" paper kind of undersells it, what you want to look at is the Laurent Younes paper "On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates", which he cites (the contribution of the PCD paper was basically making this known to the ML community and demonstrating it on RBMs). There you get guarantees of almost sure convergence to a local maximum of the empirical likelihood, provided your initial learning rate is small enough. If you see things start to diverge or oscillate then your learning rate is probably too high.

@dwf

dwf Oct 3, 2012

Member

The "PCD" paper kind of undersells it, what you want to look at is the Laurent Younes paper "On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates", which he cites (the contribution of the PCD paper was basically making this known to the ML community and demonstrating it on RBMs). There you get guarantees of almost sure convergence to a local maximum of the empirical likelihood, provided your initial learning rate is small enough. If you see things start to diverge or oscillate then your learning rate is probably too high.

This comment has been minimized.

@ynd

ynd Oct 3, 2012

Contributor

done. I do plan to make a separate example. As @amueller and @ogrisel mentioned, I think it would be best to show its use as a feature extractor for a LinearSVC for example. And showing that it does improve the accuracy on the test set very significantly. For example on digits. It would show how to use GridSearchCV to find the optimal learning rate.

@ynd

ynd Oct 3, 2012

Contributor

done. I do plan to make a separate example. As @amueller and @ogrisel mentioned, I think it would be best to show its use as a feature extractor for a LinearSVC for example. And showing that it does improve the accuracy on the test set very significantly. For example on digits. It would show how to use GridSearchCV to find the optimal learning rate.

@pprett

This comment has been minimized.

Show comment
Hide comment
@pprett

pprett Oct 3, 2012

Member

Thanks for the contribution - I'm super excited - can't wait to see this merged to master!

Member

pprett commented Oct 3, 2012

Thanks for the contribution - I'm super excited - can't wait to see this merged to master!

Show outdated Hide outdated sklearn/rbm.py
-------
x_new: array-like, shape (M, N)
"""
return 1. / (1. + np.exp(-np.maximum(np.minimum(x, 30), -30)))

This comment has been minimized.

@glouppe

glouppe Oct 3, 2012

Member

I don't know if this changes anything, but wouldn't it be better to output 0. if x < -30 and 1. if x > 30? Also, since this very particular function is at the core of the algorithm, a Cython version may be a better choice.

@glouppe

glouppe Oct 3, 2012

Member

I don't know if this changes anything, but wouldn't it be better to output 0. if x < -30 and 1. if x > 30? Also, since this very particular function is at the core of the algorithm, a Cython version may be a better choice.

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

This method doesn't depend on the object state so it can changed to a function.

@mblondel

mblondel Oct 3, 2012

Member

This method doesn't depend on the object state so it can changed to a function.

This comment has been minimized.

@larsmans

larsmans Oct 3, 2012

Member

Shouldn't we have logistic sigmoid in sklearn.utils.extmath?

@larsmans

larsmans Oct 3, 2012

Member

Shouldn't we have logistic sigmoid in sklearn.utils.extmath?

This comment has been minimized.

@amueller

amueller Oct 4, 2012

Member

maybe call it logistic_sigmoid as sigmoid just refers to the shape....

@amueller

amueller Oct 4, 2012

Member

maybe call it logistic_sigmoid as sigmoid just refers to the shape....

This comment has been minimized.

@dwf

dwf Oct 4, 2012

Member

+1 @amueller. Drives me nuts when people around here simply call this "the sigmoid". :) The logistic function is its proper name.

@dwf

dwf Oct 4, 2012

Member

+1 @amueller. Drives me nuts when people around here simply call this "the sigmoid". :) The logistic function is its proper name.

This comment has been minimized.

@ynd

ynd Oct 4, 2012

Contributor

done

@ynd

ynd Oct 4, 2012

Contributor

done

Show outdated Hide outdated sklearn/rbm.py
return self.pseudo_likelihood(v_pos)
def pseudo_likelihood(self, v):

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Is this really a pseudolikelihood?
Can you give a reference where this was used?

@amueller

amueller Oct 3, 2012

Member

Is this really a pseudolikelihood?
Can you give a reference where this was used?

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Hm I guess it is connected to pseudolikelihood... hm...

@amueller

amueller Oct 3, 2012

Member

Hm I guess it is connected to pseudolikelihood... hm...

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 3, 2012

Member

@glouppe Afaik classification with RBMs basically doesn't work. Is there any particular reason you want it? Usually throwing the representation into a linear SVM is way better.

I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

An example would be great. Is there any application where we can demo that this works?
Maybe extracting features from digits and then classify? I guess that would train a while, though :-/

If we do include an RBM, it should definitely implement both CD1 and PCD.

Member

amueller commented Oct 3, 2012

@glouppe Afaik classification with RBMs basically doesn't work. Is there any particular reason you want it? Usually throwing the representation into a linear SVM is way better.

I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

An example would be great. Is there any application where we can demo that this works?
Maybe extracting features from digits and then classify? I guess that would train a while, though :-/

If we do include an RBM, it should definitely implement both CD1 and PCD.

Show outdated Hide outdated sklearn/rbm.py
"""
h_pos = self.mean_h(v_pos)
v_neg = self.sample_v(self.h_samples)
h_neg = self.mean_h(v_neg)

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Shouldn't h be sampled here?

@amueller

amueller Oct 3, 2012

Member

Shouldn't h be sampled here?

This comment has been minimized.

@ynd

ynd Oct 3, 2012

Contributor

h is sampled for the persistent gibbs chain. For computing the learning statistics, by deriving the log probability w.r.t. the parameters you get a difference between free energy derivatives. Which results in using the mean field of the h's (see http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf page 64). When computing the learning statistics, it's mathematically valid and better to use the mean-field because it adds less noise to learning.

@ynd

ynd Oct 3, 2012

Contributor

h is sampled for the persistent gibbs chain. For computing the learning statistics, by deriving the log probability w.r.t. the parameters you get a difference between free energy derivatives. Which results in using the mean field of the h's (see http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf page 64). When computing the learning statistics, it's mathematically valid and better to use the mean-field because it adds less noise to learning.

This comment has been minimized.

@dwf

dwf Oct 3, 2012

Member

@amueller Yoshua's book doesn't mention this AFAIK but this technique (replacing a sample with its expectation in a formula) is known as "Rao-Blackwellization", and it's trivial to demonstrate that it's still an unbiased estimator but with lower variance than the original sample-based estimate. See, for example, this tutorial.

@dwf

dwf Oct 3, 2012

Member

@amueller Yoshua's book doesn't mention this AFAIK but this technique (replacing a sample with its expectation in a formula) is known as "Rao-Blackwellization", and it's trivial to demonstrate that it's still an unbiased estimator but with lower variance than the original sample-based estimate. See, for example, this tutorial.

This comment has been minimized.

@ynd

ynd Oct 3, 2012

Contributor

@dwf cool, I had somehow overlooked of that proof :).

On Wed, Oct 3, 2012 at 5:54 PM, David Warde-Farley <notifications@github.com

wrote:

In sklearn/rbm.py:

  •    v_pos: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,), optional
    
  •        Pseudo Likelihood estimate for this batch.
    
  •    References
    

  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_h(v_pos)
    
  •    v_neg = self.sample_v(self.h_samples)
    
  •    h_neg = self.mean_h(v_neg)
    

@amueller https://github.com/amueller Yoshua's book doesn't mention
this AFAIK but this technique (replacing a sample with its expectation in a
formula) is known as "Rao-Blackwellization", and it's trivial to
demonstrate that it's still an unbiased estimator but with lower variance
than the original sample-based estimate. See, for example, this tutorialhttp://www.cs.ubc.ca/%7Enando/papers/ita2010.pdf
.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1756078.

@ynd

ynd Oct 3, 2012

Contributor

@dwf cool, I had somehow overlooked of that proof :).

On Wed, Oct 3, 2012 at 5:54 PM, David Warde-Farley <notifications@github.com

wrote:

In sklearn/rbm.py:

  •    v_pos: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,), optional
    
  •        Pseudo Likelihood estimate for this batch.
    
  •    References
    

  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_h(v_pos)
    
  •    v_neg = self.sample_v(self.h_samples)
    
  •    h_neg = self.mean_h(v_neg)
    

@amueller https://github.com/amueller Yoshua's book doesn't mention
this AFAIK but this technique (replacing a sample with its expectation in a
formula) is known as "Rao-Blackwellization", and it's trivial to
demonstrate that it's still an unbiased estimator but with lower variance
than the original sample-based estimate. See, for example, this tutorialhttp://www.cs.ubc.ca/%7Enando/papers/ita2010.pdf
.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1756078.

This comment has been minimized.

@amueller

amueller Oct 4, 2012

Member

On 10/03/2012 10:47 PM, Yann N. Dauphin wrote:

In sklearn/rbm.py:

  •    v_pos: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,), optional
    
  •        Pseudo Likelihood estimate for this batch.
    
  •    References
    

  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_h(v_pos)
    
  •    v_neg = self.sample_v(self.h_samples)
    
  •    h_neg = self.mean_h(v_neg)
    

h is sampled for the persistent gibbs chain. For computing the
learning statistics, by deriving the log probability w.r.t. the
parameters you get a difference between free energy derivatives. Which
results in using the mean field of the h's (see
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
http://www.iro.umontreal.ca/%7Ebengioy/papers/ftml_book.pdf page
64). When computing the learning statistics, it's mathematically valid
and better to use the mean-field because it adds less noise to learning.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755979.

Thanks for the explanation. Makes sense as long as it is sampled again
for the MCMC chain as you do below.

@amueller

amueller Oct 4, 2012

Member

On 10/03/2012 10:47 PM, Yann N. Dauphin wrote:

In sklearn/rbm.py:

  •    v_pos: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,), optional
    
  •        Pseudo Likelihood estimate for this batch.
    
  •    References
    

  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_h(v_pos)
    
  •    v_neg = self.sample_v(self.h_samples)
    
  •    h_neg = self.mean_h(v_neg)
    

h is sampled for the persistent gibbs chain. For computing the
learning statistics, by deriving the log probability w.r.t. the
parameters you get a difference between free energy derivatives. Which
results in using the mean field of the h's (see
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
http://www.iro.umontreal.ca/%7Ebengioy/papers/ftml_book.pdf page
64). When computing the learning statistics, it's mathematically valid
and better to use the mean-field because it adds less noise to learning.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755979.

Thanks for the explanation. Makes sense as long as it is sampled again
for the MCMC chain as you do below.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 3, 2012

Member

I just saw that you do implement persistent CD. I guess having only this is fine.

Member

amueller commented Oct 3, 2012

I just saw that you do implement persistent CD. I guess having only this is fine.

@weilinear

This comment has been minimized.

Show comment
Hide comment
@weilinear

weilinear Oct 3, 2012

Member

Maybe we have have a pipeline connecting rbm and linearsvc to for an example and compare the result to using raw feature :)

Member

weilinear commented Oct 3, 2012

Maybe we have have a pipeline connecting rbm and linearsvc to for an example and compare the result to using raw feature :)

Show outdated Hide outdated sklearn/rbm.py
----------
n_components : int, optional
Number of binary hidden units
epsilon : float, optional

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

maybe just call this learning_rate?

@amueller

amueller Oct 3, 2012

Member

maybe just call this learning_rate?

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

+1, this would be consistent with SGDClassifier.

@mblondel

mblondel Oct 3, 2012

Member

+1, this would be consistent with SGDClassifier.

Show outdated Hide outdated sklearn/rbm.py
to tune this hyper-parameter. Possible values are 10**[0., -3.].
n_samples : int, optional
Number of fantasy particles to use during learning
epochs : int, optional

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Could you please use n_epochs?

@amueller

amueller Oct 3, 2012

Member

Could you please use n_epochs?

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

@amueller Or simply, n_iter :)

@mblondel

mblondel Oct 3, 2012

Member

@amueller Or simply, n_iter :)

This comment has been minimized.

@dwf

dwf Oct 3, 2012

Member

n_iter is more ambiguous, I think. Even if "epoch" is jargony, it refers to a full sweep through the dataset, whereas "iter" could mean number of [stochastic] gradient updates.

@dwf

dwf Oct 3, 2012

Member

n_iter is more ambiguous, I think. Even if "epoch" is jargony, it refers to a full sweep through the dataset, whereas "iter" could mean number of [stochastic] gradient updates.

This comment has been minimized.

@amueller

amueller Oct 7, 2012

Member

I think n_iter would be the right thing.
We use n_iter in SGDClassifier already, where it has the same meaning as here.

@amueller

amueller Oct 7, 2012

Member

I think n_iter would be the right thing.
We use n_iter in SGDClassifier already, where it has the same meaning as here.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Oct 3, 2012

Member

@amueller Well, I was suggesting that because it nearly comes from free. It would have shown how RBM could be used in a stand-alone way, as a pure generative model. But I agree that using a RBM to generate features and then to feed them to another classifier is probably a better idea in terms of end accuracy.

I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

It happens that I came into the machine learning world through RBMs :) That's why I appreciate that model, though it's purely subjective.

Member

glouppe commented Oct 3, 2012

@amueller Well, I was suggesting that because it nearly comes from free. It would have shown how RBM could be used in a stand-alone way, as a pure generative model. But I agree that using a RBM to generate features and then to feed them to another classifier is probably a better idea in terms of end accuracy.

I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

It happens that I came into the machine learning world through RBMs :) That's why I appreciate that model, though it's purely subjective.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 3, 2012

Member

@glouppe I also came to machine learning via RBMs and I really dislike them and I am convinced they don't work - though it's purely subjective ;)

Member

amueller commented Oct 3, 2012

@glouppe I also came to machine learning via RBMs and I really dislike them and I am convinced they don't work - though it's purely subjective ;)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 3, 2012

Member

@ynd Great job by the way! Looks very good :)

Member

amueller commented Oct 3, 2012

@ynd Great job by the way! Looks very good :)

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Oct 3, 2012

Member

@amueller Haha :-) Actually, my experience with RBMs is limited to Netflix and to image datasets, were I found accuracy to be quite decent. However I confess that adjusting all the hyperparameters is a real nightmare. Maybe that's the reason why switched to ensemble methods?

Member

glouppe commented Oct 3, 2012

@amueller Haha :-) Actually, my experience with RBMs is limited to Netflix and to image datasets, were I found accuracy to be quite decent. However I confess that adjusting all the hyperparameters is a real nightmare. Maybe that's the reason why switched to ensemble methods?

@jaquesgrobler

This comment has been minimized.

Show comment
Hide comment
@jaquesgrobler

jaquesgrobler Oct 3, 2012

Member

Nice job on this. I actually worked a bit with @genji on a sklearn version of this.. In this case, it could be good if he could have a look at this. well done!

Member

jaquesgrobler commented Oct 3, 2012

Nice job on this. I actually worked a bit with @genji on a sklearn version of this.. In this case, it could be good if he could have a look at this. well done!

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

@glouppe Afaik classification with RBMs basically doesn't work. Is there any
particular reason you want it? Usually throwing the representation into a
linear SVM is way better.

OK, we'll need an example showing this, hopefully with a pipeline.

Member

GaelVaroquaux commented Oct 3, 2012

@glouppe Afaik classification with RBMs basically doesn't work. Is there any
particular reason you want it? Usually throwing the representation into a
linear SVM is way better.

OK, we'll need an example showing this, hopefully with a pipeline.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

I'm a bit surprised that so many people want an RBM. I though we kind of didn't
want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

We're growing, I guess.

@amueller: if you have good implementation, you can probably give good
advice/good review on this PR. I have found that the best way to get good
code was to keep the good ideas from different codebases developed
independently.

Member

GaelVaroquaux commented Oct 3, 2012

I'm a bit surprised that so many people want an RBM. I though we kind of didn't
want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

We're growing, I guess.

@amueller: if you have good implementation, you can probably give good
advice/good review on this PR. I have found that the best way to get good
code was to keep the good ideas from different codebases developed
independently.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Oct 3, 2012

Member

Nice job on this. I actually worked a bit with @genji on a sklearn version of
this..

You should give a pointer to the codebase.

Member

GaelVaroquaux commented Oct 3, 2012

Nice job on this. I actually worked a bit with @genji on a sklearn version of
this..

You should give a pointer to the codebase.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 3, 2012

Member

It would be great if someone could benchmark this implementation against @dwf's gist: https://gist.github.com/359323 both in terms of CPU time till convergence, memory usage and ability to generate feature suitable for feeding to a linear classifier such as sklearn.linear_model.LinearSVC for instance. The dataset to use for this bench could be a subset of mnist (e.g. 2 digits such as 0 versus 8) to make it faster to evaluate.

Member

ogrisel commented Oct 3, 2012

It would be great if someone could benchmark this implementation against @dwf's gist: https://gist.github.com/359323 both in terms of CPU time till convergence, memory usage and ability to generate feature suitable for feeding to a linear classifier such as sklearn.linear_model.LinearSVC for instance. The dataset to use for this bench could be a subset of mnist (e.g. 2 digits such as 0 versus 8) to make it faster to evaluate.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 3, 2012

Member

BTW @ynd great to see you contributing to sklearn. I really enjoyed your NIPS presentation last year. Also it would be great to submit a sklearn version of your numpy CAE that includes the sampling trick in another PR :)

Member

ogrisel commented Oct 3, 2012

BTW @ynd great to see you contributing to sklearn. I really enjoyed your NIPS presentation last year. Also it would be great to submit a sklearn version of your numpy CAE that includes the sampling trick in another PR :)

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Oct 3, 2012

Member

Nice job on this. I actually worked a bit with @genji on a sklearn version of this..

We would indeed appreciate @genji inputs on reviewing this PR and maybe point to common implementation pitfalls to avoid.

Member

ogrisel commented Oct 3, 2012

Nice job on this. I actually worked a bit with @genji on a sklearn version of this..

We would indeed appreciate @genji inputs on reviewing this PR and maybe point to common implementation pitfalls to avoid.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 3, 2012

Member

oh @ynd I didn't recognize you from picture. I think we talked quite a while at your poster last nips. After that, my lab started working on constrastive auto encoders! Your work is really interesting!

Member

amueller commented Oct 3, 2012

oh @ynd I didn't recognize you from picture. I think we talked quite a while at your poster last nips. After that, my lab started working on constrastive auto encoders! Your work is really interesting!

Show outdated Hide outdated sklearn/rbm.py
from .utils import array2d, check_random_state
class RBM(BaseEstimator, TransformerMixin):

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

We try to avoid acronyms so RestrictedBolzmannMachine sounds better.

Also, we may want to bootstrap the neural_network module.

@mblondel

mblondel Oct 3, 2012

Member

We try to avoid acronyms so RestrictedBolzmannMachine sounds better.

Also, we may want to bootstrap the neural_network module.

This comment has been minimized.

@ogrisel

ogrisel Oct 3, 2012

Member

+1, esp. if Contractive Autoencoders are to follow.

@ogrisel

ogrisel Oct 3, 2012

Member

+1, esp. if Contractive Autoencoders are to follow.

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

I was also thinking of the multi-layer perceptron, if @amueller feels motivated :)

@mblondel

mblondel Oct 3, 2012

Member

I was also thinking of the multi-layer perceptron, if @amueller feels motivated :)

This comment has been minimized.

@amueller

amueller Oct 3, 2012

Member

Yeah, I know, I should really look into that again... motivation level... not that high ;)

@amueller

amueller Oct 3, 2012

Member

Yeah, I know, I should really look into that again... motivation level... not that high ;)

This comment has been minimized.

@ynd

ynd Oct 3, 2012

Contributor

A 'neural_network', or 'neural' module sounds really good.

@ogrisel Yes, Contractive Autoencoders are to follow! :)

@amueller There's a nice paper this year that makes MLPs much simpler to use, they show how to automatically tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only leaves the n_components hyper-parameter!

@ynd

ynd Oct 3, 2012

Contributor

A 'neural_network', or 'neural' module sounds really good.

@ogrisel Yes, Contractive Autoencoders are to follow! :)

@amueller There's a nice paper this year that makes MLPs much simpler to use, they show how to automatically tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only leaves the n_components hyper-parameter!

This comment has been minimized.

@amueller

amueller Oct 4, 2012

Member

A 'neural_network', or 'neural' module sounds really good.

@ogrisel https://github.com/ogrisel Yes, Contractive Autoencoders
are to follow! :)

@amueller https://github.com/amueller There's a nice paper this year
that makes MLPs much simpler to use, they show how to automatically
tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only
leaves the n_components hyper-parameter!

I saw the paper but didn't try it out yet. Have you given it a try?
By the way, that look a lot like RPROP, which we often use
(but is a lot less heuristic).

Do you think that would also be interesting for the SGD module?

@amueller

amueller Oct 4, 2012

Member

A 'neural_network', or 'neural' module sounds really good.

@ogrisel https://github.com/ogrisel Yes, Contractive Autoencoders
are to follow! :)

@amueller https://github.com/amueller There's a nice paper this year
that makes MLPs much simpler to use, they show how to automatically
tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only
leaves the n_components hyper-parameter!

I saw the paper but didn't try it out yet. Have you given it a try?
By the way, that look a lot like RPROP, which we often use
(but is a lot less heuristic).

Do you think that would also be interesting for the SGD module?

Show outdated Hide outdated sklearn/rbm.py
Learning rate to use during learning. It is *highly* recommended
to tune this hyper-parameter. Possible values are 10**[0., -3.].
n_samples : int, optional
Number of fantasy particles to use during learning

This comment has been minimized.

@mblondel

mblondel Oct 3, 2012

Member

n_particles sounds betters, as n_samples is usually used for X.shape[0].

@mblondel

mblondel Oct 3, 2012

Member

n_particles sounds betters, as n_samples is usually used for X.shape[0].

@dwf dwf closed this Oct 3, 2012

@dwf dwf reopened this Oct 3, 2012

@dwf

This comment has been minimized.

Show comment
Hide comment
@dwf

dwf Oct 3, 2012

Member

Oops, somehow I hit the close button. Sorry about that.

I'd just like to raise the obvious skeptic's take here: RBMs are relatively finicky to get to work properly for an arbitrary dataset, especially as compared to a lot of the models that sklearn implements. You don't even have a very good way of doing model comparison for non-trivial models, except approximately via Annealed Importance Sampling, which itself requires careful tuning. This was my reason for never submitting my own NumPy-based implementation: if sklearn is looking to provide canned solutions to non-experts, RBMs and deep learning in general are pretty much the antithesis of that (at least for now; the "no more pesky learning rates" paper offers hope for deterministic networks, but their technique does not apply in the least to the case where you have to stochastically estimate the gradient with MCMC, as here). On one hand, it'd be nice to get these techniques more exposure, but on the other, their inclusion in scikit-learn may be a false endorsement of how easy they are to get to work.

Member

dwf commented Oct 3, 2012

Oops, somehow I hit the close button. Sorry about that.

I'd just like to raise the obvious skeptic's take here: RBMs are relatively finicky to get to work properly for an arbitrary dataset, especially as compared to a lot of the models that sklearn implements. You don't even have a very good way of doing model comparison for non-trivial models, except approximately via Annealed Importance Sampling, which itself requires careful tuning. This was my reason for never submitting my own NumPy-based implementation: if sklearn is looking to provide canned solutions to non-experts, RBMs and deep learning in general are pretty much the antithesis of that (at least for now; the "no more pesky learning rates" paper offers hope for deterministic networks, but their technique does not apply in the least to the case where you have to stochastically estimate the gradient with MCMC, as here). On one hand, it'd be nice to get these techniques more exposure, but on the other, their inclusion in scikit-learn may be a false endorsement of how easy they are to get to work.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 3, 2012

Member

I completely agree with @dwf. I though that was the reason we didn't include it yet.
As many people seem to be excited about having it, I guess we could give it a shot, though ;)

Member

amueller commented Oct 3, 2012

I completely agree with @dwf. I though that was the reason we didn't include it yet.
As many people seem to be excited about having it, I guess we could give it a shot, though ;)

@ynd

This comment has been minimized.

Show comment
Hide comment
@ynd

ynd Oct 3, 2012

Contributor

Thanks for all the feedback. Anyway, I'll answer each comment separately as I make commits.

@ogrisel @amueller Thanks!

Contributor

ynd commented Oct 3, 2012

Thanks for all the feedback. Anyway, I'll answer each comment separately as I make commits.

@ogrisel @amueller Thanks!

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Feb 2, 2013

Member

I'm fine with merging soon, but:

  1. @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones?

  2. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size?

  3. Thoughts on momentum?

Member

vene commented Feb 2, 2013

I'm fine with merging soon, but:

  1. @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones?

  2. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size?

  3. Thoughts on momentum?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 3, 2013

Member
  1. @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones?

Then we should just make it explicit in the example top level docstring that due to download size and computational constraints, this example uses a toy dataset that is either too small or too "linearly separable" for RBM features to be really interesting. In practice unsupervised feature extraction with RBMs is only useful on datasets with both many samples and internally structured on non linearly separable components / manifolds. Currently it's misleading the reader by asserting that RBM features are better than raw features on this toy data.

  1. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size?

Having consistency on the batch_size parameter with other minibatch models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big +1 for me. I would rather have this changed before merging to master if this is not too complicated.

Member

ogrisel commented Feb 3, 2013

  1. @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones?

Then we should just make it explicit in the example top level docstring that due to download size and computational constraints, this example uses a toy dataset that is either too small or too "linearly separable" for RBM features to be really interesting. In practice unsupervised feature extraction with RBMs is only useful on datasets with both many samples and internally structured on non linearly separable components / manifolds. Currently it's misleading the reader by asserting that RBM features are better than raw features on this toy data.

  1. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size?

Having consistency on the batch_size parameter with other minibatch models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big +1 for me. I would rather have this changed before merging to master if this is not too complicated.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 3, 2013

Member

BTW has anyone tried to let this implementation run long enough on the official MNIST train set to extract features that when combined with a cross-validated LinearSVC / SVC reach the expected predictive accuracy (I am not 100% sure but I think it possible to go below the 2% test error rate with this kind of architecture)?

Also if someone runs it long enough with the same parameters as in the bottom of the page in http://deeplearning.net/tutorial/rbm.html we could also qualitatively compare the outcome of the gibbs sampling with those of the deeplearning tutorial with the theano based implementation.

Member

ogrisel commented Feb 3, 2013

BTW has anyone tried to let this implementation run long enough on the official MNIST train set to extract features that when combined with a cross-validated LinearSVC / SVC reach the expected predictive accuracy (I am not 100% sure but I think it possible to go below the 2% test error rate with this kind of architecture)?

Also if someone runs it long enough with the same parameters as in the bottom of the page in http://deeplearning.net/tutorial/rbm.html we could also qualitatively compare the outcome of the gibbs sampling with those of the deeplearning tutorial with the theano based implementation.

@ynd

This comment has been minimized.

Show comment
Hide comment
@ynd

ynd Feb 3, 2013

Contributor
  1. I'm still a bit put off though by how mini-batch size is controlled by
    changing n_particles, I find this hard to grok. Can't there be two separate
    parameters, n_particles and batch_size?

Having consistency on the batch_size parameter with other minibatch
models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big
+1 for me. I would rather have this changed before merging to master if
this is not too complicated.

Most implementations use the same number of particles as the batch size
(this is recommended by Hinton).

I will rename the parameter to batch_size though, that should clarify
things.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-13044925.

Contributor

ynd commented Feb 3, 2013

  1. I'm still a bit put off though by how mini-batch size is controlled by
    changing n_particles, I find this hard to grok. Can't there be two separate
    parameters, n_particles and batch_size?

Having consistency on the batch_size parameter with other minibatch
models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big
+1 for me. I would rather have this changed before merging to master if
this is not too complicated.

Most implementations use the same number of particles as the batch size
(this is recommended by Hinton).

I will rename the parameter to batch_size though, that should clarify
things.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-13044925.

@ynd

This comment has been minimized.

Show comment
Hide comment
@ynd

ynd Apr 15, 2013

Contributor

@ogrisel I've added some tests:

sklearn.neural_network.rbm 84 6 93% 262, 318, 323, 326-328

All the uncovered lines are basically just conditionals on print statements for verbose=True.

Contributor

ynd commented Apr 15, 2013

@ogrisel I've added some tests:

sklearn.neural_network.rbm 84 6 93% 262, 318, 323, 326-328

All the uncovered lines are basically just conditionals on print statements for verbose=True.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Apr 15, 2013

Member

Thanks a lot for picking this up. Sorry we didn't work harder on merging it. It seems we are all a bit swamped right now. I'm certain it will go in the next release, though ;)

Member

amueller commented Apr 15, 2013

Thanks a lot for picking this up. Sorry we didn't work harder on merging it. It seems we are all a bit swamped right now. I'm certain it will go in the next release, though ;)

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Apr 15, 2013

Member

Wow, actually yesterday I was thinking that after the conference deadline
passes (which was last night) I wanna come back to this PR!

A while back in a local file, I hacked up a momentum implementation on top
of this code, it did seem to speed up convergence a lot on mnist,
interesting?

On Tue, Apr 16, 2013 at 5:27 AM, Andreas Mueller
notifications@github.comwrote:

Thanks a lot for picking this up. Sorry we didn't work harder on merging
it. It seems we are all a bit swamped right now. I'm certain it will go in
the next release, though ;)


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16409345
.

Member

vene commented Apr 15, 2013

Wow, actually yesterday I was thinking that after the conference deadline
passes (which was last night) I wanna come back to this PR!

A while back in a local file, I hacked up a momentum implementation on top
of this code, it did seem to speed up convergence a lot on mnist,
interesting?

On Tue, Apr 16, 2013 at 5:27 AM, Andreas Mueller
notifications@github.comwrote:

Thanks a lot for picking this up. Sorry we didn't work harder on merging
it. It seems we are all a bit swamped right now. I'm certain it will go in
the next release, though ;)


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16409345
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Apr 16, 2013

Member

@vene I would rather not add any features and merge this asap. It has been sitting around way to long! Do you have any idea what is still missing?

Member

amueller commented Apr 16, 2013

@vene I would rather not add any features and merge this asap. It has been sitting around way to long! Do you have any idea what is still missing?

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Apr 17, 2013

Member

IIRC I was not pleased with the example when it was moved to the digits
dataset instead of mnist, as the dataset is very small and the components
don't look that meaningful. I tried toying around with the params to no
avail.

On Tue, Apr 16, 2013 at 6:37 PM, Andreas Mueller
notifications@github.comwrote:

@vene https://github.com/vene I would rather not add any features and
merge this asap. It has been sitting around way to long! Do you have any
idea what is still missing?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16435278
.

Member

vene commented Apr 17, 2013

IIRC I was not pleased with the example when it was moved to the digits
dataset instead of mnist, as the dataset is very small and the components
don't look that meaningful. I tried toying around with the params to no
avail.

On Tue, Apr 16, 2013 at 6:37 PM, Andreas Mueller
notifications@github.comwrote:

@vene https://github.com/vene I would rather not add any features and
merge this asap. It has been sitting around way to long! Do you have any
idea what is still missing?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16435278
.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Apr 17, 2013

Member

Also there is a failing common test because of the way random_state is used.

Member

vene commented Apr 17, 2013

Also there is a failing common test because of the way random_state is used.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 18, 2013

Member

@vene I you can come up with a quick PR vs this branch and show how that impacts the convergence speed and quality on mnist that sounds interesting.

Member

ogrisel commented Apr 18, 2013

@vene I you can come up with a quick PR vs this branch and show how that impacts the convergence speed and quality on mnist that sounds interesting.

self
"""
X = array2d(X)
self.random_state = check_random_state(self.random_state)

This comment has been minimized.

@ogrisel

ogrisel Apr 18, 2013

Member

Constructor parameters should not be mutated during fit and that holds for random_state as well. Please change this line to:

self.random_state_ = check_random_state(self.random_state)

or

random_state = check_random_state(self.random_state)

and then pass the random_state local variable explicitly as an argument to the subsequent private methods.

@ogrisel

ogrisel Apr 18, 2013

Member

Constructor parameters should not be mutated during fit and that holds for random_state as well. Please change this line to:

self.random_state_ = check_random_state(self.random_state)

or

random_state = check_random_state(self.random_state)

and then pass the random_state local variable explicitly as an argument to the subsequent private methods.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 6, 2013

Member

I'm very inclined to fix the travis tests and merge. Having "not completely satisfying example filters" doesn't justify delaying this PR any longer.

Member

amueller commented May 6, 2013

I'm very inclined to fix the travis tests and merge. Having "not completely satisfying example filters" doesn't justify delaying this PR any longer.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene May 6, 2013

Member

Probably a good idea, any speedups / new features can be added later.

On Mon, May 6, 2013 at 6:39 PM, Andreas Mueller notifications@github.comwrote:

I'm very inclined to fix the travis tests and merge. Having "not
completely satisfying example filters" doesn't justify delaying this PR any
longer.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-17473458
.

Member

vene commented May 6, 2013

Probably a good idea, any speedups / new features can be added later.

On Mon, May 6, 2013 at 6:39 PM, Andreas Mueller notifications@github.comwrote:

I'm very inclined to fix the travis tests and merge. Having "not
completely satisfying example filters" doesn't justify delaying this PR any
longer.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-17473458
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 6, 2013

Member

Does any one have strong reasons not to merge this? (after fixing the random state issue)

Member

amueller commented May 6, 2013

Does any one have strong reasons not to merge this? (after fixing the random state issue)

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

Some implicit castings: (I have a version numpy that purposely fails when
casting rules can induce problems)

======================================================================
FAIL: Doctest: sklearn.neural_network.rbm.BernoulliRBM
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for
sklearn.neural_network.rbm.BernoulliRBM
  File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py",
line 31, in BernoulliRBM
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py",
line 78, in sklearn.neural_network.rbm.BernoulliRBM
Failed example:
    model.fit(X)  # doctest: +ELLIPSIS
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "", line
1, in 
        model.fit(X)  # doctest: +ELLIPSIS
      File
"/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line
320, in fit
        pl_batch = self._fit(X[inds[minibatch::n_batches]])
      File
"/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line
252, in _fit
        v_pos *= lr
    TypeError: Cannot cast ufunc multiply output from dtype('float64') to
dtype('int64') with casting rule 'same_kind'
>>  raise self.failureException(self.format_failure(>  instance at 0x9529a70>.getvalue()))
    
----------------------------------------------------------------------
Member

GaelVaroquaux commented May 6, 2013

Some implicit castings: (I have a version numpy that purposely fails when
casting rules can induce problems)

======================================================================
FAIL: Doctest: sklearn.neural_network.rbm.BernoulliRBM
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for
sklearn.neural_network.rbm.BernoulliRBM
  File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py",
line 31, in BernoulliRBM
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py",
line 78, in sklearn.neural_network.rbm.BernoulliRBM
Failed example:
    model.fit(X)  # doctest: +ELLIPSIS
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "", line
1, in 
        model.fit(X)  # doctest: +ELLIPSIS
      File
"/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line
320, in fit
        pl_batch = self._fit(X[inds[minibatch::n_batches]])
      File
"/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line
252, in _fit
        v_pos *= lr
    TypeError: Cannot cast ufunc multiply output from dtype('float64') to
dtype('int64') with casting rule 'same_kind'
>>  raise self.failureException(self.format_failure(>  instance at 0x9529a70>.getvalue()))
    
----------------------------------------------------------------------
@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

Some implicit castings: (I have a version numpy that purposely fails when
casting rules can induce problems)

I have fixed that in the pr_1200 branch on my github (it was preventing
me from reviewing this PR). Please pull/cherry-pick

Member

GaelVaroquaux commented May 6, 2013

Some implicit castings: (I have a version numpy that purposely fails when
casting rules can induce problems)

I have fixed that in the pr_1200 branch on my github (it was preventing
me from reviewing this PR). Please pull/cherry-pick

block Gibbs sampling by iteratively sampling each of :math:`v` and :math:`h`
given the other, until the chain mixes. Samples generated in this way are
sometimes refered as fantasy particles. This is inefficient and it's difficult
to determine whether the Markov chain mixes.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

I think that the above paragraph can be much shortened. I don't see how it is useful to the end user.

@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

I think that the above paragraph can be much shortened. I don't see how it is useful to the end user.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

Does any one have strong reasons not to merge this?

Plenty. I am doing a quick review, and it is not ready. The review will
come in later, but I am convinced that there still is some work to be
done here.

Member

GaelVaroquaux commented May 6, 2013

Does any one have strong reasons not to merge this?

Plenty. I am doing a quick review, and it is not ready. The review will
come in later, but I am convinced that there still is some work to be
done here.

-------
x_new: array-like, shape (M, N)
"""
p[self.random_state.uniform(size=p.shape) < p] = 1.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

self.random_state cannot be a rng object, because it would mean modifying the input parameter. You need to create a shadow self.random_state_ object, as done in the other learners.

@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

self.random_state cannot be a rng object, because it would mean modifying the input parameter. You need to create a shadow self.random_state_ object, as done in the other learners.

This comment has been minimized.

@mblondel

mblondel May 6, 2013

Member

Or better, pass the random state as an argument to the method.
On May 7, 2013 6:33 AM, "Gael Varoquaux" notifications@github.com wrote:

In sklearn/neural_network/rbm.py:

  •    Compute the element-wise binomial using the probabilities p.
    
  •    Parameters
    

  •    x: array-like, shape (M, N)
    
  •    Notes
    

  •    This is equivalent to calling numpy.random.binomial(1, p) but is
    
  •    faster because it uses in-place operations on p.
    
  •    Returns
    

  •    x_new: array-like, shape (M, N)
    
  •    """
    
  •    p[self.random_state.uniform(size=p.shape) < p] = 1.
    

self.random_state cannot be a rng object, because it would mean modifying
the input parameter. You need to create a shadow self.random_state_ object,
as done in the other learners.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r4102207
.

@mblondel

mblondel May 6, 2013

Member

Or better, pass the random state as an argument to the method.
On May 7, 2013 6:33 AM, "Gael Varoquaux" notifications@github.com wrote:

In sklearn/neural_network/rbm.py:

  •    Compute the element-wise binomial using the probabilities p.
    
  •    Parameters
    

  •    x: array-like, shape (M, N)
    
  •    Notes
    

  •    This is equivalent to calling numpy.random.binomial(1, p) but is
    
  •    faster because it uses in-place operations on p.
    
  •    Returns
    

  •    x_new: array-like, shape (M, N)
    
  •    """
    
  •    p[self.random_state.uniform(size=p.shape) < p] = 1.
    

self.random_state cannot be a rng object, because it would mean modifying
the input parameter. You need to create a shadow self.random_state_ object,
as done in the other learners.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r4102207
.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

Or better, pass the random state as an argument to the method.

+1. But the random_state attribute should not be modified anyhow.

@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

Or better, pass the random state as an argument to the method.

+1. But the random_state attribute should not be modified anyhow.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 6, 2013

Member

I have made a few minor changes on my pr_1200 branch.

A few general comments (it seems that this PR lack a bit of polish and I
am not convinced that it is mergeable from a user stand point):

The docs contain a lot of math, but lack intuitive messages. Most
importantly, the RBM section does not tell me why and when I should be
using RBMs.

Examples are not linked to in the docs. The docs do not have a figure.
They feel a bit more like a page from a textbook. For instance the remark
on 'Deep neural networks that are notoriously difficult to train from
scratch can be simplified by initializing each layer’s weights with the
weights of an RBM.' What is a user going to do with this remark: we don't
have deep neural networks in scikit-learn. There is too much maths and
not enough intuitions/practical advice.

The only example is not a 'plot_' example. It does not help understanding
how the RBM work nor does it demonstrate their benefits on the rendered
docs. It also takes ages to run. This long run time is strong suboptimal,
as there is a grid-search object that refits the RBM many times with
unchanged parameters. To make it faster, I have hardcoded some values
obtained by a grid search. I also added some plotting.

I am also worried about the amount of methods of the object. Have we
checked for redundancy in the names of the methods? I see a Gibbs
sampling method. Other objects also have sampling methods. We should make
sure that the names and signature of these methods are compatible.

As a side note, after playing with this code, I am not terribly impressed
by deep learning as black-box learners. :P

Member

GaelVaroquaux commented May 6, 2013

I have made a few minor changes on my pr_1200 branch.

A few general comments (it seems that this PR lack a bit of polish and I
am not convinced that it is mergeable from a user stand point):

The docs contain a lot of math, but lack intuitive messages. Most
importantly, the RBM section does not tell me why and when I should be
using RBMs.

Examples are not linked to in the docs. The docs do not have a figure.
They feel a bit more like a page from a textbook. For instance the remark
on 'Deep neural networks that are notoriously difficult to train from
scratch can be simplified by initializing each layer’s weights with the
weights of an RBM.' What is a user going to do with this remark: we don't
have deep neural networks in scikit-learn. There is too much maths and
not enough intuitions/practical advice.

The only example is not a 'plot_' example. It does not help understanding
how the RBM work nor does it demonstrate their benefits on the rendered
docs. It also takes ages to run. This long run time is strong suboptimal,
as there is a grid-search object that refits the RBM many times with
unchanged parameters. To make it faster, I have hardcoded some values
obtained by a grid search. I also added some plotting.

I am also worried about the amount of methods of the object. Have we
checked for redundancy in the names of the methods? I see a Gibbs
sampling method. Other objects also have sampling methods. We should make
sure that the names and signature of these methods are compatible.

As a side note, after playing with this code, I am not terribly impressed
by deep learning as black-box learners. :P

@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans May 6, 2013

Member

(Please disregard previous comment)

I agree with @GaelVaroquaux that there are far too many public methods.

Member

larsmans commented May 6, 2013

(Please disregard previous comment)

I agree with @GaelVaroquaux that there are far too many public methods.

@dwf

This comment has been minimized.

Show comment
Hide comment
@dwf

dwf May 6, 2013

Member

I haven't been following this all that closely, but the narrative docs appear to disagree with the implementation. The implementation uses stochastic maximum likelihood, which uses an unbiased estimator of the maximum likelihood gradient, not contrastive divergence (which the documentation seems to be describing). If the primary utility to scikit-learn users is that of feature extraction, I'd question whether an SML-based implementation is a good choice. CD tends to learn better features.

As a side note, after playing with this code, I am not terribly impressed by deep learning as black-box learners. :P

We'll make a believer out of you yet Gael!

Member

dwf commented May 6, 2013

I haven't been following this all that closely, but the narrative docs appear to disagree with the implementation. The implementation uses stochastic maximum likelihood, which uses an unbiased estimator of the maximum likelihood gradient, not contrastive divergence (which the documentation seems to be describing). If the primary utility to scikit-learn users is that of feature extraction, I'd question whether an SML-based implementation is a good choice. CD tends to learn better features.

As a side note, after playing with this code, I am not terribly impressed by deep learning as black-box learners. :P

We'll make a believer out of you yet Gael!

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene May 7, 2013

Member

The implementation uses stochastic maximum likelihood, which uses an unbiased estimator
of the maximum likelihood gradient, not contrastive divergence (which the documentation
seems to be describing).

Is not Persistent CD a synonym for SML? In the "SML Learning" section I described the CD algorithm first and then PCD as a variation, because I found it difficult to justify PCD by itself without contrasting with CD. If I am factually incorrect could you please push me in the right direction?

There is too much maths and not enough intuitions/practical advice.

I tried to push the maths as far down as possible and to lead with intuitions, but I don't know any more intuitions than what I put down. I agree the docs are lacking but I really don't know what to write to improve them. As a black-box learner RBMs should just be used in a simple pipeline before a classifier and hope for the best.

The figure plotted by the example should indeed be in the docs. I was very unsatisfied with the way the example looks on the digit dataset, compared to it looking good on mnist, I tried to find good values but failed. I was thinking whether it would be better to use mnist and keep it as a pregenerated example.

Member

vene commented May 7, 2013

The implementation uses stochastic maximum likelihood, which uses an unbiased estimator
of the maximum likelihood gradient, not contrastive divergence (which the documentation
seems to be describing).

Is not Persistent CD a synonym for SML? In the "SML Learning" section I described the CD algorithm first and then PCD as a variation, because I found it difficult to justify PCD by itself without contrasting with CD. If I am factually incorrect could you please push me in the right direction?

There is too much maths and not enough intuitions/practical advice.

I tried to push the maths as far down as possible and to lead with intuitions, but I don't know any more intuitions than what I put down. I agree the docs are lacking but I really don't know what to write to improve them. As a black-box learner RBMs should just be used in a simple pipeline before a classifier and hope for the best.

The figure plotted by the example should indeed be in the docs. I was very unsatisfied with the way the example looks on the digit dataset, compared to it looking good on mnist, I tried to find good values but failed. I was thinking whether it would be better to use mnist and keep it as a pregenerated example.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2013

Member

Ok, I'll try to work on docs and interfaces. First I want to have a look at @jnothman's grid-search work, though...

Member

amueller commented May 7, 2013

Ok, I'll try to work on docs and interfaces. First I want to have a look at @jnothman's grid-search work, though...

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 7, 2013

Member

Ok, I'll try to work on docs and interfaces.

The most worrying thing is the example: I haven't been convinced that it
does better than a K-means to extract mid-level features.

G

Member

GaelVaroquaux commented May 7, 2013

Ok, I'll try to work on docs and interfaces.

The most worrying thing is the example: I haven't been convinced that it
does better than a K-means to extract mid-level features.

G

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2013

Member

I don't think that you can say that it is better than k-means in general.
Showing that RBMs work needs a large dataset and tuning. We can try on MNIST.

I don't think the goal of the example should be to show that it works well, ratherit should be how to use it.
The literature shows that there are cases in which it works. Reproducing them (in particular using just a CPU) is non-trivial and time-consuming.

Member

amueller commented May 7, 2013

I don't think that you can say that it is better than k-means in general.
Showing that RBMs work needs a large dataset and tuning. We can try on MNIST.

I don't think the goal of the example should be to show that it works well, ratherit should be how to use it.
The literature shows that there are cases in which it works. Reproducing them (in particular using just a CPU) is non-trivial and time-consuming.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 7, 2013

Member

Showing that RBMs work needs a large dataset and tuning. We can try on
MNIST.

That must cost an arm and a leg!

I don't think the goal of the example should be to show that it works
well, ratherit should be how to use it.

Indeed. I think that the example on my pr_1200 branch does that, however when I see it, I cannot help thinking of k-means.

The literature shows that there are cases in which it works.

The older I get, the less I trust the literature :)

Reproducing them (in particular using just
a CPU) is non-trivial and time-consuming.

We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem.

Member

GaelVaroquaux commented May 7, 2013

Showing that RBMs work needs a large dataset and tuning. We can try on
MNIST.

That must cost an arm and a leg!

I don't think the goal of the example should be to show that it works
well, ratherit should be how to use it.

Indeed. I think that the example on my pr_1200 branch does that, however when I see it, I cannot help thinking of k-means.

The literature shows that there are cases in which it works.

The older I get, the less I trust the literature :)

Reproducing them (in particular using just
a CPU) is non-trivial and time-consuming.

We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene May 7, 2013

Member

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging. It seems that people are interested in trying these kind
of models. I think it's better to provide it, and make it obvious how easy
it is to replace it with something simpler by giving it the same API and
letting people see by themselves that for many problems, KMeans or Random
Projections extract better features than complicated and slow
architectures. If users manage to use it well, that's great! If users use
it wrongly, it's not much different than if they misuse any other of our
estimators, is it?

We should do a quick check to see if this implementation can be sped up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

@amueller, I'd love to help with what I can (maybe API cleanup) but I might
step on your toes if you're on this, especially since the branch is @ynd's
at the moment. What could I do?

Member

vene commented May 7, 2013

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging. It seems that people are interested in trying these kind
of models. I think it's better to provide it, and make it obvious how easy
it is to replace it with something simpler by giving it the same API and
letting people see by themselves that for many problems, KMeans or Random
Projections extract better features than complicated and slow
architectures. If users manage to use it well, that's great! If users use
it wrongly, it's not much different than if they misuse any other of our
estimators, is it?

We should do a quick check to see if this implementation can be sped up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

@amueller, I'd love to help with what I can (maybe API cleanup) but I might
step on your toes if you're on this, especially since the branch is @ynd's
at the moment. What could I do?

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 7, 2013

Member

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging.

Agreed. I am not voting against a merge in the long run. I think that
they are still a few minor improvements to be done before a merge.

We should do a quick check to see if this implementation can be sped up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

OK, that would be fantastic, but if it requires a lot of work, we should
maybe do it in a 2nd PR.

Member

GaelVaroquaux commented May 7, 2013

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging.

Agreed. I am not voting against a merge in the long run. I think that
they are still a few minor improvements to be done before a merge.

We should do a quick check to see if this implementation can be sped up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

OK, that would be fantastic, but if it requires a lot of work, we should
maybe do it in a 2nd PR.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 7, 2013

Member

@vladn feel free to go ahead, preferably in a second pr to master, i think. this is not my highest priority but i want to get it done soonish..

Vlad Niculae notifications@github.com schrieb:

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging. It seems that people are interested in trying these
kind
of models. I think it's better to provide it, and make it obvious how
easy
it is to replace it with something simpler by giving it the same API
and
letting people see by themselves that for many problems, KMeans or
Random
Projections extract better features than complicated and slow
architectures. If users manage to use it well, that's great! If users
use
it wrongly, it's not much different than if they misuse any other of
our
estimators, is it?

We should do a quick check to see if this implementation can be sped
up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

@amueller, I'd love to help with what I can (maybe API cleanup) but I
might
step on your toes if you're on this, especially since the branch is
@ynd's
at the moment. What could I do?


Reply to this email directly or view it on GitHub:
#1200 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Member

amueller commented May 7, 2013

@vladn feel free to go ahead, preferably in a second pr to master, i think. this is not my highest priority but i want to get it done soonish..

Vlad Niculae notifications@github.com schrieb:

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging. It seems that people are interested in trying these
kind
of models. I think it's better to provide it, and make it obvious how
easy
it is to replace it with something simpler by giving it the same API
and
letting people see by themselves that for many problems, KMeans or
Random
Projections extract better features than complicated and slow
architectures. If users manage to use it well, that's great! If users
use
it wrongly, it's not much different than if they misuse any other of
our
estimators, is it?

We should do a quick check to see if this implementation can be sped
up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

@amueller, I'd love to help with what I can (maybe API cleanup) but I
might
step on your toes if you're on this, especially since the branch is
@ynd's
at the moment. What could I do?


Reply to this email directly or view it on GitHub:
#1200 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene May 9, 2013

Member

Moved to PR #1954, closing this one.
Posting here so people get pinged.
@dwf, could you help me understand the PCD-SML issue discussed above?