Added Restricted Boltzmann machines #1200

Closed
wants to merge 66 commits into
from

Projects

None yet
@ynd
Contributor
ynd commented Oct 2, 2012

RBMs are a state-of-the-art generative model. They've been used to win the Netflix challenge [1] and in record breaking systems for speech recognition at Google [2] and Microsoft. This pull request adds a class for Restricted Boltzmann Machines (RBMs) to scikits-learn. The code is both easy to read and efficient.

[1] http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
[2] http://research.google.com/pubs/archive/38130.pdf

@mblondel

verbose should be a constructor parameter.

@mblondel
Member
mblondel commented Oct 3, 2012

That's great! How hard would it be to support scipy sparse matrices too? We're trying to avoid adding new classes that can only operate with numpy arrays...

@glouppe
Member
glouppe commented Oct 3, 2012

Great addition indeed! What would be convenient is to define a RBMClassifier class from your RBM. There is several ways to do that, but the most straightforward would be to encode both X and y into the visible units and then to train your RBM. For predictions, clamp the values of X on the visible units (but let the visible units of y free), then let the machine stabilize and finally output the values at the visible units corresponding to y as the final predictions. Actually this strategy can even be used for multi-output problems.

Also, do you intend to handle missing values? (since you motivate your PR with Netflix)

@glouppe glouppe and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ n_components : int, optional
+ Number of binary hidden units
+ epsilon : float, optional
+ Learning rate to use during learning. It is *highly* recommended
+ to tune this hyper-parameter. Possible values are 10**[0., -3.].
+ n_samples : int, optional
+ Number of fantasy particles to use during learning
+ epochs : int, optional
+ Number of epochs to perform during learning
+ random_state : RandomState or an int seed (0 by default)
+ A random number generator instance to define the state of the
+ random permutations generator.
+
+ Attributes
+ ----------
+ W : array-like, shape (n_visibles, n_components), optional
@glouppe
glouppe Oct 3, 2012 Member

In scikit-learn, we are used to put a trailing underscore to all fitted attributes (self.W_).

@GaelVaroquaux
GaelVaroquaux Oct 3, 2012 Member

In scikit-learn, we are used to put a trailing underscore to all fitted
attributes (self.W_).

In addition, we know try to avoid single-letter variable names. Can you
find a more explicit name?

@mblondel
mblondel Oct 3, 2012 Member

In PCA, a related attribute is components_.

@glouppe glouppe and 4 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ n_samples : int, optional
+ Number of fantasy particles to use during learning
+ epochs : int, optional
+ Number of epochs to perform during learning
+ random_state : RandomState or an int seed (0 by default)
+ A random number generator instance to define the state of the
+ random permutations generator.
+
+ Attributes
+ ----------
+ W : array-like, shape (n_visibles, n_components), optional
+ Weight matrix, where n_visibles in the number of visible
+ units and n_components is the number of hidden units.
+ b : array-like, shape (n_components,), optional
+ Biases of the hidden units
+ c : array-like, shape (n_visibles,), optional
@glouppe
glouppe Oct 3, 2012 Member

b and c are not meaningful variable names. Maybe bias_hidden and bias_visible would be better names? (This is a suggestion)

@agramfort
agramfort Oct 3, 2012 Member

or maybe better intercept_hidden_ and intercept_visible_

@GaelVaroquaux
GaelVaroquaux Oct 3, 2012 Member

or maybe better intercepthidden and interceptvisible

intercept_hidden and intercept_visible

@amueller
amueller Oct 3, 2012 Member

bias is the usual term in the community. before joining sklearn, I never heard the word intercept.
Consistency with community or within sklearn is the question I guess ;)

@GaelVaroquaux
GaelVaroquaux Oct 3, 2012 Member

On Wed, Oct 03, 2012 at 03:33:51AM -0700, Andreas Mueller wrote:

bias is the usual term in the community. before joining sklearn, I never heard
the word intercept.
Consistency with community or within sklearn is the question I guess ;)

In any case, the fact that the 2 terminology coexist needs to be stressed
in the documentation. I wasn't aware that "bias == intercept" :)

@larsmans
larsmans Oct 3, 2012 Member

+1 for intercept_{visible,hidden}_.

@glouppe glouppe and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ verbose: bool, optional
+ When True (False by default) the method outputs the progress
+ of learning after each epoch.
+ """
+ X = array2d(X)
+
+ self.W = np.asarray(np.random.normal(0, 0.01,
+ (X.shape[1], self.n_components)), dtype=X.dtype)
+ self.b = np.zeros(self.n_components, dtype=X.dtype)
+ self.c = np.zeros(X.shape[1], dtype=X.dtype)
+ self.h_samples = np.zeros((self.n_samples, self.n_components),
+ dtype=X.dtype)
+
+ inds = range(X.shape[0])
+
+ np.random.shuffle(inds)
@glouppe
glouppe Oct 3, 2012 Member

self.random_state should be used instead.

@mblondel
mblondel Oct 3, 2012 Member

It seems to me that other parts of the scikit don't record the random state in an attribute (they pass it around instead).

@mblondel
mblondel Oct 4, 2012 Member

You could do del self.random_state at the end of fit then.

@ynd
ynd Oct 4, 2012 Contributor

On Thu, Oct 4, 2012 at 1:15 AM, Mathieu Blondel notifications@github.comwrote:

In sklearn/rbm.py:

  •    np.random.shuffle(inds)
    

You could do del self.random_state at the end of fit then.

What if fit is called twice? Wouldn't having deleted the self.random_state
cause some problems.

Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1758454.

@glouppe glouppe and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ v: array-like, shape (n_samples, n_visibles)
+
+ Returns
+ -------
+ pseudo_likelihood: array-like, shape (n_samples,)
+ """
+ fe = self.free_energy(v)
+
+ v_ = v.copy()
+ i_ = self.random_state.randint(0, v.shape[1], v.shape[0])
+ v_[range(v.shape[0]), i_] = v_[range(v.shape[0]), i_] == 0
+ fe_ = self.free_energy(v_)
+
+ return v.shape[1] * np.log(self._sigmoid(fe_ - fe))
+
+ def fit(self, X, y=None, verbose=False):
@glouppe
glouppe Oct 3, 2012 Member

As Mathieu said, verbose should be a constructor parameter.

@ynd
ynd Oct 3, 2012 Contributor

done.

On Wed, Oct 3, 2012 at 3:06 AM, Gilles Louppe notifications@github.comwrote:

In sklearn/rbm.py:

  •    v: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,)
    
  •    """
    
  •    fe = self.free_energy(v)
    
  •    v_ = v.copy()
    
  •    i_ = self.random_state.randint(0, v.shape[1], v.shape[0])
    
  •    v_[range(v.shape[0]), i_] = v_[range(v.shape[0]), i_] == 0
    
  •    fe_ = self.free_energy(v_)
    
  •    return v.shape[1] \* np.log(self._sigmoid(fe_ - fe))
    
  • def fit(self, X, y=None, verbose=False):

As Mathieu said, verbose should be a constructor parameter.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1747215.

@glouppe glouppe and 4 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ verbose: bool, optional
+ When True (False by default) the method outputs the progress
+ of learning after each epoch.
+ """
+ self.fit(X, y, verbose)
+
+ return self.transform(X)
+
+
+def main():
+ pass
+
+
+if __name__ == '__main__':
+ main()
+
@glouppe
glouppe Oct 3, 2012 Member

Remove the lines above. Instead, could you provide an example in a separate file?

@amueller
amueller Oct 3, 2012 Member

+1
Also: how are we going to test this beast?

@GaelVaroquaux
GaelVaroquaux Oct 3, 2012 Member

Also: how are we going to test this beast?

Check that the energy decreases during the training?

@amueller
amueller Oct 3, 2012 Member

This is not really what you want. You want the probability to increase and you can't compute the partition function usually.
We could do an example with ~5 hidden units and test that. That would actually a good test.

But: (P)CD learning doesn't guarantee increasing the probability of the data set and may diverge.

So I guess plotting the true model probability of the data would be a cool example and we could also test against that.
We only need a function to calculate the partition function ....

@dwf
dwf Oct 3, 2012 Member

On Wed, Oct 3, 2012 at 6:31 AM, Andreas Mueller notifications@github.comwrote:

This is not really what you want. You want the probability to increase and
you can't compute the partition function usually.
We could do an example with ~5 hidden units and test that. That would
actually a good test.

But: (P)CD learning doesn't guarantee increasing the probability of the
data set and may diverge.

Well, PCD does, at least asymptotically, as long as you use an
appropriately decaying learning rate. But asymptotic behaviour is kind of
hard to test.

@amueller
amueller Oct 3, 2012 Member

Really? Pretty sure that not, as you can not guarantee mixing of the chain.
Maybe I have to look into the paper again.

@amueller
amueller Oct 3, 2012 Member

Doesn't say anything like that in the paper as far as I can tell and I think I observed divergence in practice.
Could be that it is possible to converge if you decay the learning rate fast enough but that doesn't mean that you converge to a point that is better than the point you started with, only that you stop somewhere.

@dwf
dwf Oct 3, 2012 Member

The "PCD" paper kind of undersells it, what you want to look at is the Laurent Younes paper "On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates", which he cites (the contribution of the PCD paper was basically making this known to the ML community and demonstrating it on RBMs). There you get guarantees of almost sure convergence to a local maximum of the empirical likelihood, provided your initial learning rate is small enough. If you see things start to diverge or oscillate then your learning rate is probably too high.

@ynd
ynd Oct 3, 2012 Contributor

done. I do plan to make a separate example. As @amueller and @ogrisel mentioned, I think it would be best to show its use as a feature extractor for a LinearSVC for example. And showing that it does improve the accuracy on the test set very significantly. For example on digits. It would show how to use GridSearchCV to find the optimal learning rate.

@pprett
Member
pprett commented Oct 3, 2012

Thanks for the contribution - I'm super excited - can't wait to see this merged to master!

@glouppe glouppe and 5 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ self.epochs = epochs
+ self.random_state = check_random_state(random_state)
+
+ def _sigmoid(self, x):
+ """
+ Implements the logistic function.
+
+ Parameters
+ ----------
+ x: array-like, shape (M, N)
+
+ Returns
+ -------
+ x_new: array-like, shape (M, N)
+ """
+ return 1. / (1. + np.exp(-np.maximum(np.minimum(x, 30), -30)))
@glouppe
glouppe Oct 3, 2012 Member

I don't know if this changes anything, but wouldn't it be better to output 0. if x < -30 and 1. if x > 30? Also, since this very particular function is at the core of the algorithm, a Cython version may be a better choice.

@mblondel
mblondel Oct 3, 2012 Member

This method doesn't depend on the object state so it can changed to a function.

@larsmans
larsmans Oct 3, 2012 Member

Shouldn't we have logistic sigmoid in sklearn.utils.extmath?

@amueller
amueller Oct 4, 2012 Member

maybe call it logistic_sigmoid as sigmoid just refers to the shape....

@dwf
dwf Oct 4, 2012 Member

+1 @amueller. Drives me nuts when people around here simply call this "the sigmoid". :) The logistic function is its proper name.

@ynd
ynd Oct 4, 2012 Contributor

done

@amueller amueller commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ on Machine Learning (ICML) 2008
+ """
+ h_pos = self.mean_h(v_pos)
+ v_neg = self.sample_v(self.h_samples)
+ h_neg = self.mean_h(v_neg)
+
+ self.W += self.epsilon * (np.dot(v_pos.T, h_pos)
+ - np.dot(v_neg.T, h_neg)) / self.n_samples
+ self.b += self.epsilon * (h_pos.mean(0) - h_neg.mean(0))
+ self.c += self.epsilon * (v_pos.mean(0) - v_neg.mean(0))
+
+ self.h_samples = self.random_state.binomial(1, h_neg)
+
+ return self.pseudo_likelihood(v_pos)
+
+ def pseudo_likelihood(self, v):
@amueller
amueller Oct 3, 2012 Member

Is this really a pseudolikelihood?
Can you give a reference where this was used?

@amueller
amueller Oct 3, 2012 Member

Hm I guess it is connected to pseudolikelihood... hm...

@amueller
Member
amueller commented Oct 3, 2012

@glouppe Afaik classification with RBMs basically doesn't work. Is there any particular reason you want it? Usually throwing the representation into a linear SVM is way better.

I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

An example would be great. Is there any application where we can demo that this works?
Maybe extracting features from digits and then classify? I guess that would train a while, though :-/

If we do include an RBM, it should definitely implement both CD1 and PCD.

@amueller amueller and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ v_pos: array-like, shape (n_samples, n_visibles)
+
+ Returns
+ -------
+ pseudo_likelihood: array-like, shape (n_samples,), optional
+ Pseudo Likelihood estimate for this batch.
+
+ References
+ ----------
+ [1] Tieleman, T. Training Restricted Boltzmann Machines using
+ Approximations to the Likelihood Gradient. International Conference
+ on Machine Learning (ICML) 2008
+ """
+ h_pos = self.mean_h(v_pos)
+ v_neg = self.sample_v(self.h_samples)
+ h_neg = self.mean_h(v_neg)
@amueller
amueller Oct 3, 2012 Member

Shouldn't h be sampled here?

@ynd
ynd Oct 3, 2012 Contributor

h is sampled for the persistent gibbs chain. For computing the learning statistics, by deriving the log probability w.r.t. the parameters you get a difference between free energy derivatives. Which results in using the mean field of the h's (see http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf page 64). When computing the learning statistics, it's mathematically valid and better to use the mean-field because it adds less noise to learning.

@dwf
dwf Oct 3, 2012 Member

@amueller Yoshua's book doesn't mention this AFAIK but this technique (replacing a sample with its expectation in a formula) is known as "Rao-Blackwellization", and it's trivial to demonstrate that it's still an unbiased estimator but with lower variance than the original sample-based estimate. See, for example, this tutorial.

@ynd
ynd Oct 3, 2012 Contributor

@dwf cool, I had somehow overlooked of that proof :).

On Wed, Oct 3, 2012 at 5:54 PM, David Warde-Farley <notifications@github.com

wrote:

In sklearn/rbm.py:

  •    v_pos: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,), optional
    
  •        Pseudo Likelihood estimate for this batch.
    
  •    References
    

  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_h(v_pos)
    
  •    v_neg = self.sample_v(self.h_samples)
    
  •    h_neg = self.mean_h(v_neg)
    

@amueller https://github.com/amueller Yoshua's book doesn't mention
this AFAIK but this technique (replacing a sample with its expectation in a
formula) is known as "Rao-Blackwellization", and it's trivial to
demonstrate that it's still an unbiased estimator but with lower variance
than the original sample-based estimate. See, for example, this tutorialhttp://www.cs.ubc.ca/%7Enando/papers/ita2010.pdf
.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1756078.

@amueller
amueller Oct 4, 2012 Member

On 10/03/2012 10:47 PM, Yann N. Dauphin wrote:

In sklearn/rbm.py:

  •    v_pos: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    pseudo_likelihood: array-like, shape (n_samples,), optional
    
  •        Pseudo Likelihood estimate for this batch.
    
  •    References
    

  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_h(v_pos)
    
  •    v_neg = self.sample_v(self.h_samples)
    
  •    h_neg = self.mean_h(v_neg)
    

h is sampled for the persistent gibbs chain. For computing the
learning statistics, by deriving the log probability w.r.t. the
parameters you get a difference between free energy derivatives. Which
results in using the mean field of the h's (see
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
http://www.iro.umontreal.ca/%7Ebengioy/papers/ftml_book.pdf page
64). When computing the learning statistics, it's mathematically valid
and better to use the mean-field because it adds less noise to learning.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755979.

Thanks for the explanation. Makes sense as long as it is sampled again
for the MCMC chain as you do below.

@amueller
Member
amueller commented Oct 3, 2012

I just saw that you do implement persistent CD. I guess having only this is fine.

@weilinear
Member

Maybe we have have a pipeline connecting rbm and linearsvc to for an example and compare the result to using raw feature :)

@amueller amueller and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+class RBM(BaseEstimator, TransformerMixin):
+ """
+ Restricted Boltzmann Machine (RBM)
+
+ A Restricted Boltzmann Machine with binary visible units and
+ binary hiddens. Parameters are estimated using Stochastic Maximum
+ Likelihood (SML).
+
+ The time complexity of this implementation is ``O(n ** 2)`` assuming
+ n ~ n_samples ~ n_features.
+
+ Parameters
+ ----------
+ n_components : int, optional
+ Number of binary hidden units
+ epsilon : float, optional
@amueller
amueller Oct 3, 2012 Member

maybe just call this learning_rate?

@mblondel
mblondel Oct 3, 2012 Member

+1, this would be consistent with SGDClassifier.

@amueller amueller and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ binary hiddens. Parameters are estimated using Stochastic Maximum
+ Likelihood (SML).
+
+ The time complexity of this implementation is ``O(n ** 2)`` assuming
+ n ~ n_samples ~ n_features.
+
+ Parameters
+ ----------
+ n_components : int, optional
+ Number of binary hidden units
+ epsilon : float, optional
+ Learning rate to use during learning. It is *highly* recommended
+ to tune this hyper-parameter. Possible values are 10**[0., -3.].
+ n_samples : int, optional
+ Number of fantasy particles to use during learning
+ epochs : int, optional
@amueller
amueller Oct 3, 2012 Member

Could you please use n_epochs?

@mblondel
mblondel Oct 3, 2012 Member

@amueller Or simply, n_iter :)

@dwf
dwf Oct 3, 2012 Member

n_iter is more ambiguous, I think. Even if "epoch" is jargony, it refers to a full sweep through the dataset, whereas "iter" could mean number of [stochastic] gradient updates.

@amueller
amueller Oct 7, 2012 Member

I think n_iter would be the right thing.
We use n_iter in SGDClassifier already, where it has the same meaning as here.

@glouppe
Member
glouppe commented Oct 3, 2012

@amueller Well, I was suggesting that because it nearly comes from free. It would have shown how RBM could be used in a stand-alone way, as a pure generative model. But I agree that using a RBM to generate features and then to feed them to another classifier is probably a better idea in terms of end accuracy.

I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

It happens that I came into the machine learning world through RBMs :) That's why I appreciate that model, though it's purely subjective.

@amueller
Member
amueller commented Oct 3, 2012

@glouppe I also came to machine learning via RBMs and I really dislike them and I am convinced they don't work - though it's purely subjective ;)

@amueller
Member
amueller commented Oct 3, 2012

@ynd Great job by the way! Looks very good :)

@glouppe
Member
glouppe commented Oct 3, 2012

@amueller Haha :-) Actually, my experience with RBMs is limited to Netflix and to image datasets, were I found accuracy to be quite decent. However I confess that adjusting all the hyperparameters is a real nightmare. Maybe that's the reason why switched to ensemble methods?

@jaquesgrobler
Member

Nice job on this. I actually worked a bit with @genji on a sklearn version of this.. In this case, it could be good if he could have a look at this. well done!

@GaelVaroquaux
Member

@glouppe Afaik classification with RBMs basically doesn't work. Is there any
particular reason you want it? Usually throwing the representation into a
linear SVM is way better.

OK, we'll need an example showing this, hopefully with a pipeline.

@GaelVaroquaux
Member

I'm a bit surprised that so many people want an RBM. I though we kind of didn't
want "deep learning" stuff...
If I knew that I would have contributed one of my implementations ^^

We're growing, I guess.

@amueller: if you have good implementation, you can probably give good
advice/good review on this PR. I have found that the best way to get good
code was to keep the good ideas from different codebases developed
independently.

@GaelVaroquaux
Member

Nice job on this. I actually worked a bit with @genji on a sklearn version of
this..

You should give a pointer to the codebase.

@ogrisel
Member
ogrisel commented Oct 3, 2012

It would be great if someone could benchmark this implementation against @dwf's gist: https://gist.github.com/359323 both in terms of CPU time till convergence, memory usage and ability to generate feature suitable for feeding to a linear classifier such as sklearn.linear_model.LinearSVC for instance. The dataset to use for this bench could be a subset of mnist (e.g. 2 digits such as 0 versus 8) to make it faster to evaluate.

@ogrisel
Member
ogrisel commented Oct 3, 2012

BTW @ynd great to see you contributing to sklearn. I really enjoyed your NIPS presentation last year. Also it would be great to submit a sklearn version of your numpy CAE that includes the sampling trick in another PR :)

@ogrisel
Member
ogrisel commented Oct 3, 2012

Nice job on this. I actually worked a bit with @genji on a sklearn version of this..

We would indeed appreciate @genji inputs on reviewing this PR and maybe point to common implementation pitfalls to avoid.

@amueller
Member
amueller commented Oct 3, 2012

oh @ynd I didn't recognize you from picture. I think we talked quite a while at your poster last nips. After that, my lab started working on constrastive auto encoders! Your work is really interesting!

@mblondel mblondel and 3 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
@@ -0,0 +1,307 @@
+""" Restricted Boltzmann Machine
+"""
+
+# Author: Yann N. Dauphin <dauphiya@iro.umontreal.ca>
+# License: BSD Style.
+
+import numpy as np
+
+from .base import BaseEstimator, TransformerMixin
+from .utils import array2d, check_random_state
+
+
+class RBM(BaseEstimator, TransformerMixin):
@mblondel
mblondel Oct 3, 2012 Member

We try to avoid acronyms so RestrictedBolzmannMachine sounds better.

Also, we may want to bootstrap the neural_network module.

@ogrisel
ogrisel Oct 3, 2012 Member

+1, esp. if Contractive Autoencoders are to follow.

@mblondel
mblondel Oct 3, 2012 Member

I was also thinking of the multi-layer perceptron, if @amueller feels motivated :)

@amueller
amueller Oct 3, 2012 Member

Yeah, I know, I should really look into that again... motivation level... not that high ;)

@ynd
ynd Oct 3, 2012 Contributor

A 'neural_network', or 'neural' module sounds really good.

@ogrisel Yes, Contractive Autoencoders are to follow! :)

@amueller There's a nice paper this year that makes MLPs much simpler to use, they show how to automatically tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only leaves the n_components hyper-parameter!

@amueller
amueller Oct 4, 2012 Member

A 'neural_network', or 'neural' module sounds really good.

@ogrisel https://github.com/ogrisel Yes, Contractive Autoencoders
are to follow! :)

@amueller https://github.com/amueller There's a nice paper this year
that makes MLPs much simpler to use, they show how to automatically
tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only
leaves the n_components hyper-parameter!

I saw the paper but didn't try it out yet. Have you given it a try?
By the way, that look a lot like RPROP, which we often use
(but is a lot less heuristic).

Do you think that would also be interesting for the SGD module?

@mblondel mblondel commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ A Restricted Boltzmann Machine with binary visible units and
+ binary hiddens. Parameters are estimated using Stochastic Maximum
+ Likelihood (SML).
+
+ The time complexity of this implementation is ``O(n ** 2)`` assuming
+ n ~ n_samples ~ n_features.
+
+ Parameters
+ ----------
+ n_components : int, optional
+ Number of binary hidden units
+ epsilon : float, optional
+ Learning rate to use during learning. It is *highly* recommended
+ to tune this hyper-parameter. Possible values are 10**[0., -3.].
+ n_samples : int, optional
+ Number of fantasy particles to use during learning
@mblondel
mblondel Oct 3, 2012 Member

n_particles sounds betters, as n_samples is usually used for X.shape[0].

@dwf dwf closed this Oct 3, 2012
@dwf dwf reopened this Oct 3, 2012
@dwf
Member
dwf commented Oct 3, 2012

Oops, somehow I hit the close button. Sorry about that.

I'd just like to raise the obvious skeptic's take here: RBMs are relatively finicky to get to work properly for an arbitrary dataset, especially as compared to a lot of the models that sklearn implements. You don't even have a very good way of doing model comparison for non-trivial models, except approximately via Annealed Importance Sampling, which itself requires careful tuning. This was my reason for never submitting my own NumPy-based implementation: if sklearn is looking to provide canned solutions to non-experts, RBMs and deep learning in general are pretty much the antithesis of that (at least for now; the "no more pesky learning rates" paper offers hope for deterministic networks, but their technique does not apply in the least to the case where you have to stochastically estimate the gradient with MCMC, as here). On one hand, it'd be nice to get these techniques more exposure, but on the other, their inclusion in scikit-learn may be a false endorsement of how easy they are to get to work.

@amueller
Member
amueller commented Oct 3, 2012

I completely agree with @dwf. I though that was the reason we didn't include it yet.
As many people seem to be excited about having it, I guess we could give it a shot, though ;)

@ynd
Contributor
ynd commented Oct 3, 2012

Thanks for all the feedback. Anyway, I'll answer each comment separately as I make commits.

@ogrisel @amueller Thanks!

Yann N. Dauphin added some commits Oct 3, 2012
@ynd
Contributor
ynd commented Oct 3, 2012

@dwf In my experience, many people are interested in these techniques but
don't want to spend the initial investment of implementing them. It's true
that using RBMs is not as straightforward as using PCA, but I think
including them in scikit-learn will allow a lot of people to develop the
knowledge necessary to use them. This is good because a lot of the
difficulty of using RBMs comes from the fact that only few people in the
community know how to use them. More users means more documentation,
tutorials, papers, etc.

On Wed, Oct 3, 2012 at 1:42 PM, David Warde-Farley <notifications@github.com

wrote:

Oops, somehow I hit the close button. Sorry about that.

I'd just like to raise the obvious skeptic's take here: RBMs are
relatively finicky to get to work properly for an arbitrary dataset,
especially as compared to a lot of the models that sklearn implements. You
don't even have a very good way of doing model comparison for non-trivial
models, except approximately via Annealed Importance Sampling, which itself
requires careful tuning. This was my reason for never submitting my own
NumPy-based implementation: if sklearn is looking to provide canned
solutions to non-experts, RBMs and deep learning in general are pretty much
the antithesis of that (at least for now; the "no more pesky learning
rates" paper offers hope for deterministic networks, but their technique
does not apply in the least to the case where you have to stochastically
estimate the gradient with MCMC, as here). On one hand, it'd be nice to get
these techniques more exposure, but on the other, their inclusion in
scikit-learn may be a false endorsement of how easy they are to get to work


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-9115255.

Yann N. Dauphin added some commits Oct 3, 2012
@larsmans larsmans and 4 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+
+ Attributes
+ ----------
+ W : array-like, shape (n_visibles, n_components), optional
+ Weight matrix, where n_visibles in the number of visible
+ units and n_components is the number of hidden units.
+ b : array-like, shape (n_components,), optional
+ Biases of the hidden units
+ c : array-like, shape (n_visibles,), optional
+ Biases of the visible units
+
+ Examples
+ --------
+
+ >>> import numpy as np
+ >>> from sklearn.rbm import RBM
@larsmans
larsmans Oct 3, 2012 Member

The top-level module is getting cluttered. Shouldn't this be in sklearn.deep, sklearn.neural, ...?

@amueller
amueller Oct 4, 2012 Member

+1
sklearn.neuralnet?

@GaelVaroquaux
GaelVaroquaux Oct 4, 2012 Member

sklearn.neuralnet?

neural_net dude! :}

@amueller
amueller Oct 4, 2012 Member

lol sorry ^^

@larsmans
larsmans Oct 4, 2012 Member

Why _net?

@vene
vene Oct 4, 2012 Member

shouldn't this be s/RBM/RestrictedBoltzmannMachine here and 2 lines below?

as for the module, neural_network is still shorter than cross_validation, why not?

@ynd
ynd Oct 4, 2012 Contributor

In sklearn/rbm.py:

  • import numpy as np

  • from sklearn.rbm import RBM

shouldn't this be s/RBM/RestrictedBoltzmannMachine here and 2 lines below?

Thanks.

as for the module, neural_network is still shorter than cross_validation,
why not?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1760553.

@larsmans larsmans and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ ----------
+ x: array-like, shape (M, N)
+
+ Returns
+ -------
+ x_new: array-like, shape (M, N)
+ """
+ return 1. / (1. + np.exp(-np.maximum(np.minimum(x, 30), -30)))
+
+ def transform(self, v):
+ """
+ Computes the probabilities P({\bf h}_j=1|{\bf v}).
+
+ Parameters
+ ----------
+ v: array-like, shape (n_samples, n_visibles)
@larsmans
larsmans Oct 3, 2012 Member

n_visibles is n_samples, right? If so, then this should just be called X, not v.

@ynd
ynd Oct 4, 2012 Contributor
  • def transform(self, v):
  •    """
    
  •    Computes the probabilities P({\bf h}_j=1|{\bf v}).
    
  •    Parameters
    

  •    v: array-like, shape (n_samples, n_visibles)
    

n_visibles is n_samples, right? If so, then this should just be called X,
not v.

Here n_visibles is actually n_features (has been renamed in a more recent
commit). v is indeed the input samples, but I use the notation that is
common when dealing with rbms to make things more clear.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755791.

@larsmans larsmans commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+
+ def transform(self, v):
+ """
+ Computes the probabilities P({\bf h}_j=1|{\bf v}).
+
+ Parameters
+ ----------
+ v: array-like, shape (n_samples, n_visibles)
+
+ Returns
+ -------
+ h: array-like, shape (n_samples, n_components)
+ """
+ return self.mean_h(v)
+
+ def mean_h(self, v):
@larsmans
larsmans Oct 3, 2012 Member

mean_hidden

@larsmans larsmans and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
+ """
+ return self.mean_h(v)
+
+ def mean_h(self, v):
+ """
+ Computes the probabilities P({\bf h}_j=1|{\bf v}).
+
+ Parameters
+ ----------
+ v: array-like, shape (n_samples, n_visibles)
+
+ Returns
+ -------
+ h: array-like, shape (n_samples, n_components)
+ """
+ return self._sigmoid(np.dot(v, self.W) + self.b)
@larsmans
larsmans Oct 3, 2012 Member

There's no input validation here. Also, with safe_sparse_dot, sparse matrix input can be supported trivially.

@ynd
ynd Oct 3, 2012 Contributor

Thanks.

On Wed, Oct 3, 2012 at 5:43 PM, Lars Buitinck notifications@github.comwrote:

In sklearn/rbm.py:

  •    """
    
  •    return self.mean_h(v)
    
  • def mean_h(self, v):
  •    """
    
  •    Computes the probabilities P({\bf h}_j=1|{\bf v}).
    
  •    Parameters
    

  •    v: array-like, shape (n_samples, n_visibles)
    
  •    Returns
    

  •    h: array-like, shape (n_samples, n_components)
    
  •    """
    
  •    return self._sigmoid(np.dot(v, self.W) + self.b)
    

There's no input validation here. Also, with safe_sparse_dot, sparse
matrix input can be supported trivially.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755913.

@mblondel

I just noticed that in PCA the shape is (n_components, n_features) so could you use the transpose of the above in your code?

Also when you initialize components_, try order="fortran" and oder="c" to see which one is the fastest.

Owner
ynd replied Oct 5, 2012
@mblondel
Member
mblondel commented Oct 4, 2012

You don't even have a very good way of doing model comparison for non-trivial models

I'm not familiar with RBM but can't you use the accuracy of the last step of your pipeline? (e.g., a LinearSVC)

@amueller
Member
amueller commented Oct 4, 2012

@mblondel Using a classifier is a very indirect method of evaluating RBMs. It has nothing to with the objective function :-/

@amueller amueller commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
+ v_pos: array-like, shape (n_samples, n_features)
+
+ Returns
+ -------
+ pseudo_likelihood: array-like, shape (n_samples,), optional
+ Pseudo Likelihood estimate for this batch.
+
+ References
+ ----------
+ [1] Tieleman, T. Training Restricted Boltzmann Machines using
+ Approximations to the Likelihood Gradient. International Conference
+ on Machine Learning (ICML) 2008
+ """
+ h_pos = self.mean_h(v_pos)
+ v_neg = self.sample_v(self.h_samples_)
+ h_neg = self.mean_h(v_neg)
@amueller
amueller Oct 4, 2012 Member

I think the comment got lost during the last commits (damn you github): shouldn't you sample here?

@amueller
amueller Oct 4, 2012 Member

whoops just saw your reply, which also got somehow lost... meh!

@mblondel
Member
mblondel commented Oct 4, 2012

@amueller If your goal at hand is classification, that seems like a very valid way to choose hyperparameters to me. Unless you extract features for the sake of it? :) Anyway, all unsupervised algorithms (latent Dirichlet analysis, sparse coding...) have similar issues...

@larsmans larsmans and 1 other commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
+ x: array-like, shape (M, N)
+
+ Notes
+ -----
+ This is equivalent to calling numpy.random.binomial(1, p) but is
+ faster because it uses in-place operations on p.
+
+ Returns
+ -------
+ x_new: array-like, shape (M, N)
+ """
+ p[self.random_state.uniform(size=p.shape) < p] = 1.
+
+ return np.floor(p, p)
+
+ def transform(self, v):
@larsmans
larsmans Oct 4, 2012 Member

I still think this should be called X. We follow our own naming conventions rather than those of [fill in subfield of ML here], since otherwise practically every module would have to follow different conventions.

@mblondel
mblondel Oct 4, 2012 Member

Agreed with @larsmans.

@amueller
Member
amueller commented Oct 4, 2012

@mblondel it makes it very hard to judge learning algorithms for the method, though. For sparse coding, there is an obvious way to tell how good your optimization was!
For RBMs it is not possible to evaluate your objective function and you basically can't tell whether you made the model any better in terms of what you claim to be doing.

If you only evaluate using classification, you might end up with an algorithm that has nothing to do with the original objective. That is fine with me, but then you shouldn't claim that is has something to do with graphical models any more.

@mblondel
Member
mblondel commented Oct 4, 2012

@amueller: I agree with your point if your goal is evaluating an optimization algorithm. I was talking about hyperparameter tuning :)

@mblondel
Member
mblondel commented Oct 4, 2012

Argh, learning_rate is actually a string in SGDClassifier... Maybe the closest name is eta0... @pprett

@ogrisel
Member
ogrisel commented Oct 4, 2012

eta0 is not a very descriptive name. We should not make it a convention. initial_learning_rate would be better IMHO.

@mblondel
Member
mblondel commented Oct 4, 2012

If I understand the code correctly, here, the learning rate is constant...

@amueller
Member
amueller commented Oct 4, 2012

@mblondel yes, maybe that is not set in stone, though (wdyt @ynd ?)
@ogrisel agree. -1 on eta0.

@pprett
Member
pprett commented Oct 4, 2012

@mblondel The blame is on me - SGDClassifier's learning_rate is actually the learning rate schedule; eta0 is the initial learning rate. Even worse, GradientBoostingClassifier uses learn_rate for the learning rate (this time its really the learning rate, not the schedule)... it seems I'm the nemesis of the consistency brigade :-/

Is there any other estimator that has a learning rate parameter?

@larsmans
Member
larsmans commented Oct 4, 2012

Well, Perceptron, but that's obviously SGD in disguise.

As a proposed fix for SGD, how about...

`learning_rate="constant"`, `eta0=.1` ⇨ `learning_rate=("constant", .1)`
`learning_rate="constant"`, `eta0=None` ⇨ `learning_rate="constant"`
`learning_rate=None`, `eta0=.2` ⇨ `learning_rate=.2`

That way, we can keep the learning_rate parameter and we only have to deprecate eta0, and we don't have to introduce parameters that require a lot of typing. (And to counter the "flat is better than nested" that is bound to come up, I'd like to add once again that I'm Dutch ;)

@amueller
Member
amueller commented Oct 4, 2012

@larsmans basically I am +1 on renaming, though I am not entirely certain about how this will actually work. Also, we should keep grid-searches in mind. I usually use a decaying learning rate and grid-search over eta0.
Before:
param_dict = dict(eta0=2. ** np.arange(-5, -1))
proposed:
param_dict = dict(learning_rate=[("optimal", x) for x in 2. ** np.arange(-5, -1)])
Not horrible but still awkward.
We should keep in mind that users might not be as well-versed in Python as we are ;)

@vene
Member
vene commented Oct 4, 2012

+1 for designing APIs in a way that makes grid search code cleaner where the parameter is something you're more likely than not to optimize by grid search.

@vene vene and 1 other commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
+ A Restricted Boltzmann Machine with binary visible units and
+ binary hiddens. Parameters are estimated using Stochastic Maximum
+ Likelihood (SML).
+
+ The time complexity of this implementation is ``O(d ** 2)`` assuming
+ d ~ n_features ~ n_components.
+
+ Parameters
+ ----------
+ n_components : int, optional
+ Number of binary hidden units
+ learning_rate : float, optional
+ Learning rate to use during learning. It is *highly* recommended
+ to tune this hyper-parameter. Possible values are 10**[0., -3.].
+ n_particles : int, optional
+ Number of fantasy particles to use during learning.
@vene
vene Oct 4, 2012 Member

Could this be more specific? Despite my knowing vaguely what an RBM is and what it would be useful for, I've never encountered this term so I wouldn't know where to start setting it. Unless I go read the reference first.

@amueller
amueller Oct 4, 2012 Member

I think we could say that fantasy particles (particles in MCMC chain).
Apart from that, I think reading the docs would be the way to go ;)

@vene vene commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
+ Biases of the visible units
+
+ Examples
+ --------
+
+ >>> import numpy as np
+ >>> from sklearn.rbm import RBM
+ >>> X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
+ >>> model = RBM(n_components=2)
+ >>> model.fit(X)
+
+ References
+ ----------
+
+ [1] Hinton, G. E., Osindero, S. and Teh, Y. A fast learning algorithm for
+ deep belief nets. Neural Computation 18, pp 1527-1554.
@vene
vene Oct 4, 2012 Member

Since the URI is so clean I would add it here. WDYT?

http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf

@vene vene and 2 others commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
+
+ np.random.shuffle(inds)
+
+ n_batches = int(np.ceil(len(inds) / float(self.n_particles)))
+
+ for epoch in range(self.n_epochs):
+ pl = 0.
+ for minibatch in range(n_batches):
+ pl += self._fit(X[inds[minibatch::n_batches]]).sum()
+ pl /= X.shape[0]
+
+ if self.verbose:
+ print "Epoch %d, Pseudo-Likelihood = %.2f" % (epoch, pl)
+
+ def fit_transform(self, X, y=None):
+ """
@vene
vene Oct 4, 2012 Member

If there is no trick that allows you to get the transformed value while fitting for free, there is no use in defining this implementation. The naive fit_transform as sequence of self.fit and self.transform is already defined in TransformerMixin.

@larsmans
larsmans Oct 4, 2012 Member

fit_transform would help if there were input validation, though, as it might prevent a copy.

@ynd
ynd Oct 4, 2012 Contributor

If there is no trick that allows you to get the transformed value while
fitting for free, there is no use in defining this implementation. The
naive fit_transform as sequence of self.fit and self.transform is already
defined in TransformerMixin.

@vene Somehow if I don't define it, I get a failed test in
sklearn.tests.test_common.test_transformers when I make. Basically the
default one outputs the wrong shape.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1760595.

@larsmans
Member
larsmans commented Oct 4, 2012

@amueller Good point, never mind.

@jaberg
Contributor
jaberg commented Oct 4, 2012

@ynd nice work!

@amueller I wanted to respond do your statement "it makes it very hard to judge learning algorithms for the method, though. For sparse coding, there is an obvious way to tell how good your optimization was!"

You're right that you can actually see which model gets a lower sparse-coding score, but you should also consider that when using sparse coding for feature extraction, there is no natural way to pick the sparsity penalty in the first place... so ultimately you are in the same "guess some hyper-parameters, optimize for a while, test the representation" process as with RBMs. AFAIK there is no approach to unsupervised feature learning that comes with a measure of how effective subsequent supervised learning will be. I would even conjecture that the same old no-free-lunch theorem applies anyway to representations learned by unsupervised learning, just like it applies to raw data.

@ynd
Contributor
ynd commented Oct 4, 2012

@mblondel https://github.com/mblondel yes, maybe that is not set in
stone, though (wdyt @ynd https://github.com/ynd ?)
@ogrisel https://github.com/ogrisel agree. -1 on eta0.

It's not set in stone, but a constant learning rate by default is better.
It makes cross-validation much easier. I can add some schedules later.
Maybe with classes for handling the schedules?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-9135610.

Yann N. Dauphin added some commits Oct 4, 2012
@dwf
Member
dwf commented Oct 4, 2012

@ynd Why do you say that it makes cross-validation easier?

The trouble with constant learning rates is that mixing can really break down later in training. Depending on how nasty your dataset is, you really want to anneal.

@pprett
Member
pprett commented Oct 5, 2012

@larsmans @amueller @mblondel @ogrisel lets move the parameter renaming discussion into a separate issue - I've opened #1206 and copied some of your posts

@amueller
Member
amueller commented Oct 5, 2012

Am 04.10.2012 20:48, schrieb James Bergstra:

https://github.com/ynd

@amueller https://github.com/amueller I wanted to respond do your
statement "it makes it very hard to judge learning algorithms for the
method, though. For sparse coding, there is an obvious way to tell how
good your optimization was!"

You're right that you can actually see which model gets a lower
sparse-coding score, but you should also consider that when using
sparse coding for feature extraction, there is no natural way to pick
the sparsity penalty in the first place... so ultimately you are in
the same "guess some hyper-parameters, optimize for a while, test the
representation" process as with RBMs. AFAIK there is no approach to
unsupervised feature learning that comes with a measure of how
effective subsequent supervised learning will be. I would even
conjecture that the same old no-free-lunch theorem applies anyway to
representations learned by unsupervised learning, just like it applies
to raw data.


Reply to this email directly or view it on GitHub
#1200 (comment).

Thanks for your response. I feel it is great to share ideas :)

Maybe this is a bit arbitrary but I like to think separately of the
optimization parameters and the model parameters.
There is no use in doing model selection without looking at the goal of
the whole process, i.e. classification.
But once I specified a model, I would hope that I am actually optimizing
this model that I specified.

With sparse coding, I can be pretty sure what I am optimizing, with
RBMs, not so much.
In practice, in particular with non-convex optimization, this becomes
much more blurred.

Your perspective "guess some hyper-parameters, optimize for a while,
test the representation" is more pragmatic
and maybe more realistic, I guess.
Though I feel that with this perspective, there is no model any more,
only an algorithm.
I feel it is much harder to argue for algorithms than for models, which
is why I don't really like this perspective ;)

I hope that didn't sound to naive ;)

Cheers,
Andy

@jaberg
Contributor
jaberg commented Oct 5, 2012

On Fri, Oct 5, 2012 at 7:39 AM, Andreas Mueller notifications@github.comwrote:

Your perspective "guess some hyper-parameters, optimize for a while,
test the representation" is more pragmatic
and maybe more realistic, I guess.
Though I feel that with this perspective, there is no model any more,
only an algorithm.
I feel it is much harder to argue for algorithms than for models, which
is why I don't really like this perspective ;)

That's right... I do see most feature-learning algorithms as just that:
algorithms, which are inspired by various principles that are not obviously
related to classification. There is still a classifier coming out of each
application of the algorithm though, and cross-validation is a good way of
evaluating those models.

@ynd
Contributor
ynd commented Oct 5, 2012

It would be great if someone could benchmark this implementation against
@dwf https://github.com/dwf's gist: https://gist.github.com/359323

@ogrisel I have written a small benchmark
https://gist.github.com/3842732 to compare the implementations. First,
I had to bugfix it (using np.empy
instead of np.zeros to initialize parameters leads to NaNs). Then I removed
the l2 penalty and momentum, along with objective estimation
(reconstruction error) and stdout outputting in an effort to benchmark the
core efficiency.

The task is modeling 20newgroups data. Running the script gives us

Scikit Time: 58.4955779314 seconds
Gist Time: 100.643937469 seconds
Scikit PL: -270.043485947
Gist PL: -282.588828001

The proposed scikit implementation is 72% faster while giving better
pseudo-likelihood after two epochs. The edge in pseudo-likelihood is
probably due to using PCD instead of CD.

The test was run with numpy 1.6.1, 64bit Python 2.7.2 on an Intel(R)
Core(TM) i7 CPU 950 @ 3.07GHz.

both in terms of CPU time till convergence, memory usage and ability to
generate feature suitable for feeding to a linear classifier such as
sklearn.linear_model.LinearSVC for instance. The dataset could be used a
subset of mnist (e.g. 2 digits such as 0 versus 8) to make it faster to
evaluate.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-9101722.

@GaelVaroquaux
Member

The proposed scikit implementation is 72% faster while giving better
pseudo-likelihood after two epochs. The edge in pseudo-likelihood is
probably due to using PCD instead of CD.

Good job!

@amueller
Member
amueller commented Oct 6, 2012

@ynd could you maybe comment on using pseudo-likelihood as an evaluation criterion?
Is there a reference for that?

@vene vene and 3 others commented on an outdated diff Dec 30, 2012
sklearn/neural_networks/rbm.py
+ self.intercept_visible_ = np.zeros(X.shape[1], dtype=X.dtype)
+ self.h_samples_ = np.zeros((self.n_particles, self.n_components),
+ dtype=X.dtype)
+
+ inds = np.arange(X.shape[0])
+ self.random_state.shuffle(inds)
+
+ n_batches = int(np.ceil(len(inds) / float(self.n_particles)))
+
+ verbose = self.verbose
+ for iteration in xrange(self.n_iter):
+ pl = 0.
+ if verbose:
+ begin = time.time()
+ for minibatch in xrange(n_batches):
+ pl_batch = self._fit(X[inds[minibatch::n_batches]])
@vene
vene Dec 30, 2012 Member

This seems to be done slightly differently than the usual minibatches in scikit-learn. I wonder if it affects the speed or anything. Anyway since these minibatches are indirectly implied by the n_particles, I think I should read the paper first.

I wonder if it's just me or if it's actually confusing: I ran the RBM before reading the code and noticed that the higher n_particles is, the quicker it is, whereas my ignorant intuition was that "using more stuff should take more time". I commented once before about that parameter and I still argue that it's a particularity of the learning algorithm and not directly of RBMs themselves, so I wouldn't mind having its effect intuitively explained.

@dwf
dwf Dec 30, 2012 Member

On Sun, Dec 30, 2012 at 1:14 PM, Vlad Niculae notifications@github.comwrote:

This seems to be done slightly differently than the usual minibatches in
scikit-learn. I wonder if it affects the speed or anything. Anyway since
these minibatches are indirectly implied by the n_particles, I think I
should read the paper first.

I haven't looked at this in a while and I'm not sure what stopping
criterion Yann is using but more particles will mean you have a better/more
stable estimate of the negative phase term in the gradient.

I wonder if it's just me or if it's actually confusing: I ran the RBM
before reading the code and noticed that the higher n_particles is, the
quicker it is, whereas my ignorant intuition was that "using more stuff
should take more time". I commented once before about that parameter and I
still argue that it's a particularity of the learning algorithm and not
directly of RBMs themselves, so I wouldn't mind having its effect
intuitively explained.

The number of particles is or can be effectively independent of the
minibatch size if you're doing stochastic maximum likelihood a.k.a.
"persistent contrastive divergence".

The basic idea is this: the gradient of the log likelihood of an RBM
comprises two terms, one tractable and the other intractable because it
involves the gradient of the partition function (i.e. the sum of the
exponentiated energies of all configurations). We need to approximate this
with a sample, but sampling is also intractable, and running a Markov chain
to convergence at each step of the learning is hopeless. Geoff Hinton's
contrastive divergence trick is to run the chain for 1 step and hope for
the best, and in practice this makes for pretty good feature extractors and
lousy density models.

In the late 90s Laurent Younes showed that assuming your model parameters
don't change too quickly (i.e., your learning rate is low) it is
theoretically sound to keep samples from a "previous version" of your model
(i.e. the model before the last parameter update) and treat them as if they
were samples under the "current version" of your model, and thus keep a
"persistent" Markov chain going throughout learning even though the
distribution from which this Markov chain is sampling is constantly
changing. In this way you get approximate but unbiased gradients of the log
likelihood, and in practice this leads to much better density models as
these samples manage to explore a much bigger region of the state space
than is the case with the 1-step CD hack.

@vene
vene Dec 30, 2012 Member

Thanks for the input @dwf, I was just about to ping you :-)

I do understand contrastive divergence and the idea behind PCD, and I also understand why @amueller doesn't like it 👿 what was unclear to me was the dynamics of the n_particles parameter. Also I was referring to its description in the docstring, but since I'll be writing the narratives, it is probably important for me to understand everything too.

@ynd's code doesn't have any stopping condition, but an n_iter parameter (that must be optimized via grid search). I will check how you do it in your gist.

Since I got your attention: is there any value in increasing the k in CD-k after some time? IIUC it helps with the density modelling part (makes the learning closer to actual ML) but that's not really important here is it? The implementation here should IMHO be as close to black-box as possible, and if users are on to something good they should use or extend the one in pylearn.

@dwf
dwf Dec 30, 2012 Member

On 2012-12-30 2:45 PM, "Vlad Niculae" notifications@github.com wrote:

In sklearn/neural_networks/rbm.py:

  •    self.intercept_visible_ = np.zeros(X.shape[1], dtype=X.dtype)
    
  •    self.h_samples_ = np.zeros((self.n_particles,
    
    self.n_components),
  •        dtype=X.dtype)
    
  •    inds = np.arange(X.shape[0])
    
  •    self.random_state.shuffle(inds)
    
  •    n_batches = int(np.ceil(len(inds) / float(self.n_particles)))
    
  •    verbose = self.verbose
    
  •    for iteration in xrange(self.n_iter):
    
  •        pl = 0.
    
  •        if verbose:
    
  •            begin = time.time()
    
  •        for minibatch in xrange(n_batches):
    
  •            pl_batch = self._fit(X[inds[minibatch::n_batches]])
    

Thanks for the input @dwf, I was just about to ping you :-)

I do understand contrastive divergence and the idea behind PCD, and I
also understand why @amueller doesn't like it what was unclear to me was
the dynamics of the n_particles parameter. Also I was referring to its
description in the docstring, but since I'll be writing the narratives, it
is probably important for me to understand everything too.

Ah dammit, I just realized that I had omitted actual discussion of the
number of particles. The story is basically this: in CD it is common to run
as many 1 or K step chains as there are training examples in your mini
batch. This is because (in the one step case) you've already done half the
work for that in computing the positive phase gradients, so why not
threshold and do two more matrix multiplies.

Since the particles (Markov chain states) stick around in PCD, there is no
intrinsic reason that the sums in the positive phase and negative phase
terms need to have the same number of terms - they are approximating
expectations under different distributions, and there's no reason you need
the same number of terms in each.

@ynd's code doesn't have any stopping condition, but an n_iter parameter
(that must be optimized via grid search). I will check how you do it in
your gist.

There isn't really any good answer, I think I just use a fixed number of
iterations too.

Since I got your attention: is there any value in increasing the k in
CD-k after some time? IIUC it helps with the density modelling part (makes
the learning closer to actual ML) but that's not really important here is
it? The implementation here should IMHO be as close to black-box as
possible, and if users are on to something good they should use or extend
the one in pylearn.

I've not really heard of annealing K upwards, but it makes sense that it
would help somewhat. It's also far more expensive than PCD :) and the best
annealing schedule is not self-evident and would probably be problem
dependent.

There are definitely valid ways to use an RBM as a density model even
though it's intractable to compute the probability of a test example (e.g.
using the free energy of a test example to do outlier detection) for which
PCD training would make a gigantic difference. I can't remember the sklearn
unsupervised API at the moment but if there is something that gives you a
scalar score for each example passed in, the free energy would be the thing
to use for this (note that since the partition function is different for
different RBM instances this can not be used to do model comparison, but
it can be used to judge the log probability a given model assigns to a
point up to an additive constant, and judge an individual model's relative
preference for one example over another).

It depends whether you want to leave open these use cases, or only care
about feature extraction for classification.

@amueller
amueller Dec 30, 2012 Member

I said I don't like PCD? Can't remember that. I might have said I don't like RBMs ;) But PCD is way better than CD!

@amueller
amueller Dec 30, 2012 Member

Just quickly about the other points: I think I read something about CD-k with k increasing (possibly Russ Salakhutdinov?) but I wouldn't include it. PCD is a good tradeoff of complexity and performance imho. Parallel tempering might work better but is also more to fiddle with.

Also I think fixing the number of iterations is the best stopping criterion. I guess the free energy would otherwise be the pseudo likelihood above would be the thing to model (I haven't looked at it in more detail though).
@dwf how can you use the free energy to judge convergence?

@dwf
dwf Dec 31, 2012 Member

On Sun, Dec 30, 2012 at 5:45 PM, Andreas Mueller
notifications@github.comwrote:

@dwf https://github.com/dwf how can you use the free energy to judge
convergence?

I don't think you can, because that unknown constant is not actually
constant with respect to learning (the partition function is always
changing). Average free energy of training data minus average free energy
of validation data might be a useful quantity to monitor (basically the
"odds" of training vs. validation, such that the partition functions
cancel) but I haven't tried it.

@amueller
amueller Dec 31, 2012 Member

Yeah, that was my impression also. I though you suggested it above but that was probably a misunderstanding.

@vene
vene Dec 31, 2012 Member

@amueller I just meant you don't like training graphical models with algorithms that are so approximate that you can't really call it a graphical model anymore 😉

@ynd
ynd Jan 1, 2013 Contributor

I wonder if it's just me or if it's actually confusing: I ran the RBM
before reading the code and noticed that the higher n_particles is, the
quicker it is, whereas my ignorant intuition was that "using more stuff
should take more time". I commented once before about that parameter and I
still argue that it's a particularity of the learning algorithm and not
directly of RBMs themselves, so I wouldn't mind having its effect
intuitively explained.

This is the number of particles used to compute the gradient and make a
gradient step. This means that you need to compute less gradients to
perform an iteration over the training set. If you set n_particles to the
size of the training set, you will perform only one gradient step. That's
fast because the matrix-matrix multiplication in BLAS will be very cache
friendly.

On Sun, Dec 30, 2012 at 7:14 PM, Vlad Niculae notifications@github.comwrote:

In sklearn/neural_networks/rbm.py:

  •    self.intercept_visible_ = np.zeros(X.shape[1], dtype=X.dtype)
    
  •    self.h_samples_ = np.zeros((self.n_particles, self.n_components),
    
  •        dtype=X.dtype)
    
  •    inds = np.arange(X.shape[0])
    
  •    self.random_state.shuffle(inds)
    
  •    n_batches = int(np.ceil(len(inds) / float(self.n_particles)))
    
  •    verbose = self.verbose
    
  •    for iteration in xrange(self.n_iter):
    
  •        pl = 0.
    
  •        if verbose:
    
  •            begin = time.time()
    
  •        for minibatch in xrange(n_batches):
    
  •            pl_batch = self._fit(X[inds[minibatch::n_batches]])
    

This seems to be done slightly differently than the usual minibatches in
scikit-learn. I wonder if it affects the speed or anything. Anyway since
these minibatches are indirectly implied by the n_particles, I think I
should read the paper first.

I wonder if it's just me or if it's actually confusing: I ran the RBM
before reading the code and noticed that the higher n_particles is, the
quicker it is, whereas my ignorant intuition was that "using more stuff
should take more time". I commented once before about that parameter and I
still argue that it's a particularity of the learning algorithm and not
directly of RBMs themselves, so I wouldn't mind having its effect
intuitively explained.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r2521568.

@vene vene and 1 other commented on an outdated diff Dec 30, 2012
sklearn/neural_networks/rbm.py
+ """
+ Fit the model to the data X.
+
+ Parameters
+ ----------
+ X: array-like, shape (n_samples, n_features)
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ self
+ """
+ X = array2d(X)
+
+ self.components_ = np.asarray(self.random_state.normal(0, 0.01,
@vene
vene Dec 30, 2012 Member

Probably better to use 0.01 * self.random_state.randn(self.n_components, X.shape[1]). Also is 0.01 always good / as good as any other value?

@ynd
ynd Jan 1, 2013 Contributor

0.01 is recommended by hinton in
http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf . In practise it works
better than other values on a range of problems.

On Sun, Dec 30, 2012 at 7:28 PM, Vlad Niculae notifications@github.comwrote:

self.random_state.randn

@vene
Member
vene commented Dec 30, 2012

I am reworking the example on my local branch. I will also start working on the docs here, and if @ynd is still busy by that time I will try to improve the tests too. Maybe we can get this merged soon.

WDET about adding rectified linear units? http://www.csri.utoronto.ca/~hinton/absps/reluICML.pdf

@amueller
Member

I'd rather get the standard binary version merged first, then we can think about gaussian or rectified linear units.

@vene
Member
vene commented Dec 31, 2012

1000 filters extracted from MNIST with learning rate 0.001 for 50 epochs:

MNIST filters

Still trying to get something meaningful out of the sklearn digits dataset...

@ynd
Contributor
ynd commented Jan 1, 2013

Digits is a pretty small and easy task, that's why things are close.

It would be nice if you did the narrative documentation, I wouldn't be able
to get it done soon.

On Sun, Dec 30, 2012 at 2:17 PM, Vlad Niculae notifications@github.comwrote:

As it is at this point, the example doesn't really illustrate much: the
averaged F-score is identical (after rounding), and you just see some
precision/recall tradeoffs across classes. I wonder whether this is because
of the small dataset. Maybe the whole classification report is not even
interesting here, just the score. I'll play around with it a bit.

I noticed that the components learned are very close to one another with
just a few weights different.

I volunteer to do the narrative documentation.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-11764427.

@ynd
Contributor
ynd commented Jan 1, 2013

Try a higher learning rate, that will help get cleaner filters.

Yann

On Mon, Dec 31, 2012 at 3:29 PM, Vlad Niculae notifications@github.comwrote:

1000 filters extracted from MNIST with learning rate 0.001 for 50 epochs:

[image: MNIST filters]https://a248.e.akamai.net/camo.github.com/2b4c812b5f5722f1ce2ff57f9f4b2227f6be7963/687474703a2f2f692e696d6775722e636f6d2f6b4966675a2e6a7067

Still trying to get something meaningful out of the sklearn digits
dataset...


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-11778040.

@dwf
Member
dwf commented Jan 1, 2013

On Tue, Jan 1, 2013 at 10:25 AM, Yann N. Dauphin
notifications@github.comwrote:

Try a higher learning rate, that will help get cleaner filters.

Also momentum.

@vene
Member
vene commented Jan 2, 2013

I just realized that unlike most of our estimator modules, neural_networks cannot fit under supervised or unsupervised in the hierarchy. Despite this, I would still rather group this (and the upcoming MLP) toghether in such a module.

@amueller
Member
amueller commented Jan 2, 2013

I added an unsupervised algorithm to ensemble without changing anything in the hierarchy - didn't really think about it. Btw, the narrative doesn't have to follow the structure of the modules, though it is probably helpful if it does.
But we could just put MLPs in the supervised part and Autoencoders in the unsupervised part. What other algorithms do you think will be there?

@mblondel
Member
mblondel commented Jan 2, 2013

Agreed with @amueller. I wouldn't be surprised to find an unsupervised
algorithm in a neural_networks module.

@vene
Member
vene commented Jan 2, 2013

Well we will have at least two RBM estimators: Bernoulli, Gaussian and I would be interested in the Replicated Softmax thing too. Then the multilayer perceptron (at least as a classifier). I would draw the line at custom architecture (convolutional nets) and more than 2-layer stuff. However I wouldn't mind having an autoencoder in here. In Hinton's class he lists some cool results with autoencoders but the architectures have many layers, usually with decreasing numbers of units towards the middle, and he admits that the architectures are just guesses, but maybe a simple 2-layer autoencoder as a black box transformer can give useful enough representations.

@larsmans
Member
larsmans commented Jan 3, 2013

A basic, 2-layer autoencoder should be easy enough to build by setting Y=X in the backprop algorithm.

@vene vene and 1 other commented on an outdated diff Jan 3, 2013
sklearn/neural_networks/rbm.py
+def logistic_sigmoid(x):
+ """
+ Implements the logistic function.
+
+ Parameters
+ ----------
+ x: array-like, shape (M, N)
+
+ Returns
+ -------
+ x_new: array-like, shape (M, N)
+ """
+ return 1. / (1. + np.exp(-np.clip(x, -30, 30)))
+
+
+class RestrictedBolzmannMachine(BaseEstimator, TransformerMixin):
@vene
vene Jan 3, 2013 Member

s/Bolzmann/Boltzmann/g
caught this because the docs weren't linking properly. I will fix in my branch (based on the current state of this PR).

@vene
vene Jan 3, 2013 Member

I will actually rename to BernoulliRBM since that seems the way to go judging by the previous discussion.

@vene
Member
vene commented Jan 3, 2013

I am still not at ease regarding how to organize the docs. It's clear that we should have a neural_networks module if we will have RBMs and MLPs, it is simply the place where one would look for such models first.

However the Neural Networks section of the user's guide should have an introductory paragraph before jumping into particular models, and such a page doesn't fit under Supervised or Unsupervised Learning in the user's guide.

I'm thinking of putting the narratives into several different files and include them in several different combinations:

  • Everything, if linked from people looking at the neural_networks module
  • Supervised stuff linked from Supervised learning / Supervised Neural Network Models
  • Unsupervised stuff linked from Unsupervised learning / Unsupervised Neural Network Models

A way out would be having the Neural Network doc section at the top level but I'm -1 for that.

@amueller
Member
amueller commented Jan 3, 2013

On 01/04/2013 12:52 AM, Vlad Niculae wrote:

I am still not at ease regarding how to organize the docs. It's clear
that we should have a |neural_networks| module if we will have RBMs
and MLPs, it is simply the place where one would look for such models
first.

However the Neural Networks section of the user's guide should have an
introductory paragraph before jumping into particular models, and such
a page doesn't fit under Supervised or Unsupervised Learning in the
user's guide.

I would just put the general stuff in the one that comes first and then
link back to it.
What do you want to say about neural nets in the context of MLPs that is
also relevant for RBMs?

@vene
Member
vene commented Jan 4, 2013

That's a good question, however it is reasonble if we will have autoencoders. Even so, it would just feel off having the neural networks module under Unsupervised... EDIT: maybe link the page from under both sections? I'd rather have lower precision than lower recall in this setting, I don't want users not finding the MLP because it's under Unsupervised, or the other way around.

@amueller
Member
amueller commented Jan 4, 2013

I would really just put MLPs under supervised and Autoencoders under unsupervised (when we have them). I don't see the harm in referencing the MLP from the autoencoder about details of backprop or whatever ...

@vene
Member
vene commented Jan 4, 2013

That's true but where to point from the neural_network module itself? Maybe to both pages? Hmm... that's actually a great idea, I'll set it up like this.

@vene
Member
vene commented Jan 4, 2013

Some of the universal tests were failing after merging with master, uncovering some bugs: check_random_state was called in __init__ and transform wasn't validating. So I fixed them and sent a PR to @ynd just in case he wants to touch this PR before I finish the docs (I haven't even started yet, actually. I'll do it tomorrow 😊)

@GaelVaroquaux
Member

Even so, it would just feel off having the neural networks module under
Unsupervised...

I have the same feeling.

@larsmans
Member
larsmans commented Jan 4, 2013

Btw, didn't we decide in the MLP PR that the module should be called neural?

@mblondel
Member
mblondel commented Jan 4, 2013

Btw, didn't we decide in the MLP PR that the module should be called
neural?

Only a handful of people gave their opinion. We should probably vote on the
ML.

@vene vene and 1 other commented on an outdated diff Jan 20, 2013
sklearn/neural_networks/rbm.py
+ ----------
+ [1] Tieleman, T. Training Restricted Boltzmann Machines using
+ Approximations to the Likelihood Gradient. International Conference
+ on Machine Learning (ICML) 2008
+ """
+ h_pos = self.mean_hiddens(v_pos)
+ v_neg = self.sample_visibles(self.h_samples_)
+ h_neg = self.mean_hiddens(v_neg)
+
+ lr = self.learning_rate / self.n_particles
+ v_pos *= lr
+ v_neg *= lr
+ self.components_ += safe_sparse_dot(v_pos.T, h_pos).T
+ self.components_ -= np.dot(v_neg.T, h_neg).T
+ self.intercept_hidden_ += lr * (h_pos.sum(0) - h_neg.sum(0))
+ self.intercept_visible_ += (v_pos.sum(0) - v_neg.sum(0))
@vene
vene Jan 20, 2013 Member

Is it intended for the visible intercept to not be scaled by the learning rate? If so, could you explain why?
I'm experimenting with momentum and this jumped at me.

@ynd
ynd Jan 21, 2013 Contributor

They are scaled by the learning rate inplace beforehand in the previous
lines. This makes the computation of the gradient for components faster.

On Sun, Jan 20, 2013 at 6:39 PM, Vlad Niculae notifications@github.comwrote:

In sklearn/neural_networks/rbm.py:


  •    [1] Tieleman, T. Training Restricted Boltzmann Machines using
    
  •        Approximations to the Likelihood Gradient. International Conference
    
  •        on Machine Learning (ICML) 2008
    
  •    """
    
  •    h_pos = self.mean_hiddens(v_pos)
    
  •    v_neg = self.sample_visibles(self.h_samples_)
    
  •    h_neg = self.mean_hiddens(v_neg)
    
  •    lr = self.learning_rate / self.n_particles
    
  •    v_pos *= lr
    
  •    v_neg *= lr
    
  •    self.components_ += safe_sparse_dot(v_pos.T, h_pos).T
    
  •    self.components_ -= np.dot(v_neg.T, h_neg).T
    
  •    self.intercept_hidden_ += lr \* (h_pos.sum(0) - h_neg.sum(0))
    
  •    self.intercept_visible_ += (v_pos.sum(0) - v_neg.sum(0))
    

Is it intended for the visible intercept to not be scaled by the learning
rate? If so, could you explain why?
I'm experimenting with momentum and this jumped at me.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r2708196.

@ynd
Contributor
ynd commented Jan 24, 2013

Thanks to Vlad we have a narrative doc, what is missing for a merge?

@mblondel
Member

Thanks to Vlad we have a narrative doc, what is missing for a merge?

Agreeing on the module name. My vote goes to neural_network (without an s) to be consistent with linear_model.

@ogrisel
Member
ogrisel commented Jan 24, 2013

Agreeing on the module name. My vote goes to neural_network (without an s) to be consistent with linear_model.

+1 for this proposal as well.

@GaelVaroquaux
Member

Agreeing on the module name. My vote goes to neural_network (without an s) to
be consistent with linear_model.

+1

@ogrisel
Member
ogrisel commented Jan 24, 2013

The test coverage could probably be improved a bit:

sklearn.neural_networks.rbm                         84     10    88%   162, 222-225, 262, 318, 323, 326-328

Type make test-coverage in the top level folder to compute it again.

@ogrisel ogrisel commented on an outdated diff Jan 24, 2013
sklearn/neural_networks/rbm.py
+ return logistic_sigmoid(safe_sparse_dot(v, self.components_.T)
+ + self.intercept_hidden_)
+
+ def sample_hiddens(self, v):
+ """
+ Sample from the distribution ``P({\bf h}|{\bf v})``.
+
+ Parameters
+ ----------
+ v: array-like, shape (n_samples, n_features)
+
+ Returns
+ -------
+ h: array-like, shape (n_samples, n_components)
+ """
+ return self._sample_binomial(self.mean_hiddens(v))
@ogrisel
ogrisel Jan 24, 2013 Member

This is not tested.

@ogrisel ogrisel commented on an outdated diff Jan 24, 2013
sklearn/neural_networks/rbm.py
+ def gibbs(self, v):
+ """
+ Perform one Gibbs sampling step.
+
+ Parameters
+ ----------
+ v: array-like, shape (n_samples, n_features)
+
+ Returns
+ -------
+ v_new: array-like, shape (n_samples, n_features)
+ """
+ h_ = self.sample_hiddens(v)
+ v_ = self.sample_visibles(h_)
+
+ return v_
@ogrisel
ogrisel Jan 24, 2013 Member

This method is not tested.

@ogrisel ogrisel commented on an outdated diff Jan 24, 2013
sklearn/neural_networks/rbm.py
+ h_pos = self.mean_hiddens(v_pos)
+ v_neg = self.sample_visibles(self.h_samples_)
+ h_neg = self.mean_hiddens(v_neg)
+
+ lr = self.learning_rate / self.n_particles
+ v_pos *= lr
+ v_neg *= lr
+ self.components_ += safe_sparse_dot(v_pos.T, h_pos).T
+ self.components_ -= np.dot(v_neg.T, h_neg).T
+ self.intercept_hidden_ += lr * (h_pos.sum(0) - h_neg.sum(0))
+ self.intercept_visible_ += (v_pos.sum(0) - v_neg.sum(0))
+
+ self.h_samples_ = self._sample_binomial(h_neg)
+
+ if self.verbose:
+ return self.pseudo_likelihood(v_pos)
@ogrisel
ogrisel Jan 24, 2013 Member

This is not test but I guess we don't want to pollute stdout when running the tests. So no opinion for this one.

@GaelVaroquaux GaelVaroquaux and 2 others commented on an outdated diff Jan 24, 2013
sklearn/neural_networks/rbm.py
+ Computes the free energy
+ ``\mathcal{F}({\bf v}) = - \log \sum_{\bf h} e^{-E({\bf v},{\bf h})}``.
+
+ Parameters
+ ----------
+ v: array-like, shape (n_samples, n_features)
+
+ Returns
+ -------
+ free_energy: array-like, shape (n_samples,)
+ """
+ return - np.dot(v, self.intercept_visible_) - np.log(1. + np.exp(
+ safe_sparse_dot(v, self.components_.T) + self.intercept_hidden_)) \
+ .sum(axis=1)
+
+ def gibbs(self, v):
@GaelVaroquaux
GaelVaroquaux Jan 24, 2013 Member

If I am not mistaken, the corresponding method in GMMs (mixture/gmm.py) is called 'sample'. It would be useful to have the same name for API consistency.

@dwf
dwf Jan 24, 2013 Member

On Thu, Jan 24, 2013 at 1:53 PM, Gael Varoquaux notifications@github.comwrote:

In sklearn/neural_networks/rbm.py:

  •    Computes the free energy
    
  •    `\mathcal{F}({\bf v}) = - \log \sum_{\bf h} e^{-E({\bf v},{\bf h})}`.
    
  •    Parameters
    

  •    v: array-like, shape (n_samples, n_features)
    
  •    Returns
    

  •    free_energy: array-like, shape (n_samples,)
    
  •    """
    
  •    return - np.dot(v, self.intercept_visible_) - np.log(1. + np.exp(
    
  •        safe_sparse_dot(v, self.components_.T) + self.intercept_hidden_)) \
    
  •        .sum(axis=1)
    
  • def gibbs(self, v):

If I am not mistaken, the corresponding method in GMMs (mixture/gmm.py) is
called 'sample'. It would be useful to have the same name for API
consistency.

$0.02: I don't think it's necessarily a good idea to conflate the two.

My reasoning: sample(), I'm pretty sure, draws an independent sample from
the model distribution. This is impossible with an RBM. This executes one
step of a Gibbs sampling process, requires an initial visible state, and
importantly does not yield independent samples on successive calls.

@GaelVaroquaux
GaelVaroquaux Jan 24, 2013 Member

If I am not mistaken, the corresponding method in GMMs
(mixture/gmm.py) is called 'sample'. It would be useful to have the
same name for API consistency.

$0.02: I don't think it's necessarily a good idea to conflate the two.
My reasoning: sample(), I'm pretty sure, draws an independent sample
from the model distribution. This is impossible with an RBM. This
executes one step of a Gibbs sampling process, requires an initial
visible state, and importantly does not yield independent samples on
successive calls.

Is this a good enough reason to have different names, or does it warrants
a note in the docstring?

@ogrisel
ogrisel Jan 24, 2013 Member

+1 for making it explicit in the docstring.

@dwf
dwf Jan 24, 2013 Member

On 2013-01-24 2:07 PM, "Gael Varoquaux" notifications@github.com wrote:

In sklearn/neural_networks/rbm.py:

  •    Computes the free energy
    
  •    `\mathcal{F}({\bf v}) = - \log \sum_{\bf h} e^{-E({\bf
    
    v},{\bf h})}`.
    +
  •    Parameters
    

  •    v: array-like, shape (n_samples, n_features)
    
  •    Returns
    

  •    free_energy: array-like, shape (n_samples,)
    
  •    """
    
  •    return - np.dot(v, self.intercept_visible_) - np.log(1. +
    
    np.exp(
  •        safe_sparse_dot(v, self.components_.T) +
    
    self.intercept_hidden_)) \
  •        .sum(axis=1)
    
  • def gibbs(self, v):

If I am not mistaken, the corresponding method in GMMs >
(mixture/gmm.py) is called 'sample'. It would be useful to have the > same
name for API consistency. $0.02: I don't think it's necessarily a good idea
to conflate the two. My reasoning: sample(), I'm pretty sure, draws an
independent sample from the model distribution. This is impossible with an
RBM. This executes one step of a Gibbs sampling process, requires an
initial visible state, and importantly does not yield independent samples
on successive calls.
Is this a good enough reason to have different names, or does it warrants
a note in the docstring?

Well, this has a different signature with a non-optional argument that
doesn't appear in sample. Having the same name thus doesn't really buy
you anything from an abstraction/generic code perspective. If uniformity
for the mere sake of uniformity is desirable then maybe it's worthwhile but
it seems like they serve different enough purposes (and require different
calling conventions) to be considered conceptually separate.

@ogrisel
ogrisel Jan 24, 2013 Member

I think we all agree on keeping different names but we should further add explicitly in the docstring of the gibbs method that consecutive calls to it will yield highly dependent samples as the internal state of the RBM model is updated (stateful model).

@ogrisel
ogrisel Jan 24, 2013 Member

Actually I made a mistake, the model itself is not stateful as the state (v the activation levels of the visible units) is passed as an argument to the method.

@ogrisel
Member
ogrisel commented Jan 24, 2013

The narrative doc lacks a link to the example with RBM used for unsupervised feature extraction in a digits classification pipeline.

I think this example should move a to a neural_network subfolder of the examples folder as we expect to have more neural nets based models in the future and we don't want to crowd the top level example folder too much and we don't want to break incoming links example page on the website in the future.

Also are the extracted filters too ugly to be plotted in the example and the doc?

@ogrisel
Member
ogrisel commented Jan 24, 2013

For information, here the output of running the example on my box:

=========================================================
Pipelining: chaining a RBM and a logistic regression
=========================================================

The BernoulliRBM does unsupervised feature extraction,
while the logistic regression does the prediction.

We use a GridSearchCV to set the number of hidden units and the learning rate
of the Bernoulli Restricted Boltzmann Machine.

We also train a simple logistic regression for comparison. The example shows
that the features extracted by the BernoulliRBM help improve
the classification accuracy.


Classification report for classifier Pipeline(logistic=LogisticRegression(C=10000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, penalty=l2,
          random_state=None, tol=0.0001),
     logistic__C=10000.0, logistic__class_weight=None,
     logistic__dual=False, logistic__fit_intercept=True,
     logistic__intercept_scaling=1, logistic__penalty=l2,
     logistic__random_state=None, logistic__tol=0.0001,
     rbm=BernoulliRBM(learning_rate=0.01, n_components=400, n_iter=30, n_particles=10,
       random_state=<mtrand.RandomState object at 0x108d8f168>,
       verbose=False),
     rbm__learning_rate=0.01, rbm__n_components=400, rbm__n_iter=30,
     rbm__n_particles=10,
     rbm__random_state=<mtrand.RandomState object at 0x108d8f168>,
     rbm__verbose=False):
             precision    recall  f1-score   support

          0       0.95      0.95      0.95        41
          1       0.98      0.93      0.95        44
          2       0.98      1.00      0.99        42
          3       0.97      1.00      0.99        33
          4       0.97      0.95      0.96        40
          5       1.00      1.00      1.00        29
          6       0.96      0.86      0.91        28
          7       0.92      1.00      0.96        36
          8       0.78      0.93      0.85        27
          9       0.97      0.88      0.92        40

avg / total       0.95      0.95      0.95       360


Classification report for classifier LogisticRegression(C=10000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, penalty=l2,
          random_state=None, tol=0.0001):
             precision    recall  f1-score   support

          0       0.95      0.95      0.95        41
          1       0.97      0.86      0.92        44
          2       1.00      1.00      1.00        42
          3       0.91      0.94      0.93        33
          4       0.95      0.97      0.96        40
          5       0.94      1.00      0.97        29
          6       1.00      0.89      0.94        28
          7       0.94      0.89      0.91        36
          8       0.68      0.96      0.80        27
          9       0.94      0.82      0.88        40

avg / total       0.94      0.93      0.93       360

So the pipeline is indeed a bit better than the logistic regression, although this isn't cross validated so we have no idea of the variance of this estimate.

I think that anyway the dataset is too small for the RBM to be very useful here. I think we should mention that in the doc of the example.

We can also note that the __repr__ method of the Pipeline class is broken (it should at least respect the ordering of the steps) but this should be addressed in a separate PR.

@ogrisel
Member
ogrisel commented Jan 24, 2013

Also we should really add an explicit partial_fit method for incremental learning with mini-batches.

@larsmans
Member

Wouldn't minibatch learning require the persistent CD algorithm?

@larsmans
Member

Oh and still +1 on sklearn.neural.

@amueller
Member

@larsmans this PR does implement persistent CD.

@larsmans
Member

Excuse me, I was under the impression that it didn't.

@ynd
Contributor
ynd commented Jan 28, 2013

It seems the majority vote is 'neural_network'.

@amueller
Member
amueller commented Feb 2, 2013

+1 for neural_network.

@amueller
Member
amueller commented Feb 2, 2013

I think we should merge this PR now that there is a narrative and examples. The examples could be improved but we can still do that later. I think many people will play with this and we'll easily get a better example.

@vene
Member
vene commented Feb 2, 2013

I'm fine with merging soon, but:

  1. @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones?

  2. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size?

  3. Thoughts on momentum?

@ogrisel
Member
ogrisel commented Feb 3, 2013
  1. @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones?

Then we should just make it explicit in the example top level docstring that due to download size and computational constraints, this example uses a toy dataset that is either too small or too "linearly separable" for RBM features to be really interesting. In practice unsupervised feature extraction with RBMs is only useful on datasets with both many samples and internally structured on non linearly separable components / manifolds. Currently it's misleading the reader by asserting that RBM features are better than raw features on this toy data.

  1. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size?

Having consistency on the batch_size parameter with other minibatch models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big +1 for me. I would rather have this changed before merging to master if this is not too complicated.

@ogrisel
Member
ogrisel commented Feb 3, 2013

BTW has anyone tried to let this implementation run long enough on the official MNIST train set to extract features that when combined with a cross-validated LinearSVC / SVC reach the expected predictive accuracy (I am not 100% sure but I think it possible to go below the 2% test error rate with this kind of architecture)?

Also if someone runs it long enough with the same parameters as in the bottom of the page in http://deeplearning.net/tutorial/rbm.html we could also qualitatively compare the outcome of the gibbs sampling with those of the deeplearning tutorial with the theano based implementation.

@ynd
Contributor
ynd commented Feb 3, 2013
  1. I'm still a bit put off though by how mini-batch size is controlled by
    changing n_particles, I find this hard to grok. Can't there be two separate
    parameters, n_particles and batch_size?

Having consistency on the batch_size parameter with other minibatch
models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big
+1 for me. I would rather have this changed before merging to master if
this is not too complicated.

Most implementations use the same number of particles as the batch size
(this is recommended by Hinton).

I will rename the parameter to batch_size though, that should clarify
things.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-13044925.

Yann N. Dauphin added some commits Feb 3, 2013
Yann N. Dauphin rename n_particles to batch_size f8e0110
Yann N. Dauphin bugfix test ad4877a
Yann N. Dauphin bugfix test 5400735
Yann N. Dauphin added tests 6dfc317
Yann N. Dauphin tests 83daf31
@ynd
Contributor
ynd commented Apr 15, 2013

@ogrisel I've added some tests:

sklearn.neural_network.rbm 84 6 93% 262, 318, 323, 326-328

All the uncovered lines are basically just conditionals on print statements for verbose=True.

@amueller
Member

Thanks a lot for picking this up. Sorry we didn't work harder on merging it. It seems we are all a bit swamped right now. I'm certain it will go in the next release, though ;)

@vene
Member
vene commented Apr 15, 2013

Wow, actually yesterday I was thinking that after the conference deadline
passes (which was last night) I wanna come back to this PR!

A while back in a local file, I hacked up a momentum implementation on top
of this code, it did seem to speed up convergence a lot on mnist,
interesting?

On Tue, Apr 16, 2013 at 5:27 AM, Andreas Mueller
notifications@github.comwrote:

Thanks a lot for picking this up. Sorry we didn't work harder on merging
it. It seems we are all a bit swamped right now. I'm certain it will go in
the next release, though ;)


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16409345
.

@amueller
Member

@vene I would rather not add any features and merge this asap. It has been sitting around way to long! Do you have any idea what is still missing?

@vene
Member
vene commented Apr 17, 2013

IIRC I was not pleased with the example when it was moved to the digits
dataset instead of mnist, as the dataset is very small and the components
don't look that meaningful. I tried toying around with the params to no
avail.

On Tue, Apr 16, 2013 at 6:37 PM, Andreas Mueller
notifications@github.comwrote:

@vene https://github.com/vene I would rather not add any features and
merge this asap. It has been sitting around way to long! Do you have any
idea what is still missing?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16435278
.

@vene
Member
vene commented Apr 17, 2013

Also there is a failing common test because of the way random_state is used.

@ogrisel
Member
ogrisel commented Apr 18, 2013

@vene I you can come up with a quick PR vs this branch and show how that impacts the convergence speed and quality on mnist that sounds interesting.

@ogrisel ogrisel commented on the diff Apr 18, 2013
sklearn/neural_network/rbm.py
+ def fit(self, X, y=None):
+ """
+ Fit the model to the data X.
+
+ Parameters
+ ----------
+ X: array-like, shape (n_samples, n_features)
+ Training data, where n_samples in the number of samples
+ and n_features is the number of features.
+
+ Returns
+ -------
+ self
+ """
+ X = array2d(X)
+ self.random_state = check_random_state(self.random_state)
@ogrisel
ogrisel Apr 18, 2013 Member

Constructor parameters should not be mutated during fit and that holds for random_state as well. Please change this line to:

self.random_state_ = check_random_state(self.random_state)

or

random_state = check_random_state(self.random_state)

and then pass the random_state local variable explicitly as an argument to the subsequent private methods.

@amueller
Member
amueller commented May 6, 2013

I'm very inclined to fix the travis tests and merge. Having "not completely satisfying example filters" doesn't justify delaying this PR any longer.

@vene
Member
vene commented May 6, 2013

Probably a good idea, any speedups / new features can be added later.

On Mon, May 6, 2013 at 6:39 PM, Andreas Mueller notifications@github.comwrote:

I'm very inclined to fix the travis tests and merge. Having "not
completely satisfying example filters" doesn't justify delaying this PR any
longer.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-17473458
.

@amueller
Member
amueller commented May 6, 2013

Does any one have strong reasons not to merge this? (after fixing the random state issue)

@GaelVaroquaux
Member

Some implicit castings: (I have a version numpy that purposely fails when
casting rules can induce problems)

======================================================================
FAIL: Doctest: sklearn.neural_network.rbm.BernoulliRBM
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for
sklearn.neural_network.rbm.BernoulliRBM
  File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py",
line 31, in BernoulliRBM
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py",
line 78, in sklearn.neural_network.rbm.BernoulliRBM
Failed example:
    model.fit(X)  # doctest: +ELLIPSIS
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "", line
1, in 
        model.fit(X)  # doctest: +ELLIPSIS
      File
"/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line
320, in fit
        pl_batch = self._fit(X[inds[minibatch::n_batches]])
      File
"/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line
252, in _fit
        v_pos *= lr
    TypeError: Cannot cast ufunc multiply output from dtype('float64') to
dtype('int64') with casting rule 'same_kind'
>>  raise self.failureException(self.format_failure(>  instance at 0x9529a70>.getvalue()))
    
----------------------------------------------------------------------
@GaelVaroquaux
Member

Some implicit castings: (I have a version numpy that purposely fails when
casting rules can induce problems)

I have fixed that in the pr_1200 branch on my github (it was preventing
me from reviewing this PR). Please pull/cherry-pick

@GaelVaroquaux GaelVaroquaux commented on the diff May 6, 2013
doc/modules/neural_networks.rst
+but often in practice, as well as in this implementation, it is optimized by
+averaging over mini-batches. The gradient with respect to the weights is
+formed of two terms corresponding to the ones above. They are usually known as
+the positive gradient and the negative gradient, because of their respective
+signs.
+
+In maximizing the log likelihood, the positive gradient makes the model prefer
+hidden states that are compatible with the observed training data. Because of
+the the bipartite structure of RBMs, it can be computed efficiently. The
+negative gradient, however, is intractable. Its goal is to lower the energy of
+joint states that the model prefers, therefore making it stay true to the data
+and not fantasize. It can be approximated by Markov Chain Monte Carlo using
+block Gibbs sampling by iteratively sampling each of :math:`v` and :math:`h`
+given the other, until the chain mixes. Samples generated in this way are
+sometimes refered as fantasy particles. This is inefficient and it's difficult
+to determine whether the Markov chain mixes.
@GaelVaroquaux
GaelVaroquaux May 6, 2013 Member

I think that the above paragraph can be much shortened. I don't see how it is useful to the end user.

@GaelVaroquaux
Member

Does any one have strong reasons not to merge this?

Plenty. I am doing a quick review, and it is not ready. The review will
come in later, but I am convinced that there still is some work to be
done here.

@GaelVaroquaux GaelVaroquaux commented on the diff May 6, 2013
sklearn/neural_network/rbm.py
+ Compute the element-wise binomial using the probabilities p.
+
+ Parameters
+ ----------
+ x: array-like, shape (M, N)
+
+ Notes
+ -----
+ This is equivalent to calling numpy.random.binomial(1, p) but is
+ faster because it uses in-place operations on p.
+
+ Returns
+ -------
+ x_new: array-like, shape (M, N)
+ """
+ p[self.random_state.uniform(size=p.shape) < p] = 1.
@GaelVaroquaux
GaelVaroquaux May 6, 2013 Member

self.random_state cannot be a rng object, because it would mean modifying the input parameter. You need to create a shadow self.random_state_ object, as done in the other learners.

@mblondel
mblondel May 6, 2013 Member

Or better, pass the random state as an argument to the method.
On May 7, 2013 6:33 AM, "Gael Varoquaux" notifications@github.com wrote:

In sklearn/neural_network/rbm.py:

  •    Compute the element-wise binomial using the probabilities p.
    
  •    Parameters
    

  •    x: array-like, shape (M, N)
    
  •    Notes
    

  •    This is equivalent to calling numpy.random.binomial(1, p) but is
    
  •    faster because it uses in-place operations on p.
    
  •    Returns
    

  •    x_new: array-like, shape (M, N)
    
  •    """
    
  •    p[self.random_state.uniform(size=p.shape) < p] = 1.
    

self.random_state cannot be a rng object, because it would mean modifying
the input parameter. You need to create a shadow self.random_state_ object,
as done in the other learners.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r4102207
.

@GaelVaroquaux
GaelVaroquaux May 6, 2013 Member

Or better, pass the random state as an argument to the method.

+1. But the random_state attribute should not be modified anyhow.

@GaelVaroquaux
Member

I have made a few minor changes on my pr_1200 branch.

A few general comments (it seems that this PR lack a bit of polish and I
am not convinced that it is mergeable from a user stand point):

The docs contain a lot of math, but lack intuitive messages. Most
importantly, the RBM section does not tell me why and when I should be
using RBMs.

Examples are not linked to in the docs. The docs do not have a figure.
They feel a bit more like a page from a textbook. For instance the remark
on 'Deep neural networks that are notoriously difficult to train from
scratch can be simplified by initializing each layer’s weights with the
weights of an RBM.' What is a user going to do with this remark: we don't
have deep neural networks in scikit-learn. There is too much maths and
not enough intuitions/practical advice.

The only example is not a 'plot_' example. It does not help understanding
how the RBM work nor does it demonstrate their benefits on the rendered
docs. It also takes ages to run. This long run time is strong suboptimal,
as there is a grid-search object that refits the RBM many times with
unchanged parameters. To make it faster, I have hardcoded some values
obtained by a grid search. I also added some plotting.

I am also worried about the amount of methods of the object. Have we
checked for redundancy in the names of the methods? I see a Gibbs
sampling method. Other objects also have sampling methods. We should make
sure that the names and signature of these methods are compatible.

As a side note, after playing with this code, I am not terribly impressed
by deep learning as black-box learners. :P

@larsmans
Member
larsmans commented May 6, 2013

(Please disregard previous comment)

I agree with @GaelVaroquaux that there are far too many public methods.

@dwf
Member
dwf commented May 6, 2013

I haven't been following this all that closely, but the narrative docs appear to disagree with the implementation. The implementation uses stochastic maximum likelihood, which uses an unbiased estimator of the maximum likelihood gradient, not contrastive divergence (which the documentation seems to be describing). If the primary utility to scikit-learn users is that of feature extraction, I'd question whether an SML-based implementation is a good choice. CD tends to learn better features.

As a side note, after playing with this code, I am not terribly impressed by deep learning as black-box learners. :P

We'll make a believer out of you yet Gael!

@vene
Member
vene commented May 7, 2013

The implementation uses stochastic maximum likelihood, which uses an unbiased estimator
of the maximum likelihood gradient, not contrastive divergence (which the documentation
seems to be describing).

Is not Persistent CD a synonym for SML? In the "SML Learning" section I described the CD algorithm first and then PCD as a variation, because I found it difficult to justify PCD by itself without contrasting with CD. If I am factually incorrect could you please push me in the right direction?

There is too much maths and not enough intuitions/practical advice.

I tried to push the maths as far down as possible and to lead with intuitions, but I don't know any more intuitions than what I put down. I agree the docs are lacking but I really don't know what to write to improve them. As a black-box learner RBMs should just be used in a simple pipeline before a classifier and hope for the best.

The figure plotted by the example should indeed be in the docs. I was very unsatisfied with the way the example looks on the digit dataset, compared to it looking good on mnist, I tried to find good values but failed. I was thinking whether it would be better to use mnist and keep it as a pregenerated example.

@amueller
Member
amueller commented May 7, 2013

Ok, I'll try to work on docs and interfaces. First I want to have a look at @jnothman's grid-search work, though...

@GaelVaroquaux
Member

Ok, I'll try to work on docs and interfaces.

The most worrying thing is the example: I haven't been convinced that it
does better than a K-means to extract mid-level features.

G

@amueller
Member
amueller commented May 7, 2013

I don't think that you can say that it is better than k-means in general.
Showing that RBMs work needs a large dataset and tuning. We can try on MNIST.

I don't think the goal of the example should be to show that it works well, ratherit should be how to use it.
The literature shows that there are cases in which it works. Reproducing them (in particular using just a CPU) is non-trivial and time-consuming.

@GaelVaroquaux
Member

Showing that RBMs work needs a large dataset and tuning. We can try on
MNIST.

That must cost an arm and a leg!

I don't think the goal of the example should be to show that it works
well, ratherit should be how to use it.

Indeed. I think that the example on my pr_1200 branch does that, however when I see it, I cannot help thinking of k-means.

The literature shows that there are cases in which it works.

The older I get, the less I trust the literature :)

Reproducing them (in particular using just
a CPU) is non-trivial and time-consuming.

We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem.

@vene
Member
vene commented May 7, 2013

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging. It seems that people are interested in trying these kind
of models. I think it's better to provide it, and make it obvious how easy
it is to replace it with something simpler by giving it the same API and
letting people see by themselves that for many problems, KMeans or Random
Projections extract better features than complicated and slow
architectures. If users manage to use it well, that's great! If users use
it wrongly, it's not much different than if they misuse any other of our
estimators, is it?

We should do a quick check to see if this implementation can be sped up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

@amueller, I'd love to help with what I can (maybe API cleanup) but I might
step on your toes if you're on this, especially since the branch is @ynd's
at the moment. What could I do?

@GaelVaroquaux
Member

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging.

Agreed. I am not voting against a merge in the long run. I think that
they are still a few minor improvements to be done before a merge.

We should do a quick check to see if this implementation can be sped up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

OK, that would be fantastic, but if it requires a lot of work, we should
maybe do it in a 2nd PR.

@amueller
Member
amueller commented May 7, 2013

@vladn feel free to go ahead, preferably in a second pr to master, i think. this is not my highest priority but i want to get it done soonish..

Vlad Niculae notifications@github.com schrieb:

The older I get, the less I trust the literature :)

Totally understandable, but maybe this shouldn't be used as an argument
against merging. It seems that people are interested in trying these
kind
of models. I think it's better to provide it, and make it obvious how
easy
it is to replace it with something simpler by giving it the same API
and
letting people see by themselves that for many problems, KMeans or
Random
Projections extract better features than complicated and slow
architectures. If users manage to use it well, that's great! If users
use
it wrongly, it's not much different than if they misuse any other of
our
estimators, is it?

We should do a quick check to see if this implementation can be sped
up: I
find it slow on a simple problem.

An algorithmic speedup would be momentum, in this case.

@amueller, I'd love to help with what I can (maybe API cleanup) but I
might
step on your toes if you're on this, especially since the branch is
@ynd's
at the moment. What could I do?


Reply to this email directly or view it on GitHub:
#1200 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

@vene
Member
vene commented May 9, 2013

Moved to PR #1954, closing this one.
Posting here so people get pinged.
@dwf, could you help me understand the PCD-SML issue discussed above?

@vene vene closed this May 9, 2013
@dwf
Member
dwf commented May 10, 2013

Sure, I need to have a closer look at the narrative docs.

On Thu, May 9, 2013 at 10:23 AM, Vlad Niculae notifications@github.comwrote:

Moved to PR #1954#1954,
closing this one.
Posting here so people get pinged.
@dwf https://github.com/dwf, could you help me understand the PCD-SML
issue discussed above?


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-17666810
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment