# Added Restricted Boltzmann machines #1200

Closed
wants to merge 66 commits into
from
+684 −0

None yet

### 14 participants

Contributor
commented Oct 2, 2012
 RBMs are a state-of-the-art generative model. They've been used to win the Netflix challenge [1] and in record breaking systems for speech recognition at Google [2] and Microsoft. This pull request adds a class for Restricted Boltzmann Machines (RBMs) to scikits-learn. The code is both easy to read and efficient.
 Yann N. Dauphin added class for Restricted Boltzmann machines f192a8f
commented on sklearn/rbm.py in f192a8f Oct 3, 2012
 verbose should be a constructor parameter.
Member
commented Oct 3, 2012
 That's great! How hard would it be to support scipy sparse matrices too? We're trying to avoid adding new classes that can only operate with numpy arrays...
Member
commented Oct 3, 2012
 Great addition indeed! What would be convenient is to define a RBMClassifier class from your RBM. There is several ways to do that, but the most straightforward would be to encode both X and y into the visible units and then to train your RBM. For predictions, clamp the values of X on the visible units (but let the visible units of y free), then let the machine stabilize and finally output the values at the visible units corresponding to y as the final predictions. Actually this strategy can even be used for multi-output problems. Also, do you intend to handle missing values? (since you motivate your PR with Netflix)
and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + n_components : int, optional + Number of binary hidden units + epsilon : float, optional + Learning rate to use during learning. It is *highly* recommended + to tune this hyper-parameter. Possible values are 10**[0., -3.]. + n_samples : int, optional + Number of fantasy particles to use during learning + epochs : int, optional + Number of epochs to perform during learning + random_state : RandomState or an int seed (0 by default) + A random number generator instance to define the state of the + random permutations generator. + + Attributes + ---------- + W : array-like, shape (n_visibles, n_components), optional
 glouppe Member In scikit-learn, we are used to put a trailing underscore to all fitted attributes (self.W_). GaelVaroquaux Member In scikit-learn, we are used to put a trailing underscore to all fitted attributes (self.W_). In addition, we know try to avoid single-letter variable names. Can you find a more explicit name? mblondel Member In PCA, a related attribute is components_.
and 4 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + n_samples : int, optional + Number of fantasy particles to use during learning + epochs : int, optional + Number of epochs to perform during learning + random_state : RandomState or an int seed (0 by default) + A random number generator instance to define the state of the + random permutations generator. + + Attributes + ---------- + W : array-like, shape (n_visibles, n_components), optional + Weight matrix, where n_visibles in the number of visible + units and n_components is the number of hidden units. + b : array-like, shape (n_components,), optional + Biases of the hidden units + c : array-like, shape (n_visibles,), optional
 glouppe Member b and c are not meaningful variable names. Maybe bias_hidden and bias_visible would be better names? (This is a suggestion) agramfort Member or maybe better intercept_hidden_ and intercept_visible_ GaelVaroquaux Member or maybe better intercepthidden and interceptvisible intercept_hidden and intercept_visible amueller Member bias is the usual term in the community. before joining sklearn, I never heard the word intercept. Consistency with community or within sklearn is the question I guess ;) GaelVaroquaux Member On Wed, Oct 03, 2012 at 03:33:51AM -0700, Andreas Mueller wrote: bias is the usual term in the community. before joining sklearn, I never heard the word intercept. Consistency with community or within sklearn is the question I guess ;) In any case, the fact that the 2 terminology coexist needs to be stressed in the documentation. I wasn't aware that "bias == intercept" :) larsmans Member +1 for intercept_{visible,hidden}_.
and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + verbose: bool, optional + When True (False by default) the method outputs the progress + of learning after each epoch. + """ + X = array2d(X) + + self.W = np.asarray(np.random.normal(0, 0.01, + (X.shape[1], self.n_components)), dtype=X.dtype) + self.b = np.zeros(self.n_components, dtype=X.dtype) + self.c = np.zeros(X.shape[1], dtype=X.dtype) + self.h_samples = np.zeros((self.n_samples, self.n_components), + dtype=X.dtype) + + inds = range(X.shape[0]) + + np.random.shuffle(inds)
 glouppe Member self.random_state should be used instead. mblondel Member It seems to me that other parts of the scikit don't record the random state in an attribute (they pass it around instead). ynd Contributor mblondel Member You could do del self.random_state at the end of fit then. ynd Contributor On Thu, Oct 4, 2012 at 1:15 AM, Mathieu Blondel notifications@github.comwrote: In sklearn/rbm.py:  np.random.shuffle(inds)  You could do del self.random_state at the end of fit then. What if fit is called twice? Wouldn't having deleted the self.random_state cause some problems. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1758454.
and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + v: array-like, shape (n_samples, n_visibles) + + Returns + ------- + pseudo_likelihood: array-like, shape (n_samples,) + """ + fe = self.free_energy(v) + + v_ = v.copy() + i_ = self.random_state.randint(0, v.shape[1], v.shape[0]) + v_[range(v.shape[0]), i_] = v_[range(v.shape[0]), i_] == 0 + fe_ = self.free_energy(v_) + + return v.shape[1] * np.log(self._sigmoid(fe_ - fe)) + + def fit(self, X, y=None, verbose=False):
 glouppe Member As Mathieu said, verbose should be a constructor parameter. ynd Contributor done. On Wed, Oct 3, 2012 at 3:06 AM, Gilles Louppe notifications@github.comwrote: In sklearn/rbm.py:  v: array-like, shape (n_samples, n_visibles)   Returns   pseudo_likelihood: array-like, shape (n_samples,)   """   fe = self.free_energy(v)   v_ = v.copy()   i_ = self.random_state.randint(0, v.shape[1], v.shape[0])   v_[range(v.shape[0]), i_] = v_[range(v.shape[0]), i_] == 0   fe_ = self.free_energy(v_)   return v.shape[1] \* np.log(self._sigmoid(fe_ - fe))  def fit(self, X, y=None, verbose=False): As Mathieu said, verbose should be a constructor parameter. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1747215.
and 4 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + verbose: bool, optional + When True (False by default) the method outputs the progress + of learning after each epoch. + """ + self.fit(X, y, verbose) + + return self.transform(X) + + +def main(): + pass + + +if __name__ == '__main__': + main() +
 glouppe Member Remove the lines above. Instead, could you provide an example in a separate file? amueller Member +1 Also: how are we going to test this beast? GaelVaroquaux Member Also: how are we going to test this beast? Check that the energy decreases during the training? amueller Member This is not really what you want. You want the probability to increase and you can't compute the partition function usually. We could do an example with ~5 hidden units and test that. That would actually a good test. But: (P)CD learning doesn't guarantee increasing the probability of the data set and may diverge. So I guess plotting the true model probability of the data would be a cool example and we could also test against that. We only need a function to calculate the partition function .... dwf Member On Wed, Oct 3, 2012 at 6:31 AM, Andreas Mueller notifications@github.comwrote: This is not really what you want. You want the probability to increase and you can't compute the partition function usually. We could do an example with ~5 hidden units and test that. That would actually a good test. But: (P)CD learning doesn't guarantee increasing the probability of the data set and may diverge. Well, PCD does, at least asymptotically, as long as you use an appropriately decaying learning rate. But asymptotic behaviour is kind of hard to test. amueller Member Really? Pretty sure that not, as you can not guarantee mixing of the chain. Maybe I have to look into the paper again. amueller Member Doesn't say anything like that in the paper as far as I can tell and I think I observed divergence in practice. Could be that it is possible to converge if you decay the learning rate fast enough but that doesn't mean that you converge to a point that is better than the point you started with, only that you stop somewhere. dwf Member The "PCD" paper kind of undersells it, what you want to look at is the Laurent Younes paper "On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates", which he cites (the contribution of the PCD paper was basically making this known to the ML community and demonstrating it on RBMs). There you get guarantees of almost sure convergence to a local maximum of the empirical likelihood, provided your initial learning rate is small enough. If you see things start to diverge or oscillate then your learning rate is probably too high. ynd Contributor done. I do plan to make a separate example. As @amueller and @ogrisel mentioned, I think it would be best to show its use as a feature extractor for a LinearSVC for example. And showing that it does improve the accuracy on the test set very significantly. For example on digits. It would show how to use GridSearchCV to find the optimal learning rate.
Member
commented Oct 3, 2012
 Thanks for the contribution - I'm super excited - can't wait to see this merged to master!
and 5 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + self.epochs = epochs + self.random_state = check_random_state(random_state) + + def _sigmoid(self, x): + """ + Implements the logistic function. + + Parameters + ---------- + x: array-like, shape (M, N) + + Returns + ------- + x_new: array-like, shape (M, N) + """ + return 1. / (1. + np.exp(-np.maximum(np.minimum(x, 30), -30)))
 glouppe Member I don't know if this changes anything, but wouldn't it be better to output 0. if x < -30 and 1. if x > 30? Also, since this very particular function is at the core of the algorithm, a Cython version may be a better choice. mblondel Member This method doesn't depend on the object state so it can changed to a function. larsmans Member Shouldn't we have logistic sigmoid in sklearn.utils.extmath? amueller Member maybe call it logistic_sigmoid as sigmoid just refers to the shape.... dwf Member +1 @amueller. Drives me nuts when people around here simply call this "the sigmoid". :) The logistic function is its proper name. ynd Contributor done
sklearn/rbm.py
 + on Machine Learning (ICML) 2008 + """ + h_pos = self.mean_h(v_pos) + v_neg = self.sample_v(self.h_samples) + h_neg = self.mean_h(v_neg) + + self.W += self.epsilon * (np.dot(v_pos.T, h_pos) + - np.dot(v_neg.T, h_neg)) / self.n_samples + self.b += self.epsilon * (h_pos.mean(0) - h_neg.mean(0)) + self.c += self.epsilon * (v_pos.mean(0) - v_neg.mean(0)) + + self.h_samples = self.random_state.binomial(1, h_neg) + + return self.pseudo_likelihood(v_pos) + + def pseudo_likelihood(self, v):
 amueller Member Is this really a pseudolikelihood? Can you give a reference where this was used? amueller Member Hm I guess it is connected to pseudolikelihood... hm...
Member
commented Oct 3, 2012
 @glouppe Afaik classification with RBMs basically doesn't work. Is there any particular reason you want it? Usually throwing the representation into a linear SVM is way better. I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff... If I knew that I would have contributed one of my implementations ^^ An example would be great. Is there any application where we can demo that this works? Maybe extracting features from digits and then classify? I guess that would train a while, though :-/ If we do include an RBM, it should definitely implement both CD1 and PCD.
and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + v_pos: array-like, shape (n_samples, n_visibles) + + Returns + ------- + pseudo_likelihood: array-like, shape (n_samples,), optional + Pseudo Likelihood estimate for this batch. + + References + ---------- + [1] Tieleman, T. Training Restricted Boltzmann Machines using + Approximations to the Likelihood Gradient. International Conference + on Machine Learning (ICML) 2008 + """ + h_pos = self.mean_h(v_pos) + v_neg = self.sample_v(self.h_samples) + h_neg = self.mean_h(v_neg)
 amueller Member Shouldn't h be sampled here? ynd Contributor h is sampled for the persistent gibbs chain. For computing the learning statistics, by deriving the log probability w.r.t. the parameters you get a difference between free energy derivatives. Which results in using the mean field of the h's (see http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf page 64). When computing the learning statistics, it's mathematically valid and better to use the mean-field because it adds less noise to learning. dwf Member @amueller Yoshua's book doesn't mention this AFAIK but this technique (replacing a sample with its expectation in a formula) is known as "Rao-Blackwellization", and it's trivial to demonstrate that it's still an unbiased estimator but with lower variance than the original sample-based estimate. See, for example, this tutorial. ynd Contributor @dwf cool, I had somehow overlooked of that proof :). On Wed, Oct 3, 2012 at 5:54 PM, David Warde-Farley
Member
commented Oct 3, 2012
 I just saw that you do implement persistent CD. I guess having only this is fine.
Member
 Maybe we have have a pipeline connecting rbm and linearsvc to for an example and compare the result to using raw feature :)
and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 +class RBM(BaseEstimator, TransformerMixin): + """ + Restricted Boltzmann Machine (RBM) + + A Restricted Boltzmann Machine with binary visible units and + binary hiddens. Parameters are estimated using Stochastic Maximum + Likelihood (SML). + + The time complexity of this implementation is O(n ** 2) assuming + n ~ n_samples ~ n_features. + + Parameters + ---------- + n_components : int, optional + Number of binary hidden units + epsilon : float, optional
 amueller Member maybe just call this learning_rate? mblondel Member +1, this would be consistent with SGDClassifier.
and 2 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + binary hiddens. Parameters are estimated using Stochastic Maximum + Likelihood (SML). + + The time complexity of this implementation is O(n ** 2) assuming + n ~ n_samples ~ n_features. + + Parameters + ---------- + n_components : int, optional + Number of binary hidden units + epsilon : float, optional + Learning rate to use during learning. It is *highly* recommended + to tune this hyper-parameter. Possible values are 10**[0., -3.]. + n_samples : int, optional + Number of fantasy particles to use during learning + epochs : int, optional
 amueller Member Could you please use n_epochs? mblondel Member @amueller Or simply, n_iter :) dwf Member n_iter is more ambiguous, I think. Even if "epoch" is jargony, it refers to a full sweep through the dataset, whereas "iter" could mean number of [stochastic] gradient updates. amueller Member I think n_iter would be the right thing. We use n_iter in SGDClassifier already, where it has the same meaning as here.
Member
commented Oct 3, 2012
 @amueller Well, I was suggesting that because it nearly comes from free. It would have shown how RBM could be used in a stand-alone way, as a pure generative model. But I agree that using a RBM to generate features and then to feed them to another classifier is probably a better idea in terms of end accuracy. I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff... If I knew that I would have contributed one of my implementations ^^ It happens that I came into the machine learning world through RBMs :) That's why I appreciate that model, though it's purely subjective.
Member
commented Oct 3, 2012
 @glouppe I also came to machine learning via RBMs and I really dislike them and I am convinced they don't work - though it's purely subjective ;)
Member
commented Oct 3, 2012
 @ynd Great job by the way! Looks very good :)
Member
commented Oct 3, 2012
 @amueller Haha :-) Actually, my experience with RBMs is limited to Netflix and to image datasets, were I found accuracy to be quite decent. However I confess that adjusting all the hyperparameters is a real nightmare. Maybe that's the reason why switched to ensemble methods?
Member
 Nice job on this. I actually worked a bit with @genji on a sklearn version of this.. In this case, it could be good if he could have a look at this. well done!
Member
 @glouppe Afaik classification with RBMs basically doesn't work. Is there any particular reason you want it? Usually throwing the representation into a linear SVM is way better. OK, we'll need an example showing this, hopefully with a pipeline.
Member
 I'm a bit surprised that so many people want an RBM. I though we kind of didn't want "deep learning" stuff... If I knew that I would have contributed one of my implementations ^^ We're growing, I guess. @amueller: if you have good implementation, you can probably give good advice/good review on this PR. I have found that the best way to get good code was to keep the good ideas from different codebases developed independently.
Member
 Nice job on this. I actually worked a bit with @genji on a sklearn version of this.. You should give a pointer to the codebase.
Member
commented Oct 3, 2012
 It would be great if someone could benchmark this implementation against @dwf's gist: https://gist.github.com/359323 both in terms of CPU time till convergence, memory usage and ability to generate feature suitable for feeding to a linear classifier such as sklearn.linear_model.LinearSVC for instance. The dataset to use for this bench could be a subset of mnist (e.g. 2 digits such as 0 versus 8) to make it faster to evaluate.
Member
commented Oct 3, 2012
 BTW @ynd great to see you contributing to sklearn. I really enjoyed your NIPS presentation last year. Also it would be great to submit a sklearn version of your numpy CAE that includes the sampling trick in another PR :)
Member
commented Oct 3, 2012
 Nice job on this. I actually worked a bit with @genji on a sklearn version of this.. We would indeed appreciate @genji inputs on reviewing this PR and maybe point to common implementation pitfalls to avoid.
Member
commented Oct 3, 2012
 oh @ynd I didn't recognize you from picture. I think we talked quite a while at your poster last nips. After that, my lab started working on constrastive auto encoders! Your work is really interesting!
and 3 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 @@ -0,0 +1,307 @@ +""" Restricted Boltzmann Machine +""" + +# Author: Yann N. Dauphin +# License: BSD Style. + +import numpy as np + +from .base import BaseEstimator, TransformerMixin +from .utils import array2d, check_random_state + + +class RBM(BaseEstimator, TransformerMixin):
 mblondel Member We try to avoid acronyms so RestrictedBolzmannMachine sounds better. Also, we may want to bootstrap the neural_network module. ogrisel Member +1, esp. if Contractive Autoencoders are to follow. mblondel Member I was also thinking of the multi-layer perceptron, if @amueller feels motivated :) amueller Member Yeah, I know, I should really look into that again... motivation level... not that high ;) ynd Contributor A 'neural_network', or 'neural' module sounds really good. @ogrisel Yes, Contractive Autoencoders are to follow! :) @amueller There's a nice paper this year that makes MLPs much simpler to use, they show how to automatically tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only leaves the n_components hyper-parameter! amueller Member A 'neural_network', or 'neural' module sounds really good. @ogrisel https://github.com/ogrisel Yes, Contractive Autoencoders are to follow! :) @amueller https://github.com/amueller There's a nice paper this year that makes MLPs much simpler to use, they show how to automatically tune the learning rate (http://arxiv.org/pdf/1206.1106v1). That only leaves the n_components hyper-parameter! — I saw the paper but didn't try it out yet. Have you given it a try? By the way, that look a lot like RPROP, which we often use (but is a lot less heuristic). Do you think that would also be interesting for the SGD module?
sklearn/rbm.py
 + A Restricted Boltzmann Machine with binary visible units and + binary hiddens. Parameters are estimated using Stochastic Maximum + Likelihood (SML). + + The time complexity of this implementation is O(n ** 2) assuming + n ~ n_samples ~ n_features. + + Parameters + ---------- + n_components : int, optional + Number of binary hidden units + epsilon : float, optional + Learning rate to use during learning. It is *highly* recommended + to tune this hyper-parameter. Possible values are 10**[0., -3.]. + n_samples : int, optional + Number of fantasy particles to use during learning
 mblondel Member n_particles sounds betters, as n_samples is usually used for X.shape[0].
closed this Oct 3, 2012
reopened this Oct 3, 2012
Member
commented Oct 3, 2012
 Oops, somehow I hit the close button. Sorry about that. I'd just like to raise the obvious skeptic's take here: RBMs are relatively finicky to get to work properly for an arbitrary dataset, especially as compared to a lot of the models that sklearn implements. You don't even have a very good way of doing model comparison for non-trivial models, except approximately via Annealed Importance Sampling, which itself requires careful tuning. This was my reason for never submitting my own NumPy-based implementation: if sklearn is looking to provide canned solutions to non-experts, RBMs and deep learning in general are pretty much the antithesis of that (at least for now; the "no more pesky learning rates" paper offers hope for deterministic networks, but their technique does not apply in the least to the case where you have to stochastically estimate the gradient with MCMC, as here). On one hand, it'd be nice to get these techniques more exposure, but on the other, their inclusion in scikit-learn may be a false endorsement of how easy they are to get to work.
Member
commented Oct 3, 2012
 I completely agree with @dwf. I though that was the reason we didn't include it yet. As many people seem to be excited about having it, I guess we could give it a shot, though ;)
 Yann N. Dauphin initialize W with random_state 9c4afeb
Contributor
commented Oct 3, 2012
 Thanks for all the feedback. Anyway, I'll answer each comment separately as I make commits. @ogrisel @amueller Thanks!
added some commits Oct 3, 2012
 Yann N. Dauphin moved verbose parameter to init be2b734 Yann N. Dauphin removed empty main function 57243d3 Yann N. Dauphin removed verbose from fit_transform 7016300
Contributor
commented Oct 3, 2012
 @dwf In my experience, many people are interested in these techniques but don't want to spend the initial investment of implementing them. It's true that using RBMs is not as straightforward as using PCA, but I think including them in scikit-learn will allow a lot of people to develop the knowledge necessary to use them. This is good because a lot of the difficulty of using RBMs comes from the fact that only few people in the community know how to use them. More users means more documentation, tutorials, papers, etc. On Wed, Oct 3, 2012 at 1:42 PM, David Warde-Farley
added some commits Oct 3, 2012
 Yann N. Dauphin rename epochs to n_epochs 3fd5947 Yann N. Dauphin rename n_samples to n_particles e7b4fc6
and 4 others commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + + Attributes + ---------- + W : array-like, shape (n_visibles, n_components), optional + Weight matrix, where n_visibles in the number of visible + units and n_components is the number of hidden units. + b : array-like, shape (n_components,), optional + Biases of the hidden units + c : array-like, shape (n_visibles,), optional + Biases of the visible units + + Examples + -------- + + >>> import numpy as np + >>> from sklearn.rbm import RBM
 larsmans Member The top-level module is getting cluttered. Shouldn't this be in sklearn.deep, sklearn.neural, ...? amueller Member +1 sklearn.neuralnet? GaelVaroquaux Member sklearn.neuralnet? neural_net dude! :} amueller Member lol sorry ^^ larsmans Member Why _net? vene Member shouldn't this be s/RBM/RestrictedBoltzmannMachine here and 2 lines below? as for the module, neural_network is still shorter than cross_validation, why not? ynd Contributor In sklearn/rbm.py: import numpy as np from sklearn.rbm import RBM shouldn't this be s/RBM/RestrictedBoltzmannMachine here and 2 lines below? Thanks. as for the module, neural_network is still shorter than cross_validation, why not? — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1760553.
and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + ---------- + x: array-like, shape (M, N) + + Returns + ------- + x_new: array-like, shape (M, N) + """ + return 1. / (1. + np.exp(-np.maximum(np.minimum(x, 30), -30))) + + def transform(self, v): + """ + Computes the probabilities P({\bf h}_j=1|{\bf v}). + + Parameters + ---------- + v: array-like, shape (n_samples, n_visibles)
 larsmans Member n_visibles is n_samples, right? If so, then this should just be called X, not v. ynd Contributor def transform(self, v):  """   Computes the probabilities P({\bf h}_j=1|{\bf v}).   Parameters   v: array-like, shape (n_samples, n_visibles)  n_visibles is n_samples, right? If so, then this should just be called X, not v. Here n_visibles is actually n_features (has been renamed in a more recent commit). v is indeed the input samples, but I use the notation that is common when dealing with rbms to make things more clear. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755791.
sklearn/rbm.py
 + + def transform(self, v): + """ + Computes the probabilities P({\bf h}_j=1|{\bf v}). + + Parameters + ---------- + v: array-like, shape (n_samples, n_visibles) + + Returns + ------- + h: array-like, shape (n_samples, n_components) + """ + return self.mean_h(v) + + def mean_h(self, v):
 larsmans Member mean_hidden
and 1 other commented on an outdated diff Oct 3, 2012
sklearn/rbm.py
 + """ + return self.mean_h(v) + + def mean_h(self, v): + """ + Computes the probabilities P({\bf h}_j=1|{\bf v}). + + Parameters + ---------- + v: array-like, shape (n_samples, n_visibles) + + Returns + ------- + h: array-like, shape (n_samples, n_components) + """ + return self._sigmoid(np.dot(v, self.W) + self.b)
 larsmans Member There's no input validation here. Also, with safe_sparse_dot, sparse matrix input can be supported trivially. ynd Contributor Thanks. On Wed, Oct 3, 2012 at 5:43 PM, Lars Buitinck notifications@github.comwrote: In sklearn/rbm.py:  """   return self.mean_h(v)  def mean_h(self, v):  """   Computes the probabilities P({\bf h}_j=1|{\bf v}).   Parameters   v: array-like, shape (n_samples, n_visibles)   Returns   h: array-like, shape (n_samples, n_components)   """   return self._sigmoid(np.dot(v, self.W) + self.b)  There's no input validation here. Also, with safe_sparse_dot, sparse matrix input can be supported trivially. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1755913.
added some commits Oct 3, 2012
 Yann N. Dauphin support for sparse inputs 82ad100 Yann N. Dauphin renamed epsilon to learning rate 2c2cdfb Yann N. Dauphin rename class to full name ea2c5c9 Yann N. Dauphin mistake in the documented complexity 47af232 Yann N. Dauphin use n_features instead of n_visibles to be consistent with the codebase 92f7ea8 Yann N. Dauphin rename parameters to be consistent with codebase dd79de0 Yann N. Dauphin 30% speed-up thanks to in-place binomial from https://gist.github.com… …/359323 c02cdcf Yann N. Dauphin 12% with ingenious ordering of operations 5e2abdf Yann N. Dauphin inplace gradients for components_, 2% speed-up e3182b4 Yann N. Dauphin rename h_samples to h_samples_ b9e20cc
commented on sklearn/rbm.py in dd79de0 Oct 4, 2012
 I just noticed that in PCA the shape is (n_components, n_features) so could you use the transpose of the above in your code? Also when you initialize components_, try order="fortran" and oder="c" to see which one is the fastest.
Owner
replied Oct 5, 2012
 I just noticed that in PCA the shape is (n_components, n_features) so could you use the transpose of the above in your code? Also when you initialize components_, try order="fortran" and oder="c" to see which one is the fastest. Done. There is no performance penalty by switching to order='fortran'. …
Member
commented Oct 4, 2012
 You don't even have a very good way of doing model comparison for non-trivial models I'm not familiar with RBM but can't you use the accuracy of the last step of your pipeline? (e.g., a LinearSVC)
Member
commented Oct 4, 2012
 @mblondel Using a classifier is a very indirect method of evaluating RBMs. It has nothing to with the objective function :-/
sklearn/rbm.py
 + v_pos: array-like, shape (n_samples, n_features) + + Returns + ------- + pseudo_likelihood: array-like, shape (n_samples,), optional + Pseudo Likelihood estimate for this batch. + + References + ---------- + [1] Tieleman, T. Training Restricted Boltzmann Machines using + Approximations to the Likelihood Gradient. International Conference + on Machine Learning (ICML) 2008 + """ + h_pos = self.mean_h(v_pos) + v_neg = self.sample_v(self.h_samples_) + h_neg = self.mean_h(v_neg)
 amueller Member I think the comment got lost during the last commits (damn you github): shouldn't you sample here? amueller Member whoops just saw your reply, which also got somehow lost... meh!
Member
commented Oct 4, 2012
 @amueller If your goal at hand is classification, that seems like a very valid way to choose hyperparameters to me. Unless you extract features for the sake of it? :) Anyway, all unsupervised algorithms (latent Dirichlet analysis, sparse coding...) have similar issues...
and 1 other commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
 + x: array-like, shape (M, N) + + Notes + ----- + This is equivalent to calling numpy.random.binomial(1, p) but is + faster because it uses in-place operations on p. + + Returns + ------- + x_new: array-like, shape (M, N) + """ + p[self.random_state.uniform(size=p.shape) < p] = 1. + + return np.floor(p, p) + + def transform(self, v):
 larsmans Member I still think this should be called X. We follow our own naming conventions rather than those of [fill in subfield of ML here], since otherwise practically every module would have to follow different conventions. mblondel Member Agreed with @larsmans.
Member
commented Oct 4, 2012
 @mblondel it makes it very hard to judge learning algorithms for the method, though. For sparse coding, there is an obvious way to tell how good your optimization was! For RBMs it is not possible to evaluate your objective function and you basically can't tell whether you made the model any better in terms of what you claim to be doing. If you only evaluate using classification, you might end up with an algorithm that has nothing to do with the original objective. That is fine with me, but then you shouldn't claim that is has something to do with graphical models any more.
Member
commented Oct 4, 2012
 @amueller: I agree with your point if your goal is evaluating an optimization algorithm. I was talking about hyperparameter tuning :)
Member
commented Oct 4, 2012
 Argh, learning_rate is actually a string in SGDClassifier... Maybe the closest name is eta0... @pprett
Member
commented Oct 4, 2012
 eta0 is not a very descriptive name. We should not make it a convention. initial_learning_rate would be better IMHO.
Member
commented Oct 4, 2012
 If I understand the code correctly, here, the learning rate is constant...
Member
commented Oct 4, 2012
 @mblondel yes, maybe that is not set in stone, though (wdyt @ynd ?) @ogrisel agree. -1 on eta0.
Member
commented Oct 4, 2012
 @mblondel The blame is on me - SGDClassifier's learning_rate is actually the learning rate schedule; eta0 is the initial learning rate. Even worse, GradientBoostingClassifier uses learn_rate for the learning rate (this time its really the learning rate, not the schedule)... it seems I'm the nemesis of the consistency brigade :-/ Is there any other estimator that has a learning rate parameter?
Member
commented Oct 4, 2012
 Well, Perceptron, but that's obviously SGD in disguise. As a proposed fix for SGD, how about... learning_rate="constant", eta0=.1 ⇨ learning_rate=("constant", .1) learning_rate="constant", eta0=None ⇨ learning_rate="constant" learning_rate=None, eta0=.2 ⇨ learning_rate=.2  That way, we can keep the learning_rate parameter and we only have to deprecate eta0, and we don't have to introduce parameters that require a lot of typing. (And to counter the "flat is better than nested" that is bound to come up, I'd like to add once again that I'm Dutch ;)
Member
commented Oct 4, 2012
 @larsmans basically I am +1 on renaming, though I am not entirely certain about how this will actually work. Also, we should keep grid-searches in mind. I usually use a decaying learning rate and grid-search over eta0. Before: param_dict = dict(eta0=2. ** np.arange(-5, -1)) proposed: param_dict = dict(learning_rate=[("optimal", x) for x in 2. ** np.arange(-5, -1)]) Not horrible but still awkward. We should keep in mind that users might not be as well-versed in Python as we are ;)
Member
commented Oct 4, 2012
 +1 for designing APIs in a way that makes grid search code cleaner where the parameter is something you're more likely than not to optimize by grid search.
and 1 other commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
 + A Restricted Boltzmann Machine with binary visible units and + binary hiddens. Parameters are estimated using Stochastic Maximum + Likelihood (SML). + + The time complexity of this implementation is O(d ** 2) assuming + d ~ n_features ~ n_components. + + Parameters + ---------- + n_components : int, optional + Number of binary hidden units + learning_rate : float, optional + Learning rate to use during learning. It is *highly* recommended + to tune this hyper-parameter. Possible values are 10**[0., -3.]. + n_particles : int, optional + Number of fantasy particles to use during learning.
 vene Member Could this be more specific? Despite my knowing vaguely what an RBM is and what it would be useful for, I've never encountered this term so I wouldn't know where to start setting it. Unless I go read the reference first. amueller Member I think we could say that fantasy particles (particles in MCMC chain). Apart from that, I think reading the docs would be the way to go ;)
commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
 + Biases of the visible units + + Examples + -------- + + >>> import numpy as np + >>> from sklearn.rbm import RBM + >>> X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]]) + >>> model = RBM(n_components=2) + >>> model.fit(X) + + References + ---------- + + [1] Hinton, G. E., Osindero, S. and Teh, Y. A fast learning algorithm for + deep belief nets. Neural Computation 18, pp 1527-1554.
 vene Member Since the URI is so clean I would add it here. WDYT? http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf
and 2 others commented on an outdated diff Oct 4, 2012
sklearn/rbm.py
 + + np.random.shuffle(inds) + + n_batches = int(np.ceil(len(inds) / float(self.n_particles))) + + for epoch in range(self.n_epochs): + pl = 0. + for minibatch in range(n_batches): + pl += self._fit(X[inds[minibatch::n_batches]]).sum() + pl /= X.shape[0] + + if self.verbose: + print "Epoch %d, Pseudo-Likelihood = %.2f" % (epoch, pl) + + def fit_transform(self, X, y=None): + """
 vene Member If there is no trick that allows you to get the transformed value while fitting for free, there is no use in defining this implementation. The naive fit_transform as sequence of self.fit and self.transform is already defined in TransformerMixin. larsmans Member fit_transform would help if there were input validation, though, as it might prevent a copy. ynd Contributor If there is no trick that allows you to get the transformed value while fitting for free, there is no use in defining this implementation. The naive fit_transform as sequence of self.fit and self.transform is already defined in TransformerMixin. @vene Somehow if I don't define it, I get a failed test in sklearn.tests.test_common.test_transformers when I make. Basically the default one outputs the wrong shape. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r1760595.
Member
commented Oct 4, 2012
 @amueller Good point, never mind.
Contributor
commented Oct 4, 2012
 @ynd nice work! @amueller I wanted to respond do your statement "it makes it very hard to judge learning algorithms for the method, though. For sparse coding, there is an obvious way to tell how good your optimization was!" You're right that you can actually see which model gets a lower sparse-coding score, but you should also consider that when using sparse coding for feature extraction, there is no natural way to pick the sparsity penalty in the first place... so ultimately you are in the same "guess some hyper-parameters, optimize for a while, test the representation" process as with RBMs. AFAIK there is no approach to unsupervised feature learning that comes with a measure of how effective subsequent supervised learning will be. I would even conjecture that the same old no-free-lunch theorem applies anyway to representations learned by unsupervised learning, just like it applies to raw data.
added some commits Oct 4, 2012
 Yann N. Dauphin use X instead of v as parameter to transform 8303ef2 Yann N. Dauphin use X instead of v as parameter to transform 930c173
Contributor
commented Oct 4, 2012
 @mblondel https://github.com/mblondel yes, maybe that is not set in stone, though (wdyt @ynd https://github.com/ynd ?) @ogrisel https://github.com/ogrisel agree. -1 on eta0. It's not set in stone, but a constant learning rate by default is better. It makes cross-validation much easier. I can add some schedules later. Maybe with classes for handling the schedules? — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-9135610.
added some commits Oct 4, 2012
 Yann N. Dauphin clarify n_particles documentation 548a357 Yann N. Dauphin bugfix doctest 73de6f7 Yann N. Dauphin added URI for RBM reference 47ef7fa
Member
commented Oct 4, 2012
 @ynd Why do you say that it makes cross-validation easier? The trouble with constant learning rates is that mixing can really break down later in training. Depending on how nasty your dataset is, you really want to anneal.
added some commits Oct 4, 2012
 Yann N. Dauphin improved docstring for transform f8e635d Yann N. Dauphin renamed _sigmoid to _logistic_sigmoid 5f5b5c9 Yann N. Dauphin use double backquotes around equations 33afdd3 Yann N. Dauphin logistic_sigmoid moved to function 4a34bcf
referenced this pull request Oct 5, 2012
Closed

#### Name learning rate related parameters consistently across all estimators #1206

Member
commented Oct 5, 2012
 @larsmans @amueller @mblondel @ogrisel lets move the parameter renaming discussion into a separate issue - I've opened #1206 and copied some of your posts
Member
commented Oct 5, 2012
 Am 04.10.2012 20:48, schrieb James Bergstra: https://github.com/ynd @amueller https://github.com/amueller I wanted to respond do your statement "it makes it very hard to judge learning algorithms for the method, though. For sparse coding, there is an obvious way to tell how good your optimization was!" You're right that you can actually see which model gets a lower sparse-coding score, but you should also consider that when using sparse coding for feature extraction, there is no natural way to pick the sparsity penalty in the first place... so ultimately you are in the same "guess some hyper-parameters, optimize for a while, test the representation" process as with RBMs. AFAIK there is no approach to unsupervised feature learning that comes with a measure of how effective subsequent supervised learning will be. I would even conjecture that the same old no-free-lunch theorem applies anyway to representations learned by unsupervised learning, just like it applies to raw data. — Reply to this email directly or view it on GitHub #1200 (comment). Thanks for your response. I feel it is great to share ideas :) Maybe this is a bit arbitrary but I like to think separately of the optimization parameters and the model parameters. There is no use in doing model selection without looking at the goal of the whole process, i.e. classification. But once I specified a model, I would hope that I am actually optimizing this model that I specified. With sparse coding, I can be pretty sure what I am optimizing, with RBMs, not so much. In practice, in particular with non-convex optimization, this becomes much more blurred. Your perspective "guess some hyper-parameters, optimize for a while, test the representation" is more pragmatic and maybe more realistic, I guess. Though I feel that with this perspective, there is no model any more, only an algorithm. I feel it is much harder to argue for algorithms than for models, which is why I don't really like this perspective ;) I hope that didn't sound to naive ;) Cheers, Andy
Contributor
commented Oct 5, 2012
 On Fri, Oct 5, 2012 at 7:39 AM, Andreas Mueller notifications@github.comwrote: Your perspective "guess some hyper-parameters, optimize for a while, test the representation" is more pragmatic and maybe more realistic, I guess. Though I feel that with this perspective, there is no model any more, only an algorithm. I feel it is much harder to argue for algorithms than for models, which is why I don't really like this perspective ;) That's right... I do see most feature-learning algorithms as just that: algorithms, which are inspired by various principles that are not obviously related to classification. There is still a classifier coming out of each application of the algorithm though, and cross-validation is a good way of evaluating those models.
added some commits Oct 5, 2012
 Yann N. Dauphin transposed components_, no performance penalty e4f5a53 Yann N. Dauphin only compute pseudolikelihood if verbose=True 2163337 Yann N. Dauphin more accurate pseudo-likelihood ae0c10e
Contributor
commented Oct 5, 2012
 It would be great if someone could benchmark this implementation against @dwf https://github.com/dwf's gist: https://gist.github.com/359323 @ogrisel I have written a small benchmark https://gist.github.com/3842732 to compare the implementations. First, I had to bugfix it (using np.empy instead of np.zeros to initialize parameters leads to NaNs). Then I removed the l2 penalty and momentum, along with objective estimation (reconstruction error) and stdout outputting in an effort to benchmark the core efficiency. The task is modeling 20newgroups data. Running the script gives us Scikit Time: 58.4955779314 seconds Gist Time: 100.643937469 seconds Scikit PL: -270.043485947 Gist PL: -282.588828001 The proposed scikit implementation is 72% faster while giving better pseudo-likelihood after two epochs. The edge in pseudo-likelihood is probably due to using PCD instead of CD. The test was run with numpy 1.6.1, 64bit Python 2.7.2 on an Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz. both in terms of CPU time till convergence, memory usage and ability to generate feature suitable for feeding to a linear classifier such as sklearn.linear_model.LinearSVC for instance. The dataset could be used a subset of mnist (e.g. 2 digits such as 0 versus 8) to make it faster to evaluate. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-9101722.
Member
 The proposed scikit implementation is 72% faster while giving better pseudo-likelihood after two epochs. The edge in pseudo-likelihood is probably due to using PCD instead of CD. Good job!
Member
commented Oct 6, 2012
 @ynd could you maybe comment on using pseudo-likelihood as an evaluation criterion? Is there a reference for that?
and 3 others commented on an outdated diff Dec 30, 2012
sklearn/neural_networks/rbm.py
 + self.intercept_visible_ = np.zeros(X.shape[1], dtype=X.dtype) + self.h_samples_ = np.zeros((self.n_particles, self.n_components), + dtype=X.dtype) + + inds = np.arange(X.shape[0]) + self.random_state.shuffle(inds) + + n_batches = int(np.ceil(len(inds) / float(self.n_particles))) + + verbose = self.verbose + for iteration in xrange(self.n_iter): + pl = 0. + if verbose: + begin = time.time() + for minibatch in xrange(n_batches): + pl_batch = self._fit(X[inds[minibatch::n_batches]])
 vene Member This seems to be done slightly differently than the usual minibatches in scikit-learn. I wonder if it affects the speed or anything. Anyway since these minibatches are indirectly implied by the n_particles, I think I should read the paper first. I wonder if it's just me or if it's actually confusing: I ran the RBM before reading the code and noticed that the higher n_particles is, the quicker it is, whereas my ignorant intuition was that "using more stuff should take more time". I commented once before about that parameter and I still argue that it's a particularity of the learning algorithm and not directly of RBMs themselves, so I wouldn't mind having its effect intuitively explained. dwf Member On Sun, Dec 30, 2012 at 1:14 PM, Vlad Niculae notifications@github.comwrote: This seems to be done slightly differently than the usual minibatches in scikit-learn. I wonder if it affects the speed or anything. Anyway since these minibatches are indirectly implied by the n_particles, I think I should read the paper first. I haven't looked at this in a while and I'm not sure what stopping criterion Yann is using but more particles will mean you have a better/more stable estimate of the negative phase term in the gradient. I wonder if it's just me or if it's actually confusing: I ran the RBM before reading the code and noticed that the higher n_particles is, the quicker it is, whereas my ignorant intuition was that "using more stuff should take more time". I commented once before about that parameter and I still argue that it's a particularity of the learning algorithm and not directly of RBMs themselves, so I wouldn't mind having its effect intuitively explained. The number of particles is or can be effectively independent of the minibatch size if you're doing stochastic maximum likelihood a.k.a. "persistent contrastive divergence". The basic idea is this: the gradient of the log likelihood of an RBM comprises two terms, one tractable and the other intractable because it involves the gradient of the partition function (i.e. the sum of the exponentiated energies of all configurations). We need to approximate this with a sample, but sampling is also intractable, and running a Markov chain to convergence at each step of the learning is hopeless. Geoff Hinton's contrastive divergence trick is to run the chain for 1 step and hope for the best, and in practice this makes for pretty good feature extractors and lousy density models. In the late 90s Laurent Younes showed that assuming your model parameters don't change too quickly (i.e., your learning rate is low) it is theoretically sound to keep samples from a "previous version" of your model (i.e. the model before the last parameter update) and treat them as if they were samples under the "current version" of your model, and thus keep a "persistent" Markov chain going throughout learning even though the distribution from which this Markov chain is sampling is constantly changing. In this way you get approximate but unbiased gradients of the log likelihood, and in practice this leads to much better density models as these samples manage to explore a much bigger region of the state space than is the case with the 1-step CD hack. vene Member Thanks for the input @dwf, I was just about to ping you :-) I do understand contrastive divergence and the idea behind PCD, and I also understand why @amueller doesn't like it 👿 what was unclear to me was the dynamics of the n_particles parameter. Also I was referring to its description in the docstring, but since I'll be writing the narratives, it is probably important for me to understand everything too. @ynd's code doesn't have any stopping condition, but an n_iter parameter (that must be optimized via grid search). I will check how you do it in your gist. Since I got your attention: is there any value in increasing the k in CD-k after some time? IIUC it helps with the density modelling part (makes the learning closer to actual ML) but that's not really important here is it? The implementation here should IMHO be as close to black-box as possible, and if users are on to something good they should use or extend the one in pylearn. dwf Member On 2012-12-30 2:45 PM, "Vlad Niculae" notifications@github.com wrote: In sklearn/neural_networks/rbm.py:  self.intercept_visible_ = np.zeros(X.shape[1], dtype=X.dtype)   self.h_samples_ = np.zeros((self.n_particles,  self.n_components),  dtype=X.dtype)   inds = np.arange(X.shape[0])   self.random_state.shuffle(inds)   n_batches = int(np.ceil(len(inds) / float(self.n_particles)))   verbose = self.verbose   for iteration in xrange(self.n_iter):   pl = 0.   if verbose:   begin = time.time()   for minibatch in xrange(n_batches):   pl_batch = self._fit(X[inds[minibatch::n_batches]])  Thanks for the input @dwf, I was just about to ping you :-) I do understand contrastive divergence and the idea behind PCD, and I also understand why @amueller doesn't like it what was unclear to me was the dynamics of the n_particles parameter. Also I was referring to its description in the docstring, but since I'll be writing the narratives, it is probably important for me to understand everything too. Ah dammit, I just realized that I had omitted actual discussion of the number of particles. The story is basically this: in CD it is common to run as many 1 or K step chains as there are training examples in your mini batch. This is because (in the one step case) you've already done half the work for that in computing the positive phase gradients, so why not threshold and do two more matrix multiplies. Since the particles (Markov chain states) stick around in PCD, there is no intrinsic reason that the sums in the positive phase and negative phase terms need to have the same number of terms - they are approximating expectations under different distributions, and there's no reason you need the same number of terms in each. @ynd's code doesn't have any stopping condition, but an n_iter parameter (that must be optimized via grid search). I will check how you do it in your gist. There isn't really any good answer, I think I just use a fixed number of iterations too. Since I got your attention: is there any value in increasing the k in CD-k after some time? IIUC it helps with the density modelling part (makes the learning closer to actual ML) but that's not really important here is it? The implementation here should IMHO be as close to black-box as possible, and if users are on to something good they should use or extend the one in pylearn. I've not really heard of annealing K upwards, but it makes sense that it would help somewhat. It's also far more expensive than PCD :) and the best annealing schedule is not self-evident and would probably be problem dependent. There are definitely valid ways to use an RBM as a density model even though it's intractable to compute the probability of a test example (e.g. using the free energy of a test example to do outlier detection) for which PCD training would make a gigantic difference. I can't remember the sklearn unsupervised API at the moment but if there is something that gives you a scalar score for each example passed in, the free energy would be the thing to use for this (note that since the partition function is different for different RBM instances this can not be used to do model comparison, but it can be used to judge the log probability a given model assigns to a point up to an additive constant, and judge an individual model's relative preference for one example over another). It depends whether you want to leave open these use cases, or only care about feature extraction for classification. amueller Member I said I don't like PCD? Can't remember that. I might have said I don't like RBMs ;) But PCD is way better than CD! amueller Member Just quickly about the other points: I think I read something about CD-k with k increasing (possibly Russ Salakhutdinov?) but I wouldn't include it. PCD is a good tradeoff of complexity and performance imho. Parallel tempering might work better but is also more to fiddle with. Also I think fixing the number of iterations is the best stopping criterion. I guess the free energy would otherwise be the pseudo likelihood above would be the thing to model (I haven't looked at it in more detail though). @dwf how can you use the free energy to judge convergence? dwf Member On Sun, Dec 30, 2012 at 5:45 PM, Andreas Mueller notifications@github.comwrote: @dwf https://github.com/dwf how can you use the free energy to judge convergence? I don't think you can, because that unknown constant is not actually constant with respect to learning (the partition function is always changing). Average free energy of training data minus average free energy of validation data might be a useful quantity to monitor (basically the "odds" of training vs. validation, such that the partition functions cancel) but I haven't tried it. amueller Member Yeah, that was my impression also. I though you suggested it above but that was probably a misunderstanding. vene Member @amueller I just meant you don't like training graphical models with algorithms that are so approximate that you can't really call it a graphical model anymore 😉 ynd Contributor I wonder if it's just me or if it's actually confusing: I ran the RBM before reading the code and noticed that the higher n_particles is, the quicker it is, whereas my ignorant intuition was that "using more stuff should take more time". I commented once before about that parameter and I still argue that it's a particularity of the learning algorithm and not directly of RBMs themselves, so I wouldn't mind having its effect intuitively explained. This is the number of particles used to compute the gradient and make a gradient step. This means that you need to compute less gradients to perform an iteration over the training set. If you set n_particles to the size of the training set, you will perform only one gradient step. That's fast because the matrix-matrix multiplication in BLAS will be very cache friendly. On Sun, Dec 30, 2012 at 7:14 PM, Vlad Niculae notifications@github.comwrote: In sklearn/neural_networks/rbm.py:  self.intercept_visible_ = np.zeros(X.shape[1], dtype=X.dtype)   self.h_samples_ = np.zeros((self.n_particles, self.n_components),   dtype=X.dtype)   inds = np.arange(X.shape[0])   self.random_state.shuffle(inds)   n_batches = int(np.ceil(len(inds) / float(self.n_particles)))   verbose = self.verbose   for iteration in xrange(self.n_iter):   pl = 0.   if verbose:   begin = time.time()   for minibatch in xrange(n_batches):   pl_batch = self._fit(X[inds[minibatch::n_batches]])  This seems to be done slightly differently than the usual minibatches in scikit-learn. I wonder if it affects the speed or anything. Anyway since these minibatches are indirectly implied by the n_particles, I think I should read the paper first. I wonder if it's just me or if it's actually confusing: I ran the RBM before reading the code and noticed that the higher n_particles is, the quicker it is, whereas my ignorant intuition was that "using more stuff should take more time". I commented once before about that parameter and I still argue that it's a particularity of the learning algorithm and not directly of RBMs themselves, so I wouldn't mind having its effect intuitively explained. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r2521568.
and 1 other commented on an outdated diff Dec 30, 2012
sklearn/neural_networks/rbm.py
 + """ + Fit the model to the data X. + + Parameters + ---------- + X: array-like, shape (n_samples, n_features) + Training data, where n_samples in the number of samples + and n_features is the number of features. + + Returns + ------- + self + """ + X = array2d(X) + + self.components_ = np.asarray(self.random_state.normal(0, 0.01,
 vene Member Probably better to use 0.01 * self.random_state.randn(self.n_components, X.shape[1]). Also is 0.01 always good / as good as any other value? ynd Contributor 0.01 is recommended by hinton in http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf . In practise it works better than other values on a range of problems. On Sun, Dec 30, 2012 at 7:28 PM, Vlad Niculae notifications@github.comwrote: self.random_state.randn
Member
commented Dec 30, 2012
 I am reworking the example on my local branch. I will also start working on the docs here, and if @ynd is still busy by that time I will try to improve the tests too. Maybe we can get this merged soon. WDET about adding rectified linear units? http://www.csri.utoronto.ca/~hinton/absps/reluICML.pdf
Member
 I'd rather get the standard binary version merged first, then we can think about gaussian or rectified linear units.
Member
commented Dec 31, 2012
 1000 filters extracted from MNIST with learning rate 0.001 for 50 epochs: Still trying to get something meaningful out of the sklearn digits dataset...
Contributor
commented Jan 1, 2013
 Digits is a pretty small and easy task, that's why things are close. It would be nice if you did the narrative documentation, I wouldn't be able to get it done soon. On Sun, Dec 30, 2012 at 2:17 PM, Vlad Niculae notifications@github.comwrote: As it is at this point, the example doesn't really illustrate much: the averaged F-score is identical (after rounding), and you just see some precision/recall tradeoffs across classes. I wonder whether this is because of the small dataset. Maybe the whole classification report is not even interesting here, just the score. I'll play around with it a bit. I noticed that the components learned are very close to one another with just a few weights different. I volunteer to do the narrative documentation. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-11764427.
Contributor
commented Jan 1, 2013
 Try a higher learning rate, that will help get cleaner filters. Yann On Mon, Dec 31, 2012 at 3:29 PM, Vlad Niculae notifications@github.comwrote: 1000 filters extracted from MNIST with learning rate 0.001 for 50 epochs: Still trying to get something meaningful out of the sklearn digits dataset... — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-11778040.
 Yann N. Dauphin use train_test_split 8891e0e
Member
commented Jan 1, 2013
 On Tue, Jan 1, 2013 at 10:25 AM, Yann N. Dauphin notifications@github.comwrote: Try a higher learning rate, that will help get cleaner filters. Also momentum.
added some commits Jan 2, 2013
 vene Added RBM to whats_new.rst a71b6cc vene Added skeleton for RBM documentation 79e1a79
Member
commented Jan 2, 2013
 I just realized that unlike most of our estimator modules, neural_networks cannot fit under supervised or unsupervised in the hierarchy. Despite this, I would still rather group this (and the upcoming MLP) toghether in such a module.
Member
commented Jan 2, 2013
 I added an unsupervised algorithm to ensemble without changing anything in the hierarchy - didn't really think about it. Btw, the narrative doesn't have to follow the structure of the modules, though it is probably helpful if it does. But we could just put MLPs in the supervised part and Autoencoders in the unsupervised part. What other algorithms do you think will be there?
Member
commented Jan 2, 2013
 Agreed with @amueller. I wouldn't be surprised to find an unsupervised algorithm in a neural_networks module.
Member
commented Jan 2, 2013
 Well we will have at least two RBM estimators: Bernoulli, Gaussian and I would be interested in the Replicated Softmax thing too. Then the multilayer perceptron (at least as a classifier). I would draw the line at custom architecture (convolutional nets) and more than 2-layer stuff. However I wouldn't mind having an autoencoder in here. In Hinton's class he lists some cool results with autoencoders but the architectures have many layers, usually with decreasing numbers of units towards the middle, and he admits that the architectures are just guesses, but maybe a simple 2-layer autoencoder as a black box transformer can give useful enough representations.
Member
commented Jan 3, 2013
 A basic, 2-layer autoencoder should be easy enough to build by setting Y=X in the backprop algorithm.
and 1 other commented on an outdated diff Jan 3, 2013
sklearn/neural_networks/rbm.py
 +def logistic_sigmoid(x): + """ + Implements the logistic function. + + Parameters + ---------- + x: array-like, shape (M, N) + + Returns + ------- + x_new: array-like, shape (M, N) + """ + return 1. / (1. + np.exp(-np.clip(x, -30, 30))) + + +class RestrictedBolzmannMachine(BaseEstimator, TransformerMixin):
 vene Member s/Bolzmann/Boltzmann/g caught this because the docs weren't linking properly. I will fix in my branch (based on the current state of this PR). vene Member I will actually rename to BernoulliRBM since that seems the way to go judging by the previous discussion. amueller Member +1
added some commits Jan 3, 2013
 vene FIX mistake in rst 561fb99 vene Rename RestrictedBolzmannMachine to BernoulliRBM 11abf87
Member
commented Jan 3, 2013
 I am still not at ease regarding how to organize the docs. It's clear that we should have a neural_networks module if we will have RBMs and MLPs, it is simply the place where one would look for such models first. However the Neural Networks section of the user's guide should have an introductory paragraph before jumping into particular models, and such a page doesn't fit under Supervised or Unsupervised Learning in the user's guide. I'm thinking of putting the narratives into several different files and include them in several different combinations: Everything, if linked from people looking at the neural_networks module Supervised stuff linked from Supervised learning / Supervised Neural Network Models Unsupervised stuff linked from Unsupervised learning / Unsupervised Neural Network Models A way out would be having the Neural Network doc section at the top level but I'm -1 for that.
Member
commented Jan 3, 2013
 On 01/04/2013 12:52 AM, Vlad Niculae wrote: I am still not at ease regarding how to organize the docs. It's clear that we should have a |neural_networks| module if we will have RBMs and MLPs, it is simply the place where one would look for such models first. However the Neural Networks section of the user's guide should have an introductory paragraph before jumping into particular models, and such a page doesn't fit under Supervised or Unsupervised Learning in the user's guide. I would just put the general stuff in the one that comes first and then link back to it. What do you want to say about neural nets in the context of MLPs that is also relevant for RBMs?
 vene FIX: make BernoulliRBM doctest pass 28c0093
Member
commented Jan 4, 2013
 That's a good question, however it is reasonble if we will have autoencoders. Even so, it would just feel off having the neural networks module under Unsupervised... EDIT: maybe link the page from under both sections? I'd rather have lower precision than lower recall in this setting, I don't want users not finding the MLP because it's under Unsupervised, or the other way around.
Member
commented Jan 4, 2013
 I would really just put MLPs under supervised and Autoencoders under unsupervised (when we have them). I don't see the harm in referencing the MLP from the autoencoder about details of backprop or whatever ...
Member
commented Jan 4, 2013
 That's true but where to point from the neural_network module itself? Maybe to both pages? Hmm... that's actually a great idea, I'll set it up like this.
added some commits Jan 4, 2013
 vene Merge branch 'master' into rbm-docs Conflicts: doc/whats_new.rst d8efeb6 vene Rename also in whats_new 8dd07d7 vene FIX: BernoulliRBM check random state in fit, not in init e48c39a
Member
commented Jan 4, 2013
 Some of the universal tests were failing after merging with master, uncovering some bugs: check_random_state was called in __init__ and transform wasn't validating. So I fixed them and sent a PR to @ynd just in case he wants to touch this PR before I finish the docs (I haven't even started yet, actually. I'll do it tomorrow 😊)
 vene FIX: validation in BernoulliRBM.transform c38ce39
Member
 Even so, it would just feel off having the neural networks module under Unsupervised... I have the same feeling.
Member
commented Jan 4, 2013
 Btw, didn't we decide in the MLP PR that the module should be called neural?
Member
commented Jan 4, 2013
 Btw, didn't we decide in the MLP PR that the module should be called neural? Only a handful of people gave their opinion. We should probably vote on the ML.
and others added some commits Jan 4, 2013
 vene DOC: first attempt at RBM documentation b665631 vene Link to RBM docs from the unsupervised toctree 1dbecad vene FIX: uneven RBM image 9e77703 Yann N. Dauphin Merge https://github.com/scikit-learn/scikit-learn into rbm 9b7f5a6 vene DOC: PCD details and references a13a762 Yann N. Dauphin Merge branch 'rbm-docs' of https://github.com/vene/scikit-learn into … …vene-rbm-docs 2b5767a
and 1 other commented on an outdated diff Jan 20, 2013
sklearn/neural_networks/rbm.py
 + ---------- + [1] Tieleman, T. Training Restricted Boltzmann Machines using + Approximations to the Likelihood Gradient. International Conference + on Machine Learning (ICML) 2008 + """ + h_pos = self.mean_hiddens(v_pos) + v_neg = self.sample_visibles(self.h_samples_) + h_neg = self.mean_hiddens(v_neg) + + lr = self.learning_rate / self.n_particles + v_pos *= lr + v_neg *= lr + self.components_ += safe_sparse_dot(v_pos.T, h_pos).T + self.components_ -= np.dot(v_neg.T, h_neg).T + self.intercept_hidden_ += lr * (h_pos.sum(0) - h_neg.sum(0)) + self.intercept_visible_ += (v_pos.sum(0) - v_neg.sum(0))
 vene Member Is it intended for the visible intercept to not be scaled by the learning rate? If so, could you explain why? I'm experimenting with momentum and this jumped at me. ynd Contributor They are scaled by the learning rate inplace beforehand in the previous lines. This makes the computation of the gradient for components faster. On Sun, Jan 20, 2013 at 6:39 PM, Vlad Niculae notifications@github.comwrote: In sklearn/neural_networks/rbm.py:  [1] Tieleman, T. Training Restricted Boltzmann Machines using   Approximations to the Likelihood Gradient. International Conference   on Machine Learning (ICML) 2008   """   h_pos = self.mean_hiddens(v_pos)   v_neg = self.sample_visibles(self.h_samples_)   h_neg = self.mean_hiddens(v_neg)   lr = self.learning_rate / self.n_particles   v_pos *= lr   v_neg *= lr   self.components_ += safe_sparse_dot(v_pos.T, h_pos).T   self.components_ -= np.dot(v_neg.T, h_neg).T   self.intercept_hidden_ += lr \* (h_pos.sum(0) - h_neg.sum(0))   self.intercept_visible_ += (v_pos.sum(0) - v_neg.sum(0))  Is it intended for the visible intercept to not be scaled by the learning rate? If so, could you explain why? I'm experimenting with momentum and this jumped at me. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r2708196.
Contributor
commented Jan 24, 2013
 Thanks to Vlad we have a narrative doc, what is missing for a merge?
Member
 Thanks to Vlad we have a narrative doc, what is missing for a merge? Agreeing on the module name. My vote goes to neural_network (without an s) to be consistent with linear_model.
Member
commented Jan 24, 2013
 Agreeing on the module name. My vote goes to neural_network (without an s) to be consistent with linear_model. +1 for this proposal as well.
Member
 Agreeing on the module name. My vote goes to neural_network (without an s) to be consistent with linear_model. +1
Member
commented Jan 24, 2013
 The test coverage could probably be improved a bit: sklearn.neural_networks.rbm 84 10 88% 162, 222-225, 262, 318, 323, 326-328  Type make test-coverage in the top level folder to compute it again.
sklearn/neural_networks/rbm.py
 + return logistic_sigmoid(safe_sparse_dot(v, self.components_.T) + + self.intercept_hidden_) + + def sample_hiddens(self, v): + """ + Sample from the distribution P({\bf h}|{\bf v}). + + Parameters + ---------- + v: array-like, shape (n_samples, n_features) + + Returns + ------- + h: array-like, shape (n_samples, n_components) + """ + return self._sample_binomial(self.mean_hiddens(v))
 ogrisel Member This is not tested.
sklearn/neural_networks/rbm.py
 + def gibbs(self, v): + """ + Perform one Gibbs sampling step. + + Parameters + ---------- + v: array-like, shape (n_samples, n_features) + + Returns + ------- + v_new: array-like, shape (n_samples, n_features) + """ + h_ = self.sample_hiddens(v) + v_ = self.sample_visibles(h_) + + return v_
 ogrisel Member This method is not tested.
sklearn/neural_networks/rbm.py
 + h_pos = self.mean_hiddens(v_pos) + v_neg = self.sample_visibles(self.h_samples_) + h_neg = self.mean_hiddens(v_neg) + + lr = self.learning_rate / self.n_particles + v_pos *= lr + v_neg *= lr + self.components_ += safe_sparse_dot(v_pos.T, h_pos).T + self.components_ -= np.dot(v_neg.T, h_neg).T + self.intercept_hidden_ += lr * (h_pos.sum(0) - h_neg.sum(0)) + self.intercept_visible_ += (v_pos.sum(0) - v_neg.sum(0)) + + self.h_samples_ = self._sample_binomial(h_neg) + + if self.verbose: + return self.pseudo_likelihood(v_pos)
 ogrisel Member This is not test but I guess we don't want to pollute stdout when running the tests. So no opinion for this one.
and 2 others commented on an outdated diff Jan 24, 2013
sklearn/neural_networks/rbm.py
 + Computes the free energy + \mathcal{F}({\bf v}) = - \log \sum_{\bf h} e^{-E({\bf v},{\bf h})}. + + Parameters + ---------- + v: array-like, shape (n_samples, n_features) + + Returns + ------- + free_energy: array-like, shape (n_samples,) + """ + return - np.dot(v, self.intercept_visible_) - np.log(1. + np.exp( + safe_sparse_dot(v, self.components_.T) + self.intercept_hidden_)) \ + .sum(axis=1) + + def gibbs(self, v):
 GaelVaroquaux Member If I am not mistaken, the corresponding method in GMMs (mixture/gmm.py) is called 'sample'. It would be useful to have the same name for API consistency. dwf Member On Thu, Jan 24, 2013 at 1:53 PM, Gael Varoquaux notifications@github.comwrote: In sklearn/neural_networks/rbm.py:  Computes the free energy   \mathcal{F}({\bf v}) = - \log \sum_{\bf h} e^{-E({\bf v},{\bf h})}.   Parameters   v: array-like, shape (n_samples, n_features)   Returns   free_energy: array-like, shape (n_samples,)   """   return - np.dot(v, self.intercept_visible_) - np.log(1. + np.exp(   safe_sparse_dot(v, self.components_.T) + self.intercept_hidden_)) \   .sum(axis=1)  def gibbs(self, v): If I am not mistaken, the corresponding method in GMMs (mixture/gmm.py) is called 'sample'. It would be useful to have the same name for API consistency. $0.02: I don't think it's necessarily a good idea to conflate the two. My reasoning: sample(), I'm pretty sure, draws an independent sample from the model distribution. This is impossible with an RBM. This executes one step of a Gibbs sampling process, requires an initial visible state, and importantly does not yield independent samples on successive calls. GaelVaroquaux Member If I am not mistaken, the corresponding method in GMMs (mixture/gmm.py) is called 'sample'. It would be useful to have the same name for API consistency.$0.02: I don't think it's necessarily a good idea to conflate the two. My reasoning: sample(), I'm pretty sure, draws an independent sample from the model distribution. This is impossible with an RBM. This executes one step of a Gibbs sampling process, requires an initial visible state, and importantly does not yield independent samples on successive calls. Is this a good enough reason to have different names, or does it warrants a note in the docstring? ogrisel Member +1 for making it explicit in the docstring. dwf Member On 2013-01-24 2:07 PM, "Gael Varoquaux" notifications@github.com wrote: In sklearn/neural_networks/rbm.py:  Computes the free energy   \mathcal{F}({\bf v}) = - \log \sum_{\bf h} e^{-E({\bf  v},{\bf h})}. +  Parameters   v: array-like, shape (n_samples, n_features)   Returns   free_energy: array-like, shape (n_samples,)   """   return - np.dot(v, self.intercept_visible_) - np.log(1. +  np.exp(  safe_sparse_dot(v, self.components_.T) +  self.intercept_hidden_)) \  .sum(axis=1)  def gibbs(self, v): If I am not mistaken, the corresponding method in GMMs > (mixture/gmm.py) is called 'sample'. It would be useful to have the > same name for API consistency. \$0.02: I don't think it's necessarily a good idea to conflate the two. My reasoning: sample(), I'm pretty sure, draws an independent sample from the model distribution. This is impossible with an RBM. This executes one step of a Gibbs sampling process, requires an initial visible state, and importantly does not yield independent samples on successive calls. Is this a good enough reason to have different names, or does it warrants a note in the docstring? Well, this has a different signature with a non-optional argument that doesn't appear in sample. Having the same name thus doesn't really buy you anything from an abstraction/generic code perspective. If uniformity for the mere sake of uniformity is desirable then maybe it's worthwhile but it seems like they serve different enough purposes (and require different calling conventions) to be considered conceptually separate. ogrisel Member I think we all agree on keeping different names but we should further add explicitly in the docstring of the gibbs method that consecutive calls to it will yield highly dependent samples as the internal state of the RBM model is updated (stateful model). ogrisel Member Actually I made a mistake, the model itself is not stateful as the state (v the activation levels of the visible units) is passed as an argument to the method.
Member
commented Jan 24, 2013
 The narrative doc lacks a link to the example with RBM used for unsupervised feature extraction in a digits classification pipeline. I think this example should move a to a neural_network subfolder of the examples folder as we expect to have more neural nets based models in the future and we don't want to crowd the top level example folder too much and we don't want to break incoming links example page on the website in the future. Also are the extracted filters too ugly to be plotted in the example and the doc?
Member
commented Jan 24, 2013
 For information, here the output of running the example on my box: ========================================================= Pipelining: chaining a RBM and a logistic regression ========================================================= The BernoulliRBM does unsupervised feature extraction, while the logistic regression does the prediction. We use a GridSearchCV to set the number of hidden units and the learning rate of the Bernoulli Restricted Boltzmann Machine. We also train a simple logistic regression for comparison. The example shows that the features extracted by the BernoulliRBM help improve the classification accuracy. Classification report for classifier Pipeline(logistic=LogisticRegression(C=10000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001), logistic__C=10000.0, logistic__class_weight=None, logistic__dual=False, logistic__fit_intercept=True, logistic__intercept_scaling=1, logistic__penalty=l2, logistic__random_state=None, logistic__tol=0.0001, rbm=BernoulliRBM(learning_rate=0.01, n_components=400, n_iter=30, n_particles=10, random_state=, verbose=False), rbm__learning_rate=0.01, rbm__n_components=400, rbm__n_iter=30, rbm__n_particles=10, rbm__random_state=, rbm__verbose=False): precision recall f1-score support 0 0.95 0.95 0.95 41 1 0.98 0.93 0.95 44 2 0.98 1.00 0.99 42 3 0.97 1.00 0.99 33 4 0.97 0.95 0.96 40 5 1.00 1.00 1.00 29 6 0.96 0.86 0.91 28 7 0.92 1.00 0.96 36 8 0.78 0.93 0.85 27 9 0.97 0.88 0.92 40 avg / total 0.95 0.95 0.95 360 Classification report for classifier LogisticRegression(C=10000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001): precision recall f1-score support 0 0.95 0.95 0.95 41 1 0.97 0.86 0.92 44 2 1.00 1.00 1.00 42 3 0.91 0.94 0.93 33 4 0.95 0.97 0.96 40 5 0.94 1.00 0.97 29 6 1.00 0.89 0.94 28 7 0.94 0.89 0.91 36 8 0.68 0.96 0.80 27 9 0.94 0.82 0.88 40 avg / total 0.94 0.93 0.93 360  So the pipeline is indeed a bit better than the logistic regression, although this isn't cross validated so we have no idea of the variance of this estimate. I think that anyway the dataset is too small for the RBM to be very useful here. I think we should mention that in the doc of the example. We can also note that the __repr__ method of the Pipeline class is broken (it should at least respect the ordering of the steps) but this should be addressed in a separate PR.
Member
commented Jan 24, 2013
 Also we should really add an explicit partial_fit method for incremental learning with mini-batches.
Member
 Wouldn't minibatch learning require the persistent CD algorithm?
Member
 Oh and still +1 on sklearn.neural.
Member
 @larsmans this PR does implement persistent CD.
Member
 Excuse me, I was under the impression that it didn't.
Contributor
commented Jan 28, 2013
 It seems the majority vote is 'neural_network'.
Member
commented Feb 2, 2013
 +1 for neural_network.
Member
commented Feb 2, 2013
 I think we should merge this PR now that there is a narrative and examples. The examples could be improved but we can still do that later. I think many people will play with this and we'll easily get a better example.
 Yann N. Dauphin neural_networks -> neural_network 234b087
Member
commented Feb 2, 2013
 I'm fine with merging soon, but: @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones? I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size? Thoughts on momentum?
Member
commented Feb 3, 2013
 @ogrisel, sadly with cross-validation it's easy to find a C for which results are better without the RBM. I digged for a long time and couldn't find a configuration to learn features that perform better than the raw ones. I blame this on the dataset. Maybe we should instead concatenate the original features with the learned ones? Then we should just make it explicit in the example top level docstring that due to download size and computational constraints, this example uses a toy dataset that is either too small or too "linearly separable" for RBM features to be really interesting. In practice unsupervised feature extraction with RBMs is only useful on datasets with both many samples and internally structured on non linearly separable components / manifolds. Currently it's misleading the reader by asserting that RBM features are better than raw features on this toy data. I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size? Having consistency on the batch_size parameter with other minibatch models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big +1 for me. I would rather have this changed before merging to master if this is not too complicated.
Member
commented Feb 3, 2013
 BTW has anyone tried to let this implementation run long enough on the official MNIST train set to extract features that when combined with a cross-validated LinearSVC / SVC reach the expected predictive accuracy (I am not 100% sure but I think it possible to go below the 2% test error rate with this kind of architecture)? Also if someone runs it long enough with the same parameters as in the bottom of the page in http://deeplearning.net/tutorial/rbm.html we could also qualitatively compare the outcome of the gibbs sampling with those of the deeplearning tutorial with the theano based implementation.
Contributor
commented Feb 3, 2013
 I'm still a bit put off though by how mini-batch size is controlled by changing n_particles, I find this hard to grok. Can't there be two separate parameters, n_particles and batch_size? Having consistency on the batch_size parameter with other minibatch models such as MiniBatchKMeans and MiniBatchDictionaryLearning is a big +1 for me. I would rather have this changed before merging to master if this is not too complicated. Most implementations use the same number of particles as the batch size (this is recommended by Hinton). I will rename the parameter to batch_size though, that should clarify things. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-13044925.
added some commits Feb 3, 2013
 Yann N. Dauphin rename n_particles to batch_size f8e0110 Yann N. Dauphin bugfix test ad4877a Yann N. Dauphin bugfix test 5400735 Yann N. Dauphin added tests 6dfc317 Yann N. Dauphin tests 83daf31
Contributor
commented Apr 15, 2013
 @ogrisel I've added some tests: sklearn.neural_network.rbm 84 6 93% 262, 318, 323, 326-328 All the uncovered lines are basically just conditionals on print statements for verbose=True.
Member
 Thanks a lot for picking this up. Sorry we didn't work harder on merging it. It seems we are all a bit swamped right now. I'm certain it will go in the next release, though ;)
Member
commented Apr 15, 2013
 Wow, actually yesterday I was thinking that after the conference deadline passes (which was last night) I wanna come back to this PR! A while back in a local file, I hacked up a momentum implementation on top of this code, it did seem to speed up convergence a lot on mnist, interesting? On Tue, Apr 16, 2013 at 5:27 AM, Andreas Mueller notifications@github.comwrote: Thanks a lot for picking this up. Sorry we didn't work harder on merging it. It seems we are all a bit swamped right now. I'm certain it will go in the next release, though ;) — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16409345 .
Member
 @vene I would rather not add any features and merge this asap. It has been sitting around way to long! Do you have any idea what is still missing?
Member
commented Apr 17, 2013
 IIRC I was not pleased with the example when it was moved to the digits dataset instead of mnist, as the dataset is very small and the components don't look that meaningful. I tried toying around with the params to no avail. On Tue, Apr 16, 2013 at 6:37 PM, Andreas Mueller notifications@github.comwrote: @vene https://github.com/vene I would rather not add any features and merge this asap. It has been sitting around way to long! Do you have any idea what is still missing? — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-16435278 .
Member
commented Apr 17, 2013
 Also there is a failing common test because of the way random_state is used.
Member
commented Apr 18, 2013
 @vene I you can come up with a quick PR vs this branch and show how that impacts the convergence speed and quality on mnist that sounds interesting.
commented on the diff Apr 18, 2013
sklearn/neural_network/rbm.py
 + def fit(self, X, y=None): + """ + Fit the model to the data X. + + Parameters + ---------- + X: array-like, shape (n_samples, n_features) + Training data, where n_samples in the number of samples + and n_features is the number of features. + + Returns + ------- + self + """ + X = array2d(X) + self.random_state = check_random_state(self.random_state)
 ogrisel Member Constructor parameters should not be mutated during fit and that holds for random_state as well. Please change this line to: self.random_state_ = check_random_state(self.random_state)  or random_state = check_random_state(self.random_state)  and then pass the random_state local variable explicitly as an argument to the subsequent private methods.
Member
commented May 6, 2013
 I'm very inclined to fix the travis tests and merge. Having "not completely satisfying example filters" doesn't justify delaying this PR any longer.
Member
commented May 6, 2013
 Probably a good idea, any speedups / new features can be added later. On Mon, May 6, 2013 at 6:39 PM, Andreas Mueller notifications@github.comwrote: I'm very inclined to fix the travis tests and merge. Having "not completely satisfying example filters" doesn't justify delaying this PR any longer. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-17473458 .
Member
commented May 6, 2013
 Does any one have strong reasons not to merge this? (after fixing the random state issue)
Member
 Some implicit castings: (I have a version numpy that purposely fails when casting rules can induce problems) ====================================================================== FAIL: Doctest: sklearn.neural_network.rbm.BernoulliRBM ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.7/doctest.py", line 2201, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for sklearn.neural_network.rbm.BernoulliRBM File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line 31, in BernoulliRBM ---------------------------------------------------------------------- File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line 78, in sklearn.neural_network.rbm.BernoulliRBM Failed example: model.fit(X) # doctest: +ELLIPSIS Exception raised: Traceback (most recent call last): File "/usr/lib/python2.7/doctest.py", line 1289, in __run compileflags, 1) in test.globs File "", line 1, in model.fit(X) # doctest: +ELLIPSIS File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line 320, in fit pl_batch = self._fit(X[inds[minibatch::n_batches]]) File "/home/varoquau/dev/scikit-learn/sklearn/neural_network/rbm.py", line 252, in _fit v_pos *= lr TypeError: Cannot cast ufunc multiply output from dtype('float64') to dtype('int64') with casting rule 'same_kind' >> raise self.failureException(self.format_failure(> instance at 0x9529a70>.getvalue())) ---------------------------------------------------------------------- 
Member
 Some implicit castings: (I have a version numpy that purposely fails when casting rules can induce problems) I have fixed that in the pr_1200 branch on my github (it was preventing me from reviewing this PR). Please pull/cherry-pick
commented on the diff May 6, 2013
doc/modules/neural_networks.rst
 +but often in practice, as well as in this implementation, it is optimized by +averaging over mini-batches. The gradient with respect to the weights is +formed of two terms corresponding to the ones above. They are usually known as +the positive gradient and the negative gradient, because of their respective +signs. + +In maximizing the log likelihood, the positive gradient makes the model prefer +hidden states that are compatible with the observed training data. Because of +the the bipartite structure of RBMs, it can be computed efficiently. The +negative gradient, however, is intractable. Its goal is to lower the energy of +joint states that the model prefers, therefore making it stay true to the data +and not fantasize. It can be approximated by Markov Chain Monte Carlo using +block Gibbs sampling by iteratively sampling each of :math:v and :math:h +given the other, until the chain mixes. Samples generated in this way are +sometimes refered as fantasy particles. This is inefficient and it's difficult +to determine whether the Markov chain mixes.
 GaelVaroquaux Member I think that the above paragraph can be much shortened. I don't see how it is useful to the end user.
Member
 Does any one have strong reasons not to merge this? Plenty. I am doing a quick review, and it is not ready. The review will come in later, but I am convinced that there still is some work to be done here.
commented on the diff May 6, 2013
sklearn/neural_network/rbm.py
 + Compute the element-wise binomial using the probabilities p. + + Parameters + ---------- + x: array-like, shape (M, N) + + Notes + ----- + This is equivalent to calling numpy.random.binomial(1, p) but is + faster because it uses in-place operations on p. + + Returns + ------- + x_new: array-like, shape (M, N) + """ + p[self.random_state.uniform(size=p.shape) < p] = 1.
 GaelVaroquaux Member self.random_state cannot be a rng object, because it would mean modifying the input parameter. You need to create a shadow self.random_state_ object, as done in the other learners. mblondel Member Or better, pass the random state as an argument to the method. On May 7, 2013 6:33 AM, "Gael Varoquaux" notifications@github.com wrote: In sklearn/neural_network/rbm.py:  Compute the element-wise binomial using the probabilities p.   Parameters   x: array-like, shape (M, N)   Notes   This is equivalent to calling numpy.random.binomial(1, p) but is   faster because it uses in-place operations on p.   Returns   x_new: array-like, shape (M, N)   """   p[self.random_state.uniform(size=p.shape) < p] = 1.  self.random_state cannot be a rng object, because it would mean modifying the input parameter. You need to create a shadow self.random_state_ object, as done in the other learners. — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200/files#r4102207 . GaelVaroquaux Member Or better, pass the random state as an argument to the method. +1. But the random_state attribute should not be modified anyhow.
Member
 I have made a few minor changes on my pr_1200 branch. A few general comments (it seems that this PR lack a bit of polish and I am not convinced that it is mergeable from a user stand point): The docs contain a lot of math, but lack intuitive messages. Most importantly, the RBM section does not tell me why and when I should be using RBMs. Examples are not linked to in the docs. The docs do not have a figure. They feel a bit more like a page from a textbook. For instance the remark on 'Deep neural networks that are notoriously difficult to train from scratch can be simplified by initializing each layer’s weights with the weights of an RBM.' What is a user going to do with this remark: we don't have deep neural networks in scikit-learn. There is too much maths and not enough intuitions/practical advice. The only example is not a 'plot_' example. It does not help understanding how the RBM work nor does it demonstrate their benefits on the rendered docs. It also takes ages to run. This long run time is strong suboptimal, as there is a grid-search object that refits the RBM many times with unchanged parameters. To make it faster, I have hardcoded some values obtained by a grid search. I also added some plotting. I am also worried about the amount of methods of the object. Have we checked for redundancy in the names of the methods? I see a Gibbs sampling method. Other objects also have sampling methods. We should make sure that the names and signature of these methods are compatible. As a side note, after playing with this code, I am not terribly impressed by deep learning as black-box learners. :P
Member
commented May 6, 2013
 (Please disregard previous comment) I agree with @GaelVaroquaux that there are far too many public methods.
Member
commented May 6, 2013
 I haven't been following this all that closely, but the narrative docs appear to disagree with the implementation. The implementation uses stochastic maximum likelihood, which uses an unbiased estimator of the maximum likelihood gradient, not contrastive divergence (which the documentation seems to be describing). If the primary utility to scikit-learn users is that of feature extraction, I'd question whether an SML-based implementation is a good choice. CD tends to learn better features. As a side note, after playing with this code, I am not terribly impressed by deep learning as black-box learners. :P We'll make a believer out of you yet Gael!
Member
commented May 7, 2013
 The implementation uses stochastic maximum likelihood, which uses an unbiased estimator of the maximum likelihood gradient, not contrastive divergence (which the documentation seems to be describing). Is not Persistent CD a synonym for SML? In the "SML Learning" section I described the CD algorithm first and then PCD as a variation, because I found it difficult to justify PCD by itself without contrasting with CD. If I am factually incorrect could you please push me in the right direction? There is too much maths and not enough intuitions/practical advice. I tried to push the maths as far down as possible and to lead with intuitions, but I don't know any more intuitions than what I put down. I agree the docs are lacking but I really don't know what to write to improve them. As a black-box learner RBMs should just be used in a simple pipeline before a classifier and hope for the best. The figure plotted by the example should indeed be in the docs. I was very unsatisfied with the way the example looks on the digit dataset, compared to it looking good on mnist, I tried to find good values but failed. I was thinking whether it would be better to use mnist and keep it as a pregenerated example.
Member
commented May 7, 2013
 Ok, I'll try to work on docs and interfaces. First I want to have a look at @jnothman's grid-search work, though...
Member
 Ok, I'll try to work on docs and interfaces. The most worrying thing is the example: I haven't been convinced that it does better than a K-means to extract mid-level features. G
Member
commented May 7, 2013
 I don't think that you can say that it is better than k-means in general. Showing that RBMs work needs a large dataset and tuning. We can try on MNIST. I don't think the goal of the example should be to show that it works well, ratherit should be how to use it. The literature shows that there are cases in which it works. Reproducing them (in particular using just a CPU) is non-trivial and time-consuming.
Member
 Showing that RBMs work needs a large dataset and tuning. We can try on MNIST. That must cost an arm and a leg! I don't think the goal of the example should be to show that it works well, ratherit should be how to use it. Indeed. I think that the example on my pr_1200 branch does that, however when I see it, I cannot help thinking of k-means. The literature shows that there are cases in which it works. The older I get, the less I trust the literature :) Reproducing them (in particular using just a CPU) is non-trivial and time-consuming. We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem.
Member
commented May 7, 2013
 The older I get, the less I trust the literature :) Totally understandable, but maybe this shouldn't be used as an argument against merging. It seems that people are interested in trying these kind of models. I think it's better to provide it, and make it obvious how easy it is to replace it with something simpler by giving it the same API and letting people see by themselves that for many problems, KMeans or Random Projections extract better features than complicated and slow architectures. If users manage to use it well, that's great! If users use it wrongly, it's not much different than if they misuse any other of our estimators, is it? We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem. An algorithmic speedup would be momentum, in this case. @amueller, I'd love to help with what I can (maybe API cleanup) but I might step on your toes if you're on this, especially since the branch is @ynd's at the moment. What could I do?
Member
 The older I get, the less I trust the literature :) Totally understandable, but maybe this shouldn't be used as an argument against merging. Agreed. I am not voting against a merge in the long run. I think that they are still a few minor improvements to be done before a merge. We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem. An algorithmic speedup would be momentum, in this case. OK, that would be fantastic, but if it requires a lot of work, we should maybe do it in a 2nd PR.
Member
commented May 7, 2013
 @vladn feel free to go ahead, preferably in a second pr to master, i think. this is not my highest priority but i want to get it done soonish.. Vlad Niculae notifications@github.com schrieb: The older I get, the less I trust the literature :) Totally understandable, but maybe this shouldn't be used as an argument against merging. It seems that people are interested in trying these kind of models. I think it's better to provide it, and make it obvious how easy it is to replace it with something simpler by giving it the same API and letting people see by themselves that for many problems, KMeans or Random Projections extract better features than complicated and slow architectures. If users manage to use it well, that's great! If users use it wrongly, it's not much different than if they misuse any other of our estimators, is it? We should do a quick check to see if this implementation can be sped up: I find it slow on a simple problem. An algorithmic speedup would be momentum, in this case. @amueller, I'd love to help with what I can (maybe API cleanup) but I might step on your toes if you're on this, especially since the branch is @ynd's at the moment. What could I do? Reply to this email directly or view it on GitHub: #1200 (comment) Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
referenced this pull request May 9, 2013
Closed

#### [MRG] Restricted Boltzmann Machines: director's cut #1954

Member
commented May 9, 2013
 Moved to PR #1954, closing this one. Posting here so people get pinged. @dwf, could you help me understand the PCD-SML issue discussed above?
closed this May 9, 2013
Member
commented May 10, 2013
 Sure, I need to have a closer look at the narrative docs. On Thu, May 9, 2013 at 10:23 AM, Vlad Niculae notifications@github.comwrote: Moved to PR #1954#1954, closing this one. Posting here so people get pinged. @dwf https://github.com/dwf, could you help me understand the PCD-SML issue discussed above? — Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1200#issuecomment-17666810 .