Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Fergus semi-supervised label propagation #1185

Open
wants to merge 3 commits into from

7 participants

@magsol

This is a new semi-supervised algorithm based on Fergus et al 2009 [1].

Strengths:

  • Closed-form solution, no need for iterations or convergence threshold
  • Similar accuracy as other sklearn semi-supervised algorithms with moons data, much better accuracy than other algorithms with concentric circles
  • In addition to 'knn' and 'rbf' modes, has an 'img' mode for segmenting images

Weaknesses:

  • Full order of magnitude slower than other sklearn algorithms, likely due to eigh() on the graph laplacian (somewhat mitigated by use of sparse data structures)
  • Eigen-decomposition precludes full integration with existing semi-supervised class structure
  • Method for computing image affinities (img_rbf_kernel) is inefficient and could use some optimization

I would love to add this algorithm to scikit-learn's semi-supervised package. I'm definitely open to suggestions for improving it and bringing it up to par with scikit-learn's coding standards and algorithm performance.

[1] Rob Fergus et al. Semi-supervised learning in gigantic image collections, 2009.

@magsol magsol New semi-supervised algorithm based on Fergus et al 2009. Rather than…
… iteratively converge to a solution, this algorithm explicitly solves the closed-form solution using the eigenvectors of the normalized graph laplacian.
e7b673b
@amueller
Owner

Hey @magsol.
Thanks for the PR. I haven't read the paper but it sounds very interesting.
Which algorithms did you compare against, when concluding this is a magnitude slower?

Maybe someone who is familiar with the algorithms can comment. In general I think we have several spectral algorithms, which all suffer from the "eigendecomposition of graph laplacian" slowness, but they are still widely used standard algorithms.
We usually used some direct lapack calls. Maybe look into SpectralClustering to see how the decomposition is done there - I don't know how many eigenvalues you need, though.

For the image algorithm, please take a look at grid_to_graph. I think you reimplement it here.

There is also ongoing work on refactoring the construction of the graph laplacian and spectral decomposition. This might be quite relevant for this PR. See #1170 and #734.

What is a typical data set size on which the current implementation is practical? Do you think reusing the optimized code from spectral clustering could make it feasible on larger datasets?

And can you explain the use-case with the image?

Thanks again for the PR.
Cheers,
Andy

sklearn/semi_supervised/fergus_propagation.py
@@ -0,0 +1,301 @@
+import numpy as np
+import scipy.linalg as LA
+import scipy.sparse
+import scipy.sparse.linalg as SLA
+from sklearn.base import ClassifierMixin
+from sklearn.base import BaseEstimator
+import sklearn.metrics.pairwise as pairwise
+import sklearn.mixture as mixture
+
+class FergusPropagation(BaseEstimator, ClassifierMixin):
+ '''
@amueller Owner

Not sure but shouldn't this be """? Or is that just convention?

@magsol
magsol added a note

You're probably right--I've gotten in the habit of using single quotes (a holdover from my PHP days, most likely), but I'll change it to double quotes to fit with the rest of the sklearn coding practices.

@weilinear Owner

One quick question, is SLA.eigsh just using arpack by default?

@magsol
magsol added a note

According to this, I believe so.

@weilinear Owner

I just notice that in the SpectralClustering, there is a from ..utils.arpack import eigsh. So if scipy uses arpack by default, does import eigsh from ..utils.arpack has some special intention? @GaelVaroquaux

@magsol
magsol added a note

From the scipy source, it looks like it's using lapack directly. So that's an excellent question.

@jakevdp Collaborator
jakevdp added a note

@kuantkid - utils.arpack is exactly the same as scipy.sparse.linalg.arpack in scipy v0.10. The reason it's in the utils is because Scipy v0.10 added some features that aren't in earlier versions, and we don't yet require scipy 0.10+. In fact, at the bottom of the file in utils/arpack.py you can see that we default to the Scipy code if the version is greater than 0.10.

@weilinear Owner

@jakevdp Thanks for that. If I have taken your point, it is better for me to use utils.arpack currently for it to run with scipy <0.10. So maybe @magsol would also need just a small modification here? :)

@jakevdp Collaborator
jakevdp added a note

Yes, until we update our dependencies to scipy 0.10+, the sklearn.utils version of arpack should be used.

@magsol
magsol added a note

Ahhhh. I can certainly change the arpack calls to use scipy utils :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

ps: from a coding-style perspective this looks good, I haven't tried to understand the code yet, though ;)

@magsol

Thanks for the feedback!

The other algorithms I compared this one against were LabelSpreading and LabelPropagation.

The way lapack was invoked in the spectral clustering packages is more or less how I did it: specifying the particular eigenvalues and eigenvectors needed, as opposed to taking all N of them. Right now, the user can either specify the number of top eigenvalues and eigenvectors manually (parameter k in the constructor), or it is automatically set to the number of unique classes provided in the initial labels (obviously excluding the -1 unlabeled case).

grid_to_graph looks like exactly what I need. However, for 2D images it appears to compute a connectivity graph, instead of the RBF of the pixel values for the connected pixels. Still, at the very least I could use that implementation as a starting point to make the computation much more efficient. Thanks!

The two issues definitely look relevant. I'll follow those.

I tested the algorithm with scikit-learn's datasets of varying sizes (moons, blobs, and circles), and I found ~5000 data points was the qualitative limit without switching to sparse data structures. Anything smaller than that will work just fine, and anything larger than that, provided it uses sparse data structures, works fine too.

The image use case involves, for example, segmenting background from foreground in an image. The labeled pixels could constitute a part of the background and a part of the foreground, and the labels would be propagated from there.

Please let me know if you have any other questions. Thanks again!

Shannon

@magsol

Regarding the construction of the graph laplacian: I wrote my own method for it (albeit based on the scikit-learn implementation) because the laplacian returned in scikit-learn when computing the normalized version did not seem to follow the form it takes in the literature:

L = D^-0.5 * A * D^-0.5

where A is the affinity matrix, and D is the diagonal matrix formed by summing the rows of A. The _un_normalized laplacian, D - A, appears correct in scikit-learn. Although I'm completely open to having missed something in the documentation that explains what is being computed regarding the normalized version.

@GaelVaroquaux
@amueller
Owner

@magsol
For the normalized laplacian:
I think the difference is explained in the Luxburg tutorial on Spectral clustering.
Isn't one the Chen/Malik and the other one the Ng/Jordan version? Or am I confusing those again?

For the lapack calls: I think (really not sure) we are using a different lapack function than the standard scipy one. The utils is a backport of the next (or newest?) scipy interface to arpack.
I am not really familiar with this, though. Maybe @vene or @jakevdp can explain?

@jakevdp
Collaborator

@amueller - regarding arpack, see my line note above.

@magsol

Judging from that tutorial, it appears I'm trying to calculate a laplacian like L_sym, using the weight matrix W (or A in the code), without subtracting the identity matrix. Is that incorrect?

@GaelVaroquaux
@weilinear
Owner

@magsol I looked through the paper, seems the performance (running time) improves compared to traditional methods as stated by the following line:
a semi-supervised learning scheme that is linear in the number of images
So could I take it that this method will also apply if our data is text or any thing else? Is there any benchmark regarding running time right now on large dataset and more details about why the performance weakness as stated in the PR? Maybe you can try %prun in ipython :)

@magsol

@GaelVaroquaux: Its use cases are pretty much the same as the other two semi-supervised algorithms already in scikit. The practical benefit is its accuracy is at least as good as the other two in all cases that I've tested (moons, blobs), often significantly so. And there's no need to worry about iterations or convergence.

@kuantkid: The paper goes on to specify a method involving the use of "eigenfunctions", smooth approximations of the eigenvectors that allow for much faster computation speeds; I did not implement that aspect of the algorithm (though I certainly could), as it would have made the algorithm substantially less comparable to the other two semi-supervised algos already in scikit in terms of use cases. Right now, it can be used on any use case (and any data type: text, blobs, images, etc) that the other algorithms can handle. I don't have rigorous performance benchmarks (though I'm happy to create them!) aside from what I mentioned in the PR.

@magsol magsol Removed the import to scipy's sparse eigensolver, and replaced it wit…
…h sklearn's util.arpack wrapper for backwards compatibility.
7d3f5ae
@weilinear

Maybe it is a better idea to import from ..utils :)

Ahhh, fair enough, thanks :) Also need to fix the doc quotes anyway, I'll do that today.

@larsmans

Isn't this nested Python loop very slow?

It's not ideal, I'll certainly agree. I was pointed to this sklearn utility by another reviewer as a possible alternative for generating pairwise affinities in 8-pixel neighborhoods (since there's no reason to generate pairwise affinities between all pixels).

Even so, this nested loop actually isn't the bottleneck. I've tested it on reasonably-sized images--640x480, for example--and this loop runs within a second or two. The slowest part, by several orders of magnitude, is the eigensolver. Where this takes 1-2 seconds, even the sparse eigensolver can run for 10-15 seconds (or longer) on a medium-sized image.

Owner

That might be something to take along in the documentation. Is this algo supposed to be transductive or inductive?

Absolutely. I'm not actually sure what you mean by transductive vs inductive in this particular case, but to the best of my understanding the algorithm is transductive: it considers all the points, even those that are not labeled, when forming the eigenfunctions for propagating the labels that do exist.

Owner

By transductive, I mean algorithms that cannot be applied to unseen points. I.e. you'd get a labeling for the unlabeled points in the training set, but you can't use it outside that set.

Ahh, I understand. In its current form it is transductive; it is theoretically possible to apply eigenfunction approximations on new unlabeled data, but new labeled data would still require a new eigensolution. Currently in practice, any new labeled or unlabeled data still require a new eigensolution.

I will remove the predict and predict_proba methods to reflect this.

@larsmans

Also, is it possible to construct this CSC matrix in one go?

As in, reduce the contents of this loop to a single call to the CSC constructor?

Owner

Not exactly, you'd still need the loop. But if the above loops aren't a problem, then neither will this, so never mind.

@larsmans
Owner

We don't usually name estimators after papers/authors. How about EigenPropagation? Or EigenfunctionPropagation?

@magsol

That sounds fine to me, I can certainly implement that change :) In general I apologize for the lack of updates; I have logged the changes that have been requested so far, but I've had to put this work on hold until we publish on another project. Standard fare for academia, I suppose.

@mblondel
Owner

@magsol Since your changes are rather self-contained, could you create a gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Sep 27, 2012
  1. @magsol

    New semi-supervised algorithm based on Fergus et al 2009. Rather than…

    magsol authored
    … iteratively converge to a solution, this algorithm explicitly solves the closed-form solution using the eigenvectors of the normalized graph laplacian.
Commits on Oct 2, 2012
  1. @magsol

    Removed the import to scipy's sparse eigensolver, and replaced it wit…

    magsol authored
    …h sklearn's util.arpack wrapper for backwards compatibility.
Commits on Oct 4, 2012
  1. @magsol
This page is out of date. Refresh to see the latest.
Showing with 301 additions and 0 deletions.
  1. +301 −0 sklearn/semi_supervised/fergus_propagation.py
View
301 sklearn/semi_supervised/fergus_propagation.py
@@ -0,0 +1,301 @@
+import numpy as np
+import scipy.linalg as LA
+import scipy.sparse
+import ..utils.arpack as SLA
+from ..base import ClassifierMixin
+from ..base import BaseEstimator
+import ..metrics.pairwise as pairwise
+import ..mixture as mixture
+
+class FergusPropagation(BaseEstimator, ClassifierMixin):
+ """
+ The Fergus et al 2009 eigenfunction propagation classifier.
+
+ Parameters
+ ----------
+ kernel : {'rbf', 'img', 'knn'}
+ String identifier for kernel function to use.
+ rbf: Creates a fully-connected dense RBF kernel.
+ img: Creates a sparse RBF kernel where each pixel is connected only
+ to its 8-neighborhood; all others are 0. ONLY use for image data!
+ NOTE: Currently only support .png images.
+ knn: Creates RBF kernel with connections to closest n_neighbors points,
+ measured in euclidean distance.
+ k : integer > 0
+ Number of eigenvectors to use (default: # of labels, not including
+ unlabeled data) in eigen-decomposition.
+ gamma : float > 0
+ Parameter for RBF kernel (modes: rbf, img) (default: 20).
+ n_neighbors : int > 0
+ Parameter for the KNN kernel (modes: knn).
+ lagrangian : float > 0
+ Lagrangian multiplier to constrain the regularization penalty (default: 10).
+ img_dims : array-like, shape = [width, height]
+ Tuple specifying the width and height (or COLS and ROWS, respectively)
+ of the input image. Used only with a 'img' kernel type.
+
+ Examples
+ --------
+ TODO
+
+ References
+ ----------
+ Rob Fergus, Yair Weiss, Antonio Torralba. Semi-Supervised Learning in
+ Gigantic Image Collections (2009).
+ http://eprints.pascal-network.org/archive/00005636/01/ssl-1.pdf
+ """
+
+ def __init__(self, kernel = 'rbf', k = -1, gamma = 20, n_neighbors = 7, lagrangian = 10, img_dims = (-1, -1)):
+ # This doesn't iterate at all, so the parameters are very different
+ # from the BaseLabelPropagation parameters.
+ self.kernel = kernel
+ self.k = k
+ self.gamma = gamma
+ self.n_neighbors = n_neighbors
+ self.lagrangian = lagrangian
+ self.img_dims = img_dims
+
+ def fit(self, X, y):
+ """
+ Fit a semi-supervised eigenfunction label propagation model.
+
+ All input data is provided in matrix X (labeled and unlabeled), and
+ corresponding label vector y with dedicated marker value (-1) for
+ unlabeled samples.
+
+ Parameters
+ ----------
+ X : array-like, shape = [n_samples, n_features]
+ Feature vectors for data.
+ y : array-like, shape = [n_samples]
+ Label array, unlabeled points are marked as -1. All unlabeled
+ samples will be assigned labels.
+
+ Returns
+ -------
+ self : An instance of self.
+ """
+ self.X_ = X
+
+ # Construct the graph laplacian from the graph matrix.
+ W = self._get_kernel(self.X_)
+ L = self._normalized_graph_laplacian(W)
+
+ # Perform the eigen-decomposition.
+ vals = None
+ vects = None
+ classes = np.sort(np.unique(y))
+ if classes[0] == -1:
+ classes = np.delete(classes, 0) # remove the -1 from this list
+ if self.k == -1:
+ self.k = np.size(classes)
+
+ if scipy.sparse.isspmatrix(L):
+ vals, vects = SLA.eigsh(L, k = self.k, which = 'LM', maxiter = 1000)
+ else:
+ M = L.shape[0]
+ vals, vects = LA.eigh(L, eigvals = (M - self.k, M - 1))
+ self.u_ = vals
+
+ # Construct some matrices.
+ # U: k eigenvectors corresponding to smallest eigenvalues. (n_samples by k)
+ # S: Diagonal matrix of k smallest eigenvalues. (k by k)
+ # V: Diagonal matrix of LaGrange multipliers for labeled data, 0 for unlabeled. (n_samples by n_samples)
+ self.U_ = np.real(vects)
+ S = np.diag(self.u_)
+ V = np.diag(np.zeros(np.size(y)))
+ labeled = np.where(y != -1)
+ V[labeled, labeled] = self.lagrangian
+
+ # Solve for alpha and use it to compute the eigenfunctions, f.
+ A = S + np.dot(np.dot(self.U_.T, V), self.U_)
+ b = np.dot(np.dot(self.U_.T, V), y)
+ alpha = LA.solve(A, b)
+ f = np.dot(self.U_, alpha)
+
+ # Set up a GMM to assign the hard labels from the eigenfunctions.
+ g = mixture.GMM(n_components = np.size(classes))
+ g.fit(f)
+ self.labels_ = g.predict(f)
+ means = np.argsort(g.means_.flatten())
+ for i in range(0, np.size(self.labels_)):
+ self.labels_[i] = np.where(means == self.labels_[i])[0][0]
+
+ # Done!
+ return self
+
+ def predict(self, X):
+ """
+ Performs inductive inference across the model.
+
+ Parameters
+ ----------
+ X : array_like, shape = [n_samples, n_features]
+
+ Returns
+ -------
+ y : array_like, shape = [n_samples]
+ Predictions for input data
+ """
+ probas = self.predict_proba(X)
+ return self.labels_[np.argmax(probas, axis=1)].ravel()
+
+ def predict_proba(self, X):
+ """
+ Predict probability for each possible outcome.
+
+ Compute the probability estimates for each single sample in X
+ and each possible outcome seen during training (categorical
+ distribution).
+
+ Parameters
+ ----------
+ X : array_like, shape = [n_samples, n_features]
+
+ Returns
+ -------
+ probabilities : array, shape = [n_samples, n_classes]
+ Normalized probability distributions across
+ class labels
+ """
+ X_2d = None
+ if scipy.sparse.isspmatrix(X):
+ X_2d = X
+ else:
+ X_2d = np.atleast_2d(X)
+ weight_matrices = self._get_kernel(self.X_, X_2d)
+ if self.kernel == 'knn':
+ probabilities = []
+ for weight_matrix in weight_matrices:
+ ine = np.sum(self.label_distributions_[weight_matrix], axis=0)
+ probabilities.append(ine)
+ probabilities = np.array(probabilities)
+ else:
+ weight_matrices = weight_matrices.T
+ probabilities = np.dot(weight_matrices, self.label_distributions_)
+ normalizer = np.atleast_2d(np.sum(probabilities, axis=1)).T
+ probabilities /= normalizer
+ return probabilities
+
+ def _get_kernel(self, X, y = None):
+ if self.kernel == "rbf":
+ if y is None:
+ return pairwise.rbf_kernel(X, X, gamma = self.gamma)
+ else:
+ return pairwise.rbf_kernel(X, y, gamma = self.gamma)
+ elif self.kernel == "img":
+ return self._img_rbf_kernel(X)
+ elif self.kernel == "knn":
+ if self.nn_fit is None:
+ self.nn_fit = NearestNeighbors(self.n_neighbors).fit(X)
+ if y is None:
+ return self.nn_fit.kneighbors_graph(self.nn_fit._fit_X,
+ self.n_neighbors, mode = 'connectivity')
+ else:
+ return self.nn_fit.kneighbors(y, return_distance = False)
+ else:
+ raise ValueError("%s is not a valid kernel. Only rbf and knn"
+ " are supported at this time" % self.kernel)
+
+ def _img_rbf_kernel(self, X):
+ A = np.diag(np.ones(X.shape[0]))
+ index = 0
+ for i in range(0, self.img_dims[1]):
+ for j in range(0, self.img_dims[0]):
+
+ if i - 1 >= 0:
+ # Directly above.
+ other = index - self.img_dims[0]
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if i + 1 < self.img_dims[1]:
+ # Directly below.
+ other = index + self.img_dims[0]
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if j - 1 >= 0:
+ # Directly to the left.
+ other = index - 1
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if j + 1 < self.img_dims[0]:
+ # Directly to the right.
+ other = index + 1
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if i - 1 >= 0 and j - 1 >= 0:
+ # Upper left corner.
+ other = index - self.img_dims[0] - 1
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if i - 1 >= 0 and j + 1 < self.img_dims[0]:
+ # Upper right corner.
+ other = index - self.img_dims[0] + 1
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if i + 1 < self.img_dims[1] and j - 1 >= 0:
+ # Lower left corner.
+ other = index + self.img_dims[0] - 1
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ if i + 1 < self.img_dims[1] and j + 1 < self.img_dims[0]:
+ # Lower right corner.
+ other = index + self.img_dims[0] + 1
+ rbf = pairwise.rbf_kernel(X[index], X[other], gamma = self.gamma)
+ A[index, other] = rbf
+ A[other, index] = rbf
+
+ index += 1
+ return scipy.sparse.csc_matrix(A)
+
+ def _normalized_graph_laplacian(self, A):
+ """
+ Calculates the normalized graph laplacian, as the sklearn utility
+ graph_laplacian(normed = True) does not seem to do this.
+
+ Parameters
+ ----------
+ A : array, shape (n, n)
+ Symmetric affinity matrix.
+
+ Returns
+ -------
+ L : array, shape (n, n)
+ Normalized graph laplacian.
+ """
+ L = None
+ if scipy.sparse.isspmatrix(A):
+
+ # SKLEARN
+ n_nodes = A.shape[0]
+ if not A.format == 'coo':
+ L = A.tocoo()
+ else:
+ L = A.copy()
+ w = np.sqrt(np.asarray(L.sum(axis = 0)).squeeze())
+ L.data /= w[L.row]
+ L.data /= w[L.col]
+ else:
+ L = A.copy()
+ d = np.sqrt(L.sum(axis = 0))
+ L /= d
+ L /= d[:, np.newaxis]
+
+ # The last line effectively transposes the vector d, swapping
+ # the rows and columns of the division operation. Pretty clever,
+ # scikit-learn!
+ return L
Something went wrong with that request. Please try again.