[WIP] Sparse output KNN #3350

Open
wants to merge 35 commits into
from

Projects

None yet

7 participants

@hamsal
hamsal commented Jul 8, 2014 edited
  • Implement Cython CSR rowise mode function in sparsefuncs_fast
  • Implement Cython CSR rowise weighted_mode function in sparsefuncs_fast
  • Backport scipy sparse advanced indexing

This PR originally started out with Cython based improvements, but a middle ground has been chosen for simplicity of implementation.


This change is Reviewable

@hamsal hamsal changed the title from [WIP] Sparse output support for KNN Classifiers to [WIP] Sparse output KNN Jul 8, 2014
@hamsal

Today I wrote some more code to replace the existing predict function in KNeighborsClassifier. The code is in a very rough state but it is clear where the next improvements will come. 1) return the sparse format by convention (sparse target in means sparse target out). 2) Replace the dense mode code with a mode function written for a sparse matrix and do the same for weighted_mode.

@hamsal

How can I import my cython function into a module? I am writing a sparse row wise mode function csr_row_mode for CSC CSR matrices in the cython file utils/sparsefuncs_fast.pyx and I compile the cython file to get a new sparsefuncs_fast.c. However when I attempt to import the function in classification.py using from ..utils.parsefuncs_fast import csr_row_mode it fails to find it. I suspect there is an additional step to get it ready for import.

@GaelVaroquaux
scikit-learn member
@arjoly
scikit-learn member

Thinking about it. There is an implementation of mode with sparse matrix in the imputation module.
(pure numpy & scipy)

@hamsal

@arjoly the implementation for the mode looks like it loops over columns

from Imputer._sparse_fit

            # Most frequent
            elif strategy == "most_frequent":
                most_frequent = np.empty(len(columns))

                for i, column in enumerate(columns):
                    most_frequent[i] = _most_frequent(column,
                                                      0,
                                                      n_zeros_axis[i])
@hamsal

Thank you Gael that did it

@hamsal

The Cython CSR rowise mode looks to be working correctly, although I think there is room for improvement in efficiency I am going to to move on to a Cython implementation of weighted mode.

@arjoly arjoly and 2 others commented on an outdated diff Jul 11, 2014
sklearn/utils/sparsefuncs_fast.pyx
+ np.ndarray[np.float64_t, ndim=2, mode="c"] modes
+ np.ndarray[np.float64_t, ndim=1, mode="c"] data
+ np.ndarray[int, ndim=1, mode="c"] indices = X.indices
+ np.ndarray[int, ndim=1, mode="c"] indptr = X.indptr
+
+ np.npy_intp i, j
+ np.npy_int mode
+
+ # XXX these need to be dtype=X.data.dtype
+ modes = np.zeros((n_samples,1), dtype=np.float64)
+ data = np.asarray(X.data, dtype=np.float64) # might copy!
+
+ for i in range(n_samples):
+ nnz = indptr[i + 1] - indptr[i]
+ if nnz > 0:
+ nonz_mode, count = stats.mode(data[indptr[i]:indptr[i + 1]])
@hamsal
hamsal Jul 12, 2014

I see, I bench marked this Cython function vs python code of the same implementation and it preformed the same. The next step will be to do this calculation with Cython code.

@arjoly
arjoly Jul 12, 2014

before working on optimized version, does it work with the simple python version?

@jnothman
jnothman Jul 12, 2014

FWIW, a vectorized implementation of mode for a 1d array is:

def mode(a):
    u, inv = np.unique(a, return_inverse=True)
    c = np.bincount(inv, minlength=len(u))
    m = np.argmax(c)
    return u[m], c[m]

(and unlike scipy.stats.mode, this doesn't change the dtype)

@jnothman
jnothman Jul 12, 2014

Note also that weighted bincount should be sufficient to support the weighted case.

@hamsal
hamsal Jul 12, 2014

This function as it is right now works as a mode function, it is slow however because of what you pointed out. I will focus on getting a functional PR first before coming back to optimize this using Joels comments.

@jnothman jnothman and 1 other commented on an outdated diff Jul 12, 2014
sklearn/utils/sparsefuncs_fast.pyx
@@ -293,3 +294,33 @@ def assign_rows_csr(X,
for ind in range(indptr[rX], indptr[rX + 1]):
j = indices[ind]
out[out_rows[i], j] = data[ind]
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+@cython.cdivision(True)
+def csr_row_mode(X):
+ """L2 norm of each row in CSR matrix X."""
@jnothman
jnothman Jul 12, 2014

Wrong comment.

@jnothman
jnothman Jul 12, 2014

You need to be clear that this is only the mode over nonzeros (or explicit entries). I.e. it is treating the sparse matrix as if it is a masked array.

@hamsal
hamsal Jul 12, 2014

I believe this function has support for the zeros in a sparse matrix, we first do the one dimensional mode on the data in the row, then we count the number of zeros with nnz to see if the number of zeros is larger then the most frequent element from the mode calculation and return zero if so.

@jnothman
jnothman Jul 12, 2014

Oh, I didn't see that at the end. I haven't really understood what's going on in the predict function altogether, or what sparse support is needed for. But if the sparse representation is efficient, the mode should almost always be 0. In that case, testing for nnz being greater than half the elements should be done before any slower mode calculation.

And as suggested above, if more than half the elements are nonzero, CSR is not really useful

@vene vene and 1 other commented on an outdated diff Jul 15, 2014
sklearn/neighbors/base.py
+ classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
+ self.classes_.append(classes)
+
+ if not self.outputs_2d_:
+ self.classes_ = self.classes_[0]
+ self._y = self._y.ravel()
+ else:
+ y = sp.csc_matrix(y)
+
+ data = np.array([])
+ self.classes_ = []
+ for k in range(y.shape[1]):
+ k_col = y.getcol(k)
+ classes = np.unique(k_col.data)
+ if not k_col.nnz == y.shape[0]:
+ classes = np.insert(classes,0,0)
@vene
vene Jul 15, 2014

This is not pep8. Also I think this can break in case y has explicit zeros.
Or what if a user has an array with {+1, -1} labels and no zeros, but for some crazy reason converts it to sparse matrix? :/

@hamsal
hamsal Jul 16, 2014

I have updated the code to eliminate zeros from the sparse matrix when it is initially cast to csc. I believe the if statement protects against running this insert when the sparse matrix is saturated sparse.

@vene vene and 1 other commented on an outdated diff Jul 15, 2014
sklearn/neighbors/base.py
- classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
- self.classes_.append(classes)
+ self.sparse_target_input_ = sp.issparse(y)
+
+ if not sp.issparse(y):
+ self.classes_ = []
+ self._y = np.empty(y.shape, dtype=np.int)
+ for k in range(self._y.shape[1]):
+ classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
+ self.classes_.append(classes)
+
+ if not self.outputs_2d_:
+ self.classes_ = self.classes_[0]
+ self._y = self._y.ravel()
+ else:
+ y = sp.csc_matrix(y)
@vene
vene Jul 15, 2014

y is already sparse, shouldn't this use y.tocsc?

@vene vene and 1 other commented on an outdated diff Jul 15, 2014
sklearn/neighbors/tests/test_neighbors.py
@@ -418,6 +419,55 @@ def test_KNeighborsClassifier_multioutput():
for proba_mo, proba_so in zip(y_pred_proba_mo, y_pred_proba_so):
assert_array_almost_equal(proba_mo, proba_so)
+def test_KNeighborsClassifier_multioutput_sparse():
@vene
vene Jul 15, 2014

capital letters in test names (function names in general) are nonstandard

@hamsal
hamsal Jul 16, 2014

this has been renamed test_neighbors_classifier_multioutput_sparse

@hamsal

For some reason I have been unable to recreate any of the Travis errors. I have replicated the environments using the correct versions of numpy and scipy and my local copy of this branch builds fine. The python 3 Travis makes it look like there is something wrong with building the sparsefuncs_fast.c file. I have recreated the errors

@hamsal

Ping @jnothman, @arjoly, @vene Is there a way to get advanced indexing support for sparse matrices in the earlier versions of numpy? In this PR there is a need for indexing a matrix with a 2D array. My first thought was to backport the get_item function from scipys sparse matrices. Unfortunately this code is pretty intertwined with a number of other functions ultimately goes to some c level code. So this would be a very big port from scipy. The remaining solution I have come up with will be to write a cython function to do this for me but first I would like some input.

@arjoly
scikit-learn member

Does it work with numpy 1.8?

hamsal added some commits Jul 8, 2014
@hamsal hamsal Included some perliminary scaffolding and testing for sparse output knn 40cb849
@hamsal hamsal Implmented some basic outline for predict with sparse support 7ffd392
@hamsal hamsal Saved _y as sparse matrix out of base fit if the inpy target data is …
…sparse
a41a463
@hamsal hamsal Implemented a row wise mode cython function for CSC matricies e3afae1
@hamsal hamsal completed implementation of csr_row_mode 60c04b2
@hamsal hamsal Cleaned lines no longer used 7057858
@hamsal hamsal Include full path for imports, remove print in predict 2ed6bda
@hamsal hamsal Use .tocsc instead of constructor, Eliminate zeros, format pep8 3b1ac31
@hamsal hamsal Format pep8 768a613
@hamsal hamsal Rename testing fun lower case, format pep8, correct comment csr_row_mode
The function in test neigbors has been renamed test_neighbors_classifier_multioutput_sparse
c14c771
@hamsal hamsal Implement csr_row_mode (Reintrouduction to .c) b07035e
@hamsal hamsal Test kneighbors with sparse column 6870b8d
@hamsal hamsal Update comments in kneigbors predict
Reword comments concerning self.sparse_target_input to reflect that fit
 may not give sparse target data in all cases
19f8434
@hamsal hamsal Make dense columns durring prediction as a naive correctness implemen…
…tation
7c71da2
@hamsal hamsal Use assert almost equal to replace assert equal in testing 7c1fcdd
@hamsal hamsal Revert to sp mode function with naive dense column 2af4597
@hamsal hamsal Manaually extract column data, scipy 0.9 vs 0.14 give different .data…
… ordering
48efa6a
@hamsal hamsal Revert to naive advanced indexing by densfying cloumn first c510547
@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling c510547 on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@hamsal

Python 3.4 is giving me ValueError: level must be >= 0. Googling this tells me I am probably doing something wrong with the cython file/c code for python 3.4

@hamsal

Before doing an expensive back port of scipys indexing I have benchmark this PR as it is right now on scipy 0.14 which passes. The data used is the eurlex-data set. Here are the links to the benchmark script and the dataset in libsvm.

Benchmark 1 - Dense Target Data (Master)

Benchmark 2 - Sparse Target Data

The benchmark shows that this PR has considerably reduced the memory used by the target data. The target data is about 550 MB when dense. There are speed improvements necessary but it is already known that these are most likely from the sparse CSR which I need to optimize based on the comments given.

@jnothman
scikit-learn member

In this PR there is a need for indexing a matrix with a 2D array

It may be worth only providing a partial backport of scipy fancy indexing. I assume you are talking about the _y[neigh_ind, k] lookup, where _y is a CSC matrix, neigh_ind is a 2d array of ints, and k is an int. First, run _y.sum_duplicates(). Then you can use a searchsorted over _y.indices[_y.indptr[k]:_y.indptr[k+1]], looking up neigh_ind. You then need to get the corresponding _y.data entry, and set to 0 if the row index retrieved matched the one looked up. Does that make sense? Is it so complicated that it would better be backported in full?

@larsmans also knows scipy.sparse well and may have an opinion.

@larsmans
scikit-learn member

Your suggestion seems fine to me. The __getitem__ implementations in scipy.sparse are long and complicated, if possible I wouldn't backport them in full.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling 043ed2e on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@hamsal

Thanks @jnothman! It does make sense, I implemented and it is functioning without error.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling e82770e on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@jnothman
scikit-learn member

You're welcome, but I would extract it out into a separate function and write a couple of (at least doctest) tests.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling 72f5cdd on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling e61368f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@coveralls

Coverage Status

Coverage decreased (-0.01%) when pulling cb04d96 on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@hamsal hamsal changed the title from [WIP] Sparse output KNN to [MRG] Sparse output KNN Aug 15, 2014
@hamsal

Here is a rebench of this PR in the most recent state.

Benchmark 1 - Dense Target Data (Master)

Benchmark 2 - Sparse Target Data

This benchmark again shows that this PR has considerably reduced the memory used by the target data which is about 550 MB when dense. The newly adopted changes when dropping the unoptimized Cython code also gives a decent speed improvement from 37s total runtime to 24s.

@jnothman
scikit-learn member

Looking good! Out of curiosity, do you know what causes that memory spike at 15s in the sparse diagram?

@hamsal

Again maybe its just my environment but the line by line memory profiling results are not really meaningful for me so I cant tell for sure.

It is possible that it is from the nearest neighbor cython code. I am not too familiar with it but I think it is the most likely explanation since it is the only piece of code in predict that operates directly on the X array which is where most of the memory consumption is to be expected.

@arjoly arjoly and 1 other commented on an outdated diff Aug 18, 2014
sklearn/neighbors/base.py
+ self._y = self._y.ravel()
+ else:
+ y = y.tocsc()
+ y.eliminate_zeros()
+ nnz = np.diff(y.indptr)
+ data = np.array([])
+ self.classes_ = []
+
+ for k in range(y.shape[1]):
+ k_col_data = y.data[y.indptr[k]:y.indptr[k + 1]]
+ classes = np.unique(k_col_data)
+
+ if not nnz[k] == y.shape[0]:
+ classes = np.insert(classes, 0, 0)
+ self.classes_.append(classes)
+ data_k = [np.where(classes == e)[0][0] for e in k_col_data]
@arjoly
arjoly Aug 18, 2014

Can you explain what you are doing here?
Would it be possible to use np.unique with return_inverse?

@jnothman
jnothman Aug 18, 2014

If not, np.where(x)[0][0] === np.argmax(x) (as long as the former wouldn't have produced an IndexError).

@arjoly arjoly commented on an outdated diff Aug 18, 2014
sklearn/neighbors/base.py
+ if not sp.issparse(y):
+ self.classes_ = []
+ self._y = np.empty(y.shape, dtype=np.int)
+ for k in range(self._y.shape[1]):
+ classes, self._y[:, k] = np.unique(y[:, k],
+ return_inverse=True)
+ self.classes_.append(classes)
+
+ if not self.outputs_2d_:
+ self.classes_ = self.classes_[0]
+ self._y = self._y.ravel()
+ else:
+ y = y.tocsc()
+ y.eliminate_zeros()
+ nnz = np.diff(y.indptr)
+ data = np.array([])
@arjoly
arjoly Aug 18, 2014

Why not using a list?

@arjoly arjoly commented on an outdated diff Aug 18, 2014
sklearn/neighbors/classification.py
-from .base import \
+from sklearn.neighbors.base import \
@arjoly
arjoly Aug 18, 2014

Why do you need to change those imports?

@arjoly
arjoly Aug 18, 2014

If you want, it's possible to clean the imports by having one import by line in the form of

from .base import _check_weights
from .base import _get_weights
...
@arjoly
scikit-learn member

Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba.

@arjoly arjoly commented on an outdated diff Aug 18, 2014
sklearn/neighbors/tests/test_neighbors.py
+ knn.fit(X_train, y_train.getcol(o).toarray().ravel())
+ y_pred_so.append(knn.predict(X_test))
+ y_pred_proba_so.append(knn.predict_proba(X_test))
+
+ y_pred_so = np.vstack(y_pred_so).T
+ assert_equal(y_pred_so.shape, y_test.shape)
+ assert_equal(len(y_pred_proba_so), n_output)
+
+ # Multioutput prediction
+ knn_mo = neighbors.KNeighborsClassifier(weights=weights,
+ algorithm=algorithm)
+ knn_mo.fit(X_train, y_train)
+ y_pred_mo = knn_mo.predict(X_test)
+
+ assert_equal(y_pred_mo.shape, y_test.shape)
+ assert_true(sp.issparse(y_pred_mo))
@arjoly
arjoly Aug 18, 2014

Those two assertion are already made in the next line.

@arjoly
scikit-learn member

Since we go for the simple road, it would be nice to make it pass in similar way for predict_proba.

I missed the relevant lines in the diff.

@arjoly
scikit-learn member

Looks good, last comment is cosmetic.

@arjoly arjoly changed the title from [MRG] Sparse output KNN to [MRG+1] Sparse output KNN Aug 19, 2014
@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling ccae2ae on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@jnothman jnothman commented on the diff Aug 20, 2014
sklearn/neighbors/base.py
- if not self.outputs_2d_:
- self.classes_ = self.classes_[0]
- self._y = self._y.ravel()
+ if not sp.issparse(y):
@jnothman
jnothman Aug 20, 2014

This logic should probably be moved to LabelEncoding at some point; it currently does not handle multioutput, nor sparse (but the latter had only been used for binary targets until now). The sparse implementation is not explicitly tested, and some of its conditions are only being tested because of the random number generation happening to produce entirely dense and non-dense columns.

@jnothman jnothman and 1 other commented on an outdated diff Aug 20, 2014
sklearn/neighbors/classification.py
+ indices = array.array('i')
+ indptr = array.array('i', [0])
+
+ for k, classes_k in enumerate(classes_):
+ neigh_lbls_k = _y.getcol(k).toarray().ravel()[neigh_ind]
+
+ if weights is None:
+ mode, _ = stats.mode(neigh_lbls_k, axis=1)
+ else:
+ mode, _ = weighted_mode(neigh_lbls_k, weights, axis=1)
+
+ data.extend(mode[mode != 0].astype(_y.dtype))
+ indices.extend(np.where(mode != 0)[0])
+ indptr.append(len(indices))
+
+ y_pred = sp.csc_matrix((data, indices, indptr),
@jnothman
jnothman Aug 20, 2014

You seem to be missing a classes_.take step. Please add a test where class numbers are not contiguous (or else explicitly reject such data).

I am also not sure that you should be requiring the data to be intps, rather than classes_[0].dtype as in the dense case (with the caveat that only a dtype that is supported for the sparse input will be produced as output). I could imagine a boolean array input, but otherwise this might merely be moot.

@hamsal
hamsal Aug 20, 2014

I have corrected this, the test was changed so that it fails if the sampling from classes is not used. The dtype is now also maintained when predicting and there is an assert for this in the test.

@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling 2ba9f3f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 2ba9f3f on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling e640807 on hamsal:sprs-out-knn into 83223fd on scikit-learn:master.

@jnothman
scikit-learn member

thanks

@hamsal

Sparse Multioutput LabelEncoder for use in this PR is started here #3592

@jnothman jnothman referenced this pull request Aug 26, 2014
Open

[WIP] Sparse and Multioutput LabelEncoder #3592

2 of 7 tasks complete
@hamsal hamsal changed the title from [MRG+1] Sparse output KNN to [WIP+1] Sparse output KNN Aug 28, 2014
@hamsal hamsal changed the title from [WIP+1] Sparse output KNN to [WIP] Sparse output KNN Aug 28, 2014
@hamsal

Marked WIP until some other PR important for this one are finalized

@jnothman
scikit-learn member

happened upon this. Are you going to finish it, @hamsal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment