Skip to content

WIP: check_unary_or_binary function in sklearn.utils.arrayfuncs #1412

Open
wants to merge 1 commit into from

4 participants

@tjanez
tjanez commented Nov 26, 2012

This is my first take at implementing a function that @mblondel suggested in issue #1393.

Please note that this the first time I'm writing something in Cython, so comments and suggestions very welcome.

@amueller
scikit-learn member

Thanks for the PR @tjanez.
Couldn't this be done with np.unique?

@amueller
scikit-learn member

Sorry, I didn't see @mblondel's comment. Could you compare performance with np.unique?

@amueller
scikit-learn member

One general comment: did you run cython -a on this? It is a very useful feature of cython that I discovered only pretty late..

@mblondel
scikit-learn member

A dedicated function has two advantages: it can stop as soon as it detects more than 2 values and it doesn't allocate memory (in the worst case, the allocated memory can be as big as the input).

@mblondel mblondel commented on the diff Nov 27, 2012
sklearn/utils/arrayfuncs.pyx
+ cdef int j
+ cdef float two_values[2]
+ if size < 2:
+ return 1
+ j = 0
+ two_values[j] = X[0]
+ for i in range(1, size):
+ if X[i] != two_values[j]:
+ if j == 0:
+ j += 1
+ two_values[j] = X[i]
+ else:
+ return 0
+ return 1
+
+cdef int _double_check_unary_or_binary(double *X, Py_ssize_t size):
@mblondel
scikit-learn member
mblondel added a note Nov 27, 2012

You can use the floating type. This will produce the two functions (double and float) for you. See:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise_fast.pyx

@larsmans
scikit-learn member
larsmans added a note Dec 3, 2012

Never mind. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel mblondel commented on the diff Nov 27, 2012
sklearn/utils/arrayfuncs.pyx
@@ -113,3 +113,49 @@ def cholesky_delete(np.ndarray L, int go_out):
m = <int> L.strides[0] / sizeof (float)
float_cholesky_delete (m, n, <float *> L.data, go_out)
+def check_unary_or_binary(np.ndarray X):
+ """
+ Returns True if the given array contains only one or two unique values, and
+ False otherwise.
+ """
+ if X.dtype.name == 'float32':
+ return bool(_float_check_unary_or_binary(<float *> X.data, X.size))
+ elif X.dtype.name == 'float64':
+ return bool(_double_check_unary_or_binary(<double *> X.data, X.size))
+ else:
+ raise ValueError('Unsupported dtype for array X')
@mblondel
scikit-learn member
mblondel added a note Nov 27, 2012

If X is in COO, CSR or CSC sparse format, you can also pass X.data.data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@mblondel
scikit-learn member

You didn't implement unit tests yet. And fixing #1393 directly in this PR would be nice.

BTW, thanks for the PR!

@amueller amueller commented on the diff Nov 27, 2012
sklearn/utils/arrayfuncs.pyx
@@ -113,3 +113,49 @@ def cholesky_delete(np.ndarray L, int go_out):
m = <int> L.strides[0] / sizeof (float)
float_cholesky_delete (m, n, <float *> L.data, go_out)
+def check_unary_or_binary(np.ndarray X):
+ """
+ Returns True if the given array contains only one or two unique values, and
+ False otherwise.
+ """
+ if X.dtype.name == 'float32':
+ return bool(_float_check_unary_or_binary(<float *> X.data, X.size))
+ elif X.dtype.name == 'float64':
+ return bool(_double_check_unary_or_binary(<double *> X.data, X.size))
+ else:
+ raise ValueError('Unsupported dtype for array X')
+
+cdef int _float_check_unary_or_binary(float *X, Py_ssize_t size):
@amueller
scikit-learn member
amueller added a note Nov 27, 2012

Instead of points, you could use typed memory views. Again, look at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise_fast.pyx for an example (because I used all the cool new sh*t there ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@tjanez
tjanez commented Nov 28, 2012

@amueller, @mblondel, thank you for your comments. I'll look into them and give you further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.