New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN gives incorrect result on precomputed sparse input if there are only zeros in first row #8306

Closed
alxgrh opened this Issue Feb 7, 2017 · 8 comments

Comments

Projects
None yet
4 participants
@alxgrh

alxgrh commented Feb 7, 2017

Description

DBSCAN returns incorrect labels array on given precomputed sparse input if there are only zeros in first row.

Steps/Code to Reproduce

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import dbscan

# Create example distance matrix 
# On such input and with epsilon value equal to 0.2 DBSCAN should leave first row unclustered, put 2nd and 3rd rows to one cluster and put 4th and 5th rows to another cluster
ar = np.array([
        [0.0, 0.0, 0.0, 0.0, 0.0 ],
        [0.0, 0.0, 0.2, 0.0, 0.3 ],
        [0.0, 0.2, 0.0, 0.0, 0.0 ],
        [0.0, 0.0, 0.0, 0.0, 0.1 ],
        [0.0, 0.3, 0.0, 0.1, 0.0 ]
    ])
matrix = csr_matrix(ar)

# direct method used just for reference. DBSCAN.fit() gives the same result.
dbscan(matrix, metric='precomputed', eps=0.2, min_samples=2)

Expected Results

(array([1, 2, 3, 4]), array([-1, 0, 0, 1, 1]))

Actual Results

(array([0, 2, 3, 4]), array([0, 0, 0, 1, 0]))

Error appears only if first row of input matrix consist of only zeroes.

Versions

Linux-4.9.4-100.fc24.x86_64-x86_64-with-fedora-24-Twenty_Four
('Python', '2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.17.1')

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 7, 2017

Member

So we're to interpret this as: sample X[0] has no other items within eps distance. I agree this looks like a bug. Thanks for the report and the minimal example.

Member

jnothman commented Feb 7, 2017

So we're to interpret this as: sample X[0] has no other items within eps distance. I agree this looks like a bug. Thanks for the report and the minimal example.

@kahnchana

This comment has been minimized.

Show comment
Hide comment
@kahnchana

kahnchana Feb 7, 2017

Can I try working on this? I am new though.

kahnchana commented Feb 7, 2017

Can I try working on this? I am new though.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 7, 2017

Member
Member

jnothman commented Feb 7, 2017

@gan3sh500

This comment has been minimized.

Show comment
Hide comment
@gan3sh500

gan3sh500 Feb 7, 2017

You use ar as a pairwise distance matrix but can a pairwise distance matrix have a row of zeros without everything being zeros?

gan3sh500 commented Feb 7, 2017

You use ar as a pairwise distance matrix but can a pairwise distance matrix have a row of zeros without everything being zeros?

@kahnchana

This comment has been minimized.

Show comment
Hide comment
@kahnchana

kahnchana Feb 7, 2017

But scipy.spatial.distance.is_valid_dm() verifies this as an acceptable distance matrix. I even tried plotting the corresponding points and got:
[0.1008,0.2193],[1.6069,0.0202],[-0.5538,-0.1653],[0.2431,-0.1523],[-1.3969,0.0781]

However, distance matrices (at least for Euclidean) aren't allowed to have zero values off the diagonal, are they?

kahnchana commented Feb 7, 2017

But scipy.spatial.distance.is_valid_dm() verifies this as an acceptable distance matrix. I even tried plotting the corresponding points and got:
[0.1008,0.2193],[1.6069,0.0202],[-0.5538,-0.1653],[0.2431,-0.1523],[-1.3969,0.0781]

However, distance matrices (at least for Euclidean) aren't allowed to have zero values off the diagonal, are they?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 7, 2017

Member
Member

jnothman commented Feb 7, 2017

@alxgrh

This comment has been minimized.

Show comment
Hide comment
@alxgrh

alxgrh Feb 7, 2017

Let me be more concrete. Problem is specific to sparse matrix. Error happens if and only if the first line in matrix is empty. Example sparse matrix synthetically created from dense input for easy reproduction. All zero-values there are treated as non-existent elements.

Incorrect behavior is in this part of code
https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/cluster/dbscan_.py#L122

    if metric == 'precomputed' and sparse.issparse(X):
        neighborhoods = np.empty(X.shape[0], dtype=object)
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X_mask = X.data <= eps
        masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
        masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1] # <<== UNVALIDATED SHIFT-DECREMENT
        # insert the diagonal: a point is its own neighbor, but 0 distance
        # means absence from sparse matrix data
        masked_indices = np.insert(masked_indices, masked_indptr,
                                   np.arange(X.shape[0]))
        masked_indptr = masked_indptr[:-1] + np.arange(1, X.shape[0])
        # split into rows
neighborhoods[:] = np.split(masked_indices, masked_indptr)

In sparce matrix row indexer first element is always zero. If first row is empty then second element in row indexer is also zero.
There is shift-decrement operation in code. After this operation second element becomes first with value equal -1 and thus points to the tail of column index. This treats first row as neighbor of all other rows.

In my project I reimplemented your dbscan() function with this code

    if metric == 'precomputed' and sparse.issparse(X):
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X.data = (X.data <= eps).astype(np.intp)
        X.eliminate_zeros()

        # Convert to linked list for better performance and to use built-in .rows object
        X_lil = X.tolil()
        # insert the diagonal: a point is its own neighbor.
        X_lil.setdiag(1)

        # split into rows
        neighborhoods[:] = X_lil.rows

I doubt it's the best option, but it works correctly and doesn't look like some David Blane street magic.

alxgrh commented Feb 7, 2017

Let me be more concrete. Problem is specific to sparse matrix. Error happens if and only if the first line in matrix is empty. Example sparse matrix synthetically created from dense input for easy reproduction. All zero-values there are treated as non-existent elements.

Incorrect behavior is in this part of code
https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/cluster/dbscan_.py#L122

    if metric == 'precomputed' and sparse.issparse(X):
        neighborhoods = np.empty(X.shape[0], dtype=object)
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X_mask = X.data <= eps
        masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
        masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1] # <<== UNVALIDATED SHIFT-DECREMENT
        # insert the diagonal: a point is its own neighbor, but 0 distance
        # means absence from sparse matrix data
        masked_indices = np.insert(masked_indices, masked_indptr,
                                   np.arange(X.shape[0]))
        masked_indptr = masked_indptr[:-1] + np.arange(1, X.shape[0])
        # split into rows
neighborhoods[:] = np.split(masked_indices, masked_indptr)

In sparce matrix row indexer first element is always zero. If first row is empty then second element in row indexer is also zero.
There is shift-decrement operation in code. After this operation second element becomes first with value equal -1 and thus points to the tail of column index. This treats first row as neighbor of all other rows.

In my project I reimplemented your dbscan() function with this code

    if metric == 'precomputed' and sparse.issparse(X):
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X.data = (X.data <= eps).astype(np.intp)
        X.eliminate_zeros()

        # Convert to linked list for better performance and to use built-in .rows object
        X_lil = X.tolil()
        # insert the diagonal: a point is its own neighbor.
        X_lil.setdiag(1)

        # split into rows
        neighborhoods[:] = X_lil.rows

I doubt it's the best option, but it works correctly and doesn't look like some David Blane street magic.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 7, 2017

Member
Member

jnothman commented Feb 7, 2017

Akshay0724 added a commit to Akshay0724/scikit-learn that referenced this issue Feb 15, 2017

Akshay0724 added a commit to Akshay0724/scikit-learn that referenced this issue Feb 15, 2017

@jnothman jnothman closed this in #8339 Feb 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment