Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800

Closed
AnPananas opened this issue Apr 10, 2024 · 4 comments
Labels
Documentation Needs Triage Issue requires triage

Comments

@AnPananas
Copy link

Describe the issue linked to the documentation

The documentation page for the parameters of the DBSCAN mentions that the min_samples parameter : The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. (doc)

if you look at the (source) , you can see that in neighborhoods

neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)

if sample_weight is None:
    n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
else:
    n_neighbors = np.array([np.sum(sample_weight[neighbors]) for neighbors in neighborhoods])

the point itself is not taken into account

# A list of all core samples found.
core_samples = np.asarray(n_neighbors >= self.min_samples, dtype=np.uint8)

I took min_samples=2 and the points with one neighbor did not become the core

Suggest a potential alternative/fix

I think need to correct the documentation

@AnPananas AnPananas added Documentation Needs Triage Issue requires triage labels Apr 10, 2024
@lesteve
Copy link
Member

lesteve commented Apr 11, 2024

Thanks for opening an issue! I think the documentation is right but I have to admit, I am certainly not a DBSCAN expert.

I took min_samples=2 and the points with one neighbor did not become the core

Have you tried playing with eps? If you can provide a snippet of code showing this issue, this would be great so that a maintainer can have a closer look.

@AnPananas
Copy link
Author

AnPananas commented Apr 12, 2024

@lesteve
I use matrix of distances in input data

from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
DBSCAN_model = DBSCAN(eps=100, min_samples=2, metric='precomputed', algorithm='brute')

similarity_matrix = [
    [1e10, 2500.966630, 572.004568, 2571.116203, 2637.008209, 2378.405924, 244.336929, 288.477526, 339.468194],
    [2500.966630, 1e10, 2437.781596, 70.149573, 1024.025578, 765.423293, 2256.629701, 2262.386342, 2205.245221],
    [572.004568, 2437.781596, 1e10, 2507.931168, 2573.823175, 2315.220890, 327.667640, 333.424280, 232.536374],
    [2571.116203, 70.149573, 2507.931168, 1e10, 1094.175151, 835.572866, 2326.779274, 2332.535914, 2275.394794],
    [2637.008209, 1024.025578, 2573.823175, 1094.175151, 1e10, 258.602285, 2392.671280, 2398.427921, 2341.286801],
    [2378.405924, 765.423293, 2315.220890, 835.572866, 258.602285, 1e10, 2134.068995, 2139.825636, 2082.684515],
    [244.336929, 2256.629701, 327.667640, 2326.779274, 2392.671280, 2134.068995, 1e10, 44.140597, 95.131265],
    [288.477526, 2262.386342, 333.424280, 2332.535914, 2398.427921, 2139.825636, 44.140597, 1e10, 100.887906],
    [339.468194, 2205.245221, 232.536374, 2275.394794, 2341.286801, 2082.684515, 95.131265, 100.887906, 1e10]
]

# Create DataFrame
df_similarity = pd.DataFrame(similarity_matrix, columns=range(1, 10), index=range(1, 10))

df_similarity_numpy = df_similarity.to_numpy()

neighbors_model = NearestNeighbors(
            radius=DBSCAN_model.eps,
            algorithm=DBSCAN_model.algorithm,
            leaf_size=DBSCAN_model.leaf_size,
            metric=DBSCAN_model.metric,
            metric_params=DBSCAN_model.metric_params,
            p=DBSCAN_model.p,
            n_jobs=DBSCAN_model.n_jobs,
        )

neighbors_model.fit(df_similarity_numpy)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(df_similarity_numpy, return_distance=False)

This code return neighborhoods as array in output we see:

array([array([], dtype=int64), array([3]), array([], dtype=int64),
       array([1]), array([], dtype=int64), array([], dtype=int64),
       array([7, 8]), array([6]), array([6])], dtype=object)

only on point has more then 1 neighbor ( array([7, 8])). I remind you that we have specified 2 neighbors "including the point in question" in order for it to become the core.

then we run this:

n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
core_samples = np.asarray(n_neighbors >= 2, dtype=np.uint8)
core_samples

in output:
array([0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8)

only one point which have 2 neighbors except itself become a core point

@lesteve
Copy link
Member

lesteve commented Apr 12, 2024

As the doc says:

If metric is "precomputed", X is assumed to be a distance matrix and must be square.

Distance matrix, means the matrix has 0 on the diagonal. You matrix is a similarity matrix I am guessing this is why you find points with no neighbors.

I am going to close the issue, since at this point I feel this is more likely to be a scikit-learn usage question rather than a bug in scikit-learn.

@lesteve lesteve closed this as completed Apr 12, 2024
@AnPananas
Copy link
Author

@lesteve Thank you very much for the prompt response, now I understand)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Needs Triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

2 participants