the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800

AnPananas · 2024-04-10T10:46:38Z

Describe the issue linked to the documentation

The documentation page for the parameters of the DBSCAN mentions that the min_samples parameter : The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. (doc)

if you look at the (source) , you can see that in neighborhoods

neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)

if sample_weight is None:
    n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
else:
    n_neighbors = np.array([np.sum(sample_weight[neighbors]) for neighbors in neighborhoods])

the point itself is not taken into account

# A list of all core samples found.
core_samples = np.asarray(n_neighbors >= self.min_samples, dtype=np.uint8)

I took min_samples=2 and the points with one neighbor did not become the core

Suggest a potential alternative/fix

I think need to correct the documentation

The text was updated successfully, but these errors were encountered:

lesteve · 2024-04-11T08:07:50Z

Thanks for opening an issue! I think the documentation is right but I have to admit, I am certainly not a DBSCAN expert.

I took min_samples=2 and the points with one neighbor did not become the core

Have you tried playing with eps? If you can provide a snippet of code showing this issue, this would be great so that a maintainer can have a closer look.

AnPananas · 2024-04-12T06:51:40Z

@lesteve
I use matrix of distances in input data

from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
DBSCAN_model = DBSCAN(eps=100, min_samples=2, metric='precomputed', algorithm='brute')

similarity_matrix = [
    [1e10, 2500.966630, 572.004568, 2571.116203, 2637.008209, 2378.405924, 244.336929, 288.477526, 339.468194],
    [2500.966630, 1e10, 2437.781596, 70.149573, 1024.025578, 765.423293, 2256.629701, 2262.386342, 2205.245221],
    [572.004568, 2437.781596, 1e10, 2507.931168, 2573.823175, 2315.220890, 327.667640, 333.424280, 232.536374],
    [2571.116203, 70.149573, 2507.931168, 1e10, 1094.175151, 835.572866, 2326.779274, 2332.535914, 2275.394794],
    [2637.008209, 1024.025578, 2573.823175, 1094.175151, 1e10, 258.602285, 2392.671280, 2398.427921, 2341.286801],
    [2378.405924, 765.423293, 2315.220890, 835.572866, 258.602285, 1e10, 2134.068995, 2139.825636, 2082.684515],
    [244.336929, 2256.629701, 327.667640, 2326.779274, 2392.671280, 2134.068995, 1e10, 44.140597, 95.131265],
    [288.477526, 2262.386342, 333.424280, 2332.535914, 2398.427921, 2139.825636, 44.140597, 1e10, 100.887906],
    [339.468194, 2205.245221, 232.536374, 2275.394794, 2341.286801, 2082.684515, 95.131265, 100.887906, 1e10]
]

# Create DataFrame
df_similarity = pd.DataFrame(similarity_matrix, columns=range(1, 10), index=range(1, 10))

df_similarity_numpy = df_similarity.to_numpy()

neighbors_model = NearestNeighbors(
            radius=DBSCAN_model.eps,
            algorithm=DBSCAN_model.algorithm,
            leaf_size=DBSCAN_model.leaf_size,
            metric=DBSCAN_model.metric,
            metric_params=DBSCAN_model.metric_params,
            p=DBSCAN_model.p,
            n_jobs=DBSCAN_model.n_jobs,
        )

neighbors_model.fit(df_similarity_numpy)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(df_similarity_numpy, return_distance=False)

This code return neighborhoods as array in output we see:

array([array([], dtype=int64), array([3]), array([], dtype=int64),
       array([1]), array([], dtype=int64), array([], dtype=int64),
       array([7, 8]), array([6]), array([6])], dtype=object)

only on point has more then 1 neighbor ( array([7, 8])). I remind you that we have specified 2 neighbors "including the point in question" in order for it to become the core.

then we run this:

n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
core_samples = np.asarray(n_neighbors >= 2, dtype=np.uint8)
core_samples

in output:
array([0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8)

only one point which have 2 neighbors except itself become a core point

lesteve · 2024-04-12T13:40:06Z

As the doc says:

If metric is "precomputed", X is assumed to be a distance matrix and must be square.

Distance matrix, means the matrix has 0 on the diagonal. You matrix is a similarity matrix I am guessing this is why you find points with no neighbors.

I am going to close the issue, since at this point I feel this is more likely to be a scikit-learn usage question rather than a bug in scikit-learn.

AnPananas · 2024-04-12T13:53:53Z

@lesteve Thank you very much for the prompt response, now I understand)

AnPananas added Documentation Needs Triage Issue requires triage labels Apr 10, 2024

lesteve closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800

AnPananas commented Apr 10, 2024

lesteve commented Apr 11, 2024

AnPananas commented Apr 12, 2024 •

edited by lesteve

lesteve commented Apr 12, 2024

AnPananas commented Apr 12, 2024

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800

the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800

Comments

AnPananas commented Apr 10, 2024

Describe the issue linked to the documentation

Suggest a potential alternative/fix

lesteve commented Apr 11, 2024

AnPananas commented Apr 12, 2024 • edited by lesteve

lesteve commented Apr 12, 2024

AnPananas commented Apr 12, 2024

AnPananas commented Apr 12, 2024 •

edited by lesteve