-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the documentation says that the min_samples parameter specifies the number of neighbors including the point itself, but does not actually include #28800
Comments
Thanks for opening an issue! I think the documentation is right but I have to admit, I am certainly not a DBSCAN expert.
Have you tried playing with |
@lesteve from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
DBSCAN_model = DBSCAN(eps=100, min_samples=2, metric='precomputed', algorithm='brute')
similarity_matrix = [
[1e10, 2500.966630, 572.004568, 2571.116203, 2637.008209, 2378.405924, 244.336929, 288.477526, 339.468194],
[2500.966630, 1e10, 2437.781596, 70.149573, 1024.025578, 765.423293, 2256.629701, 2262.386342, 2205.245221],
[572.004568, 2437.781596, 1e10, 2507.931168, 2573.823175, 2315.220890, 327.667640, 333.424280, 232.536374],
[2571.116203, 70.149573, 2507.931168, 1e10, 1094.175151, 835.572866, 2326.779274, 2332.535914, 2275.394794],
[2637.008209, 1024.025578, 2573.823175, 1094.175151, 1e10, 258.602285, 2392.671280, 2398.427921, 2341.286801],
[2378.405924, 765.423293, 2315.220890, 835.572866, 258.602285, 1e10, 2134.068995, 2139.825636, 2082.684515],
[244.336929, 2256.629701, 327.667640, 2326.779274, 2392.671280, 2134.068995, 1e10, 44.140597, 95.131265],
[288.477526, 2262.386342, 333.424280, 2332.535914, 2398.427921, 2139.825636, 44.140597, 1e10, 100.887906],
[339.468194, 2205.245221, 232.536374, 2275.394794, 2341.286801, 2082.684515, 95.131265, 100.887906, 1e10]
]
# Create DataFrame
df_similarity = pd.DataFrame(similarity_matrix, columns=range(1, 10), index=range(1, 10))
df_similarity_numpy = df_similarity.to_numpy()
neighbors_model = NearestNeighbors(
radius=DBSCAN_model.eps,
algorithm=DBSCAN_model.algorithm,
leaf_size=DBSCAN_model.leaf_size,
metric=DBSCAN_model.metric,
metric_params=DBSCAN_model.metric_params,
p=DBSCAN_model.p,
n_jobs=DBSCAN_model.n_jobs,
)
neighbors_model.fit(df_similarity_numpy)
# This has worst case O(n^2) memory complexity
neighborhoods = neighbors_model.radius_neighbors(df_similarity_numpy, return_distance=False) This code return neighborhoods as array in output we see:
only on point has more then 1 neighbor ( array([7, 8])). I remind you that we have specified 2 neighbors "including the point in question" in order for it to become the core. then we run this: n_neighbors = np.array([len(neighbors) for neighbors in neighborhoods])
core_samples = np.asarray(n_neighbors >= 2, dtype=np.uint8)
core_samples in output: only one point which have 2 neighbors except itself become a core point |
As the doc says:
Distance matrix, means the matrix has 0 on the diagonal. You matrix is a similarity matrix I am guessing this is why you find points with no neighbors. I am going to close the issue, since at this point I feel this is more likely to be a scikit-learn usage question rather than a bug in scikit-learn. |
@lesteve Thank you very much for the prompt response, now I understand) |
Describe the issue linked to the documentation
The documentation page for the parameters of the DBSCAN mentions that the min_samples parameter :
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
(doc)if you look at the (source) , you can see that in neighborhoods
the point itself is not taken into account
I took min_samples=2 and the points with one neighbor did not become the core
Suggest a potential alternative/fix
I think need to correct the documentation
The text was updated successfully, but these errors were encountered: