Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Implementation of OPTICS #1984

Closed
wants to merge 137 commits into from
Closed
Changes from 2 commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
147323d
OPTICS clustering algorithm
espg May 21, 2013
d18e827
Create plot_optics
espg May 21, 2013
70daac1
pep8 fixes
espg May 22, 2013
4fea7ce
fixed conditional to be pep8
espg May 22, 2013
7cc7e12
updated to match sklearn API
espg Nov 5, 2014
24dac57
removed extra files
espg Nov 5, 2014
7d23d01
plotting example updated, small changes
espg Nov 5, 2014
15e452e
updated OPTICS.labels to OPTICS.labels_
espg Nov 5, 2014
9bed2e6
additional labels_ changes
espg Nov 5, 2014
9d163e3
added stability warning
espg Jan 22, 2015
b18da24
Noise fix; updated example plot
espg Jan 23, 2015
c0e24e5
Changed to match Sklearn API
espg Jul 15, 2015
90e21ac
Forcing 2 parameters
espg Jul 15, 2015
b06c76d
Conforming to API
espg Jul 24, 2015
abbdbf2
Fixing plot example
espg Jul 24, 2015
2c54cdc
Fixed issue with sparse matrices
freemansw1 Jul 24, 2015
58d530a
Another attempt at fixing the sparse matrix error
freemansw1 Jul 24, 2015
6034139
Better checking of sparse arrays
espg Jul 24, 2015
fd5c65f
General cleanup
espg Jul 25, 2015
39660a5
Merge remote-tracking branch 'upstream/master'
espg Jul 25, 2015
6ae8c94
Added unit tests for extract function
freemansw1 Jul 28, 2015
bc9cc1c
Attempting for near 100% coverage
freemansw1 Jul 28, 2015
34b9a6d
Fixed error in unit tests
freemansw1 Jul 28, 2015
8d5ac08
Trimmed extraneous 'if-else' check
espg Jul 28, 2015
286f90e
forcing to check for a warning.
freemansw1 Jul 28, 2015
ae2bddc
Updates to doc strings
espg Jul 30, 2015
55365c5
Style / pep8 changes
espg Jul 31, 2015
e1053b0
Added Narrative Documentation
espg Aug 2, 2015
965225e
Vectorized nneighbors lookups
espg Aug 2, 2015
1bda3b1
fixing init build error
espg Aug 2, 2015
1827abf
reverting init
espg Aug 2, 2015
08389fc
All code now vectorized
espg Aug 4, 2015
d9c2bb1
Style changes
espg Aug 4, 2015
659a697
Changing parameter style
espg Aug 7, 2015
bec154f
Extraction change; Authors update
espg Sep 21, 2015
5891945
Changed eps scaling to 5x instead of 10x
espg Sep 21, 2015
9427f1e
Fixing unit test
espg Sep 21, 2015
6462b5e
Actually fixing unit test
espg Sep 21, 2015
2785864
Making ordering_ and other attributes public
espg Sep 21, 2015
8c4572b
Updates for Documentation
espg Sep 21, 2015
3d5e0fc
Pep8 cleanup
espg Sep 21, 2015
02bb8c0
updating plot example to match new attribute name
espg Sep 22, 2015
3dd2d29
CamelCase fixes
espg Oct 19, 2015
e52cc1e
adding hierarchical extraction function
espg Oct 19, 2015
fcd5f70
added hierarchical switch to extract
espg Oct 19, 2015
4d7e3be
cluster order bug fix
espg Oct 20, 2015
b31ae1a
removed hierarchical cluster extraction
espg Oct 20, 2015
92ba90d
initial import of automatic cluster extraction code
espg Oct 20, 2015
768f7c6
wrapper for 'auto' extraction
espg Oct 21, 2015
f4d88de
test and example updates
espg Oct 21, 2015
b1aeb52
fixing unit coverage; pruning unused functions
espg Oct 21, 2015
8350711
Added 'filter' fuction
espg Nov 9, 2015
31df5eb
Vectorizing auto_cluster
espg Nov 9, 2015
428d7d4
updated filter function
espg Nov 18, 2015
517a72e
fixing test error
espg Aug 9, 2016
b7daad4
removing exception handling in favor of conditional check
Aug 10, 2016
a08543b
Merge pull request #2 from Broham/exception-handling
espg Aug 10, 2016
2019886
Updated unit tests
espg Aug 27, 2016
6adb41f
Additional unit test
espg Aug 27, 2016
78be6e4
Fix unit test bug / python 3 compat
espg Aug 27, 2016
da1ed0a
Fixed annoying deprecation warning for 1d array in BallTree
kflansburg Oct 6, 2016
e86f14a
More PEP8 and remove print statements fromt est
kflansburg Oct 6, 2016
b5db832
Merge pull request #3 from kflansburg/master
espg Oct 15, 2016
0b4cbdd
70 to 80% faster, fixed distance metrics
espg Oct 15, 2016
da58058
Fixing unit test failure
espg Oct 16, 2016
625aaa5
Exposed auto_cluster parameters as public
espg Oct 17, 2016
1aa913d
Fixed def with missing ':'
espg Oct 17, 2016
950fbb5
Fix bugs from api change...
espg Oct 17, 2016
9a04d0f
pep8 / pyflakes changes
espg Oct 18, 2016
5876f15
Updated example / plot
espg Oct 18, 2016
38784fd
Tuning plot example / pep8 change
espg Oct 18, 2016
11b13c7
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
espg Oct 28, 2016
0f2aff0
Bug fix for commit 0b4cbdd (enforce stable sort)
espg Nov 1, 2016
ff59289
Code review fixes (style)
espg Nov 6, 2016
b126ebb
fixing new unit test
espg Nov 6, 2016
3998a85
refactored min_heap to c extension
espg Nov 25, 2016
b9ffa6d
minor fixes..
espg Nov 25, 2016
4c96df3
small cython optimizations
espg Nov 26, 2016
f921d0e
cython fixes
espg Dec 6, 2016
9424090
Add foo.txt
espg Dec 6, 2016
b1c88f6
Remove foo.txt
espg Dec 6, 2016
f6ee95f
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
espg Dec 6, 2016
ea18017
fix compilation error
espg Dec 6, 2016
5b3a054
last optimizations cython/numpy
espg Dec 7, 2016
886bacc
fix pyflakes errors; change default eps value
espg Dec 7, 2016
00c50c6
_
espg Dec 7, 2016
8ce31ba
API changes from agramfort
espg Jul 15, 2017
c4711df
fixed core samples bug
espg Jul 15, 2017
88334e2
added fit_predict
espg Jul 15, 2017
5b4c7ac
updates to variable names; update plot
espg Jul 16, 2017
9d01ced
refactor to remove balltree specific code
espg Jul 16, 2017
b1a0e5f
major refactor
espg Jul 16, 2017
678330f
fixed bugs; test all pass again
espg Jul 16, 2017
e80a088
fixed weird cython bug
espg Jul 17, 2017
356609b
major refactor
espg Jul 20, 2017
9d57d09
Updated Documentation!
espg Sep 7, 2017
5add240
Merge branch 'heads/scikit-learn/master'
espg Sep 8, 2017
000c20f
fix flake8 error
espg Sep 8, 2017
af28770
added optics to cluster comparison
espg Sep 12, 2017
14ce7e8
Updated comparison plot to transpose
espg Sep 12, 2017
49ccd11
small fix
espg Sep 12, 2017
c198a93
flake8 error
espg Sep 12, 2017
050d2f3
reverting transpose of cluster comparison (seperate PR #9739)
espg Sep 20, 2017
610747e
fixes from agramfort's review
espg Oct 10, 2017
d5e2904
fix for error message
espg Oct 24, 2017
b41c71b
force cluster_id's to start at 0
espg Oct 24, 2017
e95d5d7
Fix sp. error and flake8 warning(s)
espg Oct 28, 2017
f48151a
Updated documentation
espg Nov 14, 2017
a8dff69
Removed extraneous files
espg Nov 14, 2017
61d049c
fixing lgtm alert
espg Nov 19, 2017
267129e
changes from jnothman
espg Dec 7, 2017
c9dff06
Fixes from jnothman's review
espg Feb 13, 2018
92e6824
Fixing flake8 error
espg Feb 13, 2018
547b255
Removed neighbors / balltree inheritance
espg Feb 14, 2018
98d76b7
Made nbrs private and moved initiation to fit()
espg Feb 15, 2018
2eac9fb
fixed non-standard characters
espg May 18, 2018
0b5fa48
Response to TomDLT review
espg May 23, 2018
085e6ae
Fixed labeling bug
espg May 23, 2018
26e9425
update unit test
espg May 23, 2018
fef8ccd
Simple fixes per jnothman
espg Jun 1, 2018
116e0fe
Auto-cluster tests
espg Jun 13, 2018
a98b37d
fixing test error
espg Jun 13, 2018
4932a16
removed python loop
espg Jun 13, 2018
8292495
Fixing test error
espg Jun 13, 2018
9dc2ca5
Fix typo in unit test
espg Jun 13, 2018
51f7009
documentation updates
espg Jun 14, 2018
04f7337
Merge remote-tracking branch 'remotes/upstream/master'
espg Jun 14, 2018
28d8ac4
Post-merge doctest fix
espg Jun 14, 2018
f83d38e
DBSCAN / OPTICS invariant test
espg Jul 5, 2018
338019a
Update _auto_cluster docstring
espg Jul 5, 2018
c4354c5
changes fro jnothman
espg Jul 5, 2018
c9a94e9
fix spelling error in tests
espg Jul 5, 2018
33d74d7
contingency_matrix test
espg Jul 11, 2018
b32a9c1
small unit test updates per jnothman
espg Jul 11, 2018
20823be
unit test typo fix
espg Jul 11, 2018
0b4c677
extract dbscan updates
espg Jul 12, 2018
369e4b5
updated documentation comparing OPTICS/DBSCAN
espg Jul 12, 2018
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+295 −0
Diff settings

Always

Just for now

Copy path View file
@@ -0,0 +1,107 @@
"""
===================================
Demo of OPTICS clustering algorithm
===================================

Finds core samples of high density and expands clusters from them.
"""
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import optics as op

This comment has been minimized.

Copy link
@jaquesgrobler

jaquesgrobler May 22, 2013

Member

although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example


##############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)

##############################################################################

This comment has been minimized.

Copy link
@jaquesgrobler

jaquesgrobler May 22, 2013

Member

I tend to prefer not having these separation lines.. A blank line is fine :)


##############################################################################
# Compute OPTICS

testtree = op.setOfObjects(X)

# Run the top-level optics algorithm

op.prep_optics(testtree,30,10)

This comment has been minimized.

Copy link
@jaquesgrobler

jaquesgrobler May 22, 2013

Member

pep8 needed here

op.prep_optics(testtree, 30, 10)

# Note: build_optics should process using the same parameters as prep optics #
op.build_optics(testtree,30,10,'./list.txt')

# Extract clustering structure. This can be run for any clustering distance,
# and can be run mulitiple times without rerunning OPTICS
# OPTICS does need to be re-run to change the min-pts parameter
op.ExtractDBSCAN(testtree,0.3)

##############################################################################
# Plot result

import pylab as pl

# Core samples and labels #
core_samples = testtree._index[testtree._is_core[:] > 0]
labels = testtree._cluster_id[:]
#len(testtree._index[testtree._is_core[:] > 0])

# Black removed and is used for noise instead.
unique_labels = set(testtree._cluster_id[:]) # modifed from orginal #
colors = pl.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
markersize = 6
class_members = [index[0] for index in np.argwhere(labels == k)]
cluster_core_samples = [index for index in core_samples
if labels[index] == k]
for index in class_members:
x = X[index]
if index in core_samples and k != -1:
markersize = 14
else:
markersize = 6
pl.plot(x[0], x[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=markersize)

pl.title('Estimated number of clusters: %d' % n_clusters_)
pl.show()

##############################################################################
# Change epsilon, and plot results

op.ExtractDBSCAN(testtree,0.115)

# Core samples and labels #
core_samples = testtree._index[testtree._is_core[:] > 0]
labels = testtree._cluster_id[:]
#len(testtree._index[testtree._is_core[:] > 0])
n_clusters_ = max(testtree._cluster_id) # gives number of clusters
n_clusters_

n_clusters_ = max(testtree._cluster_id) # gives number of clusters
n_clusters_

# Plot results #
pl.figure()

# Black removed and is used for noise instead.
unique_labels = set(testtree._cluster_id[:]) # modifed from orginal #
colors = pl.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
markersize = 6
class_members = [index[0] for index in np.argwhere(labels == k)]
cluster_core_samples = [index for index in core_samples
if labels[index] == k]
for index in class_members:
x = X[index]
if index in core_samples and k != -1:
markersize = 14
else:
markersize = 6
pl.plot(x[0], x[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=markersize)

pl.title('Estimated number of clusters: %d' % n_clusters_)
pl.show()
Copy path View file
@@ -0,0 +1,188 @@
# -*- coding: utf-8 -*-

###################################
## Written by Shane Grigsby ##
## Email: refuge@rocktalus.com ##
## Date: May 2013 ##
###################################


## Imports ##

import sys
import scipy

from sklearn.neighbors import BallTree

This comment has been minimized.

Copy link
@raghavrv

raghavrv Nov 1, 2016

Member

Inconsistency in style of importing... Use relative imports always...


from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

## Main Class ##

class setOfObjects(BallTree):

This comment has been minimized.

Copy link
@NelleV

NelleV May 22, 2013

Member

Make sure to follow pep8's convention for class names.


"""Build balltree data structure with processing index from given data in preparation for OPTICS Algorithm
Parameters
----------
data_points: array [n_samples, n_features]"""

def __init__(self,data_points):

super(setOfObjects,self).__init__(data_points)

self._n = len(self.data)
self._processed = scipy.zeros((self._n,1),dtype=bool) ## Start all points as 'unprocessed' ##
self._reachability = scipy.ones(self._n)*scipy.inf ## Important! ##
self._core_dist = scipy.ones(self._n)*scipy.nan
self._index = scipy.array(range(self._n)) ## Might be faster to use a list? ##
self._nneighbors = scipy.ones(self._n,dtype=int)
self._cluster_id = -scipy.ones(self._n,dtype=int) ## Start all points as noise ##
self._is_core = scipy.ones(self._n,dtype=bool)
self._ordered_list = [] ### DO NOT switch this to a hash table, ordering is important ###

## Used in prep step ##
def _set_neighborhood(self,point,epsilon):
self._nneighbors[point] = self.query_radius(self.data[point], epsilon, count_only=1)[0]

## Used in prep step ##
def _set_core_dist(self,point,MinPts):
self._core_dist[point] = self.query(self.data[point],MinPts)[0][0][-1]

## Prep Method ##

### Paralizeable! ###
def prep_optics(SetofObjects,epsilon,MinPts):

"""Prep data set for main OPTICS loop
Parameters
----------
SetofObjects: Instantiated instance of 'setOfObjects' class
epsilon: float or int
Determines maximum object size that can be extracted. Smaller epsilons reduce run time
MinPts: int
The minimum number of samples in a neighborhood to be considered a core point
Returns
-------
Modified setOfObjects tree structure"""

for i in SetofObjects._index:
SetofObjects._set_neighborhood(i,epsilon)
for j in SetofObjects._index:
if SetofObjects._nneighbors[j] >= MinPts:
SetofObjects._set_core_dist(j,MinPts)
print('Core distances and neighborhoods prepped for ' + str(SetofObjects._n) + ' points.')

## Main OPTICS loop ##

def build_optics(SetOfObjects,epsilon,MinPts,Output_file_name):

"""Builds OPTICS ordered list of clustering structure
Parameters
----------
SetofObjects: Instantiated and prepped instance of 'setOfObjects' class
epsilon: float or int
Determines maximum object size that can be extracted. Smaller epsilons reduce run time. This should be equal to epsilon in 'prep_optics'
MinPts: int
The minimum number of samples in a neighborhood to be considered a core point. Must be equal to MinPts used in 'prep_optics'
Output_file_name: string
Valid path where write access is available. Stores cluster structure"""

for point in SetOfObjects._index:
if SetOfObjects._processed[point] == False:
expandClusterOrder(SetOfObjects,point,epsilon,
MinPts,Output_file_name)

## OPTICS helper functions; these should not be public ##

### NOT Paralizeable! The order that entries are written to the '_ordered_list' is important! ###
def expandClusterOrder(SetOfObjects,point,epsilon,MinPts,Output_file_name):
if SetOfObjects._core_dist[point] <= epsilon:
while not SetOfObjects._processed[point]:
SetOfObjects._processed[point] = True
SetOfObjects._ordered_list.append(point)
## Comment following two lines to not write to a text file ##
with open(Output_file_name, 'a') as file:
file.write((str(point) + ', ' + str(SetOfObjects._reachability[point]) + '\n'))
## Keep following line! ##
point = set_reach_dist(SetOfObjects,point,epsilon)
print('Object Found!')
else:
SetOfObjects._processed[point] = True # Probably not needed... #


### As above, NOT paralizable! Paralizing would allow items in 'unprocessed' list to switch to 'processed' ###
def set_reach_dist(SetOfObjects,point_index,epsilon):

### Assumes that the query returns ordered (smallest distance first) entries ###
### This is the case for the balltree query... ###
### ...switching to a query structure that does not do this will break things! ###
### And break in a non-obvious way: For cases where multiple entries are tied in ###
### reachablitly distance, it will cause the next point to be processed in ###
### random order, instead of the closest point. This may manefest in edge cases ###
### where different runs of OPTICS will give different ordered lists and hence ###
### different clustering structure...removing reproducability. ###

distances, indices = SetOfObjects.query(SetOfObjects.data[point_index],
SetOfObjects._nneighbors[point_index])

## Checks to see if there more than one member in the neighborhood ##
if scipy.iterable(distances):

## Masking processed values ##
unprocessed = indices[(SetOfObjects._processed[indices] < 1)[0].T]
rdistances = scipy.maximum(distances[(SetOfObjects._processed[indices] < 1)[0].T],SetOfObjects._core_dist[point_index])
SetOfObjects._reachability[unprocessed] = scipy.minimum(SetOfObjects._reachability[unprocessed], rdistances)

### Checks to see if everything is already processed; if so, return control to main loop ##
if unprocessed.size > 0:
### Define return order based on reachability distance ###
return sorted(zip(SetOfObjects._reachability[unprocessed],unprocessed), key=lambda reachability: reachability[0])[0][1]
else:
return point_index
else: ## Not sure if this else statement is actaully needed... ##
return point_index

## Extract DBSCAN Equivalent cluster structure ##

# Important: Epsilon prime should be less than epsilon used in OPTICS #
def ExtractDBSCAN(SetOfObjects, epsilon_prime):

"""Performs DBSCAN equivalent extraction for arbitrary epsilon. Can be run multiple times.
Parameters
----------
SetOfObjects: Prepped and build instance of setOfObjects
epsilon_prime: float or int
Must be less than or equal to what was used for prep and build steps
Returns
-------
Modified setOfObjects with cluster_id and is_core attributes."""

# Start Cluster_id at zero, incremented to '1' for first cluster
cluster_id = 0
for entry in SetOfObjects._ordered_list:
if SetOfObjects._reachability[entry] > epsilon_prime:
if SetOfObjects._core_dist[entry] <= epsilon_prime:
cluster_id += 1
SetOfObjects._cluster_id[entry] = cluster_id
# Two gives first member of the cluster; not meaningful, as first cluster members do not correspond to centroids #
## SetOfObjects._is_core[entry] = 2 ## Breaks boolean array :-( ##
else:
# This is only needed for compatibility for repeated scans. -1 is Noise points #
SetOfObjects._cluster_id[entry] = -1
else:
SetOfObjects._cluster_id[entry] = cluster_id
if SetOfObjects._core_dist[entry] <= epsilon_prime:
# One (i.e., 'True') for core points #
SetOfObjects._is_core[entry] = 1
else:
# Zero (i.e., 'False') for non-core, non-noise points #
SetOfObjects._is_core[entry] = 0


##### End Algorithm #####
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.