Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdbscan.HDBSCAN().fit('group') #50

Closed
ericgcoker opened this issue Aug 8, 2016 · 6 comments
Closed

hdbscan.HDBSCAN().fit('group') #50

ericgcoker opened this issue Aug 8, 2016 · 6 comments

Comments

@ericgcoker
Copy link

I know that the algorithm will fit very easily using a column from a full pandas dataframe, but is there an elegant solution for 'fitting' across categorical groups, either by using 'groupby.transform' or through iteration?

@lmcinnes
Copy link
Collaborator

lmcinnes commented Aug 8, 2016

I'm not sure exactly what you mean. Can you provide some example code of the sort of thing you would like to see? Are you trying to fit multiple different groups from a single dataframe, each group independent of the prior?

@lmcinnes
Copy link
Collaborator

lmcinnes commented Aug 9, 2016

I see what you mean now, and I can definitely see why that might be desirable. I don't have any elegant solutions to offer unfortunately. If you need to fit the groups independently then I think you have to effectively iterate through them in one way or another. That means either a transform as you have, or just iterating through the groups in the groupby and constructing the resulting series. I would lean toward the latter as it is "simpler" and will probably do the job, but obviously the transform will faster. You can access some of the "under the hood" code if you like to make the functions easier.

from hdbscan import hdbscan
from hdbscan._hdbscan_tree import outlier_scores

def outlier(series):
    l, c, p, tree, s, m = hdbscan(series)
    return outlier_scores(tree)

df['score'] = df.groupby('category')['numeric'].transform(outlier)

ought to work, although I admit I haven't tried it. Let me know if that is the sort of thing you had in mind.

@lmcinnes
Copy link
Collaborator

lmcinnes commented Aug 9, 2016

Ah yes, the local variable x which is, of course my fault. This is why I should always test code that I type in. At least I can catch obvious errors. I've updated my comment to fix the obvious error (x should have been series). If we're lucky that might solve the other error too.

@lmcinnes
Copy link
Collaborator

Ah, I see the problem. Sklearn wants 2D arrays, and a Series is 1D. You'll
need to wrap it in a second axis to make it work. I can't recall the syntax
off the top of my head but I'll post it when I get a chance to look it up.

On Wed, Aug 10, 2016 at 10:56 AM, Eric Coker notifications@github.com
wrote:

I'm running this instead of trying transform, because I thought it would
be simpler.
`from hdbscan import hdbscan
from hdbscan._hdbscan_tree import outlier_scores

def outlier(series):
l, c, p, tree, s, m = hdbscan(series)
return outlier_scores(tree)
score = df.groupby(['categ1','categ2'])['num_col'].apply(outlier)
`
And receiving this error, seems to possibly be related to
'gen_min_span_tree'

`/home/user/.local/lib/python2.7/site-packages/pandas/core/groupby.pyc in
apply(self, func, _args, *_kwargs)
713 # ignore SettingWithCopy here in case the user mutates
714 with option_context('mode.chained_assignment',None):
--> 715 return self._python_apply_general(f)
716
717 def _python_apply_general(self, f):

/home/user/.local/lib/python2.7/site-packages/pandas/core/groupby.pyc in
_python_apply_general(self, f)
717 def _python_apply_general(self, f):
718 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 719 self.axis)
720
721 return self._wrap_applied_output(keys, values,

/home/user/.local/lib/python2.7/site-packages/pandas/core/groupby.pyc in
apply(self, f, data, axis)
1404 # group might be modified
1405 group_axes = _get_axes(group)
-> 1406 res = f(group)
1407 if not _is_indexed_like(res, group_axes):
1408 mutated = True

/home/user/.local/lib/python2.7/site-packages/pandas/core/groupby.pyc in
f(g)
709 @wraps https://github.com/wraps(func)
710 def f(g):
--> 711 return func(g, _args, *_kwargs)
712
713 # ignore SettingWithCopy here in case the user mutates

in outlier(series)
3
4 def outlier(series):
----> 5 l, c, p, tree, s, m = hdbscan(series)
6 return outlier_scores(tree)
7

/home/user/anaconda2/lib/python2.7/site-packages/
hdbscan-0.8.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in hdbscan(X,
min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm,
memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs,
allow_single_cluster, *

_kwargs) 495 memory.cache(_hdbscan_prims_kdtree)(X, min_samples, alpha,
496 metric, p, leaf_size, --> 497 gen_min_span_tree, *_kwargs)
498 else:
499 (single_linkage_tree,

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/memory.pyc
in call(self, _args, *_kwargs)
281
282 def call(self, _args, *_kwargs):
--> 283 return self.func(_args, *_kwargs)
284
285 def call_and_shelve(self, _args, *_kwargs):

/home/user/anaconda2/lib/python2.7/site-packages/
hdbscan-0.8.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in
_hdbscan_prims_kdtree(X, min_samples, alpha, metric, p, leaf_size,
gen_min_span_tree, **kwargs)
169 core_distances = tree.query(X, k=min_samples,
170 dualtree=True,
--> 171 breadth_first=True)[0][:, -1].copy(order='C')
172 # Mutual reachability distance is implicit in mst_linkage_core_vector
173 min_spanning_tree = mst_linkage_core_vector(X, core_distances,
dist_metric, alpha)

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query
(sklearn/neighbors/kd_tree.c:10563)()

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.
NeighborsHeap.init (sklearn/neighbors/kd_tree.c:4971)()

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.get_memview_DTYPE_2D
(sklearn/neighbors/kd_tree.c:2662)()

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/neighbors/kd_tree.so
in View.MemoryView.array_cwrapper (sklearn/neighbors/kd_tree.c:25261)()

/home/user/anaconda2/lib/python2.7/site-packages/sklearn/neighbors/kd_tree.so
in View.MemoryView.array.cinit (sklearn/neighbors/kd_tree.c:24186)()

ValueError: Invalid shape in axis 1: 0.`


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/lmcinnes/hdbscan/issues/50#issuecomment-238893151,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALaKBVH21b4S0mvBYH4lncrxO7d0k40Mks5qeeaCgaJpZM4JfTP7
.

@lmcinnes
Copy link
Collaborator

I think what is needed is something like:

def outlier(series):
    feature_matrix = series.asobject[:, np.newaxis]
    l, c, p, tree, s, m = hdbscan(feature_matrix)
    return outlier_scores(tree)

I admit there may still be issues with hdbscan on one-dimensional data, but this should at least format it so that sklearn style APIs will deal with it appropriately.

@lmcinnes
Copy link
Collaborator

I presume this was working now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants