-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiments on clustering tweets #87
Comments
In practice that few data points in that large dimensional a space is simply not enough for there to actually be clusters. So in some sense I think this is actually doing the right thing: I would be very surprised if there were actually clusters, especially with a minimum cluster size of 10 data points. You could try reducing the dimension via some dimension reduction technique, but realistically you need more data when dealing with that sort of dimensionality, especially for density based algorithms. If you want to know more about your data it might be helpful to look at a hubness plot. |
Ok in theory there should be clusters as these tweets were collected using comment keywords and within them there should be sub-topics. I have tried topic modelling and also tfidf with dimentionality reduction with more tweets. Now it has 1000 instances with 50 features. I do get some clusters now with a cluster being labelled as '-1', is this normal? Thanks for your reply 👍 |
Yes, that's pretty normal. If you just want to partition your data, and
have every point assigned to a cluster regardless of how much of an outlier
it is then you'll want a different algorithm. In practice with small
datasets unless everything is *very* tidily grouped you can expect to have
a fair amount of noise points.
…On Thu, Feb 23, 2017 at 6:49 AM, bluemonk482 ***@***.***> wrote:
Ok in theory there should be clusters as these tweets were collected using
comment keywords and within them there should be sub-topics.
I have tried topic modelling and also tfidf with dimentionality reduction
with more tweets. Now it has 1000 instances with 50 features. I do get some
clusters now with a cluster being labelled as '-1', is this normal?
Thanks for your reply 👍
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBbuMAvf5Mu6vT513120GFqtQ1KiSks5rfXJKgaJpZM4MHPAT>
.
|
Actually what I ultimately want is to cluster my tweets into a set of sub-topics; then sub-sample to a list of close aligned tweets from each cluster which (ideally) these sub-sampled tweets should be talking about similar things.. Do you have any suggestions on what algorithm or what type of clustering algorithms might be more suitable in my case? So fat after a few quick experiments it seems Affinity Propagation works better for me. Thanks very much for your help Leland! |
I would personally be quite surprised if Affinity Propagation gave
particularly useful results. If you need partitioning then Mean Shift is
not a terrible idea, and I would also seriously consider a Hierarchical
clustering approach which will give you a richer cluster structure to
explore. On that front you could stick with HDBSCAN* but instead of taking
the labels directly you can explore the condensed tree which gives a
hierarchical decomposition of clusters that may be easier to work with
(since it is a simpler tree) than standard hierarchical clustering.
…On Fri, Feb 24, 2017 at 9:11 AM, bluemonk482 ***@***.***> wrote:
Actually what I ultimately want is to cluster my tweets into a set of
sub-topics; then sub-sample to a list of close aligned tweets from each
cluster which (ideally) these sub-sampled tweets should be talking about
similar things..
Do you have any suggestions on what algorithm or what type of clustering
algorithms might be more suitable in my case? So fat after a few quick
experiments it seems Affinity Propagation works better for me.
Thanks very much for your help Leland!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBQqfxpNFoyV0tmZqe23jcH_GzWJ2ks5rfuUkgaJpZM4MHPAT>
.
|
Why do you think Affinity Propagation would not give useful results?
Doesn't it sound applicable in my case? And it doesn't require a pre-defined K for number of clusters, like K-means. I have tried Mean Shift a few times, and it returns a single cluster at all the times. The issues I have with Hierarchical clustering is that it requires more careful parameter optimisation. I have perform such clustering on many data sets, so I don't want to tune such hyper-parameters for each data. Also think hierarchical agglomerative method makes hard decisions that can cause it to get stuck in poor solutions? Affinity propagation is softer in making those decisions. My research is not in clustering so that might sound a bit naive.. :simple_smile: |
I have personally had poor experiences getting Affinity Propagation to give
good results, even on fairly easy to cluster data sets. Affinity
Propagation was one of my favoured algorithms when I set out on a personal
project to compare and contrast clustering algorithms over a wide range of
datasets and clustering situations. By the time I was done Affinity Prop
was my least favourite. It can have a lot of difficulty actually getting
good clusters; it is extremely sensitive to parameters -- in practice you
*have* to play with the preference vector and with the damping parameter if
you hope to get a good representative clustering; the preference vector is
a proxy parameter for number of clusters, but in a non-intuitive and
non-linear way.
I am happy to recommend Affinity Prop for clustering non-metric space data,
e.g. where you have asymmetric similarities etc. as it is one of the only
algorithms that can do this. For general data under, say, a Euclidean
metric, I have found it to very rarely be a good choice.
…On Thu, Mar 2, 2017 at 6:46 AM, bluemonk482 ***@***.***> wrote:
Why do you think Affinity Propagation would not give useful results?
Affinity propagation simultaneously considers all data points as potential
prototypes and passes soft information around until a subset of data points
"win" and become the exemplars.
Doesn't it sound applicable in my case? And it doesn't require a
pre-defined K for number of clusters, like K-means. I have tried Mean Shift
a few times, and it returns a single cluster at all the times.
The issues I have with Hierarchical clustering is that it requires more
careful parameter optimisation. I have perform such clustering on many data
sets, so I don't want to tune such hyper-parameters for each data. Also
think hierarchical agglomerative method makes hard decisions that can cause
it to get stuck in poor solutions? Affinity propagation is softer in making
those decisions.
My research is not in clustering so that might sound a bit naive..
:simple_smile:
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBdHlI5OCsmmpCZ0S7eFe3RtBbuJzks5rhqwogaJpZM4MHPAT>
.
|
hmm thanks for the suggestions ! 👍 |
Excellent implementation! Thanks guys!!
I have done some experiments now trying to cluster a bunch of tweets (about 350) using hdbscan but the results I have to say, are mush worse than other more 'mainstream' ones from sklearn..
I have tried:
-- tweet2vec then computing cosine pairwise distance matrix
-- tweet2vec then computing Euclidean pairwise distance matrix
In most times I am getting '-1' for cluster labels (is this normal?) and quite often all instances are labelled as '-1's if my N is over 10.. I wonder where the difficulty is? Has hdbscan proven to be below average clustering tool for short and noisy text like tweets?
Thanks!
The text was updated successfully, but these errors were encountered: