Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments on clustering tweets #87

Closed
bwang482 opened this issue Feb 21, 2017 · 8 comments
Closed

Experiments on clustering tweets #87

bwang482 opened this issue Feb 21, 2017 · 8 comments

Comments

@bwang482
Copy link

Excellent implementation! Thanks guys!!

I have done some experiments now trying to cluster a bunch of tweets (about 350) using hdbscan but the results I have to say, are mush worse than other more 'mainstream' ones from sklearn..

I have tried:

  • bow with tfidf (288 dimensions)
    • tfidf then computing JSD pairwise distance matrix
    • tfidf then perform LSHForest (from sklearn) and generate a Euclidean distance matrix between points
  • tweet2vec (500 dimensions)
    -- tweet2vec then computing cosine pairwise distance matrix
    -- tweet2vec then computing Euclidean pairwise distance matrix
    • tweet2vec then perform LSHForest (from sklearn) and generate a Euclidean distance matrix between point
In [62]: clusterer = hdbscan.HDBSCAN(min_cluster_size=N, metric='precomputed')
    ...: clusters = clusterer.fit_predict(sims)
    ...: print('Number of clusters =', clusters.max())
    ...: print(clusters)

In most times I am getting '-1' for cluster labels (is this normal?) and quite often all instances are labelled as '-1's if my N is over 10.. I wonder where the difficulty is? Has hdbscan proven to be below average clustering tool for short and noisy text like tweets?

Thanks!

@lmcinnes
Copy link
Collaborator

In practice that few data points in that large dimensional a space is simply not enough for there to actually be clusters. So in some sense I think this is actually doing the right thing: I would be very surprised if there were actually clusters, especially with a minimum cluster size of 10 data points. You could try reducing the dimension via some dimension reduction technique, but realistically you need more data when dealing with that sort of dimensionality, especially for density based algorithms. If you want to know more about your data it might be helpful to look at a hubness plot.

@bwang482
Copy link
Author

Ok in theory there should be clusters as these tweets were collected using comment keywords and within them there should be sub-topics.

I have tried topic modelling and also tfidf with dimentionality reduction with more tweets. Now it has 1000 instances with 50 features. I do get some clusters now with a cluster being labelled as '-1', is this normal?

Thanks for your reply 👍

@lmcinnes
Copy link
Collaborator

lmcinnes commented Feb 23, 2017 via email

@bwang482
Copy link
Author

Actually what I ultimately want is to cluster my tweets into a set of sub-topics; then sub-sample to a list of close aligned tweets from each cluster which (ideally) these sub-sampled tweets should be talking about similar things..

Do you have any suggestions on what algorithm or what type of clustering algorithms might be more suitable in my case? So fat after a few quick experiments it seems Affinity Propagation works better for me.

Thanks very much for your help Leland!

@lmcinnes
Copy link
Collaborator

lmcinnes commented Feb 24, 2017 via email

@bwang482
Copy link
Author

bwang482 commented Mar 2, 2017

Why do you think Affinity Propagation would not give useful results?

Affinity propagation simultaneously considers all data points as potential prototypes and passes soft information around until a subset of data points "win" and become the exemplars.

Doesn't it sound applicable in my case? And it doesn't require a pre-defined K for number of clusters, like K-means. I have tried Mean Shift a few times, and it returns a single cluster at all the times.

The issues I have with Hierarchical clustering is that it requires more careful parameter optimisation. I have perform such clustering on many data sets, so I don't want to tune such hyper-parameters for each data. Also think hierarchical agglomerative method makes hard decisions that can cause it to get stuck in poor solutions? Affinity propagation is softer in making those decisions.

My research is not in clustering so that might sound a bit naive.. :simple_smile:

@lmcinnes
Copy link
Collaborator

lmcinnes commented Mar 2, 2017 via email

@bwang482
Copy link
Author

bwang482 commented Mar 2, 2017

hmm thanks for the suggestions ! 👍

@bwang482 bwang482 closed this as completed Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants