Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

Closed
namespace-Pt opened this issue Mar 20, 2022 · 4 comments

Comments

@namespace-Pt
Copy link

I like your paper but I think it's confusing that how to tackle multiple embeddings of the same word/token. I wonder is there any chance that different embeddings of the same word are mapped to different clusters and all of them are quite close to the cluster center in the spherical space. How do you deal with that?

@yumeng5
Copy link
Owner

yumeng5 commented Mar 20, 2022

Hi @namespace-Pt ,

Thanks for the question. You are right that each word can have multiple contextualized embeddings, and they can be mapped to different clusters during the clustering step in our algorithm. However, when deriving the final results, we take the average of the latent contextualized embeddings as the (context-free) representation for each word, which is then used for computing the topic-word distribution.

I hope this helps. Please let me know if anything remains unclear.

Best,
Yu

@namespace-Pt
Copy link
Author

Ok, I got it, thank you.

So the average step is like in the paper ''Tired of Topic Models Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too''? Did you reweight the averaged token embeddings? Also, how do you deal with subwords?

@yumeng5
Copy link
Owner

yumeng5 commented Mar 21, 2022

So the average step is like in the paper ''Tired of Topic Models Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too''?

Yes. The difference is that TopClus uses contextualized embeddings (instead of context-free embeddings as in that paper) for clustering.

Did you reweight the averaged token embeddings?

No, we do not have any reweighing steps.

Also, how do you deal with subwords?

We remove subwords from the vocabulary when deriving the final results, so our results will not contain subwords.

@namespace-Pt
Copy link
Author

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants