I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

namespace-Pt · 2022-03-20T08:04:01Z

I like your paper but I think it's confusing that how to tackle multiple embeddings of the same word/token. I wonder is there any chance that different embeddings of the same word are mapped to different clusters and all of them are quite close to the cluster center in the spherical space. How do you deal with that?

yumeng5 · 2022-03-20T21:32:56Z

Hi @namespace-Pt ,

Thanks for the question. You are right that each word can have multiple contextualized embeddings, and they can be mapped to different clusters during the clustering step in our algorithm. However, when deriving the final results, we take the average of the latent contextualized embeddings as the (context-free) representation for each word, which is then used for computing the topic-word distribution.

I hope this helps. Please let me know if anything remains unclear.

Best,
Yu

namespace-Pt · 2022-03-21T01:20:17Z

Ok, I got it, thank you.

So the average step is like in the paper ''Tired of Topic Models Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too''? Did you reweight the averaged token embeddings? Also, how do you deal with subwords?

yumeng5 · 2022-03-21T02:45:03Z

So the average step is like in the paper ''Tired of Topic Models Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too''?

Yes. The difference is that TopClus uses contextualized embeddings (instead of context-free embeddings as in that paper) for clustering.

Did you reweight the averaged token embeddings?

No, we do not have any reweighing steps.

Also, how do you deal with subwords?

We remove subwords from the vocabulary when deriving the final results, so our results will not contain subwords.

namespace-Pt · 2022-03-21T04:18:26Z

Thank you.

namespace-Pt closed this as completed Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

namespace-Pt commented Mar 20, 2022

yumeng5 commented Mar 20, 2022

namespace-Pt commented Mar 21, 2022

yumeng5 commented Mar 21, 2022

namespace-Pt commented Mar 21, 2022

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that? #1

Comments

namespace-Pt commented Mar 20, 2022

yumeng5 commented Mar 20, 2022

namespace-Pt commented Mar 21, 2022

yumeng5 commented Mar 21, 2022

namespace-Pt commented Mar 21, 2022