This is the official implementation of our paper Representing mixtures of Word Embeddings with mixtures of Topic Embedding in ICLR2022.
The proposed WeTe is a new topic modeling framework that views a document as a set of its word embeddings, and views topics as a set of embedding vectors shared over all documents. Topic embeddings and the document proportions are learned by minimizing the bidirectional transport cost between those two sets.
- Clone this repo:
git clone git@github.com:wds2014/WeTe.git
cd WeTe
- Install pytorch with cuda and other requirements as you need.
- Datasets in our paper
All datasets can be downloaded from google driver.
- Customising your own dataset
Organizing the Bow and the vocabulary of the corpus into the form WeTe expects according to the provided .pkl
file in dataset
folder and the dataloader.py
file, and happy to try WeTe !
We recommend loading the pre-trained word embeddings for better results.
- Glove
the pretrained glove word embeddings can be downloaded from Glove.
- Or, training (finetuning) the word embeddings for the corpus with word2vec tool.
- Easy to train:
python main.py
Changing the arguments in main.py
for different datasets and settings. The learned topics are saved in runs
folder.
- Clustering and Classification
We have provided the K-means clustering and LogisticRegression classification codes in cluster_clc.py
file. Those results are auto-reported during training.
- Topic quality
We have provided the topic diversity in Trainer.py
. For topic coherence, please refer to Palmetto, which is not provided in this repo. One needs to download and set up separately.
If you find this repo useful to your project, please consider to cite it with following bib:
@inproceedings{
wang2022representing,
title={Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings},
author={Dongsheng Wang and Dan dan Guo and He Zhao and Huangjie Zheng and Korawat Tanwisuth and Bo Chen and Mingyuan Zhou},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=IYMuTbGzjFU}
}