{{ message }}

# zll17 / Neural_Topic_Models

Implementation of topic models based on neural network approaches.

Switch branches/tags
Could not load branches
Nothing to show

## Latest commit

Change the expression doc_bow[0][idx] = 1.0 in inference function in each model to doc_bow[0][idx]  += 1.0. Thanks Tu Ngo for pointing out this mistake.
3aaf66a

## Files

Failed to load latest commit information.
Type
Name
Commit time
Feb 21, 2021
Feb 22, 2021
Feb 21, 2021
Feb 21, 2021
Nov 30, 2020
Dec 26, 2020
Mar 16, 2021
Mar 16, 2021
Mar 16, 2021
Dec 27, 2020
Mar 3, 2021
Mar 16, 2021
Oct 7, 2020

PyTorch implementations of Neural Topic Model varieties proposed in recent years, including NVDM-GSM, WTM-MMD (W-LDA), WTM-GMM, ETM, BATM ,and GMNTM. The aim of this project is to provide a practical and working example for neural topic models to facilitate the research of related fields. Configuration of the models will not exactly the same as those proposed in the papers, and the hyper-parameters are not carefully finetuned, but I have chosen to get the core ideas covered.

Empirically, NTM is superior to classical statistical topic models ,especially on short texts. Datasets of short news (cnews10k), dialogue utterances (zhddline) and conversation (zhdd), are presented for evaluation purpose, all of which are in Chinese. As a comparison to the NTM, an out-of-box LDA script is also provided, which is based on the gensim library.

If you have any question or suggestion about this implementation, please do not hesitate to contact me. To make it better, welcome to join me. ;)

Note: If you find it's slow to load the pictures of this readme file, you can read this article at my blog.

## 1. Installation

$git clone https://github.com/zll17/Neural_Topic_Models$ cd Neural_Topic_Models/
$sudo pip install -r requirements.txt ## 2. Models ### 2.1 NVDM-GSM Original paper: Discovering Discrete Latent Topics with Neural Variational Inference Author: Yishu Miao #### Description VAE + Gaussian Softmax The architecture of the model is a simple VAE, which takes the BOW of a document as its input. After sampling the latent vector z from the variational distribution Q(z|x), the model will normalize z through a softmax layer, which will be taken as the topic distribution$ \theta $in the following steps. The configuration of the encoder and decoder could also be customized by yourself, depending on your application. Explaination for some arguments: ​ --taskname: the name of the dataset, on which you want to build the topic model. ​ --n_topic: the number of topics. ​ --num_epochs: number of training epochs. ​ --no_below: to filter out the tokens whose document frequency is below the threshold, should be integer. ​ --no_above: to filter out the tokens whose document frequency is higher than the threshold, set as a float number to indicate the ratio of the number of documents. ​ --auto_adj: once adopted, there would be no need to specify the no_above argument, the model will automatically filter out the top 20 words with the highest document frequencies. ​ --bkpt_continue: once adopted, the model will load the last checkpoint file and continue training. [Paper] [Code] #### Run Example $ python3 GSM_run.py --taskname cnews10k --n_topic 20 --num_epochs 1000 --no_above 0.0134 --no_below 5 --criterion cross_entropy


### 2.2 WTM-MMD

Original paper: Topic Modeling with Wasserstein Autoencoders

Author: Feng Nan, Ran Ding, Ramesh Nallapati, Bing Xiang

#### Description

WAE with Dirichlet prior + Gaussian Softmax

The architecture is a WAE, which is actually a straightforward AutoEncoder, with an additional regulation on the latent space. According to the original paper, the prior distribution of the latent vectors z is set as Dirichlet distribution, while the variational distribution is regulated under the Wasserstein distance. Compared with the GSM model, this model can hugely alleviate the KL collapse problem and obtain more coherent topics.

Explaination for some arguments:

​ --dist: the type of the prior distribution, set as dirichlet to use it as the WLDA model.

​ --alpha: the hyperparameter $\alpha$ in the dirichlet distribution.

The meaning of other arguments can be referred to the GSM model.

[Paper] [Code]

### 2.4 ETM

Original paper: Topic Modeling in Embedding Spaces

Author: Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei

#### Description

VAE + Gaussian Softmax + Embedding

The architecture is a straightforward VAE, with the topic-word distribution matrix decomposed as the product of the topic vectors and the word vectors. The topic vectors and word vectors are jointly trained with the topic modeling process. A note-worthy mentioned advantage of this model is that it can improve the interpretability of topics by locatting the topic vectors and the word vectors in the same space. Correspondingly, the model requires more time to converge to an ideal result than others since it has more parameters to adjust.

Explaination for some arguments:

​ --emb_dim: the dimension of the topic vectors as well as the word vectors, default set as 300.

The meaning of other arguments can be referred to the GSM model.

[Paper] [Code]

### 2.6 BATM

Origianal paper: Neural Topic Modeling with Bidirectional Adversarial Training

Author: Rui Wang, Xuemeng Hu, Deyu Zhou, Yulan He, Yuxuan Xiong, Chenchen Ye, Haiyang Xu

(The picture is taken from the original paper.)

#### Description

GAN+Encoder

This model is made up of three modules: a Generator, a Discriminator, and an Encoder. The Encoder takes in a real document and outputs its topic distribution vector, concatenated with the normalized BOW of the original document. The Generator will takes in samples from a prior Dirichlet distribution and produce BOW vector of the fake document, concatenated with the sample distribution vectors. The Discriminator maximizes the likelihood of the real distribution pairs and minimizes the likelihood of the fake distribution pairs. Once done the training, the Encoder could output the topic distribution given a document, while the generator could output the topic-word distribution. Althrough it seems like a feasible approach to accomplish the topic modeling task through this adversarial way, my implement of this model cannot work properately. I still work on it and look for solutions. Any ideas or suggestions would be welcome.

[Paper] [Code]

#### Run Example

$python3 BATM_run.py --taskname zhdd --n_topic 20 --num_epochs 300 --no_above 0.039 --no_below 5 ## 3. Datasets • cnews10k: short cnews sampled from the cnews dataset, in Chinese. • zhddline: a dialogue dataset in Chinese, translated from the DailyDialog dataset by Sogou translation API. • zhdd: Every conversation is concatenated as a document to be processed. There're 12336 documents in total. • 3body1: The famous science fiction The Three-Body Problem, each paragraph is taken as a document. ​ Basic statistics are listed in the following table: dataset num of document genre avg len of docs language cnews10k 10k short news 18.7 Chinese zhddline 96785 short utterances 18.1 Chinese zhdd 12336 short dialogues 142.1 Chinese 3body1 2626 long novel 73.8 Chinese ##### Some snippets ###### 3.1 cnews10k ###### 3.2 zhddline ###### 3.3 zhdd ###### 3.4 3body1 #### 4. Usage In this section, I will take the zhddline text data as an example and display how to apply the WTM-GMM model on it to modeling its topics. You can use your own text data and follow the same steps. ###### 4.1 Preparation First, prepare the text data. One line will be taken as one document, so you need to keep one document in one line, in our example, each utterance per line. Then modify the name of the text file into the format {taskname}_lines.txt, in this case, zhddline_lines.txt. Put the renamed file in the data directory. Finally, choose the right tokenizer or create one by yourself. The tokenizer should be customized according to text type. The default configuration utilizes HanLP as tokenizer to deal with modern Chinese sentences. If you need to process other types of text (i.e. in English or in ancient Chinese), open the file tokenization.py , modify the code in function Tokenizer accordingly. ###### 4.2 Run Once done the preparation job, run the corresponding running script, in this case, you need to set the taskname as zhddline and specify other necessary arguments. $ python3 WTM_run.py --taskname zhddline --n_topic 20 --num_epochs 1000 --no_below 5 --dist gmm-std --auto_adj

The model will evaluate the topic coherence and topic diversity every 10 epochs, and display the top 20 topic words for each topic. The weight of the model will be stored in the ckpt directory once the training is done. The result of the topic modeling is shown below.

#### 5. Acknowledgement

I would appreciate my supervisor Prof. Qiang Zhou for his helpful suggestions to these neural topic models. A big part of this project is supported by him.

In the construction of this project, some implementations are taken as reference, I would like to thank the contributors of those projects:

I would also give thanks to @NekoMt.Tai for her kind shares of her GPU machines with me.

Cite: If you find this code useful in your research, please consider citing:

@misc{ZLL2020,
author = {Leilan Zhang},
title = {Neural Topic Models},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/zll17/Neural_Topic_Models}},
commit = {f02e8f876449fc3ebffc66f7635a59281b08c1eb}
}


## TODO

• 训练模型权重保存
• log 曲线绘制
• 文档-主题分布推断
• ETM 主题向量、词向量获取、保存
• 隐空间绘制
• 中文文档完善
• 各个模型性能对比
• 阶段性保存权重，断点续训
• 高亮推荐模型
• DETM

## 1. 安装

$git clone https://github.com/zll17/Neural_Topic_Models$ cd Neural_Topic_Models/

### 2.2 WTM-MMD

#### Description

WAE with Dirichlet prior + Gaussian Softmax

--dist：先验分布的类型，当启用W-LDA模型时需设置为dirichlet

--alpha：dirichlet分布中的超参数$\ alpha$。

[Paper] [Code]

### 2.4 ETM

#### Description

VAE + Gaussian Softmax + Embedding

--emb_dim：主题向量的维度，默认为300。

[Paper] [Code]

### 2.6 BATM

(The picture is taken from the original paper.)

GAN+Encoder

[Paper] [Code]

#### 5. Acknowledgement

Cite: If you find this code useful in your research, please consider citing:

@misc{ZLL2020,
author = {Leilan Zhang},
title = {Neural Topic Models},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/zll17/Neural_Topic_Models}},
commit = {f02e8f876449fc3ebffc66f7635a59281b08c1eb}
}


Implementation of topic models based on neural network approaches.

## Releases

No releases published

## Packages 0

No packages published