Topic Model Generalisation

Neural topic model (NTM) generalisation in terms of document representation.

We propose a Generalisation Regularisation (Greg) module to improve NTMs' generalisation capability in terms of document representation. As a result, an NTM trained on a source corpus still yields good document representation for unseen documents from other corpora.

See details of our Paper.

Requirements

torch: 2.2.1+cu121
torchmetrics: 1.3.2
numpy: 1.24.1
scipy: 1.12.0
scikit-learn: 1.4.1.post1
gensim: 4.3.2
pot: 0.9.3
tqdm: 4.66.2

Datasets

We use '20News', 'R8', 'Webs', 'TMN' and 'DBpedia' (a random subset), for our experiments. The pre-processed datasets are available for download at: https://drive.google.com/drive/folders/1aNpsTkd95yybj2cXAuwmgshFwBHgv1eF?usp=drive_link

We store our pre-processed datasets in .mat files, which can be loaded as dictionaries using scipy.io.loadmat(). The datasets/dictionaries have the following common attributes/keys:

wordsTrain, labelsTrain: bag-of-words (BOW) of training documents, and their labels.
wordsTest, labelsTest: BOW of testing documents, and their labels.
vocabulary, embeddings: vocabularies of the corpus, and their word embeddings from 'glove-wiki-gigaword-50'.
test1, test2: the first and second fold of the test BOWs (for computing document completion perplexity).

For source-to-target tasks, the source and target data have an extra suffix (e.g. 'wordsTrain_source' and 'wordsTrain_target').

For source-to-noisy tasks, the noisy target is stored in a separate 'data_aug.mat' file.

Run topic models with Greg

To run original topic models:

python main.py --model NVDM --dataset combined_20News_RestAll --n_topic 50

To run topic models with Greg:

python main.py --model NVDM --dataset combined_20News_RestAll --n_topic 50 --use_Greg

Results

The evaluation is done by document classification and clustering for both source and target documents.

A running example without Greg at epoch 50:

############################################
Evaluation at: 
NVDM_dataset:combined_20News_RestAll_K50_RS1_epochs:50_LR0.0003_reg:False_regW300.0_augRate:0.5_aug:DA2

doc classification acc (original corpus):  0.3967073818374934
doc classification acc (R8):  0.7035830618892508
doc classification acc (DBpedia):  0.2738184546136534
doc classification acc (TMN):  0.3679447852760736
doc classification acc (Webs):  0.362753036437247
############################################
doc clustering TP, TN (original corpus):  0.16330323951141795 0.11003465044205976
doc clustering TP, TN (R8):  0.6319218241042345 0.1279169527204744
doc clustering TP, TN (DBpedia):  0.1827956989247312 0.07871181078741796
doc clustering TP, TN (TMN):  0.3214723926380368 0.034599746750148555
doc clustering TP, TN (Webs):  0.2813765182186235 0.06003729503945232
############################################
source document completion ppl:  15523.4
############################################

A running example with Greg at epoch 50:

############################################
Evaluation at: 
NVDM_dataset:combined_20News_RestAll_K50_RS1_epochs:50_LR0.0003_reg:True_regW300.0_augRate:0.5_aug:DA2

doc classification acc (original corpus):  0.4159585767392459
doc classification acc (R8):  0.7817589576547231
doc classification acc (DBpedia):  0.49087271817954486
doc classification acc (TMN):  0.5625766871165644
doc classification acc (Webs):  0.5817813765182186
############################################
doc clustering TP, TN (original corpus):  0.16648964418481146 0.11442718376169024
doc clustering TP, TN (R8):  0.6436482084690553 0.15272100820116064
doc clustering TP, TN (DBpedia):  0.24081020255063765 0.14740831169213556
doc clustering TP, TN (TMN):  0.4170245398773006 0.10636887676325055
doc clustering TP, TN (Webs):  0.33765182186234816 0.11487270835789015
############################################
source document completion ppl:  15433.4
############################################

We can notice that the document representation performance in terms of classification and clustering is improved by Greg on both the source and different target corpora.

Here is one of our results (Table 4), for 5 runs of 20News as the source, and the rest as the targets, where the number of topics for models is set as 50.

Overall, Greg brings significant (Table 9-11) improvements to the original models in most cases regarding neural topical generalisation. See more details in our [Paper](https://arxiv.org/pdf/2307.12564.pdf).

Citation

Please cite our paper if it helps:

@article{yang2023towards,
  title={Towards Generalising Neural Topical Representations},
  author={Yang, Xiaohao and Zhao, He and Phung, Dinh and Du, Lan},
  journal={arXiv preprint arXiv:2307.12564},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
CLNTM		CLNTM
NVDM		NVDM
PLDA		PLDA
SCHOLAR		SCHOLAR
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
doc_aug.py		doc_aug.py
doc_dist.py		doc_dist.py
eval.py		eval.py
main.py		main.py
parameters.py		parameters.py
read_data.py		read_data.py
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Model Generalisation

Requirements

Datasets

Run topic models with Greg

Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Topic Model Generalisation

Requirements

Datasets

Run topic models with Greg

Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages