Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


The source code used for Weakly-Supervised Neural Text Classification, published in CIKM 2018.


Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 is strongly recommended; using older python versions might lead to package incompatibility issues.

Quick Start

python --dataset ${dataset} --sup_source ${sup_source} --model ${model}

where you need to specify the dataset in ${dataset}, the weak supervision type in ${sup_source} (could be one of ['labels', 'keywords', 'docs']), and the type of neural model to use in ${model} (could be one of ['cnn', 'rnn']).

An example run is provided in, which can be executed by


More advanced settings on training and hyperparameters are commented in


The weak supervision sources ${sup_source} can come from any of the following:

  • Label surface names (labels); you need to provide class names for each class in ./${dataset}/classes.txt, where each line begins with the class id (starting from 0), followed by a colon, and then the class label surface name.
  • Class-related keywords (keywords); you need to provide class-related keywords for each class in ./${dataset}/keywords.txt, where each line begins with the class id (starting from 0), followed by a colon, and then the class-related keywords separated by commas.
  • Labeled documents (docs); you need to provide labeled document ids for each class in ./${dataset}/doc_id.txt, where each line begins with the class id (starting from 0), followed by a colon, and then document ids in the corpus (starting from 0) of the corresponding class separated by commas.

Examples are given under ./agnews/ and ./yelp/.


The final results (document labels) will be written in ./${dataset}/out.txt, where each line is the class label id for the corresponding document.

Intermediate results (e.g. trained network weights, self-training logs) will be saved under ./results/${dataset}/${model}/.

Running on a New Dataset

To execute the code on a new dataset, you need to

  1. Create a directory named ${dataset}.
  2. Put raw corpus (with or without true labels) under ./${dataset}.
  3. Modify the function read_file in so that it returns a list of documents in variable data, and corresponding true labels in variable y (If ground truth labels are not available, simply return y = None).
  4. Modify to accept the new dataset; you need to add ${dataset} to argparse, and then specify parameter settings (e.g. update_interval, pretrain_epochs) for the new dataset.

You can always refer to the example datasets when adapting the code for a new dataset.


Please cite the following paper if you find the code helpful for your research.

  title={Weakly-Supervised Neural Text Classification},
  author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management},


[CIKM 2018] Weakly-Supervised Neural Text Classification





No releases published


No packages published