Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


The source code used for Weakly-Supervised Neural Text Classification, published in CIKM 2018.


Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Python 3.6 is strongly recommended; using older python versions might lead to package incompatibility issues.

Quick Start

python --dataset ${dataset} --sup_source ${sup_source} --model ${model}

where you need to specify the dataset in ${dataset}, the weak supervision type in ${sup_source} (could be one of ['labels', 'keywords', 'docs']), and the type of neural model to use in ${model} (could be one of ['cnn', 'rnn']).

An example run is provided in, which can be executed by


More advanced settings on training and hyperparameters are commented in


The weak supervision sources ${sup_source} can come from any of the following:

  • Label surface names (labels); you need to provide class names for each class in ./${dataset}/classes.txt, where each line begins with the class id (starting from 0), followed by a colon, and then the class label surface name.
  • Class-related keywords (keywords); you need to provide class-related keywords for each class in ./${dataset}/keywords.txt, where each line begins with the class id (starting from 0), followed by a colon, and then the class-related keywords separated by commas.
  • Labeled documents (docs); you need to provide labeled document ids for each class in ./${dataset}/doc_id.txt, where each line begins with the class id (starting from 0), followed by a colon, and then document ids in the corpus (starting from 0) of the corresponding class separated by commas.

Examples are given under ./agnews/ and ./yelp/.


The final results (document labels) will be written in ./${dataset}/out.txt, where each line is the class label id for the corresponding document.

Intermediate results (e.g. trained network weights, self-training logs) will be saved under ./results/${dataset}/${model}/.

Running on a New Dataset

To execute the code on a new dataset, you need to

  1. Create a directory named ${dataset}.
  2. Put raw corpus (with or without true labels) under ./${dataset}.
  3. Modify the function read_file in so that it returns a list of documents in variable data, and corresponding true labels in variable y (If ground truth labels are not available, simply return y = None).
  4. Modify to accept the new dataset; you need to add ${dataset} to argparse, and then specify parameter settings (e.g. update_interval, pretrain_epochs) for the new dataset.

You can always refer to the example datasets when adapting the code for a new dataset.


Please cite the following paper if you find the code helpful for your research.

  title={Weakly-Supervised Neural Text Classification},
  author={Meng, Yu and Shen, Jiaming and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 27th ACM International Conference on Information and Knowledge Management},


No releases published


No packages published