Skip to content
Generating labels for topics automatically using neural embeddings
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
dataset Add code files Nov 15, 2016
LICENSE Update LICENSE Nov 15, 2016
_config.yml Set theme jekyll-theme-minimal Jul 26, 2017 Add code files Nov 15, 2016


This package contains script, code files and tools to compute labels for topics automatically using Doc2vec and Word2vec (over phrases) models as part of the publication "Automatic labeeling of topics using neural embeddings". URLs to Pre-trained models along with annotated datasets are also given here.

Pre-Trained Models

Additional support files

Some Python Library Requirements

  • Gensim
  • Numpy
  • Pandas

Additional Libraries and Tools Needed

To run NETL there are some other libraries invloved to process wikipedia dump, tokenisation and for supervised learning.

Running the System

Directly running the pre-trained system and get the labels.(Training the System not needed for this step)

  • Download the pre-trained models, SVM rank binary files (namely svm_rank_classify and svm_rank_learn) and PageRank from the URLs provided above.
  • Extract and place doc2vec and word2vec models in model_run/pre_trained_models.
  • Extract and place the pagerank file,svm_rank_classify and svm_rank_learn in model_run/support_files.
  • Make sure your data topic file is in csv format. . Then put it in model_run/data. Update the path for your topic file in (in parameter data). Currently it points to toy_data/toytopics.csv.
  • Run by "python -cg -us -s" (will give you candidate labels in a file with supervised and unsupervised labels printed on console as well as in output files. By default 3 output files namely output_candidates, output_unsupervised and output_supervised will be created in the same directory)
  • Can update any other parameters in file such as number of supervised or unsupervised labels needed etc.

Train the System.

  • Download a Wiki xml dump and place it in training/dump. Update the parameter input_dump with the right dump filename in

  • Download and Exttract the Standford Parser,, word2vec phrase List from the URLs provided above.

  • Place Stanford parser(the whole directory) and in training/support_packages.

  • Place word2vec phrase List file in training/additional_files.

  • Run "python -e -td -dv -ng -wv".

  • Word2vec and Doc2vec models will be saved in training/trained_models and additional tokenised documents and other documents will be placed in training/processed_documents.

  • If you need your own word2vec phrase file can run but will need to alter the procedure a bit. Run "python -e -td" and then run After that again run "python -dv -ng -wv"

Input Format

The input format if just need to run the model

  • Topic file: One line per topic (displaying top-N words) in .csv format. Path to this file be updated in
  • Candidate labels file : This is only needed if you already have candidate labels and want to just get supervised and unsupervised ranking of those labels. one line per topic displaying candidate labels. Should corespond to the topic file in ordering. (Note: to just run this update the path in file and run just python -us -s)

Examples given in model_run/toy_data

The input format for training a new model.

  • XML Dump from Wikipedia. Place it in training/dump.

Directory Structure and Files

There are 2 main directories. First is is model_run. It has files which are used to directly run and give us candidate labels, unsupervised and supervised labels. Download the models, put them in right dirctory in and directly run it wthout the training step.

  • model_run/ This script has all the parameters for running the model. All path to doc2vec, word2vec and svm is given here. All change in parameters should be made here.
  • model_run/ This file generates candidate labels for the topics and output in a file
  • model_run/ This will give you the best labels for your topic in totaly unsupervised way using letter trigram.
  • model_run/ Run this to get labels in supervised labels. It will use SVM Ranker classify to give you supervised labels for your topic.
  • model_run/toy_data/toytopics.csv: Just contains a sample topic file to get the format for your input
  • model_run/toy_data/cand_label_output: A sample output of candidate labels generated.
  • model_run/pre_trained_models. Directory to place trained doc2vec and word2vec models.
  • model_run/data : Place your topic data file here.
  • model_run/support_files/svm_model: Trained svm model on our dataset.
  • model_run/support_files/doc2vec_indices: These are indices coreesponding to doc2vec tags from our pre trained doc2vec model for filtered/short document titles.
  • model_run/support_files/word2vec_indices.These are indices coresponding to wrod2vec tag from our word2vec model from our word2vec phrase list.

The second directory is training.The training directory contains script and code for training the embedding models. The files in training are:

  • training/ This is the script where all the parameters for training the system our specified. This is the file which is used to run the system and all chnages in parameters can be made here.
  • training/ This contain a call to and converts wikipedia xml dump into documents.
  • training/ This uses stanford parser tokeniser to tokenise extracted XML dump.
  • training/ Calls the gensim based Doc2Vec model and train the Doc2Vec model for tokenised documents.
  • training/create_ngrams: It uses filtered Wikipedia phrase(either downloaded from URL above ow Word2vec Phrase list of Word2vec Phrase list or generated using and creates n-grams(n<=4) in documents. These phrases have an underscore between them are tags in word2vec model.
  • training/ Calls the gensim based Word2Vec model and train the word2vec mode on documents modifed by
  • training/processed_documents: Output from, and create-ngrams. py are saved here in separate directories.
  • training/trained_models. The final trained doc2vec and word2vec models are place here.
  • training/support_packages- Stanford Parser and should be put here.
  • training/additional_files - Download the word2vec phrase list file and is placed here or if you generate this file using the output is saved over here.

Some additional code files:

  • If you want to again train the svm model on a dataset. Update the parameters in this file for your own m. Currently it points to our dataset.datset
  • It take all the document tags (which are wikipedia titles) generated by the doc2vecmodel and then filters them such that documents which are shorter than 40 words or if tag length (wikipedia title) is more than 4 words. These filtered tags are only considered for getting the potential labels for topics. This file is already generated if you need in Filtered/short Document titles above or can generate your own using this scrpt.
  • This is also similar to but since output of this file is used in generating word2vec file we remove any brackets from the wikipedia title and tokenise it using StanFord tokeniser. Again if you need a pre computed one URL is given above word2vec phrase list. These labels are only considered to get a potential label from word2vec model. Output will be in training/additional_files
  • This model takes outputfiles of and and just gives the index position of those labels in the respective models. Again for ease of just running the model they are already computed and placed in model_run/support_files(doc2vec_indices,word2vec_indices).

NOTE: While running the training models make sure that you move all previously generated directories to other location or change the name of output files in else files or directories may be over written.

Wikipedia internal pageRank were calculated using the help of


  • A Topic file given which has 228 topics with its 10 terms from four domains namely blogs,books,news and pubmed.
  • Annotated files: For each topic 19 labels were annotated.


Bhatia, Shraey, Jey Han Lau and Timothy Baldwin (2016) Automatic Labelling of Topics with Neural Embeddings, In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016), Osaka, Japan, 953–963.

Known Issues:

It has been brought to my notice that a few people have faced issue with running out of memory. Well this could have been due to parallelization. It is a trade off between speed and memory. So if you get that problem, follow the fololwing:

In file model_run/ comment out (or replace) line 191 and line 192.

#pool = mp.Pool(processes=cores)
#result =, range(0,len(topic_list))

Add the follwing

for i in range(0,len(topic_list)):

Ofcourse it will make it slower as is processed just on one core now.

You can’t perform that action at this time.