Cache efficient implementation for Latent Dirichlet Allocation
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
.gitignore Updated .gitignore May 31, 2016
CMakeLists.txt lots Apr 28, 2016
LICENSE Add license May 16, 2016 Update Jun 8, 2016 warplda estimate Apr 29, 2016

WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation


WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).



  • GCC (>=4.8.5)
  • CMake (>=2.8.12)
  • git
  • libnuma
    • CentOS: yum install libnuma-devel
    • Ubuntu: apt-get install libnuma-dev

Clone this project

git clone

Install third-party dependency


Download some data, and split it as training and testing set

cd data
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

cd release/src
make -j


Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.


Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

Output format

WarpLDA generates a number of files:

.vocab (generated by .format)

Each line of it is a word in the vocabulary.

.info.full.txt (generated by warplda -estimate)

The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format (probability, word). The number of most frequent words is controlled by -ntop. .info.words.txt is a simpler version which only contains words.

.model (generated by warplda -estimate)

The word-topic count matrix. The first line contains four integers

<size of vocabulary> <number of topics> <alpha> <beta>

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

<number of elements> index:count index:count ...

For example, 0:2 on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

.z.estimate (generated by warplda -estimate)

The topic assignments of each token in the libsvm format. Each line is a document,

<number of tokens> <word id>:<topic id> <word id>:<topic id> ...

.z.inference (generated by warplda -inference)

The format is the same as .z.estimate.

Other features

  • Use custom prefix for output -prefix myprefix

  • Output perplexity every 10 iterations -perplexity 10

  • Tune Dirichlet hyperparameters -alpha 10 -beta 0.1

  • Use UCI machine learning repository data

      gunzip docword.nips.txt.gz
      ./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
      head -n 1400 nips.txt > nips_train.txt
      tail -n 100 nips.txt > nips_test.txt




Please cite WarpLDA if you find it is useful!

  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},