Skip to content
Cache efficient implementation for Latent Dirichlet Allocation
C++ Python CMake Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation


WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).



  • GCC (>=4.8.5)
  • CMake (>=2.8.12)
  • git
  • libnuma
    • CentOS: yum install libnuma-devel
    • Ubuntu: apt-get install libnuma-dev

Clone this project

git clone

Install third-party dependency


Download some data, and split it as training and testing set

cd data
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

cd release/src
make -j


Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.


Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

Output format

WarpLDA generates a number of files:

.vocab (generated by .format)

Each line of it is a word in the vocabulary.

.info.full.txt (generated by warplda -estimate)

The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format (probability, word). The number of most frequent words is controlled by -ntop. .info.words.txt is a simpler version which only contains words.

.model (generated by warplda -estimate)

The word-topic count matrix. The first line contains four integers

<size of vocabulary> <number of topics> <alpha> <beta>

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

<number of elements> index:count index:count ...

For example, 0:2 on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

.z.estimate (generated by warplda -estimate)

The topic assignments of each token in the libsvm format. Each line is a document,

<number of tokens> <word id>:<topic id> <word id>:<topic id> ...

.z.inference (generated by warplda -inference)

The format is the same as .z.estimate.

Other features

  • Use custom prefix for output -prefix myprefix

  • Output perplexity every 10 iterations -perplexity 10

  • Tune Dirichlet hyperparameters -alpha 10 -beta 0.1

  • Use UCI machine learning repository data

      gunzip docword.nips.txt.gz
      ./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
      head -n 1400 nips.txt > nips_train.txt
      tail -n 100 nips.txt > nips_test.txt




Please cite WarpLDA if you find it is useful!

  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
You can’t perform that action at this time.